Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple postgres connections from cronjob causes site to go unresponsive #1374

Closed
joelrebel opened this issue Dec 9, 2013 · 4 comments
Closed
Assignees

Comments

@joelrebel
Copy link

The site affected in this case was data.sa.gov.au (s104, s105). Accessing the site would cause a timeout and nginx would throw the 501 error.

The s105 host had these cronjobs running,

root     21997  0.0  0.0  19112   392 ?        Ss   Apr23   2:26 cron
root     10296  0.0  0.0  33844  1088 ?        S    07:00   0:00  \_ CRON
okfn     10298  0.0  0.0   4404   596 ?        Ss   07:00   0:00  |   \_ /bin/sh -c /usr/lib/ckan/sa/bin/paster --plugin=ckan tracking update -c /etc/ckan/sa/production.ini && /usr/lib/ckan/sa/bin/paster --plugin=ckan search-index rebuild
okfn     10818  0.0  2.8 202220 58892 ?        S    07:00   0:01  |       \_ /usr/lib/ckan/sa/bin/python /usr/lib/ckan/sa/bin/paster --plugin=ckan search-index rebuild -r -c /etc/ckan/sa/production.ini

root     23170  0.0  0.0  33844  1088 ?        S    08:00   0:00  \_ CRON
okfn     23176  0.0  0.0   4404   600 ?        Ss   08:00   0:00      \_ /bin/sh -c /usr/lib/ckan/sa/bin/paster --plugin=ckan tracking update -c /etc/ckan/sa/production.ini && /usr/lib/ckan/sa/bin/paster --plugin=ckan search-index rebuild
okfn     23177 27.5  2.8 202180 58900 ?        S    08:00   0:01          \_ /usr/lib/ckan/sa/bin/python /usr/lib/ckan/sa/bin/paster --plugin=ckan tracking update -c /etc/ckan/sa/production.ini

Which seemed to be responsible for these db connections/queries on s104

postgres 22213  0.0  0.1 136416  3760 ?        SN   Oct25   8:37 /usr/lib/postgresql/9.1/bin/postgres -D /var/lib/postgresql/9.1/main -c config_file=/etc/postgresql/9.1/main/postgresql.conf
postgres 22260  0.0  1.7 136820 35636 ?        SNs  Oct25  22:30  \_ postgres: writer process
postgres 22261  0.0  0.0 136416  1652 ?        SNs  Oct25  14:43  \_ postgres: wal writer process
postgres 22262  0.0  1.7 137676 35372 ?        SNs  Oct25   7:58  \_ postgres: autovacuum launcher process
postgres 22263  0.0  0.0  96980  1068 ?        SNs  Oct25  21:10  \_ postgres: stats collector process
postgres 17290  0.0  0.5 138524 11196 ?        SNs  01:35   0:00  \_ postgres: sa_datastore_readonly sa_datastore 10.179.187.168(57182) SELECT waiting
postgres 17299  0.0  0.5 138524 11292 ?        SNs  01:36   0:00  \_ postgres: sa_datastore_readonly sa_datastore 10.179.187.168(57245) SELECT waiting
postgres  1877  0.0  0.9 139952 19560 ?        SNs  07:00   0:00  \_ postgres: sa sa 10.179.187.168(53728) idle in transaction
postgres  1891  0.0  0.2 137668  5092 ?        SNs  07:00   0:00  \_ postgres: sa sa 10.179.187.168(53734) idle
postgres  1894  0.0  0.4 139040  8764 ?        SNs  07:00   0:00  \_ postgres: sa_datastore_readwrite sa_datastore 10.179.187.168(53735) idle in transaction
postgres  1895  0.0  0.2 137668  5352 ?        SNs  07:00   0:00  \_ postgres: sa_datastore_readonly sa_datastore 10.179.187.168(53737) idle
postgres  2266  0.0  0.5 141004 10296 ?        SNs  07:00   0:00  \_ postgres: sa_datastore_readwrite sa_datastore 10.179.187.168(53776) SELECT waiting
postgres  2267  0.0  0.3 138420  7072 ?        SNs  07:00   0:00  \_ postgres: sa sa 10.179.187.168(53779) idle
postgres  2268  0.0  0.2 137668  5020 ?        SNs  07:00   0:00  \_ postgres: sa sa 10.179.187.168(53780) idle
postgres  2269  0.0  0.3 137796  7172 ?        SNs  07:00   0:00  \_ postgres: sa_datastore_readwrite sa_datastore 10.179.187.168(53781) CREATE VIEW waiting
postgres  2270  0.0  0.2 137668  5160 ?        SNs  07:00   0:00  \_ postgres: sa_datastore_readonly sa_datastore 10.179.187.168(53782) idle
postgres  3931  0.0  0.2 137656  4488 ?        SNs  07:04   0:00  \_ postgres: sa_datastore_readonly sa_datastore 10.179.187.168(54503) SELECT waiting
postgres  4026  0.0  0.2 137668  4580 ?        SNs  07:05   0:00  \_ postgres: sa_datastore_readonly sa_datastore 10.179.187.168(54563) SELECT waiting
postgres  5903  0.0  0.3 137792  7736 ?        SNs  07:08   0:00  \_ postgres: sa_datastore_readwrite sa_datastore 10.179.187.168(55237) CREATE VIEW waiting
postgres  7477  0.0  0.3 137792  7532 ?        SNs  07:11   0:00  \_ postgres: sa_datastore_readwrite sa_datastore 10.179.187.168(55350) CREATE VIEW waiting
postgres 25323  0.0  0.3 137792  7664 ?        SNs  07:51   0:00  \_ postgres: sa_datastore_readwrite sa_datastore 10.179.187.168(56000) CREATE VIEW waiting
postgres 25327  0.0  0.3 137792  7632 ?        SNs  07:51   0:00  \_ postgres: sa_datastore_readwrite sa_datastore 10.179.187.168(56008) CREATE VIEW waiting
postgres 27241  0.0  0.3 138420  7428 ?        SNs  07:55   0:00  \_ postgres: sa sa 10.179.187.168(56093) idle
postgres 27242  0.0  0.3 138420  7400 ?        SNs  07:55   0:00  \_ postgres: sa sa 10.179.187.168(56094) idle
postgres 27243  0.0  0.2 137668  5088 ?        SNs  07:55   0:00  \_ postgres: sa sa 10.179.187.168(56095) idle
postgres 27244  0.0  0.2 137668  5088 ?        SNs  07:55   0:00  \_ postgres: sa sa 10.179.187.168(56096) idle
postgres 27245  0.0  0.3 137796  7932 ?        SNs  07:55   0:00  \_ postgres: sa_datastore_readwrite sa_datastore 10.179.187.168(56097) CREATE VIEW waiting
postgres 27246  0.0  0.3 137796  7808 ?        SNs  07:55   0:00  \_ postgres: sa_datastore_readwrite sa_datastore 10.179.187.168(56098) CREATE VIEW waiting
postgres 27247  0.0  0.2 137668  5360 ?        SNs  07:55   0:00  \_ postgres: sa_datastore_readonly sa_datastore 10.179.187.168(56099) idle
postgres 27248  0.0  0.2 137668  5356 ?        SNs  07:55   0:00  \_ postgres: sa_datastore_readonly sa_datastore 10.179.187.168(56100) idle
postgres 29079  0.0  0.3 138420  7368 ?        SNs  08:00   0:00  \_ postgres: sa sa 10.179.187.168(56171) idle
postgres 29089  0.0  0.2 137668  5080 ?        SNs  08:00   0:00  \_ postgres: sa sa 10.179.187.168(56172) idle
postgres 29094  0.0  0.3 137796  7808 ?        SNs  08:00   0:00  \_ postgres: sa_datastore_readwrite sa_datastore 10.179.187.168(56174) CREATE VIEW waiting
postgres 29096  0.0  0.2 137668  5348 ?        SNs  08:00   0:00  \_ postgres: sa_datastore_readonly sa_datastore 10.179.187.168(56176) idle
postgres 29097  0.0  0.3 138420  7412 ?        SNs  08:00   0:00  \_ postgres: sa sa 10.179.187.168(56177) idle
postgres 29102  0.0  0.2 137668  5084 ?        SNs  08:00   0:00  \_ postgres: sa sa 10.179.187.168(56180) idle
postgres 29103  0.0  0.3 137796  7776 ?        SNs  08:00   0:00  \_ postgres: sa_datastore_readwrite sa_datastore 10.179.187.168(56181) CREATE VIEW waiting
postgres 29106  0.0  0.2 137668  5356 ?        SNs  08:00   0:00  \_ postgres: sa_datastore_readonly sa_datastore 10.179.187.168(56182) idle

On killing those queries, data.sa.gov.au was responsive again.

@ghost ghost assigned joetsoi Dec 10, 2013
@maxious
Copy link
Member

maxious commented Dec 10, 2013

Could this be caused by the __read_connection_has_correct_privileges function concurrently trying to use/create/delete the _foo table? https://github.com/okfn/ckan/blob/master/ckanext/datastore/plugin.py#L148

My PostgreSQL error log contains some entries like:

2013-11-17 04:47:28 UTC ERROR:  relation "_foo" already exists
2013-11-17 04:47:28 UTC STATEMENT:  CREATE TABLE _foo ()
2013-11-17 04:47:28 UTC ERROR:  cache lookup failed for relation 236044
2013-11-17 04:47:28 UTC STATEMENT:  DROP TABLE IF EXISTS _foo
2013-11-17 04:48:18 UTC ERROR:  duplicate key value violates unique constraint "pg_type_typname_nsp_index"
2013-11-17 04:48:18 UTC DETAIL:  Key (typname, typnamespace)=(_foo, 2200) already exists.
2013-11-17 04:48:18 UTC STATEMENT:  CREATE TABLE _foo ()

(you can kill idle connections with "SELECT pg_terminate_backend(procpid) FROM pg_stat_activity WHERE current_query like '%' AND query_start < current_timestamp - INTERVAL '2' MINUTE;" but if the site is already locked up the damage is done)

@joetsoi
Copy link
Contributor

joetsoi commented Dec 16, 2013

yes, it probably has more to do with the read_connection_has_correct_privileges being called concurrently by multiple processes as well. The function, if executed on it's own by a single proccess should be fine as the exclusive locks are only held during the CREATE and DROP statements. The write connection's transaction shouldn't be holding an exclusive lock while the SELECT has_table_privilege is being executed as it's not doing anything at that point. But the select will of had to wait for the write connection to switch to a shared lock first

It could be caused by another process starting up at this point and attempting to create the _foo table whilst the select is happening on a separate one. We might be able to fix it by having a random table name instead of _foo. I am however concerned that we aren't closing the connection elsewhere in another action function or something. Either way it'd be good to fix the _foo already exists first in case it's masking another connection bug underneath.

I'm also willing to be completely wrong here, if someone who's more knowledgeable with databases than I am is willing to comment.

joetsoi added a commit that referenced this issue Dec 16, 2013
remove the read connection, and use the write connection with the
username of the read connection to test if the priviliges are correct
joetsoi added a commit that referenced this issue Dec 16, 2013
 * remove nested try statement
 * use make_url instead of splicing the string for _get_db_url()
@joetsoi
Copy link
Contributor

joetsoi commented Dec 16, 2013

@maxious I have a pull request that fixes the read_connection_has_correct_privileges to only use one connection, it's the "CREATE VIEW waiting' connections that still concern me from our logs, were they just waiting because of connections from the read_connection_has_correct_privileges or is there another unclosed connection somewhere. Do you have the logs of your crashes for comparison?

Are you running test CKAN instance as well? If so would you be able to test the the pull request as well?

@joetsoi
Copy link
Contributor

joetsoi commented Apr 1, 2014

unless we can get better logs when this occurs, or if it occurs again I'll close the issue

a select * from pg_stat_activity; before killing the processes will be useful for debugging this if it recurs.

@joetsoi joetsoi closed this as completed Apr 1, 2014
joetsoi added a commit that referenced this issue Feb 10, 2015
remove the read connection, and use the write connection with the
username of the read connection to test if the priviliges are correct
joetsoi added a commit that referenced this issue Feb 10, 2015
 * remove nested try statement
 * use make_url instead of splicing the string for _get_db_url()
joetsoi added a commit that referenced this issue Feb 10, 2015
remove the read connection, and use the write connection with the
username of the read connection to test if the priviliges are correct
joetsoi added a commit that referenced this issue Feb 10, 2015
 * remove nested try statement
 * use make_url instead of splicing the string for _get_db_url()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants