Multiple postgres connections from cronjob causes site to go unresponsive #1374

joelrebel · 2013-12-09T12:01:19Z

The site affected in this case was data.sa.gov.au (s104, s105). Accessing the site would cause a timeout and nginx would throw the 501 error.

The s105 host had these cronjobs running,

root     21997  0.0  0.0  19112   392 ?        Ss   Apr23   2:26 cron
root     10296  0.0  0.0  33844  1088 ?        S    07:00   0:00  \_ CRON
okfn     10298  0.0  0.0   4404   596 ?        Ss   07:00   0:00  |   \_ /bin/sh -c /usr/lib/ckan/sa/bin/paster --plugin=ckan tracking update -c /etc/ckan/sa/production.ini && /usr/lib/ckan/sa/bin/paster --plugin=ckan search-index rebuild
okfn     10818  0.0  2.8 202220 58892 ?        S    07:00   0:01  |       \_ /usr/lib/ckan/sa/bin/python /usr/lib/ckan/sa/bin/paster --plugin=ckan search-index rebuild -r -c /etc/ckan/sa/production.ini

root     23170  0.0  0.0  33844  1088 ?        S    08:00   0:00  \_ CRON
okfn     23176  0.0  0.0   4404   600 ?        Ss   08:00   0:00      \_ /bin/sh -c /usr/lib/ckan/sa/bin/paster --plugin=ckan tracking update -c /etc/ckan/sa/production.ini && /usr/lib/ckan/sa/bin/paster --plugin=ckan search-index rebuild
okfn     23177 27.5  2.8 202180 58900 ?        S    08:00   0:01          \_ /usr/lib/ckan/sa/bin/python /usr/lib/ckan/sa/bin/paster --plugin=ckan tracking update -c /etc/ckan/sa/production.ini

Which seemed to be responsible for these db connections/queries on s104

postgres 22213  0.0  0.1 136416  3760 ?        SN   Oct25   8:37 /usr/lib/postgresql/9.1/bin/postgres -D /var/lib/postgresql/9.1/main -c config_file=/etc/postgresql/9.1/main/postgresql.conf
postgres 22260  0.0  1.7 136820 35636 ?        SNs  Oct25  22:30  \_ postgres: writer process
postgres 22261  0.0  0.0 136416  1652 ?        SNs  Oct25  14:43  \_ postgres: wal writer process
postgres 22262  0.0  1.7 137676 35372 ?        SNs  Oct25   7:58  \_ postgres: autovacuum launcher process
postgres 22263  0.0  0.0  96980  1068 ?        SNs  Oct25  21:10  \_ postgres: stats collector process
postgres 17290  0.0  0.5 138524 11196 ?        SNs  01:35   0:00  \_ postgres: sa_datastore_readonly sa_datastore 10.179.187.168(57182) SELECT waiting
postgres 17299  0.0  0.5 138524 11292 ?        SNs  01:36   0:00  \_ postgres: sa_datastore_readonly sa_datastore 10.179.187.168(57245) SELECT waiting
postgres  1877  0.0  0.9 139952 19560 ?        SNs  07:00   0:00  \_ postgres: sa sa 10.179.187.168(53728) idle in transaction
postgres  1891  0.0  0.2 137668  5092 ?        SNs  07:00   0:00  \_ postgres: sa sa 10.179.187.168(53734) idle
postgres  1894  0.0  0.4 139040  8764 ?        SNs  07:00   0:00  \_ postgres: sa_datastore_readwrite sa_datastore 10.179.187.168(53735) idle in transaction
postgres  1895  0.0  0.2 137668  5352 ?        SNs  07:00   0:00  \_ postgres: sa_datastore_readonly sa_datastore 10.179.187.168(53737) idle
postgres  2266  0.0  0.5 141004 10296 ?        SNs  07:00   0:00  \_ postgres: sa_datastore_readwrite sa_datastore 10.179.187.168(53776) SELECT waiting
postgres  2267  0.0  0.3 138420  7072 ?        SNs  07:00   0:00  \_ postgres: sa sa 10.179.187.168(53779) idle
postgres  2268  0.0  0.2 137668  5020 ?        SNs  07:00   0:00  \_ postgres: sa sa 10.179.187.168(53780) idle
postgres  2269  0.0  0.3 137796  7172 ?        SNs  07:00   0:00  \_ postgres: sa_datastore_readwrite sa_datastore 10.179.187.168(53781) CREATE VIEW waiting
postgres  2270  0.0  0.2 137668  5160 ?        SNs  07:00   0:00  \_ postgres: sa_datastore_readonly sa_datastore 10.179.187.168(53782) idle
postgres  3931  0.0  0.2 137656  4488 ?        SNs  07:04   0:00  \_ postgres: sa_datastore_readonly sa_datastore 10.179.187.168(54503) SELECT waiting
postgres  4026  0.0  0.2 137668  4580 ?        SNs  07:05   0:00  \_ postgres: sa_datastore_readonly sa_datastore 10.179.187.168(54563) SELECT waiting
postgres  5903  0.0  0.3 137792  7736 ?        SNs  07:08   0:00  \_ postgres: sa_datastore_readwrite sa_datastore 10.179.187.168(55237) CREATE VIEW waiting
postgres  7477  0.0  0.3 137792  7532 ?        SNs  07:11   0:00  \_ postgres: sa_datastore_readwrite sa_datastore 10.179.187.168(55350) CREATE VIEW waiting
postgres 25323  0.0  0.3 137792  7664 ?        SNs  07:51   0:00  \_ postgres: sa_datastore_readwrite sa_datastore 10.179.187.168(56000) CREATE VIEW waiting
postgres 25327  0.0  0.3 137792  7632 ?        SNs  07:51   0:00  \_ postgres: sa_datastore_readwrite sa_datastore 10.179.187.168(56008) CREATE VIEW waiting
postgres 27241  0.0  0.3 138420  7428 ?        SNs  07:55   0:00  \_ postgres: sa sa 10.179.187.168(56093) idle
postgres 27242  0.0  0.3 138420  7400 ?        SNs  07:55   0:00  \_ postgres: sa sa 10.179.187.168(56094) idle
postgres 27243  0.0  0.2 137668  5088 ?        SNs  07:55   0:00  \_ postgres: sa sa 10.179.187.168(56095) idle
postgres 27244  0.0  0.2 137668  5088 ?        SNs  07:55   0:00  \_ postgres: sa sa 10.179.187.168(56096) idle
postgres 27245  0.0  0.3 137796  7932 ?        SNs  07:55   0:00  \_ postgres: sa_datastore_readwrite sa_datastore 10.179.187.168(56097) CREATE VIEW waiting
postgres 27246  0.0  0.3 137796  7808 ?        SNs  07:55   0:00  \_ postgres: sa_datastore_readwrite sa_datastore 10.179.187.168(56098) CREATE VIEW waiting
postgres 27247  0.0  0.2 137668  5360 ?        SNs  07:55   0:00  \_ postgres: sa_datastore_readonly sa_datastore 10.179.187.168(56099) idle
postgres 27248  0.0  0.2 137668  5356 ?        SNs  07:55   0:00  \_ postgres: sa_datastore_readonly sa_datastore 10.179.187.168(56100) idle
postgres 29079  0.0  0.3 138420  7368 ?        SNs  08:00   0:00  \_ postgres: sa sa 10.179.187.168(56171) idle
postgres 29089  0.0  0.2 137668  5080 ?        SNs  08:00   0:00  \_ postgres: sa sa 10.179.187.168(56172) idle
postgres 29094  0.0  0.3 137796  7808 ?        SNs  08:00   0:00  \_ postgres: sa_datastore_readwrite sa_datastore 10.179.187.168(56174) CREATE VIEW waiting
postgres 29096  0.0  0.2 137668  5348 ?        SNs  08:00   0:00  \_ postgres: sa_datastore_readonly sa_datastore 10.179.187.168(56176) idle
postgres 29097  0.0  0.3 138420  7412 ?        SNs  08:00   0:00  \_ postgres: sa sa 10.179.187.168(56177) idle
postgres 29102  0.0  0.2 137668  5084 ?        SNs  08:00   0:00  \_ postgres: sa sa 10.179.187.168(56180) idle
postgres 29103  0.0  0.3 137796  7776 ?        SNs  08:00   0:00  \_ postgres: sa_datastore_readwrite sa_datastore 10.179.187.168(56181) CREATE VIEW waiting
postgres 29106  0.0  0.2 137668  5356 ?        SNs  08:00   0:00  \_ postgres: sa_datastore_readonly sa_datastore 10.179.187.168(56182) idle

On killing those queries, data.sa.gov.au was responsive again.

The text was updated successfully, but these errors were encountered:

maxious · 2013-12-10T22:58:01Z

Could this be caused by the __read_connection_has_correct_privileges function concurrently trying to use/create/delete the _foo table? https://github.com/okfn/ckan/blob/master/ckanext/datastore/plugin.py#L148

My PostgreSQL error log contains some entries like:

2013-11-17 04:47:28 UTC ERROR:  relation "_foo" already exists
2013-11-17 04:47:28 UTC STATEMENT:  CREATE TABLE _foo ()
2013-11-17 04:47:28 UTC ERROR:  cache lookup failed for relation 236044
2013-11-17 04:47:28 UTC STATEMENT:  DROP TABLE IF EXISTS _foo
2013-11-17 04:48:18 UTC ERROR:  duplicate key value violates unique constraint "pg_type_typname_nsp_index"
2013-11-17 04:48:18 UTC DETAIL:  Key (typname, typnamespace)=(_foo, 2200) already exists.
2013-11-17 04:48:18 UTC STATEMENT:  CREATE TABLE _foo ()

(you can kill idle connections with "SELECT pg_terminate_backend(procpid) FROM pg_stat_activity WHERE current_query like '%' AND query_start < current_timestamp - INTERVAL '2' MINUTE;" but if the site is already locked up the damage is done)

joetsoi · 2013-12-16T12:32:15Z

yes, it probably has more to do with the read_connection_has_correct_privileges being called concurrently by multiple processes as well. The function, if executed on it's own by a single proccess should be fine as the exclusive locks are only held during the CREATE and DROP statements. The write connection's transaction shouldn't be holding an exclusive lock while the SELECT has_table_privilege is being executed as it's not doing anything at that point. But the select will of had to wait for the write connection to switch to a shared lock first

It could be caused by another process starting up at this point and attempting to create the _foo table whilst the select is happening on a separate one. We might be able to fix it by having a random table name instead of _foo. I am however concerned that we aren't closing the connection elsewhere in another action function or something. Either way it'd be good to fix the _foo already exists first in case it's masking another connection bug underneath.

I'm also willing to be completely wrong here, if someone who's more knowledgeable with databases than I am is willing to comment.

remove the read connection, and use the write connection with the username of the read connection to test if the priviliges are correct

* remove nested try statement * use make_url instead of splicing the string for _get_db_url()

joetsoi · 2013-12-16T13:42:50Z

@maxious I have a pull request that fixes the read_connection_has_correct_privileges to only use one connection, it's the "CREATE VIEW waiting' connections that still concern me from our logs, were they just waiting because of connections from the read_connection_has_correct_privileges or is there another unclosed connection somewhere. Do you have the logs of your crashes for comparison?

Are you running test CKAN instance as well? If so would you be able to test the the pull request as well?

joetsoi · 2014-04-01T11:32:06Z

unless we can get better logs when this occurs, or if it occurs again I'll close the issue

a select * from pg_stat_activity; before killing the processes will be useful for debugging this if it recurs.

remove the read connection, and use the write connection with the username of the read connection to test if the priviliges are correct

* remove nested try statement * use make_url instead of splicing the string for _get_db_url()

remove the read connection, and use the write connection with the username of the read connection to test if the priviliges are correct

* remove nested try statement * use make_url instead of splicing the string for _get_db_url()

ghost assigned joetsoi Dec 10, 2013

joetsoi added a commit that referenced this issue Dec 16, 2013

[#1374] fix datastore _read_connection_has_correct_privileges

b1c12f7

remove the read connection, and use the write connection with the username of the read connection to test if the priviliges are correct

joetsoi added a commit that referenced this issue Dec 16, 2013

[#1374] refactor datastore _read_connection_has_correct_privileges()

28e9203

* remove nested try statement * use make_url instead of splicing the string for _get_db_url()

joetsoi mentioned this issue Dec 16, 2013

1374 Fix Datastore loading/startup bug #1397

Merged

joetsoi added a commit that referenced this issue Dec 16, 2013

[#1374] add host and port to _get_db_from_url

7494381

joetsoi added a commit that referenced this issue Dec 16, 2013

[#1374] datastore:rename variable to avoid confusing name clash

8bc88c2

joetsoi added a commit that referenced this issue Dec 16, 2013

[#1374] datastore:use bind params instead of string formatting

f124bb9

joetsoi closed this as completed Apr 1, 2014

joetsoi added a commit that referenced this issue Feb 10, 2015

[#1374] fix datastore _read_connection_has_correct_privileges

a37213d

remove the read connection, and use the write connection with the username of the read connection to test if the priviliges are correct

joetsoi added a commit that referenced this issue Feb 10, 2015

[#1374] refactor datastore _read_connection_has_correct_privileges()

5c30bb6

* remove nested try statement * use make_url instead of splicing the string for _get_db_url()

joetsoi added a commit that referenced this issue Feb 10, 2015

[#1374] add host and port to _get_db_from_url

8cdc14b

joetsoi added a commit that referenced this issue Feb 10, 2015

[#1374] datastore:rename variable to avoid confusing name clash

ee11f10

joetsoi added a commit that referenced this issue Feb 10, 2015

[#1374] datastore:use bind params instead of string formatting

4c76231

joetsoi added a commit that referenced this issue Feb 10, 2015

[#1374] fix datastore _read_connection_has_correct_privileges

fd14420

remove the read connection, and use the write connection with the username of the read connection to test if the priviliges are correct

joetsoi added a commit that referenced this issue Feb 10, 2015

[#1374] refactor datastore _read_connection_has_correct_privileges()

7944df7

* remove nested try statement * use make_url instead of splicing the string for _get_db_url()

joetsoi added a commit that referenced this issue Feb 10, 2015

[#1374] add host and port to _get_db_from_url

fa00111

joetsoi added a commit that referenced this issue Feb 10, 2015

[#1374] datastore:rename variable to avoid confusing name clash

6935648

joetsoi added a commit that referenced this issue Feb 10, 2015

[#1374] datastore:use bind params instead of string formatting

0cf8a2c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple postgres connections from cronjob causes site to go unresponsive #1374

Multiple postgres connections from cronjob causes site to go unresponsive #1374

joelrebel commented Dec 9, 2013

maxious commented Dec 10, 2013

joetsoi commented Dec 16, 2013

joetsoi commented Dec 16, 2013

joetsoi commented Apr 1, 2014

Multiple postgres connections from cronjob causes site to go unresponsive #1374

Multiple postgres connections from cronjob causes site to go unresponsive #1374

Comments

joelrebel commented Dec 9, 2013

maxious commented Dec 10, 2013

joetsoi commented Dec 16, 2013

joetsoi commented Dec 16, 2013

joetsoi commented Apr 1, 2014