New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiple postgres connections from cronjob causes site to go unresponsive #1374
Comments
Could this be caused by the __read_connection_has_correct_privileges function concurrently trying to use/create/delete the _foo table? https://github.com/okfn/ckan/blob/master/ckanext/datastore/plugin.py#L148 My PostgreSQL error log contains some entries like:
(you can kill idle connections with "SELECT pg_terminate_backend(procpid) FROM pg_stat_activity WHERE current_query like '%' AND query_start < current_timestamp - INTERVAL '2' MINUTE;" but if the site is already locked up the damage is done) |
yes, it probably has more to do with the read_connection_has_correct_privileges being called concurrently by multiple processes as well. The function, if executed on it's own by a single proccess should be fine as the exclusive locks are only held during the CREATE and DROP statements. The write connection's transaction shouldn't be holding an exclusive lock while the SELECT has_table_privilege is being executed as it's not doing anything at that point. But the select will of had to wait for the write connection to switch to a shared lock first It could be caused by another process starting up at this point and attempting to create the _foo table whilst the select is happening on a separate one. We might be able to fix it by having a random table name instead of _foo. I am however concerned that we aren't closing the connection elsewhere in another action function or something. Either way it'd be good to fix the _foo already exists first in case it's masking another connection bug underneath. I'm also willing to be completely wrong here, if someone who's more knowledgeable with databases than I am is willing to comment. |
remove the read connection, and use the write connection with the username of the read connection to test if the priviliges are correct
* remove nested try statement * use make_url instead of splicing the string for _get_db_url()
@maxious I have a pull request that fixes the read_connection_has_correct_privileges to only use one connection, it's the "CREATE VIEW waiting' connections that still concern me from our logs, were they just waiting because of connections from the read_connection_has_correct_privileges or is there another unclosed connection somewhere. Do you have the logs of your crashes for comparison? Are you running test CKAN instance as well? If so would you be able to test the the pull request as well? |
unless we can get better logs when this occurs, or if it occurs again I'll close the issue a select * from pg_stat_activity; before killing the processes will be useful for debugging this if it recurs. |
remove the read connection, and use the write connection with the username of the read connection to test if the priviliges are correct
* remove nested try statement * use make_url instead of splicing the string for _get_db_url()
remove the read connection, and use the write connection with the username of the read connection to test if the priviliges are correct
* remove nested try statement * use make_url instead of splicing the string for _get_db_url()
The site affected in this case was data.sa.gov.au (s104, s105). Accessing the site would cause a timeout and nginx would throw the 501 error.
The s105 host had these cronjobs running,
Which seemed to be responsible for these db connections/queries on s104
On killing those queries, data.sa.gov.au was responsive again.
The text was updated successfully, but these errors were encountered: