Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job timeout doesn't kill the mysql query #4629

Merged
merged 2 commits into from
Feb 24, 2020

Conversation

rauchy
Copy link
Contributor

@rauchy rauchy commented Feb 23, 2020

I'm using the master branch. During refreshing one of my dashboards, a mysql query reached the timeout which raised the "query execution time limit".
I thought the timeout handler would also kill the related sql query in mysql server while it didn't. So there are lots of stuck sql queries in the mysql server.

So is that possible to also kill the related running query in the database server during handling the timeout?

Thanks.

@Jakdaw
Copy link
Contributor

Jakdaw commented Feb 10, 2020

That should happen. What version of redash are you using?

@dingweihua
Copy link
Author

@Jakdaw
I'm using the latest codes in master branch. It shows "v9.0.0-alpha" from the manager command.
I know the master branch is fully using RQ now. Does v8.x (which uses Celery) have the same result?

@arikfr
Copy link
Member

arikfr commented Feb 10, 2020

@rauchy does the RQ implementation of cancel send Interrupt signal to the job before killing it?

@dingweihua
Copy link
Author

@arikfr I dig into the code and found that the RQ did send interrupt signal to the job after cancelling the job. However my case is, I didn't manually cancel the job but let the query automatically trigger the timeout exception. But the timeout handler didn't kill the sql query.

@rauchy rauchy self-assigned this Feb 12, 2020
@rauchy
Copy link
Contributor

rauchy commented Feb 12, 2020

@dingweihua thanks for reporting. I will look into the signal propagation issue for MySQL, but meanwhile - you might benefit from updating to the latest master, as it seems that your version had an implicit 180 second timeout, which is no longer imposed.

@rauchy rauchy added this to To do in Switch from Celery to RQ via automation Feb 12, 2020
@rauchy rauchy added the Backend label Feb 12, 2020
@dingweihua
Copy link
Author

@rauchy Yeah i got it. Thanks! You updated the latest default time limit to -1 to mention the user to setup it.

@dingweihua
Copy link
Author

On considering this issue, I'm trying to use version 8.0 (which uses Celery). Would the sql query be automatically killed after timeout in v8.0? @rauchy

@rauchy rauchy moved this from To do to In progress in Switch from Celery to RQ Feb 16, 2020
@rauchy
Copy link
Contributor

rauchy commented Feb 16, 2020

Testing this, it appears that MySQL is receiving signals properly on timeouts and manual cancellation using the latest master. @dingweihua could you elaborate on the issue you are facing?

@dingweihua
Copy link
Author

Thanks @rauchy
My question is on v8.0. In v8.0, if we wait a MYSQL query (some time-consuming query) to timeout, will our signal processing kill the sql query in the Mysql database server? As we mentioned above, it wouldn't in the master branch.

@dingweihua
Copy link
Author

@dingweihua thanks for reporting. I will look into the signal propagation issue for MySQL, but meanwhile - you might benefit from updating to the latest master, as it seems that your version had an implicit 180 second timeout, which is no longer imposed.

Thinking about this issue again, i found you've already set the signal processing handler in QueryExecutor's run() method. But in this situation, I don't think our thread could receive the SIGINT signal. SIGINT is translated into a KeyboardInterrupt exception. While it seems that the JobTimeoutException is not a "SIGINT" like signal.

Actually, I debugged my local code (master branch). The run_query Exception was raised (in QueryExecutor.run()) while no SIGINT signal received.

@rauchy
Copy link
Contributor

rauchy commented Feb 18, 2020

As we mentioned above, it wouldn't in the master branch.

What makes you think that? I ran the following test and saw that the query was cancelled in MySQL:

  1. Start a long-running MySQL query in Redash (Something like select sleep(3600)).
  2. Open a MySQL console and run SHOW FULL PROCESSLIST;, you should see the above query in there:
mysql> SHOW FULL PROCESSLIST;
+----+-----------------+------------------+------+---------+------+------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Id | User            | Host             | db   | Command | Time | State                  | Info                                                                                                                                                                                            |
+----+-----------------+------------------+------+---------+------+------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|  4 | event_scheduler | localhost        | NULL | Daemon  |   27 | Waiting on empty queue | NULL                                                                                                                                                                                            |
|  8 | root            | localhost        | NULL | Query   |    0 | starting               | SHOW FULL PROCESSLIST                                                                                                                                                                           |
|  9 | user            | 123.12.1.2:49612 | db   | Query   |    7 | User sleep             | /* Username: john@doe.com, Query ID: adhoc, Queue: queries, Job ID: 7ce0d635-9499-4a85-a06b-31215d61c560, Query Hash: 31d8816df61c454e0c28d7373391820d, Scheduled: False */ select sleep(3600)  |
+----+-----------------+------------------+------+---------+------+------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
3 rows in set (0.00 sec)
  1. Cancel the query in Redash.
  2. Run the same command in your MySQL console and watch that the query process has been terminated:
mysql> SHOW FULL PROCESSLIST;
+----+-----------------+-----------+------+---------+------+------------------------+-----------------------+
| Id | User            | Host      | db   | Command | Time | State                  | Info                  |
+----+-----------------+-----------+------+---------+------+------------------------+-----------------------+
|  4 | event_scheduler | localhost | NULL | Daemon  |   35 | Waiting on empty queue | NULL                  |
|  8 | root            | localhost | NULL | Query   |    0 | starting               | SHOW FULL PROCESSLIST |
+----+-----------------+-----------+------+---------+------+------------------------+-----------------------+
2 rows in set (0.00 sec)

Same thing with setting a short time limit with a query that times out. Are you getting different results?

@dingweihua
Copy link
Author

As we mentioned above, it wouldn't in the master branch.

What makes you think that? I ran the following test and saw that the query was cancelled in MySQL:

  1. Start a long-running MySQL query in Redash (Something like select sleep(3600)).
  2. Open a MySQL console and run SHOW FULL PROCESSLIST;, you should see the above query in there:
mysql> SHOW FULL PROCESSLIST;
+----+-----------------+------------------+------+---------+------+------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Id | User            | Host             | db   | Command | Time | State                  | Info                                                                                                                                                                                            |
+----+-----------------+------------------+------+---------+------+------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|  4 | event_scheduler | localhost        | NULL | Daemon  |   27 | Waiting on empty queue | NULL                                                                                                                                                                                            |
|  8 | root            | localhost        | NULL | Query   |    0 | starting               | SHOW FULL PROCESSLIST                                                                                                                                                                           |
|  9 | user            | 123.12.1.2:49612 | db   | Query   |    7 | User sleep             | /* Username: john@doe.com, Query ID: adhoc, Queue: queries, Job ID: 7ce0d635-9499-4a85-a06b-31215d61c560, Query Hash: 31d8816df61c454e0c28d7373391820d, Scheduled: False */ select sleep(3600)  |
+----+-----------------+------------------+------+---------+------+------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
3 rows in set (0.00 sec)
  1. Cancel the query in Redash.
  2. Run the same command in your MySQL console and watch that the query process has been terminated:
mysql> SHOW FULL PROCESSLIST;
+----+-----------------+-----------+------+---------+------+------------------------+-----------------------+
| Id | User            | Host      | db   | Command | Time | State                  | Info                  |
+----+-----------------+-----------+------+---------+------+------------------------+-----------------------+
|  4 | event_scheduler | localhost | NULL | Daemon  |   35 | Waiting on empty queue | NULL                  |
|  8 | root            | localhost | NULL | Query   |    0 | starting               | SHOW FULL PROCESSLIST |
+----+-----------------+-----------+------+---------+------+------------------------+-----------------------+
2 rows in set (0.00 sec)

Same thing with setting a short time limit with a query that times out. Are you getting different results?

@rauchy Yes, you're right! I have the same idea on your mentioned case. But my case is different from yours. In my case, I didn't manually trigger the cancel operation. My situation is:

  1. Start a long-running MySQL query in Redash
  2. SHOW FULL PROCESSLIST and we'll see the query
  3. Wait for the job timeout as we set the time limit to maybe 1 minute (do not trigger the cancel operation)
  4. From the frontend webpage, we got the error message:

Error running query: Query exceeded Redash query execution time limit.
While the sql query in database server was not killed.

  1. SHOW FULL PROCESSLIST But the sql query is still in the mysql server.

This is my case. In my case, I don't think we can receive the signal so we couldn't do anything on the running sql query. Do you think we should also kill the sql query after facing time limit (without manually cancelling)?

@arikfr
Copy link
Member

arikfr commented Feb 18, 2020

@rauchy when a timeout is reached how the job is being cancelled?

@rauchy
Copy link
Contributor

rauchy commented Feb 18, 2020

Same thing with setting a short time limit with a query that times out.

@dingweihua I've just double-checked this. As I said, on master you should see the same results when cancelling manually or when cancellation automatically occurs after a timeout.

@arikfr on manual cancellations the work horse will receive a SIGINT and cancel as it used to (by executing the MySQL KILL command), while on timeouts the work horse process would get terminated, which results in MySQL terminating the running queries. In fact, we can consider dropping the KILL execution for manual cancellations, as the work horse would exit anyway.

@arikfr
Copy link
Member

arikfr commented Feb 18, 2020

@rauchy I know that Postgres terminates running queries if the client disconnects, but not sure about MySQL. I'm pretty sure it doesn't (hence the explicit behavior we have). Maybe it was introduced in newer versions of MySQL?

@rauchy
Copy link
Contributor

rauchy commented Feb 18, 2020

@dingweihua which version of MySQL are you running?

@dingweihua
Copy link
Author

I'm using MySQL 5.5.5-10.2.12-MariaDB

@rauchy
Copy link
Contributor

rauchy commented Feb 19, 2020

@dingweihua I've tried this with 10.2.31 and 10.1.44 and they worked as expected. Wondering if this is dependent on the type of query you are running. In my example, I'm using SELECT SLEEP(3600), which puts the MySQL in a user-sleep state - can you try it with such a query?

@dingweihua
Copy link
Author

@rauchy I've tried SELECT SLEEP(3600) and YES the timeout killed this query in MySQL server! So that means it works well on "SLEEP" query. But when i use a table related query, the SQL query wasn't killed after timeout. You can also try a time-consuming query on some large table and I think the same situation would also happen.

@rauchy
Copy link
Contributor

rauchy commented Feb 20, 2020

@dingweihua interesting. Can you try running with https://github.com/getredash/redash/tree/forward-timeout-signals-to-mysql?

@dingweihua
Copy link
Author

@rauchy Yes, i just tried the branch and it works! I think it's a good fix in your change. By using the signal SIGALRM, you raised the JobTimeoutException so that the mysql runner could catch the exception to cancel the running query thread. Thanks!

@rauchy rauchy requested a review from arikfr February 23, 2020 08:46
@rauchy
Copy link
Contributor

rauchy commented Feb 23, 2020

@arikfr WDYT about this? We'll probably need to attach to this signal to other multi-threaded runners.

Switch from Celery to RQ automation moved this from In progress to Reviewer approved Feb 24, 2020
@rauchy rauchy merged commit 35250d6 into master Feb 24, 2020
Switch from Celery to RQ automation moved this from Reviewer approved to Done Feb 24, 2020
@rauchy rauchy deleted the forward-timeout-signals-to-mysql branch February 24, 2020 22:16
@rauchy rauchy mentioned this pull request Feb 24, 2020
2 tasks
suitedJK pushed a commit to suitedJK/redash that referenced this pull request Jul 21, 2020
* forward timeout SIGALRMs to MySQL threads in order to kill any running proccesses

* no need to attach to SIGALRM as RQ already does that

# Conflicts:
#	redash/query_runner/mysql.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

4 participants