New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Statements stuck in sys.jobs #6383
Comments
Thanks for reporting
I remember that we fixed something along the lines of entries not properly clearing from |
We are also seeing stuck jobs on simple SELECT statements. Can't even KILL them, no effect. The only solution seems to be to restart the cluster. This issue has been present on almost every version, we are now on 2.1.6 and still get them. |
Could you a) Also post the output of |
Unfortunately the cluster was restarted recently, however I will report back when jobs get stuck again. |
Well that did not take long 😞 . One of the nodes in the cluster just ran out of heap (from 4gb to 7.9gb in 15 seconds, which seems way too fast and more like a bug). The cluster became unresponsive until the node was restarted so I could not run the sys.operations query. Unfortunately I could not generate a heap dump because it was on the production cluster and had to fix it ASAP. However I sent you logs via email. Thank you !! |
Here is the result for the |
Going from 4gb to 7.9gb in 15 sec is not uncommon.
That's definitely an issue that should be caught by the circuit breaker.
I think this might be a different problem - maybe you could open up a separate issue for this?
Any chance to get a heapdump from node |
I tried to get a heapdump using Now everything seems fine (I did change the "write:wait_for_active_shards" setting to 1 for a few tables since we were getting some errors about not enough active shard copies). Edit: Created a separate issue #6472 |
I currently have 719 rows stuck in |
@SneakyMax could you also report:
|
Yeah.
The result of the command above gives me:
I'd give you crate logs for the errors too, but unfortunately our crate logs are full of thousands and thousands of lines of
|
@mfussenegger I finally managed to get a heap dump. Sent to your email. |
@SneakyMax
Are you using the crate python driver? This deprecation warning is fixed in the python driver version >= Regarding the issue the describes stucked queries listed in |
@mikethebeer We're using the That would be hilarious if this probably was because of flooded logs. I'll see if I can find anything after upgrading the driver. |
@SneakyMax ok, keep us informed if you see stucked queries again after the driver update. This was just an assumption that came to my mind first since I have the feeling that certain nodes become unresponsive and so also their queries get stuck. |
@SneakyMax Another parameter that you could look at is the number of shards in the cluster. As I remember correctly there was an issue with hanging queries in |
@seut Yeah, currently have 39 statements in sys.jobs, none of them should be there. Crate 2.2.3. |
re: @mikethebeer
|
@seut Yes, we also have around 10 jobs stuck which can't be killed (on a 5 node cluster running Crate 2.1.6). Here are the statements from those jobs: SELECT id, severity, description, passed FROM sys.checks WHERE passed = FALSE ORDER BY severity DESC, id
select schema_name \|\| '.' \|\| table_name, min_lucene_version from sys.shards where min_lucene_version not like '6.%.%' order by 1 |
Thanks for updating the issue again with some more information. Previously we thought this could be related to more complex queries but it seems to affect also simple queries with sys tables, which led me to check some of the job logs code and how the entries are cleaned up. While looking through the code which is executed before the job log entry is being removed, I found that we don't remove the job log in case of channel errors which can occur during sending the Rest response. This could be the cause for this issue because all the clients used here are Rest based. I've opened a PR (#6718), so this should be in soon and you could test whether the issue still occurs. |
Closing this as we believe it's been fixed with 2.2.6. Please re-open if anyone encounters this in 2.2.6+ |
We've upgraded now and |
@mfussenegger we've been running 2.3 for a week now and no extra entries in sys.jobs! |
We have also updated to 2.3.2 and still see a few stuck jobs which cannot be killed (a few UPDATE statements). |
Were the UPDATE statements executed when nodes have been restarted? We've a fix in 2.3.3 in that case. If it's neither of that a heap dump would be very valuable in order for us to investigate further. |
The stuck jobs are from today, but the nodes haven't been restarted since about 2 days ago (a few crashed). Is there any way to determine which node did the stuck queries execute on ? |
The |
Something is strange... those 5 stuck UPDATE statements do not have entries in SELECT table_name, schema_name, format('%s.%s', schema_name, table_name) AS fqn, _node['id'] AS node_id, state, routing_state, relocating_node, count(*), "primary", sum(num_docs), avg(num_docs), sum(size) FROM sys.shards GROUP BY table_name, schema_name, fqn, node_id, state, routing_state, relocating_node, "primary" |
Can you show us the exact update statements? There are different execution plans for UPDATE statements. If the UPDATE has a
This is one of the queries I was referring to that shouldn't become stuck anymore in |
Ah, I see. All the UPDATE queries are like this : UPDATE "tracknamic"."vehicle" SET "group_id"=? , "make"=? , "model"=? , ... WHERE id = '6fe97252-deac-468b-8412-3364cbb36405' |
I've got a few more questions regarding the stuck update statements:
|
|
So far I'm not able to reproduce this, nor can I see anything in the code that looks suspicious. Did the number of stuck entries grow since you initially reported this?
Did the crashes happen around the time when the queries got stuck? |
Since I reported this, no other stuck We will try to update to 2.3.3 and report if it happens again. |
k thx, let us know once you've upgraded. |
Environment:
Crate 1.1.6
Amazon Linux AMI 4.9.38-16.33.amzn1.x86_64
6 AWS t2.small nodes, in 3 availability zones
Problem description:
I posted this issue on Slack and was instructed to crate an issue here.
Seeing quite a few (over 100) entries in sys.jobs for statements with old start dates ... up to a week ago. The application which started the query is definitely not running and I don't seem to be able to KILL the job either (nothing happens).
Current 102 jobs. Killing a couple of them using KILL didn't change anything (the job still shows up in sys.jobs):
I've tried running a few of the queries found in sys.jobs and they all run in under a second. Nothing noteworthy is in the logs. Node CPU looks normal:
I haven't restarted the nodes in attempt to maintain this state. What else should I look for? Thanks.
The text was updated successfully, but these errors were encountered: