Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Statements stuck in sys.jobs #6383

Closed
tellezb opened this issue Oct 10, 2017 · 37 comments
Closed

Statements stuck in sys.jobs #6383

tellezb opened this issue Oct 10, 2017 · 37 comments
Labels
bug Clear identification of incorrect behaviour v/2.3

Comments

@tellezb
Copy link

tellezb commented Oct 10, 2017

Environment:
Crate 1.1.6
Amazon Linux AMI 4.9.38-16.33.amzn1.x86_64
6 AWS t2.small nodes, in 3 availability zones

Problem description:
I posted this issue on Slack and was instructed to crate an issue here.

Seeing quite a few (over 100) entries in sys.jobs for statements with old start dates ... up to a week ago. The application which started the query is definitely not running and I don't seem to be able to KILL the job either (nothing happens).

Current 102 jobs. Killing a couple of them using KILL didn't change anything (the job still shows up in sys.jobs):

image

I've tried running a few of the queries found in sys.jobs and they all run in under a second. Nothing noteworthy is in the logs. Node CPU looks normal:
image

I haven't restarted the nodes in attempt to maintain this state. What else should I look for? Thanks.

@mfussenegger
Copy link
Member

Thanks for reporting

  • What kind of queries are those? Are all of them the same? (E.g. all contain GROUP BY, all are JOINS.. something like that)
  • What kind of entries are in sys.operations ?

I remember that we fixed something along the lines of entries not properly clearing from sys.jogs a while ago. Would it be possible to upgrade to the latest version? (Currently 2.1)

@tellezb
Copy link
Author

tellezb commented Oct 10, 2017

The queries vary and are against a few different tables. None of them have GROUP BY, some of them have ORDER BY, some of them have filters, none of them filter by pk.

339 entries in sys.operations
image

Unfortunately, I cannot update the cluster to 2.1 quite yet.

@rps-v
Copy link

rps-v commented Oct 30, 2017

We are also seeing stuck jobs on simple SELECT statements. Can't even KILL them, no effect. The only solution seems to be to restart the cluster. This issue has been present on almost every version, we are now on 2.1.6 and still get them.

@mfussenegger
Copy link
Member

@rps-v

Could you

a) Also post the output of select name, _node['name'], count(*) from sys.operations group by 1, 2; ?
b) Confirm that on the client side no queries get stuck
c) Provide a heap dump of one of the nodes which contain stuck operations ?

@rps-v
Copy link

rps-v commented Oct 31, 2017

@mfussenegger

Unfortunately the cluster was restarted recently, however I will report back when jobs get stuck again.

@rps-v
Copy link

rps-v commented Nov 1, 2017

@mfussenegger

Well that did not take long 😞 . One of the nodes in the cluster just ran out of heap (from 4gb to 7.9gb in 15 seconds, which seems way too fast and more like a bug).

The cluster became unresponsive until the node was restarted so I could not run the sys.operations query. Unfortunately I could not generate a heap dump because it was on the production cluster and had to fix it ASAP. However I sent you logs via email.

Thank you !!

@rps-v
Copy link

rps-v commented Nov 2, 2017

@mfussenegger

Here is the result for the sys.operations query:

ops

@mfussenegger
Copy link
Member

One of the nodes in the cluster just ran out of heap (from 4gb to 7.9gb in 15 seconds, which seems way too fast and more like a bug

Going from 4gb to 7.9gb in 15 sec is not uncommon.
But the logs indicate that the garbage collection couldn't free any memory:

[gc][old][151891][74] duration [7.4s], collections [1]/[7.4s], total [7.4s]/[4.5m], memory [7.9gb]->[7.9gb]/[7.9gb], all_pools {[young] [532.5mb]->[532.5mb]/[532.5mb]}{[survivor] [64.4mb]->[64.1mb]/[66.5mb]}
{[old] [7.3gb]->[7.3gb]/[7.3gb]}

That's definitely an issue that should be caught by the circuit breaker.

  • Did you change any circuit breaker settings?
  • Were any unusual queries run around that time?

I think this might be a different problem - maybe you could open up a separate issue for this?

Here is the result for the sys.operations query:

Any chance to get a heapdump from node s0?

@rps-v
Copy link

rps-v commented Nov 2, 2017

@mfussenegger

  • We did not change any circuit breaker settings (sent you the config via email)
  • As far as I could find, no unusual queries were running

I tried to get a heapdump using jmap but it was running really slow and since this happened on the production cluster, had to abort it and restart the node. Will try again if queries get stuck.

Now everything seems fine (I did change the "write:wait_for_active_shards" setting to 1 for a few tables since we were getting some errors about not enough active shard copies).

Edit: Created a separate issue #6472

@abrenneke
Copy link

I currently have 719 rows stuck in sys.jobs, most from 4 days ago. CPU usage is 0, so it's not like they're doing anything.

@mfussenegger
Copy link
Member

@SneakyMax

could you also report:

  • CrateDB version you're using
  • Can you tell if the clients received a response for these queries?
  • Are all queries of a certain kind (e.g. all contain joins, all are group by queries - something like that)
  • If you're using 2.1 or 2.2, would it be possible to provide a heap dump from a node which contains any stuck queries? (Can be checked with select name, _node['name'], count(*) from sys.operations group by 1, 2);

@abrenneke
Copy link

abrenneke commented Nov 20, 2017

Yeah.

  • Crate 2.1.6
  • 4 days ago is when we were testing some applications and running many (~250) queries at once moment - we did get timeouts for some of them. I assumed it was nginx that was in front of crate, but now that I think about it, the nginx logs just said "upstream error"
  • They're all of the format SELECT stuff FROM ( SELECT stuff FROM table WHERE stuff GROUP BY stuff ) AS e INNER JOIN other_table WHERE stuff ORDER BY stuff)

The result of the command above gives me:

name | _node['name'] | count(*)
distributed merge | James Tiberius Kirk | 715
distributing collect | James Tiberius Kirk | 715
collect | James Tiberius Kirk | 667
fetchContext | James Tiberius Kirk | 666
mergeOnHandler | S'chn T'gai Spock | 1
fetchContext | S'chn T'gai Spock | 166
nested-loop | S'chn T'gai Spock | 715
collect | S'chn T'gai Spock | 168
nl-merge | S'chn T'gai Spock | 1381

I'd give you crate logs for the errors too, but unfortunately our crate logs are full of thousands and thousands of lines of

[2017-11-20T22:11:13,253][WARN ][o.e.d.r.RestController   ] Content type detection for rest requests is deprecated. Specify the content type using the [Content-Type] header.

@rps-v
Copy link

rps-v commented Nov 21, 2017

@mfussenegger I finally managed to get a heap dump. Sent to your email.

@mikethebeer
Copy link

@SneakyMax
Thx for the update.

I'd give you crate logs for the errors too, but unfortunately our crate logs are full of thousands and thousands of lines of

[2017-11-20T22:11:13,253][WARN ][o.e.d.r.RestController   ] Content type detection for rest requests is deprecated. Specify the content type using the [Content-Type] header.

Are you using the crate python driver? This deprecation warning is fixed in the python driver version >= 0.20.1. I really recommend to update to the latest driver since this deprecation warning seems to flood your logs which is not good for performance.

Regarding the issue the describes stucked queries listed in sys.jobs/operations could you check if you get any disconnect warnings of participating nodes in the cluster? A problem could be that you run into network partitions or an issue that a node hangs (just because of flooded logs?! )

@abrenneke
Copy link

@mikethebeer We're using the node-crate driver. It looks like Content-type was added oct 11th, we just hadn't upgraded yet.

That would be hilarious if this probably was because of flooded logs. I'll see if I can find anything after upgrading the driver.

@mikethebeer
Copy link

@SneakyMax ok, keep us informed if you see stucked queries again after the driver update. This was just an assumption that came to my mind first since I have the feeling that certain nodes become unresponsive and so also their queries get stuck.

@mikethebeer
Copy link

@SneakyMax Another parameter that you could look at is the number of shards in the cluster. As I remember correctly there was an issue with hanging queries in sys.jobs some time ago. In that scenario our customer got way to much shards assigned in the cluster.
Could you check also for the nr. of shards and replicas that you have configured for your tables? Send us also the nr. of nodes you have in your cluster so that we could compare the relation to each.

@seut
Copy link
Member

seut commented Jan 3, 2018

@tellezb @rps-v @SneakyMax is this still an issue for you?

@abrenneke
Copy link

@seut Yeah, currently have 39 statements in sys.jobs, none of them should be there. Crate 2.2.3.

@abrenneke
Copy link

re: @mikethebeer

  • The main large table has 4 configured shards, and 8 started shards
  • We have two nodes in our cluster

@rps-v
Copy link

rps-v commented Jan 4, 2018

@seut Yes, we also have around 10 jobs stuck which can't be killed (on a 5 node cluster running Crate 2.1.6).

Here are the statements from those jobs:

SELECT id, severity, description, passed FROM sys.checks WHERE passed = FALSE ORDER BY severity DESC, id

select schema_name \|\| '.' \|\| table_name, min_lucene_version from sys.shards where min_lucene_version not like '6.%.%' order by 1

@mxm
Copy link
Contributor

mxm commented Jan 5, 2018

Thanks for updating the issue again with some more information. Previously we thought this could be related to more complex queries but it seems to affect also simple queries with sys tables, which led me to check some of the job logs code and how the entries are cleaned up.

While looking through the code which is executed before the job log entry is being removed, I found that we don't remove the job log in case of channel errors which can occur during sending the Rest response. This could be the cause for this issue because all the clients used here are Rest based.

I've opened a PR (#6718), so this should be in soon and you could test whether the issue still occurs.

@mfussenegger
Copy link
Member

Closing this as we believe it's been fixed with 2.2.6.

Please re-open if anyone encounters this in 2.2.6+

@mfussenegger mfussenegger added bug Clear identification of incorrect behaviour v/2.2.6 labels Feb 7, 2018
@abrenneke
Copy link

We've upgraded now and stats.enabled set to true, will let you know if it happens.

@abrenneke
Copy link

@mfussenegger we've been running 2.3 for a week now and no extra entries in sys.jobs!

@rps-v
Copy link

rps-v commented Feb 21, 2018

We have also updated to 2.3.2 and still see a few stuck jobs which cannot be killed (a few UPDATE statements).

@mfussenegger
Copy link
Member

We have also updated to 2.3.2 and still see a few stuck jobs which cannot be killed (a few UPDATE statements).

Were the UPDATE statements executed when nodes have been restarted? We've a fix in 2.3.3 in that case.
And there is also a fix in 2.3.3 that should prevent queries on sys.checks or sys.shards from getting stuck.

If it's neither of that a heap dump would be very valuable in order for us to investigate further.

@rps-v
Copy link

rps-v commented Feb 21, 2018

The stuck jobs are from today, but the nodes haven't been restarted since about 2 days ago (a few crashed).
The jobs are not from any sys table, so it shouldn't be that.

Is there any way to determine which node did the stuck queries execute on ?

@mfussenegger
Copy link
Member

Is there any way to determine which node did the stuck queries execute on ?

The sys.operations table contains a _node['name'] and _node['id'] column. You can join sys.jobs and sys.operations on jobs.id = operations.job_id

@rps-v
Copy link

rps-v commented Feb 22, 2018

Something is strange... those 5 stuck UPDATE statements do not have entries in sys.operations (_node is null via JOIN).
Also saw a stuck job from today with the following statement (could be the Admin UI ?):

SELECT table_name, schema_name, format('%s.%s', schema_name, table_name) AS fqn, _node['id'] AS node_id, state, routing_state, relocating_node, count(*), "primary", sum(num_docs), avg(num_docs), sum(size) FROM sys.shards GROUP BY table_name, schema_name, fqn, node_id, state, routing_state, relocating_node, "primary"

@mfussenegger
Copy link
Member

Something is strange... those 5 stuck UPDATE statements do not have entries in sys.operations (_node is null via JOIN).

Can you show us the exact update statements? There are different execution plans for UPDATE statements. If the UPDATE has a WHERE pk_column = ? it's executed in a way that doesn't result in sys.operations entries.

Also saw a stuck job from today with the following statement (could be the Admin UI ?):

This is one of the queries I was referring to that shouldn't become stuck anymore in 2.3.3

@rps-v
Copy link

rps-v commented Feb 22, 2018

Ah, I see.

All the UPDATE queries are like this :

UPDATE "tracknamic"."vehicle" SET "group_id"=? , "make"=? , "model"=? , ...  WHERE  id = '6fe97252-deac-468b-8412-3364cbb36405'

@mfussenegger
Copy link
Member

@rps-v

I've got a few more questions regarding the stuck update statements:

  1. Are there any errors in the logs?
  2. What kind of client are you using - is it postgres based and by chance are you using prepared statements?
  3. Do you know if the client received responses for these statements, or did the client get stuck / run into a timeout?

@rps-v
Copy link

rps-v commented Feb 26, 2018

@mfussenegger

  1. No errors in logs (on all nodes)
  2. We are using HTTP client (node-crate)
  3. Unfortunately I do not have that information

@mfussenegger
Copy link
Member

@rps-v

So far I'm not able to reproduce this, nor can I see anything in the code that looks suspicious.

Did the number of stuck entries grow since you initially reported this?

but the nodes haven't been restarted since about 2 days ago (a few crashed).

Did the crashes happen around the time when the queries got stuck?

@rps-v
Copy link

rps-v commented Feb 27, 2018

@mfussenegger

Since I reported this, no other stuck UPDATE queries, however there were some from sys tables (which you said is fixed in 2.3.3).
No, the crashes occured 2 days before the queries got stuck.

We will try to update to 2.3.3 and report if it happens again.
I think the issue can be closed for now.

@mfussenegger
Copy link
Member

k thx, let us know once you've upgraded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Clear identification of incorrect behaviour v/2.3
Projects
None yet
Development

No branches or pull requests

8 participants