Statements stuck in sys.jobs #6383

tellezb · 2017-10-10T15:31:34Z

Environment:
Crate 1.1.6
Amazon Linux AMI 4.9.38-16.33.amzn1.x86_64
6 AWS t2.small nodes, in 3 availability zones

Problem description:
I posted this issue on Slack and was instructed to crate an issue here.

Seeing quite a few (over 100) entries in sys.jobs for statements with old start dates ... up to a week ago. The application which started the query is definitely not running and I don't seem to be able to KILL the job either (nothing happens).

Current 102 jobs. Killing a couple of them using KILL didn't change anything (the job still shows up in sys.jobs):

I've tried running a few of the queries found in sys.jobs and they all run in under a second. Nothing noteworthy is in the logs. Node CPU looks normal:

I haven't restarted the nodes in attempt to maintain this state. What else should I look for? Thanks.

mfussenegger · 2017-10-10T15:37:50Z

Thanks for reporting

What kind of queries are those? Are all of them the same? (E.g. all contain GROUP BY, all are JOINS.. something like that)
What kind of entries are in sys.operations ?

I remember that we fixed something along the lines of entries not properly clearing from sys.jogs a while ago. Would it be possible to upgrade to the latest version? (Currently 2.1)

tellezb · 2017-10-10T16:30:14Z

The queries vary and are against a few different tables. None of them have GROUP BY, some of them have ORDER BY, some of them have filters, none of them filter by pk.

339 entries in sys.operations

Unfortunately, I cannot update the cluster to 2.1 quite yet.

rps-v · 2017-10-30T16:44:43Z

We are also seeing stuck jobs on simple SELECT statements. Can't even KILL them, no effect. The only solution seems to be to restart the cluster. This issue has been present on almost every version, we are now on 2.1.6 and still get them.

mfussenegger · 2017-10-31T09:10:40Z

@rps-v

Could you

a) Also post the output of select name, _node['name'], count(*) from sys.operations group by 1, 2; ?
b) Confirm that on the client side no queries get stuck
c) Provide a heap dump of one of the nodes which contain stuck operations ?

rps-v · 2017-10-31T11:23:30Z

@mfussenegger

Unfortunately the cluster was restarted recently, however I will report back when jobs get stuck again.

rps-v · 2017-11-01T10:07:46Z

@mfussenegger

Well that did not take long 😞 . One of the nodes in the cluster just ran out of heap (from 4gb to 7.9gb in 15 seconds, which seems way too fast and more like a bug).

The cluster became unresponsive until the node was restarted so I could not run the sys.operations query. Unfortunately I could not generate a heap dump because it was on the production cluster and had to fix it ASAP. However I sent you logs via email.

Thank you !!

rps-v · 2017-11-02T08:42:47Z

@mfussenegger

Here is the result for the sys.operations query:

mfussenegger · 2017-11-02T08:57:13Z

One of the nodes in the cluster just ran out of heap (from 4gb to 7.9gb in 15 seconds, which seems way too fast and more like a bug

Going from 4gb to 7.9gb in 15 sec is not uncommon.
But the logs indicate that the garbage collection couldn't free any memory:

[gc][old][151891][74] duration [7.4s], collections [1]/[7.4s], total [7.4s]/[4.5m], memory [7.9gb]->[7.9gb]/[7.9gb], all_pools {[young] [532.5mb]->[532.5mb]/[532.5mb]}{[survivor] [64.4mb]->[64.1mb]/[66.5mb]}
{[old] [7.3gb]->[7.3gb]/[7.3gb]}

That's definitely an issue that should be caught by the circuit breaker.

Did you change any circuit breaker settings?
Were any unusual queries run around that time?

I think this might be a different problem - maybe you could open up a separate issue for this?

Here is the result for the sys.operations query:

Any chance to get a heapdump from node s0?

rps-v · 2017-11-02T09:14:12Z

@mfussenegger

We did not change any circuit breaker settings (sent you the config via email)
As far as I could find, no unusual queries were running

I tried to get a heapdump using jmap but it was running really slow and since this happened on the production cluster, had to abort it and restart the node. Will try again if queries get stuck.

Now everything seems fine (I did change the "write:wait_for_active_shards" setting to 1 for a few tables since we were getting some errors about not enough active shard copies).

Edit: Created a separate issue #6472

abrenneke · 2017-11-20T18:29:21Z

I currently have 719 rows stuck in sys.jobs, most from 4 days ago. CPU usage is 0, so it's not like they're doing anything.

mfussenegger · 2017-11-20T21:35:21Z

@SneakyMax

could you also report:

CrateDB version you're using
Can you tell if the clients received a response for these queries?
Are all queries of a certain kind (e.g. all contain joins, all are group by queries - something like that)
If you're using 2.1 or 2.2, would it be possible to provide a heap dump from a node which contains any stuck queries? (Can be checked with select name, _node['name'], count(*) from sys.operations group by 1, 2);

abrenneke · 2017-11-20T22:12:36Z

Yeah.

Crate 2.1.6
4 days ago is when we were testing some applications and running many (~250) queries at once moment - we did get timeouts for some of them. I assumed it was nginx that was in front of crate, but now that I think about it, the nginx logs just said "upstream error"
They're all of the format SELECT stuff FROM ( SELECT stuff FROM table WHERE stuff GROUP BY stuff ) AS e INNER JOIN other_table WHERE stuff ORDER BY stuff)

The result of the command above gives me:

name | _node['name'] | count(*)
distributed merge | James Tiberius Kirk | 715
distributing collect | James Tiberius Kirk | 715
collect | James Tiberius Kirk | 667
fetchContext | James Tiberius Kirk | 666
mergeOnHandler | S'chn T'gai Spock | 1
fetchContext | S'chn T'gai Spock | 166
nested-loop | S'chn T'gai Spock | 715
collect | S'chn T'gai Spock | 168
nl-merge | S'chn T'gai Spock | 1381

I'd give you crate logs for the errors too, but unfortunately our crate logs are full of thousands and thousands of lines of

[2017-11-20T22:11:13,253][WARN ][o.e.d.r.RestController   ] Content type detection for rest requests is deprecated. Specify the content type using the [Content-Type] header.

rps-v · 2017-11-21T20:19:58Z

@mfussenegger I finally managed to get a heap dump. Sent to your email.

mikethebeer · 2017-12-01T13:25:09Z

@SneakyMax
Thx for the update.

I'd give you crate logs for the errors too, but unfortunately our crate logs are full of thousands and thousands of lines of

[2017-11-20T22:11:13,253][WARN ][o.e.d.r.RestController   ] Content type detection for rest requests is deprecated. Specify the content type using the [Content-Type] header.

Are you using the crate python driver? This deprecation warning is fixed in the python driver version >= 0.20.1. I really recommend to update to the latest driver since this deprecation warning seems to flood your logs which is not good for performance.

Regarding the issue the describes stucked queries listed in sys.jobs/operations could you check if you get any disconnect warnings of participating nodes in the cluster? A problem could be that you run into network partitions or an issue that a node hangs (just because of flooded logs?! )

abrenneke · 2017-12-01T18:52:01Z

@mikethebeer We're using the node-crate driver. It looks like Content-type was added oct 11th, we just hadn't upgraded yet.

That would be hilarious if this probably was because of flooded logs. I'll see if I can find anything after upgrading the driver.

mikethebeer · 2017-12-01T23:27:43Z

@SneakyMax ok, keep us informed if you see stucked queries again after the driver update. This was just an assumption that came to my mind first since I have the feeling that certain nodes become unresponsive and so also their queries get stuck.

mikethebeer · 2017-12-04T08:05:55Z

@SneakyMax Another parameter that you could look at is the number of shards in the cluster. As I remember correctly there was an issue with hanging queries in sys.jobs some time ago. In that scenario our customer got way to much shards assigned in the cluster.
Could you check also for the nr. of shards and replicas that you have configured for your tables? Send us also the nr. of nodes you have in your cluster so that we could compare the relation to each.

seut · 2018-01-03T10:46:39Z

@tellezb @rps-v @SneakyMax is this still an issue for you?

abrenneke · 2018-01-04T00:26:20Z

@seut Yeah, currently have 39 statements in sys.jobs, none of them should be there. Crate 2.2.3.

abrenneke · 2018-01-04T00:28:22Z

re: @mikethebeer

The main large table has 4 configured shards, and 8 started shards
We have two nodes in our cluster

rps-v · 2018-01-04T07:44:14Z

@seut Yes, we also have around 10 jobs stuck which can't be killed (on a 5 node cluster running Crate 2.1.6).

Here are the statements from those jobs:

SELECT id, severity, description, passed FROM sys.checks WHERE passed = FALSE ORDER BY severity DESC, id

select schema_name \|\| '.' \|\| table_name, min_lucene_version from sys.shards where min_lucene_version not like '6.%.%' order by 1

mxm · 2018-01-05T14:39:48Z

Thanks for updating the issue again with some more information. Previously we thought this could be related to more complex queries but it seems to affect also simple queries with sys tables, which led me to check some of the job logs code and how the entries are cleaned up.

While looking through the code which is executed before the job log entry is being removed, I found that we don't remove the job log in case of channel errors which can occur during sending the Rest response. This could be the cause for this issue because all the clients used here are Rest based.

I've opened a PR (#6718), so this should be in soon and you could test whether the issue still occurs.

mfussenegger · 2018-02-07T20:12:32Z

Closing this as we believe it's been fixed with 2.2.6.

Please re-open if anyone encounters this in 2.2.6+

abrenneke · 2018-02-12T20:10:02Z

We've upgraded now and stats.enabled set to true, will let you know if it happens.

abrenneke · 2018-02-16T15:20:14Z

@mfussenegger we've been running 2.3 for a week now and no extra entries in sys.jobs!

rps-v · 2018-02-21T14:08:32Z

We have also updated to 2.3.2 and still see a few stuck jobs which cannot be killed (a few UPDATE statements).

mfussenegger · 2018-02-21T14:28:31Z

We have also updated to 2.3.2 and still see a few stuck jobs which cannot be killed (a few UPDATE statements).

Were the UPDATE statements executed when nodes have been restarted? We've a fix in 2.3.3 in that case.
And there is also a fix in 2.3.3 that should prevent queries on sys.checks or sys.shards from getting stuck.

If it's neither of that a heap dump would be very valuable in order for us to investigate further.

rps-v · 2018-02-21T14:57:51Z

The stuck jobs are from today, but the nodes haven't been restarted since about 2 days ago (a few crashed).
The jobs are not from any sys table, so it shouldn't be that.

Is there any way to determine which node did the stuck queries execute on ?

mfussenegger · 2018-02-21T15:16:41Z

Is there any way to determine which node did the stuck queries execute on ?

The sys.operations table contains a _node['name'] and _node['id'] column. You can join sys.jobs and sys.operations on jobs.id = operations.job_id

rps-v · 2018-02-22T08:30:19Z

Something is strange... those 5 stuck UPDATE statements do not have entries in sys.operations (_node is null via JOIN).
Also saw a stuck job from today with the following statement (could be the Admin UI ?):

SELECT table_name, schema_name, format('%s.%s', schema_name, table_name) AS fqn, _node['id'] AS node_id, state, routing_state, relocating_node, count(*), "primary", sum(num_docs), avg(num_docs), sum(size) FROM sys.shards GROUP BY table_name, schema_name, fqn, node_id, state, routing_state, relocating_node, "primary"

mfussenegger · 2018-02-22T08:43:59Z

Something is strange... those 5 stuck UPDATE statements do not have entries in sys.operations (_node is null via JOIN).

Can you show us the exact update statements? There are different execution plans for UPDATE statements. If the UPDATE has a WHERE pk_column = ? it's executed in a way that doesn't result in sys.operations entries.

Also saw a stuck job from today with the following statement (could be the Admin UI ?):

This is one of the queries I was referring to that shouldn't become stuck anymore in 2.3.3

rps-v · 2018-02-22T09:26:23Z

Ah, I see.

All the UPDATE queries are like this :

UPDATE "tracknamic"."vehicle" SET "group_id"=? , "make"=? , "model"=? , ...  WHERE  id = '6fe97252-deac-468b-8412-3364cbb36405'

mfussenegger · 2018-02-23T09:50:30Z

@rps-v

I've got a few more questions regarding the stuck update statements:

Are there any errors in the logs?
What kind of client are you using - is it postgres based and by chance are you using prepared statements?
Do you know if the client received responses for these statements, or did the client get stuck / run into a timeout?

rps-v · 2018-02-26T09:24:48Z

@mfussenegger

No errors in logs (on all nodes)
We are using HTTP client (node-crate)
Unfortunately I do not have that information

mfussenegger · 2018-02-27T14:15:32Z

@rps-v

So far I'm not able to reproduce this, nor can I see anything in the code that looks suspicious.

Did the number of stuck entries grow since you initially reported this?

but the nodes haven't been restarted since about 2 days ago (a few crashed).

Did the crashes happen around the time when the queries got stuck?

rps-v · 2018-02-27T14:21:44Z

@mfussenegger

Since I reported this, no other stuck UPDATE queries, however there were some from sys tables (which you said is fixed in 2.3.3).
No, the crashes occured 2 days before the queries got stuck.

We will try to update to 2.3.3 and report if it happens again.
I think the issue can be closed for now.

mfussenegger · 2018-02-27T14:52:49Z

k thx, let us know once you've upgraded.

mfussenegger added the feedback_needed label Oct 10, 2017

joemoe removed the feedback_needed label Oct 31, 2017

rps-v mentioned this issue Nov 2, 2017

Circuit breaker not tripping when GC can't free memory #6472

Closed

mfussenegger added the feedback_needed label Nov 15, 2017

mfussenegger closed this as completed Feb 7, 2018

mfussenegger added bug Clear identification of incorrect behaviour v/2.2.6 labels Feb 7, 2018

mfussenegger removed the needs feedback label Feb 7, 2018

mfussenegger reopened this Feb 21, 2018

mfussenegger added the v/2.3 label Feb 21, 2018

mfussenegger added the needs feedback label Feb 23, 2018

mfussenegger closed this as completed Feb 27, 2018

Statements stuck in sys.jobs #6383

Statements stuck in sys.jobs #6383

Comments

tellezb commented Oct 10, 2017

mfussenegger commented Oct 10, 2017

tellezb commented Oct 10, 2017

rps-v commented Oct 30, 2017

mfussenegger commented Oct 31, 2017

rps-v commented Oct 31, 2017

rps-v commented Nov 1, 2017

rps-v commented Nov 2, 2017

mfussenegger commented Nov 2, 2017

rps-v commented Nov 2, 2017 • edited

abrenneke commented Nov 20, 2017

mfussenegger commented Nov 20, 2017

abrenneke commented Nov 20, 2017 • edited

rps-v commented Nov 21, 2017

mikethebeer commented Dec 1, 2017

abrenneke commented Dec 1, 2017

mikethebeer commented Dec 1, 2017

mikethebeer commented Dec 4, 2017

seut commented Jan 3, 2018

abrenneke commented Jan 4, 2018

abrenneke commented Jan 4, 2018

rps-v commented Jan 4, 2018

mxm commented Jan 5, 2018

mfussenegger commented Feb 7, 2018

abrenneke commented Feb 12, 2018

abrenneke commented Feb 16, 2018

rps-v commented Feb 21, 2018

mfussenegger commented Feb 21, 2018

rps-v commented Feb 21, 2018

mfussenegger commented Feb 21, 2018

rps-v commented Feb 22, 2018

mfussenegger commented Feb 22, 2018

rps-v commented Feb 22, 2018

mfussenegger commented Feb 23, 2018

rps-v commented Feb 26, 2018

mfussenegger commented Feb 27, 2018

rps-v commented Feb 27, 2018

mfussenegger commented Feb 27, 2018

rps-v commented Nov 2, 2017 •

edited

abrenneke commented Nov 20, 2017 •

edited