Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kafka Indexing Service lagging every hour #5979

Closed
erankor opened this issue Jul 8, 2018 · 10 comments

Comments

Projects
None yet
2 participants
@erankor
Copy link

commented Jul 8, 2018

Hi all,

We're using the Kafka Indexing Service to ingest data to Druid (using version 0.10.0). Recently we're seeing it lag every hour - exactly at each hour, the end time of the data source stops advancing for 5-6 min, and after 2-3 min it fully recovers.
This is most likely related to segment handoff somehow, however -

  1. We have plenty of capacity in terms of workers - most of the time we have 4 kafka tasks x 2 data sources, and we have 4 servers with a capacity of 5 per server. So, assuming we need x2 tasks during handoff, we certainly have it.
  2. Looking at the performance metrics (load, cpu, memory) of the middle managers at the time of the lag - they don't seem to be working hard.

Are there any parameters on the overlord/middle managers/kafka tasks that should be updated to solve this?
Will happily provide any additional info that can help

Thank you!

Eran

@erankor

This comment has been minimized.

Copy link
Author

commented Jul 9, 2018

The issue got much worse today :( reached 30-40 min lag during handoff. I tried to stop realtime ingestion completely but it didn't help -

  1. shutdown KIS tasks (had to kill the tasks manually to have it complete)
  2. stopped the overlord, coordinator, middle managers
  3. restarted all the services
  4. re-enabled KIS tasks

Adding some logs that may hopefully help troubleshoot this
overlord-exception2.txt
overlord-exception3.txt
stalled-task.txt
overlord-exception1.txt

  1. stalled-task.txt - if I'm reading this correctly, the worker was waiting for ~15 min on some request that was issued to the overlord
  2. overlord-exception1-3.txt - some exceptions that I saw on the overlord logs -
    a. "io.druid.java.util.common.RetryUtils - Failed on try x" + MySQLIntegrityConstraintViolationException
    b. "The RuntimeException could not be mapped to a response, re-throwing to the HTTP container" + MySQLIntegrityConstraintViolationException
    c. "The RuntimeException could not be mapped to a response, re-throwing to the HTTP container" + "Unable to grant lock to inactive Task"

These exceptions seem to be happening quite a lot, here the numbers for some random 2 hours -
$ journalctl -u druid_over* -S -7200 | grep WARN | grep -c 'io.druid.java.util.common.RetryUtils'
92
$ journalctl -u druid_over* -S -7200 | grep -c 'Unable to grant lock to inactive Task'
135
$ journalctl -u druid_over* -S -7200 | grep -c 'UnableToExecuteStatementException'
179

Any direction you can give me here would be appreciated

Thank you

Eran

@erankor

This comment has been minimized.

Copy link
Author

commented Jul 10, 2018

Problem solved!
The issue was that the druid_pendingSegments table in MySQL grew huge over time (~800k rows), this caused very slow queries to MySQL. In turn, this made segment allocation tasks work very slowly on the overlord, which slowed down the bootstrapping of the Kafka tasks.
Anyway, I stopped all services, truncated the table and reinserted some of the recent rows (not sure if this is actually needed, but wanted to be on the safe side).
Now everything is working well again :)

@erankor erankor closed this Jul 10, 2018

@jihoonson

This comment has been minimized.

Copy link
Contributor

commented Jul 10, 2018

@erankor good to hear that you've solved! FYI, a new feature for automatic pendingSegments cleanup (#5149) has added in 0.12.0. This may make easier to handle the growing pendingSegments table.

@erankor

This comment has been minimized.

Copy link
Author

commented Jul 10, 2018

Thanks @jihoonson, I bumped into this pull after I figured out the problem, that's good news :)
Another small question on this - until we upgrade, I would like to add some periodic cleanup to this table. Is it safe to delete rows with created_date older than a week or so? I won't have any job in queue for that long, my main concern is whether it's ok to update this table directly while the overlord is running.

Thanks!

Eran

@jihoonson

This comment has been minimized.

Copy link
Contributor

commented Jul 10, 2018

I won't have any job in queue for that long, my main concern is whether it's ok to update this table directly while the overlord is running.

@erankor yes, it should be fine if you can make sure to not have any running tasks of created_date older than a week.

@erankor

This comment has been minimized.

Copy link
Author

commented Jul 10, 2018

Great! thanks @jihoonson!

@erankor

This comment has been minimized.

Copy link
Author

commented Jul 11, 2018

@jihoonson, a couple of additional questions on the MySQL tables -

  1. Following this issue, we started tracking slow queries on the MySQL DB, and I'm seeing queries like the following repeat often -
    SELECT payload FROM druid_segments WHERE used = true AND dataSource = 'player-events-historical' AND ((start <= '2018-07-11T04:00:00.000Z' AND end >= '2018-07-11T03:00:00.000Z'));
    a. Is it possible to delete rows with used = false from this table? While I saw there is already an index for used, keeping the table small can probably help. In our case, only ~1% of the rows there have used=true, I'm guessing it's because KIS tasks create partitioned segments that we later merge.
    b. I'm thinking about replacing the existing (used) index with an index of (used, end), since it seems these queries always have end >= (some recent timestamp), does that make sense?
  2. Going over the tables one by one, I saw that druid_tasks & druid_tasklogs also have rows being accumulated, does it make sense to clean those? from what I saw, there is no date on druid_tasklogs, but I can get a list of old tasks from druid_tasks, and then delete both the tasks and their logs by id.

Thank you!

Eran

@jihoonson

This comment has been minimized.

Copy link
Contributor

commented Jul 11, 2018

@erankor

  1. That SQL is to list all used (live) segments falling in an interval, and it can be called 1) whenever a new segment is allocated in KIS and 2) when a taskLock is acquired in batch ingestion (like index or merge tasks).
    a. Yes, you can delete them, but I would recommend to use the kill task rather than manually deleting rows. You can set druid.coordinator.kill.on to true for coordinators to periodically kill segments. Please check http://druid.io/docs/0.10.0/configuration/coordinator.html#coordinator-operation for details.
    b. I haven't tested, but sounds making sense.
  2. The druid_tasklogs is not actually being used, so you can freely delete all entries in that table. Please check #5859. For druid_tasks, yes you can delete old entries from that table. One thing I'm curious is, do you have an actual issue of too large druid_tasks table? or are you just checking all tables? If you have an actual issue, it may make sense to support auto cleanup for druid_tasks table too.
@erankor

This comment has been minimized.

Copy link
Author

commented Jul 11, 2018

Thanks @jihoonson, regarding 2, as you wrote - I just went over all tables, and looked deeper into the ones with a non-negligible number of rows - I don't have any issue with this table. However, I will still add some cronjob to delete records from druid_tasks & tasklogs, as we're using a very week MySQL instance, and I rather keep all tables as small as possible.

@jihoonson

This comment has been minimized.

Copy link
Contributor

commented Jul 11, 2018

@erankor thanks! I raised #5994.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.