downtime when upgrading single node overlord+coordinator to 0.13.0 #6854

pdeva · 2019-01-14T21:39:31Z

I noticed this behavior while upgrading from 0.12.3 to 0.13.0.
We have a single node merged overlord+coordinator.

The assumption is that since overlord/coordinator are not in the query path, upgrading that node shouldn't result in any downtime.

However, it seems when the node restarted (after upgrade), it terminated all existing KIS tasks and restarted them. This tooks a few minutes to happen, and during these 5 minutes or so, no realtime data was available for querying.

I saw a lot of messages like this during the node startup (after upgrade):

o.a.d.i.c.IndexTaskClient [IndexTaskClient-pctile-hour-0] No TaskLocation available for task [index_kafka_pctile-hour_ec0a9bc8420bc02_ehjdgpfc], this task may not have been assigned to a worker yet or may have already completed

Is this expected behavior? If so, it might be worth documenting.

The text was updated successfully, but these errors were encountered:

gianm · 2019-01-14T21:45:33Z

I don't think that it's expected behavior. Do you have more than one overlord+coordinator (highly available deployment)? Or just 1? Did the system recover on its own?

pdeva · 2019-01-14T21:46:31Z

just 1.
the system did indeed recover, however for those 5 mins all realtime data (in KIS tasks) was unavailable for query.

pdeva · 2019-02-19T05:26:07Z

if KIS is being marked as stable (ref #6970) , this bug might be worth either fixing or atleast documenting.

upgrading overlord nodes in 0.14.0 can result in few minutes of query downtime for KIS tasks, as noticed in apache#6854 this behavior should be documented until this is fixed by apache#6958 in 0.15.

stale · 2019-11-26T06:11:10Z

This issue has been marked as stale due to 280 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions.

stale · 2019-12-24T06:26:00Z

This issue has been closed due to lack of activity. If you think that is incorrect, or the issue requires additional review, you can revive the issue at any time.

pdeva mentioned this issue Mar 11, 2019

Support Kafka supervisor adopting running tasks between versions #6958

Closed

pdeva mentioned this issue Mar 12, 2019

document query downtime behavior of upgrading overlords in 0.14 #7241

Closed

stale bot added the stale label Nov 26, 2019

stale bot closed this as completed Dec 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

downtime when upgrading single node overlord+coordinator to 0.13.0 #6854

downtime when upgrading single node overlord+coordinator to 0.13.0 #6854

pdeva commented Jan 14, 2019

gianm commented Jan 14, 2019

pdeva commented Jan 14, 2019

pdeva commented Feb 19, 2019

stale bot commented Nov 26, 2019

stale bot commented Dec 24, 2019

downtime when upgrading single node overlord+coordinator to 0.13.0 #6854

downtime when upgrading single node overlord+coordinator to 0.13.0 #6854

Comments

pdeva commented Jan 14, 2019

gianm commented Jan 14, 2019

pdeva commented Jan 14, 2019

pdeva commented Feb 19, 2019

stale bot commented Nov 26, 2019

stale bot commented Dec 24, 2019