Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

downtime when upgrading single node overlord+coordinator to 0.13.0 #6854

Closed
pdeva opened this issue Jan 14, 2019 · 5 comments
Closed

downtime when upgrading single node overlord+coordinator to 0.13.0 #6854

pdeva opened this issue Jan 14, 2019 · 5 comments
Labels

Comments

@pdeva
Copy link
Contributor

pdeva commented Jan 14, 2019

I noticed this behavior while upgrading from 0.12.3 to 0.13.0.
We have a single node merged overlord+coordinator.

The assumption is that since overlord/coordinator are not in the query path, upgrading that node shouldn't result in any downtime.

However, it seems when the node restarted (after upgrade), it terminated all existing KIS tasks and restarted them. This tooks a few minutes to happen, and during these 5 minutes or so, no realtime data was available for querying.

I saw a lot of messages like this during the node startup (after upgrade):

o.a.d.i.c.IndexTaskClient [IndexTaskClient-pctile-hour-0] No TaskLocation available for task [index_kafka_pctile-hour_ec0a9bc8420bc02_ehjdgpfc], this task may not have been assigned to a worker yet or may have already completed

Is this expected behavior? If so, it might be worth documenting.

@gianm
Copy link
Contributor

gianm commented Jan 14, 2019

I don't think that it's expected behavior. Do you have more than one overlord+coordinator (highly available deployment)? Or just 1? Did the system recover on its own?

@pdeva
Copy link
Contributor Author

pdeva commented Jan 14, 2019

just 1.
the system did indeed recover, however for those 5 mins all realtime data (in KIS tasks) was unavailable for query.

@pdeva
Copy link
Contributor Author

pdeva commented Feb 19, 2019

if KIS is being marked as stable (ref #6970) , this bug might be worth either fixing or atleast documenting.

pdeva added a commit to pdeva/druid that referenced this issue Mar 12, 2019
upgrading overlord nodes in 0.14.0 can result in few minutes of query downtime for KIS tasks, as noticed in apache#6854
this behavior should be documented until this is fixed by apache#6958 in 0.15.
@stale
Copy link

stale bot commented Nov 26, 2019

This issue has been marked as stale due to 280 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions.

@stale stale bot added the stale label Nov 26, 2019
@stale
Copy link

stale bot commented Dec 24, 2019

This issue has been closed due to lack of activity. If you think that is incorrect, or the issue requires additional review, you can revive the issue at any time.

@stale stale bot closed this as completed Dec 24, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants