token aware routing broken + code complexity of map_replicas too high with vnodes #305

ecourreges-orange · 2016-07-11T15:50:32Z

token aware routing is broken
complexity of map_replicas is too high (proportionnal to number of tokens) and leads to timeout on reconnect of control connection, if timeout is set at <3s for a cluster of >10 nodes with vnodes activated.

- token aware routing is broken - complexity of map_replicas is too high (proportionnal to number of tokens) and leads to timeout on reconnect of control connection, if timeout is set at <3s for a cluster of >10 nodes with vnodes activated.

datastax-bot · 2016-07-11T15:50:33Z

Hi @ecourreges-orange, thanks for your contribution!

In order for us to evaluate and accept your PR, we ask that you sign a contribution license agreement. It's all electronic and will take just minutes.

Sincerely,
DataStax Bot.

mpenick · 2016-07-11T20:36:49Z

@ecourreges-orange Thanks for your feedback. This is something I'll starting digging into first thing tomorrow. First, I'll get token-awareness working properly then I'll look into reducing the complexity of map_replicas().

ecourreges-orange · 2016-07-12T15:24:32Z

Sorry, I don't know where to post issues/feature requests, but here are the ones related to this Pull Request:

The driver should connect randomly to the contact_points at startup, not in lexicographical order => this will limit the impact of having the control_connection host/node down. The java driver already does that.
The useSchema and tokenAware functionalities should be independent, as useSchema induces unneeded cost when just using tokenAware. The driver should only get the keyspace list.
In case of connection_control reconnect, map_replicas() should be called only once at the end, not for every node
Complexity N of tokens_to_replicas should be reduced and tested on a cluster of at least 10 nodes with vnodes (i.e. >=2560 token ranges).
Request Timeout should be decorrelated from control_connection actions => in the case above, if you set requestTimeout too low, you end up with an infinite loop (with 100% CPU) of control_connection reconnects that do map_replicas()+timeout.
on_query_meta_schema() should not clear the token_map built in on_query_hosts()
All of this must be unit tested

I understand these are not straightforward fixes, but they are pretty important in guaranteeing a good QoS on our production Cassandra during maintenance and node up/down events.

Thank you.
Regards,
Emmanuel.

mpenick · 2016-07-12T16:41:12Z

Thanks again. For future issues you can use JIRA: https://datastax-oss.atlassian.net/, but feel free to continue using GitHub. I've created this issue: https://datastax-oss.atlassian.net/browse/CPP-389 to track the problems with token awareness (and also linked related issues).

The randomizing contact points improvement is tracked in this issue: https://datastax-oss.atlassian.net/browse/CPP-193

mpenick · 2016-08-17T21:54:12Z

Issues addressed @ 4c744dc. Please let use know if this resolves your issues.

Comments on current issues:

2daa3b4

- token aware routing is broken - complexity of map_replicas is too high (proportionnal to number of tokens) and leads to timeout on reconnect of control connection, if timeout is set at <3s for a cluster of >10 nodes with vnodes activated.

datastax-bot added the cla-missing label Jul 11, 2016

mpenick closed this Aug 17, 2016

mpenick pushed a commit that referenced this pull request Dec 10, 2019

test: CPP-822 - Simplify CCM objects to use enums (#305)

b719493

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

token aware routing broken + code complexity of map_replicas too high with vnodes #305

token aware routing broken + code complexity of map_replicas too high with vnodes #305

Uh oh!

ecourreges-orange commented Jul 11, 2016

Uh oh!

datastax-bot commented Jul 11, 2016

Uh oh!

mpenick commented Jul 11, 2016

Uh oh!

ecourreges-orange commented Jul 12, 2016 •

edited

Loading

Uh oh!

mpenick commented Jul 12, 2016 •

edited

Loading

Uh oh!

mpenick commented Aug 17, 2016

Uh oh!

Uh oh!

token aware routing broken + code complexity of map_replicas too high with vnodes #305

token aware routing broken + code complexity of map_replicas too high with vnodes #305

Uh oh!

Conversation

ecourreges-orange commented Jul 11, 2016

Uh oh!

datastax-bot commented Jul 11, 2016

Uh oh!

mpenick commented Jul 11, 2016

Uh oh!

ecourreges-orange commented Jul 12, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mpenick commented Jul 12, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mpenick commented Aug 17, 2016

Uh oh!

Uh oh!

ecourreges-orange commented Jul 12, 2016 •

edited

Loading

mpenick commented Jul 12, 2016 •

edited

Loading