Skip to content

Conversation

ecourreges-orange
Copy link

  • token aware routing is broken
  • complexity of map_replicas is too high (proportionnal to number of tokens) and leads to timeout on reconnect of control connection, if timeout is set at <3s for a cluster of >10 nodes with vnodes activated.

- token aware routing is broken
- complexity of map_replicas is too high (proportionnal to number of tokens) and leads to timeout on reconnect of control connection, if timeout is set at <3s for a cluster of >10 nodes with vnodes activated.
@datastax-bot
Copy link

Hi @ecourreges-orange, thanks for your contribution!

In order for us to evaluate and accept your PR, we ask that you sign a contribution license agreement. It's all electronic and will take just minutes.

Sincerely,
DataStax Bot.

@mpenick
Copy link
Contributor

mpenick commented Jul 11, 2016

@ecourreges-orange Thanks for your feedback. This is something I'll starting digging into first thing tomorrow. First, I'll get token-awareness working properly then I'll look into reducing the complexity of map_replicas().

@ecourreges-orange
Copy link
Author

ecourreges-orange commented Jul 12, 2016

Sorry, I don't know where to post issues/feature requests, but here are the ones related to this Pull Request:

  • The driver should connect randomly to the contact_points at startup, not in lexicographical order => this will limit the impact of having the control_connection host/node down. The java driver already does that.
  • The useSchema and tokenAware functionalities should be independent, as useSchema induces unneeded cost when just using tokenAware. The driver should only get the keyspace list.
  • In case of connection_control reconnect, map_replicas() should be called only once at the end, not for every node
  • Complexity N of tokens_to_replicas should be reduced and tested on a cluster of at least 10 nodes with vnodes (i.e. >=2560 token ranges).
  • Request Timeout should be decorrelated from control_connection actions => in the case above, if you set requestTimeout too low, you end up with an infinite loop (with 100% CPU) of control_connection reconnects that do map_replicas()+timeout.
  • on_query_meta_schema() should not clear the token_map built in on_query_hosts()
  • All of this must be unit tested

I understand these are not straightforward fixes, but they are pretty important in guaranteeing a good QoS on our production Cassandra during maintenance and node up/down events.

Thank you.
Regards,
Emmanuel.

@mpenick
Copy link
Contributor

mpenick commented Jul 12, 2016

Thanks again. For future issues you can use JIRA: https://datastax-oss.atlassian.net/, but feel free to continue using GitHub. I've created this issue: https://datastax-oss.atlassian.net/browse/CPP-389 to track the problems with token awareness (and also linked related issues).

The randomizing contact points improvement is tracked in this issue: https://datastax-oss.atlassian.net/browse/CPP-193

@mpenick
Copy link
Contributor

mpenick commented Aug 17, 2016

Issues addressed @ 4c744dc. Please let use know if this resolves your issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants