Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Rack awareness support #967

Merged
merged 6 commits into from
Apr 29, 2021
Merged

Conversation

systream
Copy link

In this PR location (you can also call it site or availability zone) has been introduced. When claiming a new ring the list of nodes is ordered taking into consideration the location of the individual nodes, in a manner that to adjacent nodes are preferably from different locations.

@martinsumner
Copy link
Contributor

Thank-you for this. I've taken a quick look, and I want to make sure I understand this correctly.

  1. The aim is data diversity across locations to improve data protection (e.g. try to make sure data is in multiple Availability Zones).
  2. This doesn't offer any guarantee that data for any particular preflist is stored in diverse locations, it just tries to bias the feed into the algorithm to make it more likely that data is stored in diverse locations.

Is my understanding correct?

@systream
Copy link
Author

systream commented Feb 23, 2021

1, Yes, exactly.
2, You are right, there are circumstances where it cannot guarantee the diverse of locations, it is just like not optimal node count, ring size combinations where tail violations could happen.

It should be used responsibly as the other options.

I tested it across different ring-size/node count/locations combo, and it works quite well, but for example it won't optimize transfers between old and new ring.

The aim was to not modify the original claiming algorithm, until one or more locations have set.

@martinsumner
Copy link
Contributor

martinsumner commented Feb 24, 2021

The challenge is how much value there is, without any guarantee. Perhaps, it would be nice if there was a CLI command that would allow you to report on how location-safe the cluster plan is.

Previously, some consideration had been given to resurrecting claim v3 to implement things like location awareness. The claim v3 algorithm was a radical change, to treat claim as an optimisation problem. With claim v3, if you weren't happy with the optimised plan - you could reject it and roll the dice again, giving some control to the operator.

Location awareness is on the to-do list for riak-core-lite, perhaps that team may have some thoughts on this.

I will add a link to the PR into the slack channel as well, to try and canvas some broader feedback.

@systream
Copy link
Author

Cool, thanks.

There is something what i'd like to clarify.
If you pick wrong node count or set unbalanced locations across the nodes it could not guarantee, it is true. (I think this is normal, warning message is a great idea).

Example:

---- Cluster Status ----
Ring ready: true

+--------------------+------+-------+-----+-------+--------+
|        node        |status| avail |ring |pending|location|
+--------------------+------+-------+-----+-------+--------+
|     dev1@127.0.0.1 |valid |  up   | 31.3|  --   | loc_A  |
| (C) dev2@127.0.0.1 |valid |  up   | 37.5|  --   | loc_B  |
|     dev3@127.0.0.1 |valid |  up   | 31.3|  --   | loc_C  |
+--------------------+------+-------+-----+-------+--------+

Key: (C) = Claimant; availability marked with '!' is unexpected

First and last partitions are on the same node so it could a be a problem, but anyway it is a problem.

(dev1@127.0.0.1)8> riak_core_location:print_ring_with_location().
'dev2@127.0.0.1' ("loc_B")       0
'dev3@127.0.0.1' ("loc_C")       91343852333181432387730302044767688728495783936
'dev1@127.0.0.1' ("loc_A")       182687704666362864775460604089535377456991567872
'dev2@127.0.0.1' ("loc_B")       274031556999544297163190906134303066185487351808
'dev3@127.0.0.1' ("loc_C")       365375409332725729550921208179070754913983135744
'dev1@127.0.0.1' ("loc_A")       456719261665907161938651510223838443642478919680
'dev2@127.0.0.1' ("loc_B")       548063113999088594326381812268606132370974703616
'dev3@127.0.0.1' ("loc_C")       639406966332270026714112114313373821099470487552
'dev1@127.0.0.1' ("loc_A")       730750818665451459101842416358141509827966271488
'dev2@127.0.0.1' ("loc_B")       822094670998632891489572718402909198556462055424
'dev3@127.0.0.1' ("loc_C")       913438523331814323877303020447676887284957839360
'dev1@127.0.0.1' ("loc_A")       1004782375664995756265033322492444576013453623296
'dev2@127.0.0.1' ("loc_B")       1096126227998177188652763624537212264741949407232
'dev3@127.0.0.1' ("loc_C")       1187470080331358621040493926581979953470445191168
'dev1@127.0.0.1' ("loc_A")       1278813932664540053428224228626747642198940975104
'dev2@127.0.0.1' ("loc_B")       1370157784997721485815954530671515330927436759040

If you have proper node count/location setup, it will work:

Example:

---- Cluster Status ----
Ring ready: true

+--------------------+------+-------+-----+-------+--------+
|        node        |status| avail |ring |pending|location|
+--------------------+------+-------+-----+-------+--------+
|     dev1@127.0.0.1 |valid |  up   | 25.0|  --   | loc_A  |
| (C) dev2@127.0.0.1 |valid |  up   | 25.0|  --   | loc_B  |
|     dev3@127.0.0.1 |valid |  up   | 25.0|  --   | loc_C  |
|     dev4@127.0.0.1 |valid |  up   | 25.0|  --   | Loc_D  |
+--------------------+------+-------+-----+-------+--------+

Key: (C) = Claimant; availability marked with '!' is unexpected
(dev1@127.0.0.1)11> riak_core_location:print_ring_with_location().
'dev2@127.0.0.1' ("loc_B")       0
'dev3@127.0.0.1' ("loc_C")       91343852333181432387730302044767688728495783936
'dev4@127.0.0.1' ("Loc_D")       182687704666362864775460604089535377456991567872
'dev1@127.0.0.1' ("loc_A")       274031556999544297163190906134303066185487351808
'dev2@127.0.0.1' ("loc_B")       365375409332725729550921208179070754913983135744
'dev3@127.0.0.1' ("loc_C")       456719261665907161938651510223838443642478919680
'dev4@127.0.0.1' ("Loc_D")       548063113999088594326381812268606132370974703616
'dev1@127.0.0.1' ("loc_A")       639406966332270026714112114313373821099470487552
'dev2@127.0.0.1' ("loc_B")       730750818665451459101842416358141509827966271488
'dev3@127.0.0.1' ("loc_C")       822094670998632891489572718402909198556462055424
'dev4@127.0.0.1' ("Loc_D")       913438523331814323877303020447676887284957839360
'dev1@127.0.0.1' ("loc_A")       1004782375664995756265033322492444576013453623296
'dev2@127.0.0.1' ("loc_B")       1096126227998177188652763624537212264741949407232
'dev3@127.0.0.1' ("loc_C")       1187470080331358621040493926581979953470445191168
'dev4@127.0.0.1' ("Loc_D")       1278813932664540053428224228626747642198940975104
'dev1@127.0.0.1' ("loc_A")       1370157784997721485815954530671515330927436759040
---- Cluster Status ----
Ring ready: true

+--------------------+------+-------+-----+-------+--------+
|        node        |status| avail |ring |pending|location|
+--------------------+------+-------+-----+-------+--------+
|     dev1@127.0.0.1 |valid |  up   | 25.0|  --   | Loc_A  |
| (C) dev2@127.0.0.1 |valid |  up   | 25.0|  --   | Loc_A  |
|     dev3@127.0.0.1 |valid |  up   | 25.0|  --   | Loc_B  |
|     dev4@127.0.0.1 |valid |  up   | 25.0|  --   | Loc_B  |
+--------------------+------+-------+-----+-------+--------+

Key: (C) = Claimant; availability marked with '!' is unexpected
(dev1@127.0.0.1)13> riak_core_location:print_ring_with_location().
'dev3@127.0.0.1' ("Loc_B")       0
'dev1@127.0.0.1' ("Loc_A")       91343852333181432387730302044767688728495783936
'dev4@127.0.0.1' ("Loc_B")       182687704666362864775460604089535377456991567872
'dev2@127.0.0.1' ("Loc_A")       274031556999544297163190906134303066185487351808
'dev3@127.0.0.1' ("Loc_B")       365375409332725729550921208179070754913983135744
'dev1@127.0.0.1' ("Loc_A")       456719261665907161938651510223838443642478919680
'dev4@127.0.0.1' ("Loc_B")       548063113999088594326381812268606132370974703616
'dev2@127.0.0.1' ("Loc_A")       639406966332270026714112114313373821099470487552
'dev3@127.0.0.1' ("Loc_B")       730750818665451459101842416358141509827966271488
'dev1@127.0.0.1' ("Loc_A")       822094670998632891489572718402909198556462055424
'dev4@127.0.0.1' ("Loc_B")       913438523331814323877303020447676887284957839360
'dev2@127.0.0.1' ("Loc_A")       1004782375664995756265033322492444576013453623296
'dev3@127.0.0.1' ("Loc_B")       1096126227998177188652763624537212264741949407232
'dev1@127.0.0.1' ("Loc_A")       1187470080331358621040493926581979953470445191168
'dev4@127.0.0.1' ("Loc_B")       1278813932664540053428224228626747642198940975104
'dev2@127.0.0.1' ("Loc_A")       1370157784997721485815954530671515330927436759040
---- Cluster Status ----
Ring ready: true

+--------------------+------+-------+-----+-------+--------+
|        node        |status| avail |ring |pending|location|
+--------------------+------+-------+-----+-------+--------+
|     dev1@127.0.0.1 |valid |  up   | 25.0|  --   | Loc_A  |
| (C) dev2@127.0.0.1 |valid |  up   | 25.0|  --   | Loc_B  |
|     dev3@127.0.0.1 |valid |  up   | 25.0|  --   | Loc_A  |
|     dev4@127.0.0.1 |valid |  up   | 25.0|  --   | Loc_B  |
+--------------------+------+-------+-----+-------+--------+

Key: (C) = Claimant; availability marked with '!' is unexpected
(dev1@127.0.0.1)14> riak_core_location:print_ring_with_location().
'dev2@127.0.0.1' ("Loc_B")       0
'dev1@127.0.0.1' ("Loc_A")       91343852333181432387730302044767688728495783936
'dev4@127.0.0.1' ("Loc_B")       182687704666362864775460604089535377456991567872
'dev3@127.0.0.1' ("Loc_A")       274031556999544297163190906134303066185487351808
'dev2@127.0.0.1' ("Loc_B")       365375409332725729550921208179070754913983135744
'dev1@127.0.0.1' ("Loc_A")       456719261665907161938651510223838443642478919680
'dev4@127.0.0.1' ("Loc_B")       548063113999088594326381812268606132370974703616
'dev3@127.0.0.1' ("Loc_A")       639406966332270026714112114313373821099470487552
'dev2@127.0.0.1' ("Loc_B")       730750818665451459101842416358141509827966271488
'dev1@127.0.0.1' ("Loc_A")       822094670998632891489572718402909198556462055424
'dev4@127.0.0.1' ("Loc_B")       913438523331814323877303020447676887284957839360
'dev3@127.0.0.1' ("Loc_A")       1004782375664995756265033322492444576013453623296
'dev2@127.0.0.1' ("Loc_B")       1096126227998177188652763624537212264741949407232
'dev1@127.0.0.1' ("Loc_A")       1187470080331358621040493926581979953470445191168
'dev4@127.0.0.1' ("Loc_B")       1278813932664540053428224228626747642198940975104
'dev3@127.0.0.1' ("Loc_A")       1370157784997721485815954530671515330927436759040

@martinsumner
Copy link
Contributor

I don't know if you're seeing the CI failures, but there are dialyzer issues:

src/riak_core_cluster_cli.erl
  40: Function register_cli/0 has no local return
  44: Function register_all_usage/0 has no local return
  54: Function register_all_commands/0 will never be called
  63: Function status_register/0 will never be called
 150: Function partition_count_register/0 will never be called
 183: Function partitions_register/0 will never be called
 231: Function partition_register/0 will never be called
 290: Function location_register/0 will never be called
===> Warnings written to /home/travis/build/basho/riak_core/_build/default/22.3.dialyzer_warnings
===> Warnings occurred running dialyzer: 8

The next stage is riak_test testing. There is a test group defined for core tests. If you've not used riak_test before there's some setup instructions.

There hasn't been any positive feedback via slack etc wrt pushing for this change. There is interest in a rack-awareness with stricter guarantees - however I think any complete solution is going to be such a radical change as to be unrealistic, compared to this relatively simple enhancement.

There is no firm process for getting something like this approved at the moment. I can see positive aspects of the change, but would want a broader consensus behind it. @martincox do you have an opinion?

@systream
Copy link
Author

systream commented Mar 8, 2021

Sorry, dialyzer issues fixed.
I can accept that it might not the best solution.
We have strong security prescription, so i can't tell much about how we use riak, but to be able to scale without increasing the n_val we need a feature something like that. I'd like to help to make rack awareness happen. :)

@systream
Copy link
Author

systream commented Mar 9, 2021

Put riak_core in _checkouts and I did run the tests:

That's 100.0% for those keeping score
making summary
bucket_props_roundtrip-bitcask: pass
bucket_props_validation-bitcask: pass
bucket_types-bitcask: pass
http_bucket_types-bitcask: pass
cluster_meta_basic-bitcask: pass
cluster_meta_rmr-bitcask: pass
gh_riak_core_154-bitcask: pass
gh_riak_core_155-bitcask: pass
gh_riak_core_176-bitcask: pass
http_bucket_types-bitcask: pass
verify_build_cluster-bitcask: pass
verify_claimant-bitcask: pass
verify_dynamic_ring-bitcask: pass
verify_leave-bitcask: pass
verify_reset_bucket_props-bitcask: pass
verify_riak_lager-bitcask: pass

Before I write more tests, it would be nice to know whether to continue with this solution or not.

@martincox
Copy link
Contributor

I think it sounds like a good starting point, even with the caveat over a lack of strict guarantees. Potentially could be developed further in the future, if anyone can invest more time into it? Makes sense to me to adopt it as is, it looks to provide some improvement. I'd be happy to see it merged. 👍

@martinsumner
Copy link
Contributor

@systream just a note on timescales. I'm going to proceed with release 3.0.4 this week without this change, as I need something fixed ASAP, but once we have a riak_test we're satisfied with, I will set off the process for producing a Riak 3.0.5 which includes this rack awareness PR.

I have a few other things on for the next few days, but I might have some time towards the end of next week to help with a riak_test test if required.

@systream
Copy link
Author

Okay, cool thanks! I have already started writing new tests in the rack_test, but unfortunately I had urgent tasks to finish.
I think it will be finished it in a couple of days.

@systream
Copy link
Author

systream commented Mar 29, 2021

@martinsumner I wrote tests in riak test. I'm not totally sure is it appropriate, could you check it. Thanks.
basho/riak_test#1353.

@martinsumner
Copy link
Contributor

I will have a look later in the week. Thanks.

@martinsumner
Copy link
Contributor

I think I'm fine with this from a test and code perspective now. Sorry for the delay, I've been distracted with other things.

@systream - for others to use it, a write-up would be helpful (just some markdown in docs/). Is this something you could do?

@martincox - anything to add?

@systream
Copy link
Author

Sure, I will do the docs

@martinsumner martinsumner self-requested a review April 15, 2021 08:21
@systream
Copy link
Author

What's the next step? Is there anything i need to do?

@martinsumner
Copy link
Contributor

Sorry, it is on me now. I will start work on finalising a release next week. Apologies for the delay.

@systream
Copy link
Author

Sorry, I don't want you to rush, just got the feeling that i might need to do something.

@martinsumner
Copy link
Contributor

@systream - sorry, but some late suggestions for the docs.

I think the instructions need to be extended to make it clear that locations need to be allocated sequentially in order for the claim to have the desired affect. That is to say:

node1 -> loc1
node2 -> loc2
node3 -> loc3
node4 -> loc1
node5 -> loc2
node6 -> loc3

should provide a diverse spread across locations, but the following mapping will not:

node1 -> loc1
node2 -> loc1
node3 -> loc2
node4 -> loc2
node5 -> loc3
node6 -> loc3

Also in order to check a planned, but uncommitted cluster change, this script would be useful:

PlannedRing = element(1, lists:last(element(3, riak_core_claimaint:plan())).
riak_core_location:check_ring(PlannedRing, 3, 2).

@systream
Copy link
Author

It doesn't (or shouldn't) matter which node is where, as long as the numbers of distinct locations are good.
Both of the node location assignment should work. I did check it.

@martinsumner
Copy link
Contributor

Perhaps I did something wrong, but I've just checked it - and not allocating nodes in sequence caused an issue:

$ dev/dev6/riak/bin/riak admin cluster plan
=============================== Staged Changes ================================
Action         Details(s)
-------------------------------------------------------------------------------
set-location  'dev1@127.0.0.1' to rack_a
set-location  'dev2@127.0.0.1' to rack_a
set-location  'dev3@127.0.0.1' to rack_b
set-location  'dev4@127.0.0.1' to rack_b
set-location  'dev5@127.0.0.1' to rack_c
set-location  'dev6@127.0.0.1' to rack_c
-------------------------------------------------------------------------------


NOTE: Applying these changes will result in 1 cluster transition

###############################################################################
                         After cluster transition 1/1
###############################################################################

================================= Membership ==================================
Status     Ring    Pending    Node
-------------------------------------------------------------------------------
valid     100.0%     17.2%    dev1@127.0.0.1 (rack_a)
valid       0.0%     15.6%    dev2@127.0.0.1 (rack_a)
valid       0.0%     17.2%    dev3@127.0.0.1 (rack_b)
valid       0.0%     15.6%    dev4@127.0.0.1 (rack_b)
valid       0.0%     17.2%    dev5@127.0.0.1 (rack_c)
valid       0.0%     17.2%    dev6@127.0.0.1 (rack_c)
-------------------------------------------------------------------------------
Valid:6 / Leaving:0 / Exiting:0 / Joining:0 / Down:0

WARNING: Not all replicas will be on distinct locations

Transfers resulting from cluster changes: 53
  10 transfers from 'dev1@127.0.0.1' to 'dev2@127.0.0.1'
  10 transfers from 'dev1@127.0.0.1' to 'dev4@127.0.0.1'
  11 transfers from 'dev1@127.0.0.1' to 'dev6@127.0.0.1'
  11 transfers from 'dev1@127.0.0.1' to 'dev3@127.0.0.1'
  11 transfers from 'dev1@127.0.0.1' to 'dev5@127.0.0.1'

$ dev/dev6/riak/bin/riak admin cluster commit
Cluster changes committed
Eshell V10.7  (abort with ^G)
(dev6@127.0.0.1)1> {ok, R} = riak_core_ring_manager:get_my_ring().
(dev6@127.0.0.1)3> length(riak_core_location:check_ring(R, 3, 2)).
40

Whereas a sequential allocation had no such violations.

I'm going to have a play around to see if I can see how this might not be working for me.

@martinsumner
Copy link
Contributor

martinsumner commented Apr 29, 2021

Hang on ... I think I might have prematurely checked before transfers were complete!

Sorry, my mistake. That was the case.

@martinsumner martinsumner merged commit 1f7e204 into basho:develop-3.0 Apr 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants