Diskless cluster mode v2 #676

eolivelli · 2020-08-09T06:15:36Z

This is a second version of the experimental diskless-cluster mode.

Recap about "diskless cluster mode":
With mode=diskless-cluster we are storing data pages and index pages on BookKeeper instead of the local disk.
In this mode the local disk is used only for tmpDirectory, to swap temporary resulsets (that do not survive to the restart of the server).
In order to store data on BK we created BookKeeperDataStorageManager, and generally this works this way:

it mimics FiledataStorageManager behaviour
metadata are stored on ZooKeeper
real data (and index) pages are saved in BK Ledgers (one page per ledger....this is for the initial implementation, we could improve it in the future, but actually it makes it easy to perform deletion of data)
the mapping from tablaspace,iddatapage -> idledger is stored on ZooKeeper
in v1 we stored all of the metadata in znodes children of a znode with name equals to the nodeId of the server, this way each node was independent from the others
in v1 leader and followers behave like in traditional "cluster" mode, in which followers tail the log and replay "locally" the data, this is not really "locally" because with BookKeeperDataStorageManager we are storing data to Bookies
for instance in v1 when you have 3 nodes, each node handles datapages independently and this leads to a huge waste of space due the high level of write amplification (please also note that probably you will write to BK with 3/3/3 replication configuration, that is keeping 3 copies of each entry !)

Differences from v1:

only the leader maintains the data, followers do nothing
use ZooKeeper "multi" to ensure leadership for every write to data store metadata (mutl is like a 'transaction' in ZK, every operation must succeed in order to complete the whole operation, otherwise the batch does not have visible effects)
followers only start the TableSpaceManager and announce themselves on tablespacereplicastate
allow 'replica:*' in CREATE/ALTER TABLESPACE command in order to allow ANY node to become replica and eventually leader in case of dead leader (you have to set maxleaderinactivitytime explicitly)
add more test cases

Problems with v1 addressed by this PR:

if you had 3 replicas with replicationfactor=3 you were going to have 9 copies of datapage !!
there was no strong guarantee that only one node can write to the datapages
there were no tests about the running of two "leaders" (the old leader MUST not be able to perform checkpoints or 'store' data to the storage layer)
there was no way to say "every node can handle this tablespace"

coveralls · 2020-08-09T07:00:34Z

Pull Request Test Coverage Report for Build 1023

124 of 150 (82.67%) changed or added relevant lines in 8 files are covered.
31 unchanged lines in 6 files lost coverage.
Overall coverage increased (+0.3%) to 65.362%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
herddb-core/src/main/java/herddb/core/DBManager.java	7	8	87.5%
herddb-core/src/main/java/herddb/core/PostCheckpointAction.java	2	3	66.67%
herddb-core/src/main/java/herddb/mem/MemoryDataStorageManager.java	3	5	60.0%
herddb-core/src/main/java/herddb/file/FileDataStorageManager.java	14	19	73.68%
herddb-core/src/main/java/herddb/cluster/BookKeeperDataStorageManager.java	91	108	84.26%

Files with Coverage Reduction	New Missed Lines	%
herddb-core/src/main/java/herddb/core/TableManager.java	1	81.56%
herddb-core/src/main/java/herddb/model/planner/FilterOp.java	1	74.0%
herddb-core/src/main/java/herddb/jmx/JMXUtils.java	2	66.29%
herddb-core/src/main/java/herddb/cluster/BookKeeperDataStorageManager.java	4	75.44%
herddb-utils/src/main/java/herddb/index/blink/BLink.java	11	71.65%
herddb-core/src/main/java/herddb/sql/CalcitePlanner.java	12	84.41%

Totals
Change from base Build 1015:	0.3%
Covered Lines:	18004
Relevant Lines:	27545

💛 - Coveralls

- only one node handles data - use ZooKeeper 'multi' operation to ensure leadership while dealing with metadata in BKDataStorageManager - followers only announce them on metadata - new 'replica:*' parameter to tell that evey node is a replica and can become leader - add tests

eolivelli · 2020-08-10T13:57:59Z

@dcdiennea @dmercuriali @amitvc @diegosalvi PTAL

eolivelli · 2020-08-12T19:57:52Z

@nicoloboschi I added more details as you requested offline, PTAL

diegosalvi · 2020-08-17T09:00:11Z

herddb-core/src/main/java/herddb/cluster/BookKeeperDataStorageManager.java

@@ -1207,8 +1256,9 @@ private static LogSequenceNumber readLogSequenceNumberFromIndexMetadataFile(Stri
    @Override
    public List<Table> loadTables(LogSequenceNumber sequenceNumber, String tableSpace) throws DataStorageManagerException {
        try {
-            String tableSpaceDirectory = getTableSpaceZNode(tableSpace);
-            ensureZNodeDirectory(tableSpaceDirectory);
+//            TODO


Impl leftovers?

actually this is no more needed.
I am dropping this commented code, thanks

diegosalvi · 2020-08-17T09:00:35Z

herddb-core/src/main/java/herddb/cluster/BookKeeperDataStorageManager.java

@@ -1259,8 +1309,9 @@ private static LogSequenceNumber readLogSequenceNumberFromIndexMetadataFile(Stri
    @Override
    public List<Index> loadIndexes(LogSequenceNumber sequenceNumber, String tableSpace) throws DataStorageManagerException {
        try {
-            String tableSpaceDirectory = getTableSpaceZNode(tableSpace);
-            ensureZNodeDirectory(tableSpaceDirectory);
+//          TODO


Impl leftovers?

actually this is no more needed.
I am dropping this commented code, thanks

diegosalvi · 2020-08-17T09:03:17Z

herddb-core/src/main/java/herddb/model/TableSpace.java

@@ -81,6 +81,10 @@ private TableSpace(
        this.metadataStorageCreationTime = metadataStorageCreationTime;
    }

+    public boolean isNodeAssignedToTableSpace(String nodeId) {
+        return this.replicas.contains(nodeId) || this.replicas.contains("*");


Could we use a constant?

diegosalvi · 2020-08-17T09:05:41Z

I'm thinking that in the future a ledger cache on disk could be helpful

eolivelli · 2020-08-17T09:19:33Z

I'm thinking that in the future a ledger cache on disk could be helpful
@diegosalvi can you elaborate more ?
This can be a good idea
can you file an issue with your idea please ?
this way we can address this comment in a follow up work

In my opinion copying data locally can help reads but it will add additional write amplification. So we should look for a good use case

eolivelli · 2020-08-17T11:12:44Z

@diegosalvi this is the link to travis
https://travis-ci.org/github/diennea/herddb/builds/718589141

eolivelli · 2020-08-17T11:57:39Z

latest link
https://travis-ci.org/github/diennea/herddb/builds/718591247

eolivelli · 2020-08-17T18:20:53Z

Travis passed
https://travis-ci.org/github/diennea/herddb/builds/718591247

@diegosalvi your comments have been addressed. I am merging this patch now

eolivelli force-pushed the fix/disklessclusterv2 branch from 40ede74 to d130770 Compare August 9, 2020 12:44

tidy up

3a1ecbf

eolivelli marked this pull request as ready for review August 9, 2020 12:52

eolivelli requested review from aluccaroni, amitvc, diegosalvi, dmercuriali, mino181295 and dianacle August 9, 2020 12:53

eolivelli added this to the release/0.19.0 milestone Aug 9, 2020

revert unneded changes

6b8a38d

ensure leadership during delete

39a94e3

diegosalvi reviewed Aug 17, 2020

View reviewed changes

address Diego's comments

10d72f8

eolivelli closed this Aug 17, 2020

eolivelli reopened this Aug 17, 2020

fix checkstyle

dc2cba3

eolivelli merged commit 6a2b380 into master Aug 17, 2020

eolivelli deleted the fix/disklessclusterv2 branch August 17, 2020 18:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Diskless cluster mode v2 #676

Diskless cluster mode v2 #676

eolivelli commented Aug 9, 2020 •

edited

coveralls commented Aug 9, 2020 •

edited

eolivelli commented Aug 10, 2020

eolivelli commented Aug 12, 2020

diegosalvi Aug 17, 2020

eolivelli Aug 17, 2020

diegosalvi Aug 17, 2020

eolivelli Aug 17, 2020

diegosalvi Aug 17, 2020

eolivelli Aug 17, 2020

diegosalvi commented Aug 17, 2020

eolivelli commented Aug 17, 2020

eolivelli commented Aug 17, 2020

eolivelli commented Aug 17, 2020

eolivelli commented Aug 17, 2020

Diskless cluster mode v2 #676

Diskless cluster mode v2 #676

Conversation

eolivelli commented Aug 9, 2020 • edited

coveralls commented Aug 9, 2020 • edited

Pull Request Test Coverage Report for Build 1023

💛 - Coveralls

eolivelli commented Aug 10, 2020

eolivelli commented Aug 12, 2020

diegosalvi Aug 17, 2020

Choose a reason for hiding this comment

eolivelli Aug 17, 2020

Choose a reason for hiding this comment

diegosalvi Aug 17, 2020

Choose a reason for hiding this comment

eolivelli Aug 17, 2020

Choose a reason for hiding this comment

diegosalvi Aug 17, 2020

Choose a reason for hiding this comment

eolivelli Aug 17, 2020

Choose a reason for hiding this comment

diegosalvi commented Aug 17, 2020

eolivelli commented Aug 17, 2020

eolivelli commented Aug 17, 2020

eolivelli commented Aug 17, 2020

eolivelli commented Aug 17, 2020

eolivelli commented Aug 9, 2020 •

edited

coveralls commented Aug 9, 2020 •

edited