Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Diskless cluster mode v2 #676

Merged
merged 6 commits into from Aug 17, 2020
Merged

Diskless cluster mode v2 #676

merged 6 commits into from Aug 17, 2020

Conversation

eolivelli
Copy link
Contributor

@eolivelli eolivelli commented Aug 9, 2020

This is a second version of the experimental diskless-cluster mode.

Recap about "diskless cluster mode":
With mode=diskless-cluster we are storing data pages and index pages on BookKeeper instead of the local disk.
In this mode the local disk is used only for tmpDirectory, to swap temporary resulsets (that do not survive to the restart of the server).
In order to store data on BK we created BookKeeperDataStorageManager, and generally this works this way:

  • it mimics FiledataStorageManager behaviour
  • metadata are stored on ZooKeeper
  • real data (and index) pages are saved in BK Ledgers (one page per ledger....this is for the initial implementation, we could improve it in the future, but actually it makes it easy to perform deletion of data)
  • the mapping from tablaspace,iddatapage -> idledger is stored on ZooKeeper
  • in v1 we stored all of the metadata in znodes children of a znode with name equals to the nodeId of the server, this way each node was independent from the others
  • in v1 leader and followers behave like in traditional "cluster" mode, in which followers tail the log and replay "locally" the data, this is not really "locally" because with BookKeeperDataStorageManager we are storing data to Bookies
  • for instance in v1 when you have 3 nodes, each node handles datapages independently and this leads to a huge waste of space due the high level of write amplification (please also note that probably you will write to BK with 3/3/3 replication configuration, that is keeping 3 copies of each entry !)

Differences from v1:

  • only the leader maintains the data, followers do nothing
  • use ZooKeeper "multi" to ensure leadership for every write to data store metadata (mutl is like a 'transaction' in ZK, every operation must succeed in order to complete the whole operation, otherwise the batch does not have visible effects)
  • followers only start the TableSpaceManager and announce themselves on tablespacereplicastate
  • allow 'replica:*' in CREATE/ALTER TABLESPACE command in order to allow ANY node to become replica and eventually leader in case of dead leader (you have to set maxleaderinactivitytime explicitly)
  • add more test cases

Problems with v1 addressed by this PR:

  • if you had 3 replicas with replicationfactor=3 you were going to have 9 copies of datapage !!
  • there was no strong guarantee that only one node can write to the datapages
  • there were no tests about the running of two "leaders" (the old leader MUST not be able to perform checkpoints or 'store' data to the storage layer)
  • there was no way to say "every node can handle this tablespace"

@coveralls
Copy link

coveralls commented Aug 9, 2020

Pull Request Test Coverage Report for Build 1023

  • 124 of 150 (82.67%) changed or added relevant lines in 8 files are covered.
  • 31 unchanged lines in 6 files lost coverage.
  • Overall coverage increased (+0.3%) to 65.362%

Changes Missing Coverage Covered Lines Changed/Added Lines %
herddb-core/src/main/java/herddb/core/DBManager.java 7 8 87.5%
herddb-core/src/main/java/herddb/core/PostCheckpointAction.java 2 3 66.67%
herddb-core/src/main/java/herddb/mem/MemoryDataStorageManager.java 3 5 60.0%
herddb-core/src/main/java/herddb/file/FileDataStorageManager.java 14 19 73.68%
herddb-core/src/main/java/herddb/cluster/BookKeeperDataStorageManager.java 91 108 84.26%
Files with Coverage Reduction New Missed Lines %
herddb-core/src/main/java/herddb/core/TableManager.java 1 81.56%
herddb-core/src/main/java/herddb/model/planner/FilterOp.java 1 74.0%
herddb-core/src/main/java/herddb/jmx/JMXUtils.java 2 66.29%
herddb-core/src/main/java/herddb/cluster/BookKeeperDataStorageManager.java 4 75.44%
herddb-utils/src/main/java/herddb/index/blink/BLink.java 11 71.65%
herddb-core/src/main/java/herddb/sql/CalcitePlanner.java 12 84.41%
Totals Coverage Status
Change from base Build 1015: 0.3%
Covered Lines: 18004
Relevant Lines: 27545

💛 - Coveralls

- only one node handles data
- use ZooKeeper 'multi' operation to ensure leadership while dealing with metadata in BKDataStorageManager
- followers only announce them on metadata
- new 'replica:*' parameter to tell that evey node is a replica and can become leader
- add tests
@eolivelli eolivelli marked this pull request as ready for review August 9, 2020 12:52
@eolivelli eolivelli added this to the release/0.19.0 milestone Aug 9, 2020
@eolivelli
Copy link
Contributor Author

@dcdiennea @dmercuriali @amitvc @diegosalvi PTAL

@eolivelli
Copy link
Contributor Author

@nicoloboschi I added more details as you requested offline, PTAL

@@ -1207,8 +1256,9 @@ private static LogSequenceNumber readLogSequenceNumberFromIndexMetadataFile(Stri
@Override
public List<Table> loadTables(LogSequenceNumber sequenceNumber, String tableSpace) throws DataStorageManagerException {
try {
String tableSpaceDirectory = getTableSpaceZNode(tableSpace);
ensureZNodeDirectory(tableSpaceDirectory);
// TODO
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impl leftovers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually this is no more needed.
I am dropping this commented code, thanks

@@ -1259,8 +1309,9 @@ private static LogSequenceNumber readLogSequenceNumberFromIndexMetadataFile(Stri
@Override
public List<Index> loadIndexes(LogSequenceNumber sequenceNumber, String tableSpace) throws DataStorageManagerException {
try {
String tableSpaceDirectory = getTableSpaceZNode(tableSpace);
ensureZNodeDirectory(tableSpaceDirectory);
// TODO
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impl leftovers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually this is no more needed.
I am dropping this commented code, thanks

@@ -81,6 +81,10 @@ private TableSpace(
this.metadataStorageCreationTime = metadataStorageCreationTime;
}

public boolean isNodeAssignedToTableSpace(String nodeId) {
return this.replicas.contains(nodeId) || this.replicas.contains("*");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use a constant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@diegosalvi
Copy link
Contributor

I'm thinking that in the future a ledger cache on disk could be helpful

@eolivelli
Copy link
Contributor Author

I'm thinking that in the future a ledger cache on disk could be helpful
@diegosalvi can you elaborate more ?
This can be a good idea
can you file an issue with your idea please ?
this way we can address this comment in a follow up work

In my opinion copying data locally can help reads but it will add additional write amplification. So we should look for a good use case

@eolivelli eolivelli closed this Aug 17, 2020
@eolivelli eolivelli reopened this Aug 17, 2020
@eolivelli
Copy link
Contributor Author

@eolivelli
Copy link
Contributor Author

@eolivelli
Copy link
Contributor Author

Travis passed
https://travis-ci.org/github/diennea/herddb/builds/718591247

@diegosalvi your comments have been addressed. I am merging this patch now

@eolivelli eolivelli merged commit 6a2b380 into master Aug 17, 2020
@eolivelli eolivelli deleted the fix/disklessclusterv2 branch August 17, 2020 18:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants