New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BP-14 Force Ledger Bookie side implementation #1317
Conversation
This patch is still WIP, as it lacks important features/fixes:
But the patch is open for early review |
|
||
if (masterKeyCache.get(ledgerId) == null) { | ||
// Force the load into masterKey cache | ||
byte[] oldValue = masterKeyCache.putIfAbsent(ledgerId, masterKey); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this part should be refactored and put in common with addEntryInternal. I did not touch addEntryInternal in this PR to make the path as simple as possible.
@@ -1008,6 +1037,8 @@ public void run() { | |||
if (entry != null && (!syncData || entry.ackBeforeSync)) { | |||
toFlush.set(i, null); | |||
numEntriesToFlush--; | |||
flushedNotSyncedEntries.add( | |||
new AbstractMap.SimpleImmutableEntry<>(entry.ledgerId, entry.entryId)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AbstractMap.SimpleImmutableEntry is to be replaced with a more specific class without boxing. Maybe it is better to use a Reclyclable object
public class SyncCursor { | ||
|
||
private long minAddSynced = -1; | ||
private final SortedMap<Long, Long> ranges = new TreeMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sijie you requested a change about using some Guava utility for this. I cannot find the comment. Could you please give me again your suggestion ?.
I will be happy to have a more efficient implementation
This change is not expected to be on 4.7, but it is targeted to 4.8 (hopefully) |
save resources by not always reading from ledgerstorage after Bookie restart
Moving forward with this task, remaining parts:
add all @sijie 's comments in #671 like:
I have to think about moving SyncCursor to Bookie, because I have to listen for 'fsync' on entries already acknowledged and this can be done in Journal and not on Bookie (without adding some kind of JounalSyncListener Interface) |
@sijie I am tring to use RangeSet (https://github.com/google/guava/wiki/NewCollectionTypesExplained#rangeset). |
@eolivelli - pulsar uses RangeSet for tracking the acks. the problem is similar, so I suggested you looking into RangeSet. buf if you think you have a better solution to implement this, feel free to implement with a better solution. fyi, this is how pulsar uses RangeSet - https://github.com/apache/incubator-pulsar/blob/master/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedCursorImpl.java |
…n order to recover correcly ledgers with gaps from LedgerStorage Address Sijie comments
@@ -103,12 +104,19 @@ void initialize(ServerConfiguration conf, | |||
long addEntry(ByteBuf entry) throws IOException, BookieException; | |||
|
|||
/** | |||
* Read an entry from storage. | |||
* Read an entry from storage. In case of {@link BookieProtocol#LAST_ADD_CONFIRMED} we are going to return the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am adding this comments because I had to dig into code to understand clearly what is the contract of the method
@sijie currently I am blocked because I need to recover the cursor after bookie restart, see my email on the list |
@reddycharan please review, I cannot add you in "reviewers" list |
retest this please |
* <ul> | ||
* <li>0, 1, 2, 3 we have maxAddForced = 3 (all entries from 0 to 3 are stored)</li> | ||
* <li>0, 1, 3, 5 we have maxAddForced = 1 (all entries from 0 to 1 are stored, 2 and 4 are not stored)</li> | ||
* <li>1, 2, 3 we have maxAddForced = -1 (not available, as 0 is missing, so there is an gap)</li> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There can never be contiguous entries in the bookie perspective if Ensemble Size > WQ
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently in BP14 we are restricting to ES = WQ >= AQ.
So eventually there will be no hole, only temporary ones due to WriteAdvHandle or temporary network errors/retransmission.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, So this is not just a test statement, this implementation itself is going to restrict the ensemble size. usually, people select this mode for better performance, limiting this reduces the striping and hence throughput. What other options do we have? so that the solution doesn't push us to a corner?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jvrao from the very beginning of BP14 we decided not to implement "striping" in the first implement because it brings many problems.
In my first proposal I had a "force" API in which the client was sending not just the 'last written entry id' but a range of entries, in order to ask the bookie about the segment of ledger which is the current 'tail segment', because with striping the bookie is not receiving all the entries of the ledger.
As an initial implementation it is fine to have ES = WQ, we can just enhance this feature in the future. My initial idea was to make the client send the range of entries which defines the current segment (bookie do not have info about location of entries). This way the bookie can report to the client the actual status of the entries in the range and the client is able to advance LAC.
Client will advance LAC only in presence of the guarantee of sequence of entries fully persisted on AQ bookies, without holes.
In other words:
the range 0...LAC is always persisted durably on AQ bookies
the force() API enables the client to wait for an fsync on all the bookies on the current ensemble.
In presence of ES > WQ entries will be striped so the force API() is far more complex, not infeasible, but really complex, because the client must ask to the bookies for each segment.
Ok for talking about this on next meeting.
cc @sijie
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we are mixing up multiple things here. Let me see if I can clarify
- Client advances LAC is based on Bookie response on write.
- Bookie decides to persist the write or not based on the addentry flag.
If we split them, then there is NO change needed on the LAC logic.
No need to maintain this cursor or track lastAddCommit.
The application is taking the responsibility of potentially losing data between sync points.
And just like in any other storage subsystem, if a writer is acknowledged successfully
a reader must be allowed to read that.
So, one guarantee we break in this model - In this durability model - "If an entry is read once you can always read it". That is expected.
Also on the sync, we don't need to return any entryId (last commit id or something); A successful sync guarantees that all the data written on that extent is synced.
Hope this clarifies and makes the patch simple. I briefly mentioned to @sijie too, and he did not raise any major concern with this model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extending my thought...
This would make the recovery complicated if we update the piggybacked LAC too.
Because LAC may point to something that is not there.
So, we can have nonDurableLAC on the client side. This gets updated just like today's LAC, and LAC = nonDuarbleLAC on successful SYNC. With this model, you can achieve what you are trying to do now without maintaining the cursor. But this won't allow reads on all entries that were written after the previous sync.
* </p> | ||
* <p> | ||
* This cursor is kept in memory and in order to handle the case of Bookie restart we are reconstructing this | ||
* {@link #lastAddForced} value from {@link LedgerStorage} just be asking for the maximum written entryId (with a scan) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How long you keep the cursor in memory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a Guava cache. If there is no activity on a ledger or there are too many concurrent ledgers some cursor will be dropped and reconstructed if needed (by a force ledger call). We must make size and other knobs of this cache tunable by configuration file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a Guava cache. If there is no activity on a ledger or there are too many concurrent ledgers some cursor will be dropped and reconstructed if needed (by a force ledger call). We must make size and other knobs of this cache tunable by configuration file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a 'close' rpc will ease cache eviction
retest this please |
Closing, superseded by #1375 |
Implement "force" API on journal and bookie side.
In BP-14 the client will have a new force(long ledgerId) API which ensures that every write has been persisted durably to the bookie (currently to the journal) even using DEFERRED_SYNC WriteFlag.
This change is the implementation of such API on Journal and on Bookie.
On the Journal we have to keep track of which entries have been acknowledged to the client but those entries are not guaranteed to be written durably (forced) to the disk (they are written with ackBeforeSinc flag).
During the restart of the bookie such in-memory cursor (LastForceCursor) is lost so we have to reconstruct this information from LedgerStorage, by reading up to the "max entry id"; this "max entry id" value MUST be supplied by the writer because in case of WriteAdvHandle (sparse entryIds) on the bookie it is not possible to know which entries of a given ledger are stored on bookie
Master issue #471