Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BP-14 Force Ledger Bookie side implementation #1317

Closed
wants to merge 5 commits into from

Conversation

eolivelli
Copy link
Contributor

@eolivelli eolivelli commented Apr 5, 2018

Implement "force" API on journal and bookie side.

In BP-14 the client will have a new force(long ledgerId) API which ensures that every write has been persisted durably to the bookie (currently to the journal) even using DEFERRED_SYNC WriteFlag.

This change is the implementation of such API on Journal and on Bookie.

On the Journal we have to keep track of which entries have been acknowledged to the client but those entries are not guaranteed to be written durably (forced) to the disk (they are written with ackBeforeSinc flag).

During the restart of the bookie such in-memory cursor (LastForceCursor) is lost so we have to reconstruct this information from LedgerStorage, by reading up to the "max entry id"; this "max entry id" value MUST be supplied by the writer because in case of WriteAdvHandle (sparse entryIds) on the bookie it is not possible to know which entries of a given ledger are stored on bookie

Master issue #471

@eolivelli
Copy link
Contributor Author

This patch is still WIP, as it lacks important features/fixes:

  • after Bookie restart the LastPersistedEntryId is read from LedgerStorage but such recovery must notify the Journal, otherwise at each new forceLedger the bookie will not be able to use information on the journal and so it will read from LedgerStorage
  • during reviews of early versions of this patch @sijie requested changes on SyncCursor in order to have a more efficient implementation (this is still to be addressed)
  • I am temporary using AbstractMap.SimpleImmutableEntry to store pairs of ledgerId/entryId but it is very inefficient as it uses boxing of Long variables and we could use a Recyclable object

But the patch is open for early review


if (masterKeyCache.get(ledgerId) == null) {
// Force the load into masterKey cache
byte[] oldValue = masterKeyCache.putIfAbsent(ledgerId, masterKey);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this part should be refactored and put in common with addEntryInternal. I did not touch addEntryInternal in this PR to make the path as simple as possible.

@@ -1008,6 +1037,8 @@ public void run() {
if (entry != null && (!syncData || entry.ackBeforeSync)) {
toFlush.set(i, null);
numEntriesToFlush--;
flushedNotSyncedEntries.add(
new AbstractMap.SimpleImmutableEntry<>(entry.ledgerId, entry.entryId));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AbstractMap.SimpleImmutableEntry is to be replaced with a more specific class without boxing. Maybe it is better to use a Reclyclable object

public class SyncCursor {

private long minAddSynced = -1;
private final SortedMap<Long, Long> ranges = new TreeMap<>();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sijie you requested a change about using some Guava utility for this. I cannot find the comment. Could you please give me again your suggestion ?.
I will be happy to have a more efficient implementation

@eolivelli eolivelli requested a review from sijie April 5, 2018 08:03
@eolivelli
Copy link
Contributor Author

This change is not expected to be on 4.7, but it is targeted to 4.8 (hopefully)

@eolivelli eolivelli self-assigned this Apr 5, 2018
save resources by not always reading from ledgerstorage after Bookie restart
@eolivelli
Copy link
Contributor Author

Moving forward with this task, remaining parts:

  • add test case about fencing: forceLedger should not work on a fenced ledger (fencing is handled in writes, but as new code follows the same scheme we should have a test which cover all code branches)
  • add a better implementation of SyncCursor

add all @sijie 's comments in #671 like:

I have to think about moving SyncCursor to Bookie, because I have to listen for 'fsync' on entries already acknowledged and this can be done in Journal and not on Bookie (without adding some kind of JounalSyncListener Interface)

@eolivelli
Copy link
Contributor Author

@sijie I am tring to use RangeSet (https://github.com/google/guava/wiki/NewCollectionTypesExplained#rangeset).
At a first glance it does not seem to me really "cheaper".
Maybe I am not taking the good approach. But every modification to a RangeSet allocates a lot of objects.
Maybe you have already experience about this utility

@sijie
Copy link
Member

sijie commented Apr 12, 2018

@eolivelli - pulsar uses RangeSet for tracking the acks. the problem is similar, so I suggested you looking into RangeSet. buf if you think you have a better solution to implement this, feel free to implement with a better solution.

fyi, this is how pulsar uses RangeSet - https://github.com/apache/incubator-pulsar/blob/master/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedCursorImpl.java

@@ -103,12 +104,19 @@ void initialize(ServerConfiguration conf,
long addEntry(ByteBuf entry) throws IOException, BookieException;

/**
* Read an entry from storage.
* Read an entry from storage. In case of {@link BookieProtocol#LAST_ADD_CONFIRMED} we are going to return the
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am adding this comments because I had to dig into code to understand clearly what is the contract of the method

@eolivelli
Copy link
Contributor Author

@sijie currently I am blocked because I need to recover the cursor after bookie restart, see my email on the list

@eolivelli eolivelli changed the title WIP - BP-14 Force Ledger Bookie side implementation BP-14 Force Ledger Bookie side implementation Apr 20, 2018
@eolivelli
Copy link
Contributor Author

@reddycharan please review, I cannot add you in "reviewers" list

@eolivelli
Copy link
Contributor Author

retest this please

* <ul>
* <li>0, 1, 2, 3 we have maxAddForced = 3 (all entries from 0 to 3 are stored)</li>
* <li>0, 1, 3, 5 we have maxAddForced = 1 (all entries from 0 to 1 are stored, 2 and 4 are not stored)</li>
* <li>1, 2, 3 we have maxAddForced = -1 (not available, as 0 is missing, so there is an gap)</li>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There can never be contiguous entries in the bookie perspective if Ensemble Size > WQ

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently in BP14 we are restricting to ES = WQ >= AQ.
So eventually there will be no hole, only temporary ones due to WriteAdvHandle or temporary network errors/retransmission.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, So this is not just a test statement, this implementation itself is going to restrict the ensemble size. usually, people select this mode for better performance, limiting this reduces the striping and hence throughput. What other options do we have? so that the solution doesn't push us to a corner?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jvrao from the very beginning of BP14 we decided not to implement "striping" in the first implement because it brings many problems.

In my first proposal I had a "force" API in which the client was sending not just the 'last written entry id' but a range of entries, in order to ask the bookie about the segment of ledger which is the current 'tail segment', because with striping the bookie is not receiving all the entries of the ledger.

As an initial implementation it is fine to have ES = WQ, we can just enhance this feature in the future. My initial idea was to make the client send the range of entries which defines the current segment (bookie do not have info about location of entries). This way the bookie can report to the client the actual status of the entries in the range and the client is able to advance LAC.

Client will advance LAC only in presence of the guarantee of sequence of entries fully persisted on AQ bookies, without holes.
In other words:
the range 0...LAC is always persisted durably on AQ bookies

the force() API enables the client to wait for an fsync on all the bookies on the current ensemble.
In presence of ES > WQ entries will be striped so the force API() is far more complex, not infeasible, but really complex, because the client must ask to the bookies for each segment.

Ok for talking about this on next meeting.

cc @sijie

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are mixing up multiple things here. Let me see if I can clarify

  1. Client advances LAC is based on Bookie response on write.
  2. Bookie decides to persist the write or not based on the addentry flag.

If we split them, then there is NO change needed on the LAC logic.
No need to maintain this cursor or track lastAddCommit.

The application is taking the responsibility of potentially losing data between sync points.
And just like in any other storage subsystem, if a writer is acknowledged successfully
a reader must be allowed to read that.

So, one guarantee we break in this model - In this durability model - "If an entry is read once you can always read it". That is expected.

Also on the sync, we don't need to return any entryId (last commit id or something); A successful sync guarantees that all the data written on that extent is synced.

Hope this clarifies and makes the patch simple. I briefly mentioned to @sijie too, and he did not raise any major concern with this model.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extending my thought...
This would make the recovery complicated if we update the piggybacked LAC too.
Because LAC may point to something that is not there.

So, we can have nonDurableLAC on the client side. This gets updated just like today's LAC, and LAC = nonDuarbleLAC on successful SYNC. With this model, you can achieve what you are trying to do now without maintaining the cursor. But this won't allow reads on all entries that were written after the previous sync.

* </p>
* <p>
* This cursor is kept in memory and in order to handle the case of Bookie restart we are reconstructing this
* {@link #lastAddForced} value from {@link LedgerStorage} just be asking for the maximum written entryId (with a scan)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How long you keep the cursor in memory?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a Guava cache. If there is no activity on a ledger or there are too many concurrent ledgers some cursor will be dropped and reconstructed if needed (by a force ledger call). We must make size and other knobs of this cache tunable by configuration file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a Guava cache. If there is no activity on a ledger or there are too many concurrent ledgers some cursor will be dropped and reconstructed if needed (by a force ledger call). We must make size and other knobs of this cache tunable by configuration file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a 'close' rpc will ease cache eviction

@eolivelli
Copy link
Contributor Author

retest this please

@eolivelli
Copy link
Contributor Author

Closing, superseded by #1375

@eolivelli eolivelli closed this Apr 28, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants