Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a NoopEngine implementation #31163

Merged
merged 23 commits into from Jun 15, 2018

Conversation

dakrone
Copy link
Member

@dakrone dakrone commented Jun 7, 2018

This adds a new Engine implementation that does.. nothing. Any operations throw
an UnsupportedOperationException when tried. This engine is intended as a
building block for replicated closed indices in subsequent code changes.

Relates to #31141

This adds a new Engine implementation that does.. nothing. Any operations throw
an `UnsupportedOperationException` when tried. This engine is intended as a
building block for replicated closed indices in subsequent code changes.

Relates to elastic#31141
@dakrone dakrone added >non-issue :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. labels Jun 7, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

Copy link
Member

@jasontedor jasontedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some high-level comments. I have not done a careful review.

/**
* Returns true if the engine is a noop engine
*/
public abstract boolean isNoopEngine();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels wrong to me. It looks like it's only used for testing? Is this really needed? We did not feel the need to add something like this for the following engine in CCR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry this snuck in from me extracting this from my other branch (where this is actually used), I'll remove it for now and re-add it for the next set of work

@@ -24,4 +24,9 @@
public Engine newReadWriteEngine(EngineConfig config) {
return new InternalEngine(config);
}

@Override
public Engine newNoopEngine(EngineConfig config) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you take a look at how we handle the following engine in CCR and see if that extension point makes sense instead of adding this here? It feels conceptually wrong that the internal engine factory returns something other than an internal engine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to make this a plugin? I was thinking built-in but I guess we could move it to a module if we wanted to, I wasn't sure if it's a good idea to do engine implementations inside of modules

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this again, it appears that EnginePlugin is part of the CCR branch, are there plans to factor it out into a separate PR that goes into master?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be great if we could make it a module, I am always happy to see isolation when it makes sense and here I think it can. Note that CCR has an engine implementation inside a module. 😇

We have talked exactly about the possibility of reusing the engine plugin work in the context of closed indices many months ago so we could bring that work out of the CCR branch and into 6.x/master.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I've removed the changes to the EngineFactory and I can use the pluggable one for the next PR (since this one doesn't hook the engine in yet)

@dakrone dakrone requested a review from s1monw June 7, 2018 18:04
Copy link
Contributor

@s1monw s1monw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left some comments.


@Override
public void restoreLocalCheckpointFromTranslog() {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra NL

Translog translog = null;

// The deletion policy for the translog should not keep any translogs around, so the min age/size is set to -1
final TranslogDeletionPolicy translogDeletionPolicy = new TranslogDeletionPolicy(-1, -1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we move this into the try block please?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Certainly

private final IndexCommit lastCommit;
private final LocalCheckpointTracker localCheckpointTracker;
private final String historyUUID;
private SegmentInfos lastCommittedSegmentInfos;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be final no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep

* Directory so that the last commit's user data can be read for the historyUUID
* and last committed segment info.
*/
public class NoopEngine extends Engine {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this class be final and maybe pkg private?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, though I'll un-final it when I work on the second half of this, but it can be final for now :)

@Override
public CommitId flush() throws EngineException {
try {
translog.rollGeneration();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks wrong. We don't write anything here why do we need to modify the translog? I think this should be read-only

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order for flushing to clear existing translogs (because we don't want any translog operations to be replayed for peer or store recovery) we want the flush method to remove the translog, this was added so that flushing the new engine would ensure that we don't have any translog operations around that could cause UOEs during recovery

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can remove this for now and re-introduce it later when adding the state transition part, if that makes it better, but we still need to be able to completely remove translog ops before doing recovery since we have no way to do operation-based recovery.

What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I'd like to remove it for now I can 't see in this change why it's needed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I've removed that change from this PR

Copy link
Contributor

@bleskes bleskes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dakrone . I left some comments and questions.

translogDeletionPolicy.setTranslogGenerationOfLastCommit(lastGen);
translogDeletionPolicy.setMinTranslogGenerationForRecovery(lastGen);

localCheckpointTracker = createLocalCheckpointTracker();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't really need a local checkpoint tracker here, instead can we work on master to not expose the localCheckpointTracker out of engine (similar to how we don't expose the translog) and then we can avoid creating it? We should assert that maxSeq == localCheckpoint when opening lucene and otherwise fail.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean something like this: #31213

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. thanks.

final TranslogDeletionPolicy translogDeletionPolicy = new TranslogDeletionPolicy(-1, -1);

lastCommittedSegmentInfos = store.readLastCommittedSegmentsInfo();
translog = openTranslog(engineConfig, translogDeletionPolicy, engineConfig.getGlobalCheckpointSupplier());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need a translog? all we want is to validate that the translog has a the right uuid and that it's empty?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use the translog all over the place in other methods in Engine. We can try to encapsulate all of these into different methods (similar to #31213), but I'm not sure what benefit we'd actually get from that. In the next iteration we'll need the translog because we'll need to sync and flush it on engine opening so that there are no operations to be replayed during peer recovery (which I think will still need a translog to retrieve stats about # of ops).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what benefit we'd actually get from that

I looked at these methods and I agree it's on the edge - there are quite a few. I would still prefer we replace them with unsupported operations or noop returns (ex empty sanpshots). This will lock things down and prevent things that aren't supposed to happen - I think that's good. An alternative is to implement a NoopTranslog but that's another rabbit hole.

we'll need to sync and flush it on engine opening

Why is that? I think it's good to only close indices that have no ongoing indexing (like our plan for frozen index). Regardless - why can't we do the flush / trim when we close the open engine and convert it to a noop engine?

I can see one thing down the road because we may close an index on recovery where it has broken settings (TBD). In that case I would still prefer to make utilities methods like Store#associateIndexWithNewTranslog that work on the folder. Note that this problem isn't solved even if we keep the translog open as we can't index operations from it into the lucene index with the NoopEngine nor ship to a recovering shard with NoopEngine in it. We assume those don't exist. I think we should discuss this separately.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's good to only close indices that have no ongoing indexing (like our plan for frozen index).

Yes, but even with no ongoing indexing, a translog still remains (due to retention policy)

Regardless - why can't we do the flush / trim when we close the open engine and convert it to a noop engine?

That would require setting a new retention policy on an existing engine (making a part of InternalEngine mutable which makes me :(). In the future though, we could do the retention policy, sync, and flush/trim when the NoopEngine is opened, and then immediately close the translog.

In order to do this though, we'll have to remove the getTranslog method from Engine, is that something you want me to do as a precursor to this?


@Override
public boolean ensureTranslogSynced(Stream<Translog.Location> locations) {
return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can throw an unsupported operation exception too?

}

@Override
public Translog getTranslog() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this increases visibilty

@dakrone dakrone mentioned this pull request Jun 8, 2018
8 tasks
@dakrone dakrone added v7.0.0 and removed v7.0.0 labels Jun 8, 2018
@dakrone
Copy link
Member Author

dakrone commented Jun 11, 2018

@jasontedor @bleskes @s1monw I've removed almost all the pieces from this that were deemed unnecessary, can you take another look please?

Copy link
Contributor

@bleskes bleskes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great thanks for the extra iteration. I left some minor asks.

lastCommit = indexCommits.get(indexCommits.size()-1);
historyUUID = lastCommit.getUserData().get(HISTORY_UUID_KEY);
// We don't want any translogs hanging around for recovery, so we need to set these accordingly
final long lastGen = Long.parseLong(lastCommit.getUserData().get(Translog.TRANSLOG_GENERATION_KEY));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not used?

Also - I think it's to validate integrity - i.e. open the translog, see that it's uuid matches, see that it's empty and shut it down?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I pushed a commit removing the unused lastGen and open-closing the translog for validation purposes


@Override
public SeqNoStats getSeqNoStats(long globalCheckpoint) {
return new SeqNoStats(0, 0, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use the incoming globalCheckpoint?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't that violate the assertion that global checkpoint is the minimum of all shards' local checkpoints?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that would but we don't check this here. Also - I think we should load the stats from the last commit point, not just 0 (both max and local checkpoint are stored in the commit)

Copy link
Contributor

@bleskes bleskes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did another round. Sorry for not thinking about these in the first run.

final TranslogDeletionPolicy translogDeletionPolicy = new TranslogDeletionPolicy(-1, -1);

// The translog is opened and closed to validate that the translog UUID from lucene is the same as the one in the translog
Translog translog = openTranslog(engineConfig, translogDeletionPolicy, engineConfig.getGlobalCheckpointSupplier());

This comment was marked as resolved.

This comment was marked as resolved.


@Override
public SeqNoStats getSeqNoStats(long globalCheckpoint) {
return new SeqNoStats(0, 0, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that would but we don't check this here. Also - I think we should load the stats from the last commit point, not just 0 (both max and local checkpoint are stored in the commit)

}

@Override
public void resetLocalCheckpoint(long localCheckpoint) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you assert this is equal to getLocalCheckpoint()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep added this

}

@Override
public long getLocalCheckpoint() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you return the local checkpoint stored in the commit?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, added this


public class NoopEngineTests extends EngineTestCase {
private static final IndexSettings INDEX_SETTINGS = IndexSettingsModule.newIndexSettings("index", Settings.EMPTY);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also add a test that creates a normal engine, adds some data, flushes, closes and the opens a noop engine and verify all the stats makes sense and getting a lucene snapshot returns the expected data (doc counts etc)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getting a lucene snapshot returns the expected data (doc counts etc)?

Do you mean from newTranslogSnapshotFromMinSeqNo? I've been returning an empty snapshot for this with docCount of 0 (and no operations, because we have no way of returning operations), so it won't match the same doc count that a snapshot from a regular engine will.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant this . The standard engine ends up returning a SnapshotIndexCommit which is why I used the term lucene snapshot. I hope this is clearer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, thanks for clarifying, this was added as a test

@dakrone
Copy link
Member Author

dakrone commented Jun 12, 2018

@bleskes thanks for taking another look, I think I addressed all your feedback

Copy link
Contributor

@bleskes bleskes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the extra work @dakrone . I left some question on the test, but I'm happy either way.

@@ -98,6 +98,9 @@ public void close() {

// The translog is opened and closed to validate that the translog UUID from lucene is the same as the one in the translog
Translog translog = openTranslog(engineConfig, translogDeletionPolicy, engineConfig.getGlobalCheckpointSupplier());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrap this in try finally?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes good catch, I thought about that last night but forgot to do it, I'll update that

@@ -56,4 +67,38 @@ public void testTwoNoopEngines() throws IOException {
engine2.close();
}

public void testNoopAfterRegularEngine() throws IOException {
int docs = randomIntBetween(1, 10);
ReplicationTracker tracker = (ReplicationTracker) engine.config().getGlobalCheckpointSupplier();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this needed? we typically just write to the engine?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a rabbit hole @jasontedor and I got into last night; in order to completely flush and remove the translog we need a deletion policy where we can advance the generation needed for recovery, in order to advance that we have to have the tracker working correctly (which I guess is normally managed at the IndexShard level), so this allows us to advance global checkpoints as needed so we can remove the translog and open a new noop engine (otherwise the noop engine would throw an exception about there being translog operations when being opened).

Copy link
Member

@dnhatn dnhatn Jun 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can pass the global checkpoint as a LongSupplier to an engine.

        final AtomicLong globalCheckpoint = new AtomicLong(SequenceNumbers.NO_OPS_PERFORMED);
        try (Store store = createStore();
             InternalEngine engine = createEngine(store, createTempDir(), globalCheckpoint::get)) {
            engine.syncTranslog(); // Make sure that the global checkpoint is persisted to the translog's checkpoint
        }

List<IndexCommit> indexCommits = DirectoryReader.listCommits(store.directory());
lastCommit = indexCommits.get(indexCommits.size()-1);
historyUUID = lastCommit.getUserData().get(HISTORY_UUID_KEY);
localCheckpoint = Long.parseLong(lastCommit.getUserData().get(SequenceNumbers.LOCAL_CHECKPOINT_KEY));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not yet relevant to this PR but these will have to be the same IMO .

Copy link
Member

@jasontedor jasontedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments/questions.

try {
lastCommittedSegmentInfos = store.readLastCommittedSegmentsInfo();
List<IndexCommit> indexCommits = DirectoryReader.listCommits(store.directory());
lastCommit = indexCommits.get(indexCommits.size()-1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Formatting nit: indexCommits.size()-1 -> indexCommits.size() - 1

// The deletion policy for the translog should not keep any translogs around, so the min age/size is set to -1
final TranslogDeletionPolicy translogDeletionPolicy = new TranslogDeletionPolicy(-1, -1);

// The translog is opened and closed to validate that the translog UUID from lucene is the same as the one in the translog
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a test case for this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, added a test

final NoopEngine engine = new NoopEngine(noopConfig(INDEX_SETTINGS, store, primaryTranslogDir));
expectThrows(UnsupportedOperationException.class, () -> engine.index(null));
expectThrows(UnsupportedOperationException.class, () -> engine.delete(null));
expectThrows(UnsupportedOperationException.class, () -> engine.noOp(null));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a follow-up can we have:

  • Noop -> NoOp
  • noop -> no-op (when written in exception messages to the user, and code comments)
  • noopEngine -> `noOpEngine

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ooookkaayy


public void testTwoNoopEngines() throws IOException {
engine.close();
// It's so noop you can even open two engines for the same store without tripping anything
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something that we are going to rely on?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not? I just thought it was a good test that we aren't acquiring locks or locking files accidentally or anything like that

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, maybe the test name should be clearer, or at least a comment of its purpose?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure I'll add a comment

@dakrone
Copy link
Member Author

dakrone commented Jun 13, 2018

@jasontedor I addressed your comments (I'll do the rename in a followup)

Copy link
Member

@jasontedor jasontedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@dakrone
Copy link
Member Author

dakrone commented Jun 14, 2018

@elasticmachine run sample packaging tests please

@dakrone dakrone merged commit e1d068d into elastic:closed-index-replication Jun 15, 2018
dakrone added a commit to dakrone/elasticsearch that referenced this pull request Jun 18, 2018
In elastic#31163 (comment)
Jason requested renaming NoopEngine to NoOpEngine. This commit renames the class
and relevant parts.

Relates to elastic#31141
dakrone added a commit that referenced this pull request Jun 19, 2018
In #31163 (comment)
Jason requested renaming NoopEngine to NoOpEngine. This commit renames the class
and relevant parts.

Relates to #31141
dakrone added a commit to dakrone/elasticsearch that referenced this pull request Jul 2, 2018
This commit uses the NoOp engine introduced in elastic#31163 for closed indices.
Instead of being removed from the routing table and essentially "dropped" (but
not deleted), closed indices are now replicated the same way open indices are.

Relates to elastic#31141
tlrx pushed a commit that referenced this pull request Sep 14, 2018
This adds a new Engine implementation that does.. nothing. Any operations throw
an `UnsupportedOperationException` when tried. This engine is intended as a
building block for replicated closed indices in subsequent code changes.

Relates to #31141
tlrx pushed a commit that referenced this pull request Sep 14, 2018
In #31163 (comment)
Jason requested renaming NoopEngine to NoOpEngine. This commit renames the class
and relevant parts.

Relates to #31141
tlrx pushed a commit to tlrx/elasticsearch that referenced this pull request Sep 20, 2018
This adds a new Engine implementation that does.. nothing. Any operations throw
an `UnsupportedOperationException` when tried. This engine is intended as a
building block for replicated closed indices in subsequent code changes.

Relates to elastic#31141
tlrx pushed a commit to tlrx/elasticsearch that referenced this pull request Sep 20, 2018
In elastic#31163 (comment)
Jason requested renaming NoopEngine to NoOpEngine. This commit renames the class
and relevant parts.

Relates to elastic#31141
tlrx pushed a commit to tlrx/elasticsearch that referenced this pull request Sep 26, 2018
This adds a new Engine implementation that does.. nothing. Any operations throw
an `UnsupportedOperationException` when tried. This engine is intended as a
building block for replicated closed indices in subsequent code changes.

Relates to elastic#31141
tlrx pushed a commit to tlrx/elasticsearch that referenced this pull request Sep 26, 2018
In elastic#31163 (comment)
Jason requested renaming NoopEngine to NoOpEngine. This commit renames the class
and relevant parts.

Relates to elastic#31141
@dakrone dakrone deleted the add-noop-engine branch February 4, 2019 14:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Engine Anything around managing Lucene and the Translog in an open shard. >non-issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants