Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use _refresh instead of reading from Translog in the RT GET case #20102

Merged
merged 10 commits into from Aug 24, 2016

Conversation

Projects
None yet
7 participants
@s1monw
Copy link
Contributor

commented Aug 22, 2016

Today we do a lot of accounting inside the engine to maintain locations
of documents inside the transaction log. This is only needed to ensure
we can return the documents source from the engine if it hasn't been refreshed.
Aside of the added complexity to be able to read from the currently writing translog,
maintainance of pointers into the translog this also caused inconsistencies like different values
of the _ttl field if it was read from the tlog or not. TermVectors are totally different if
the document is fetched from the tranlog since copy fields are ignored etc.

This change will simply call refresh if the documents latest version is not in the index. This
streamlines the semantics of the _get API and allows for more optimizations inside the engine
and on the transaction log. Note: _refresh is only called iff the requested document is not refreshed yet but has recently been updated or added.

NOTE: there are more APIs that are test only now that potentially can be removed... I didn't run REST tests yet etc so it's really WIP

Use _refresh instead of reading from Translog in the RT GET case
Today we do a lot of accounting inside the engine to maintain locations
of documents inside the transaction log. This is only needed to ensure
we can return the documents source from the engine if it hasn't been refreshed.
Aside of the added complexity to be able to read from the currently writing translog,
maintainance of pointers into the translog this also caused inconsistencies like different values
of the `_ttl` field if it was read from the tlog or not. TermVectors are totally different if
the document is fetched from the tranlog since copy fields are ignored etc.

This chance will simply call `refresh` if the documents latest version is not in the index. This
streamlines the semantics of the `_get` API and allows for more optimizations inside the engine
and on the transaction log. Note: `_refresh` is only called iff the requested document is not refreshed
yet but has recently been updated or added.
@mikemccand

This comment has been minimized.

Copy link
Contributor

commented Aug 22, 2016

LGTM, so simple.

This goes a long ways towards addressing #19787 where the version map consumes 2/3 of the indexing buffer. Version map is still alive w/ this change, but not storing the Translog.Location will save a lot of heap.

@jpountz

This comment has been minimized.

Copy link
Contributor

commented Aug 22, 2016

LGTM2

@nik9000

This comment has been minimized.

Copy link
Contributor

commented Aug 22, 2016

In a followup we should switch the default for GETs to non-realtime. Its nice that we can still do a realtime GET but the performance is very different. I figure we can do the docs in that PR? I'm happy to make the PR if no one else wants it.

@jpountz

This comment has been minimized.

Copy link
Contributor

commented Aug 22, 2016

@nik9000 What is your motivation to make GET not realtime by default? Are you worried about the worst-case of someone doing a PUT then GET on the same id all the time?

@nik9000

This comment has been minimized.

Copy link
Contributor

commented Aug 22, 2016

Are you worried about the worst-case of someone doing a PUT then GET on the same id all the time?

Basically, yes. HTTP GETs are usually fast and don't make changes to the server. People are used to that because it is true for almost all GETs. This way of doing realtime GETs is still fairly fast in most cases but it'll make lots of little segments. Lots of little segments are bad for performance for lots of reasons but in this case I worry because they don't actually slow down the GET all that much, just the searches done later.

I think it is the right thing to do if we're trying to push folks towards fast things by default.

@s1monw

This comment has been minimized.

Copy link
Contributor Author

commented Aug 22, 2016

Lots of little segments are bad for performance for lots of reasons but in this case I worry because they don't actually slow down the GET all that much, just the searches done later.

I don't think this is true. We only refresh iff the document we are fetching has changed since we refreshed the last time. So I guess commonly we won't refresh at all

s1monw added some commits Aug 22, 2016

@s1monw s1monw removed WIP discuss labels Aug 23, 2016

@s1monw

This comment has been minimized.

Copy link
Contributor Author

commented Aug 23, 2016

I removed discuss and WIP labels and updated docs, I think it's ready

has been updated but is not yet refreshed, the get API will issue a refresh
call in-place to make the document visible. This will also make other documents
changed since the last refresh visible. In order to disable realtime GET,
one can set `realtime` parameter to `false`.

This comment has been minimized.

Copy link
@clintongormley
source is not enabled). It is a good practice to assume that the fields
will be loaded from source when using realtime GET, even if the fields
are stored.
stored in the mapping).

This comment has been minimized.

Copy link
@clintongormley

clintongormley Aug 24, 2016

Member

stored -> <<mapping-store,stored>>

This comment has been minimized.

Copy link
@s1monw

s1monw Aug 24, 2016

Author Contributor

not sure I get what you mean

This comment has been minimized.

Copy link
@jpountz

jpountz Aug 24, 2016

Contributor

he wants you to add a link

This comment has been minimized.

Copy link
@s1monw

s1monw Aug 24, 2016

Author Contributor

yeah github markup messup


==== get API

As of 5.0.0 the get API will issue a refresh if a the requested document has

This comment has been minimized.

Copy link
@clintongormley

clintongormley Aug 24, 2016

Member

a the -> the

==== get API

As of 5.0.0 the get API will issue a refresh if a the requested document has
been changed since the last refresh but the change isn't refreshed yet. This

This comment has been minimized.

Copy link
@clintongormley

clintongormley Aug 24, 2016

Member

isn't -> hasn't been

@clintongormley

This comment has been minimized.

Copy link
Member

commented Aug 24, 2016

minor docs changes, otherwise LGTM


@Override
public long ramBytesUsed() {
return RamUsageEstimator.NUM_BYTES_OBJECT_HEADER + Long.BYTES + RamUsageEstimator.NUM_BYTES_OBJECT_REF +
(translogLocation != null ? translogLocation.size : 0);
return RamUsageEstimator.NUM_BYTES_OBJECT_HEADER + Long.BYTES + RamUsageEstimator.NUM_BYTES_OBJECT_REF;

This comment has been minimized.

Copy link
@jpountz

jpountz Aug 24, 2016

Contributor

so Translog.Location does not need to implement Accountable anymore?

@@ -182,140 +181,12 @@ private GetResult innerGet(String type, String id, String[] gFields, boolean rea

try {
// break between having loaded it from translog (so we only have _source), and having a document to load
if (get.docIdAndVersion() != null) {
return innerGetLoadFromStoredFields(type, id, gFields, fetchSourceContext, get, mapperService);

This comment has been minimized.

Copy link
@jpountz

jpountz Aug 24, 2016

Contributor

can you fix the indentation?

@@ -103,12 +96,12 @@ static TermVectorsResponse getTermVectors(IndexShard indexShard, TermVectorsRequ
Versions.DocIdAndVersion docIdAndVersion = get.docIdAndVersion();
/* from an artificial document */
if (request.doc() != null) {
termVectorsByField = generateTermVectorsFromDoc(indexShard, request, !docFromTranslog);
termVectorsByField = generateTermVectorsFromDoc(indexShard, request, true);

This comment has been minimized.

Copy link
@jpountz

jpountz Aug 24, 2016

Contributor

this seems to be the only call site of generateTermVectorsFromDoc so maybe we can simplify generateTermVectorsFromDoc by removing all branches that handle the false case

@s1monw s1monw merged commit c499427 into elastic:master Aug 24, 2016

1 of 2 checks passed

elasticsearch-ci Build started sha1 is merged.
Details
CLA Commit author is a member of Elasticsearch
Details

@s1monw s1monw deleted the s1monw:simplify_realtime_get branch Aug 24, 2016

@tylerfontaine

This comment has been minimized.

Copy link
Member

commented Aug 24, 2016

In the event of an explosion of segments because of these refreshes, performance concerns aside (as I don't know enough to comment on them), it /could/ chew through file descriptors, which would be more problematic. Would it make sense to include a caveat to that effect in the event someone abuses this?

Or are we assuming that

  1. it shouldn't happen most of the time
  2. if it did, the segments are small enough that default merging rules would keep file descriptor use in check?
@s1monw

This comment has been minimized.

Copy link
Contributor Author

commented Aug 25, 2016

@tylerfontaine very good question! The assumption here is that our mechanisms of throttling indexing and only flushing compound file segments will keep the number of file descriptors at bay by triggering merging. If you have thousands of shards on a node that might become a problem but that's orthogonal to this change. makes sense?

@ismael-hasan

This comment has been minimized.

Copy link
Member

commented Aug 25, 2016

As of now, one recommendation to improve indexing speed in bulk is to change the refresh interval parameter to a higher value (30 seconds) or to disable refresh meanwhile doing bulk operations.
In the following use case:

  • Periodic bulk operations
  • Quality of data is not good, the same document can be updated several times in a bulk operation

Could it happen that we slow indexing speed because of too many refresh operations happening constantly?
Does this make sense? If so, would it be worthy (and possible and advisable!) that updates done from the bulk api do not trigger the refresh, but instead use the translog?

@s1monw

This comment has been minimized.

Copy link
Contributor Author

commented Aug 25, 2016

@ismael-hasan your analysis of the problem is correct. If you are updating the same doc over and over again you will hit refresh way more often. If this is the rule rather than the exception in you application it will be recommended to bulk those updates together on your application layer. I am convinced that this is a rare corner case, would you agree given the overall user base? We are trying to reduce the footprint in terms of memory and APIs that exists solely because of this corner-case so reading from translog on only this situation is basically reverting this PR. We don't want to scarify the overall experience for corner-cases like this since it will make maintainability of performance and quality hard over time.

Yet, I am happy to add this limitation and your use-case to the docs to give people a heads-up, can you elaborate on your use-case a bit more so I can phrase it?

@ismael-hasan

This comment has been minimized.

Copy link
Member

commented Aug 25, 2016

@s1monw It was an example use case, not an actual one. I also think it is an unusual scenario and it should be addressed from the data point of view - I saw it just a couple of times in previous jobs, for instance, I saw it with partial updates: a document is inserted in a production database triggering storing it somewhere else to be indexed later; then, a new field is populated and the change is stored also, and a second/third time; finally, when all of this goes to ingestion, we have 1 record for the original document and 3 records with addition of new fields to the object. I just wanted to raise awareness of it and to know better how we planned to handle it; if allowing the translog approach for bulk was already considered and discarded, I am happy with it!

I strongly +1 to adding a note to the documentation, it will save some headaches in the future with "Ingestion is going slow" - it is easier to plan ahead how to implement your bulk ingestion taking this into account than changing later the approach when you realize you are refreshing too often during bulk. I would add something like "When doing bulk indexing, sometimes the data will not be curated, including several operations to the same document in the same bulk operation. This approach can cause slow indexing, since it will trigger unnecessary refresh operations. It is recommended in this use case to pre-process the data to send only one addition/update per document per bulk operation which includes all of the modifications at once."

@s1monw

This comment has been minimized.

Copy link
Contributor Author

commented Aug 25, 2016

@ismael-hasan I though about your usecase again and I think there is a misunderstanding. This really only applied to indexing if you use elasticsearch as a primary store not when you sync a database. For instance if you have a trigger in the DB you generate a new version of doc X and either index it or put it aside (as you said) to bulk it later. Now another field is populated you create version N+1 of the doc and either index it or put it aside in the same bulk, and so on. Non of these indexing operations will use the GET API. if you use the upsert feature for something like a counter or so (note: elasticsearch is not good for counters !!!!) where you read the document from ES via GET, increment the counter and index it again that is when you will hit this. I hope this clarifies the situation?

@ismael-hasan

This comment has been minimized.

Copy link
Member

commented Aug 25, 2016

Note this is informative, by no means I am suggesting to revert this change
@s1monw There is another scenario that came to my mind, sharing it in case more people were doing this. I had a couple of cases in which the users had log entries related by "event_id", so, to be able to build an entity-centric approach, we used logstash outputs to do upserts: as the log entries related to an event were coming into the system the entity would be updated. This came with an acknowledged and accepted penalty in indexing performance.
It seems the performance in 5 will be much worse using this way of dealing with centralizing entities from logs, so it is worthy to start considering different methods to deal with this scenario.

jpountz added a commit to jpountz/elasticsearch that referenced this pull request Aug 26, 2016

GET operations should not extract fields from `_source`. elastic#20158
This makes GET operations more consistent with `_search` operations which expect
`(stored_)fields` to work on stored fields and source filtering to work on the
`_source` field. This is now possible thanks to the fact that GET operations
do not read from the translog anymore (elastic#20102) and also allows to get rid of
`FieldMapper#isGenerated`.

The `_termvectors` API (and thus more_like_this too) was relying on the fact
that GET operations would extract fields from either stored fields or the source
so the logic to do this that used to exist in `ShardGetService` has been moved
to `TermVectorsService`. It would be nice that term vectors do not rely on this,
but this does not seem to be a low hanging fruit.

tlrx added a commit to tlrx/elasticsearch that referenced this pull request Jun 30, 2017

tlrx added a commit that referenced this pull request Jul 4, 2017

tlrx added a commit that referenced this pull request Jul 4, 2017

tlrx added a commit that referenced this pull request Jul 4, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.