DocumentSizeObserver infrastructure to allow not reporting upon failures #104859

pgomulka · 2024-01-29T12:45:39Z

We want to report that observation of document parsing has finished only upon a successful indexing.
To achieve this, we need to perform reporting only in one place (not as previously in both IngestService and 'bulk action')

This commit splits the DocumentParsingObserver in two. One for wrapping an XContentParser and returning the observed state - the DocumentSizeObserver and a DocumentSizeReporter to perform an action when parsing has been completed and indexing successful.

To perform reporting in one place we need to pass the state from IngestService to 'bulk action'. The state is currently represented as long - normalisedBytesParsed.

In TransportShardBulkAction we are getting the normalisedBytesParsed information and in the serverless plugin we will check if the value is indicating that parsing already happened in IngestService (value being != -1) we create a DocumentSizeObserver with the fixed normalisedBytesParsed and won't increment it.

When the indexing is completed and successful we report the observed state for an index with DocumentSizeReporter

small nit: by passing the documentParsingObserve via SourceToParse we no longer have to inject it via complex hierarchy for DocumentParser. Hence some constructor changes

…ction

pgomulka · 2024-01-31T13:18:25Z

server/src/main/java/org/elasticsearch/action/bulk/TransportShardBulkAction.java

+            DocumentParsingObserver documentParsingObserver = request.getNormalisedBytesParsed() != -1L
+                ? documentParsingSupplier.forAlreadyParsedInIngest(request.getNormalisedBytesParsed())
+                : documentParsingSupplier.getNewObserver();


I could simplify this with just

Suggested change

DocumentParsingObserver documentParsingObserver = request.getNormalisedBytesParsed() != -1L

? documentParsingSupplier.forAlreadyParsedInIngest(request.getNormalisedBytesParsed())

: documentParsingSupplier.getNewObserver();

DocumentParsingObserver documentParsingObserver = request.getNormalisedBytesParsed() != -1L

? DocumentParsingObserver.EMPTY_INSTANCE

: documentParsingSupplier.getNewObserver();

Then in onComplete
I could simply add an if

DocumentParsingReporter documentParsingReporter = documentParsingSupplier.getDocumentParsingReporter(); IndexRequest request = context.getRequestToExecute(); if(request.getNormalisedBytesParsed()!=-1) { documentParsingReporter.onCompleted(docWriteRequest.index(), request.getNormalisedBytesParsed()); } else { DocumentParsingObserver documentParsingObserver = context.getDocumentParsingObserver(); documentParsingReporter.onCompleted(docWriteRequest.index(), documentParsingObserver.normalisedBytesParsed()); }

with this I could remove the forAlreadyParsedInIngest method and the FixedDocumentParsingObserver instance

Whether simplified or not, maybe it would be nice to extract this out into a static method on DocumentParsingObserver (or something like that)? It would give you a handy place to write a javadoc and would encapsulate the -1L magic number.

fair point, better to hide the complexity of dealing with a magic number in a serverless plugin

elasticsearchmachine · 2024-02-01T12:19:40Z

Hi @pgomulka, I've created a changelog YAML for you.

elasticsearchmachine · 2024-02-01T14:10:37Z

Pinging @elastic/es-core-infra (Team:Core/Infra)

rjernst · 2024-02-01T20:40:03Z

Given the change to need to explicitly serialize the state, and that state being a long (the size of the document), perhaps we should narrow the scope of the observer. We could rename it to DocumentSizeObserver (or something similar), and the resulting size is what is serialized. So the observer would explicitly have the size available after parsing. Then we would not need 2 different interfaces. @pgomulka wdyt?

…ction

pgomulka · 2024-02-02T10:20:15Z

Given the change to need to explicitly serialize the state, and that state being a long (the size of the document), perhaps we should narrow the scope of the observer. We could rename it to DocumentSizeObserver (or something similar), and the resulting size is what is serialized.

fair point, I like the explicit size in a name. - updated the PR

So the observer would explicitly have the size available after parsing. Then we would not need 2 different interfaces. @pgomulka wdyt?

The reason for split in DocumentSizeObserver and DocumentParsingReporter is so that code that is not meant to report (IngestService and DocumentParser) has no access to a method that would allow it to do.
The reporting should only be done in TransportBulkShardAction.onComplete and only this code has access to the DocumentParsingReporter instance

pgomulka · 2024-02-02T14:54:03Z

I suggest that maybe we focus on INDEX only here. I have a change for updates, but it would complicate the PR even more (wiring and changes to UpdateHelper https://github.com/elastic/elasticsearch/pull/105063/files#diff-6a17455034b7c4884ae809d49357d6dfedaedd2d0780a10af30b314ce757546e)

rjernst

Looks pretty good, I have a few more nits on naming

rjernst · 2024-02-06T14:22:58Z

server/src/main/java/org/elasticsearch/TransportVersions.java

@@ -167,6 +167,7 @@ static TransportVersion def(int id) {
    public static final TransportVersion DESIRED_NODE_VERSION_OPTIONAL_STRING = def(8_580_00_0);
    public static final TransportVersion ML_INFERENCE_REQUEST_INPUT_TYPE_UNSPECIFIED_ADDED = def(8_581_00_0);
    public static final TransportVersion ASYNC_SEARCH_STATUS_SUPPORTS_KEEP_ALIVE = def(8_582_00_0);
+    public static final TransportVersion NORMALISED_BYTES_PARSED = def(8_583_00_0);


Could this be more descriptive? eg INDEX_REQUEST_NORMALIZED_BYTES_PARSED? It's a mouthful, but it gives at least some context as to what the normalized bytes are for.

rjernst · 2024-02-06T14:38:55Z

server/src/main/java/org/elasticsearch/plugins/internal/DocumentParsingSupplier.java

+/**
+ * An interface to provide instances of document parsing observer and reporter
+ */
+public interface DocumentParsingSupplier {


nit: supplier is normally for a singular thing being supplied. We've typically used "provider" (SPI terminology) for something like this which provides many things.

rjernst · 2024-02-06T14:39:37Z

server/src/main/java/org/elasticsearch/plugins/internal/DocumentParsingSupplier.java

+    /**
+     * @return a new 'empty' observer to use when observing parsing
+     */
+    DocumentSizeObserver getDocumentSizeObserver();


Since new objects are being created (it's not the same instance each call), can we use "new" or "create" prefix instead of "get"?

rjernst · 2024-02-06T14:40:25Z

server/src/main/java/org/elasticsearch/plugins/internal/DocumentParsingSupplier.java

+    /**
+     * @return an observer to use when continue observing parsing based on previous result
+     */
+    DocumentSizeObserver getDocumentSizeObserver(long normalisedBytesParsed);


nit: Can we call this newFixedDocumentSizeObserver to make clear the size will not change?

rjernst · 2024-02-06T14:41:40Z

server/src/main/java/org/elasticsearch/plugins/internal/DocumentParsingSupplierPlugin.java

     */
-    Supplier<DocumentParsingObserver> getDocumentParsingObserverSupplier();
+    DocumentParsingSupplier getDocumentParsingSupplier();


Can we eliminate a level of indirection here by having the plugin interface contain the 3 methods for constructing these classes?

the DocumentParsingSupplier is being used in transport classes/ingestService. It would be odd to pass down a plugin into those classes

stu-elastic · 2024-02-06T16:21:26Z

server/src/main/java/org/elasticsearch/action/bulk/TransportShardBulkAction.java

@@ -173,6 +180,7 @@ protected int primaryOperationCount(BulkShardRequest request) {
        return request.items().length;
    }

+    // TODO PG this is just for testing?


I was worried that there is another public method performOnPrimary but it turns out that it is only used in testing. It is being called by xpack test, hence a scope had to be changed.

rjernst

I left a few more nits about naming, and it looks like the TransportVersion needs to be updated. Otherwise LGTM.

rjernst · 2024-02-09T14:25:08Z

server/src/main/java/org/elasticsearch/action/index/IndexRequest.java

@@ -204,6 +206,9 @@ public IndexRequest(@Nullable ShardId shardId, StreamInput in) throws IOExceptio
        } else {
            requireDataStream = false;
        }
+        if (in.getTransportVersion().onOrAfter(INDEX_REQUEST_NORMALIZED_BYTES_PARSED)) {
+            normalisedBytesParsed = in.readLong();


any reason not to use vlong?

rjernst · 2024-02-09T14:27:21Z

server/src/main/java/org/elasticsearch/plugins/internal/DocumentParsingReporter.java

+/**
+ * An interface to allow performing an action when parsing has been completed and successful
+ */
+public interface DocumentParsingReporter {


Since this only handles document size, can we make the name parallel to the observer as DocumentSizeReporter?

rjernst · 2024-02-09T14:28:03Z

server/src/main/java/org/elasticsearch/plugins/internal/DocumentParsingSupplierPlugin.java

 */
-public interface DocumentParsingObserverPlugin {
+public interface DocumentParsingSupplierPlugin {


Since we now call the supplier a provider, can we use the same terminology in the plugin here, ie DocumentParsingProviderPlugin?

rjernst · 2024-02-09T14:30:19Z

server/src/main/java/org/elasticsearch/plugins/internal/DocumentParsingProvider.java

+    /**
+     * @return an observer to use when continue observing parsing based on previous result
+     */
+    DocumentSizeObserver newDocumentSizeObserver(long normalisedBytesParsed);


Can we use a more distinctive name from creating a "real" observer above? eg newFixedDocumentSizeObserver, and make it clear in the javadoc this observer does not actually observe anything, it reports the exact size passed in. The javadoc currently implies it would keep increasing the size when passed to a parser.

it is not always a fixedDcoumetnSizeObserver.
If the parsing has not happened in the IngestService we still want to create a new Metering DocumetnSizeObserver.
I will keep the method name, but will remove the javadoc wording indicating that it will continue parsing (this was based on the previous idea that we wanted to keep the interfaces in ES independent of the implementation in serverless)

I renamed the 2 methods as per your suggestion there and updated the javadoc

udpate request that are sending a document (or part of it) should allow for metering the size of that doc the update request that are using a script should not be metered - reported size 0. this commit is following up on #104859 The parsing is of the update's document is being done in UpdateHelper - the same pattern we use to meter parsing in IngestService. If the script is being used, the size observed will be 0. The value observed is then reported in the TransportShardBulkAction and thanks to the value being 0 or positive it will not be metering the modified document again. This commit also renames the getDocumentParsingSupplier to getDocumentParsingProvider (this was accidentally omitted in the #104859)

es: working draft

4886ef8

pgomulka added the WIP label Jan 29, 2024

elasticsearchmachine added the v8.13.0 label Jan 29, 2024

pgomulka added 13 commits January 29, 2024 15:31

new reporter interface

a08f969

Merge remote-tracking branch 'origin/main' into test_rejection

d4ff750

reporter

815a53e

only report upon non delete

83f51c0

spotless

d4400ac

npe in tests

cad3270

test passing

12a7dc7

Merge remote-tracking branch 'pgomulka/test_rejection' into test_reje…

7eb355a

…ction

java doc

5f7d0a2

spotlessAPply

a1261b4

spotlessAPply

5411d7c

rename

1349020

Merge remote-tracking branch 'origin/main' into test_rejection

6b53252

pgomulka commented Jan 31, 2024

View reviewed changes

pgomulka added 3 commits February 1, 2024 10:26

method renames

223ee7b

Merge remote-tracking branch 'origin/main' into test_rejection

abecfe7

Merge remote-tracking branch 'origin/main' into test_rejection

d45c9de

pgomulka added :Core/Infra/Core Core issues without another label >enhancement and removed WIP labels Feb 1, 2024

pgomulka changed the title ~~[DRAFT] ES - document observing with rejections~~ ES - document observing with rejections Feb 1, 2024

Update docs/changelog/104859.yaml

1f9b35c

pgomulka added :Core/Infra/Metrics Metrics and metering infrastructure and removed :Core/Infra/Core Core issues without another label labels Feb 1, 2024

pgomulka changed the title ~~ES - document observing with rejections~~ Document parsing observer to not report upon failures Feb 1, 2024

pgomulka changed the title ~~Document parsing observer to not report upon failures~~ DocumentParsingObserver infrastructure to allow not reporting upon failures Feb 1, 2024

pgomulka self-assigned this Feb 1, 2024

pgomulka marked this pull request as ready for review February 1, 2024 14:10

elasticsearchmachine added the Team:Core/Infra Meta label for core/infra team label Feb 1, 2024

pgomulka requested review from rjernst and stu-elastic February 1, 2024 14:58

pgomulka added 2 commits February 2, 2024 10:00

renames

bc3ba6c

Merge remote-tracking branch 'pgomulka/test_rejection' into test_reje…

a9cc740

…ction

pgomulka mentioned this pull request Feb 2, 2024

Infrastructure for metering the update requests #105063

Merged

rjernst reviewed Feb 6, 2024

View reviewed changes

pgomulka added 2 commits February 6, 2024 15:56

code review follow up- renames

fc101a8

Merge remote-tracking branch 'origin/main' into test_rejection

596a009

stu-elastic reviewed Feb 6, 2024

View reviewed changes

remove todos

d4b9134

pgomulka requested a review from rjernst February 6, 2024 18:16

rjernst approved these changes Feb 9, 2024

View reviewed changes

renames, code review follow up

cdf9a8a

pgomulka changed the title ~~DocumentParsingObserver infrastructure to allow not reporting upon failures~~ DocumentSizeObserver infrastructure to allow not reporting upon failures Feb 12, 2024

pgomulka added 5 commits February 12, 2024 12:36

Merge remote-tracking branch 'origin/main' into test_rejection

c997ea8

move the 'if !=-1' to server

f70389e

Merge remote-tracking branch 'origin/main' into test_rejection

6987610

vlong to zlong due to -1 value

913f80a

remove unused field

aa50c77

pgomulka merged commit 11f3c29 into elastic:main Feb 12, 2024
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DocumentSizeObserver infrastructure to allow not reporting upon failures #104859

DocumentSizeObserver infrastructure to allow not reporting upon failures #104859

pgomulka commented Jan 29, 2024 •

edited

pgomulka Jan 31, 2024

joegallo Jan 31, 2024

pgomulka Feb 1, 2024

elasticsearchmachine commented Feb 1, 2024

elasticsearchmachine commented Feb 1, 2024

rjernst commented Feb 1, 2024

pgomulka commented Feb 2, 2024

pgomulka commented Feb 2, 2024 •

edited

rjernst left a comment

rjernst Feb 6, 2024

rjernst Feb 6, 2024

rjernst Feb 6, 2024

rjernst Feb 6, 2024

rjernst Feb 6, 2024

pgomulka Feb 6, 2024

stu-elastic Feb 6, 2024

pgomulka Feb 6, 2024

rjernst left a comment

rjernst Feb 9, 2024

rjernst Feb 9, 2024

rjernst Feb 9, 2024

rjernst Feb 9, 2024

pgomulka Feb 12, 2024 •

edited

pgomulka Feb 12, 2024

DocumentSizeObserver infrastructure to allow not reporting upon failures #104859

DocumentSizeObserver infrastructure to allow not reporting upon failures #104859

Conversation

pgomulka commented Jan 29, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elasticsearchmachine commented Feb 1, 2024

elasticsearchmachine commented Feb 1, 2024

rjernst commented Feb 1, 2024

pgomulka commented Feb 2, 2024

pgomulka commented Feb 2, 2024 • edited

rjernst left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rjernst left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pgomulka Feb 12, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pgomulka commented Jan 29, 2024 •

edited

pgomulka commented Feb 2, 2024 •

edited

pgomulka Feb 12, 2024 •

edited