Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support dotted field notations in the reroute processor #96243

Merged

Conversation

felixbarny
Copy link
Member

@graphaelli reported an issue when using the reroute processor on APM data. The reason is that APM Server sets data_stream.dataset as a dotted field name instead of object notation. This leads to the document containing both the dotted and the nested field after routing whose values differ which leads to a conflict for the constant-keyword mapping for data_stream.dataset.

IMO, IngestDocument#setValue and IngestDocument#getValue should always consider both dotted and nested notation but that's a much bigger change that I don't want to conflate with this enhancement which is borderline a bug fix.

@felixbarny felixbarny added >enhancement :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP labels May 22, 2023
@felixbarny felixbarny requested a review from jbaiera May 22, 2023 08:38
@felixbarny felixbarny self-assigned this May 22, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@elasticsearchmachine elasticsearchmachine added Team:Data Management Meta label for data/management team v8.9.0 external-contributor Pull request authored by a developer outside the Elasticsearch team labels May 22, 2023
@elasticsearchmachine
Copy link
Collaborator

Hi @felixbarny, I've created a changelog YAML for you.

@felixbarny felixbarny requested a review from eyalkoren May 22, 2023 09:10
@felixbarny felixbarny added >bug v8.8.1 auto-backport-and-merge Automatically create backport pull requests and merge when ready and removed >enhancement labels May 22, 2023
@elasticsearchmachine
Copy link
Collaborator

Hi @felixbarny, I've updated the changelog YAML for you.

@jbaiera
Copy link
Member

jbaiera commented May 24, 2023

So, I've taken a look through the code, and I'm not entirely sure we should go this route by default. As it stands, the IngestDocument.ctxMap is only used in scripting. All other interactions with it are done through the IngestDocument get and set value methods.

IMO, IngestDocument#setValue and IngestDocument#getValue should always consider both dotted and nested notation but that's a much bigger change that I don't want to conflate with this enhancement which is borderline a bug fix.

Fully agree that this is way out of scope, but even going down that path starts to open up a lot of corner cases for how ingest node should be handling dotted field names. For instance, if we're using a set processor to read from a dotted field in order to write it somewhere else, does the destination get a dotted field name or is it in object notation. If a dotted field exists, do we overwrite it with the object notation or vice versa?

Is there any way to make use of the dot expander in most cases before running the reroute processor? This would resolve the issue without having to wrestle with the above design concerns.

@felixbarny
Copy link
Member Author

Is there any way to make use of the dot expander in most cases before running the reroute processor? This would resolve the issue without having to wrestle with the above design concerns.

I don't like the implicit requirement of having to use the dot expander processor in order to be able to use the reroute processor. We could potentially de-dot data_stream.* by default which is effectively what this PR is doing.

@jbaiera
Copy link
Member

jbaiera commented May 24, 2023

I don't like the implicit requirement of having to use the dot expander processor in order to be able to use the reroute processor.

For what it is worth, none of the existing processors are equipped to handle dotted field names other than the dot expander. Dotted field names are not visible in ingest. I don't think there is consensus to be found on how best to fix ingest so that dotted fields are considered for set and get field operations.

Accepting dotted field names when parsing data into Lucene is fine because documents are treated as a loose bag of terms that never need to be rematerialized. Picking between nested json and dotted field names when writing a value to a field in ingest is more complicated because it changes the source. Making assumptions about what the user wants/is ok with is what leads to usability bugs.

If you have a pipeline that works on documents with dotted fields, using a dot expander makes it clear that dotted field names will be normalized into nested json and visible to subsequent processors. Between assertions in the ingest processors and the simulate features we offer, I think we're more likely to find issues earlier on in the ingest process with this approach than down the path in the document parsing code at the shard layer.

This feels like accepting technical debt, but I'll mark this as team discuss for the DM team in case others disagree with that assessment.

@jbaiera
Copy link
Member

jbaiera commented May 24, 2023

Linking to a Hadoop issue because I view Ingest Node and Hadoop in similar light. They both are expected to read and write source data in ways that the rest of Elasticsearch does not concern itself with.

elastic/elasticsearch-hadoop#853

@ruflin
Copy link
Member

ruflin commented May 25, 2023

Using the dedot processor by default seems to against where we want to land eventually that everything is flattened: #88934

For what it is worth, none of the existing processors are equipped to handle dotted field names other than the dot expander.

I think we all agree that this is not great but needs to be fixed eventually. It is a problem that has hit us several time in the past. But I see the case here for the reroute processor slightly differently: It has a default behaviour on certain fields. These are special fields targeted to the data stream naming scheme. I consider it acceptable to have special logic on these fields. It becomes especially important as even our products use it differently and we should not put any of the burden on the user to figure this out.

@felixbarny
Copy link
Member Author

As it stands, the IngestDocument.ctxMap. All other interactions with it are done through the IngestDocument get and set value methods.

Except for the dot expander processor, which calls IngestDocument.getSourceAndMetadata which returns the same ctxMap object as IngestDocument.ctxMap.

Picking between nested json and dotted field names when writing a value to a field in ingest is more complicated because it changes the source. Making assumptions about what the user wants/is ok with is what leads to usability bugs.

Compared to the general problem of "solving" dots in field names in ingest pipelines, the advantage that we have in the context of the reroute processor is that we're not working with completely generic data. We're dealing with very specific, well-defined ECS fields:

  • data_stream.type
  • data_stream.dataset
  • event.dataset
  • data_stream.namespace

These are defined as objects in ECS and we're already modifying these properties with the reroute processor, so they're not really "sacred" properties that we need to be careful of modifying. It's the reroute processor's job to modify these attributes so that they're in sync with the index name.

I don't think it's helpful to insist that incoming documents need to use nested field notation for the reroute processor to work. People use dots and nested notation interchangeably as ES doesn't differ between the notations once the document is stored. Our solutions generally prefer retrieving information via fields rather than _source as per the general recommendation and with synthetic _source you also can't rely on the exact structure of the source too much.

Rejecting documents with dotted field names for data_stream.* would be completely against the spirit of accepting all logs. Therefore, I think the only path that we have is to accept reading from the dotted field name equivalents and to ensure that the output of the reroute processor is a document that never has both a nested and a dotted field notation data_stream.*. Otherwise, this would trigger an issue with constant keyword collisions that's very hard to understand for users.

@jbaiera
Copy link
Member

jbaiera commented May 26, 2023

We discussed this in our most recent team meeting and there was general discomfort from the engineers present about dotted field name usage in ingest and around fixing this for just the reroute processor.

Using the dedot processor by default seems to against where we want to land eventually that everything is flattened: #88934

Based on this direction, I think it's likely that this will continue to be a problem for other processors. If the plan is to put pipeline logic in the hands of users so that they can manage their own logs parsing logic, and to send dotted field names mixed with nested json, then we are setting them up for frustration if ingest does not handle dotted field names gracefully. We would like to have a plan in place on how to support dotted field names across all processors, especially if we are moving in the direction of all logs mappings being flattened, as this problem will only continue to pop up.

Rejecting documents with dotted field names for data_stream.* would be completely against the spirit of accepting all logs.

Accepting all logs is absolutely the strategy we want to follow for this feature. The suggestion of using the dot expander earlier in the pipeline would allow for accepting these logs without making special changes for this one case and hopefully would guard against further problems in other processors. We've discussed having json parsing by default for Logs+. We're setting default timestamps on documents based on their arrivals. I was hoping expanding dotted fields would be a feature worth considering in the same light. Doing so would buy us flexibility on how we want to approach the greater problem of dotted field names in ingest.

Plus, if we move forward with the change toward being lenient around accepting sub-objects when they are disabled (as described in #88934), then I'm not sure I understand how the dot expander conflicts with the accept all logs objective. Those nested fields would just be flattened at index time.

Let's see if we can make some time soon to discuss this further. Dotted field names are an incredibly messy feature that we support and we want to make sure we're going in the right direction here, not just introducing quick fixes as they arise.

Copy link
Member Author

@felixbarny felixbarny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would like to have a plan in place on how to support dotted field names across all processors, especially if we are moving in the direction of all logs mappings being flattened, as this problem will only continue to pop up.

I do agree that this problem is bigger than just the reroute processor and ideally, we'd have a consistent approach for all processors.

However, the changes in this PR don't necessarily conflict with that goal. I think we can make progress on this specific issue and the general issue independently. The way the changes are made in this PR don't conflict with any general changes to support dotted field notations in ingest processors and are aligned with how we're handling dots in field names in the fields API as well as Mustache templates.

@felixbarny
Copy link
Member Author

felixbarny commented May 31, 2023

Plus, if we move forward with the change toward being lenient around accepting sub-objects when they are disabled (as described in #88934), then I'm not sure I understand how the dot expander conflicts with the accept all logs objective. Those nested fields would just be flattened at index time.

Using the dot expander on all incoming data conflicts with one of the main motivations for using subobjects:false: We want to fully accept documents that currently would be considered to have an object/scalar mismatch, for example if they have the fields foo and foo.bar. Applying for dot expander on such a document will result in an error.

@bpintea bpintea added v8.8.2 and removed v8.8.1 labels Jun 6, 2023
Copy link
Member

@jbaiera jbaiera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on discussions (both here and externally) as well as how things are heading for other parts of the ingest code in regards to dotted field names (especially the work on the fields api in Painless), I think we'd be good to get this in at this point.

LGTM, thanks for the discussion and iterations!

@felixbarny felixbarny merged commit 2e81bcc into elastic:main Jun 6, 2023
12 checks passed
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
8.8

@zhengminhz
Copy link

Caused by: org.elasticsearch.hadoop.EsHadoopIllegalStateException: Found field 'connect2.SSLhandshake'. Fields containing dots ('.') are not supported in es-hadoop at org.elasticsearch.spark.sql.RowValueReader.addToBuffer(RowValueReader.scala:62) at org.elasticsearch.spark.sql.RowValueReader.addToBuffer$(RowValueReader.scala:55) at org.elasticsearch.spark.sql.ScalaRowValueReader.addToBuffer(ScalaEsRowValueReader.scala:32) at org.elasticsearch.spark.sql.ScalaRowValueReader.addToMap(ScalaEsRowValueReader.scala:118) at org.elasticsearch.hadoop.serialization.ScrollReader.map(ScrollReader.java:1058) at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:895) at org.elasticsearch.hadoop.serialization.ScrollReader.readHitAsMap(ScrollReader.java:608) at org.elasticsearch.hadoop.serialization.ScrollReader.readHit(ScrollReader.java:432) ... 44 more
@felixbarny I still have this when i use 8.8.2 version

@felixbarny
Copy link
Member Author

Fields containing dots ('.') are not supported in es-hadoop

This PR did not add support for dots in es-hadoop. It just added support for that in the reroute processor.

There's another discussion about adding support for dots in more processors:

Yet another issue discusses adding support for dots in es-hadoop:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport-and-merge Automatically create backport pull requests and merge when ready >bug :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP external-contributor Pull request authored by a developer outside the Elasticsearch team Team:Data Management Meta label for data/management team team-discuss v8.8.2 v8.9.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants