Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dotexpander processor #20078

Merged

Conversation

martijnvg
Copy link
Member

@martijnvg martijnvg commented Aug 19, 2016

Adds a processor the turns fields with dots into object fields, so that other processors can clean the data up before indexing.

@martijnvg martijnvg added >bug review :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP v5.0.0-beta1 labels Aug 19, 2016
@talevy
Copy link
Contributor

talevy commented Aug 22, 2016

LGTM

UPDATED

@rjernst
Copy link
Member

rjernst commented Aug 22, 2016

I'm confused on the purpose of this feature? Dots in field names don't matter in master. They all get read the same way, as field name separators. So to document parsing, this:

{
  "foo" : {
      "bar.baz" : "value"
   }
}

will be no different than:

{
  "foo" : {
      "bar": {
          "baz" : "value"
      }
   }
}

@talevy
Copy link
Contributor

talevy commented Aug 22, 2016

@rjernst

This is for pre-indexing, we have our own fieldname resolution and do not have such a differentiator without this PR. am I missing something?

@rjernst
Copy link
Member

rjernst commented Aug 22, 2016

@talevy There is no reason to structure the fields with dots in the names, they are not treated any different by document parsing than if the structure was there as in my example.

@talevy
Copy link
Contributor

talevy commented Aug 22, 2016

@rjernst, I agree that there is no reason, but users may have some other reasons for having them be that way and it makes sense for us to have a way to support it; no?

@rjernst
Copy link
Member

rjernst commented Aug 22, 2016

but users may have some other reasons for having them be that way

IMO this is untenable long term. The point of ingest is to get the data setup to be searchable in a particular way. What the source looks like should not matter, and continuing to add things which allow users to rely on the structure of _source prevent improvements like #9034.

@talevy
Copy link
Contributor

talevy commented Aug 22, 2016

I kind of view it the other way. I view this as enabling users to convert between untenable structures into better ones where they can use ingest to move their legacy sources away from the dots-in-fields structure.

That being said, responsibility for resolving this on the client end seems acceptable for me as well.

@talevy
Copy link
Contributor

talevy commented Aug 22, 2016

after speaking with @rjernst offline, I think we should pre-process the document beforehand to convert objects of this type:

{ "foo.bar": "baz" } 

into objects of this type:

{ "foo": { "bar" : "baz" } }

Since the object mappers do this internally while indexing anyways... it makes sense for Ingest to do it upfront. This will allow us to freely describe fields with dots without the need for escaping

thanks @rjernst for the chime, I think this is a friendlier solution since this is the default behavior in core anyways

@martijnvg
Copy link
Member Author

I like idea to convert fields with dots into object fields.

I just wonder what the behaviour should if a document being processed has both object and with field names with dots (very rare case). Something like this:

{
  "foo.bar" : "value1",
   "foo" : {
      "bar" : "value2"
    }
}

Should we then just turn the bar field under foo field into an array field holding both values?

@clintongormley
Copy link

I'm not sure I agree with automatically converting dots to objects here, for exactly the reason @martijnvg has pointed out: you need to deal with conflicts. The ingest processor is the point at which you can take your raw messy documents and transform them into something more useful. In other words, you can take a doc like { "foo": 5, "foo.bar": 10 } and rewrite it into something that can be indexed into Elasticsearch.

Instead of transforming dots to objects automatically, i would provide this as function as a processor. Instead of adding support for escaping of dots, I would recommend using the script processor (which already has syntax for accessing fields with dots). This would allow the user to fix their document as they see fit, after which they can apply the dots-to-objects processor.

@martijnvg martijnvg force-pushed the ingest_handle_dots_in_field_names branch from 3bbd81f to 9e84927 Compare August 29, 2016 15:43
@martijnvg martijnvg changed the title Add support for dots in fieldnames when referring to fields in processors Add dedot processor Aug 29, 2016
@martijnvg
Copy link
Member Author

@talevy @clintongormley I've updated the PR to add a dedot processor instead of adding general support for dots in field names.

@martijnvg martijnvg removed the discuss label Aug 29, 2016
Otherwise these fields can't accessed by any processor.

[[uppercase-options]]
.Uppercase Options

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be Dedot Options?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

happens to me all the time

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my shameless copy pasting...

@clintongormley
Copy link

thanks @martijnvg - what happens when there are conflicting fields? Worth adding to the docs?

@clintongormley
Copy link

I'm also wondering if this should be called dedot given that its behaviour is very different from the logstash version. What about dot_expander?

@talevy
Copy link
Contributor

talevy commented Aug 29, 2016

I agree with @clintongormley dedot is rather confusing. especially since it used to have a different meaning.

@martijnvg
Copy link
Member Author

what happens when there are conflicting fields?

It turns it into an array and you need to deal with it (even when types are conflicting, and if that isn't resolved then when serializing to json or in ES it will scream).

What about dot_expander?

+1

@clintongormley
Copy link

It turns it into an array and you need to deal with it (even when types are conflicting, and if that isn't resolved then when serializing to json or in ES it will scream).

What about this document?

{
  "foo": 5,
  "foo.bar": 10
}

@martijnvg
Copy link
Member Author

@clintongormley So this would need to be done in two steps, first rename foo to foo.bar and then dot expend foo.bar (field with the actual dot) field. This way the actual implementation remains simple for this processor.

@martijnvg martijnvg force-pushed the ingest_handle_dots_in_field_names branch 2 times, most recently from 9ee5d27 to b87dbc9 Compare August 30, 2016 07:53
@clintongormley
Copy link

@martijnvg what i mean is: what would this processor do if it encountered this document? Throw an exception? What does that exception look like? And should we add this info to the docs?

@martijnvg
Copy link
Member Author

The error that is now thrown is not clear (java.lang.IllegalArgumentException: cannot set [foo] with parent object of type [java.lang.String] as part of path [foo.bar]), so I can try to improve that. I did update the docs, explaining to use the rename processor in this situation.

@martijnvg martijnvg changed the title Add dedot processor Add dotexpander processor Aug 31, 2016
@martijnvg martijnvg force-pushed the ingest_handle_dots_in_field_names branch from b87dbc9 to 37fd8f9 Compare August 31, 2016 08:45

Expands a field with dots into an object field. This processor allows fields
with dots in the name to be accessible by other processors in the pipeline.
Otherwise these fields can't accessed by any processor.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing a be

Otherwise these fields can't be accessed by any processor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@martijnvg martijnvg force-pushed the ingest_handle_dots_in_field_names branch from 37fd8f9 to ced55e9 Compare September 2, 2016 21:35
@talevy
Copy link
Contributor

talevy commented Sep 2, 2016

LGTM

import org.elasticsearch.ingest.Processor;

import java.util.Map;
import java.util.regex.Pattern;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused import

@martijnvg martijnvg force-pushed the ingest_handle_dots_in_field_names branch from ced55e9 to ee4c485 Compare September 5, 2016 05:27
@martijnvg martijnvg force-pushed the ingest_handle_dots_in_field_names branch from ee4c485 to 6f6d17d Compare September 5, 2016 05:28
@martijnvg martijnvg merged commit 6f6d17d into elastic:master Sep 5, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP v5.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants