-
Notifications
You must be signed in to change notification settings - Fork 24.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stringify Objects for Tokenizing #17918
Comments
Not sure how well this would play with source filtering, etc. Wondering if it wouldn't be a better idea to implement this as a JSON stringify processor in the ingest node? |
First I've heard of these. Got a pointer to some documentation? |
I've got serious reservations about the whole ingest node plan. Anything not defined in the database mappings or settings has the potential to be applied inconsistently. Otherwise, there's no conflict between these two features. |
@ralphlevan this feels like a transformation that should happen before the analysis phase, which means that we're unlikely to accept a PR that targets analysis |
I agree. This change happens during document parsing, not analysis. The code is complete and working as I hoped. The document source is unchanged and the complex object that was provided is returned unchanged. But, downstream tokenizers get passed the complete object as a string. Copy_to passes the stringified object to the new fields. The only code touched was |
Here's the demonstration test:
|
Hmmm... I'm afraid I don't like it. I don't like that |
I understand. There is clearly a philosophical tension within Elasticsearch about whether it should be used as a primary data repository or whether it should be used only as a discovery engine for data stored elsewhere. If primary data support was not intended, then why have support for arbitrarily complex data? A comma separated list of data items would be much simpler to handle. But, if complex objects can be injested, why can't that complexity be used in the indexing? I lean heavily toward using the database as a primary repository. The customer has complex data and does not want to have to mangle/unmangle that data every time they touch it, just to simplify the job of the indexer. The code I am submitting supports that. Customers can provide complex data and get that complex data back unmodified, but the indexer gets access to more than simple atomic data. I have a philosophical problem with the processing of data not being centralized. The injest nodes plan assumes voluntary compliance by the client software submitting data. Any plan to mangle data externally to the database has the potential to support multiple paths to the database and the potential for inconsistent processing of the data. I feel strongly that such processing needs to be centralized and the simplest center for that processing is the database. That's what I have provided by having all the processing specified in the database mapping. |
How is coercing an object to a string any different than coercing a string to a number? The mapping says its an int, but the _source says it's a string? The explicit coerce parameter is what makes both of these legal. |
We have been discussing it on FixitFriday and decided to not implement this feature. It should be done on client-side. |
@jpountz , can you point me at the record of that discussion? I'm disappointed I wasn't offered an opportunity to participate. |
@ralphlevan sorry I missed your message. It was an internal discussion whose main arguments can be read at #19691 (comment) if you are interested. |
I think the new ingest node feature will allow me to construct the fields I need for indexing. Thanks! Ralph From: Adrien Grand [mailto:notifications@github.com] @ralphlevanhttps://github.com/ralphlevan sorry I missed your message. It was an internal discussion whose main arguments can be read at #19691 (comment)#19691 (comment) if you are interested. — |
Describe the feature:
There are cases where complex objects need to be passed to tokenizers. I am writing a new feature that will allow a mapping to specify that a property of type string may coerce an object within that property to a string.
In this example, the tokenizer will be passed "{"b":"2","c":"3"}". It will be up to the tokenizer to decide how to treat that. The default tokenizer returns the tokens "b", "2", "c" and "3".
This situation occurs frequently in library data. Complex objects have been created and the values of some of the properties effect how the other properties are treated. A simple example occurs with book titles. There is a piece of data that specifies how many leading characters should be stripped from the title to use it as a sort key.
So far, the changes have been restricted to StringFieldMapper and TokenCountFieldMapper. I am making the changes in version 2.1.0 and expect to merge them into all subsequent versions.
I expect to be able to use the copy_to parameter, but don't know exactly how that will work yet.
The text was updated successfully, but these errors were encountered: