Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stringify Objects for Tokenizing #17918

Closed
ralphlevan opened this issue Apr 21, 2016 · 14 comments
Closed

Stringify Objects for Tokenizing #17918

ralphlevan opened this issue Apr 21, 2016 · 14 comments
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP discuss :Search/Analysis How text is split into tokens

Comments

@ralphlevan
Copy link

Describe the feature:
There are cases where complex objects need to be passed to tokenizers. I am writing a new feature that will allow a mapping to specify that a property of type string may coerce an object within that property to a string.

PUT /test
{
  "mappings": {
     "test": {
        "properties": {
           "a": {
              "type": "string",
              "coerce": true
           }
        }
     }
  }
}
PUT /test/test/1
{
    "a": {
        "b":"2",
        "c":"3"
    }
}

In this example, the tokenizer will be passed "{"b":"2","c":"3"}". It will be up to the tokenizer to decide how to treat that. The default tokenizer returns the tokens "b", "2", "c" and "3".

This situation occurs frequently in library data. Complex objects have been created and the values of some of the properties effect how the other properties are treated. A simple example occurs with book titles. There is a piece of data that specifies how many leading characters should be stripped from the title to use it as a sort key.

So far, the changes have been restricted to StringFieldMapper and TokenCountFieldMapper. I am making the changes in version 2.1.0 and expect to merge them into all subsequent versions.

I expect to be able to use the copy_to parameter, but don't know exactly how that will work yet.

@clintongormley clintongormley added discuss :Search/Analysis How text is split into tokens labels Apr 25, 2016
@clintongormley
Copy link

Not sure how well this would play with source filtering, etc. Wondering if it wouldn't be a better idea to implement this as a JSON stringify processor in the ingest node?

@clintongormley clintongormley added the :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP label Apr 25, 2016
@ralphlevan
Copy link
Author

First I've heard of these. Got a pointer to some documentation?

@clintongormley
Copy link

@ralphlevan
Copy link
Author

I've got serious reservations about the whole ingest node plan. Anything not defined in the database mappings or settings has the potential to be applied inconsistently.

Otherwise, there's no conflict between these two features.

@clintongormley
Copy link

@ralphlevan this feels like a transformation that should happen before the analysis phase, which means that we're unlikely to accept a PR that targets analysis

@ralphlevan
Copy link
Author

ralphlevan commented Apr 28, 2016

I agree. This change happens during document parsing, not analysis.

The code is complete and working as I hoped. The document source is unchanged and the complex object that was provided is returned unchanged. But, downstream tokenizers get passed the complete object as a string. Copy_to passes the stringified object to the new fields.

The only code touched was DocumentParser, StringFieldMapper, TokenCountFieldMapper and JsonXContentParser. (It turns out the DocumentParser changes were just traces that I added to figure out what was going on. I've reverted those changes.)

@ralphlevan
Copy link
Author

Here's the demonstration test:

PUT /test
{
  "mappings": {
     "test": {
        "properties": {
           "b": {
              "type": "string",
              "coerce": true,
              "copy_to": "copyOfB"

           },
           "copyOfB": {
               "type": "string",
               "store": true
           }
        }
     }
  }
}
PUT /test/test/1
{
    "a": "z",
    "b": {
        "c":"2",
        "d":"3"
    },
    "e": "y"
}
{
   "_index": "test",
   "_type": "test",
   "_id": "1",
   "_version": 1,
   "_shards": {
      "total": 2,
      "successful": 1,
      "failed": 0
   },
   "created": true
}
GET /test/test/1
{
   "_index": "test",
   "_type": "test",
   "_id": "1",
   "_version": 1,
   "found": true,
   "_source": {
      "a": "z",
      "b": {
         "c": "2",
         "d": "3"
      },
      "e": "y"
   }
}
GET /test/test/1?fields=copyOfB
{
   "_index": "test",
   "_type": "test",
   "_id": "1",
   "_version": 1,
   "found": true,
   "fields": {
      "copyOfB": [
         "{\"c\":\"2\",\"d\":\"3\"}"
      ]
   }
}

@clintongormley
Copy link

Hmmm... I'm afraid I don't like it. I don't like that b is mapped as a string but the _source contains an object. I'd much prefer this to be an explicit change happening in an ingest processor, but I'm happy leave this open for further discussion.

@ralphlevan
Copy link
Author

I understand.

There is clearly a philosophical tension within Elasticsearch about whether it should be used as a primary data repository or whether it should be used only as a discovery engine for data stored elsewhere. If primary data support was not intended, then why have support for arbitrarily complex data? A comma separated list of data items would be much simpler to handle. But, if complex objects can be injested, why can't that complexity be used in the indexing?

I lean heavily toward using the database as a primary repository. The customer has complex data and does not want to have to mangle/unmangle that data every time they touch it, just to simplify the job of the indexer. The code I am submitting supports that. Customers can provide complex data and get that complex data back unmodified, but the indexer gets access to more than simple atomic data.

I have a philosophical problem with the processing of data not being centralized. The injest nodes plan assumes voluntary compliance by the client software submitting data. Any plan to mangle data externally to the database has the potential to support multiple paths to the database and the potential for inconsistent processing of the data. I feel strongly that such processing needs to be centralized and the simplest center for that processing is the database. That's what I have provided by having all the processing specified in the database mapping.

@ralphlevan
Copy link
Author

How is coercing an object to a string any different than coercing a string to a number? The mapping says its an int, but the _source says it's a string? The explicit coerce parameter is what makes both of these legal.

@jpountz
Copy link
Contributor

jpountz commented May 6, 2016

We have been discussing it on FixitFriday and decided to not implement this feature. It should be done on client-side.

@jpountz jpountz closed this as completed May 6, 2016
@ralphlevan
Copy link
Author

@jpountz , can you point me at the record of that discussion? I'm disappointed I wasn't offered an opportunity to participate.

@jpountz
Copy link
Contributor

jpountz commented Aug 11, 2016

@ralphlevan sorry I missed your message. It was an internal discussion whose main arguments can be read at #19691 (comment) if you are interested.

@ralphlevan
Copy link
Author

I think the new ingest node feature will allow me to construct the fields I need for indexing.

Thanks!

Ralph

From: Adrien Grand [mailto:notifications@github.com]
Sent: Thursday, August 11, 2016 9:41 AM
To: elastic/elasticsearch elasticsearch@noreply.github.com
Cc: LeVan,Ralph levan@oclc.org; Mention mention@noreply.github.com
Subject: Re: [elastic/elasticsearch] Stringify Objects for Tokenizing (#17918)

@ralphlevanhttps://github.com/ralphlevan sorry I missed your message. It was an internal discussion whose main arguments can be read at #19691 (comment)#19691 (comment) if you are interested.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com//issues/17918#issuecomment-239163388, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ABB9W3Uu_-R7MG9mFg-VbqbsDuvYsZ_2ks5qeyZpgaJpZM4INEdi.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP discuss :Search/Analysis How text is split into tokens
Projects
None yet
Development

No branches or pull requests

3 participants