Stringify Objects for Tokenizing #17918

ralphlevan · 2016-04-21T20:56:14Z

Describe the feature:
There are cases where complex objects need to be passed to tokenizers. I am writing a new feature that will allow a mapping to specify that a property of type string may coerce an object within that property to a string.

PUT /test
{
  "mappings": {
     "test": {
        "properties": {
           "a": {
              "type": "string",
              "coerce": true
           }
        }
     }
  }
}

PUT /test/test/1
{
    "a": {
        "b":"2",
        "c":"3"
    }
}

In this example, the tokenizer will be passed "{"b":"2","c":"3"}". It will be up to the tokenizer to decide how to treat that. The default tokenizer returns the tokens "b", "2", "c" and "3".

This situation occurs frequently in library data. Complex objects have been created and the values of some of the properties effect how the other properties are treated. A simple example occurs with book titles. There is a piece of data that specifies how many leading characters should be stripped from the title to use it as a sort key.

So far, the changes have been restricted to StringFieldMapper and TokenCountFieldMapper. I am making the changes in version 2.1.0 and expect to merge them into all subsequent versions.

I expect to be able to use the copy_to parameter, but don't know exactly how that will work yet.

The text was updated successfully, but these errors were encountered:

clintongormley · 2016-04-25T18:00:18Z

Not sure how well this would play with source filtering, etc. Wondering if it wouldn't be a better idea to implement this as a JSON stringify processor in the ingest node?

ralphlevan · 2016-04-25T18:37:32Z

First I've heard of these. Got a pointer to some documentation?

clintongormley · 2016-04-26T11:13:20Z

https://www.elastic.co/guide/en/elasticsearch/reference/master/ingest.html (coming in 5.0)

ralphlevan · 2016-04-27T12:59:41Z

I've got serious reservations about the whole ingest node plan. Anything not defined in the database mappings or settings has the potential to be applied inconsistently.

Otherwise, there's no conflict between these two features.

clintongormley · 2016-04-28T17:56:02Z

@ralphlevan this feels like a transformation that should happen before the analysis phase, which means that we're unlikely to accept a PR that targets analysis

ralphlevan · 2016-04-28T18:29:21Z

I agree. This change happens during document parsing, not analysis.

The code is complete and working as I hoped. The document source is unchanged and the complex object that was provided is returned unchanged. But, downstream tokenizers get passed the complete object as a string. Copy_to passes the stringified object to the new fields.

The only code touched was ~~DocumentParser~~, StringFieldMapper, TokenCountFieldMapper and JsonXContentParser. (It turns out the DocumentParser changes were just traces that I added to figure out what was going on. I've reverted those changes.)

ralphlevan · 2016-04-28T18:42:27Z

Here's the demonstration test:

PUT /test
{
  "mappings": {
     "test": {
        "properties": {
           "b": {
              "type": "string",
              "coerce": true,
              "copy_to": "copyOfB"

           },
           "copyOfB": {
               "type": "string",
               "store": true
           }
        }
     }
  }
}

PUT /test/test/1
{
    "a": "z",
    "b": {
        "c":"2",
        "d":"3"
    },
    "e": "y"
}
{
   "_index": "test",
   "_type": "test",
   "_id": "1",
   "_version": 1,
   "_shards": {
      "total": 2,
      "successful": 1,
      "failed": 0
   },
   "created": true
}

GET /test/test/1
{
   "_index": "test",
   "_type": "test",
   "_id": "1",
   "_version": 1,
   "found": true,
   "_source": {
      "a": "z",
      "b": {
         "c": "2",
         "d": "3"
      },
      "e": "y"
   }
}

GET /test/test/1?fields=copyOfB
{
   "_index": "test",
   "_type": "test",
   "_id": "1",
   "_version": 1,
   "found": true,
   "fields": {
      "copyOfB": [
         "{\"c\":\"2\",\"d\":\"3\"}"
      ]
   }
}

clintongormley · 2016-04-29T12:49:34Z

Hmmm... I'm afraid I don't like it. I don't like that b is mapped as a string but the _source contains an object. I'd much prefer this to be an explicit change happening in an ingest processor, but I'm happy leave this open for further discussion.

ralphlevan · 2016-04-29T13:58:19Z

I understand.

There is clearly a philosophical tension within Elasticsearch about whether it should be used as a primary data repository or whether it should be used only as a discovery engine for data stored elsewhere. If primary data support was not intended, then why have support for arbitrarily complex data? A comma separated list of data items would be much simpler to handle. But, if complex objects can be injested, why can't that complexity be used in the indexing?

I lean heavily toward using the database as a primary repository. The customer has complex data and does not want to have to mangle/unmangle that data every time they touch it, just to simplify the job of the indexer. The code I am submitting supports that. Customers can provide complex data and get that complex data back unmodified, but the indexer gets access to more than simple atomic data.

I have a philosophical problem with the processing of data not being centralized. The injest nodes plan assumes voluntary compliance by the client software submitting data. Any plan to mangle data externally to the database has the potential to support multiple paths to the database and the potential for inconsistent processing of the data. I feel strongly that such processing needs to be centralized and the simplest center for that processing is the database. That's what I have provided by having all the processing specified in the database mapping.

ralphlevan · 2016-04-29T18:50:31Z

How is coercing an object to a string any different than coercing a string to a number? The mapping says its an int, but the _source says it's a string? The explicit coerce parameter is what makes both of these legal.

jpountz · 2016-05-06T10:09:39Z

We have been discussing it on FixitFriday and decided to not implement this feature. It should be done on client-side.

ralphlevan · 2016-05-06T13:10:35Z

@jpountz , can you point me at the record of that discussion? I'm disappointed I wasn't offered an opportunity to participate.

jpountz · 2016-08-11T13:39:48Z

@ralphlevan sorry I missed your message. It was an internal discussion whose main arguments can be read at #19691 (comment) if you are interested.

ralphlevan · 2016-08-11T13:53:46Z

I think the new ingest node feature will allow me to construct the fields I need for indexing.

Thanks!

Ralph

From: Adrien Grand [mailto:notifications@github.com]
Sent: Thursday, August 11, 2016 9:41 AM
To: elastic/elasticsearch elasticsearch@noreply.github.com
Cc: LeVan,Ralph levan@oclc.org; Mention mention@noreply.github.com
Subject: Re: [elastic/elasticsearch] Stringify Objects for Tokenizing (#17918)

@ralphlevanhttps://github.com/ralphlevan sorry I missed your message. It was an internal discussion whose main arguments can be read at #19691 (comment)#19691 (comment) if you are interested.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com//issues/17918#issuecomment-239163388, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ABB9W3Uu_-R7MG9mFg-VbqbsDuvYsZ_2ks5qeyZpgaJpZM4INEdi.

clintongormley added discuss :Search/Analysis How text is split into tokens labels Apr 25, 2016

clintongormley added the :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP label Apr 25, 2016

jpountz closed this as completed May 6, 2016

clintongormley mentioned this issue Aug 11, 2016

Support coerce JSON -> String #19691

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stringify Objects for Tokenizing #17918

Stringify Objects for Tokenizing #17918

ralphlevan commented Apr 21, 2016

clintongormley commented Apr 25, 2016

ralphlevan commented Apr 25, 2016

clintongormley commented Apr 26, 2016

ralphlevan commented Apr 27, 2016

clintongormley commented Apr 28, 2016

ralphlevan commented Apr 28, 2016 •

edited

ralphlevan commented Apr 28, 2016

clintongormley commented Apr 29, 2016

ralphlevan commented Apr 29, 2016

ralphlevan commented Apr 29, 2016

jpountz commented May 6, 2016

ralphlevan commented May 6, 2016

jpountz commented Aug 11, 2016

ralphlevan commented Aug 11, 2016

Stringify Objects for Tokenizing #17918

Stringify Objects for Tokenizing #17918

Comments

ralphlevan commented Apr 21, 2016

clintongormley commented Apr 25, 2016

ralphlevan commented Apr 25, 2016

clintongormley commented Apr 26, 2016

ralphlevan commented Apr 27, 2016

clintongormley commented Apr 28, 2016

ralphlevan commented Apr 28, 2016 • edited

ralphlevan commented Apr 28, 2016

clintongormley commented Apr 29, 2016

ralphlevan commented Apr 29, 2016

ralphlevan commented Apr 29, 2016

jpountz commented May 6, 2016

ralphlevan commented May 6, 2016

jpountz commented Aug 11, 2016

ralphlevan commented Aug 11, 2016

ralphlevan commented Apr 28, 2016 •

edited