Indexing Arbitrary Key/Value Data

I can index such data with Druid:

`{"ts":"2018-01-01T02:35:45Z","appToken":"guid", "eventName":"app-open", "key1":"value1"}
`
via this configuration:

```
"parser" : {
        "type" : "string",
        "parseSpec" : {
          "format" : "json",
          "timestampSpec" : {
            "format" : "iso",
            "column" : "ts"
          },
          "dimensionsSpec" : {
            "dimensions": [
              "appToken",
              "eventName",
              "key1"
            ]
          }
        }
      }
```

However, I would also want to index such data:

```
{
  "ts":"2018-01-01T03:35:45Z",
  "appToken":"guid",
  "eventName":"app-open",
  "properties":[{"randomKey1":"randomValue1"}, {"randomKey2":"randomValue2"}]
}
```

at which properties is an array and members of that array has some arbitrary keys and values since flatten json doesn't work at this case: [Flatten JSON](http://druid.io/docs/latest/ingestion/flatten-json). i.e.:

"world": [{"hey": "there"}, {"tree": "apple"}]

However, `I don't know what will be the keys at indexing time`. Such configuration is handled via this at documentation:

```
...
{
  "type": "path",
  "name": "world-hey",
  "expr": "$.world[0].hey"
},
{
  "type": "path",
  "name": "worldtree",
  "expr": "$.world[1].tree"
}
...
```
**PS:** I've started a conversation about this task at mail list and than created an issue for it: [Indexing Arbitrary Key/Value Data](https://lists.apache.org/thread.html/64ec2c1390b80994749d497b11f585829cba1cfcaeb6ee4c2d37141d@<dev.druid.apache.org>)

Here is the reply from @gianm for this mail thread:

> There isn't currently an out of the box parser in Druid that can do what
you are describing. But it is an interesting feature to think about. Today
you could implement this using a custom parser (instead of using the
builtin json/avro/etc parsers, write an extension that implements an
InputRowParser, and you can do anything you want, including automatic
flattening of nested data).
> 
> In terms of how this might be done out of the box in the future I could
> think of a few ideas.
> 
> 1) Have some way to define an "automatic flatten spec". Maybe something
> that systematically flattens in a particular way: in your example, perhaps
> it'd automatically create fields like "world.0.hey" and "world.1.tree".
> 
> 2) A repetition and definition level scheme similar to Parquet:
> https://blog.twitter.com/engineering/en_us/a/2013/dremel-made-simple-with-parquet.html.
> It sounds like this could be more natural and lend itself to better
> compression than (1).
> 
> 3) Create a new column type designed to store json-like data, although
> presumably in some more optimized form. Add some query-time functionality
> for extracting values from it. Use this for storing the original input
> data. This would only really make sense if you had rollup disabled. In this
> case, the idea would be that you would store an entire ingested object in
> this new kind of column, and extract some subset fields for faster access
> into traditional dimension and metric columns.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Indexing Arbitrary Key/Value Data #7029

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Indexing Arbitrary Key/Value Data #7029

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions