Skip to content

Indexing Arbitrary Key/Value Data #7029

@kamaci

Description

@kamaci

I can index such data with Druid:

{"ts":"2018-01-01T02:35:45Z","appToken":"guid", "eventName":"app-open", "key1":"value1"}
via this configuration:

"parser" : {
        "type" : "string",
        "parseSpec" : {
          "format" : "json",
          "timestampSpec" : {
            "format" : "iso",
            "column" : "ts"
          },
          "dimensionsSpec" : {
            "dimensions": [
              "appToken",
              "eventName",
              "key1"
            ]
          }
        }
      }

However, I would also want to index such data:

{
  "ts":"2018-01-01T03:35:45Z",
  "appToken":"guid",
  "eventName":"app-open",
  "properties":[{"randomKey1":"randomValue1"}, {"randomKey2":"randomValue2"}]
}

at which properties is an array and members of that array has some arbitrary keys and values since flatten json doesn't work at this case: Flatten JSON. i.e.:

"world": [{"hey": "there"}, {"tree": "apple"}]

However, I don't know what will be the keys at indexing time. Such configuration is handled via this at documentation:

...
{
  "type": "path",
  "name": "world-hey",
  "expr": "$.world[0].hey"
},
{
  "type": "path",
  "name": "worldtree",
  "expr": "$.world[1].tree"
}
...

PS: I've started a conversation about this task at mail list and than created an issue for it: Indexing Arbitrary Key/Value Data

Here is the reply from @gianm for this mail thread:

There isn't currently an out of the box parser in Druid that can do what
you are describing. But it is an interesting feature to think about. Today
you could implement this using a custom parser (instead of using the
builtin json/avro/etc parsers, write an extension that implements an
InputRowParser, and you can do anything you want, including automatic
flattening of nested data).

In terms of how this might be done out of the box in the future I could
think of a few ideas.

  1. Have some way to define an "automatic flatten spec". Maybe something
    that systematically flattens in a particular way: in your example, perhaps
    it'd automatically create fields like "world.0.hey" and "world.1.tree".

  2. A repetition and definition level scheme similar to Parquet:
    https://blog.twitter.com/engineering/en_us/a/2013/dremel-made-simple-with-parquet.html.
    It sounds like this could be more natural and lend itself to better
    compression than (1).

  3. Create a new column type designed to store json-like data, although
    presumably in some more optimized form. Add some query-time functionality
    for extracting values from it. Use this for storing the original input
    data. This would only really make sense if you had rollup disabled. In this
    case, the idea would be that you would store an entire ingested object in
    this new kind of column, and extract some subset fields for faster access
    into traditional dimension and metric columns.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions