Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query Time Lookups for Dimensions in Druid #211

Closed
sid88in opened this issue Mar 29, 2016 · 9 comments
Closed

Query Time Lookups for Dimensions in Druid #211

sid88in opened this issue Mar 29, 2016 · 9 comments
Labels
enhancement:request Enhancement request submitted by anyone from the community inactive Inactive for >= 30 days

Comments

@sid88in
Copy link
Contributor

sid88in commented Mar 29, 2016

This feature comes from http://druid.io/docs/latest/querying/lookups.html

Use Case:
Have dimension -> values mapping at query time. In real world we can have small set of mappings approx 10 - 20 or large mappings > 1000. When we visualize the data in Caravel, then the current version just shows the dimension values as integers if the data in druid is coming from fact table (metrics and dimensions). As we don't have joins in druid, we want query time lookups to get actual value of the dimension in charts and graphs. I have been able to hack the code and test this feature with map lookup feature. The query generator creates query in this form:

{
    "queryType": "groupBy",
    "dataSource": "XXX",
    "intervals": [
        "2016-03-16/2016-03-17"
    ],
    "granularity": "day",
    "dimensions": [
        {
            "type": "extraction",
            "dimension": "ABC",
            "outputName": "ABC",
            "extractionFn": {
                "type": "lookup",
                "lookup": {
                    "type": "map",
                    "map": {
                        "1": "sid",
                        "2": "gupta"
                    }
                },
                "retainMissingValue": true,
                "injective": false
            }
        }
    ],
    "aggregations": [
        {
            "type": "doubleSum",
            "fieldName": "XYZ",
            "name": "XYZ"
        },
        {
            "type": "longSum",
            "fieldName": "PQR",
            "name": "PQR"
        }
    ]
}

I want to discuss the design of this feature before coding it out and making it available for PR.
I am currently thinking of having 2 types of lookups: static lookup (small key value pairs) and dynamic lookups (jdbc lookups). For static lookup we can either have a function to preload the data from json mapping in the code or let users input key value mappings from the UI http://hostname:port/druiddatasourcemodelview/edit/2

Any thoughts? I still have to make jdbc lookups to work

@xrmx xrmx added enhancement:request Enhancement request submitted by anyone from the community druid labels Aug 9, 2016
@eirsep
Copy link

eirsep commented Oct 5, 2016

Will this feature be implemented anytime soon?

@zhaown
Copy link

zhaown commented Oct 6, 2016

+1, Looking forward to this too.

@rohitkochar
Copy link

+1
We also need this feature

@kkalyan
Copy link
Contributor

kkalyan commented Nov 1, 2016

@sid88in can you share your work in progress code ?

@sid88in
Copy link
Contributor Author

sid88in commented Nov 5, 2016

@kkalyan I moved from Yahoo to Glassdoor while working on Caravel :) Was able to hack the code to implement query time lookups in Caravel. Basically you will need to modify the query function (where the actually query gets formulated). I modified it to use map function and it worked. I'll talk to Maxime to see what is his thought on this and we may want to engineer it properly to be used by everyone.

@mistercrunch
Copy link
Member

mistercrunch commented Nov 6, 2016

I just added support for dimensionsSpec here #1545 which exposes ways to do specify mapping on a per-column basis:
http://druid.io/docs/latest/querying/dimensionspecs.html

To use it, you have to create a new calculated column on the datasource and paste your valid dimensionSpec expression in the text box.

You can define your mapping as a hard coded JSON object in our dimensionSpec, and it also looks like you can define "namespace" mappings that are stored on all the nodes on the Druid cluster and you can reference them in dimensionSpec.
http://druid.io/docs/latest/querying/lookups.html

Maybe a complementary feature on the Caravel would be for a way to manage your Druid Lookups in Caravel and have Caravel sync them with the Druid cluster. This is not on our current roadmap but we get to it at some point.

@Igosuki
Copy link

Igosuki commented Jan 19, 2017

Update - 2017-26-01

Tested it again today after adding various mappings and it works fine both for jdbc and static mapping. I think this issue should be closed.

Original

@mistercrunch I used the feature today, and it seems the columns is ignored when building the result data from the Druid query (I can see the lookup dimension working if I do the query by myself to the Druid API).

For instance :
I have users buying products, my metric is an aggregate of product price, or just the global count. Naturally I try to lookup a table of the following :

{
    "dimensions": [
        "user_id",
        {
            "replaceMissingValueWith": "N/A",
            "name": "product_name",
            "dimension": "product_id",
            "type": "lookup",
            "injective": false,
            "outputName": "product_name_lookup"
        }
    ],
    "aggregations": [
        {
            "type": "count",
            "name": "count"
        }
    ],
    "intervals": "2017-01-12T20:43:43+00:00/2017-01-19T20:43:43+00:00",
    "limitSpec": {
        "limit": 5000,
        "type": "default",
        "columns": [
            {
                "direction": "descending",
                "dimension": "count"
            }
        ]
    },
    "granularity": "all",
    "postAggregations": [],
    "queryType": "groupBy",
    "dataSource": "user-purchases"
}

Druid will return n of :

{
    "version": "v1",
    "timestamp": "2017-01-12T20:43:43.000Z",
    "event": {
      "count": 20,
      "user_id": "10",
      "product_name_lookup": "A Nice Vase"
    }
  },

But the 'product_name_lookup' column will not appear on reports in superset.

@KostasChr
Copy link

I am happily using superset - and I have a calculated column on the datasource with a dimensionSpec expression that describes a query time lookup.

Grouping works well but filtering does not on this calculated column... Anyone succeeded in doing this? Maybe I need to update the query generator?

@ghulands
Copy link

This should be fixed via #4740

@kristw kristw added the inactive Inactive for >= 30 days label Mar 20, 2019
@stale stale bot closed this as completed Apr 11, 2019
zhaoyongjie pushed a commit to zhaoyongjie/incubator-superset that referenced this issue Nov 17, 2021
zhaoyongjie pushed a commit to zhaoyongjie/incubator-superset that referenced this issue Nov 24, 2021
zhaoyongjie pushed a commit to zhaoyongjie/incubator-superset that referenced this issue Nov 25, 2021
zhaoyongjie pushed a commit to zhaoyongjie/incubator-superset that referenced this issue Nov 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement:request Enhancement request submitted by anyone from the community inactive Inactive for >= 30 days
Projects
None yet
Development

No branches or pull requests