Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multiple lookups within one namespace #2523

Closed
sirpkt opened this issue Feb 23, 2016 · 17 comments
Closed

Support multiple lookups within one namespace #2523

sirpkt opened this issue Feb 23, 2016 · 17 comments
Labels

Comments

@sirpkt
Copy link
Contributor

sirpkt commented Feb 23, 2016

Even with the same source (JDBC or CSV file), we should make separate configuration for each pair of different key, value column pairs.
For example, if a lookup DB table has four columns, A, B, C, and D, and I need three lookups A to B, C to D, and A to C, then, I should make three different namespaces, which have lots of redundant information like URI, poll period, ID and so on.

In my thought, this approach is not good for maintenance.
When source configuration is changed like password change and table/column name change, it is hard to check which namespaces are affected by that change and also tiresome to manually change all the related namespaces.

So, I think it is better to divide namespace to two level, namespace and lookup maps.
Namespace is data source level and lookup map is defined for each different (key, value) column or field pairs within the given data source.

For the first example case, changed configuration could be like followings

{
  "type":"jdbc",
  "namespace":"DB1",
  "connectorConfig":{
    "createTables":true,
    "connectURI":"jdbc:mysql://localhost:3306/druid",
    "user":"druid",
    "password":"diurd"
  },
  "table":"some_lookup_table",
  "lookup maps": [
      {"name": "AtoB",
       "key": "A",
       "value":"B"},
      {"name": "CtoD",
       "key": "C",
       "value":"D"},
      {"name": "AtoC",
       "key": "A",
       "value":"C"}
  ]
  "tsColumn":"timestamp_column",
  "pollPeriod":600000
}

And, NamespacedExtractor may have one more parameter that indicate lookup map name in the given namespace.

@b-slim
Copy link
Contributor

b-slim commented Feb 23, 2016

@sirpkt i am not sure how you will call the lookup at query time ?

@sirpkt
Copy link
Contributor Author

sirpkt commented Feb 24, 2016

@b-slim before this patch, user can use namespaced lookup by setting lookup of LookupDimensionSpec as

{
  "type":"namespace",
  "namepace":"DB1"
}

If this issue is resolved, user can also specify lookup map name within the namespace

{
  "type":"namespace",
  "namepace":"DB1",
  "mapName":"CtoD"
}

@b-slim
Copy link
Contributor

b-slim commented Feb 24, 2016

@sirpkt if the you be using lookup, you wan't be able to get the registered object via the lookup ref manager, which is the use case that you want to support. Probably you can get around by having another lookup delegate.

@sirpkt
Copy link
Contributor Author

sirpkt commented Feb 25, 2016

@b-slim Oh, I just read about #2329
So, in the near future, every lookup extractor will be managed by lookup ref manager,
but not yet now especially for namespace lookup extensions including kafka namespace, right?

I have two questions

  1. If I use static configuration for namespace as before (and not register lookup extractor factory to lookup ref manager), then, can I use that namespace as before by using lookup of LookupDimensionSpec?
  2. Is there any good example code to reference for registering lookup extractor factory? It seems like I have to register my lookup extractor factory to every node (or coordinator node after step 2 of Support lookups at Query Time #2329 is completed?) but there exists no interface to register.

@b-slim
Copy link
Contributor

b-slim commented Feb 25, 2016

@sirpkt yes the idea it to have a cluster wide config via HTTP+ZK. So you would register your lookup via Coordinator. The way to do it is to simply send the json config object to Coordinator and Jackson will create the object.
Also static Config is going away in favor of Snapshot which reflect the distributed state via coordinator PR#2517.
Currently the core of the module is in development, by @drcrallen and i am reviewing the PR.

@drcrallen
Copy link
Contributor

@sirpkt can you expand more on the pain points here? There are some pretty cutting changes in the proposed PR and I'm not quite getting the main pain they are solving.

@ryryguy
Copy link

ryryguy commented Apr 6, 2016

As requested I am sharing our use case. We're using a TSV in S3 for a namespace lookup (at least to start with, we will probably switch over to a JDBC source eventually). We have a single key column, which always corresponds to the same actual dimension in Druid. We have a dozen lookup columns (could grow by a handful, but I'd think no more than 20). And we're starting pretty small now with only about 100K rows, but expect that could grow to several million rows before too long.

We don't need this updated really frequently. Actually we're still working out our ETLs and so forth to deal with revisions and additions to the lookup data. But I wouldn't expect us to have updates more frequently than hourly, and probably more like daily.

As far as pain points with this arrangement - there is sure plenty of boilerplate in the config. I have an array of a dozen entries in druid.query.extraction.namespace.lookups that are identical in all fields except for namespace and valueColumn. A bit clunky but not so much that I'd complain about it really - I did write a couple of simple scripts that generate the stuff to be placed in config.

I'm more concerned about the overhead when we do update the lookup source. Druid will have to load and parse this (potentially sized) 20 x 3M TSV once per lookup. I haven't done any benchmarking but I have noticed that it can take on the order of 15 seconds to completely load our current 12 x 100K case. Even if it takes a few minutes that is not a gamebreaker (assuming it does not interfere with query performance or produce inconsistent results while in progress). But it certainly seems like it could be a lot more efficient to load and parse the file once instead of 12 or 20 times.

Overall, the configuration and use feels a bit clunky, I think because from the user point of view, we have just one "lookup namespace" - there is a single source, and a single key column. It would feel more natural to define the data source level properties (uri, format, columns) and key column once, along with a list of allowed targetColumns, then use it in dimension specs and filters by referencing just the one single namespace plus a targetColumn. It might start to look like an ingestion spec at that point, with dataSchema- and ioConfig-like sections.

But honestly I don't know how much of a priority I'd want it to be. Associating a single namespace with a single key column and multiple value columns might well be overfitting to our specific case, and it 's certainly quite usable as it stands.

(One side note, the ability to include columns in the CSV which are not key or value columns is useful for assembling the data manually - we can include "friendly name" sort of columns that are helpful to people who are filling in or auditing the actual lookup data.)

@drcrallen
Copy link
Contributor

thanks @ryryguy
The use case you are talking about seems more like a join than a lookup kind of use case. Just FYI when lookups were spec'd out and designed, the intended use case was for small-ish hash joins to the larger data in Druid (like ID to name), not as a full dataset join between two large datasets.

I have tested lookups up to about 1M unique keys, but much beyond that has not been really investigated because it is outside the scope of the use case for lookups.

If you use off-heap storage for the lookups they should still work for large lookups (few millions of keys), but the impact on performance (especially for filtering on lookups) will not be as minimal as the intended use case of tens to a few thousand unique keys.

"Large" data joins are still on the "would like to do some day" level.

@ryryguy
Copy link

ryryguy commented Apr 6, 2016

Thanks for the info @drcrallen

To clarify our use case a little bit further, we are using this functionality because the lookup values are not completely stable. We don't want to trigger mass re-ingestion of the primary data whenever someone corrects a typo in an address or the like.

Also, we would only be querying one dimension from the lookup at a time, but we will allow multiple filters selecting against it. (Slowish performance if multiple lookup filters are selected would probably be acceptable, that is not a primary or major use case.)

One trick I suppose we could do if the length of the lookup started causing problems would be to split them into smaller lookups. The key column in the lookup is a campaign id. (Rows in Druid represent phone calls; campaigns to phone calls are one-to-many.) In our business entities campaigns are owned by accounts. All our queries will also filter on a primary account id. So we could make one TSV per account, embedding the account id in the namespace names. That could cut down the maximum length of a lookup dramatically, though at the cost of increased complexity in marshalling and configuring the lookups.

If we do need to optimize for performance does that seem like a good approach? Or would having larger numbers of smaller lookups run into different problems?

@ryryguy
Copy link

ryryguy commented Apr 28, 2016

So, I just got a surprising S3 bill. Is it the case that when Druid polls an S3 lookup source (as governed by the "pollPeriod" config value), it does a full GET on the lookup files, rather than just checking timestamp? There's no indication in the Druid logs that I can find. But based on the evidence of our S3 bill that sure seems to be what is happening.

If so, all the more reason to avoid multiple scans of the same file. Or better still, fix it to actually use the timestamps!

@drcrallen
Copy link
Contributor

@ryryguy that is a good question. it calls org.jets3t.service.S3Service#listObjects which might be doing weird stuff under the hood instead of just looking at metadata. I'll have to investigate more.

@ryryguy
Copy link

ryryguy commented Apr 28, 2016

So just a little more info for you, I turned on the enhanced S3 logging and let Druid run for a little while. It's not entirely clear cut because the S3 logging is "best effort delivery" - delayed and not guaranteed. But attempting to reconstruct via the internal timestamps, it looks like on the polling period intervals, there is a bunch of activity that looks like this:

[redacted] REST.GET.BUCKET - "GET /?max-keys=1000&prefix=lookups%2Fqa%2F&delimiter=%2F HTTP/1.1" 200 - 2488 - 14 14 "-" "JetS3t/0.9.4 (Linux/3.2.0-77-generic; amd64; en; JVM 1.8.0_20)" -
[redacted] REST.GET.OBJECT lookups/qa/customDims-truncated.tsv "GET /lookups/qa/customDims-truncated.tsv HTTP/1.1" 200 - 2392 2392 10 9 "-" "JetS3t/0.9.4 (Linux/3.2.0-77-generic; amd64; en; JVM 1.8.0_20)" -

There is at least one of these REST.GET.BUCKET/REST.GET.OBJECT pairs per configured lookup per polling period. I have some examples where there are 4 times the number of lookups. 😦 So yeah, I'd say the jets3t code is doing more than just checking metadata all right.

I'm too embarassed to say how much of an S3 bill I ran up before noticing this, and it's my own fault for not being sure to have the Druid instance co-located with S3 to avoid charges. But at least putting a warning in the documentation right away seems like a good idea! (And maybe a separate issue to track it?)

@drcrallen
Copy link
Contributor

@ryryguy already filed #2894

@Igosuki
Copy link
Contributor

Igosuki commented Aug 10, 2017

Seems like what's really wanted is being able to write your own cache lookup sql query, and then being able to map any of the result set columns.

@nosahama
Copy link

nosahama commented May 8, 2019

As requested I am sharing our use case. We're using a TSV in S3 for a namespace lookup (at least to start with, we will probably switch over to a JDBC source eventually). We have a single key column, which always corresponds to the same actual dimension in Druid. We have a dozen lookup columns (could grow by a handful, but I'd think no more than 20). And we're starting pretty small now with only about 100K rows, but expect that could grow to several million rows before too long.

We don't need this updated really frequently. Actually we're still working out our ETLs and so forth to deal with revisions and additions to the lookup data. But I wouldn't expect us to have updates more frequently than hourly, and probably more like daily.

As far as pain points with this arrangement - there is sure plenty of boilerplate in the config. I have an array of a dozen entries in druid.query.extraction.namespace.lookups that are identical in all fields except for namespace and valueColumn. A bit clunky but not so much that I'd complain about it really - I did write a couple of simple scripts that generate the stuff to be placed in config.

I'm more concerned about the overhead when we do update the lookup source. Druid will have to load and parse this (potentially sized) 20 x 3M TSV once per lookup. I haven't done any benchmarking but I have noticed that it can take on the order of 15 seconds to completely load our current 12 x 100K case. Even if it takes a few minutes that is not a gamebreaker (assuming it does not interfere with query performance or produce inconsistent results while in progress). But it certainly seems like it could be a lot more efficient to load and parse the file once instead of 12 or 20 times.

Overall, the configuration and use feels a bit clunky, I think because from the user point of view, we have just one "lookup namespace" - there is a single source, and a single key column. It would feel more natural to define the data source level properties (uri, format, columns) and key column once, along with a list of allowed targetColumns, then use it in dimension specs and filters by referencing just the one single namespace plus a targetColumn. It might start to look like an ingestion spec at that point, with dataSchema- and ioConfig-like sections.

But honestly I don't know how much of a priority I'd want it to be. Associating a single namespace with a single key column and multiple value columns might well be overfitting to our specific case, and it 's certainly quite usable as it stands.

(One side note, the ability to include columns in the CSV which are not key or value columns is useful for assembling the data manually - we can include "friendly name" sort of columns that are helpful to people who are filling in or auditing the actual lookup data.)

Hi there, please i am trying to configure Druid to load a lookup file from s3, how do i do this? Do i use the file:/ syntax or there is another syntax for loading lookups from s3?

@stale
Copy link

stale bot commented Feb 12, 2020

This issue has been marked as stale due to 280 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions.

@stale stale bot added the stale label Feb 12, 2020
@stale
Copy link

stale bot commented Mar 11, 2020

This issue has been closed due to lack of activity. If you think that is incorrect, or the issue requires additional review, you can revive the issue at any time.

@stale stale bot closed this as completed Mar 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants