Query Time Lookup #1259

drcrallen · 2015-04-01T17:27:04Z

This PR is at the code review stage. Comments on either the high-level overview or the low-level implementations are welcome. This master comment will be updated with pertinent discussion points if they become major topics in the track below.

Add query time lookups for renames via query "namespace" (may end up renaming this to something with a more natural description)

Add ability to explicitly rename using a "namespace" which is a particular data collection that is loaded on all realtime, historic nodes, and brokers. If any of these nodes has the namespace extension, ALL nodes have the namespace extension.
Add namespace caching and populating (can be on heap or off heap)
Add NamespaceExtractionCacheManager for handling caches
Added ExtractionNamespace for handling metadata on the extraction namespaces
Added ExtractionNamespaceUpdate for handling metadata related to updates
Add extension which caches renames from a kafka stream (requires kafka8)
Added README.md for the namespace kafka extension
Added docs
Added namespace/size, namespace/count, namespace/deltaTasksStarted metrics

Note, the below is for the second round of PRs, which will go in after the static config.
here is how namespace populating will eventually work ( https://github.com/metamx/druid/tree/queryTimeLookup_announce ):

The namespace serving nodes announce (through Announcer) that it exists
The Coordinator listens for this announcement and fires off a list of namespaces to the node
The Coordinator regularly polls the metadata and POSTs the most recent list to the namespace servicing nodes on each update round.
The namespace servicing nodes determine what to add or drop each time the new list is sent to them.

Since everything is loaded everywhere this approach is an "at least once announcement" kind of approach and things eventually settle onto a state as the leading coordinator keeps polling the metadata and spitting out the "correct" state to each node.

Prior PR:
#1093

cheddar · 2015-04-01T17:57:10Z

docs/content/DimensionSpecs.md

+      "columns":["key","value]
+    }
+  },
+  "updateMs":0


Why would the update be a function outside of the namespace? Shouldn't the namespace define it's own update mechanisms that make sense for however it is integrated?

It can, especially since updateMs doesn't make much sense for some namespace types like Kafka.

Are you saying that you are agreeing and will remove updateMs as an external property, instead, making it a part of the particular namespace?

Yes, instead of having a "namespace" and "updateMS" the namespace spec will have updateMs in it for those which understand such a thing, and the "namespace" will be the top level object.

himanshug · 2015-04-03T02:44:32Z

...ename-kafka8/src/main/java/io/druid/query/extraction/namespace/KafkaExtractionNamespace.java

+      final String namespace
+  )
+  {
+    Preconditions.checkNotNull(kafkaTopic, "kafkatTopic required");


with @NotNull in place, are we putting additional checks with the expectation that these objects will be hand constructed without json deserialization as well?
and a minor typo, s/kafkatTopic/kafkaTopic/

There are only checks (Preconditions.checkNotNull) in the constructor. The only way someone should be able to set the values as null is via reflection, in which case all bets are off.

I'm fine with Preconditions check but thought they were redundant because jackson will do the null checks automatically given that you have @NotNull specified on those.

himanshug · 2015-04-03T20:27:15Z

.../test/java/io/druid/server/namespace/cache/TestNamespaceExtractionCacheManagerExecutors.java

+    Assert.assertTrue(fnCache.containsKey(ns));
+    prior = runs.get();
+    Thread.sleep(50);
+    Assert.assertTrue(runs.get() > prior);


there is really no guaranty that this will happen (it is likely though). since this check does not seem to be essential (i understand it is checking that updater runner continued to run as scheduled), does it make sense to remove it?

I could, but the purpose is exactly as you put it, to make sure the scheduler is behaving properly before continuing with the rest of the test.

There is a similar question asked in just a few lines (which is where I on very rare occasions see problems) which is if the scheduler has now stopped properly.

If you have an idea for a better way to test or ensure this then I am very open to suggestions.

The problem is that if a namespace is scheduled to update regularly, I don't want it to continue updating after it has been told to cancel its tasks, or else the namespace will re-populate during the delete process.

I modified this a bit to hopefully change the way the tests are done and squash the problem, which I think was a test problem rather than an impl problem.

fjy · 2015-07-21T23:04:29Z

common/src/main/java/io/druid/data/SearchableVersionedDataFinder.java

+ * date version of data given a base descriptor and a matching pattern. "Version" is completely dependent on the
+ * implementation but is commonly equal to the "last modified" timestamp.
+ *
+ * EXPERIMENTAL: This is implemented explicitly for URIExtractionNamespaceFunctionFactory


i'd prefer not to have messages like this. When we release QTL, it should be tested in production somewhere.

Removing experimental notice.

drcrallen · 2015-07-27T21:55:21Z

There was a question as to why certain namespace types exist. Here's a brief rundown:

uri : What we will be using. Takes a flat data file and parses it with a parseSpec whose formats are supported as per the drui-api ParseSpecs. This one operates on a polling interval to look for a new data file to download.
kafka: This one instantly propagates changes.
jdbc: This one is only there because I kept getting requests to have a way to read the data through a jdbc connection. It function very similarly to the uri case except it pulls from a DB.

The question was if a kafka extension is really needed. It is the only way to instantly propagate a change through the cluster in the current collection of ways to propagate data. If that is deemed not required then it is pretty easy to pull out the extension.

* Adds kafka, URI, and JDBC namespace defintions * Add ability to explicitly rename using a "namespace" which is a particular data collection that is loaded on all realtime, historic nodes, and brokers. If any of these nodes has the namespace extension, ALL nodes have the namespace extension. * Add namespace caching and populating (can be on heap or off heap) * Add NamespaceExtractionCacheManager for handling caches * Added ExtractionNamespace for handling metadata on the extraction namespaces * Added ExtractionNamespaceUpdate for handling metadata related to updates * Add extension which caches renames from a kafka stream (requires kafka8) * Added README.md for the namespace kafka extension * Added docs * Added namespace/size, namespace/count, namespace/deltaTasksStarted metrics Add static config for namespaces via `druid.query.extraction.namespace` * This is a rebase of https://github.com/b-slim/druid/tree/static_config_only

Query Time Lookup

drcrallen mentioned this pull request Apr 1, 2015

Query time lookup (In Code Review) #1093

Closed

drcrallen force-pushed the queryTimeLookup branch from 31769bc to 37ce08d Compare April 1, 2015 17:41

cheddar reviewed Apr 1, 2015
View reviewed changes

drcrallen force-pushed the queryTimeLookup branch from 6f79ce8 to ff42f6c Compare April 1, 2015 22:29

himanshug reviewed Apr 3, 2015
View reviewed changes

drcrallen force-pushed the queryTimeLookup branch 2 times, most recently from c98ae14 to a4431bb Compare April 3, 2015 15:41

himanshug reviewed Apr 3, 2015
View reviewed changes

drcrallen closed this Jul 21, 2015

drcrallen reopened this Jul 21, 2015

fjy reviewed Jul 21, 2015
View reviewed changes

drcrallen force-pushed the queryTimeLookup branch from 62cccac to a592d31 Compare July 27, 2015 22:00

drcrallen force-pushed the queryTimeLookup branch from a592d31 to 86ede70 Compare July 28, 2015 18:14

fjy added a commit that referenced this pull request Jul 28, 2015

Merge pull request #1259 from metamx/queryTimeLookup

2256794

Query Time Lookup

fjy merged commit 2256794 into apache:master Jul 28, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query Time Lookup #1259

Query Time Lookup #1259

drcrallen commented Apr 1, 2015

cheddar Apr 1, 2015

drcrallen Apr 1, 2015

cheddar Apr 1, 2015

drcrallen Apr 1, 2015

himanshug Apr 3, 2015

drcrallen Apr 3, 2015

himanshug Apr 3, 2015

himanshug Apr 3, 2015

drcrallen Apr 3, 2015

drcrallen Apr 3, 2015

drcrallen Apr 3, 2015

fjy Jul 21, 2015

drcrallen Jul 27, 2015

drcrallen commented Jul 27, 2015

Query Time Lookup #1259

Query Time Lookup #1259

Conversation

drcrallen commented Apr 1, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

drcrallen commented Jul 27, 2015