Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query Time Lookup #1259

Merged
merged 1 commit into from
Jul 28, 2015
Merged

Query Time Lookup #1259

merged 1 commit into from
Jul 28, 2015

Conversation

drcrallen
Copy link
Contributor

This PR is at the code review stage. Comments on either the high-level overview or the low-level implementations are welcome. This master comment will be updated with pertinent discussion points if they become major topics in the track below.

Add query time lookups for renames via query "namespace" (may end up renaming this to something with a more natural description)

  • Add ability to explicitly rename using a "namespace" which is a particular data collection that is loaded on all realtime, historic nodes, and brokers. If any of these nodes has the namespace extension, ALL nodes have the namespace extension.
  • Add namespace caching and populating (can be on heap or off heap)
  • Add NamespaceExtractionCacheManager for handling caches
  • Added ExtractionNamespace for handling metadata on the extraction namespaces
  • Added ExtractionNamespaceUpdate for handling metadata related to updates
  • Add extension which caches renames from a kafka stream (requires kafka8)
  • Added README.md for the namespace kafka extension
  • Added docs
  • Added namespace/size, namespace/count, namespace/deltaTasksStarted metrics

Note, the below is for the second round of PRs, which will go in after the static config.
here is how namespace populating will eventually work ( https://github.com/metamx/druid/tree/queryTimeLookup_announce ):

  1. The namespace serving nodes announce (through Announcer) that it exists
  2. The Coordinator listens for this announcement and fires off a list of namespaces to the node
  3. The Coordinator regularly polls the metadata and POSTs the most recent list to the namespace servicing nodes on each update round.
  4. The namespace servicing nodes determine what to add or drop each time the new list is sent to them.

Since everything is loaded everywhere this approach is an "at least once announcement" kind of approach and things eventually settle onto a state as the leading coordinator keeps polling the metadata and spitting out the "correct" state to each node.

Prior PR:
#1093

"columns":["key","value]
}
},
"updateMs":0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would the update be a function outside of the namespace? Shouldn't the namespace define it's own update mechanisms that make sense for however it is integrated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can, especially since updateMs doesn't make much sense for some namespace types like Kafka.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you saying that you are agreeing and will remove updateMs as an external property, instead, making it a part of the particular namespace?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, instead of having a "namespace" and "updateMS" the namespace spec will have updateMs in it for those which understand such a thing, and the "namespace" will be the top level object.

final String namespace
)
{
Preconditions.checkNotNull(kafkaTopic, "kafkatTopic required");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with @NotNull in place, are we putting additional checks with the expectation that these objects will be hand constructed without json deserialization as well?
and a minor typo, s/kafkatTopic/kafkaTopic/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are only checks (Preconditions.checkNotNull) in the constructor. The only way someone should be able to set the values as null is via reflection, in which case all bets are off.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with Preconditions check but thought they were redundant because jackson will do the null checks automatically given that you have @NotNull specified on those.

@drcrallen drcrallen force-pushed the queryTimeLookup branch 2 times, most recently from c98ae14 to a4431bb Compare April 3, 2015 15:41
Assert.assertTrue(fnCache.containsKey(ns));
prior = runs.get();
Thread.sleep(50);
Assert.assertTrue(runs.get() > prior);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is really no guaranty that this will happen (it is likely though). since this check does not seem to be essential (i understand it is checking that updater runner continued to run as scheduled), does it make sense to remove it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could, but the purpose is exactly as you put it, to make sure the scheduler is behaving properly before continuing with the rest of the test.

There is a similar question asked in just a few lines (which is where I on very rare occasions see problems) which is if the scheduler has now stopped properly.

If you have an idea for a better way to test or ensure this then I am very open to suggestions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that if a namespace is scheduled to update regularly, I don't want it to continue updating after it has been told to cancel its tasks, or else the namespace will re-populate during the delete process.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I modified this a bit to hopefully change the way the tests are done and squash the problem, which I think was a test problem rather than an impl problem.

@drcrallen drcrallen closed this Jul 21, 2015
@drcrallen drcrallen reopened this Jul 21, 2015
* date version of data given a base descriptor and a matching pattern. "Version" is completely dependent on the
* implementation but is commonly equal to the "last modified" timestamp.
*
* EXPERIMENTAL: This is implemented explicitly for URIExtractionNamespaceFunctionFactory
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd prefer not to have messages like this. When we release QTL, it should be tested in production somewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing experimental notice.

@drcrallen
Copy link
Contributor Author

There was a question as to why certain namespace types exist. Here's a brief rundown:

  • uri : What we will be using. Takes a flat data file and parses it with a parseSpec whose formats are supported as per the drui-api ParseSpecs. This one operates on a polling interval to look for a new data file to download.
  • kafka: This one instantly propagates changes.
  • jdbc: This one is only there because I kept getting requests to have a way to read the data through a jdbc connection. It function very similarly to the uri case except it pulls from a DB.

The question was if a kafka extension is really needed. It is the only way to instantly propagate a change through the cluster in the current collection of ways to propagate data. If that is deemed not required then it is pretty easy to pull out the extension.

* Adds kafka, URI, and JDBC namespace defintions
* Add ability to explicitly rename using a "namespace" which is a particular data collection that is loaded on all realtime, historic nodes, and brokers. If any of these nodes has the namespace extension, ALL nodes have the namespace extension.
* Add namespace caching and populating (can be on heap or off heap)
* Add NamespaceExtractionCacheManager for handling caches
* Added ExtractionNamespace for handling metadata on the extraction namespaces
* Added ExtractionNamespaceUpdate for handling metadata related to updates
* Add extension which caches renames from a kafka stream (requires kafka8)
* Added README.md for the namespace kafka extension
* Added docs
* Added namespace/size, namespace/count, namespace/deltaTasksStarted metrics

Add static config for namespaces via `druid.query.extraction.namespace`
* This is a rebase of https://github.com/b-slim/druid/tree/static_config_only
fjy added a commit that referenced this pull request Jul 28, 2015
@fjy fjy merged commit 2256794 into apache:master Jul 28, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants