Skip to content
tralfamadude edited this page Nov 21, 2012 · 13 revisions

Druid

Druid is a distributed, column-oriented analytical datastore. It was originally created to resolve query latency issues seen with trying to use Hadoop to power an interactive service.

Hadoop has shown the world that it’s possible to house your data warehouse on commodity hardware for a fraction of the price of typical solutions. As people adopt Hadoop for their data warehousing needs, they find two things.

  1. They can now query all of their data in a fairly flexible manner and answer any question they have
  2. The queries take a long time

The first one is the joy that everyone feels the first time they get Hadoop running. The latter is what they realize after they have used Hadoop interactively for a while because Hadoop is optimized for throughput, not latency. Druid is a system that you can set up in your organization next to Hadoop. It provides the ability to access your data in an interactive slice-and-dice fashion. It trades off some query flexibility and takes over the storage format in order to provide the speed.

Druid is especially useful if you are summarizing your data sets and then querying the summarizations. If you put your summarizations into Druid, you will get quick queryability out of a system that you can be confident will scale up as your data volumes increase (as far as we are aware, the largest running Druid cluster houses 15TB of queryable data).

We have more details about the general design of the system and why you might want to use it in our Design doc.

Contributing and learning more

If you would like to contribute to Druid or have any questions about usage or code, you can write to the mailing list(Google Group):

druid-development@googlegroups.com
https://groups.google.com/d/forum/druid-development

We are also squatting in channel #druid-dev on irc.freenode.net.

If you are interested in contributing to the code, we accept pull requests. Note, that we have only just completed decoupling our Metamarkets-specific code from the code base and we took some short-cuts in interface design to make it happen. So, there are a number of interfaces that exist right now which are likely to be in flux. If you are embedding Druid in your system, it will be safest for the time being to only extend/implement interfaces that this wiki describes, as those are intended as stable (unless otherwise mentioned).

For issue tracking, we are using the github issue tracker. Please fill out an issue from the Issues tab on the github screen.

Be sure to look at the Concepts and Terminology page.

Composition of a Druid Cluster

The durable/persisted data kept by Druid is in a form termed “segments” which can be on local disk and/or in a key-value store like S3 or HDFS.

A druid cluster is composed of various node types or services:

  1. Compute
    The base node, processes queries at the segment level, supports replication
  2. Broker
    The query broker, understands the data topology and routes queries to the correct compute nodes
  3. Master
    The coordinator node, makes sure that all segments are being served, replicated and balanced. General custodian for the cluster
  4. Realtime
    The realtime node is the data pathway for rapid indexing and querying of new data in realtime

The following diagram shows the data flow for queries without showing batch indexing:

Simple Data Flow

Getting Started, Configuration, and Setup

The best place to get started, and for demonstration purposes, is the demo of a Realtime node running standalone. The query process, index building, and realtime ingestion features are exercised without the complication of multiple processes on multiple machines. See RealtimeStandaloneMain page for how to get started with this key part.

The most basic cluster setup of Druid requires the Compute, Broker and Master servers listed above, a MySQL metadata server, and a ZooKeeper cluster. A Realtime server can be added to enable real time queries. See Cluster Setup for more details.

For basic server configuration see the configuration page and for a discussion of how to query a server, see the Querying page.

Data Ingestion

Data ingestion can occur in either of two ways:

  1. Realtime
    Rapid aggregation, indexing, and querying of recent data. The indexed segments eventually become historical data.
  2. Batch Ingestion
    Using Hadoop batch jobs to aggregate and index historical data.
Clone this wiki locally