-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Home
Druid is an open source, real-time analytical data store that supports fast ad-hoc queries on large-scale data sets
- Learn more by reading the White Paper
- Check out some Examples
- Analytics – Druid is for exploratory analytics. It supports a variety of filters, aggregators and query types as well as pluggable analytics. Different users have developed things like top N, Histograms etc but the key is that they all scale linearly and return quickly by leveraging Druid’s core.
- Interactivity – Druid users leverage it for exploratory querying of real-time ingested data.
- Uptime – Druid is used to back SaaS implementations that need to be up all the time. No downtime for upgrades or scaling.
- Predictable/Fast – Druid gives fast results every time
- Scalability – Druid is handles billions of events and terabytes of data per day and growing
- Efficiency – Druid uses as little storage and CPU as possible. Due to the volumes referenced above efficiency is key to being able to achieve the speeds and manage the volumes cost effectively.
Buzzwords aside, it was originally created to resolve query latency issues seen with trying to use Hadoop to power an interactive service. Hadoop has shown the world that it’s possible to house your data warehouse on commodity hardware for a fraction of the price of typical solutions. As people adopt Hadoop for their data warehousing needs, they find two things.
- They can now query all of their data in a fairly flexible manner and answer any question they have
- The queries take a long time
The first one is the joy that everyone feels the first time they get Hadoop running. The latter is what they realize after they have used Hadoop interactively for a while because Hadoop is optimized for throughput, not latency. Druid is a system that you can set up in your organization next to Hadoop. It provides the ability to access your data in an interactive slice-and-dice fashion. It trades off some query flexibility and takes over the storage format in order to provide the speed.
Druid is especially useful if you are summarizing your data sets and then querying the summarizations. If you put your summarizations into Druid, you will get quick queryability out of a system that you can be confident will scale up as your data volumes increase. Deployments have scaled up to 2TB of data per hour at peak ingested and aggregated in real-time.
We have more details about the general design of the system and why you might want to use it in our Design doc.
The best place to get started, and for demonstration purposes, is the demo of a Realtime node running standalone. The query process, index building, and realtime ingestion features are exercised without the complication of multiple processes on multiple machines. See the Realtime Examples page for how to get started with this key part.
The most basic cluster setup of Druid requires the Compute, Broker and Master servers listed above, a MySQL metadata server, and a ZooKeeper cluster. A Realtime server can be added to enable real time queries. See Cluster Setup for more details.
For basic server configuration see the configuration page and for a discussion of how to query a server, see the Querying page.
Data ingestion can occur in either of two ways:
-
Realtime
Rapid aggregation, indexing, and querying of recent data. The indexed segments eventually become historical data. -
Batch Ingestion
Using Hadoop batch jobs to aggregate and index historical data.
If you would like to contribute to Druid or have any questions about usage or code, you can write to the mailing list(Google Group):
druid-development@googlegroups.com
https://groups.google.com/d/forum/druid-development
We are also squatting in channel #druid-dev on irc.freenode.net.
On Twitter we are #druidio
If you are interested in contributing to the code, we accept pull requests. Note, that we have only just completed decoupling our Metamarkets-specific code from the code base and we took some short-cuts in interface design to make it happen. So, there are a number of interfaces that exist right now which are likely to be in flux. If you are embedding Druid in your system, it will be safest for the time being to only extend/implement interfaces that this wiki describes, as those are intended as stable (unless otherwise mentioned).
For issue tracking, we are using the github issue tracker. Please fill out an issue from the Issues tab on the github screen.
We also have a Libraries page that lists external libraries that people have created for working with Druid.
Be sure to look at the Concepts and Terminology page.
The durable/persisted data kept by Druid is in a form termed Segments which can be on local disk and/or in a key-value store like S3 or HDFS.
A druid cluster is composed of various node types or services:
-
Compute
The base node, processes queries at the segment level, supports replication -
Broker
The query broker, understands the data topology and routes queries to the correct compute nodes -
Master
The coordinator node, makes sure that all segments are being served, replicated and balanced. General custodian for the cluster -
Realtime
The realtime node is the data pathway for rapid indexing and querying of new data in realtime
The following diagram shows the data flow for queries without showing batch indexing: