# Ingesting Data in Apache Druid

This notebook is an introduction to data ingestion in Apache Druid and serves as an index into the other notebooks in this folder. 

Apache Druid is a database specialized in serving fast analytic applications with a goal of subsecond response times on queries. It is designed to execute efficient ad-hoc filtering and aggregation on large datasets ranging from terabytes to petabytes. Its efficiency in the use of resources means that it also supports high concurrency. Its ability to query data as it is ingested from streaming platform enables real-time analytics. 

In order to use these capabilities the data must first be ingested. Ingestion is the process of taking raw data, organizing it and converting it into Druid's segment format. The segment format and its distribution through a cluster are what enable fast filtering and aggregation even as the data volume scales. 

This set of notebooks focuses on the different forms of ingestion and provide runnable examples. The data used in these notebooks is small because we're assuming that you are running using the docker-compose driven cluster which is all running on your personal computer. 

## The Segment Format

The Druid Segment is a column oriented data structure. This makes analytic queries faster, since only a small subset of the data is read given that only columns involved in the query are needed. Dimension columns use a sorted dictionary that helps search the segment and compress the data in the column. Dimensions have row indexes for each of the dictionary value for fast filtering and bitmap indices to resolve complex multi-value and multi column filter criteria by using bit-wise operations. Learn more about the [Druid Segment here](https://druid.apache.org/docs/latest/design/segments.html).


### Segment Size 
At query time, each segment is processed by a single thread. Many threads process segments in parallel. There is some overhead associated with processing each segment. With a large number of small segments, more overhead is involved to resolve a query that needs them. With very a small number of very large segments, there is a reduction in parallelism. So there is a balance between the size and quantity of the segments for a given dataset. Apache Druid engineers have found that a good starting point for segment size is aproximately 5 million rows and somewhere between 500-700 MB.

See the Time partitioning and Clustering sections below for details on how to achieve the ideal segment size. Read more about [segment size optimization here](https://druid.apache.org/docs/latest/operations/segment-optimization.html).


## Batch & Real-time
Data can be ingested in batched or in real-time... describe scenarios where one or the other is used with advantages of each.

Try them out:
- [Batch Ingestion Notebook](01-batch-ingestion.ipynb)
- [Streaming Ingestion Notebook](02-streaming-ingestion.ipynb)

## Time Partitioning

talk about how the data is organized by time, using time buckets

- [Time Partitioning Notebook](03-time-partitioning.ipynb)

## Clustering
Talk about multiple segments within a time partition and how they can be organized for pruning. Talk about situations when clustering is useful.
- [Clustering Notebook](04-clustering.ipynb) notebook link

## Rollup
Talk about the concept of rollup at ingestion and how it improves query performance and higher concurrency. 
- [Rollup Notebook](05-rollup.ipynb)

## Compaction
Compaction and auto-compaction, when they are needed and why they are needed.
- [Compaction Notebook](06-compaction.ipynb)

## Schema Auto-Discovery
Introduce schema auto-discovery function and describe scenarios where it is desirable and when it is not.
- [Auto Discovery Notebook](07-auto-discovery.ipynb)