# Analyzing data distributions

The Apache Druid database provides functions to analyze large distributions thanks to the [Apache Datasketches](https://datasketches.apache.org/docs/Quantiles/QuantilesOverview.html) project's [Quantiles](https://datasketches.apache.org/docs/Quantiles/QuantilesOverview.html) sketch, as well as [t-digest](https://druid.apache.org/docs/latest/development/extensions-contrib/tdigestsketch-quantiles/), [momentSketch](https://druid.apache.org/docs/latest/development/extensions-contrib/momentsketch-quantiles/), and [approxHistogram](https://druid.apache.org/docs/latest/development/extensions-core/approximate-histograms/).

In this tutorial, you work through some examples of the functions available in the Datasketches extension, a core extension that enables estimation of quantiles, ranks, and histograms.

## Prerequisites

This tutorial works with Druid 26.0.0 or later.

#### Run using Docker

Launch this tutorial and all prerequisites using the `druid-jupyter` profile of the Docker Compose file for Jupyter-based Druid tutorials. For more information, see [Docker for Jupyter Notebook tutorials](https://druid.apache.org/docs/latest/tutorials/tutorial-jupyter-docker.html).
   
#### Run without using Docker

If you do not use the Docker Compose environment, you need the following:

* A running Apache Druid instance, with a `DRUID_HOST` local environment variable containing the server name of your Druid router.
* [druidapi](https://github.com/apache/druid/blob/master/examples/quickstart/jupyter-notebooks/druidapi/README.md), a Python client for Apache Druid. Follow the instructions in the Install section of the README file.
* [matplotlib](https://matplotlib.org/), a library for creating visualizations in Python.
* [pandas](https://pandas.pydata.org/), a data analysis and manipulation tool.

### Initialize Python

Run the next cell to set up the Druid Python client's connection to Apache Druid.

If successful, the Druid version number will be shown in the output.

In [None]:
import druidapi
import os

if 'DRUID_HOST' not in os.environ.keys():
    druid_host=f"http://localhost:8888"
else:
    druid_host=f"http://{os.environ['DRUID_HOST']}:8888"
    
print(f"Opening a connection to {druid_host}.")
druid = druidapi.jupyter_client(druid_host)

display = druid.display
sql_client = druid.sql
status_client = druid.status

status_client.version

### Load example data

Once your Druid environment is up and running, ingest the sample data for this tutorial.

Run the following cell to create a table called `example-flights-quantiles`.  When completed, you'll see a description of the final table.

Monitor the ingestion task process in the Druid console.

In [None]:
sql='''
REPLACE INTO "example-flights-quantiles" OVERWRITE ALL
WITH "ext" AS (SELECT *
FROM TABLE(
  EXTERN(
    '{"type":"http","uris":["https://static.imply.io/example-data/flight_on_time/flights/On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11.csv.zip"]}',
    '{"type":"csv","findColumnsFromHeader":true}'
  )
) EXTEND ("depaturetime" VARCHAR, "arrivalime" VARCHAR, "Year" BIGINT, "Quarter" BIGINT, "Month" BIGINT, "DayofMonth" BIGINT, "DayOfWeek" BIGINT, "FlightDate" VARCHAR, "Reporting_Airline" VARCHAR, "DOT_ID_Reporting_Airline" BIGINT, "IATA_CODE_Reporting_Airline" VARCHAR, "Tail_Number" VARCHAR, "Flight_Number_Reporting_Airline" BIGINT, "OriginAirportID" BIGINT, "OriginAirportSeqID" BIGINT, "OriginCityMarketID" BIGINT, "Origin" VARCHAR, "OriginCityName" VARCHAR, "OriginState" VARCHAR, "OriginStateFips" BIGINT, "OriginStateName" VARCHAR, "OriginWac" BIGINT, "DestAirportID" BIGINT, "DestAirportSeqID" BIGINT, "DestCityMarketID" BIGINT, "Dest" VARCHAR, "DestCityName" VARCHAR, "DestState" VARCHAR, "DestStateFips" BIGINT, "DestStateName" VARCHAR, "DestWac" BIGINT, "CRSDepTime" BIGINT, "DepTime" BIGINT, "DepDelay" BIGINT, "DepDelayMinutes" BIGINT, "DepDel15" BIGINT, "DepartureDelayGroups" BIGINT, "DepTimeBlk" VARCHAR, "TaxiOut" BIGINT, "WheelsOff" BIGINT, "WheelsOn" BIGINT, "TaxiIn" BIGINT, "CRSArrTime" BIGINT, "ArrTime" BIGINT, "ArrDelay" BIGINT, "ArrDelayMinutes" BIGINT, "ArrDel15" BIGINT, "ArrivalDelayGroups" BIGINT, "ArrTimeBlk" VARCHAR, "Cancelled" BIGINT, "CancellationCode" VARCHAR, "Diverted" BIGINT, "CRSElapsedTime" BIGINT, "ActualElapsedTime" BIGINT, "AirTime" BIGINT, "Flights" BIGINT, "Distance" BIGINT, "DistanceGroup" BIGINT, "CarrierDelay" BIGINT, "WeatherDelay" BIGINT, "NASDelay" BIGINT, "SecurityDelay" BIGINT, "LateAircraftDelay" BIGINT, "FirstDepTime" VARCHAR, "TotalAddGTime" VARCHAR, "LongestAddGTime" VARCHAR, "DivAirportLandings" VARCHAR, "DivReachedDest" VARCHAR, "DivActualElapsedTime" VARCHAR, "DivArrDelay" VARCHAR, "DivDistance" VARCHAR, "Div1Airport" VARCHAR, "Div1AirportID" VARCHAR, "Div1AirportSeqID" VARCHAR, "Div1WheelsOn" VARCHAR, "Div1TotalGTime" VARCHAR, "Div1LongestGTime" VARCHAR, "Div1WheelsOff" VARCHAR, "Div1TailNum" VARCHAR, "Div2Airport" VARCHAR, "Div2AirportID" VARCHAR, "Div2AirportSeqID" VARCHAR, "Div2WheelsOn" VARCHAR, "Div2TotalGTime" VARCHAR, "Div2LongestGTime" VARCHAR, "Div2WheelsOff" VARCHAR, "Div2TailNum" VARCHAR, "Div3Airport" VARCHAR, "Div3AirportID" VARCHAR, "Div3AirportSeqID" VARCHAR, "Div3WheelsOn" VARCHAR, "Div3TotalGTime" VARCHAR, "Div3LongestGTime" VARCHAR, "Div3WheelsOff" VARCHAR, "Div3TailNum" VARCHAR, "Div4Airport" VARCHAR, "Div4AirportID" VARCHAR, "Div4AirportSeqID" VARCHAR, "Div4WheelsOn" VARCHAR, "Div4TotalGTime" VARCHAR, "Div4LongestGTime" VARCHAR, "Div4WheelsOff" VARCHAR, "Div4TailNum" VARCHAR, "Div5Airport" VARCHAR, "Div5AirportID" VARCHAR, "Div5AirportSeqID" VARCHAR, "Div5WheelsOn" VARCHAR, "Div5TotalGTime" VARCHAR, "Div5LongestGTime" VARCHAR, "Div5WheelsOff" VARCHAR, "Div5TailNum" VARCHAR, "Unnamed: 109" VARCHAR))
SELECT
  TIME_PARSE("depaturetime") AS "__time",
  "Reporting_Airline",
  "Tail_Number",
  "Distance",
  "Origin",
  DepDelayMinutes + ArrDelayMinutes AS "Delay"
FROM "ext"
PARTITIONED BY DAY
'''

sql_client.run_task(sql)
sql_client.wait_until_ready('example-flights-quantiles')
display.table('example-flights-quantiles')

Finally, run the following cell to import additional Python modules that you will use.

In [None]:
import json
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd

## Understanding the Quantiles sketch

A structure, also known as a "sketch," holds a representation of the distribution of values. Sketches are a a class of streaming algorithms that include quantiles and count distinct algorithms. To create Quantiles sketches, use the [`DS_QUANTILES_SKETCH`](https://druid.apache.org/docs/latest/querying/sql-functions#ds_quantiles_sketch) function and provide the column for which to create the sketch.

To see how quantiles sketches look, run the following cell. It uses `DS_QUANTILES_SKETCH` to create a Quantiles sketch for both `Distance` and `Delay`.

In [None]:
sql='''
SELECT 
  DS_QUANTILES_SKETCH(Distance) AS "Distance_QS",
  DS_QUANTILES_SKETCH(Delay) AS "Delay_QS"
FROM "example-flights-quantiles"
'''

display.sql(sql)

This function can be combined with `FILTER` to limit the sketch based on other dimension values.

Running the cell below shows two Quantiles sketches that are only representative of flights leaving Atlanta.

In [None]:
sql='''
SELECT
  DS_QUANTILES_SKETCH(Distance) FILTER (WHERE Origin='ATL') AS "Distance_QS",
  DS_QUANTILES_SKETCH(Delay) FILTER (WHERE Origin='ATL') AS "Delay_QS"
FROM "example-flights-quantiles"
'''

display.sql(sql)

As with other Datasketches, the `DS_QUANTILES_SKETCH` function (or the [classic equivallent](https://druid.apache.org/docs/latest/development/extensions-core/datasketches-quantiles#aggregator), `quantilesDoublesSketch`) can be used at ingestion time to pre-calculate these objects, reducing processing required at query time.

In the rest of this notebook, these representations will be used to find quantiles (the estimated value at a given position) and ranks (the estimated position of a specific value), and to build histograms.

## Estimating the quantile

A quantile allows us to conclude that, within a given set of observations, the specified percentage of values falls below a particular value.

Druid provides functions that allow you to estimate single and multiple quantiles.

### Estimate a single quantile value

Run the following cell, which uses the [`DS_GET_QUANTILE`](https://druid.apache.org/docs/latest/querying/sql-functions#ds_get_quantile) function to find out the values for each `Reporting_Airline` at the quartile positions - -25%, 50%, and 75%.

In [None]:
sql='''
SELECT
  Reporting_Airline,
  DS_GET_QUANTILE(DS_QUANTILES_SKETCH(Distance),.25) AS "25percent",
  DS_GET_QUANTILE(DS_QUANTILES_SKETCH(Distance),.5) AS "50percent",
  DS_GET_QUANTILE(DS_QUANTILES_SKETCH(Distance),.75) AS "75percent"
FROM "example-flights-quantiles"
GROUP BY 1
'''

display.sql(sql)

The results show that, for each `Reporting_Airline`, 25% of flights flew a `Distance` below the first value, 50% below the second, and 75 below the third.

Druid provides a shorthand version of the SQL statement used in the previous cell -  [`APPROX_QUANTILE_DS`](https://druid.apache.org/docs/latest/querying/sql-functions#approx_quantile_ds).

Run the cell below, which gives the same result as the query above but uses this shorthand function. There's no need for the `DS_QUANTILES_SKETCH` function to create a sketch against `Distance`, and instead just provide `Distance`.

In [None]:
sql='''
SELECT
  Reporting_Airline,
  APPROX_QUANTILE_DS(Distance,.25) AS "25percent",
  APPROX_QUANTILE_DS(Distance,.5) AS "50percent",
  APPROX_QUANTILE_DS(Distance,.75) AS "75percent"
FROM "example-flights-quantiles"
GROUP BY 1
'''

display.sql(sql)

As with `DS_QUANTILES_SKETCH`, this can be combined with a `FILTER` clause.

Run the following cell which looks only at flights that were delayed by 3 hours. By using the `APPROX_QUANTILE_DS` function the results will show the distance that 98% of flights in this category flew.

In [None]:
sql='''
SELECT
  APPROX_QUANTILE_DS(Distance,.98) FILTER (WHERE Reporting_Airline = 'CO') AS "Distance_CO",
  APPROX_QUANTILE_DS(Distance,.98) FILTER (WHERE Reporting_Airline = 'US') AS "Distance_US",
  APPROX_QUANTILE_DS(Distance,.98) FILTER (WHERE Reporting_Airline = 'AA') AS "Distance_AA"
FROM "example-flights-quantiles"
WHERE Delay > 180
'''

display.sql(sql)

Run the next cell to produce a plot of the number of delayed flights on a particular day, by airline, that were delayed by more than 75% of flights were in the two-week period before.

Table `b` lists, for each `Reporting_Airline`, the third-quartile `Delay` within a two-week period.

This data is then used through a `JOIN` against `example-flights-quantiles` to `COUNT` the number of flights, per `Reporting_Airline`, that are above the third-quartile.

The sorted results are then put into a pandas Dataframe and plotted.

In [None]:
sql='''
WITH b AS (
SELECT
  "Reporting_Airline",
  APPROX_QUANTILE_DS(Delay,.75) AS "75delay"
FROM "example-flights-quantiles"
WHERE (TIMESTAMP '2005-11-01' <= "__time" AND "__time" < TIMESTAMP '2005-11-14')
GROUP BY 1)

SELECT
  a."Reporting_Airline",
  COUNT(*) AS "Flights"
FROM "example-flights-quantiles" a
LEFT JOIN b ON a.Reporting_Airline = b.Reporting_Airline
WHERE TIME_FLOOR(a."__time", 'PT1H') = TIMESTAMP '2005-11-15 11:00:00'
AND a."Delay" > b."75delay"
GROUP BY 1
ORDER BY 2 DESC
'''

df1 = pd.DataFrame(sql_client.sql(sql))

df1.plot.bar(x='Reporting_Airline', y='Flights')
plt.show()

A query like the one above could be adapted to work with streaming ingestion of event data, giving a real-time view of the numbers of flights currently delayed that exceed the 75th percentile delay from the previous two weeks.

### Estimate multiple quantile values

In the examples above, you used separate `APPROX_QUANTILE_DS` functions to find the values at different positions in the sketch.

```sql
  APPROX_QUANTILE_DS(Distance,.25) AS "25percent",
  APPROX_QUANTILE_DS(Distance,.5) AS "50percent",
  APPROX_QUANTILE_DS(Distance,.75) AS "75percent"
```

Alternatively, the values can be returned in an [`ARRAY`](https://druid.apache.org/docs/latest/querying/sql-array-functions) by using the [`DS_GET_QUANTILES`](https://druid.apache.org/docs/latest/querying/sql-functions#ds_get_quantiles) function.

Run the following cell, which returns quantiles in an `ARRAY` object instead of separate columns - one for quartiles, one for deciles - broken down by `Reporting_Airline` on a particular day.

In [None]:
sql='''
SELECT
  Reporting_Airline,
  DS_GET_QUANTILES(DS_QUANTILES_SKETCH(Distance),0, .25, .5, .75, 1) AS "quartiles",
  DS_GET_QUANTILES(DS_QUANTILES_SKETCH(Distance),0, .1, .2, .3, .4, .5, .6, .7, .8, .9, 1) AS "deciles"
FROM "example-flights-quantiles"
WHERE TIME_FLOOR("__time", 'P1D') = TIMESTAMP '2005-11-24'
GROUP BY 1
'''

display.sql(sql)

## Estimating the rank (cumulative distribution functions)

Druid provides two functions for estimating what the [normalized rank](https://datasketches.apache.org/docs/Quantiles/SketchingQuantilesAndRanksTutorial.html) of a value is within the data. Or, to put it another way, what percentage of other records fall below a given value.

### Estimate a single rank position

Use the [`DS_RANK`](https://druid.apache.org/docs/latest/querying/sql-functions#ds_rank) function to estimate where a given value falls in the distribution of the data, returned as a percentage.

Run the following cell to obtain the rank of a particular value in the "XE" flights on the 24th November 2005.

Notice that the `DS_QUANTILES_SKETCH` function is used to provide `DS_RANK` with the sketch representation.

In [None]:
sql='''
SELECT
  DS_RANK(DS_QUANTILES_SKETCH(Distance),950) AS "rank"
FROM "example-flights-quantiles"
WHERE TIME_FLOOR("__time", 'P1D') = TIMESTAMP '2005-11-24'
AND "Reporting_Airline" = 'XE'
'''

display.sql(sql)

This tells us that this distance is representative of about 90% of flights on that day. The results for "XE" in the previous cell, where we showed the deciles for the same flights, corroborates that.

Run the cell below to see how far up the rankings a delay of 30-minutes would be when looking at each delayed flight posted by a `Reporting_Airline` in the week commencing 31st October 2005.

In [None]:
sql='''
SELECT
  Reporting_Airline,
  DS_RANK(DS_QUANTILES_SKETCH(Delay),30) AS "rank"
FROM "example-flights-quantiles"
WHERE TIME_FLOOR("__time", 'P1W') = TIMESTAMP '2005-10-31'
AND Delay > 0
GROUP BY 1
ORDER BY 2 DESC
'''

display.sql(sql)

### Estimate multiple rank positions

In order to estimate multiple rank positions, the [`DS_CDF`](https://druid.apache.org/docs/latest/querying/sql-functions#ds_cdf) function is available.

Run the cell below to use this function to return an `ARRAY` of ranking positions for 6 different delays per `Reporting_Airline`, from 30 minutes through to 16 hours. The results show what percentage of reported delays in the dataset were under each of the lengths of time we have specified.

The `MAX` is added as a sorting criteria.

In [None]:
sql='''
SELECT
  Reporting_Airline,
  DS_CDF(DS_QUANTILES_SKETCH(Delay), 30, 60, 120, 240, 480, 960) AS "cdf",
  MAX(Delay) AS "Longest Delay"
FROM "example-flights-quantiles"
WHERE TIME_FLOOR("__time", 'P1W') = TIMESTAMP '2005-11-07'
AND Delay > 0
GROUP BY 1
ORDER BY 3 DESC
'''

display.sql(sql)

## Estimating histograms

When wanting to understand the shape of the distribution, Druid provides the [`DS_HISTOGRAM`](https://druid.apache.org/docs/latest/querying/sql-functions#ds_histogram) function, which also acts on an underlying Quantiles sketch.

Run the following cell to see how the `DS_HISTOGRAM` function estimates the buckets of `Distance` of flights taking off from Los Angeles.

In [None]:
sql='''
SELECT Origin,
  DS_HISTOGRAM(DS_QUANTILES_SKETCH(Distance), 250, 500, 750, 1000, 1250, 1500, 1750, 2000) AS Histogram
FROM "example-flights-quantiles"
WHERE "Origin" = 'LAX'
AND TIME_FLOOR("__time", 'P1W') = TIMESTAMP '2005-10-31'
GROUP BY 1
'''

display.sql(sql)

The results show an estimated count of the number of flights in each of the `Distance` buckets specified, plus one final bucket for the remainder.

Combining `DS_QUANTILES_SKETCH` with `FILTER` focuses the distribution being analysed.

Run the next cell which produces histograms for two specific `Reporting_Airline` with flights from Los Angeles.

In [None]:
sql='''
SELECT
  DS_HISTOGRAM(DS_QUANTILES_SKETCH(Distance) FILTER (WHERE "Reporting_Airline" = 'AA'), 250, 500, 750, 1000, 1250, 1500, 1750, 2000) AS "Histogram-AA",
  DS_HISTOGRAM(DS_QUANTILES_SKETCH(Distance) FILTER (WHERE "Reporting_Airline" = 'AS'), 250, 500, 750, 1000, 1250, 1500, 1750, 2000) AS "Histogram-AS"
FROM "example-flights-quantiles"
WHERE "Origin" = 'LAX'
AND TIME_FLOOR("__time", 'P1W') = TIMESTAMP '2005-10-31'
'''

display.sql(sql)

## Summary

* Apache Datasketch "Quantile" sketches are supported in Apache Druid
* SQL functions exist for estimating individual ranks (`DS_RANK`) and quantiles (`DS_GET_QUANTILE`)
* Druid can return an `ARRAY` of ranks (`DS_CDF`) and quantiles (`DS_GET_QUANTILES`)
* Histograms can be estimated using the `DS_HISTOGRAM` function

## Learn more

* Read the [quantiles extension](https://druid.apache.org/docs/latest/development/extensions-core/datasketches-quantiles) documentation
* Read more about the functions for [creating](https://druid.apache.org/docs/latest/querying/sql-aggregations#quantiles-sketch-functions) quantile sketches
* Read the [Druid Data Cookbook](https://blog.hellmar-becker.de/2022/03/20/druid-data-cookbook-quantiles-in-druid-with-datasketches/) article by Hellmar Becker on the quantiles sketch