# Generating Apache Datasketches at ingestion time

It's extremely common for analysts to want to count unique occurences of some dimension value in data. With the Druid database's history of large volumes of data comes an advanced computer science technique to speed up this calculation through Apache Datasketches-based approximation.

Further speed boost can be achieved by storing sketches inside `TABLE`s directly, making query execution for `COUNT(DISTINCT)` operations leaner and more efficient.

In this tutorial, work through generating both HyperLogLog and Theta sketch objects as part of an ingestion.

## Prerequisites

This tutorial works with Druid 26.0.0 or later.

#### Run using Docker

Launch this tutorial and all prerequisites using the `druid-jupyter` profile of the Docker Compose file for Jupyter-based Druid tutorials. For more information, see [Docker for Jupyter Notebook tutorials](https://druid.apache.org/docs/latest/tutorials/tutorial-jupyter-docker.html).
   
#### Run without using Docker

If you do not use the Docker Compose environment, you need the following:

* A running Apache Druid instance, with a `DRUID_HOST` local environment variable containing the server name of your Druid router.
* [druidapi](https://github.com/apache/druid/blob/master/examples/quickstart/jupyter-notebooks/druidapi/README.md), a Python client for Apache Druid. Follow the instructions in the Install section of the README file.
* [matplotlib](https://matplotlib.org/), a library for creating visualizations in Python.
* [pandas](https://pandas.pydata.org/), a data analysis and manipulation tool.

### Initialize Python

Run the next cell to set up the Druid Python client's connection to Apache Druid.

If successful, the Druid version number will be shown in the output.

In [None]:
import druidapi
import os

if 'DRUID_HOST' not in os.environ.keys():
    druid_host=f"http://localhost:8888"
else:
    druid_host=f"http://{os.environ['DRUID_HOST']}:8888"
    
print(f"Opening a connection to {druid_host}.")
druid = druidapi.jupyter_client(druid_host)

display = druid.display
sql_client = druid.sql
status_client = druid.status

status_client.version

Finally, run the following cell to import additional Python modules that you will use.

In [None]:
import json
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd

## Generating sketch objects with SQL

Run the next cell to get an indication of what a sketch object looks like.

It applies a `GROUP BY` to `Reporting_Airline` and uses the [`DS_HLL`](https://druid.apache.org/docs/26.0.0/querying/sql-functions.html#ds_hll) function to create sets of `Tail_Number`s inside a "HyperLogLog" sketch object.

The results show, for each `Reporting_Airline`, the highly optimized, aggregated list of the `Tail_Number`s.

Instead of Druid calculating these with every query, this notebook will walk through storing these in a table and addressing them directly.

In [None]:
sql = '''
SELECT
    "Reporting_Airline",
    DS_HLL("Tail_Number") AS "Sketch"
FROM "On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11"
GROUP BY 1
LIMIT 5
'''

display.sql(sql)

### Creating sketches during batch ingestion

The next cell incorporates the `DS_HLL` function above into a `REPLACE` statement to build a new table, `flights-counts`.

As with the previous SQL, `GROUP BY` is the key to creating an aggregated list of values to go into the sets represented by the sketches. In addition to `Reporting_Airline`, however, we add `Origin`, and `Dest`. And because we need to have a `__time` field, the sets are also broken down by hour using a `TIME_FLOOR` function.

Notice that the SQL does not `SELECT` the original `Tail_Number`s. If we kept that field, the `GROUP BY` wouldn't aggregate any rows into the sketch - there would be a 1:1 relationship between the row and each `Tail_Number` - which is the opposite of what we are designing for! By implication, it will be no longer possible to use the raw data as part of any SQL queries, like a `GROUP BY` or a `WHERE`.

You will also spot a [`DS_THETA`](https://druid.apache.org/docs/26.0.0/querying/sql-functions.html#ds_theta) function, giving the ability to do intersection and difference operations in addition to unions.

Run this cell to store the ingestion SQL into the `sql` variable.

In [None]:
sql='''
REPLACE INTO "flights-counts" OVERWRITE ALL
WITH "ext" AS (SELECT *
FROM TABLE(
  EXTERN(
    '{"type":"http","uris":["https://static.imply.io/example-data/flight_on_time/flights/On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11.csv.zip"]}',
    '{"type":"csv","findColumnsFromHeader":true}'
  )
) EXTEND ("depaturetime" VARCHAR, "arrivalime" VARCHAR, "Year" BIGINT, "Quarter" BIGINT, "Month" BIGINT, "DayofMonth" BIGINT, "DayOfWeek" BIGINT, "FlightDate" VARCHAR, "Reporting_Airline" VARCHAR, "DOT_ID_Reporting_Airline" BIGINT, "IATA_CODE_Reporting_Airline" VARCHAR, "Tail_Number" VARCHAR, "Flight_Number_Reporting_Airline" BIGINT, "OriginAirportID" BIGINT, "OriginAirportSeqID" BIGINT, "OriginCityMarketID" BIGINT, "Origin" VARCHAR, "OriginCityName" VARCHAR, "OriginState" VARCHAR, "OriginStateFips" BIGINT, "OriginStateName" VARCHAR, "OriginWac" BIGINT, "DestAirportID" BIGINT, "DestAirportSeqID" BIGINT, "DestCityMarketID" BIGINT, "Dest" VARCHAR, "DestCityName" VARCHAR, "DestState" VARCHAR, "DestStateFips" BIGINT, "DestStateName" VARCHAR, "DestWac" BIGINT, "CRSDepTime" BIGINT, "DepTime" BIGINT, "DepDelay" BIGINT, "DepDelayMinutes" BIGINT, "DepDel15" BIGINT, "DepartureDelayGroups" BIGINT, "DepTimeBlk" VARCHAR, "TaxiOut" BIGINT, "WheelsOff" BIGINT, "WheelsOn" BIGINT, "TaxiIn" BIGINT, "CRSArrTime" BIGINT, "ArrTime" BIGINT, "ArrDelay" BIGINT, "ArrDelayMinutes" BIGINT, "ArrDel15" BIGINT, "ArrivalDelayGroups" BIGINT, "ArrTimeBlk" VARCHAR, "Cancelled" BIGINT, "CancellationCode" VARCHAR, "Diverted" BIGINT, "CRSElapsedTime" BIGINT, "ActualElapsedTime" BIGINT, "AirTime" BIGINT, "Flights" BIGINT, "Distance" BIGINT, "DistanceGroup" BIGINT, "CarrierDelay" BIGINT, "WeatherDelay" BIGINT, "NASDelay" BIGINT, "SecurityDelay" BIGINT, "LateAircraftDelay" BIGINT, "FirstDepTime" VARCHAR, "TotalAddGTime" VARCHAR, "LongestAddGTime" VARCHAR, "DivAirportLandings" VARCHAR, "DivReachedDest" VARCHAR, "DivActualElapsedTime" VARCHAR, "DivArrDelay" VARCHAR, "DivDistance" VARCHAR, "Div1Airport" VARCHAR, "Div1AirportID" VARCHAR, "Div1AirportSeqID" VARCHAR, "Div1WheelsOn" VARCHAR, "Div1TotalGTime" VARCHAR, "Div1LongestGTime" VARCHAR, "Div1WheelsOff" VARCHAR, "Div1TailNum" VARCHAR, "Div2Airport" VARCHAR, "Div2AirportID" VARCHAR, "Div2AirportSeqID" VARCHAR, "Div2WheelsOn" VARCHAR, "Div2TotalGTime" VARCHAR, "Div2LongestGTime" VARCHAR, "Div2WheelsOff" VARCHAR, "Div2TailNum" VARCHAR, "Div3Airport" VARCHAR, "Div3AirportID" VARCHAR, "Div3AirportSeqID" VARCHAR, "Div3WheelsOn" VARCHAR, "Div3TotalGTime" VARCHAR, "Div3LongestGTime" VARCHAR, "Div3WheelsOff" VARCHAR, "Div3TailNum" VARCHAR, "Div4Airport" VARCHAR, "Div4AirportID" VARCHAR, "Div4AirportSeqID" VARCHAR, "Div4WheelsOn" VARCHAR, "Div4TotalGTime" VARCHAR, "Div4LongestGTime" VARCHAR, "Div4WheelsOff" VARCHAR, "Div4TailNum" VARCHAR, "Div5Airport" VARCHAR, "Div5AirportID" VARCHAR, "Div5AirportSeqID" VARCHAR, "Div5WheelsOn" VARCHAR, "Div5TotalGTime" VARCHAR, "Div5LongestGTime" VARCHAR, "Div5WheelsOff" VARCHAR, "Div5TailNum" VARCHAR, "Unnamed: 109" VARCHAR))
SELECT
  TIME_FLOOR(TIME_PARSE("depaturetime"), 'PT1H') AS "__time",
  "Reporting_Airline",
  "Origin",
  "Dest",
  COUNT(*) AS "Events",
  MAX("Distance") AS "Distance_Max",
  MIN("Distance") AS "Distance_Min",
  DS_HLL("Tail_Number") AS "Tail_Number_HLL",
  DS_THETA("Tail_Number") AS "Tail_Number_THETA"
FROM "ext"
GROUP BY 1, 2, 3, 4
PARTITIONED BY DAY
'''

A specific context parameter must be used when creating sketches at ingestion time: [`finalizeAggregations`](https://druid.apache.org/docs/26.0.0/multi-stage-query/reference.html#context-parameters). This prompts Druid to store the true sketch value, and is applied for you automatically when you use the Druid console.

As we are running the ingestion programmatically, we must construct a request (`req`) with the appropriate context parameters before we execute the ingestion.

In [None]:
req = sql_client.sql_request(sql)
req.add_context("finalize", "false")
req.add_context("finalizeAggregations", "false")

Run the following cell to start the ingestion. Monitor the ingestion task itself in the Druid Console as it runs.

Once finished, you will see the table definition, including the two sketch dimensions - `Tail_Number_HLL` and `Tail_Number_THETA`. You'll notice that these two dimensions have a `COMPLEX` type, indicating that they are for storing Datasketch objects.

In [None]:
sql_client.run_task(req)
sql_client.wait_until_ready('flights-counts')
display.table('flights-counts')

### Creating sketches during streaming ingestion

In streaming ingestion, rather than using `DS_HLL` or `DS_THETA` you include the Druid Native equivallent in the [`metricsSpec`](https://druid.apache.org/docs/26.0.0/ingestion/ingestion-spec.html#metricsspec) and - instead of `GROUP BY` - enable [`queryGranularity`](https://druid.apache.org/docs/latest/ingestion/ingestion-spec.html#granularityspec) and `rollup` to truncate the time stamp and pre-aggregate the rows.

The `INSERT` statement above is therefore equivallent to:

```json
    {
      "type": "HLLSketchBuild",
      "fieldName": "Tail_Number",
      "lgK": 12,
      "tgtHllType": "HLL_4"
    },
    {
      "type": "thetaSketch",
      "fieldName": "Tail_Number",
      "size": 16384
    }
```

For the purpose of this notebook, we will use the `TABLE` that you have ingested above in batch mode.

Notice that here it's easy to see some internal parameters for sketch generation, like the `lgK` value for HLL. In SQL mode, these are exposed as supplementary parameters to the `DS_HLL` function. Be cautious of changing these values without researching the effects - not just in accuracy but also in terms of performance and segment size.

## Using COUNT(DISTINCT) on stored sketches

Use specific SQL functions when carrying out `COUNT(DISTINCT)` operations on raw sketches.

* For HLL [`APPROX_COUNT_DISTINCT_DS_HLL`](https://druid.apache.org/docs/26.0.0/querying/sql-functions.html#approx_count_distinct_ds_hll), and
* for Theta [`APPROX_COUNT_DISTINCT_THETA`](https://druid.apache.org/docs/26.0.0/querying/sql-functions.html#approx_count_distinct_ds_theta).

The following cell estimates the number of unique airplanes over a specific time period using both the HyperLogLog and Theta sketch objects in the `TABLE` that you just created. It breaks it down by `Reporting_Airline` and also includes a `SUM` of the miles flown.

In [None]:
sql='''
SELECT
   "Reporting_Airline",
   SUM("Distance_Max") AS "Miles_Flown",
   APPROX_COUNT_DISTINCT_DS_HLL("Tail_Number_HLL") AS "HLLApprox",
   APPROX_COUNT_DISTINCT_DS_THETA("Tail_Number_THETA") AS "ThetaApprox"
FROM "flights-counts"
WHERE TIMESTAMP '2005-10-31' <= __time AND __time <= TIMESTAMP '2005-11-20'
GROUP BY 1
'''

display.sql(sql)

## Calculating set union with stored sketches

Using stored sketches as part of a union operation is very similar to when they have not been stored. The difference being that you address the sketch directly in the `DS_HLL` function, rather than the raw data.

Run next cell, which:

* Constructs three sets, each represented as a HyperLogLog sketch, using `DS_HLL` against the stored `Tail_Number_HLL` sketches - it also applies a `FILTER` to isolate flights out of three specific cities,
* Applies `HLL_SKETCH_UNION` to union the three sets, and
* Estimates the resulting set size with `HLL_SKETCH_ESTIMATE`.

It also includes the same operation using Theta sketches. Again, three sets are created from the underlying sketch data that is then unioned and its size estimated.

It uses `TIME_FLOOR` to give a week-by-week `GROUP BY` of the data.

In [None]:
sql='''
SELECT
  TIME_FLOOR("__time",'P1W') AS "Week commencing",
  HLL_SKETCH_ESTIMATE(
     HLL_SKETCH_UNION(
       DS_HLL("Tail_Number_HLL") FILTER (WHERE "Origin"='ATL'),
       DS_HLL("Tail_Number_HLL") FILTER (WHERE "Origin"='DFW'),
       DS_HLL("Tail_Number_HLL") FILTER (WHERE "Origin"='SFO')
      )
    ) AS "AnyThreeCity-HLL",
  THETA_SKETCH_ESTIMATE(
     THETA_SKETCH_UNION(
       DS_THETA("Tail_Number_THETA") FILTER (WHERE "Origin"='ATL'),
       DS_THETA("Tail_Number_THETA") FILTER (WHERE "Origin"='DFW'),
       DS_THETA("Tail_Number_THETA") FILTER (WHERE "Origin"='SFO')
      )
    ) AS "AnyThreeCity-THETA"
FROM "flights-counts"
WHERE TIMESTAMP '2005-10-31' <= __time AND __time <= TIMESTAMP '2005-11-20'
GROUP BY 1
'''

display.sql(sql)

## Calculating set intersection and difference with pre-existing Theta sketches

The same techqnieus applies to intersection and difference operations on stored sketches as it does to using raw data, with the difference being the dimension upon which sets are created through `DS_THETA`.

Run the following cell, which uses the stored Theta sketch to perform both a difference and an intersection operation on two sets of airplanes, one from the week commencing 31st October, and another for the week commencing 7th November.

In [None]:
sql='''
SELECT
  "Reporting_Airline",
  THETA_SKETCH_ESTIMATE(
     THETA_SKETCH_NOT(
       DS_THETA("Tail_Number_THETA") FILTER (WHERE TIME_FLOOR("__time",'P1W') = TIMESTAMP '2005-10-31'),
       DS_THETA("Tail_Number_THETA") FILTER (WHERE TIME_FLOOR("__time",'P1W') = TIMESTAMP '2005-11-07')
      )
    ) AS "WeekOneNotTwo",
  THETA_SKETCH_ESTIMATE(
     THETA_SKETCH_INTERSECT(
       DS_THETA("Tail_Number_THETA") FILTER (WHERE TIME_FLOOR("__time",'P1W') = TIMESTAMP '2005-10-31'),
       DS_THETA("Tail_Number_THETA") FILTER (WHERE TIME_FLOOR("__time",'P1W') = TIMESTAMP '2005-11-07')
      )
    ) AS "WeekOneAndTwo"
FROM "flights-counts"
GROUP BY 1
'''

display.sql(sql)

## Conclusion

* Druid can be pre-loaded with sketch objects that speed up approximation both in batch and streaming ingestion
* Specific SQL functions are used to address the sketch for `COUNT(DISTINCT)` operations

## Learn more

* Watch [Employ Approximation](https://youtu.be/fSWwJs1gCvQ?list=PLDZysOZKycN7MZvNxQk_6RbwSJqjSrsNR) by Peter Marshall
* Read [Ingesting Data Sketches into Apache Druid](https://blog.hellmar-becker.de/2022/12/26/ingesting-data-sketches-into-apache-druid/) by Hellmar Becker
* Read more about the native "aggregator" functions for streaming ingestion
    * [ThetaSketch function](https://druid.apache.org/docs/26.0.0/development/extensions-core/datasketches-theta.html)
    * [HyperLogLog function](https://druid.apache.org/docs/26.0.0/development/extensions-core/datasketches-hll.html)