# Generating and ingesting Apache Datasketches

It's extremely common for analysts to want to count unique occurences of some dimension value in data. With the Druid database's history of large volumes of data comes an advanced computer science technique to speed up this calculation through approximation. In this tutorial, work through some examples and see the effect of turning it on and off, and of making it even faster by pre-generating the objects that Druid uses to execute the query.

## Prerequisites

This tutorial works with Druid 26.0.0 or later.

#### Run using Docker

Launch this tutorial and all prerequisites using the `druid-jupyter` profile of the Docker Compose file for Jupyter-based Druid tutorials. For more information, see [Docker for Jupyter Notebook tutorials](https://druid.apache.org/docs/latest/tutorials/tutorial-jupyter-docker.html).
   
#### Run without using Docker

If you do not use the Docker Compose environment, you need the following:

* A running Apache Druid instance, with a `DRUID_HOST` local environment variable containing the servername of your Druid router
* [druidapi](https://github.com/apache/druid/blob/master/examples/quickstart/jupyter-notebooks/druidapi/README.md), a Python client for Apache Druid. Follow the instructions in the Install section of the README file.
* [matplotlib](https://matplotlib.org/), a library for creating visualizations in Python,
* [pandas](https://pandas.pydata.org/), a data analysis and manipulation tool.

### Initialize Python

Run the next cell to set up the Druid Python client's connection to Apache Druid.

If successful, the Druid version number will be shown in the output.

In [1]:
import druidapi
import os

if 'DRUID_HOST' not in os.environ.keys():
    druid_host=f"http://localhost:8888"
else:
    druid_host=f"http://{os.environ['DRUID_HOST']}:8888"
    
print(f"Opening a connection to {druid_host}.")
druid = druidapi.jupyter_client(druid_host)

display = druid.display
sql_client = druid.sql
status_client = druid.status

status_client.version

Opening a connection to http://druid-master-0.lan:8888.


'26.0.0'

Finally, run the following cell to import additional Python modules that you will use.

In [3]:
import json
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd

## Generating sketch objects with SQL

For improved performance when doing `COUNT(DISTINCT)` using approximation technique, you can reference `TABLE` dimensions that already contain sketches.

Run the next cell to get an indication of what a HyperLogLog sketch looks like. It applies a `GROUP BY` to `Reporting_Airline` and uses the [`DS_HLL`](https://druid.apache.org/docs/26.0.0/querying/sql-functions.html#ds_hll) function to create sets of `Tail_Number`s inside a "HyperLogLog" sketch object.

In [8]:
sql = '''
SELECT
    "Reporting_Airline",
    DS_HLL("Tail_Number") AS "Sketch"
FROM "On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11"
GROUP BY 1
LIMIT 5
'''

display.sql(sql)

Reporting_Airline,Sketch
AA,"""CgEHDAAYAAIAAAAAAAAAAAAAAAAOp6xAAAAAAAAAAACDDQAAAAAAAAAAAAAAAAMAACAAIAAAIAAAAAEAAAAAAAAABQMAACAAAAAAAAAAAAAAAAAAAAADAAATEAAAAAAAAAAAAAAEIAAAAAAAAAAAAAAANggAEAAAAAAAAQABAQAAAgAAAAABEAADAgAAAAIAAAIAEAADAgAwABAAABAAAAAGAgAAABQAAABCBAMAAQAAAAEAAQAAAAAAABABAAAAACAAAAAAACIAAQYAAQAAAgACEDABIABgIAAAAAAAAAAAAAAAAAAAIAAAAAAAAAACAAAQAAAAARAAAgADAAAAAAAGEAAAAgAQAAEBAAAAAAAAAAEgUAAAAAAAAQBQUAAgAAAAAAAAEQAgAAAAAAAQASIBEgAAAAAAAAAAAAAQAAAAAQABAAAAACAAAAAAAAUAAAAAAAAAAAAAAAAAQAAAEQAAAAAAAAEAAQAQAhAlAAAAAAAAAAQAAAAAAAAAAAAAABABEACQAAAAEAAAIAAAAAAAIAAgAAAAAAEBAAEAAAQAAAAABAAAAQAQAAAAAhAAAAAAAAAAAAAJMAAAAAAAAAAAAAAAAAAAAAAgABMABQAAEwAQAAAAAAAAAAAAAAAQQAAAAAAwEQAQABAAAAAAAAAAEAEQMAAQAAAAAAAAAgEAAAAwAQAAAAAAAAAgIAIAACACMgIAIAACAQAAYDAAEAAAAAAAAAABACACAgAAEAAAAAEBEQQAAAAAERAAAAAgAAAAAAAAAgAAAAAAAAAAAAACFAAABQAAAAAAIAEAAAAAAAADABADAAAQYDQAAQABEAAAAQAAAQAAAAAAEAAAEQAAAQAAAAAwIAAAAAABAQAAAAAwAAAAQAAAEAEQAAAQMAAAAAATAAAAAAAAAAAAIwAAAAABEAAAAAAAAAAAEAAAAAAgAAAAAAAAAAAAABAAAQAAAAAAAAAAAAJQAQAAAAAAAgIDAAAAEAACAAAAEAACkAAAEAAAAAAAATAAABAAAAAQBgAAEAAAEAAAACIAAgAAIAAAEAAAAAAAABAAAAAAAAJQEAAABAAAAAAAAAAgAAABAAAAAQAAFAAAAAEAAAAAAAAAEAAAAAAAAAEAAQAAAFAAAAUgABAQAQAAAAAAAhAAAAABAAAAAAABAAABIAAAAAAQBQAAAQAAABAgAAEAAAAAARAAAAAAAAAAAAAAAQEAAQEAAAAAAAEAAQAAAAAGAAAAAAACAAABAAAAAAAwAAAAAABAAEAQAAAAAAAAAAAQABAAAAAAAAAAAAAAAAADEAAAAAABAAAAAQAAAAAAARAAAAACMAABAAACAAAABRAAAAEAAAAAAAAAADABEAAAAAAAAAAAAAAAAAAAAAABAAAAAwIAAAAAACACAAAAAAABAAAAAAIBMBACQBADAAAAAAAAAAAAAQAAAAAQEAIQABAAEAJAAAAAAAAAAAAAAAAAAhAAAAAAEAAAAAAAAEAwAAAAMAAAAAEAAAAAAWAAAAAAAQAAAAAgEgAAAAAAAAEAAAAwAAAAACEAABAAAgEAAAAAAAAAAAAAQAcAAAAAAQMAAAAQBAAAAAAAEAAhAgAAAAEQAQAwAAAAAQBAIAAAUAAAAAAAARAAUAAAAAAgAAMAEAAAMAAAAAEAAAARQAAAAAAABRAAAwAQAAABAQgAAAAAAAAAEAEAAABgAAACAQACABATAAABAAABAAAAAAAAAwAAEAAQAAAAAAAAMBAAAAAAAAAAcAAQABAgAGAFEAAAAAAAAAAAEAAgAAIQAAADAAAAAAAmACABBAAEEAAAACAgAAAAAAADAQAAAAEAAwAAAAAAAAAAAAEAAAAAAAAAAAEBAABAAAAAAQAAEFAAAgACAAIAAAIAQAAAAEAAAAUAAAACAgABAAAAAAAAAAAAAAACAAIAAAABAAAAAAEGAAAAAAAAAAAAMBABAwAAAiAgAAADBAAAAAADIAAAAAAAAAAAAAAAAAIAAQAAAAAAAAAAAAAAAAAAAAAAAAEgAAIAAAAAAAAAACAAAAAAABAAAAAQAAACAAAAUAAAEAIgAEAgAAEAAAAAAAAAAAAAACAQAAAAAwABAwAQAAAAAAAAAAAAAAAgABAAAAUAAgIAAAEAAAAAAAAAACACAAAAAAAAAAEAAAABAAAAAAACAgAQACAAAAEAAAAAIAAAAAAAIAMAAAAAEBACAAcAAAFBEAAAAAAAABAAAQAAAJCQAAMAkAEAAANQMBAAAAABAAEAAAAAAAMAAAAAMAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAVAAAAABAAAAAAAAAQAAAAAAAAAQAAYAAAAAAAABAAAAAAAQEAAwAhAAAAADAAAAAAAAAQAAAAAgAAAQABAAAAFCEAAAAAABEAABAAAAAAAAAAAAADAAAAABAAEAAAAAAAAAABAAAAAAIAAAExAAAAAAEQAAAAAAIBAAAAAAEAAAABARAAAAAAEAAAAAAAAAAFABAAACAAAAAABAAAABAAQwEAABAAAAAAAgAAAAAAADAAAABQJAAAAAAhABAAAAAhAgAAAAACAQAAAAUAAAAAQAARAAAAAAAQABEgAQAAAwEAAQAAEAAAAAAAAAAAAhADAAAEAQABAAAAAAYAAAAAAAAAAAAAAAEQAAAAAAAAAAAQIwAAEAAAMAYAADMAAAAAAAEAAwABBAAAAAAAAAAAAAAAAAACAAAAAAMAAAAQAEAAAAABAQAAAAAQAAAAAAAAMAEAACABAAAAAAAAAAAAAQEQ"""
AS,"""AwEHDAgIAAFsAAAAA/xeBARtOA4FXeUJCdpABDXVywgP+j8GEs4iBhQnmQcWHLgH6iypBhiHNAgbGbwKHMFzBB2/zAwjpBMNJe2GESi+/QcpCNAE50XMBS6aoAkwjHAUMaCdBDQAPQQ1d6kHNgbnBTiCKwZi2JgEPIKOCD3kvgQ+9ksHQBiJEkNBAglEeq0FRx7VBx0qDAlKfJUMS59ICE3vfBZOdYILT+ScBFD+TgVTP/kGVpv6EldMegZY+9EJVgNzBlzYlwpgeT0PYS3eBmJ8hA6B4/AHZoo8CmsgGwZsDC4I+7nrCG95pgtwU/IFc6aJBHY7mgd8vK0XfRAJC38lIgaBmQ0ShAwjIoXvPQaKeVoOjFncBY2ucQuRm98Fk4ghCZsVYA+gPNkFUFIuEVNQdAarjxQOrYFqBa55SQaxZpgFs95lEbQkKAS13uMLuYm6CLwDZhK0CLMGvoyRE7+y3QzCWT0PxKZjDsXrKgzf5moLycs8BsyGdQXQdmgTNpzwB0uJlgXeZ7QG6qa1CSjdpg3msP0F5xu9C+regRHuFXAN9X0mBOcRfhX5H4gO+zQ9C+cVGg99gWQI"""
B6,"""AwEHDAcIAAFYAAAAeOKnBxed4A3BodkGhqhZFAfC7QgInDEKiUK9FgskOgkMe94IjV6qHw72eAtVXakGEkMfB6d2agiXgCUEn664BX+Q+wWiNYUTpOS7BozMgA2nU9APqer3DKofBBGsswgJbfW0BjD38gTrkdwPMliaBLVp9gjVMMgHVzCtDjlc9Qs6RjsHu71xBMa9BQc9O/EKvmYVC79G4wpB72gG4T10C+dPTwRGYuoGSDDKCknPAgTLDmAFDu88Bs2tqBJOPxoKT8E2BG2x3wlSWF8GU4bOBdRTwAgOI5QI1ADiBr5MOQ7Y3qsE2TnVBel4FwZcZJQHXcBwBMyEXQdf7x8F4FQiBuHowwdintMHZFiwCyzcSgnmJ3IF537RCWhGhgSkyxYGMJ0rCIljKgZttTkHcO82CXg+1hHVR1YKSNboCOQErg8pU5kE+OMoCnmY/gjTU70EDHc2CLApCwZ+eLsWYYeGBg=="""
CO,"""AwEHDAkIAAFGAQAAAGrVByHBpQsE5hIIB15rEQiIVRcK4p4EDFBPBg+2sQQTUGoHFLbXBhVutgsW+FsEGDRlB5AFFwoamIIVGwxlDhzKRwQfiG4ET6BDEyHAkwgiHuoGIxSpECQ2qxYoWMcNKXBnDixMawgtwucHJkGOBvY7/AWSPeUHMl6/BDAF4gc1hPkENpYrBTdwbAU46PkGOYAaB+uf4AQ8lM4JQAIlBkMiFApGdIAFR9itBs/wngdJftEHSl7QDkswIwRNBlQGTpqlC08yOATNBA8HURBAC1M6UQlU9lUFtEabDaF5GAc1SZ4EWwgvCFzQRwc4S8YH0xSzEWAKyg9ICxYJYjwVB2OYjgZnoFYH/9LaBGo0ihKtv7wJbeTZBG4eaw1v9GQEfeW1B3PcdwR0uNsEdtSTBnlikgmmqFMGQH4ACYDw7A+CXKYKg6CDEoTWCwuF2lgThqYSBOo6cxGIDEoVN1CUBox6GgmNHlAIjhxbB48cfAaTivgElawDBZb2LwSXICwHmxTfBc6bPQ2h/ukH9VpfD61SkQjYmscKhCDoCU2TAwaswKYGQJzpB64OoxOvrBoJsEqDBrLqGQa0WE0EtiT/Cbdcuwm4xgkJuuCzBMTsqxC9krcNviSYC8BWAxIWW4sMxNp4ClHpegfH6BAEy6IQDMyYSQvNQGENzkj+BM/eMwfQApkPDI4FEtaYPQfX8GQN2IxAB//akQTaCmsE25zwBnnHsgsKp3ME35CmB+DKEQTiICEE93XrDYIy3ArnWPkO6chOBAPNewnrwnoJ9D34BO5s0QcE1gkK8RhRCvQooQ8f17cP9mbkB/fYAwb47KIE+WzDBhrAUQaDeLgFx244B/9GvwUAr3UHATVdCAJDxgQDXYoIB5+jEgi9AwYJxS8HCnFgBQv7lxYJBawIDd+5BJvkAAcPTf8KEJOpByHj2wwTgX8GFl2OCheDogg1yBEJHKMIBh25Tw8fAxsEFRYOCyHxFQ3TnnAKI2uiEfoPSwWGP3sTJnF2BG8leQsp4TYJIw9cB6YKuQos4SgILdeBBzBvcwfjwF0LMnHqBDTz2gz7OmoSNpeDDjcB0AfP0FQKOrFCBDw9FQU9FREHPvdqCHnEDghApzwKQa37BULFHwZEvywF8KtCCUYZmg2yKE0NSIeiBEqF0wdLEzcJTC8GD009XA5PHWAFUNu0BNjwxAVSRx0GYuKtBZd7YxJWKR0HV99kFBeDbAVbFw4ZShJdDl5P4wVfS6UEvqJUD2HZ/Qq/bwYGZ5smD9ciRQlqfbkIa/X1F2xfZwVtVe4Jb+X7BnB5JA43QIULLE7pCX3hyxhQuRgEf2P9BoGrogSW7HAJhScNC4ZVTgbvMgkGLF+uCItRAwiNdcoGjp3sCZATjhGRUb8SAJHWBpTJEwmVBTIblkPmDZdn2RWadzQGm0dSDH8d8gb3U7MRodlVBaK/9Aula3MEqLf3BmdBtAeqMb0ErSsqCq7fjQayS58KUMTmFviwwgbq0n4LvoVZCr/nHwQc0j4Imk+8CuzV7QjDsRcIxMWVDsh34AvJlwQHUHqdBs3zyAvOw5MLiI9EBsMZWBiIhwUKA9GoBAqawwTYs50KB6f9Bu7b4w/cB8IK3S1/BlMUqwbhF/QG4q2pCHrRuAbke50E58OPBmv4IhOoQaQLoUrDBSGU4wXu7SYI8If/CuoNxAvzO4UF9OtYBc8lwgWVlREP9+UsCq5JngugHeoHDfkxBtEvHA8="""
DH,"""AwEHDAcIAAE1AAAAgB6vB/TG3AaC8x4MA2qaBoVWhwUKzBUL8JAJFX04hBAXz4cGmdm2C546+Acf2FsEoJVxDiLhGwej5g4GpX7TE6iktgRdJ4IKshUsC8E55wk+2eMHhU4sE0FGDQ3CIbwNxSciB0eJLAZI2m8Fy7NuC82uogZQAggI0/gMBtT/0AXW1cgEXQCYDN4Y8RDfzo0IYZMXFuI60wtj/UYG5CJ3CudOxAdo538ba6SOBGxeygfuf74Fb/IoHvCpQgjyO/MIdO65BXlYyQn62qkF+40xDf0XkAo="""


The results show, for each `Reporting_Airline`, the highly optimized, aggregated list of the `Tail_Number`s. By storing this in the table, Druid queries gains an overall speed boost from an increase in table data efficiency and a decrease in computation effort.

### Creating sketches during batch ingestion

The next cell incorporates the `DS_HLL` function into a `REPLACE` statement to build a new table, `flights-counts`. `GROUP BY` aggregates `Tail_Number`s for each hour (`TIME_FLOOR`), `Reporting_Airline`, `Origin`, and `Dest`.

Unlike HuperLogLog, Theta sketches allow for intersection and difference set operations. The [`DS_THETA`](https://druid.apache.org/docs/26.0.0/querying/sql-functions.html#ds_theta) function is included in the SQL below to set you up for trying this out later.

Notice that the SQL does not `SELECT` the original `Tail_Number`s. If we kept that field, the `GROUP BY` wouldn't aggregate any rows into the sketch - there would be a 1:1 relationship between the row and each `Tail_Number` - which is the opposite of what we are designing for! By implication, it will be no longer possible to use the raw data as part of any SQL queries, like a `GROUP BY` or a `WHERE`.

In [None]:
sql='''
REPLACE INTO "flights-counts" OVERWRITE ALL
WITH "ext" AS (SELECT *
FROM TABLE(
  EXTERN(
    '{"type":"http","uris":["https://static.imply.io/example-data/flight_on_time/flights/On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11.csv.zip"]}',
    '{"type":"csv","findColumnsFromHeader":true}'
  )
) EXTEND ("depaturetime" VARCHAR, "arrivalime" VARCHAR, "Year" BIGINT, "Quarter" BIGINT, "Month" BIGINT, "DayofMonth" BIGINT, "DayOfWeek" BIGINT, "FlightDate" VARCHAR, "Reporting_Airline" VARCHAR, "DOT_ID_Reporting_Airline" BIGINT, "IATA_CODE_Reporting_Airline" VARCHAR, "Tail_Number" VARCHAR, "Flight_Number_Reporting_Airline" BIGINT, "OriginAirportID" BIGINT, "OriginAirportSeqID" BIGINT, "OriginCityMarketID" BIGINT, "Origin" VARCHAR, "OriginCityName" VARCHAR, "OriginState" VARCHAR, "OriginStateFips" BIGINT, "OriginStateName" VARCHAR, "OriginWac" BIGINT, "DestAirportID" BIGINT, "DestAirportSeqID" BIGINT, "DestCityMarketID" BIGINT, "Dest" VARCHAR, "DestCityName" VARCHAR, "DestState" VARCHAR, "DestStateFips" BIGINT, "DestStateName" VARCHAR, "DestWac" BIGINT, "CRSDepTime" BIGINT, "DepTime" BIGINT, "DepDelay" BIGINT, "DepDelayMinutes" BIGINT, "DepDel15" BIGINT, "DepartureDelayGroups" BIGINT, "DepTimeBlk" VARCHAR, "TaxiOut" BIGINT, "WheelsOff" BIGINT, "WheelsOn" BIGINT, "TaxiIn" BIGINT, "CRSArrTime" BIGINT, "ArrTime" BIGINT, "ArrDelay" BIGINT, "ArrDelayMinutes" BIGINT, "ArrDel15" BIGINT, "ArrivalDelayGroups" BIGINT, "ArrTimeBlk" VARCHAR, "Cancelled" BIGINT, "CancellationCode" VARCHAR, "Diverted" BIGINT, "CRSElapsedTime" BIGINT, "ActualElapsedTime" BIGINT, "AirTime" BIGINT, "Flights" BIGINT, "Distance" BIGINT, "DistanceGroup" BIGINT, "CarrierDelay" BIGINT, "WeatherDelay" BIGINT, "NASDelay" BIGINT, "SecurityDelay" BIGINT, "LateAircraftDelay" BIGINT, "FirstDepTime" VARCHAR, "TotalAddGTime" VARCHAR, "LongestAddGTime" VARCHAR, "DivAirportLandings" VARCHAR, "DivReachedDest" VARCHAR, "DivActualElapsedTime" VARCHAR, "DivArrDelay" VARCHAR, "DivDistance" VARCHAR, "Div1Airport" VARCHAR, "Div1AirportID" VARCHAR, "Div1AirportSeqID" VARCHAR, "Div1WheelsOn" VARCHAR, "Div1TotalGTime" VARCHAR, "Div1LongestGTime" VARCHAR, "Div1WheelsOff" VARCHAR, "Div1TailNum" VARCHAR, "Div2Airport" VARCHAR, "Div2AirportID" VARCHAR, "Div2AirportSeqID" VARCHAR, "Div2WheelsOn" VARCHAR, "Div2TotalGTime" VARCHAR, "Div2LongestGTime" VARCHAR, "Div2WheelsOff" VARCHAR, "Div2TailNum" VARCHAR, "Div3Airport" VARCHAR, "Div3AirportID" VARCHAR, "Div3AirportSeqID" VARCHAR, "Div3WheelsOn" VARCHAR, "Div3TotalGTime" VARCHAR, "Div3LongestGTime" VARCHAR, "Div3WheelsOff" VARCHAR, "Div3TailNum" VARCHAR, "Div4Airport" VARCHAR, "Div4AirportID" VARCHAR, "Div4AirportSeqID" VARCHAR, "Div4WheelsOn" VARCHAR, "Div4TotalGTime" VARCHAR, "Div4LongestGTime" VARCHAR, "Div4WheelsOff" VARCHAR, "Div4TailNum" VARCHAR, "Div5Airport" VARCHAR, "Div5AirportID" VARCHAR, "Div5AirportSeqID" VARCHAR, "Div5WheelsOn" VARCHAR, "Div5TotalGTime" VARCHAR, "Div5LongestGTime" VARCHAR, "Div5WheelsOff" VARCHAR, "Div5TailNum" VARCHAR, "Unnamed: 109" VARCHAR))
SELECT
  TIME_FLOOR(TIME_PARSE("depaturetime"), 'PT1H') AS "__time",
  "Reporting_Airline",
  "Origin",
  "Dest",
  COUNT(*) AS "Events",
  MAX("Distance") AS "Distance_Max",
  MIN("Distance") AS "Distance_Min",
  DS_HLL("Tail_Number") AS "Tail_Number_HLL",
  DS_THETA("Tail_Number") AS "Tail_Number_THETA"
FROM "ext"
GROUP BY 1, 2, 3, 4
PARTITIONED BY DAY
'''

When doing this programmatically you need to be sure to include a context parameter that prompts Druid to store the true sketch value: [`finalizeAggregations`](https://druid.apache.org/docs/26.0.0/multi-stage-query/reference.html#context-parameters). Notice that, if you build an ingestion using the console, these settings are applied for you automatically.

The following cell adds the parameters and then executes the ingestion.

When the ingestion is finished, you will see the table definition. Monitor the ingestion task itself in the Druid Console as it runs.

In [None]:
req = sql_client.sql_request(sql)
req.add_context("finalize", "false")
req.add_context("finalizeAggregations", "false")

sql_client.run_task(req)
sql_client.wait_until_ready('flights-counts')
display.table('flights-counts')

As noted before, use specific SQL functions to estimate the size of the set represented by the sketch.

* For HLL [`APPROX_COUNT_DISTINCT_DS_HLL`](https://druid.apache.org/docs/26.0.0/querying/sql-functions.html#approx_count_distinct_ds_hll), and
* for Theta [`APPROX_COUNT_DISTINCT_THETA`](https://druid.apache.org/docs/26.0.0/querying/sql-functions.html#approx_count_distinct_ds_theta).

Here's an example query showing our estimated results.

In [None]:
sql='''
SELECT
   "Reporting_Airline",
   SUM("Distance_Max") AS "Miles_Flown",
   APPROX_COUNT_DISTINCT_DS_HLL("Tail_Number_HLL") AS "HLLApprox",
   APPROX_COUNT_DISTINCT_DS_THETA("Tail_Number_THETA") AS "ThetaApprox"
FROM "flights-counts"
GROUP BY 1
'''

display.sql(sql)

### Creating sketches during streaming ingestion

In streaming ingestion, rather than using `DS_HLL` or `DS_THETA` you include the Druid Native equivallent in the [`metricsSpec`](https://druid.apache.org/docs/26.0.0/ingestion/ingestion-spec.html#metricsspec) and - instead of `GROUP BY` - enable [`queryGranularity`](https://druid.apache.org/docs/latest/ingestion/ingestion-spec.html#granularityspec) and `rollup` to truncate the time stamp and pre-aggregate the rows.

The `INSERT` statement above is therefore equivallent to:

```json
    {
      "type": "HLLSketchBuild",
      "fieldName": "Tail_Number",
      "lgK": 12,
      "tgtHllType": "HLL_4"
    },
    {
      "type": "thetaSketch",
      "fieldName": "Tail_Number",
      "size": 16384
    }
```

Notice that here it's easy to see some internal parameters for sketch generation, like the `lgK` value for HLL. In SQL mode, these are exposed as supplementary parameters to the `DS_HLL` function. Be cautious of changing these values without researching the effects - not just in accuracy but also in terms of performance and segment size.

## Calculating set union with pre-existing Theta and HyperLogLog sketches

HyperLogLog and Theta sketches allow us to estimate the `COUNT(DISTINCT)` of the union of two or more sets. Consider that, when Druid executes a `COUNT(DISTINCT)` query in approximate mode, it is creating a set that is the union of independent sets of results - several from each and every data segment - and giving back to us an estimate the set size.

In Druid SQL, you have access to functions that allow you to define and union your own sets in order to estimate their size.

Run next cell, which:

* Gets three sets of `Tail_Number`s using `DS_HLL` - it applies a `FILTER` to isolate flights out of three specific cities,
* Applies `HLL_SKETCH_UNION` to union the three sets, and
* Estimates the resulting set size with `HLL_SKETCH_ESTIMATE`.

It uses `TIME_FLOOR` to giving us a week-by-week `GROUP BY` of the data.

In [None]:
sql='''
SELECT
  TIME_FLOOR("__time",'P1W') AS "Week commencing",
  HLL_SKETCH_ESTIMATE(
     HLL_SKETCH_UNION(
       DS_HLL("Tail_Number_HLL") FILTER (WHERE "Origin"='ATL'),
       DS_HLL("Tail_Number_HLL") FILTER (WHERE "Origin"='DFW'),
       DS_HLL("Tail_Number_HLL") FILTER (WHERE "Origin"='SFO')
      )
    ) AS "AnyThreeCity-HLL",
  THETA_SKETCH_ESTIMATE(
     THETA_SKETCH_UNION(
       DS_THETA("Tail_Number_THETA") FILTER (WHERE "Origin"='ATL'),
       DS_THETA("Tail_Number_THETA") FILTER (WHERE "Origin"='DFW'),
       DS_THETA("Tail_Number_THETA") FILTER (WHERE "Origin"='SFO')
      )
    ) AS "AnyThreeCity-THETA"
FROM "flights-counts"
WHERE TIMESTAMP '2005-10-31' <= __time AND __time <= TIMESTAMP '2005-11-20'
GROUP BY 1
'''

display.sql(sql)

## Calculating set intersection and difference with pre-existing Theta sketches

With Theta sketches, you can also approximate the size of:

* The intersection of two sets (people who went to both MoMa _and_ the NPG)
* The difference between one set and another (people who went to MoMa _not_ the NPG)

For simplicity, the following sections contain queries that take advantage of the data set you've created that contains sketches. You can apply the same techniques on tables without sketches in them – use `DS_THETA` function on the raw data instead.

### Set intersection

Run the next cell to see set difference being used with Theta sketches to produce a count estimate of the overlap between the sets.

### Set difference

Finally, run the next cell to use Theta sketch operations to estimate the size of the difference between one set and another.

Note that this operation is not cumutative - Druid calculates the size of the difference (A to B), not symetric difference, between the sets.

## Conclusion

* Approximation is the default execution model for `COUNT(DISTINCT)` queries
* You can turn it off with a query context parameter
* Accuracy is highly dependent on the distribution and cardinality of data across the database
* Druid can be pre-loaded with sketch objects that speed up approximation both in batch and streaming ingestion
* HyperLogLog and Theta sketches both allow you to approximate `COUNT(DISTINCT)` of entire sets
* Only Theta sketches allow you to carry out set operations

## Learn more

* Try estimation on your own dataset:
    * Identify a high-cardinality column in one of your own data sets
    * Test how long an approximate `DISTINCT(COUNT)` query takes to run with approximation turned on
    * Test how long the same query takes to run with approximation turned off
* Watch [Employ Approximation](https://youtu.be/fSWwJs1gCvQ?list=PLDZysOZKycN7MZvNxQk_6RbwSJqjSrsNR) by Peter Marshall
* Read [Ingesting Data Sketches into Apache Druid](https://blog.hellmar-becker.de/2022/12/26/ingesting-data-sketches-into-apache-druid/) by Hellmar Becker
* Read more about the native "aggregator" functions for streaming ingestion
    * [ThetaSketch function](https://druid.apache.org/docs/26.0.0/development/extensions-core/datasketches-theta.html)
    * [HyperLogLog function](https://druid.apache.org/docs/26.0.0/development/extensions-core/datasketches-hll.html)