# Performing set operations

Users often call for a way to compare or combine multiple sets of results. They may want to merge results into a single list, find where the sets overlap, or see differences. The Druid database contains standard SQL functions as well as those that leverage advanced computer science techniques to speed up these types of calculation through approximation. In this tutorial, work through some examples of different techniques, and see the effect of making it even faster through approximation.

## Prerequisites

This tutorial works with Druid 26.0.0 or later.

Launch this tutorial and all prerequisites using the `druid-jupyter` profile of the Docker Compose file for Jupyter-based Druid tutorials. For more information, see [Docker for Jupyter Notebook tutorials](https://druid.apache.org/docs/latest/tutorials/tutorial-jupyter-docker.html).

You must also have loaded the "FlightCarrierOnTime (1 month)" sample data, using defaults, into the table `On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11`.

If you do not use the Docker Compose environment, you need the following:
* A running Druid instance.
   * Update the `druid_host` variable to point to your Router endpoint. For example, `druid_host = "http://localhost:8888"`.
* The following Python packages:
   * `druidapi`, a Python client for Apache Druid

To start this tutorial, run the next cell. It defines variables for two datasources and the Druid host the tutorial uses. The quickstart deployment configures Druid to listen on port `8888` by default, so you'll make API calls against `http://localhost:8888`.


In [None]:
import druidapi
import json
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd

# druid_host iRs the hostname and port for your Druid deployment. 
# In the Docker Compose tutorial environment, this is the Router
# service running at "http://router:8888".
# If you are not using the Docker Compose environment, edit the `druid_host`.

druid_host = "http://druid-master-0:8888"
druid_host

druid = druidapi.jupyter_client(druid_host)
display = druid.display
sql_client = druid.sql
display.tables('INFORMATION_SCHEMA')

## Combining result sets

These features of Druid allow us to combine multiple sets of results together to create one single set.

### Merging result sets with `UNION ALL`

Execute the following query to combine together two different queries - one that contains 10 flights taking off from San Fransisco at around 11 o'clock in the morning, and another with flights departing from Atlanta in the same hour.

In [None]:
sql = '''
WITH
set1 AS (
  SELECT
  __time,
  "Origin",
  "Tail_Number",
  "Flight_Number_Reporting_Airline"
  FROM "On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11"
  WHERE Origin = 'SFO'
  AND DATE_TRUNC('HOUR', __time) = TIMESTAMP '2005-11-01 11:00:00'
  ORDER BY __time
  LIMIT 10
  ),
set2 AS (
SELECT
  __time,
  "Origin",
  "Tail_Number",
  "Flight_Number_Reporting_Airline"
  FROM "On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11"
  WHERE Origin = 'ATL'
  AND DATE_TRUNC('HOUR', __time) = TIMESTAMP '2005-11-01 11:00:00'
  ORDER BY __time
  LIMIT 10
  )
  
SELECT * from set1
UNION ALL
SELECT * from set2
'''

display.sql(sql)

This is what's known as a "top-level" `UNION` operation: each set of results was gathered individually, one after the other, and the list of results concatenated.

Notice that these results are not in order by time – even though the individual sets did `ORDER BY` time, because the result is a concatenation.

For extra detail, run the cell below to see the `EXPLAIN` of the query plan. Notice that there are two `query` parts, with each one being one of our queries.

* [Top-level UNION ALL](https://druid.apache.org/docs/latest/querying/sql.html#top-level)

In [None]:
print(json.dumps(json.loads(sql_client.explain_sql(sql)['PLAN']), indent=2))

One might think that the solution is to bring back `__time` in each result set as an additional dimension, and to then use `ORDER BY` in a new outer query, like this:

```sql
SELECT "Origin",
    "Tail_Number",
    "Flight_Number_Reporting_Airline"
FROM (
    SELECT * from set1
    UNION ALL
    SELECT * from set2
    )
ORDER BY __time
```

However, when there is a level of abstraction over the concatenated result set, Druid switches how it executes the query, and this adds a number of constraints on what is possible.

Take a look at this query. Each set still looks at SFO and ATL, which we expect to `UNION` into a single set, with `__time`, that we can then `ORDER BY` to get a single, time-ordered, result set.

Run this cell, however, and you will receive an error from Druid, complete with information about the constraints.


In [None]:
sql = '''
WITH
set1 AS (
  SELECT
  __time,
  "Origin",
  "Tail_Number",
  "Flight_Number_Reporting_Airline"
  FROM "On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11"
  WHERE Origin = 'SFO'
  AND DATE_TRUNC('HOUR', __time) = TIMESTAMP '2005-11-01 11:00:00'
  ),
set2 AS (
SELECT
  __time,
  "Origin",
  "Tail_Number",
  "Flight_Number_Reporting_Airline"
  FROM "On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11"
  WHERE Origin = 'ATL'
  AND DATE_TRUNC('HOUR', __time) = TIMESTAMP '2005-11-01 11:00:00'
  )
  
SELECT __time,
    "Origin",
    "Tail_Number",
    "Flight_Number_Reporting_Airline"
FROM (
    SELECT * from set1
    UNION ALL
    SELECT * from set2
    )
ORDER BY __time
'''

print(json.dumps(json.loads(sql_client.explain_sql(sql)['PLAN']), indent=2))

If you'd like the detail on why, run the next cell, which displays the query plan for a working version of the query above. You will see that Druid executes the query on a special `union` datasource, built very simply from entire `table`s, and they cannot have:

> expressions, column aliasing, JOIN, GROUP BY, ORDER BY, and so on

Learn more here:

* [`union` datasources](https://druid.apache.org/docs/26.0.0/querying/datasource.html#union)

In [None]:
sql = '''
WITH
set1 AS (
  SELECT
  __time,
  "Origin",
  "Tail_Number",
  "Flight_Number_Reporting_Airline"
  FROM "On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11"
  ),
set2 AS (
SELECT
  __time,
  "Origin",
  "Tail_Number",
  "Flight_Number_Reporting_Airline"
  FROM "On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11"
  )
  
SELECT "Origin",
    "Tail_Number",
    "Flight_Number_Reporting_Airline"
FROM (
    SELECT * from set1
    UNION ALL
    SELECT * from set2
    )
ORDER BY __time
'''

print(json.dumps(json.loads(sql_client.explain_sql(sql)['PLAN']), indent=2))

### Aggregations on result sets with `UNION ALL`

Using `UNION ALL` it's possible to concatenate aggregate calculations.

Take a look at this cell before you run it – what do you expect to happen?

In [None]:
sql = '''
WITH
set1 AS (
  SELECT
  "Origin",
  COUNT(*),
  MAX(Distance),
  MIN(Distance)
  FROM "On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11"
  WHERE Origin = 'SFO'
  AND DATE_TRUNC('HOUR', __time) = TIMESTAMP '2005-11-01 11:00:00'
  GROUP BY 1
  ),
set2 AS (
SELECT
  "Origin",
  "Tail_Number",
  "Flight_Number_Reporting_Airline"
  FROM "On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11"
  WHERE Origin = 'ATL'
  AND DATE_TRUNC('HOUR', __time) = TIMESTAMP '2005-11-01 11:00:00'
  ORDER BY __time
  LIMIT 10
  )
  
SELECT * from set1
UNION ALL
SELECT * from set2
'''

display.sql(sql)

As the `UNION ALL` concatenated the sets, it very simply added the results for Atlanta to the end of the results for San Francisco. It did not take into account that the columns in set 2 were in a different order, nor did it take into account _either_ of the errors in field names.

Instead, the query ought to have been more explicit, taking into account the proper field names in each set, ensuring consistency with `set1`'s schema:

```sql
SELECT "Flights", "Shortest", "Longest" from set1
UNION ALL
SELECT "Frights", "Shortest", "Lengthiest" from set2
```

### Merging sets where each element is a single value approximately

HyperLogLog is a 

In [None]:
sql='''
SELECT
  HLL_SKETCH_ESTIMATE (
   HLL_SKETCH_UNION (
      DS_HLL("user") FILTER (WHERE DATE_TRUNC('HOUR', __time) = TIMESTAMP '2016-06-27 11:00:00'),
      DS_HLL("user") FILTER (WHERE DATE_TRUNC('HOUR', __time) = TIMESTAMP '2016-06-27 12:00:00')
      )
    ),
  THETA_SKETCH_ESTIMATE(
   THETA_SKETCH_UNION (
      DS_THETA("user") FILTER (WHERE DATE_TRUNC('HOUR', __time) = TIMESTAMP '2016-06-27 11:00:00'),
      DS_THETA("user") FILTER (WHERE DATE_TRUNC('HOUR', __time) = TIMESTAMP '2016-06-27 12:00:00')
      )
    ),
  THETA_SKETCH_ESTIMATE(
   THETA_SKETCH_NOT (
      DS_THETA("user") FILTER (WHERE DATE_TRUNC('HOUR', __time) = TIMESTAMP '2016-06-27 11:00:00'),
      DS_THETA("user") FILTER (WHERE DATE_TRUNC('HOUR', __time) = TIMESTAMP '2016-06-27 12:00:00')
      )
    ),
  THETA_SKETCH_ESTIMATE(
   THETA_SKETCH_INTERSECT (
      DS_THETA("user") FILTER (WHERE DATE_TRUNC('HOUR', __time) = TIMESTAMP '2016-06-27 11:00:00'),
      DS_THETA("user") FILTER (WHERE DATE_TRUNC('HOUR', __time) = TIMESTAMP '2016-06-27 12:00:00')
      )
    )
FROM "wikipedia"
'''

display.sql(sql)

### Running COUNT(DISTINCT) without approximation

We can supply a query context parameter, `useApproximateCountDistinct`, to force Druid to not use approximation. We won't get the speed boost afforded by the sketching approach – but that's OK because the example dataset is so small! It would be a different story if `Tail_Number` had high cardinality - like if it was IP Addresses or User Identifiers.

In [None]:
sql='''
SELECT "Reporting_Airline", COUNT(DISTINCT "Tail_Number") AS "Events"
FROM "On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11"
GROUP BY 1
ORDER BY 2
'''

req = sql_client.sql_request(sql)
req.add_context("useApproximateCountDistinct", "false")
resp = sql_client.sql_query(req)

df = pd.DataFrame(resp.rows)
df.plot.bar(x='Reporting_Airline', y='Events')
plt.show()

### Comparing approximate and non-approximate results

On the surface, these do not _look_ different. And, in a lot of user interfaces, that's perfectly fine!

The next cell will run the query in two modes – accurate and approximate. It then displays a `diff` between the two results.

In [None]:
sql = '''
SELECT "Reporting_Airline", COUNT(DISTINCT "Tail_Number") AS "Events"
FROM "On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11"
GROUP BY 1
ORDER BY 2
'''

req = sql_client.sql_request(sql)
req.add_context("useApproximateCountDistinct", "false")
resp = sql_client.sql_query(req)

df1 = pd.DataFrame(sql_client.sql(sql))
df2 = pd.DataFrame(resp.rows)

df3 = df1.compare(df2, keep_equal=True)
df3

There are _value_ errors, as you might expect with approximation. This therefore affects _ordering_ of results.

Error in sketch-based approximation is probabilistic, rather than guaranteed. That's to say that a certain percentage of the time you can expect the measurements you take to be within a certain distance of the true value. Also, their size is not dependent on the data – the default size of a sketch in Druid is just over 2000 bytes.

Approximation is especially helpful for very high cardinality data. When there are hundreds of thousands, millions, even tens-of-millions of distinct values, passing the individual distinct values to be merged takes longer and more data storage than using datasketches.

As an experiment, you may want to:

* Ingest or use a much larger data set
* Identify a high-cardinality column
* Issue an approximate `DISTINCT(COUNT)` with approximation turned on
* Issue another query with approximation turned off

## COUNT(DISTINCT) queries on sketched data

For even faster performance, we can provide Druid with compatible sketches inside the data itself. We do this at ingestion time, pre-populating some dimensions with the sketches that would otherwise have to be computed at query time.

This technique also massively reduces the footprint of the data in the database. By storing highly optimized representations of groups of unique values, you avoid storing the individual values themselves.

There are two types of Apache Datasketch that allow for `COUNT(DISTINCT)` computations:

* [HyperLogLog](https://druid.apache.org/docs/26.0.0/querying/sql-aggregations.html#hll-sketch-functions)
* [Theta](https://druid.apache.org/docs/26.0.0/querying/sql-aggregations.html#theta-sketch-functions)

A Theta sketch allows for set operations, like intersection and difference, while HyperLogLog ("HLL") does not.

To illustrate how this works, the next cell uses a `GROUP BY` query and generates a datasketch with a `DS_HLL` function.

In [None]:
sql = '''
SELECT "Reporting_Airline", DS_HLL("Tail_Number") AS "Sketch"
FROM "On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11"
GROUP BY 1
LIMIT 5
'''

display.sql(sql)

In our results, we get a "human readable" version of what a sketch looks like.

This is thanks to the [`DS_HLL`](https://druid.apache.org/docs/26.0.0/querying/sql-functions.html#ds_hll) function, which creates a HLL sketch. For a Theta sketch, we can use the [`DS_THETA`](https://druid.apache.org/docs/26.0.0/querying/sql-functions.html#ds_theta) function.

Each sketch represents, in a highly optimized format, the aggregated list of the `Tail_Number`s in the data set. Sketches are _mergable_ which is essential in a massively-parallelised query operation where individual microservices carry out individual calculations that must then be brought together to give a final result.

Imagine that our query is executed in parallel on all the data in the database – the sketches, like you see above, are then merged into a final sketch. When presented with the very final _merged_ datasketch, Druid uses the Apache Datasketch library to estimate how many distinct `Tail_Number`s there are in that set, and present the result back to us. This operation is on much less data, and requires much less CPU power than a non-approximate `COUNT(DISTINCT)`, where every row of our `GROUP BY` would have to be passed back to be merged.

### Creating sketches during batch ingestion

The next cell ingests the example flight data into a new table, `flights-counts`, and utilizes a `GROUP BY` to aggregate all the flight numbers into two sketches: a HLL sketch using `DS_HLL` and a Theta sketch using `DS_THETA`.

Notice that we no longer store the original field, `Tail_Number`. If we kept that field, the `GROUP BY` wouldn't aggregate any rows into the sketch - there would be a 1:1 relationship between the row and each `Tail_Number` - which is the opposite of what we are designing for! By implication, it will be no longer possible to use the raw data as part of any SQL queries, like `GROUP BY` or `WHERE`.

The `GROUP BY` below will generate a sketch _for each_ of the dimensions that we `GROUP BY` - having too many dimensions defeats the purpose of aggregating the data! Therefore the `SELECT` has been crafted to retain only the dimensions our imaginary end users will want to filter or `GROUP BY` the `COUNT(DISTINCT)` data on.

In [None]:
sql='''
REPLACE INTO "flights-counts" OVERWRITE ALL
WITH "ext" AS (SELECT *
FROM TABLE(
  EXTERN(
    '{"type":"http","uris":["https://static.imply.io/example-data/flight_on_time/flights/On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11.csv.zip"]}',
    '{"type":"csv","findColumnsFromHeader":true}'
  )
) EXTEND ("depaturetime" VARCHAR, "arrivalime" VARCHAR, "Year" BIGINT, "Quarter" BIGINT, "Month" BIGINT, "DayofMonth" BIGINT, "DayOfWeek" BIGINT, "FlightDate" VARCHAR, "Reporting_Airline" VARCHAR, "DOT_ID_Reporting_Airline" BIGINT, "IATA_CODE_Reporting_Airline" VARCHAR, "Tail_Number" VARCHAR, "Flight_Number_Reporting_Airline" BIGINT, "OriginAirportID" BIGINT, "OriginAirportSeqID" BIGINT, "OriginCityMarketID" BIGINT, "Origin" VARCHAR, "OriginCityName" VARCHAR, "OriginState" VARCHAR, "OriginStateFips" BIGINT, "OriginStateName" VARCHAR, "OriginWac" BIGINT, "DestAirportID" BIGINT, "DestAirportSeqID" BIGINT, "DestCityMarketID" BIGINT, "Dest" VARCHAR, "DestCityName" VARCHAR, "DestState" VARCHAR, "DestStateFips" BIGINT, "DestStateName" VARCHAR, "DestWac" BIGINT, "CRSDepTime" BIGINT, "DepTime" BIGINT, "DepDelay" BIGINT, "DepDelayMinutes" BIGINT, "DepDel15" BIGINT, "DepartureDelayGroups" BIGINT, "DepTimeBlk" VARCHAR, "TaxiOut" BIGINT, "WheelsOff" BIGINT, "WheelsOn" BIGINT, "TaxiIn" BIGINT, "CRSArrTime" BIGINT, "ArrTime" BIGINT, "ArrDelay" BIGINT, "ArrDelayMinutes" BIGINT, "ArrDel15" BIGINT, "ArrivalDelayGroups" BIGINT, "ArrTimeBlk" VARCHAR, "Cancelled" BIGINT, "CancellationCode" VARCHAR, "Diverted" BIGINT, "CRSElapsedTime" BIGINT, "ActualElapsedTime" BIGINT, "AirTime" BIGINT, "Flights" BIGINT, "Distance" BIGINT, "DistanceGroup" BIGINT, "CarrierDelay" BIGINT, "WeatherDelay" BIGINT, "NASDelay" BIGINT, "SecurityDelay" BIGINT, "LateAircraftDelay" BIGINT, "FirstDepTime" VARCHAR, "TotalAddGTime" VARCHAR, "LongestAddGTime" VARCHAR, "DivAirportLandings" VARCHAR, "DivReachedDest" VARCHAR, "DivActualElapsedTime" VARCHAR, "DivArrDelay" VARCHAR, "DivDistance" VARCHAR, "Div1Airport" VARCHAR, "Div1AirportID" VARCHAR, "Div1AirportSeqID" VARCHAR, "Div1WheelsOn" VARCHAR, "Div1TotalGTime" VARCHAR, "Div1LongestGTime" VARCHAR, "Div1WheelsOff" VARCHAR, "Div1TailNum" VARCHAR, "Div2Airport" VARCHAR, "Div2AirportID" VARCHAR, "Div2AirportSeqID" VARCHAR, "Div2WheelsOn" VARCHAR, "Div2TotalGTime" VARCHAR, "Div2LongestGTime" VARCHAR, "Div2WheelsOff" VARCHAR, "Div2TailNum" VARCHAR, "Div3Airport" VARCHAR, "Div3AirportID" VARCHAR, "Div3AirportSeqID" VARCHAR, "Div3WheelsOn" VARCHAR, "Div3TotalGTime" VARCHAR, "Div3LongestGTime" VARCHAR, "Div3WheelsOff" VARCHAR, "Div3TailNum" VARCHAR, "Div4Airport" VARCHAR, "Div4AirportID" VARCHAR, "Div4AirportSeqID" VARCHAR, "Div4WheelsOn" VARCHAR, "Div4TotalGTime" VARCHAR, "Div4LongestGTime" VARCHAR, "Div4WheelsOff" VARCHAR, "Div4TailNum" VARCHAR, "Div5Airport" VARCHAR, "Div5AirportID" VARCHAR, "Div5AirportSeqID" VARCHAR, "Div5WheelsOn" VARCHAR, "Div5TotalGTime" VARCHAR, "Div5LongestGTime" VARCHAR, "Div5WheelsOff" VARCHAR, "Div5TailNum" VARCHAR, "Unnamed: 109" VARCHAR))
SELECT
  TIME_FLOOR(TIME_PARSE("depaturetime"), 'PT1H') AS "__time",
  "Reporting_Airline",
  "Origin",
  "Dest",
  COUNT(*) AS "Events",
  MAX("Distance") AS "Distance_Max",
  MIN("Distance") AS "Distance_Min",
  DS_HLL("Tail_Number") AS "Tail_Number_HLL",
  DS_THETA("Tail_Number") AS "Tail_Number_THETA"
FROM "ext"
GROUP BY 1, 2, 3, 4
PARTITIONED BY DAY
'''


When doing this programmatically we need to be sure to include a context parameter that prompts Druid to store the true sketch value: [`finalizeAggregations`](https://druid.apache.org/docs/26.0.0/multi-stage-query/reference.html#context-parameters). Notice that, if you build an ingestion using the console, these settings are applied for you automatically.

The following cell adds the parameters and then executes the ingestion.

In [None]:
req = sql_client.sql_request(sql)
req.add_context("finalize", "false")
req.add_context("finalizeAggregations", "false")

sql_client.run_task(req)
sql_client.wait_until_ready('flights-counts')
display.table('flights-counts')

Open your Druid console's ingestion tab to monitor the progress of the ingestion.

Now we can use specific SQL functions that inform Druid to use sketches we have created:

* For HLL [`APPROX_COUNT_DISTINCT_DS_HLL`](https://druid.apache.org/docs/26.0.0/querying/sql-functions.html#approx_count_distinct_ds_hll), and
* for Theta [`APPROX_COUNT_DISTINCT_THETA`](https://druid.apache.org/docs/26.0.0/querying/sql-functions.html#approx_count_distinct_ds_theta).

Here's an example query showing our estimated results – notice that we can still use the `FILTER` clause to split results.

In [None]:
sql='''
SELECT
   "Reporting_Airline",
   SUM("Distance_Max") AS "Miles_Flown",
   APPROX_COUNT_DISTINCT_DS_HLL("Tail_Number_HLL") FILTER (WHERE "Distance_Max" > 2000) AS "HLLApprox-over2k",
   APPROX_COUNT_DISTINCT_DS_THETA("Tail_Number_THETA") FILTER (WHERE "Distance_Max" < 2000) AS "ThetaApprox-under2k"
FROM "flights-counts"
GROUP BY 1
'''

display.sql(sql)

Remembering that HLL sketches are mergable, we can take multiple sets of results and estimate an overall distinct count.

In this query, we generate three HLL sketches covering flights out of three cities in the United States over a three week period. We then merge these together, and estimate how many distinct `Tail_Number`s there were.  You'll recognise the `APPROX_COUNT_DISTINCT_DS_HLL` function and the `DS_HLL` function, generating sketches for the `Tail_Number`s originating in each city. And to that we add the `HLL_SKETCH_UNION` function, which merges each of our result sets. To turn it from a sketch into something readable, we then use the `HLL_SKETCH_ESTIMATE` function to give us a number instead of a sketch.

We're then grouping those calculations by weeks by using `TIME_FLOOR`.

In [None]:
sql='''
SELECT
  TIME_FLOOR("__time",'P1W') AS "Week commencing",
  APPROX_COUNT_DISTINCT_DS_HLL("Tail_Number_HLL") FILTER (WHERE "Origin"='ATL') AS "From Atlanta",
  APPROX_COUNT_DISTINCT_DS_HLL("Tail_Number_HLL") FILTER (WHERE "Origin"='DFW') AS "From Dallas",
  APPROX_COUNT_DISTINCT_DS_HLL("Tail_Number_HLL") FILTER (WHERE "Origin"='SFO') AS "From San Francisco",
  HLL_SKETCH_ESTIMATE(
     HLL_SKETCH_UNION(
       DS_HLL("Tail_Number_HLL") FILTER (WHERE "Origin"='ATL'),
       DS_HLL("Tail_Number_HLL") FILTER (WHERE "Origin"='DFW'),
       DS_HLL("Tail_Number_HLL") FILTER (WHERE "Origin"='SFO')
      )
    ) AS "From any of the three",
  APPROX_COUNT_DISTINCT_DS_HLL("Tail_Number_HLL") AS "From any city"
FROM "flights-counts"
WHERE TIMESTAMP '2005-10-31' <= __time AND __time <= TIMESTAMP '2005-11-20'
GROUP BY 1
'''

display.sql(sql)

### Creating sketches during streaming ingestion

In streaming ingestion, the same principles apply – you include an entry in the [`metricsSpec`](https://druid.apache.org/docs/26.0.0/ingestion/ingestion-spec.html#metricsspec) part of your ingestion specification, enabling [`queryGranularity`](https://druid.apache.org/docs/latest/ingestion/ingestion-spec.html#granularityspec) and `rollup` to truncate the time stamp and pre-aggregate the rows.

The statement above is equivallent to:

```json
    {
      "type": "HLLSketchBuild",
      "fieldName": "Tail_Number",
      "lgK": 12,
      "tgtHllType": "HLL_4"
    },
    {
      "type": "thetaSketch",
      "fieldName": "Tail_Number",
      "size": 16384
    }
```

Notice that here it's easy to see some internal parameters for sketch generation, like the `lgK` value for HLL. In SQL mode, these are exposed as supplementary parameters to the `DS_HLL` function. Be cautious of changing these values without researching the effects - not just in accuracy but also in terms of performance and segment size.

## Conclusion

* Approximation is the default execution model for `COUNT(DISTINCT)` queries
* You can turn it off with a query context parameter
* Accuracy is highly dependent on the distribution and cardinality of data across the database
* Druid can be pre-loaded with sketch objects that speed up approximation both in batch and streaming ingestion

## Learn more

* Watch [Employ Approximation](https://youtu.be/fSWwJs1gCvQ?list=PLDZysOZKycN7MZvNxQk_6RbwSJqjSrsNR) by Peter Marshall
* Read [Ingesting Data Sketches into Apache Druid](https://blog.hellmar-becker.de/2022/12/26/ingesting-data-sketches-into-apache-druid/) by Hellmar Becker
* Read more about the native "aggregator" functions for streaming ingestion
    * [ThetaSketch function](https://druid.apache.org/docs/26.0.0/development/extensions-core/datasketches-theta.html)
    * [HyperLogLog function](https://druid.apache.org/docs/26.0.0/development/extensions-core/datasketches-hll.html)