# Counting distinct values

It's extremely common for analysts to want to count unique occurrences of some result set or function. The Druid database enables you to leverage advanced computer science techniques to speed up this type of calculation through approximation.

In this tutorial, work through some examples and see the effect of turning approximation on and off, and of making it even faster by pre-generating the objects that it uses to execute them.

## Prerequisites

This tutorial works with Druid 26.0.0 or later.

#### Run using Docker

Launch this tutorial and all prerequisites using the `druid-jupyter` profile of the Docker Compose file for Jupyter-based Druid tutorials. For more information, see [Docker for Jupyter Notebook tutorials](https://druid.apache.org/docs/latest/tutorials/tutorial-jupyter-docker.html).
   
#### Run without using Docker

If you do not use the Docker Compose environment, you need the following:

* A running Apache Druid instance, with a `DRUID_HOST` local environment variable containing the server name of your Druid router.
* [druidapi](https://github.com/apache/druid/blob/master/examples/quickstart/jupyter-notebooks/druidapi/README.md), a Python client for Apache Druid. Follow the instructions in the Install section of the README file.
* [matplotlib](https://matplotlib.org/), a library for creating visualizations in Python.
* [pandas](https://pandas.pydata.org/), a data analysis and manipulation tool.

### Initialize Python

Run the next cell to set up the Druid Python client's connection to Apache Druid.

If successful, the Druid version number will be shown in the output.

In [None]:
import druidapi
import os

if 'DRUID_HOST' not in os.environ.keys():
    druid_host=f"http://localhost:8888"
else:
    druid_host=f"http://{os.environ['DRUID_HOST']}:8888"
    
print(f"Opening a connection to {druid_host}.")
druid = druidapi.jupyter_client(druid_host)

display = druid.display
sql_client = druid.sql
status_client = druid.status

status_client.version

### Load example data

Once your Druid environment is up and running, ingest the sample data for this tutorial.

Run the following cell to create a table called `example-flights-countdistinct`.  When completed, you'll see a description of the final table.

Monitor the ingestion task process in the Druid console.

In [None]:
sql='''
REPLACE INTO "example-flights-countdistinct" OVERWRITE ALL
WITH "ext" AS (SELECT *
FROM TABLE(
  EXTERN(
    '{"type":"http","uris":["https://static.imply.io/example-data/flight_on_time/flights/On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11.csv.zip"]}',
    '{"type":"csv","findColumnsFromHeader":true}'
  )
) EXTEND ("depaturetime" VARCHAR, "arrivalime" VARCHAR, "Year" BIGINT, "Quarter" BIGINT, "Month" BIGINT, "DayofMonth" BIGINT, "DayOfWeek" BIGINT, "FlightDate" VARCHAR, "Reporting_Airline" VARCHAR, "DOT_ID_Reporting_Airline" BIGINT, "IATA_CODE_Reporting_Airline" VARCHAR, "Tail_Number" VARCHAR, "Flight_Number_Reporting_Airline" BIGINT, "OriginAirportID" BIGINT, "OriginAirportSeqID" BIGINT, "OriginCityMarketID" BIGINT, "Origin" VARCHAR, "OriginCityName" VARCHAR, "OriginState" VARCHAR, "OriginStateFips" BIGINT, "OriginStateName" VARCHAR, "OriginWac" BIGINT, "DestAirportID" BIGINT, "DestAirportSeqID" BIGINT, "DestCityMarketID" BIGINT, "Dest" VARCHAR, "DestCityName" VARCHAR, "DestState" VARCHAR, "DestStateFips" BIGINT, "DestStateName" VARCHAR, "DestWac" BIGINT, "CRSDepTime" BIGINT, "DepTime" BIGINT, "DepDelay" BIGINT, "DepDelayMinutes" BIGINT, "DepDel15" BIGINT, "DepartureDelayGroups" BIGINT, "DepTimeBlk" VARCHAR, "TaxiOut" BIGINT, "WheelsOff" BIGINT, "WheelsOn" BIGINT, "TaxiIn" BIGINT, "CRSArrTime" BIGINT, "ArrTime" BIGINT, "ArrDelay" BIGINT, "ArrDelayMinutes" BIGINT, "ArrDel15" BIGINT, "ArrivalDelayGroups" BIGINT, "ArrTimeBlk" VARCHAR, "Cancelled" BIGINT, "CancellationCode" VARCHAR, "Diverted" BIGINT, "CRSElapsedTime" BIGINT, "ActualElapsedTime" BIGINT, "AirTime" BIGINT, "Flights" BIGINT, "Distance" BIGINT, "DistanceGroup" BIGINT, "CarrierDelay" BIGINT, "WeatherDelay" BIGINT, "NASDelay" BIGINT, "SecurityDelay" BIGINT, "LateAircraftDelay" BIGINT, "FirstDepTime" VARCHAR, "TotalAddGTime" VARCHAR, "LongestAddGTime" VARCHAR, "DivAirportLandings" VARCHAR, "DivReachedDest" VARCHAR, "DivActualElapsedTime" VARCHAR, "DivArrDelay" VARCHAR, "DivDistance" VARCHAR, "Div1Airport" VARCHAR, "Div1AirportID" VARCHAR, "Div1AirportSeqID" VARCHAR, "Div1WheelsOn" VARCHAR, "Div1TotalGTime" VARCHAR, "Div1LongestGTime" VARCHAR, "Div1WheelsOff" VARCHAR, "Div1TailNum" VARCHAR, "Div2Airport" VARCHAR, "Div2AirportID" VARCHAR, "Div2AirportSeqID" VARCHAR, "Div2WheelsOn" VARCHAR, "Div2TotalGTime" VARCHAR, "Div2LongestGTime" VARCHAR, "Div2WheelsOff" VARCHAR, "Div2TailNum" VARCHAR, "Div3Airport" VARCHAR, "Div3AirportID" VARCHAR, "Div3AirportSeqID" VARCHAR, "Div3WheelsOn" VARCHAR, "Div3TotalGTime" VARCHAR, "Div3LongestGTime" VARCHAR, "Div3WheelsOff" VARCHAR, "Div3TailNum" VARCHAR, "Div4Airport" VARCHAR, "Div4AirportID" VARCHAR, "Div4AirportSeqID" VARCHAR, "Div4WheelsOn" VARCHAR, "Div4TotalGTime" VARCHAR, "Div4LongestGTime" VARCHAR, "Div4WheelsOff" VARCHAR, "Div4TailNum" VARCHAR, "Div5Airport" VARCHAR, "Div5AirportID" VARCHAR, "Div5AirportSeqID" VARCHAR, "Div5WheelsOn" VARCHAR, "Div5TotalGTime" VARCHAR, "Div5LongestGTime" VARCHAR, "Div5WheelsOff" VARCHAR, "Div5TailNum" VARCHAR, "Unnamed: 109" VARCHAR))
SELECT
  TIME_PARSE("depaturetime") AS "__time",
  "Reporting_Airline",
  "Tail_Number",
  "Origin"
FROM "ext"
PARTITIONED BY DAY
'''

sql_client.run_task(sql)
sql_client.wait_until_ready('example-flights-countdistinct')
display.table('example-flights-countdistinct')

Finally, run the following cell to import additional Python modules that you will use.

In [None]:
import json
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd

## Using COUNT(DISTINCT)

Finding the number of distinct elements in a set is very common using `COUNT(DISTINCT)` function. But there are other ways to leverage Druid's massively-parallelised query execution engine to solve this problem, particularly on data sets that contain many tens-of-thousands, perhaps even millions of unique values.

In the sections that follow, you will try out:

* COUNT(DISTINCT) with and without approximation
* Apache Datasketch-based set operations

### Run COUNT(DISTINCT) with approximation

By default, Apache Druid applies approximation to `COUNT(DISTINCT)` queries, helping interactive data exploration be performant, even with a computationally expensive operations like `COUNT(DISTINCT)`.

> Approximations improve scalability, storage, and memory use - at the cost of some error.
> 
> _[Gian Merlino](https://github.com/gianm)_

You can look into Druid's [configuration files](https://druid.apache.org/docs/26.0.0/configuration/index.html#sql) to find whether this approach has been left as the default by your system administrators (`druid.sql.planner.useApproximateCountDistinct`) and what approach to approximation will be used (`druid.sql.approxCountDistinct.function`).

When approximation is used, intermediate results from each data process are put into a representation called a [data sketch](https://datasketches.apache.org/) - a probabilistic data structure with size not dependent on the underlying data. These are then unioned and the size of the set is estimated.

Run the following cell to execute a `COUNT(DISTINCT)` query that, by default, will run using approximation. It finds the number of unique `Tail_Number`s for each `Reporting_Airline` and stores the results in a dataframe. The results are then plotted in a histogram.

In [None]:
sql = '''
SELECT
    "Reporting_Airline",
    COUNT(DISTINCT "Tail_Number") AS "Unique Tail Numbers"
FROM "example-flights-countdistinct"
GROUP BY 1
ORDER BY 2
'''

df1 = pd.DataFrame(sql_client.sql(sql))

df1.plot.bar(x='Reporting_Airline', y='Unique Tail Numbers')
plt.show()

### Run COUNT(DISTINCT) without approximation

Supply a query context parameter, `useApproximateCountDistinct`, to force Druid to not use approximation for `COUNT(DISTINCT)` queries.

Using the same SQL statement as before, the following cell crafts a request (`req`). The request is then given a context parameter to turn off approximation. The response is then stored and put into a second dataframe, from which we get a plot of unique `Tail_Number`s by `Reporting_Airline`.

In [None]:
req = sql_client.sql_request(sql)
req.add_context("useApproximateCountDistinct", "false")
resp = sql_client.sql_query(req)

df2 = pd.DataFrame(resp.rows)
df2.plot.bar(x='Reporting_Airline', y='Unique Tail Numbers')
plt.show()

### Compare the results

The next cell shows a comparison of the two results above: `df1` used the default approximation approach, while `df2` are the results where we turned approximation off.

In [None]:
df3 = df1.compare(df2, keep_equal=True)
df3

The table shows:

* A row number
* The reporting airline in the approximate results (`self`) versus that in the non-approximate results (`other`)
* The calculated distinct number of `Tail Number`s

Notice that there are _value_ errors, as you might expect with approximation, and that in some instances this affects the _order_ of results.

Error in sketch-based approximation is probabilistic, rather than guaranteed. That's to say that a certain percentage of the time you can expect the measurements you take to be within a certain distance of the true value.

## Calculating set union with Theta and HyperLogLog sketches

There are two types of Apache Datasketch you can use to estimate the size of a union of one or more sets:

* [HyperLogLog](https://druid.apache.org/docs/26.0.0/querying/sql-aggregations.html#hll-sketch-functions)
* [Theta](https://druid.apache.org/docs/26.0.0/querying/sql-aggregations.html#theta-sketch-functions)

Each allows Druid to estimate the `COUNT(DISTINCT)` of the union of two or more sets. When you ran the `COUNT(DISTINCT)` query in approximate mode, Druid arrived at a single set that was the union of the intermediate sets, and returned an estimate of the set size.

In Druid SQL, you can access functions that allow you to define your own sets that you can union in order to estimate their size.

Run the next cell, which:

* Gets three sets of `Tail_Number`s using `DS_HLL` - it applies a `FILTER` to isolate flights out of three specific cities,
* Applies `HLL_SKETCH_UNION` to union the three sets, and
* Estimates the resulting set size with `HLL_SKETCH_ESTIMATE`.

It uses `TIME_FLOOR` to giving us a week-by-week `GROUP BY` of the data.

In [None]:
sql='''
SELECT
  TIME_FLOOR("__time",'P1W') AS "Week commencing",
  HLL_SKETCH_ESTIMATE(
     HLL_SKETCH_UNION(
       DS_HLL("Tail_Number") FILTER (WHERE "Origin"='ATL'),
       DS_HLL("Tail_Number") FILTER (WHERE "Origin"='DFW'),
       DS_HLL("Tail_Number") FILTER (WHERE "Origin"='SFO')
      )
    ) AS "AnyCity-HLL",
  THETA_SKETCH_ESTIMATE(
     THETA_SKETCH_UNION(
       DS_THETA("Tail_Number") FILTER (WHERE "Origin"='ATL'),
       DS_THETA("Tail_Number") FILTER (WHERE "Origin"='DFW'),
       DS_THETA("Tail_Number") FILTER (WHERE "Origin"='SFO')
      )
    ) AS "AnyCity-Theta"
FROM "example-flights-countdistinct"
WHERE TIMESTAMP '2005-10-31' <= __time AND __time <= TIMESTAMP '2005-11-20'
GROUP BY 1
'''

display.sql(sql)

Because of differences in how HyperLogLog and Theta sketch functions themselves work, and defaults of how intermediate sketches themselves are constructed at query-time, there are differences in the results between HyperLogLog and Theta sketches.

Read more about this in the [documentation](https://druid.apache.org/docs/26.0.0/querying/sql-aggregations.html#hll-sketch-functions) for the `DS_HLL` and `DS_THETA` functions.

## Calculating set intersection and difference with Theta sketches

With Theta sketches, you can additionally approximate the size of:

* The intersection of two sets (airplanes that went to both ATL _and_ SFO)
* The difference between one set and another (airplanes that went to ATL and _not_ SFO)


### Set intersection

Run the next cell to see the intersection between three Theta sketch sets, week-by-week.

As in the query above, each set is filtered to specific airports, then an intersection is performed, before finally the size of that set is estimated and passed back.

In [None]:
sql='''
SELECT
  TIME_FLOOR("__time",'P1W') AS "Week commencing",
  THETA_SKETCH_ESTIMATE(
     THETA_SKETCH_INTERSECT(
       DS_THETA("Tail_Number") FILTER (WHERE "Origin"='ATL'),
       DS_THETA("Tail_Number") FILTER (WHERE "Origin"='DFW'),
       DS_THETA("Tail_Number") FILTER (WHERE "Origin"='SFO')
      )
    ) AS "AllThreeCities"
FROM "example-flights-countdistinct"
WHERE TIMESTAMP '2005-10-31' <= __time AND __time <= TIMESTAMP '2005-11-20'
GROUP BY 1
'''

display.sql(sql)

This is an important application of set operations: estimating how many of some object carried out some action or other. In the example above, that is how many airplanes took off from from all three cities by week.

Another application might be to use Druid's event analytics capabilities to look for time-based intersections.

The next cell creates a dataframe from SQL that creates two sets using the `FILTER` function. The first set represents all airplanes that flew on the week commencing 31st October, and the second for the week commencing 7th November. It then intersects these to create a new sketch representing all the airplanes that flew on both days. Finally, the size of that set is estimated, with a `GROUP BY` that breaks it down by `Reporting_Airline`.

In [None]:
sql='''
SELECT
  "Reporting_Airline",
  THETA_SKETCH_ESTIMATE(
     THETA_SKETCH_INTERSECT(
       DS_THETA("Tail_Number") FILTER (WHERE TIME_FLOOR("__time",'P1W') = TIMESTAMP '2005-10-31'),
       DS_THETA("Tail_Number") FILTER (WHERE TIME_FLOOR("__time",'P1W') = TIMESTAMP '2005-11-07')
      )
    ) AS "BothWeeks"
FROM "example-flights-countdistinct"
GROUP BY 1
ORDER BY 2 DESC
'''

df = pd.DataFrame(sql_client.sql(sql))

df.plot.bar(x='Reporting_Airline', y='BothWeeks')
plt.show()

### Set difference

Finally, we turn to using Theta sketch operations to estimate the size of the difference between one set and another.

The next cell switches the `THETA_SKETCH_INTERSECT` operation, which intersects the sets, for `THETA_SKETCH_NOT`, which does a difference operation. The plot we see therefore charts, approximately, how many airplanes flew in the week commencing 31st October that did _not_ also fly in the next week.

Note that this operation is not cumulative - Druid calculates the size of the difference (A to B) per airline: it is not a symmetric difference operation.

In [None]:
sql='''
SELECT
  "Reporting_Airline",
  THETA_SKETCH_ESTIMATE(
     THETA_SKETCH_NOT(
       DS_THETA("Tail_Number") FILTER (WHERE TIME_FLOOR("__time",'P1W') = TIMESTAMP '2005-10-31'),
       DS_THETA("Tail_Number") FILTER (WHERE TIME_FLOOR("__time",'P1W') = TIMESTAMP '2005-11-07')
      )
    ) AS "FirstNotSecondWeek"
FROM "example-flights-countdistinct"
GROUP BY 1
ORDER BY 2 DESC
'''

df = pd.DataFrame(sql_client.sql(sql))

df.plot.bar(x='Reporting_Airline', y='FirstNotSecondWeek')
plt.show()

## Summary

* Approximation is the default execution model for `COUNT(DISTINCT)` queries
* You can turn it off with a query context parameter
* Accuracy is governed by the size and mode of the data sketch and by the operations you perform.
* HyperLogLog and Theta sketches both allow you to approximate `COUNT(DISTINCT)` of entire sets
* Only Theta sketches allow you to carry out set operations

## Go further

* Try estimation on your own dataset:
    * Identify a high-cardinality column in one of your own datasets
    * Test how long an approximate `DISTINCT(COUNT)` query takes to run with approximation turned on
    * Test how long the same query takes to run with approximation turned off

## Learn more

* Read the [Theta sketch](https://druid.apache.org/docs/26.0.0/development/extensions-core/datasketches-theta.html) documentation for reference on ingestion and native queries on Theta sketches in Druid.
* Review the [Theta sketch scalar functions](https://druid.apache.org/docs/26.0.0/querying/sql-scalar.html#theta-sketch-functions) and [Theta sketch aggregation functions](https://druid.apache.org/docs/latest/querying/sql-aggregations.html#theta-sketch-functions) documentation.
* Read [Sketches for high cardinality columns](https://druid.apache.org/docs/latest/ingestion/schema-design.html#sketches-for-high-cardinality-columns) in the schema design guidance.
* Peek at the [DataSketches extension](https://druid.apache.org/docs/latest/development/extensions-core/datasketches-extension.html) documentation for information about other available sketches.
* Visit the [Apache Datasketches](https://datasketches.apache.org) project site, where you will also find information on [accuracy](https://datasketches.apache.org/docs/Theta/ThetaAccuracy.html).
* Watch [Casting the spell: Apache Druid in practice](https://youtu.be/QAitmv8QRq4) by Itai Yaffe and Yakir Buskilla (Nielsen)
* Watch [Employ Approximation](https://youtu.be/il84eH0kUyc) by Peter Marshall (Imply)
* Watch [Advertiser audience forecasting with Apache Druid](https://youtu.be/7PRWDMRSAOw) by Qasim Zeeshan and Sundeep Yedida (Reddit)
* Watch [Funnel Analysis in Mobile Gaming - leveraging approximation algorithms for low latency analytics](https://youtu.be/il84eH0kUyc) by Ramón Lastres Guerrero (Game Analytics)