# Counting distinct values

__It's extremely common for analysts to want to count unique occurences of some dimension value in data. With the Druid database's history of large volumes of data comes an advanced computer science technique to speed up this calculation through approximation. In this tutorial, work through some examples and see the effect of turning it on and off, and of making it even faster by pre-generating the objects that Druid uses to execute the query.__

## Prerequisites

* Import the "FlightCarrierOnTime (1 month)" sample date into the default table called `On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11`


In [None]:
import druidapi
import json
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd

# druid_host is the hostname and port for your Druid deployment. 
# In the Docker Compose tutorial environment, this is the Router
# service running at "http://router:8888".
# If you are not using the Docker Compose environment, edit the `druid_host`.

druid_host = "http://druid-master-0.lan:8888"
druid_host

druid = druidapi.jupyter_client(druid_host)
display = druid.display
sql_client = druid.sql

## COUNT(DISTINCT) queries on simple values

Here's a very simple query to find the number of distinct Tail Numbers in the example dataset.

```sql
SELECT "Reporting_Airline", COUNT(DISTINCT "Tail_Number") AS "Events"
FROM "On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11"
GROUP BY 1
ORDER BY 2
```

### COUNT(DISTINCT) with approximation

Druid will automatically look for patterns of query that can make use of approximation. In this instance, Druid will identify a match for approximate `COUNT(DISTINCT)`. This means that each data server computes its own results, but sends the data for merging back in a special object called a [data sketch](https://datasketches.apache.org/). These are then merged, rather than the full list of distinct values, and passed back to us.

> Approximations improve scalability, storage, and memory use - at the cost of some error.
> 
> _[Gian Merlino](https://github.com/gianm)_

Let's run this query with all of Druid's defaults to see what the results are like. (We can safely omit a `__time` filter thanks to the tiny size of the example dataset.)

In [None]:
sql = '''
SELECT "Reporting_Airline", COUNT(DISTINCT "Tail_Number") AS "Events"
FROM "On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11"
GROUP BY 1
ORDER BY 2
'''
df = pd.DataFrame(sql_client.sql(sql))

df.plot.bar(x='Reporting_Airline', y='Events')
plt.show()

### COUNT(DISTINCT) without approximation

We can supply a query context parameter, `useApproximateCountDistinct`, to force Druid to not use approximation. We won't get the speed boost afforded by the sketching approach – but that's OK because the example dataset is so small!

In [None]:
sql='''
SELECT "Reporting_Airline", COUNT(DISTINCT "Tail_Number") AS "Events"
FROM "On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11"
GROUP BY 1
ORDER BY 2
'''

req = sql_client.sql_request(sql)
req.add_context("useApproximateCountDistinct", "false")
resp = sql_client.sql_query(req)

df = pd.DataFrame(resp.rows)
df.plot.bar(x='Reporting_Airline', y='Events')
plt.show()

### Results compared

On the surface, these do not _look_ different. And, in a lot of user interfaces, that's perfectly fine!

But let's go a bit deeper and see how the results actually differ.

In [None]:
ql = '''
SELECT "Reporting_Airline", COUNT(DISTINCT "Tail_Number") AS "Events"
FROM "On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11"
GROUP BY 1
ORDER BY 2
'''

req = sql_client.sql_request(sql)
req.add_context("useApproximateCountDistinct", "false")
resp = sql_client.sql_query(req)

df1 = pd.DataFrame(sql_client.sql(sql))
df2 = pd.DataFrame(resp.rows)

df3 = df1.compare(df2, keep_equal=True)
df3

There are _value_ errors, as you might expect with approximation. And this therefore affects _ordering_ of results.

Note that the error margins we see here are exaggerated because of the size of our dataset.

We are also referring to a very low cardinality of `Tail_Number` when, in reality, this technique is especially helpful for very high cardinality data! Imagine that, for our query, only a few hundred rows are being passed back from each data server.

But when there are hundreds of thousands, millions, even tens-of-millions of distinct values, passing these results to be merged, and then merging them, takes a very long time. With the datasketches approach, very _very_ small objects are passed back and merged, and therefore very we get much faster query results.

As an experiment, you may want to:

* Ingest or use a much larger data set
* Identify a high-cardinality column
* Issue an approximate `DISTINCT(COUNT)` with approximation turned on
* Issue another query with approximation turned off

## COUNT(DISTINCT) queries on sketched data

For even faster performance, we can provide Druid with compatible sketches inside the data itself. We do this at ingestion time, pre-populating some dimensions with the sketches that would otherwise have to be computed at query time.

There are two types of Apache Datasketch that allow for `COUNT(DISTINCT)` computations:

* [HyperLogLog](https://druid.apache.org/docs/26.0.0/querying/sql-aggregations.html#hll-sketch-functions)
* [Theta](https://druid.apache.org/docs/26.0.0/querying/sql-aggregations.html#theta-sketch-functions)

A Theta sketch allows for set operations, like intersection and difference, while HyperLogLog ("HLL") does not.

To understand how this works, let's first create a `GROUP BY` query that generates a HLL Apache Datasketch for us to see.

In [None]:
sql = '''
SELECT "Reporting_Airline", DS_HLL("Tail_Number") AS "Sketch"
FROM "On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11"
GROUP BY 1
'''

display.sql(sql)

In our results, we get a "human readable" version of what a sketch looks like. These are like the objects used internally by Druid when it does the approximate calculation.

Now imagine that we would want to use this at query time. To do that, we would use a SQL function called `APPROX_COUNT_DISTINCT_DS_HLL` - it uses these objects directly to calculate results, bypassing the need for these to be computed by Druid at query time.

Let's therefore ingest the flight data again, but this time let's `GROUP BY` at ingestion time and use the `DS_HLL` function to output a sketch object that we can then use for `COUNT(DISTINCT)`.


In [None]:
sql='''
REPLACE INTO "On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11" OVERWRITE ALL
WITH "ext" AS (SELECT *
FROM TABLE(
  EXTERN(
    '{"type":"http","uris":["https://static.imply.io/example-data/flight_on_time/flights/On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11.csv.zip"]}',
    '{"type":"csv","findColumnsFromHeader":true}'
  )
) EXTEND ("depaturetime" VARCHAR, "arrivalime" VARCHAR, "Year" BIGINT, "Quarter" BIGINT, "Month" BIGINT, "DayofMonth" BIGINT, "DayOfWeek" BIGINT, "FlightDate" VARCHAR, "Reporting_Airline" VARCHAR, "DOT_ID_Reporting_Airline" BIGINT, "IATA_CODE_Reporting_Airline" VARCHAR, "Tail_Number" VARCHAR, "Flight_Number_Reporting_Airline" BIGINT, "OriginAirportID" BIGINT, "OriginAirportSeqID" BIGINT, "OriginCityMarketID" BIGINT, "Origin" VARCHAR, "OriginCityName" VARCHAR, "OriginState" VARCHAR, "OriginStateFips" BIGINT, "OriginStateName" VARCHAR, "OriginWac" BIGINT, "DestAirportID" BIGINT, "DestAirportSeqID" BIGINT, "DestCityMarketID" BIGINT, "Dest" VARCHAR, "DestCityName" VARCHAR, "DestState" VARCHAR, "DestStateFips" BIGINT, "DestStateName" VARCHAR, "DestWac" BIGINT, "CRSDepTime" BIGINT, "DepTime" BIGINT, "DepDelay" BIGINT, "DepDelayMinutes" BIGINT, "DepDel15" BIGINT, "DepartureDelayGroups" BIGINT, "DepTimeBlk" VARCHAR, "TaxiOut" BIGINT, "WheelsOff" BIGINT, "WheelsOn" BIGINT, "TaxiIn" BIGINT, "CRSArrTime" BIGINT, "ArrTime" BIGINT, "ArrDelay" BIGINT, "ArrDelayMinutes" BIGINT, "ArrDel15" BIGINT, "ArrivalDelayGroups" BIGINT, "ArrTimeBlk" VARCHAR, "Cancelled" BIGINT, "CancellationCode" VARCHAR, "Diverted" BIGINT, "CRSElapsedTime" BIGINT, "ActualElapsedTime" BIGINT, "AirTime" BIGINT, "Flights" BIGINT, "Distance" BIGINT, "DistanceGroup" BIGINT, "CarrierDelay" BIGINT, "WeatherDelay" BIGINT, "NASDelay" BIGINT, "SecurityDelay" BIGINT, "LateAircraftDelay" BIGINT, "FirstDepTime" VARCHAR, "TotalAddGTime" VARCHAR, "LongestAddGTime" VARCHAR, "DivAirportLandings" VARCHAR, "DivReachedDest" VARCHAR, "DivActualElapsedTime" VARCHAR, "DivArrDelay" VARCHAR, "DivDistance" VARCHAR, "Div1Airport" VARCHAR, "Div1AirportID" VARCHAR, "Div1AirportSeqID" VARCHAR, "Div1WheelsOn" VARCHAR, "Div1TotalGTime" VARCHAR, "Div1LongestGTime" VARCHAR, "Div1WheelsOff" VARCHAR, "Div1TailNum" VARCHAR, "Div2Airport" VARCHAR, "Div2AirportID" VARCHAR, "Div2AirportSeqID" VARCHAR, "Div2WheelsOn" VARCHAR, "Div2TotalGTime" VARCHAR, "Div2LongestGTime" VARCHAR, "Div2WheelsOff" VARCHAR, "Div2TailNum" VARCHAR, "Div3Airport" VARCHAR, "Div3AirportID" VARCHAR, "Div3AirportSeqID" VARCHAR, "Div3WheelsOn" VARCHAR, "Div3TotalGTime" VARCHAR, "Div3LongestGTime" VARCHAR, "Div3WheelsOff" VARCHAR, "Div3TailNum" VARCHAR, "Div4Airport" VARCHAR, "Div4AirportID" VARCHAR, "Div4AirportSeqID" VARCHAR, "Div4WheelsOn" VARCHAR, "Div4TotalGTime" VARCHAR, "Div4LongestGTime" VARCHAR, "Div4WheelsOff" VARCHAR, "Div4TailNum" VARCHAR, "Div5Airport" VARCHAR, "Div5AirportID" VARCHAR, "Div5AirportSeqID" VARCHAR, "Div5WheelsOn" VARCHAR, "Div5TotalGTime" VARCHAR, "Div5LongestGTime" VARCHAR, "Div5WheelsOff" VARCHAR, "Div5TailNum" VARCHAR, "Unnamed: 109" VARCHAR))
SELECT
  TIME_PARSE("depaturetime") AS "__time",
  "Reporting_Airline",
  DS_HLL("Tail_Number") AS "Tail_Number_HLL",
  "Origin",
  "Dest",
  "Distance"
FROM "ext"
PARTITIONED BY DAY
'''



## Conclusion

* Approximation is the default execution model for `COUNT(DISTINCT)` queries
* You can turn it off with a query context parameter
* Accuracy is highly dependent on the distribution and cardinality of data across the database

## Learn more

* Watch "[Employ Approximation](https://youtu.be/fSWwJs1gCvQ?list=PLDZysOZKycN7MZvNxQk_6RbwSJqjSrsNR)" by Peter Marshall at [Imply](https://imply.io)