# Using TopN approximation in Druid queries

<!--
  ~ Licensed to the Apache Software Foundation (ASF) under one
  ~ or more contributor license agreements.  See the NOTICE file
  ~ distributed with this work for additional information
  ~ regarding copyright ownership.  The ASF licenses this file
  ~ to you under the Apache License, Version 2.0 (the
  ~ "License"); you may not use this file except in compliance
  ~ with the License.  You may obtain a copy of the License at
  ~
  ~   http://www.apache.org/licenses/LICENSE-2.0
  ~
  ~ Unless required by applicable law or agreed to in writing,
  ~ software distributed under the License is distributed on an
  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  ~ KIND, either express or implied.  See the License for the
  ~ specific language governing permissions and limitations
  ~ under the License.
  -->

Imagine you’re building a dynamic filter in your app: you want to populate it with, say, the top most popular (COUNT) dimension values in descending order (ORDER BY). Druid speeds up this type of query using TopN approximation by default. In this tutorial, work through some examples and see the effect of turning approximation off.

## Prerequisites

This tutorial works with Druid 26.0.0 or later.

#### Run using Docker

Launch this tutorial and all prerequisites using the `druid-jupyter` profile of the Docker Compose file for Jupyter-based Druid tutorials. For more information, see [Docker for Jupyter Notebook tutorials](https://druid.apache.org/docs/latest/tutorials/tutorial-jupyter-docker.html).
   
#### Run without using Docker

If you do not use the Docker Compose environment, you need the following:

* A running Apache Druid instance, with a `DRUID_HOST` local environment variable containing the server name of your Druid router.
* [druidapi](https://github.com/apache/druid/blob/master/examples/quickstart/jupyter-notebooks/druidapi/README.md), a Python client for Apache Druid. Follow the instructions in the Install section of the README file.
* [matplotlib](https://matplotlib.org/), a library for creating visualizations in Python.
* [pandas](https://pandas.pydata.org/), a data analysis and manipulation tool.

### Initialize Python

Run the next cell to set up the Druid Python client's connection to Apache Druid.

If successful, the Druid version number will be shown in the output.

In [None]:
import druidapi
import os

if 'DRUID_HOST' not in os.environ.keys():
    druid_host=f"http://localhost:8888"
else:
    druid_host=f"http://{os.environ['DRUID_HOST']}:8888"
    
print(f"Opening a connection to {druid_host}.")
druid = druidapi.jupyter_client(druid_host)

display = druid.display
sql_client = druid.sql
status_client = druid.status

status_client.version

### Load example data

Once your Druid environment is up and running, ingest the sample data for this tutorial.

Run the following cell to create a table called `example-flights-topn`.  When completed, you'll see a description of the final table.

Monitor the ingestion task process in the Druid console.

In [1]:
sql='''
REPLACE INTO "example-flights-topn" OVERWRITE ALL
WITH "ext" AS (SELECT *
FROM TABLE(
  EXTERN(
    '{"type":"http","uris":["https://static.imply.io/example-data/flight_on_time/flights/example-flights-topn.csv.zip"]}',
    '{"type":"csv","findColumnsFromHeader":true}'
  )
) EXTEND ("depaturetime" VARCHAR, "arrivalime" VARCHAR, "Year" BIGINT, "Quarter" BIGINT, "Month" BIGINT, "DayofMonth" BIGINT, "DayOfWeek" BIGINT, "FlightDate" VARCHAR, "Reporting_Airline" VARCHAR, "DOT_ID_Reporting_Airline" BIGINT, "IATA_CODE_Reporting_Airline" VARCHAR, "Tail_Number" VARCHAR, "Flight_Number_Reporting_Airline" BIGINT, "OriginAirportID" BIGINT, "OriginAirportSeqID" BIGINT, "OriginCityMarketID" BIGINT, "Origin" VARCHAR, "OriginCityName" VARCHAR, "OriginState" VARCHAR, "OriginStateFips" BIGINT, "OriginStateName" VARCHAR, "OriginWac" BIGINT, "DestAirportID" BIGINT, "DestAirportSeqID" BIGINT, "DestCityMarketID" BIGINT, "Dest" VARCHAR, "DestCityName" VARCHAR, "DestState" VARCHAR, "DestStateFips" BIGINT, "DestStateName" VARCHAR, "DestWac" BIGINT, "CRSDepTime" BIGINT, "DepTime" BIGINT, "DepDelay" BIGINT, "DepDelayMinutes" BIGINT, "DepDel15" BIGINT, "DepartureDelayGroups" BIGINT, "DepTimeBlk" VARCHAR, "TaxiOut" BIGINT, "WheelsOff" BIGINT, "WheelsOn" BIGINT, "TaxiIn" BIGINT, "CRSArrTime" BIGINT, "ArrTime" BIGINT, "ArrDelay" BIGINT, "ArrDelayMinutes" BIGINT, "ArrDel15" BIGINT, "ArrivalDelayGroups" BIGINT, "ArrTimeBlk" VARCHAR, "Cancelled" BIGINT, "CancellationCode" VARCHAR, "Diverted" BIGINT, "CRSElapsedTime" BIGINT, "ActualElapsedTime" BIGINT, "AirTime" BIGINT, "Flights" BIGINT, "Distance" BIGINT, "DistanceGroup" BIGINT, "CarrierDelay" BIGINT, "WeatherDelay" BIGINT, "NASDelay" BIGINT, "SecurityDelay" BIGINT, "LateAircraftDelay" BIGINT, "FirstDepTime" VARCHAR, "TotalAddGTime" VARCHAR, "LongestAddGTime" VARCHAR, "DivAirportLandings" VARCHAR, "DivReachedDest" VARCHAR, "DivActualElapsedTime" VARCHAR, "DivArrDelay" VARCHAR, "DivDistance" VARCHAR, "Div1Airport" VARCHAR, "Div1AirportID" VARCHAR, "Div1AirportSeqID" VARCHAR, "Div1WheelsOn" VARCHAR, "Div1TotalGTime" VARCHAR, "Div1LongestGTime" VARCHAR, "Div1WheelsOff" VARCHAR, "Div1TailNum" VARCHAR, "Div2Airport" VARCHAR, "Div2AirportID" VARCHAR, "Div2AirportSeqID" VARCHAR, "Div2WheelsOn" VARCHAR, "Div2TotalGTime" VARCHAR, "Div2LongestGTime" VARCHAR, "Div2WheelsOff" VARCHAR, "Div2TailNum" VARCHAR, "Div3Airport" VARCHAR, "Div3AirportID" VARCHAR, "Div3AirportSeqID" VARCHAR, "Div3WheelsOn" VARCHAR, "Div3TotalGTime" VARCHAR, "Div3LongestGTime" VARCHAR, "Div3WheelsOff" VARCHAR, "Div3TailNum" VARCHAR, "Div4Airport" VARCHAR, "Div4AirportID" VARCHAR, "Div4AirportSeqID" VARCHAR, "Div4WheelsOn" VARCHAR, "Div4TotalGTime" VARCHAR, "Div4LongestGTime" VARCHAR, "Div4WheelsOff" VARCHAR, "Div4TailNum" VARCHAR, "Div5Airport" VARCHAR, "Div5AirportID" VARCHAR, "Div5AirportSeqID" VARCHAR, "Div5WheelsOn" VARCHAR, "Div5TotalGTime" VARCHAR, "Div5LongestGTime" VARCHAR, "Div5WheelsOff" VARCHAR, "Div5TailNum" VARCHAR, "Unnamed: 109" VARCHAR))
SELECT
  TIME_PARSE("depaturetime") AS "__time",
  "Reporting_Airline",
  "Tail_Number",
  "Distance",
  "Origin"
FROM "ext"
PARTITIONED BY DAY
'''

sql_client.run_task(sql)
sql_client.wait_until_ready('example-flights-topn')
display.table('example-flights-topn')

NameError: name 'sql_client' is not defined

When this is completed, run the following cell for the final part of the initialization. This will provide us some methods to call as we explore what TopN does.

In [None]:
import json
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd

## Example TopN style queries

Druid looks for patterns in incoming SQL SELECT statements to work out if they would benefit from using approximation. A ranking query, like the one below, matches the rules for TopN approximation, so Druid enables it by default.

To see this happen, we need an SQL statement that has:
* A GROUP BY on one dimension, and
* an ORDER BY on one aggregate.

Run this query to see what the results are like:

In [None]:
sql = '''
SELECT
    "Reporting_Airline",
    COUNT(*) AS Flights,
    SUM("Distance") AS SumDistance
FROM
    "example-flights-topn"
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10
'''
display.sql(sql)

You can use `EXPLAIN PLAN FOR` or the `explain_sql` method to see whether Druid used TopN approximation.

In [None]:
print(json.dumps(json.loads(sql_client.explain_sql(sql)['PLAN']), indent=2))

You know approximation is used when the `queryType` is `topN`.

Druid automatically applies a `LIMIT` operation, not just on the final result set, but on the results calculated by each server that’s been called upon to answer the query. This results in less data bubbling up from each process to be merged overall, and therefore greater efficiency in execution.

There's an important reason why our query doesn't include a `HAVING` clause: for `HAVING` to work properly, Druid needs to have the full and final results of your aggregations. A single data service doesn't have the full picture - that's only available after all the results are merged.

Notice the `threshold` value?

```json
    "threshold": 10,
```

The parallelised `LIMIT` was the `max` of both the `threshold` shown here – which came from the `LIMIT` in the SQL - and a configuration setting in your cluster – the default for which is 1,000.

You can find out how to read and set this default `LIMIT` in the [documentation](https://druid.apache.org/docs/latest/querying/topnquery.html#aliasing).

As a first step in understanding the implications, we need to find data in our sample set where the
cardinality of the dimension that we will `GROUP BY` exceeds that number. By default, that is 1000.

What's the cardinality of our dimension?

In [None]:
sql = '''
SELECT COUNT (DISTINCT "Reporting_Airline") AS UniqueReportingAirlines
FROM "example-flights-topn"
'''
display.sql(sql)

Twenty unique values is too low – the initial `LIMIT` has no effect! This means there is no trimming happening anywhere in the database. As a result, as the documentation explains, our results are going to be without error. All the data servers will return all their results, without trimming, to be merged and passed back to us.

Let's find another dimension.

In [None]:
sql = '''
SELECT
    COUNT (DISTINCT "Tail_Number") AS UniqueTailNumbers
FROM "example-flights-topn"
WHERE "Tail_Number" <> ''
'''
display.sql(sql)

With this many distinct values to `GROUP BY`, we know that data servers will trim their results when the
`topN` engine is engaged.

There is another factor to consider – data distribution.

Run the next query to visualise the distribution of unique `Tail_Number`s in the example dataset.

In [None]:
sql = '''
SELECT
    "Tail_Number",
    COUNT(*) AS RecordCount
FROM "example-flights-topn"
WHERE "Tail_Number" <> ''
GROUP BY 1
ORDER BY 2 DESC
LIMIT 500
'''

df4 = pd.DataFrame(sql_client.sql(sql))

df4.plot(x='Tail_Number', y='RecordCount', marker='o')
plt.xticks(rotation=45, ha='right')
plt.gca().get_legend().remove()
plt.show()

Imagine that the cut-off point is in the first 10% of results - in this sample data that's about the first 400.

The distribution above shows that there is a _very_ high chance that the top result is going to be top result across all our data, and the second, and the third, and so on. The same ranking will, very likely, come back from all of the servers.

But as we approach 1000, 25% of the way along, we have a flatter distribution. It is not as predictable any more where results will rank. Consider, too, that this is a very simple distribution plot: what will happen when we have `WHERE` on `__time` or other dimensions?

Run the following cell to see the impact of the initial `LIMIT` on results.

It uses a single query that finds the number of records and the sum total distance for each `Tail_Number`.

It executes the query twice, putting results into DataFrames. The first, `df1`, is populated with results in the usual way - running `sql_client.sql(sql)` directly. The second, `df2`, uses a crafted `req` object that adds the `useApproximateTopN` query context parameter to turn off approximation.

It then creates a third, `df3`, which is a `compare` of `df2` against `df1`, and prints the results.

In [None]:
sql = '''
SELECT
    "Tail_Number",
    COUNT(*) AS "count",
    SUM(Distance) AS "distance"
FROM "example-flights-topn"
WHERE "Tail_Number" IS NOT NULL
GROUP BY 1
ORDER BY 3 DESC
LIMIT 500
'''

df1 = pd.DataFrame(sql_client.sql(sql))

req = sql_client.sql_request(sql)
req.add_context("useApproximateTopN", "false")
resp = sql_client.sql_query(req)

df2 = pd.DataFrame(sql_client.sql_query(req).rows)

df3 = df1.compare(df2, keep_equal=True)
df3

You can see:

* The `self` (df1) and `other` (df2) rank position of each `Tail_Number` in each position
* The self / other values for the calculated `count` and `distance`

You may notice some `Tail_Number`s are in different positions depending on what the calculated `distance` is: certain data servers returned different sets of results, depending entirely on local data distribution. And some `Tail_Number`s may not appear in the list at all as they drop "below the fold".

Let's try this with a different dimension, `Flight_Number_Reporting_Airline`. The example dataset has more unique values, but the distribution is much flatter than `Tail_Number`. Run the following cell to see the count and a distribution plot.

In [None]:
sql = '''
SELECT COUNT(DISTINCT "Flight_Number_Reporting_Airline") AS UniqueReportingAirlines
FROM "example-flights-topn"
WHERE "Flight_Number_Reporting_Airline" <> ''
'''

display.sql(sql)

sql = '''
SELECT "Flight_Number_Reporting_Airline", COUNT(*)
FROM "example-flights-topn"
WHERE "Flight_Number_Reporting_Airline" <> ''
GROUP BY 1
ORDER BY 2 DESC
LIMIT 500
'''

df5 = pd.DataFrame(sql_client.sql(sql))

df5.plot(x='Flight_Number_Reporting_Airline', y='EXPR$1', kind="bar", xticks=[])
plt.gca().get_legend().remove()
plt.show()

Having more unique values puts these queries within the default `LIMIT`, so TopN will execute queries more efficiently, improving performance.

The flatter overall distribution we see means it's very likely each data process will itself have a flatter distribution of data, meaning that the top results are much less prominent overall and much less prominent locally.

Run the following cell to repeat the same test we did before, creating two sets of results, and comparing them.

In [None]:
sql = '''
SELECT
    "Flight_Number_Reporting_Airline",
    AVG("Distance") AS AverageDistance
FROM "example-flights-topn"
WHERE "Flight_Number_Reporting_Airline" IS NOT NULL
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10
'''

req = sql_client.sql_request(sql)
req.add_context("useApproximateTopN", "false")
resp = sql_client.sql_query(req)

df1 = pd.DataFrame(sql_client.sql(sql))
df2 = pd.DataFrame(sql_client.sql_query(req).rows)

df3 = df1.compare(df2, keep_equal=True)
df3

Here the impact of a flatter distribution over a greater cardinality is clear, not just in ranking order, but also in the values that have been calculated to give us that ranking. 

Reporting airline `17` is in a lower position with TopN than without it. And the calculation itself, because it non-additive, has a higher error.

TopN is useful for interactive elements, then, like filters or initial lists of results to deep dive into. That's because of the speed boost we receive at the expense of accuracy – the mantra for all approximation.

We've seen that the accuracy of the ranking depends greatly on data distribution, and thereby on what each of the data servers "vote" for in terms of position.

The following cell contains a query that is a good example of TopN being applied: it creates a list of `Tail_Number`s within a particular period of time. Imagine that you might use this list to provide an interactive filter on `Tail_Number` to the end user when they're looking at this specific time period.

Run the following cell to show the cardinality of `Tail_Number`s in that period, and then to plot the distribution:

In [None]:
sql = '''
SELECT COUNT (DISTINCT "Tail_Number") AS UniqueTailNumbers
FROM "example-flights-topn"
WHERE "Tail_Number" <> ''
AND (TIMESTAMP '2005-11-01' <= "__time" AND "__time" <= TIMESTAMP '2005-11-14')
'''
display.sql(sql)

sql = '''
SELECT
    "Tail_Number",
    COUNT(*) AS "Flights"
FROM "example-flights-topn"
WHERE "Tail_Number" <> ''
AND (TIMESTAMP '2005-11-01' <= "__time" AND "__time" <= TIMESTAMP '2005-11-14')
GROUP BY 1
ORDER BY 2 DESC
LIMIT 500
'''

df4 = pd.DataFrame(sql_client.sql(sql))

df4.plot(x='Tail_Number', y='Flights', marker='o')
plt.xticks(rotation=45, ha='right')
plt.gca().get_legend().remove()
plt.show()

This distribution pattern is good for TopN - the highest ranking values are very prominent.

Run the following cell to compare the two styles of execution:

In [None]:
sql = '''
SELECT "Tail_Number", COUNT(*) AS "count", SUM(Distance) AS "distance"
    FROM "example-flights-topn"
    WHERE "Tail_Number" IS NOT NULL
    AND (TIMESTAMP '2005-11-01' <= "__time" AND "__time" <= TIMESTAMP '2005-11-14')
    GROUP BY 1
    ORDER BY 3 DESC
    LIMIT 500
'''

req = sql_client.sql_request(sql)
req.add_context("useApproximateTopN", "false")
resp = sql_client.sql_query(req)

df1 = pd.DataFrame(sql_client.sql(sql))
df2 = pd.DataFrame(sql_client.sql_query(req).rows)

df3 = df1.compare(df2, keep_equal=True)
df3

The distribution, together with our filters, means that these results are useful for this kind of interactive UI element.

## Summary

* TopN is the default execution model for `GROUP BY` queries with one dimension, an `ORDER BY` and a `LIMIT` clause
* You can turn TopN off with a query context parameter
* Accuracy is highly dependent on distribution of the data, after filters etc., across the database