# Using TopN approximation in Druid queries

<!--
  ~ Licensed to the Apache Software Foundation (ASF) under one
  ~ or more contributor license agreements.  See the NOTICE file
  ~ distributed with this work for additional information
  ~ regarding copyright ownership.  The ASF licenses this file
  ~ to you under the Apache License, Version 2.0 (the
  ~ "License"); you may not use this file except in compliance
  ~ with the License.  You may obtain a copy of the License at
  ~
  ~   http://www.apache.org/licenses/LICENSE-2.0
  ~
  ~ Unless required by applicable law or agreed to in writing,
  ~ software distributed under the License is distributed on an
  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  ~ KIND, either express or implied.  See the License for the
  ~ specific language governing permissions and limitations
  ~ under the License.
  -->

Imagine you’re building a dynamic filter in your app: you want to populate it with, say, the top most popular (COUNT) dimension values in descending order (ORDER BY). Druid speeds up this type of query using TopN approximation by default. In this tutorial, work through some examples and see the effect of turning approximation off.

## Prerequisites

This tutorial works with Druid 26.0.0 or later.

#### Run using Docker

Launch this tutorial and all prerequisites using the `druid-jupyter` profile of the Docker Compose file for Jupyter-based Druid tutorials. For more information, see [Docker for Jupyter Notebook tutorials](https://druid.apache.org/docs/latest/tutorials/tutorial-jupyter-docker.html).
   
#### Run without using Docker

If you do not use the Docker Compose environment, you need the following:

* A running Apache Druid instance, with a `DRUID_HOST` local environment variable containing the server name of your Druid router.
* [druidapi](https://github.com/apache/druid/blob/master/examples/quickstart/jupyter-notebooks/druidapi/README.md), a Python client for Apache Druid. Follow the instructions in the Install section of the README file.
* [matplotlib](https://matplotlib.org/), a library for creating visualizations in Python.
* [pandas](https://pandas.pydata.org/), a data analysis and manipulation tool.

### Initialize Python

Run the next cell to set up the Druid Python client's connection to Apache Druid.

If successful, the Druid version number will be shown in the output.

In [None]:
import druidapi
import os

if 'DRUID_HOST' not in os.environ.keys():
    druid_host=f"http://localhost:8888"
else:
    druid_host=f"http://{os.environ['DRUID_HOST']}:8888"
    
print(f"Opening a connection to {druid_host}.")
druid = druidapi.jupyter_client(druid_host)

display = druid.display
sql_client = druid.sql
status_client = druid.status

status_client.version

### Load example data

Once your Druid environment is up and running, ingest the sample data for this tutorial.

Run the following cell to create a table called `example-flights-topn`.  When completed, you'll see a description of the final table.

Monitor the ingestion task process in the Druid console.

In [None]:
sql='''
REPLACE INTO "example-flights-topn" OVERWRITE ALL
WITH "ext" AS (SELECT *
FROM TABLE(
  EXTERN(
    '{"type":"http","uris":["https://static.imply.io/example-data/flight_on_time/flights/On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11.csv.zip"]}',
    '{"type":"csv","findColumnsFromHeader":true}'
  )
) EXTEND ("depaturetime" VARCHAR, "arrivalime" VARCHAR, "Year" BIGINT, "Quarter" BIGINT, "Month" BIGINT, "DayofMonth" BIGINT, "DayOfWeek" BIGINT, "FlightDate" VARCHAR, "Reporting_Airline" VARCHAR, "DOT_ID_Reporting_Airline" BIGINT, "IATA_CODE_Reporting_Airline" VARCHAR, "Tail_Number" VARCHAR, "Flight_Number_Reporting_Airline" BIGINT, "OriginAirportID" BIGINT, "OriginAirportSeqID" BIGINT, "OriginCityMarketID" BIGINT, "Origin" VARCHAR, "OriginCityName" VARCHAR, "OriginState" VARCHAR, "OriginStateFips" BIGINT, "OriginStateName" VARCHAR, "OriginWac" BIGINT, "DestAirportID" BIGINT, "DestAirportSeqID" BIGINT, "DestCityMarketID" BIGINT, "Dest" VARCHAR, "DestCityName" VARCHAR, "DestState" VARCHAR, "DestStateFips" BIGINT, "DestStateName" VARCHAR, "DestWac" BIGINT, "CRSDepTime" BIGINT, "DepTime" BIGINT, "DepDelay" BIGINT, "DepDelayMinutes" BIGINT, "DepDel15" BIGINT, "DepartureDelayGroups" BIGINT, "DepTimeBlk" VARCHAR, "TaxiOut" BIGINT, "WheelsOff" BIGINT, "WheelsOn" BIGINT, "TaxiIn" BIGINT, "CRSArrTime" BIGINT, "ArrTime" BIGINT, "ArrDelay" BIGINT, "ArrDelayMinutes" BIGINT, "ArrDel15" BIGINT, "ArrivalDelayGroups" BIGINT, "ArrTimeBlk" VARCHAR, "Cancelled" BIGINT, "CancellationCode" VARCHAR, "Diverted" BIGINT, "CRSElapsedTime" BIGINT, "ActualElapsedTime" BIGINT, "AirTime" BIGINT, "Flights" BIGINT, "Distance" BIGINT, "DistanceGroup" BIGINT, "CarrierDelay" BIGINT, "WeatherDelay" BIGINT, "NASDelay" BIGINT, "SecurityDelay" BIGINT, "LateAircraftDelay" BIGINT, "FirstDepTime" VARCHAR, "TotalAddGTime" VARCHAR, "LongestAddGTime" VARCHAR, "DivAirportLandings" VARCHAR, "DivReachedDest" VARCHAR, "DivActualElapsedTime" VARCHAR, "DivArrDelay" VARCHAR, "DivDistance" VARCHAR, "Div1Airport" VARCHAR, "Div1AirportID" VARCHAR, "Div1AirportSeqID" VARCHAR, "Div1WheelsOn" VARCHAR, "Div1TotalGTime" VARCHAR, "Div1LongestGTime" VARCHAR, "Div1WheelsOff" VARCHAR, "Div1TailNum" VARCHAR, "Div2Airport" VARCHAR, "Div2AirportID" VARCHAR, "Div2AirportSeqID" VARCHAR, "Div2WheelsOn" VARCHAR, "Div2TotalGTime" VARCHAR, "Div2LongestGTime" VARCHAR, "Div2WheelsOff" VARCHAR, "Div2TailNum" VARCHAR, "Div3Airport" VARCHAR, "Div3AirportID" VARCHAR, "Div3AirportSeqID" VARCHAR, "Div3WheelsOn" VARCHAR, "Div3TotalGTime" VARCHAR, "Div3LongestGTime" VARCHAR, "Div3WheelsOff" VARCHAR, "Div3TailNum" VARCHAR, "Div4Airport" VARCHAR, "Div4AirportID" VARCHAR, "Div4AirportSeqID" VARCHAR, "Div4WheelsOn" VARCHAR, "Div4TotalGTime" VARCHAR, "Div4LongestGTime" VARCHAR, "Div4WheelsOff" VARCHAR, "Div4TailNum" VARCHAR, "Div5Airport" VARCHAR, "Div5AirportID" VARCHAR, "Div5AirportSeqID" VARCHAR, "Div5WheelsOn" VARCHAR, "Div5TotalGTime" VARCHAR, "Div5LongestGTime" VARCHAR, "Div5WheelsOff" VARCHAR, "Div5TailNum" VARCHAR, "Unnamed: 109" VARCHAR))
SELECT
  TIME_PARSE("depaturetime") AS "__time",
  "Reporting_Airline",
  "Tail_Number",
  "Distance",
  "Origin",
  "Flight_Number_Reporting_Airline"
FROM "ext"
PARTITIONED BY DAY
'''

sql_client.run_task(sql)
sql_client.wait_until_ready('example-flights-topn')
display.table('example-flights-topn')

When this is completed, run the following cell to load some Python libraries we need to explore what TopN does.

In [None]:
import json
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd

## Example TopN style queries

Druid looks for patterns in `SELECT` statements to determine if they would benefit from using approximation. A ranking query, like the following, matches the rules for TopN approximation, so Druid enables it by default.

For Druid to automatically optimize for TopN, you need an SQL statement that has:
* A GROUP BY on one dimension, and
* an ORDER BY on one aggregate.

Run this query to see what the results are like:

In [None]:
sql = '''
SELECT
    "Reporting_Airline",
    COUNT(*) AS Flights,
    SUM("Distance") AS SumDistance
FROM
    "example-flights-topn"
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10
'''
display.sql(sql)

Run the following cell, which uses the `explain_sql` method to show the [`EXPLAIN PLAN`](https://druid.apache.org/docs/latest/querying/sql-translation#interpreting-explain-plan-output) for this query.

In [None]:
print(json.dumps(json.loads(sql_client.explain_sql(sql)['PLAN']), indent=2))

The plan `queryType` is `topN`, showing that TopN approximation was used.

Druid applied a `LIMIT` operation on the results calculated by each data service involved in the query, improving processing efficiency by minimizing the amount of data transferred to the Broker.

This [pushed-down](https://druid.apache.org/docs/latest/querying/groupbyquery#limit-pushdown-optimization) `LIMIT` is the `max` of the `threshold` in the plan (which came from the `LIMIT` in the SQL) and the [`minTopNThreshold`](https://druid.apache.org/docs/latest/querying/topnquery.html#aliasing) setting in your cluster - the default being 1,000.

To see the implication of this `LIMIT` in action, the cardinality of the `GROUP BY` dimension therefore needs to exceed this cap.

Run the following query to discover the cardinality of the `GROUP BY` on `Reporting_Airline`.

In [None]:
sql = '''
SELECT COUNT (DISTINCT "Reporting_Airline") AS UniqueReportingAirlines
FROM "example-flights-topn"
'''
display.sql(sql)

The number of unique values is below the `LIMIT` cap, meaning, there is no trimming and the results are not approximate; all the data servers will return all their results, without trimming, to be merged and passed back to us.

What is the cardinality for the `Tail_Number` dimension?

In [None]:
sql = '''
SELECT
    COUNT (DISTINCT "Tail_Number") AS UniqueTailNumbers
FROM "example-flights-topn"
WHERE "Tail_Number" <> ''
'''
display.sql(sql)

With this many distinct values to `GROUP BY`, the services involved in the query will trim their results when using the
`topN` engine.

Run the next query to visualize the distribution of unique `Tail_Number`s in the example dataset.

In [None]:
sql = '''
SELECT
    "Tail_Number",
    COUNT(*) AS RecordCount
FROM "example-flights-topn"
WHERE "Tail_Number" <> ''
GROUP BY 1
ORDER BY 2 DESC
LIMIT 500
'''

df4 = pd.DataFrame(sql_client.sql(sql))

df4.plot(x='Tail_Number', y='RecordCount', marker='o')
plt.xticks(rotation=45, ha='right')
plt.gca().get_legend().remove()
plt.show()

The plot shows that we have a long tail distribution, meaning there is a high likelihood the same `Tail_Number` will be in rank position one across the data set, and therefore across all segments. The flatter the distribution, the less reliable this assertion is.

Take a look at the following cell to see a query that counts the number of records and sums total distance for each `Tail_Number`.

Run the cell to execute this query in both TopN and non-TopN modes. The first run puts the results into a Dataframe `df1` running `sql_client.sql(sql)` directly. The second uses a crafted `req` object that adds the `useApproximateTopN` query context parameter to turn off approximation, storing the results in `df2`.

It then runs a `compare` of `df2` against `df1` using `df3` and prints the results.

In [None]:
sql = '''
SELECT
    "Tail_Number",
    COUNT(*) AS "count",
    SUM(Distance) AS "distance"
FROM "example-flights-topn"
WHERE "Tail_Number" IS NOT NULL
GROUP BY 1
ORDER BY 3 DESC
LIMIT 500
'''

# Load the results into a pandas DataFrame

df1 = pd.DataFrame(sql_client.sql(sql))

# Set up a sql_request to turn off TopN approximation

req = sql_client.sql_request(sql)
req.add_context("useApproximateTopN", "false")
resp = sql_client.sql_query(req)

# Load the non-TopN results into a second pandas DataFrame using that request

df2 = pd.DataFrame(sql_client.sql_query(req).rows)

# Load the compare of df1 to df2 into a new dataframe and print

df3 = df1.compare(df2, keep_equal=True)
df3

You can see:

* The `self` (df1) and `other` (df2) rank position of each `Tail_Number` in each position
* The self / other values for the calculated `count` and `distance`

You may notice some `Tail_Number`s are in different positions depending on what the calculated `distance` is: certain data servers returned different sets of results, depending entirely on local data distribution. And some `Tail_Number`s may not appear in the list at all as they drop below the fold the cut-off applied to that specific process.

Let's try this with a different dimension, `Flight_Number_Reporting_Airline`. The example dataset has more unique values, but the distribution is much flatter than `Tail_Number`. Run the following cell to see the count and a distribution plot.

In [None]:
sql = '''
SELECT COUNT(DISTINCT "Flight_Number_Reporting_Airline") AS UniqueReportingAirlines
FROM "example-flights-topn"
WHERE "Flight_Number_Reporting_Airline" <> ''
'''

display.sql(sql)

sql = '''
SELECT "Flight_Number_Reporting_Airline", COUNT(*) AS Flights
FROM "example-flights-topn"
WHERE "Flight_Number_Reporting_Airline" <> ''
GROUP BY 1
ORDER BY 2 DESC
LIMIT 500
'''

# Load the results into a pandas DataFrame

df5 = pd.DataFrame(sql_client.sql(sql))

# Display a plot

df5.plot(x='Flight_Number_Reporting_Airline', y='Flights', kind="bar", xticks=[])
plt.gca().get_legend().remove()
plt.show()

This dimension, unlike `Tail_Number`, has a flatter distribution. Each data process is likely to have a flatter distribution of data, too, meaning the top ranking results are less prominent. The "voting" across the servers as to what is in the top is less clear.

Run the following cell to repeat the same test we did before, creating two sets of results, and comparing them.

In [None]:
sql = '''
SELECT
    "Flight_Number_Reporting_Airline",
    AVG("Distance") AS AverageDistance
FROM "example-flights-topn"
WHERE "Flight_Number_Reporting_Airline" IS NOT NULL
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10
'''

# Set up a sql_request to turn off TopN approximation

req = sql_client.sql_request(sql)
req.add_context("useApproximateTopN", "false")
resp = sql_client.sql_query(req)

# Load two pandas DataFrames - one with the TopN and one with the non-TopN results

df1 = pd.DataFrame(sql_client.sql(sql))
df2 = pd.DataFrame(sql_client.sql_query(req).rows)

# Load the compare of df1 to df2 into a new dataframe and print

df3 = df1.compare(df2, keep_equal=True)
df3

Here, the flatter distribution exaggerates ranking and calculation error. Further issues are caused by the calculation being non-additive.

The following cell contains a query that is a good example of TopN being applied: it creates a list of `Tail_Number`s within a particular period of time. Imagine that you might use this list to provide an interactive filter on `Tail_Number` to the end user when they're looking at this specific time period.

Run the following cell to show the cardinality of `Tail_Number`s in that period, and then to plot the distribution:

In [None]:
sql = '''
SELECT COUNT (DISTINCT "Tail_Number") AS UniqueTailNumbers
FROM "example-flights-topn"
WHERE "Tail_Number" <> ''
AND (TIMESTAMP '2005-11-01' <= "__time" AND "__time" <= TIMESTAMP '2005-11-14')
'''
display.sql(sql)

sql = '''
SELECT
    "Tail_Number",
    COUNT(*) AS "Flights"
FROM "example-flights-topn"
WHERE "Tail_Number" <> ''
AND (TIMESTAMP '2005-11-01' <= "__time" AND "__time" <= TIMESTAMP '2005-11-14')
GROUP BY 1
ORDER BY 2 DESC
LIMIT 500
'''

df4 = pd.DataFrame(sql_client.sql(sql))

df4.plot(x='Tail_Number', y='Flights', marker='o')
plt.xticks(rotation=45, ha='right')
plt.gca().get_legend().remove()
plt.show()

This distribution pattern is good for TopN - the highest ranking values are very prominent.

Run the following cell to compare the two styles of execution:

In [None]:
sql = '''
SELECT "Tail_Number", COUNT(*) AS "count", SUM(Distance) AS "distance"
    FROM "example-flights-topn"
    WHERE "Tail_Number" IS NOT NULL
    AND (TIMESTAMP '2005-11-01' <= "__time" AND "__time" <= TIMESTAMP '2005-11-14')
    GROUP BY 1
    ORDER BY 3 DESC
    LIMIT 500
'''

req = sql_client.sql_request(sql)
req.add_context("useApproximateTopN", "false")
resp = sql_client.sql_query(req)

df1 = pd.DataFrame(sql_client.sql(sql))
df2 = pd.DataFrame(sql_client.sql_query(req).rows)

df3 = df1.compare(df2, keep_equal=True)
df3

The distribution, together with our filters, means that these results are useful for this kind of interactive UI element.

## Summary

The speed boost we receive through TopN, at the expense of some accuracy, makes it useful for interactive elements like filters or initial lists of results that people will then deep dive into.

* TopN is the default execution model for `GROUP BY` queries with one dimension, an `ORDER BY` and a `LIMIT` clause
* You can turn TopN off with a query context parameter
* Accuracy is highly dependent on distribution of the data, after filters etc., across the database

## Learn more

Read the following documentation for more information:

* [TopN queries](https://druid.apache.org/docs/latest/querying/topnquery)