# Using `UNION ALL` to address multiple `TABLE`s in the same query

While working with Druid, you may need to bring together two different tables of results together into a single result list, or to treat multiple tables as a single input to a query. This notebook introduces the `UNION ALL` operator, walking through two ways in which this operator can be used to achieve this result: top-level and table-level `UNION ALL`.

## Prerequisites

This tutorial works with Druid 26.0.0 or later.

#### Run using Docker

Launch this tutorial and all prerequisites using the `druid-jupyter` profile of the Docker Compose file for Jupyter-based Druid tutorials. For more information, see [Docker for Jupyter Notebook tutorials](https://druid.apache.org/docs/latest/tutorials/tutorial-jupyter-docker.html).
   
#### Run Druid without Docker

If you do not use the Docker Compose environment, you need the following:

* A running Druid instance, with a `DRUID_HOST` local environment variable containing the servername of your Druid router
* [druidapi](https://github.com/apache/druid/blob/master/examples/quickstart/jupyter-notebooks/druidapi/README.md), a Python client for Apache Druid. Follow the instructions in the Install section of the README file.

### Initialization

Run the next cell to attempt a connection to Druid services. If successful, the Druid version number will be shown in the output.

In [2]:
import druidapi
import os

if 'DRUID_HOST' not in os.environ.keys():
    druid_host=f"http://localhost:8888"
else:
    druid_host=f"http://{os.environ['DRUID_HOST']}:8888"
    
print(f"Opening a connection to {druid_host}.")
druid = druidapi.jupyter_client(druid_host)

display = druid.display
sql_client = druid.sql
status_client = druid.status

status_client.version

Opening a connection to http://druid-master-0.lan:8888.


'26.0.0'

## Using Top-level `UNION ALL` to concatenate result sets

Run the following cell to ingest the wikipedia data example. Once completed, you will see a description of the new table.

Monitor the ingestion in the Druid console while it runs.

In [3]:
sql='''
REPLACE INTO "example-wikipedia-unionall" OVERWRITE ALL
WITH "ext" AS (SELECT *
FROM TABLE(
  EXTERN(
    '{"type":"http","uris":["https://druid.apache.org/data/wikipedia.json.gz"]}',
    '{"type":"json"}'
  )
) EXTEND ("isRobot" VARCHAR, "channel" VARCHAR, "timestamp" VARCHAR, "flags" VARCHAR, "isUnpatrolled" VARCHAR, "page" VARCHAR, "diffUrl" VARCHAR, "added" BIGINT, "comment" VARCHAR, "commentLength" BIGINT, "isNew" VARCHAR, "isMinor" VARCHAR, "delta" BIGINT, "isAnonymous" VARCHAR, "user" VARCHAR, "deltaBucket" BIGINT, "deleted" BIGINT, "namespace" VARCHAR, "cityName" VARCHAR, "countryName" VARCHAR, "regionIsoCode" VARCHAR, "metroCode" BIGINT, "countryIsoCode" VARCHAR, "regionName" VARCHAR))
SELECT
  TIME_PARSE("timestamp") AS "__time",
  "isRobot",
  "channel",
  "flags",
  "isUnpatrolled",
  "page",
  "diffUrl",
  "added",
  "comment",
  "commentLength",
  "isNew",
  "isMinor",
  "delta",
  "isAnonymous",
  "user",
  "deltaBucket",
  "deleted",
  "namespace",
  "cityName",
  "countryName",
  "regionIsoCode",
  "metroCode",
  "countryIsoCode",
  "regionName"
FROM "ext"
PARTITIONED BY DAY
'''

sql_client.run_task(sql)
sql_client.wait_until_ready('example-wikipedia-unionall')
display.table('example-wikipedia-unionall')

Position,Name,Type
1,__time,TIMESTAMP
2,isRobot,VARCHAR
3,channel,VARCHAR
4,flags,VARCHAR
5,isUnpatrolled,VARCHAR
6,page,VARCHAR
7,diffUrl,VARCHAR
8,added,BIGINT
9,comment,VARCHAR
10,commentLength,BIGINT


With `UNION ALL`, we can append the results of one query with another.

The first query in the cell below, `set1`, returns the ten first edits to any "fr"-like `channel` between midday and 1pm on the 27th June 2016. The second query repeats this but for any "en"-like `channel`.

In [5]:
sql = '''
WITH
set1 AS (
  SELECT
    __time,
    "channel",
    "page",
    "isRobot"
  FROM "example-wikipedia-unionall"
  WHERE DATE_TRUNC('HOUR', __time) = TIMESTAMP '2016-06-27 12:00:00'
    AND channel LIKE '#fr%'
  ORDER BY __time
  LIMIT 10
  ),
set2 AS (
  SELECT
    __time,
    "channel",
    "page",
    "isRobot"
  FROM "example-wikipedia-unionall"
  WHERE DATE_TRUNC('HOUR', __time) = TIMESTAMP '2016-06-27 12:00:00'
    AND channel LIKE '#en%'
  ORDER BY __time
  LIMIT 10
  )
  
SELECT * from set1
UNION ALL
SELECT * from set2
'''

display.sql(sql)

__time,channel,page,isRobot
2016-06-27T12:00:49.515Z,#fr.wikipedia,MHD (rappeur),False
2016-06-27T12:01:38.670Z,#fr.wikipedia,Utilisateur:Barada-nikto/Exercice de style,False
2016-06-27T12:03:25.126Z,#fr.wikipedia,Bataille de Noyon,False
2016-06-27T12:03:32.290Z,#fr.wikipedia,Niveau de base,False
2016-06-27T12:03:37.170Z,#fr.wikipedia,Utilisateur:Barada-nikto/Brouillon/bacasable,False
2016-06-27T12:04:37.709Z,#fr.wikipedia,Liste des personnages des Experts,False
2016-06-27T12:05:01.862Z,#fr.wikipedia,24 Heures du Mans 2016,False
2016-06-27T12:05:25.435Z,#fr.wikipedia,La Seyne-sur-Mer,False
2016-06-27T12:06:07.231Z,#fr.wikipedia,Parc naturel régional des volcans d'Auvergne,False
2016-06-27T12:06:38.941Z,#fr.wikipedia,Adam Baldwin,False


This is what's known as a [top-level](https://druid.apache.org/docs/latest/querying/sql.html#top-level) `UNION` operation. First, `set1` was calculated, and the results of subsequent sets were then appended.

Notice that these results are not in order by time – even though the individual sets did `ORDER BY` time. Druid simply concatenated the two result sets together.

Optionally, run the next cell – it shows the precise `EXPLAIN PLAN` for the query. You can see there are two `query` execution plans, one for each query, and that Druid's planning process has taken time to optimize how the query above will actually execute.

In [6]:
print(json.dumps(json.loads(sql_client.explain_sql(sql)['PLAN']), indent=2))

NameError: name 'json' is not defined

The following cell contains another example of `UNION ALL`, where some filtering and `GROUP BY` operations have been added. These operations are included in the individual component queries as well as part of the top level `UNION`.

In [None]:
sql='''
WITH
set1 AS (
    SELECT "Reporting_Airline", "Distance"
    FROM "example-flights-unionall"
    WHERE DATE_TRUNC('HOUR', __time) = TIMESTAMP '2005-11-01 11:00:00'
  ),
set2 AS (
    SELECT
        "Reporting_Airline",
        COUNT(*) AS "Frights",
        MAX(Distance) AS "Lengthiest",
        MIN(Distance) AS "Shortest"
    FROM "example-flights-unionall"
    WHERE DATE_TRUNC('HOUR', __time) = TIMESTAMP '2005-11-02 11:00:00'
    AND "Reporting_Airline" LIKE 'A%'
    GROUP BY 1
  )

SELECT "Reporting_Airline",
        COUNT(*) AS "Frights",
        MAX(Distance) AS "Lengthiest",
        MIN(Distance) AS "Shortest" from set1
  WHERE "Reporting_Airline" LIKE 'AA'
  GROUP BY 1
UNION ALL
SELECT * from set2
'''

display.sql(sql)

The next cell contains two result sets: `set1` provides some statistics for the 1st November, `set2` for the 2nd November.

The `UNION ALL` operation simply concatenates the two sets of results.

In [None]:
sql='''
WITH
set1 AS (
    SELECT
        "Reporting_Airline",
        COUNT(*) AS "Flights",
        MIN(Distance) AS "Shortest",
        MAX(Distance) AS "Longest"
    FROM "example-flights-unionall"
    WHERE DATE_TRUNC('HOUR', __time) = TIMESTAMP '2005-11-01 11:00:00'
    AND "Reporting_Airline" LIKE 'A%'
    GROUP BY 1
  ),
set2 AS (
    SELECT
        "Reporting_Airline",
        COUNT(*) AS "Frights",
        MAX(Distance) AS "Lengthiest",
        MIN(Distance) AS "Shortest"
    FROM "example-flights-unionall"
    WHERE DATE_TRUNC('HOUR', __time) = TIMESTAMP '2005-11-02 11:00:00'
    AND "Reporting_Airline" LIKE 'A%'
    GROUP BY 1
  )

SELECT * from set1
UNION ALL
SELECT * from set2
'''

display.sql(sql)

The `UNION ALL` operation concatenated the sets, appending the results for Atlanta to the end of the results for San Francisco.

Notice that the values for longest and shortest are in the incorrect columns for Atlanta. Notice, too, that even though _set2_ has "Frights" rather than "Flights" (!), there is only a Flights column - we do not get both San Franscsco Flights and Atlanta Frights!

The following query correct this by being explicit about the field names selected from both sets, ensuring that when the results are appended for Atlanta, the correct results are returned.

```sql
SELECT "Flights", "Shortest", "Longest" from set1
UNION ALL
SELECT "Frights", "Shortest", "Lengthiest" from set2
```

Run the cell below to see what difference this makes:

In [None]:
sql='''
WITH
set1 AS (
    SELECT
        "Reporting_Airline",
        COUNT(*) AS "Flights",
        MIN(Distance) AS "Shortest",
        MAX(Distance) AS "Longest"
    FROM "example-flights-unionall"
    WHERE DATE_TRUNC('HOUR', __time) = TIMESTAMP '2005-11-01 11:00:00'
    AND "Reporting_Airline" LIKE 'A%'
    GROUP BY 1
  ),
set2 AS (
    SELECT
        "Reporting_Airline",
        COUNT(*) AS "Frights",
        MAX(Distance) AS "Lengthiest",
        MIN(Distance) AS "Shortest"
    FROM "example-flights-unionall"
    WHERE DATE_TRUNC('HOUR', __time) = TIMESTAMP '2005-11-02 11:00:00'
    AND "Reporting_Airline" LIKE 'A%'
    GROUP BY 1
  )

SELECT "Flights", "Shortest", "Longest" from set1
UNION ALL
SELECT "Frights", "Shortest", "Lengthiest" from set2
'''

display.sql(sql)

## Using Table-level `UNION ALL` to work with multiple tables

From one source of data, data engineers may create multiple `TABLE` datasources in order to:

* Separate data with different levels of `__time` granularity (ie. the level of summarisation),
* Apply different security to different parts, for example, per tenant,
* Break up the data using filtering at ingestion time, for example, different tables for different HTTP error codes,
* Separate upstream data by the source device or system, for example, different types of IOT device,
* Isolate different periods of time, perhaps with different retention periods.

You can use `UNION ALL` to access _all_ the source data, referencing all the `TABLE` datasources through a sub-query or a `FROM` clause.

Druid executes these "[top level](https://druid.apache.org/docs/26.0.0/querying/sql.html#top-level)" `UNION ALL` queries differently to "[table level](https://druid.apache.org/docs/26.0.0/querying/sql.html#table-level)" queries you have used so far. Table level `UNION ALL` makes use of `union` datasources, and it's important that you read the [documentation](https://druid.apache.org/docs/26.0.0/querying/datasource.html#union) to understand the functionality available to you.

The next two cells create two new tables, `example-wikipedia-unionall-en` and `example-wikipedia-unionall-fr`. One table contains only English language channels, while the second contains only French language channels. Imagine that this is a design decision taken by a data engineer so that they can be governed separately. But now we want to have a single view of both that people can query.

Run these ingestion jobs, and monitor them as they run in the Druid Console.

In [None]:
sql='''
REPLACE INTO "example-wikipedia-unionall-en" OVERWRITE ALL
WITH "ext" AS (SELECT *
FROM TABLE(
  EXTERN(
    '{"type":"http","uris":["https://druid.apache.org/data/wikipedia.json.gz"]}',
    '{"type":"json"}'
  )
) EXTEND ("isRobot" VARCHAR, "channel" VARCHAR, "timestamp" VARCHAR, "flags" VARCHAR, "isUnpatrolled" VARCHAR, "page" VARCHAR, "diffUrl" VARCHAR, "added" BIGINT, "comment" VARCHAR, "commentLength" BIGINT, "isNew" VARCHAR, "isMinor" VARCHAR, "delta" BIGINT, "isAnonymous" VARCHAR, "user" VARCHAR, "deltaBucket" BIGINT, "deleted" BIGINT, "namespace" VARCHAR, "cityName" VARCHAR, "countryName" VARCHAR, "regionIsoCode" VARCHAR, "metroCode" BIGINT, "countryIsoCode" VARCHAR, "regionName" VARCHAR))
SELECT
  TIME_PARSE("timestamp") AS "__time",
  "isRobot",
  "channel",
  "flags",
  "isUnpatrolled",
  "page",
  "diffUrl",
  "added",
  "comment",
  "commentLength",
  "isNew",
  "isMinor",
  "delta",
  "isAnonymous",
  "user",
  "deltaBucket",
  "deleted",
  "namespace",
  "cityName",
  "countryName",
  "regionIsoCode",
  "metroCode",
  "countryIsoCode",
  "regionName"
FROM "ext"
WHERE "channel" LIKE '#en%'
PARTITIONED BY DAY
'''

sql_client.run_task(sql)
sql_client.wait_until_ready('example-wikipedia-unionall-en')
display.table('example-wikipedia-unionall-en')

In [None]:
sql='''
REPLACE INTO "example-wikipedia-unionall-fr" OVERWRITE ALL
WITH "ext" AS (SELECT *
FROM TABLE(
  EXTERN(
    '{"type":"http","uris":["https://druid.apache.org/data/wikipedia.json.gz"]}',
    '{"type":"json"}'
  )
) EXTEND ("isRobot" VARCHAR, "channel" VARCHAR, "timestamp" VARCHAR, "flags" VARCHAR, "isUnpatrolled" VARCHAR, "page" VARCHAR, "diffUrl" VARCHAR, "added" BIGINT, "comment" VARCHAR, "commentLength" BIGINT, "isNew" VARCHAR, "isMinor" VARCHAR, "delta" BIGINT, "isAnonymous" VARCHAR, "user" VARCHAR, "deltaBucket" BIGINT, "deleted" BIGINT, "namespace" VARCHAR, "cityName" VARCHAR, "countryName" VARCHAR, "regionIsoCode" VARCHAR, "metroCode" BIGINT, "countryIsoCode" VARCHAR, "regionName" VARCHAR))
SELECT
  TIME_PARSE("timestamp") AS "__time",
  "isRobot",
  "channel",
  "flags",
  "isUnpatrolled",
  "page",
  "diffUrl",
  "added",
  "comment",
  "commentLength",
  "isNew",
  "isMinor",
  "delta",
  "isAnonymous",
  "user",
  "deltaBucket",
  "deleted",
  "namespace",
  "cityName",
  "countryName",
  "regionIsoCode",
  "metroCode",
  "countryIsoCode",
  "regionName"
FROM "ext"
WHERE "channel" LIKE '#fr%'
PARTITIONED BY DAY
'''

sql_client.run_task(sql)
sql_client.wait_until_ready('example-wikipedia-unionall-fr')
display.table('example-wikipedia-unionall-fr')

The next cell creates `unifiedSource` using `UNION ALL`. This datasource is the single, unified view of all the data. You then use the unified datasource in a `SELECT` query to count the number of robot and non-robot edits by channel.

Remember, the `SELECT` in the `unifiedSource` must be simple in order to meet the constraints set by a table level `UNION ALL`, so any operations such as filtering can only be done in the outer `SELECT` statement.

In [None]:
sql = '''
WITH unifiedSource AS (
    SELECT
        "__time",
        "isRobot",
        "channel",
        "user",
        "countryName"
    FROM "example-wikipedia-unionall-en"
    UNION ALL
    SELECT
        "__time",
        "isRobot",
        "channel",
        "user",
        "countryName"
    FROM "example-wikipedia-unionall-fr"
    )

SELECT
    "channel",
    COUNT(*) FILTER (WHERE isRobot=true) AS "Robot Edits",
    COUNT (DISTINCT user) FILTER (WHERE isRobot=true) AS "Robot Editors",
    COUNT(*) FILTER (WHERE isRobot=false) AS "Human Edits",
    COUNT (DISTINCT user) FILTER (WHERE isRobot=false) AS "Human Editors"
FROM unifiedSource
GROUP BY 1
'''

display.sql(sql)

## Conclusion

* There are two modes for `UNION ALL` in Druid - top level and table level
* Top level is a simple concatenation, and operations must be done on the source `TABLE`s
* Table level uses a `union` data source, and operations must be done on the outer `SELECT`

## Learn more

* Watch [Plan your Druid table datasources](https://youtu.be/OpYDX4RYLV0?list=PLDZysOZKycN7MZvNxQk_6RbwSJqjSrsNR) by Peter Marshall
* Read about [union](https://druid.apache.org/docs/26.0.0/querying/datasource.html#union) datasources in the documentation
* Read the latest [documentation](https://druid.apache.org/docs/26.0.0/querying/sql.html#union-all) on the `UNION ALL` operator