# Working with nested columns

<!--
  ~ Licensed to the Apache Software Foundation (ASF) under one
  ~ or more contributor license agreements.  See the NOTICE file
  ~ distributed with this work for additional information
  ~ regarding copyright ownership.  The ASF licenses this file
  ~ to you under the Apache License, Version 2.0 (the
  ~ "License"); you may not use this file except in compliance
  ~ with the License.  You may obtain a copy of the License at
  ~
  ~   http://www.apache.org/licenses/LICENSE-2.0
  ~
  ~ Unless required by applicable law or agreed to in writing,
  ~ software distributed under the License is distributed on an
  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  ~ KIND, either express or implied.  See the License for the
  ~ specific language governing permissions and limitations
  ~ under the License.
  -->

This tutorial demonstrates how to work with [nested columns](https://druid.apache.org/docs/latest/querying/nested-columns.html) in Apache Druid.

Druid stores nested data structures in `COMPLEX<json>` columns. In this tutorial, you perform the following tasks:

- Ingest nested JSON data using SQL-based ingestion.
- Transform nested data during ingestion using SQL JSON functions.
- Perform queries to display, filter, and aggregate nested data.

Druid supports directly ingesting nested data with the following formats: JSON, Parquet, Avro, ORC, Protobuf.


## Table of contents

- [Prerequisites](#Prerequisites)
- [Ingest nested data](#Ingest-nested-data)
- [Transform nested data](#Transform-nested-data)
- [Query nested data](#Query-nested-data)
- [Learn more](#Learn-more)

For the best experience, use JupyterLab so that you can always access the table of contents.

## Prerequisites

You need to install the requests library for Python before you start&mdash;for example:

```bash
pip3 install requests
```

Next, you need a Druid cluster. This tutorial uses the configuration described in the Druid [Quickstart (local)](https://druid.apache.org/docs/latest/tutorials/index.html). Download Druid from the quickstart page. In the root of the Druid folder, run the following command to start Druid:

```bash
./bin/start-druid
```

Finally, you need either JupyterLab (recommended) or Jupyter Notebook. Visit the [Jupyter site](https://jupyter.org/) if you want to learn more about these interfaces.

Both the quickstart Druid cluster and Jupyter deploy at `localhost:8888` by default, so you need to change the port for Jupyter. To do this, stop Jupyter if it's running and start it with the `port` parameter included. For example, you can use the following command to start Jupyter on port `3001`:

```bash
# If you're using JupyterLab
jupyter lab --port 3001
# If you're using Jupyter Notebook
jupyter notebook --port 3001 
```

To start this tutorial, run the next cell. It imports the Python packages you need and defines variables for two datasources and the Druid host the tutorial uses. The quickstart deployment configures Druid to listen on port `8888` by default, so you'll make API calls against `http://localhost:8888`.

In [None]:
import requests
import json

# druid_host is the hostname and port for your Druid deployment. 
# In a distributed environment, use the Router service  as the `druid_host`. 
druid_host = "http://localhost:8888"
dataSource1 = "kttm"
dataSource2 = "kttm_transform"
print(f"\033[1mDruid host\033[0m: {druid_host}")

In the rest of the tutorial, the `endpoint`, `http_method`, and `payload` variables are updated to accomplish different tasks.

## Ingest nested data

Run the following cell to ingest sample clickstream data from the [Koalas to the Max](https://www.koalastothemax.com/) game.

In [None]:
endpoint = "/druid/v2/sql/task"
print(druid_host+endpoint)
http_method = "POST"

payload = json.dumps({
"query": "INSERT INTO \"kttm\" \
    WITH \"source\" AS \
    (SELECT * FROM TABLE(EXTERN('{\"type\":\"http\",\"uris\":[\"https://static.imply.io/example-data/kttm-nested-v2/kttm-nested-v2-2019-08-25.json.gz\"]}', \
       '{\"type\":\"json\"}','[{\"name\":\"timestamp\",\"type\":\"string\"},{\"name\":\"client_ip\",\"type\":\"string\"}, \
        {\"name\":\"session\",\"type\":\"string\"},{\"name\":\"session_length\",\"type\":\"string\"},{\"name\":\"event\",\"type\":\"COMPLEX<json>\"}, \
        {\"name\":\"agent\",\"type\":\"COMPLEX<json>\"},{\"name\":\"geo_ip\",\"type\":\"COMPLEX<json>\"}]'))) \
        SELECT TIME_PARSE(\"timestamp\") AS \"__time\", \"client_ip\", \"session\", \"session_length\", \"event\", \"agent\", \"geo_ip\"FROM \"source\" \
    PARTITIONED BY DAY"
})
    
headers = {'Content-Type': 'application/json'}

response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)
ingestion_taskId_response = response
ingestion_taskId = json.loads(ingestion_taskId_response.text)['taskId']

print(f"\nInserting data into the table named {dataSource1}.")
print("\nThe response includes the task ID and the status: " + response.text + ".")

Run the following cell to get the status of the ingestion task.

In [None]:
import time

endpoint = f"/druid/indexer/v1/task/{ingestion_taskId}/status"
print(f"\033[1mQuery endpoint\033[0m: {druid_host+endpoint}")
http_method = "GET"

payload = {}
headers = {}

response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)
ingestion_status = json.loads(response.text)['status']['status']

if ingestion_status == "RUNNING":
  print("The ingestion is running...")

while ingestion_status != "SUCCESS":
  response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)
  ingestion_status = json.loads(response.text)['status']['status']
  time.sleep(15)  
  
if ingestion_status == "SUCCESS": 
  print("The ingestion is complete:")
  print(json.dumps(response.json(), indent=4))

When the ingestion task status shows `SUCCESS`, run the following cell to query the data and return selected columns from 3 rows. Note the nested structure of the `event`, `agent`, and `geo_ip` columns.

In [None]:
endpoint = "/druid/v2/sql"
print(druid_host+endpoint)
http_method = "POST"

payload = json.dumps({
  "query": "SELECT session, event, agent, geo_ip FROM kttm LIMIT 3"
})
headers = {'Content-Type': 'application/json'}

response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)

print(json.dumps(json.loads(response.text), indent=4))

## Transform nested data

You can use Druid's [SQL JSON functions](https://druid.apache.org/docs/latest/querying/sql-json-functions.html) to transform nested data in your ingestion query.

Run the following cell to insert sample data into a new datasource named `kttm_transform`. The SELECT query extracts the `country` and `city` elements from the nested `geo_ip` column and creates a composite object `sessionDetails` containing  `session` and `session_length`.

In [None]:
endpoint = "/druid/v2/sql/task"
print(druid_host+endpoint)
http_method = "POST"

payload = json.dumps({
"query": "INSERT INTO \"kttm_transform\" \
    WITH \"source\" AS \
    (SELECT * FROM TABLE(EXTERN('{\"type\":\"http\",\"uris\":[\"https://static.imply.io/example-data/kttm-nested-v2/kttm-nested-v2-2019-08-25.json.gz\"]}', \
       '{\"type\":\"json\"}','[{\"name\":\"timestamp\",\"type\":\"string\"},{\"name\":\"session\",\"type\":\"string\"},{\"name\":\"session_length\",\"type\":\"string\"}, \
        {\"name\":\"event\",\"type\":\"COMPLEX<json>\"},{\"name\":\"agent\",\"type\":\"COMPLEX<json>\"},{\"name\":\"geo_ip\",\"type\":\"COMPLEX<json>\"}]'))) \
        SELECT TIME_PARSE(\"timestamp\") AS \"__time\", \
        JSON_QUERY(geo_ip, '$.country') as country, \
        JSON_QUERY(geo_ip, '$.city') as city, \
        JSON_OBJECT('session':session, 'session_length':session_length) as sessionDetails \
        FROM \"source\" \
    PARTITIONED BY DAY"
})

headers = {'Content-Type': 'application/json'}

response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)
ingestion_taskId_response = response
ingestion_taskId = json.loads(ingestion_taskId_response.text)['taskId']

print(f"\nInserting data into the table named {dataSource2}")
print("\nThe response includes the task ID and the status: " + response.text + ".")

Run the following cell to get the status of the ingestion task.

In [None]:
import time

endpoint = f"/druid/indexer/v1/task/{ingestion_taskId}/status"
print(f"\033[1mQuery endpoint\033[0m: {druid_host+endpoint}")
http_method = "GET"

payload = {}
headers = {}

response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)
ingestion_status = json.loads(response.text)['status']['status']

if ingestion_status == "RUNNING":
  print("The ingestion is running...")

while ingestion_status != "SUCCESS":
  response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)
  ingestion_status = json.loads(response.text)['status']['status']
  time.sleep(15)  
  
if ingestion_status == "SUCCESS": 
  print("The ingestion is complete:")
  print(json.dumps(response.json(), indent=4))

When the ingestion task status shows `SUCCESS`, run the following cell to query the data and return `country`, `city`, and `sessionDetails` from 3 rows:

In [None]:
endpoint = "/druid/v2/sql"
print(druid_host+endpoint)
http_method = "POST"

payload = json.dumps({
  "query": "SELECT country, city, sessionDetails FROM kttm_transform LIMIT 3"
})
headers = {'Content-Type': 'application/json'}

response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)

print(json.dumps(json.loads(response.text), indent=4))

## Query nested data

Run the following cell to display the data types for columns in the `kttm` datasource. Note that nested columns  display as `COMPLEX<json>`.

In [None]:
endpoint = "/druid/v2/sql"
print(druid_host+endpoint)
http_method = "POST"

payload = json.dumps({
  "query": "SELECT TABLE_NAME, COLUMN_NAME, DATA_TYPE \
   FROM INFORMATION_SCHEMA.COLUMNS \
   WHERE TABLE_NAME = 'kttm'"
})
headers = {'Content-Type': 'application/json'}

response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)

print(json.dumps(json.loads(response.text), indent=4))

You can use [`JSON_VALUE`](https://druid.apache.org/docs/latest/querying/sql-json-functions.html) to extract specific elements from a `COMPLEX<json>` object.
    
Run the following cell to extract `continent` from `geo_ip` and `category` from `agent` for 3 rows:

In [None]:
endpoint = "/druid/v2/sql"
print(druid_host+endpoint)
http_method = "POST"

payload = json.dumps({
  "query": "SELECT JSON_VALUE(geo_ip, '$.continent') as continent, \
   JSON_VALUE(agent, '$.category') as category \
   FROM kttm LIMIT 3"
})
headers = {'Content-Type': 'application/json'}

response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)

print(json.dumps(json.loads(response.text), indent=4))

### Grouping, filtering, aggregating

Run the following cell to see how you can use the SELECT COUNT(DISTINCT) operator with `JSON_VALUE`.

In [None]:
endpoint = "/druid/v2/sql"
print(druid_host+endpoint)
http_method = "POST"

payload = json.dumps({
  "query": "SELECT COUNT(DISTINCT(JSON_VALUE(geo_ip, '$.city'))) as \"Number of cities\" \
   FROM kttm"
})
headers = {'Content-Type': 'application/json'}

response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)

print(json.dumps(json.loads(response.text), indent=4))

Run the following cell to filter and group a query using `JSON_VALUE`. The query selects the `browser` element from the `agent` column and the `country` and `city` elements from the `geo_ip` column, for all rows with city `Helsinki`. 

In [None]:
endpoint = "/druid/v2/sql"
print(druid_host+endpoint)
http_method = "POST"

payload = json.dumps({
  "query": "SELECT JSON_VALUE(agent, '$.browser') as browser, \
   JSON_VALUE(geo_ip, '$.country') as country, \
   JSON_VALUE(geo_ip, '$.city') as city \
   FROM kttm \
   WHERE JSON_VALUE(geo_ip, '$.city') in ('Helsinki') \
   GROUP BY 1,2,3 \
   ORDER BY 1"
})
headers = {'Content-Type': 'application/json'}

response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)

print(json.dumps(json.loads(response.text), indent=4))

### Using helper operators

You can use SQL helper operators such as [`JSON_KEYS`](https://druid.apache.org/docs/latest/querying/sql-json-functions.html) and [`JSON_PATHS`](https://druid.apache.org/docs/latest/querying/sql-json-functions.html) to examine nested data and plan your queries. Run the following cell to return an array of field names and an array of paths for the `geo_ip` nested column.

In [None]:
endpoint = "/druid/v2/sql"
print(druid_host+endpoint)
http_method = "POST"

payload = json.dumps({
  "query": "SELECT ARRAY_CONCAT_AGG(DISTINCT JSON_KEYS(geo_ip, '$.')) as \"geo_ip keys\", \
   ARRAY_CONCAT_AGG(DISTINCT JSON_PATHS(geo_ip)) as \"geo_ip paths\" \
   FROM kttm"
})
headers = {'Content-Type': 'application/json'}

response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)

print(json.dumps(json.loads(response.text), indent=4))

## Learn more

This tutorial covers the basics of working with nested data. To learn more about nested data in Druid and related Druid features, see the following topics:

- [Nested columns](https://druid.apache.org/docs/latest/querying/nested-columns.html) for information about the nested columns feature, with ingestion and query examples. 
- [SQL JSON functions](https://druid.apache.org/docs/latest/querying/sql-json-functions.html) for details on all of the functions you used in this tutorial.
- [SQL-based ingestion](https://druid.apache.org/docs/latest/multi-stage-query/index.html) for information on how to use the multi-stage query task engine in Druid.