## Working with nested columns

<!--
  ~ Licensed to the Apache Software Foundation (ASF) under one
  ~ or more contributor license agreements.  See the NOTICE file
  ~ distributed with this work for additional information
  ~ regarding copyright ownership.  The ASF licenses this file
  ~ to you under the Apache License, Version 2.0 (the
  ~ "License"); you may not use this file except in compliance
  ~ with the License.  You may obtain a copy of the License at
  ~
  ~   http://www.apache.org/licenses/LICENSE-2.0
  ~
  ~ Unless required by applicable law or agreed to in writing,
  ~ software distributed under the License is distributed on an
  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  ~ KIND, either express or implied.  See the License for the
  ~ specific language governing permissions and limitations
  ~ under the License.
  -->

This tutorial demonstrates how to work with [nested columns](https://druid.apache.org/docs/latest/querying/nested-columns.html) in Apache Druid.

Druid stores nested data structures in `COMPLEX<json>` columns. In this tutorial you perform the following tasks:

- Ingest nested JSON data using SQL-based ingestion.
- Transform nested data during ingestion using SQL JSON functions.
- Perform queries to display, filter, and aggregate nested data.
- Use helper operators to examine nested data and plan your queries.

Druid supports directly ingesting nested data with the following formats: JSON, Parquet, Avro, ORC, Protobuf.

## Table of contents

- [Prerequisites](#Prerequisites)
- [Initialization](#Initialization)
- [Ingest nested data](#Ingest-nested-data)
- [Transform nested data](#Transform-nested-data)
- [Query nested data](#Query-nested-data)
- [Group, filter, and aggregate nested data](#Group-filter-and-aggregate-nested-data)
- [Use helper operators](#Use-helper-operators)
- [Learn more](#Learn-more)

## Prerequisites

This tutorial works with Druid 25.0.0 or later.

### Run with Docker

Launch this tutorial and all prerequisites using the `druid-jupyter` profile of the Docker Compose file for Jupyter-based Druid tutorials. For more information, see [Docker for Jupyter Notebook tutorials](https://druid.apache.org/docs/latest/tutorials/tutorial-jupyter-docker.html).

### Run without Docker

If you do not use the Docker Compose environment, you need the following:

* A running Apache Druid instance, with a `DRUID_HOST` local environment variable containing the server name of your Druid router.
* [druidapi](https://github.com/apache/druid/blob/master/examples/quickstart/jupyter-notebooks/druidapi/README.md), a Python client for Apache Druid. Follow the instructions in the Install section of the README file.

## Initialization

Run the next cell to set up the Druid Python client's connection to Apache Druid.

If successful, the Druid version number will be shown in the output.

In [None]:
import druidapi
import os

if 'DRUID_HOST' not in os.environ.keys():
    druid_host=f"http://localhost:8888"
else:
    druid_host=f"http://{os.environ['DRUID_HOST']}:8888"
    
print(f"Opening a connection to {druid_host}.")
druid = druidapi.jupyter_client(druid_host)

display = druid.display
sql_client = druid.sql
status_client = druid.status

status_client.version

## Ingest nested data

Run the following cell to ingest sample clickstream data from the [Koalas to the Max](https://www.koalastothemax.com/) game.

In [None]:
sql = '''
INSERT INTO example_koalas_nesteddata
    WITH "source" AS
    (SELECT * FROM TABLE(EXTERN('{"type":"http","uris":["https://static.imply.io/example-data/kttm-nested-v2/kttm-nested-v2-2019-08-25.json.gz"]}',
       '{"type":"json"}','[{"name":"timestamp","type":"string"},{"name":"client_ip","type":"string"},
        {"name":"session","type":"string"},{"name":"session_length","type":"string"},{"name":"event","type":"COMPLEX<json>"},
        {"name":"agent","type":"COMPLEX<json>"},{"name":"geo_ip","type":"COMPLEX<json>"}]')))
    SELECT TIME_PARSE("timestamp") AS "__time",
    "client_ip", 
    "session", 
    "session_length", 
    "event", 
    "agent", 
    "geo_ip"
    FROM "source"
    PARTITIONED BY DAY
'''

sql_client.run_task(sql)
sql_client.wait_until_ready("example_koalas_nesteddata")
display.table("example_koalas_nesteddata")

Druid reports task completion as soon as ingestion is done. However, it takes a while for Druid to load the resulting segments.

Wait for the table detail to display, then run the following cell to query the data and return selected columns from 3 rows. Note the nested structure of the `event`, `agent`, and `geo_ip` columns.

In [None]:
sql = '''
SELECT session, event, agent, geo_ip 
FROM example_koalas_nesteddata LIMIT 3
'''
resp = sql_client.sql_query(sql)
resp.show()

## Transform nested data

You can use Druid's [SQL JSON functions](https://druid.apache.org/docs/latest/querying/sql-json-functions.html) to transform nested data in your ingestion query.

Run the following cell to insert sample data into a new datasource named `example_koalas_nesteddata_transform`. The SELECT query extracts the `country` and `city` elements from the nested `geo_ip` column and creates a composite object `sessionDetails` containing  `session` and `session_length`.

In [None]:
sql = '''
INSERT INTO example_koalas_nesteddata_transform
    WITH "source" AS
    (SELECT * FROM TABLE(EXTERN('{"type":"http","uris":["https://static.imply.io/example-data/kttm-nested-v2/kttm-nested-v2-2019-08-25.json.gz"]}',
       '{"type":"json"}','[{"name":"timestamp","type":"string"},{"name":"session","type":"string"},{"name":"session_length","type":"string"},
        {"name":"event","type":"COMPLEX<json>"},{"name":"agent","type":"COMPLEX<json>"},{"name":"geo_ip","type":"COMPLEX<json>"}]')))
        SELECT TIME_PARSE("timestamp") AS "__time",
        JSON_QUERY(geo_ip, '$.country') as country,
        JSON_QUERY(geo_ip, '$.city') as city,
        JSON_OBJECT('session':session, 'session_length':session_length) as sessionDetails
    FROM "source"
    PARTITIONED BY DAY
'''

sql_client.run_task(sql)
sql_client.wait_until_ready("example_koalas_nesteddata_transform")
display.table("example_koalas_nesteddata_transform")

When the table detail displays, run the following cell to query the data and return `country`, `city`, and `sessionDetails` from 3 rows:

In [None]:
sql = '''
SELECT country, city, sessionDetails 
FROM example_koalas_nesteddata_transform 
LIMIT 3
'''
resp = sql_client.sql_query(sql)
resp.show()

## Query nested data

Run the following cell to display the data types for columns in the `example_koalas_nesteddata` datasource. Note that nested columns  display as `COMPLEX<json>`.

In [None]:
sql = '''
SELECT TABLE_NAME, COLUMN_NAME, DATA_TYPE
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = 'example_koalas_nesteddata'
'''
resp = sql_client.sql_query(sql)
resp.show()

You can use [`JSON_VALUE`](https://druid.apache.org/docs/latest/querying/sql-json-functions.html) to extract specific elements from a `COMPLEX<json>` object.
    
Run the following cell to extract `continent` from `geo_ip` and `category` from `agent` for 3 rows:

In [None]:
sql = '''
SELECT JSON_VALUE(geo_ip, '$.continent') as continent,
JSON_VALUE(agent, '$.category') as category
FROM example_koalas_nesteddata LIMIT 3
'''
resp = sql_client.sql_query(sql)
resp.show()

### Group, filter, and aggregate nested data

Run the following cell to see how you can use the SELECT COUNT(DISTINCT) operator with `JSON_VALUE`.

In [None]:
sql = '''
SELECT COUNT(DISTINCT(JSON_VALUE(geo_ip, '$.city'))) as "Number of cities"
FROM example_koalas_nesteddata
'''
resp = sql_client.sql_query(sql)
resp.show()

Run the following cell to filter and group a query using `JSON_VALUE`. The query selects the `browser` element from the `agent` column and the `country` and `city` elements from the `geo_ip` column, for all rows with city `Helsinki`. 

In [None]:
sql = '''
SELECT JSON_VALUE(agent, '$.browser') as browser,
JSON_VALUE(geo_ip, '$.country') as country,
JSON_VALUE(geo_ip, '$.city') as city
FROM example_koalas_nesteddata
WHERE JSON_VALUE(geo_ip, '$.city') in ('Helsinki')
GROUP BY 1,2,3
ORDER BY 1
'''
resp = sql_client.sql_query(sql)
resp.show()

### Use helper operators

You can use SQL helper operators such as [`JSON_KEYS`](https://druid.apache.org/docs/latest/querying/sql-json-functions.html) and [`JSON_PATHS`](https://druid.apache.org/docs/latest/querying/sql-json-functions.html) to examine nested data and plan your queries. Run the following cell to return an array of field names and an array of paths for the `geo_ip` nested column.

In [None]:
sql = '''
SELECT ARRAY_CONCAT_AGG(DISTINCT JSON_KEYS(geo_ip, '$.')) as "geo_ip keys",
ARRAY_CONCAT_AGG(DISTINCT JSON_PATHS(geo_ip)) as "geo_ip paths"
FROM example_koalas_nesteddata
'''
resp = sql_client.sql_query(sql)
resp.show()

## Learn more

This tutorial covers the basics of working with nested data. To learn more about nested data in Druid and related Druid features, see the following topics:

- [Nested columns](https://druid.apache.org/docs/latest/querying/nested-columns.html) for information about the nested columns feature, with ingestion and query examples. 
- [SQL JSON functions](https://druid.apache.org/docs/latest/querying/sql-json-functions.html) for details on all of the functions you used in this tutorial.
- [SQL-based ingestion](https://druid.apache.org/docs/latest/multi-stage-query/index.html) for information on how to use Druid SQL-based ingestion.