## Scenario

In short: read records from multiple JSON and Parquet files with different schemas, explode/unnest and aggregate the data, and write the results to Postgres.

### Data

The data generated with `generate_example_data.py` for this scenario is in the `data/` directory and contains 25 JSON files and 5 Parquet files. Each JSON file and each row in the Parquet files contains a record with the following fields:
| Field | Optional | Data type | Possible values |
| --- | --- | --- | --- |
| id |  | int |  | [1, 600)
| timestamp |  | str | ISO format timestamp |
| col1 | x | float | [100, 200) |
| col2 | x | float | [200, 300) |
| col3 | x | float | [300, 400) |
| tags | x | List[str] | "a", "b", "c", "d" |

### Task 1

Find the count and average of values in the fields `col1`, `col2`, and `col3` aggregated by the different tags. Write the result to Postgres as a new table.

### Task 2

Find the most common tag(s) for every month (ignoring the year). Write the result to Postgres as a new table.

### Plan of attack

We'll work through this problem in the following steps:
1. read JSON and Parquet files and combine schemas,
2. explode tags and get Task 1 aggregates,
3. convert str timestamp to actual timestamp and extract month,
4. get tag ranks by month and get top tags as a list by month for Task 2,
5. connect and write results to Postgres.

In [1]:
import duckdb

## 1. Read JSON and Parquet files and combine schemas

DuckDB supports globbing so reading multiple files is as easy as reading a single file. Since our files can have different schemas, we need to use the functions `read_json_auto()` and `read_parquet()` in order to use the parameter `union_by_name = true`.

In [8]:
# Read all Parquet files:
duckdb.sql("FROM read_parquet('data/*.parquet', union_by_name = true)").show()

┌───────┬─────────────────────┬──────────────┬────────────────────┬────────────────────┬────────────────────┐
│  id   │      timestamp      │     tags     │        col3        │        col2        │        col1        │
│ int64 │       varchar       │  varchar[]   │       double       │       double       │       double       │
├───────┼─────────────────────┼──────────────┼────────────────────┼────────────────────┼────────────────────┤
│   201 │ 2023-02-14T23:43:06 │ [c, d, b, a] │               NULL │               NULL │               NULL │
│   202 │ 2023-12-10T04:20:09 │ [c, b]       │  344.4384086978107 │ 283.82011872170756 │               NULL │
│   203 │ 2023-04-12T19:11:56 │ [a, d, b]    │  300.2019868412024 │ 250.26658507608727 │               NULL │
│   204 │ 2023-11-29T05:50:49 │ NULL         │               NULL │               NULL │ 160.47502928589932 │
│   205 │ 2023-07-24T18:16:09 │ [a]          │  389.3784272049234 │  240.8342356007753 │ 134.37312613968828 │
│   206 │ 

In [9]:
# Read all JSON files:
duckdb.sql("FROM read_json_auto('data/*.json', union_by_name = true)").show()

┌───────┬─────────────────────┬────────────────────┬────────────────────┬──────────────┬────────────────────┐
│  id   │      timestamp      │        col1        │        col3        │     tags     │        col2        │
│ int64 │       varchar       │       double       │       double       │  varchar[]   │       double       │
├───────┼─────────────────────┼────────────────────┼────────────────────┼──────────────┼────────────────────┤
│    10 │ 2023-03-19T15:02:04 │ 179.27596351197928 │  393.0434391894289 │ [d]          │               NULL │
│     9 │ 2023-06-26T21:56:05 │ 164.56291322757346 │ 331.81802543751616 │ [d]          │               NULL │
│     8 │ 2023-11-16T18:57:25 │               NULL │ 318.59194254112924 │ [b]          │               NULL │
│     7 │ 2023-07-19T03:39:52 │               NULL │  344.1654722833825 │ [d, b]       │ 200.66933318979386 │
│    19 │ 2023-03-27T05:11:16 │               NULL │               NULL │ NULL         │               NULL │
│     6 │ 

In [11]:
# UNION the results and save the query as a view:
query = """
CREATE OR REPLACE VIEW all_records_view AS (
    FROM read_parquet('data/*.parquet', union_by_name = true)
    UNION ALL BY NAME
    FROM read_json_auto('data/*.json', union_by_name = true)
)
"""
duckdb.sql(query)
duckdb.sql("FROM all_records_view ORDER BY id").show()

┌───────┬─────────────────────┬──────────────┬────────────────────┬────────────────────┬────────────────────┐
│  id   │      timestamp      │     tags     │        col3        │        col2        │        col1        │
│ int64 │       varchar       │  varchar[]   │       double       │       double       │       double       │
├───────┼─────────────────────┼──────────────┼────────────────────┼────────────────────┼────────────────────┤
│     1 │ 2023-02-09T06:28:40 │ [b]          │  359.8701591734018 │               NULL │ 116.18588359611803 │
│     2 │ 2023-05-14T23:19:16 │ [a, b]       │               NULL │               NULL │               NULL │
│     3 │ 2023-06-20T22:06:01 │ NULL         │               NULL │               NULL │               NULL │
│     4 │ 2023-02-01T09:24:39 │ [b, a, c]    │               NULL │  264.8946250873081 │ 153.48516545739275 │
│     5 │ 2023-02-10T04:00:55 │ NULL         │               NULL │               NULL │               NULL │
│     6 │ 

## 2. Explode tags and get Task 1 aggregates

To aggregate by tags, we'll explode the tag lists, i.e. we create a row for every tag in a list of tags. In DuckDB this is done with the function `unnest()`.

In [12]:
duckdb.sql("FROM all_records_view SELECT unnest(tags) AS tag, col1, col2, col3").show()

┌─────────┬────────────────────┬────────────────────┬───────────────────┐
│   tag   │        col1        │        col2        │       col3        │
│ varchar │       double       │       double       │      double       │
├─────────┼────────────────────┼────────────────────┼───────────────────┤
│ c       │               NULL │               NULL │              NULL │
│ d       │               NULL │               NULL │              NULL │
│ b       │               NULL │               NULL │              NULL │
│ a       │               NULL │               NULL │              NULL │
│ c       │               NULL │ 283.82011872170756 │ 344.4384086978107 │
│ b       │               NULL │ 283.82011872170756 │ 344.4384086978107 │
│ a       │               NULL │ 250.26658507608727 │ 300.2019868412024 │
│ d       │               NULL │ 250.26658507608727 │ 300.2019868412024 │
│ b       │               NULL │ 250.26658507608727 │ 300.2019868412024 │
│ a       │ 134.37312613968828 │  240.

One thing to note is that `unnest()` effectively drops all rows without tags. We might want to keep them to get the aggregated values for rows with no tags. We can manually add them back with a simple UNION.

In [16]:
query = """
FROM all_records_view SELECT unnest(tags) AS tag, col1, col2, col3
UNION ALL
FROM all_records_view SELECT 'NO TAGS' AS tag, col1, col2, col3
WHERE tags IS NULL
"""
duckdb.sql(query).show()

┌─────────┬────────────────────┬────────────────────┬────────────────────┐
│   tag   │        col1        │        col2        │        col3        │
│ varchar │       double       │       double       │       double       │
├─────────┼────────────────────┼────────────────────┼────────────────────┤
│ c       │               NULL │               NULL │               NULL │
│ d       │               NULL │               NULL │               NULL │
│ b       │               NULL │               NULL │               NULL │
│ a       │               NULL │               NULL │               NULL │
│ c       │               NULL │ 283.82011872170756 │  344.4384086978107 │
│ b       │               NULL │ 283.82011872170756 │  344.4384086978107 │
│ a       │               NULL │ 250.26658507608727 │  300.2019868412024 │
│ d       │               NULL │ 250.26658507608727 │  300.2019868412024 │
│ b       │               NULL │ 250.26658507608727 │  300.2019868412024 │
│ a       │ 134.373126139

Now all we are missing are the aggregates. Let's also save the result as a view.

In [19]:
query = """
CREATE OR REPLACE VIEW task1 AS (
    WITH cte AS (
        FROM all_records_view SELECT unnest(tags) AS tag, col1, col2, col3
        UNION ALL
        FROM all_records_view SELECT 'NO TAGS' AS tag, col1, col2, col3
        WHERE tags IS NULL
    )
    FROM cte SELECT tag, COUNT(col1), MEAN(col1), COUNT(col2), MEAN(col2), COUNT(col3), MEAN(col3)
    GROUP BY tag
)
"""
duckdb.sql(query)
duckdb.sql("FROM task1").show()

┌─────────┬─────────────┬────────────────────┬─────────────┬────────────────────┬─────────────┬────────────────────┐
│   tag   │ count(col1) │     mean(col1)     │ count(col2) │     mean(col2)     │ count(col3) │     mean(col3)     │
│ varchar │    int64    │       double       │    int64    │       double       │    int64    │       double       │
├─────────┼─────────────┼────────────────────┼─────────────┼────────────────────┼─────────────┼────────────────────┤
│ b       │          31 │ 141.89237957017806 │          30 │  243.4542142670114 │          35 │  350.7843464869098 │
│ NO TAGS │          17 │ 148.85170712520218 │          16 │ 238.82129301737254 │          13 │  360.7105108776255 │
│ c       │          19 │ 148.04043550462617 │          20 │  244.4782917085122 │          20 │  339.3808263120647 │
│ a       │          26 │ 147.36241567052775 │          22 │ 244.83179086145273 │          29 │ 349.94927711507523 │
│ d       │          20 │  150.4693243437674 │          17 │ 246

## 3. Convert str timestamp to actual timestamp and extract month



## 4. Get tag ranks by month and get top tags as a list by month for Task 2



## 5. Connect and write results to Postgres

