# Data Querying and Exporting

This notebook demonstrates the process of querying data from Iceberg data lake (formerly Delta Lake) and exporting it in various formats.

## Starting the servers:
To launch the server, open the Docker Desktop app and run the following command at the root of the project:
```bash
$ make up
```
This command will launch the Jupyter server using the environment variables defined in the `.env` file.

#### Understanding expected file paths:
DiveDB expects the following paths to be set in the `.env` file:
- `CONTAINER_DATA_PATH`
- `LOCAL_DATA_PATH`
- `LOCAL_ICEBERG_PATH`
- `CONTAINER_ICEBERG_PATH`

These paths are used to mount the Iceberg warehouse and file storage to the containers. The "LOCAL_" paths can be wherever makes sense for your local machine. The "CONTAINER_" paths are the paths that the containers expect. We recommend you keep the "CONTAINER_" paths as they are in the `.env.example` file.

**Migration Note**: We have migrated from Delta Lake to Apache Iceberg for better performance and schema evolution capabilities.

#### When is the server ready?
You'll know it's ready when you see the following logs in the terminal:
```bash
jupyter-1            | [I 2024-08-30 16:12:37.083 ServerApp] Serving notebooks from local directory: /app
jupyter-1            | [I 2024-08-30 16:12:37.083 ServerApp] Jupyter Server 2.14.2 is running at:
jupyter-1            | [I 2024-08-30 16:12:37.083 ServerApp] http://e29d05e13fd0:8888/jupyter/tree
jupyter-1            | [I 2024-08-30 16:12:37.083 ServerApp]     http://127.0.0.1:8888/jupyter/tree
```

## Connecting to the Jupyter Kernel:
To connect to the Jupyter server in your notebook, follow these steps:
1. Click the "Select Kernel" button at the top right of the page.
1. Pick the "Select another kernel" option in the dropdown menu.
1. Pick the "Existing Jupyter Server" option in the dropdown menu.
1. Now we need to connect to the Jupyter server.
    - If you previously connected to the Jupyter server
        - Pick the "localhost" option in the dropdown menu (or whatever you named it prior)
    - If you have not connected to the Jupyter server before
        - Pick the "Enter the URL of the running Jupyter server" option in the dropdown menu.
        - Enter http://localhost:8888/jupyter
        - Give it a name you'll remember (like "Local DiveDB Jupyter Server")
1. Press the "Reload" icon in the top right of the dropdown menu to see the latest kernel.
1. Pick the "Python 3" option in the dropdown menu.

This will ensure you execute the Jupyter notebook in the correct environment.

## Querying from Iceberg Data Lake
We connect to our datastores using the `DuckPond` class. DuckPond is a wrapper around a DuckDB connection with access to both our Metadata Database and our measurements stored in Apache Iceberg tables. The ability to query both sources of data from a single connection is useful for quickly accessing data for analysis.

Data in our Iceberg data lake is sorted into various views. There's a simple hierarchy to views: datasets are at the highest level and each dataset has a data iceberg, a point events iceberg, and a state events iceberg.
```mermaid
graph TD
    %% Datasets
    A[🛠️ SS Movement]
    B[🛠️ NESE Sleep]
    C[🛠️ OO Physiology]

    %% Databases
    D1[🧊 data]
    D2[🧊 point events]
    D3[🧊 state events]

    E1[🧊 data]
    E2[🧊 point events]
    E3[🧊 state events]

    F1[🧊 data]
    F2[🧊 point events]
    F3[🧊 state events]

    %% Connections
    A --> D1
    A --> D2
    A --> D3

    B --> E1
    B --> E2
    B --> E3

    C --> F1
    C --> F2
    C --> F3
```
(Legend: 🛠️ = project. 🧊 = iceberg.)

To see all of the views available, use the `list_dataset_views` on your DuckPond instance.

In [None]:
import os
import importlib
import DiveDB.services.duck_pond
importlib.reload(DiveDB.services.duck_pond)

from DiveDB.services.duck_pond import DuckPond

duck_pond = DuckPond(os.environ["CONTAINER_ICEBERG_PATH"])

display(duck_pond.list_all_views())

There are two main ways to query data from the Iceberg data lake:
1. Using the DuckPond `get_data` method
2. Using the DuckPond connection to query directly

### Using the DuckPond `get_data` method
DuckPond's `get_data` method constructs a query based on the parameters you pass to it and returns a DuckDB DataFrame. It takes the following optional parameters:
- `labels`: A string or list of data labels to query.
- `logger_ids`: A string or list of logger IDs to query.
- `animal_ids`: A string or list of animal IDs to query.
- `deployment_ids`: A string or list of deployment IDs to query.
- `recording_ids`: A string or list of recording IDs to query.
- `date_range`: A tuple of start and end dates to query.
- `limit`: The maximum number of rows to return.

The `get_data` method returns a [DuckDB DuckDBPyConnection](https://duckdb.org/docs/api/python/reference/#duckdb.DuckDBPyConnection) which can be used to convert the data in many different formats including the following ([see documentation for a full list](https://duckdb.org/docs/api/python/conversion#result-conversion-duckdb-results-to-python))
- NumPy Array (`.fetchnumpy()`)
- Pandas DataFrame (`.df()`)
- Arrows Table (`.arrow()`)
- Polars DataFrame (`.pl()`)

Until a conversion method is called, the data is not loaded into memory. This allows for large queries to be run without using too much memory.

##### Example:

In [None]:
import os
import importlib
import DiveDB.services.duck_pond
importlib.reload(DiveDB.services.duck_pond)

from DiveDB.services.duck_pond import DuckPond

duck_pond = DuckPond(os.environ["CONTAINER_ICEBERG_PATH"])

conn = duck_pond.conn.sql("""SELECT count(*) FROM "EP Physiology_Data" """)

display(conn.df())

## Querying for shared frequency from Iceberg
Iceberg can store multiple signal names at a single frequency. If you query a single signal name, the data will be returned as a list of values for each timestamp. If you query multiple signal names, the data will be returned as a list of lists of values for each timestamp.

The data will be returned as a Pandas DataFrame with a DatetimeIndex.

##### Example:

In [None]:
import os
import importlib
import DiveDB.services.duck_pond
importlib.reload(DiveDB.services.duck_pond)

from DiveDB.services.duck_pond import DuckPond

duck_pond = DuckPond(os.environ["CONTAINER_ICEBERG_PATH"])

duck_pond.get_data(    
    labels=["derived_data_depth"],
    animal_ids="apfo-001a",
    dataset="EP Physiology",
    frequency=1/60,  # Once a minute
)


### Using the IcePond connection to query directly
More complex queries can be run directly on the DuckPond connection. This is useful for queries that may not be supported by the `get_data` method which has those involving grouping or aggregations. 

DuckDB runs sql very similar in syntax to other SQL databases. A full breakdown of the syntax can be found [in the documenation](https://duckdb.org/docs/sql/introduction).

The connection object can be found in the `duck_pond.conn` attribute. To run queries, use the `sql` method which also returns a [DuckDB DuckDBPyConnection](https://duckdb.org/docs/api/python/reference/#duckdb.DuckDBPyConnection) which can be used to convert the data in many different formats including the following ([see documentation for a full list](https://duckdb.org/docs/api/python/conversion#result-conversion-duckdb-results-to-python))
- NumPy Array (`.fetchnumpy()`)
- Pandas DataFrame (`.df()`)
- Arrows Table (`.arrow()`)
- Polars DataFrame (`.pl()`)

##### Example:

In [None]:
import importlib
import os
# Reload the DuckPond module to pick up any changes
import DiveDB.services.duck_pond
importlib.reload(DiveDB.services.duck_pond)
from DiveDB.services.duck_pond import DuckPond

duck_pond = DuckPond(os.environ["CONTAINER_ICEBERG_PATH"])

df = duck_pond.conn.sql(f"""
    SELECT label, avg(float_value) as mean_data
    FROM "EP Physiology_Data"
    WHERE label = 'sensor_data_light'
    OR label = 'sensor_data_temperature'
    GROUP BY label
""").df()

display(df)


## Chaining Queries
Queries can be chained together to form a pipeline. This is useful for running complex queries that involve multiple steps.

##### Example:

In [None]:
import importlib
import os
# Reload the DuckPond module to pick up any changes
import DiveDB.services.duck_pond
importlib.reload(DiveDB.services.duck_pond)
from DiveDB.services.duck_pond import DuckPond

duck_pond = DuckPond(os.environ["CONTAINER_ICEBERG_PATH"])

# Get the filtered data
filtered_data = duck_pond.get_data(    
    dataset="EP Physiology",
    # Resample values to 10 Hz and make sure each signal has the same time intervals
    frequency=10,
    # Aggregation of events (think state events - behaviors) type: state (has state and end dates)
    classes="sensor_data_accelerometer",
    
)

display(filtered_data)


## Query Variables
Sometimes we don't want to hardcode variables in our queries. We can use the `execute` method to pass variables to the query.

##### Example:

In [None]:
import importlib
import os
# Reload the DuckPond module to pick up any changes
import DiveDB.services.duck_pond
importlib.reload(DiveDB.services.duck_pond)
from DiveDB.services.duck_pond import DuckPond

duck_pond = DuckPond(os.environ["CONTAINER_ICEBERG_PATH"])

label = "sensor_data_temperature"
df = duck_pond.conn.execute(f"""
    SELECT label, avg(float_value) as mean_data
    FROM "EP Physiology_Data"
    WHERE label = $1
    GROUP BY label
""", [label]).df()

display(df)

## Query Metadata Database
We can also query the Metadata Database directly. This is useful for querying data that is not stored in Delta Lake and joining it for queries on measurement data.

##### Example:

In [None]:
import importlib
import os
# Reload the DuckPond module to pick up any changes
import DiveDB.services.duck_pond
importlib.reload(DiveDB.services.duck_pond)
from DiveDB.services.duck_pond import DuckPond

from DiveDB.services.notion_orm import NotionORMManager

notion_manager = NotionORMManager(
    db_map={"Animal DB": os.environ["ANIMALS_DB_ID"]},
    token=os.environ["NOTION_API_KEY"],
)

duck_pond = DuckPond(os.environ["CONTAINER_ICEBERG_PATH"], notion_manager=notion_manager)


# Show all tables we have access to
print(duck_pond.get_db_schema())

df = duck_pond.conn.sql("""
    SELECT * FROM Animals LIMIT 10
""").df()


display(df)

## Exporting Data to EDF
When it's easier to work with EDF files, we can export the data to an EDF file. This is useful for working with the data in other software packages.

Calling `export_to_edf(output_dir)` on a `DiveData` object creates one output EDF file for each recording in the `DiveData` relation, saved to `output_dir` with filename `<recording_id>.edf`. 

*Note: it currently requires a lot of memory. Can be improved.*<br/>
*Note: it's lacking support for most info fields in the EDF file.*

##### Example:

In [None]:
import os
import importlib
import DiveDB.services.duck_pond
import DiveDB.services.dive_data
importlib.reload(DiveDB.services.duck_pond)
importlib.reload(DiveDB.services.dive_data)

from DiveDB.services.duck_pond import DuckPond
from DiveDB.services.notion_orm import NotionORMManager

notion_manager = NotionORMManager(
    db_map={"Animal DB": os.environ["ANIMALS_DB_ID"]},
    token=os.environ["NOTION_API_KEY"],
)

duck_pond = DuckPond(os.environ["CONTAINER_ICEBERG_PATH"], notion_manager=notion_manager)

dive_data = duck_pond.get_data(    
    dataset="EP Physiology",
    labels=["sensor_data_temperature", "derived_data_depth"],
    limit=1000000,
)

output_edf_paths = dive_data.export_to_edf(".tmp/my_output_dir/")
display(output_edf_paths)



## Importing exported EDF as an MNE Signal Array
For working with the data in MNE, we can export the data to an EDF and then import it to MNE.


##### Example:

In [None]:
import mne
import importlib
import os
import DiveDB.services.duck_pond
importlib.reload(DiveDB.services.duck_pond)
from DiveDB.services.duck_pond import DuckPond
from DiveDB.services.notion_orm import NotionORMManager

notion_manager = NotionORMManager(
    db_map={"Animal DB": os.environ["ANIMALS_DB_ID"]},
    token=os.environ["NOTION_API_KEY"],
)

duck_pond = DuckPond(os.environ["CONTAINER_ICEBERG_PATH"], notion_manager=notion_manager)

dive_data = duck_pond.get_data(    
    dataset="EP Physiology",
    labels="ECG_ICA2",
    limit=1000000,
)

output_edf_paths = dive_data.export_to_edf(".tmp/my_output_dir/")
raw = mne.io.read_raw_edf(output_edf_paths[0])
display(raw)