# Data Querying and Exporting

This notebook demonstrates the process of querying data from Delta Lake and exporting it in various formats.

## Starting the servers:
To launch the server, open the Docker Desktop app and run the following command at the root of the project:
```bash
$ make up
```
This command will launch the Django server, Postgres database, and Jupyter server using the environment variables defined in the `.env` file accross all containers.

#### Understanding expected file paths:
DiveDB expects the following paths to be set in the `.env` file:
- `CONTAINER_DATA_PATH`
- `LOCAL_DATA_PATH`
- `HOST_DELTA_LAKE_PATH`
- `CONTAINER_DELTA_LAKE_PATH`

These paths are used to mount the Delta Lake and file storage to the containers. The "LOCAL_" and "HOST_" paths can be wherever makes sense for your local machine. The "CONTAINER_" paths are the paths that the containers expect. We recommend you keep the "CONTAINER_" paths as they are in the `.env.example` file.

#### When is the server ready?
There are 3 processes that need to be running for the server to be ready:
1. The Django server (`web`)
2. The Postgres database (`metadata_database`)
3. The Jupyter server (`jupyter`)

Jupyter is almost always the last to start up. You'll know it's ready when you see the following logs in the terminal:
```bash
jupyter-1            | [I 2024-08-30 16:12:37.083 ServerApp] Serving notebooks from local directory: /app
jupyter-1            | [I 2024-08-30 16:12:37.083 ServerApp] Jupyter Server 2.14.2 is running at:
jupyter-1            | [I 2024-08-30 16:12:37.083 ServerApp] http://e29d05e13fd0:8888/jupyter/tree
jupyter-1            | [I 2024-08-30 16:12:37.083 ServerApp]     http://127.0.0.1:8888/jupyter/tree
```

## Connecting to the Jupyter Kernel:
To connect to the Jupyter server in your notebook, follow these steps:
1. Click the "Select Kernel" button at the top right of the page.
1. Pick the "Select another kernel" option in the dropdown menu.
1. Pick the "Existing Jupyter Server" option in the dropdown menu.
1. Now we need to connect to the Jupyter server.
    - If you previously connected to the Jupyter server
        - Pick the "localhost" option in the dropdown menu (or whatever you named it prior)
    - If you have not connected to the Jupyter server before
        - Pick the "Enter the URL of the running Jupyter server" option in the dropdown menu.
        - Enter http://localhost:8888/jupyter
        - Give it a name you'll remember (like "Local DiveDB Jupyter Server")
1. Press the "Reload" icon in the top right of the dropdown menu to see the latest kernel.
1. Pick the "Python 3" option in the dropdown menu.

This will ensure you execute the Jupyter notebook in the correct environment.

## Querying from Delta Lake
We connect to our datastores using the `DuckPond` class. DuckPond is a wrapper around a DuckDB connection with access to both our Metadata Database and our measurements stored in Delta Lake. The ability to query both sources of data from a single connection is useful for quickly accessing data for analysis.

There are two main ways to query data from Delta Lake:
1. Using the DuckPond `get_delta_data` method
2. Using the DuckPond connection to query directly

### Using the DuckPond `get_delta_data` method
DuckPond's `get_delta_data` method constructs a query based on the parameters you pass to it and returns a DuckDB DataFrame. It is useful for quickly accessing data for analysis. It takes the following optional parameters:
- `labels`: A string or list of data labels to query.
- `logger_ids`: A string or list of logger IDs to query.
- `animal_ids`: A string or list of animal IDs to query.
- `deployment_ids`: A string or list of deployment IDs to query.
- `recording_ids`: A string or list of recording IDs to query.
- `date_range`: A tuple of start and end dates to query.
- `limit`: The maximum number of rows to return.

The `get_delta_data` method returns a [DuckDB DuckDBPyConnection](https://duckdb.org/docs/api/python/reference/#duckdb.DuckDBPyConnection) which can be used to convert the data in many different formats including the following ([see documentation for a full list](https://duckdb.org/docs/api/python/conversion#result-conversion-duckdb-results-to-python))
- NumPy Array (`.fetchnumpy()`)
- Pandas DataFrame (`.df()`)
- Arrows Table (`.arrow()`)
- Polars DataFrame (`.pl()`)

Until a conversion method is called, the data is not loaded into memory. This allows for large queries to be run without using too much memory.

##### Example:

In [5]:
import os
import importlib
import DiveDB.services.duck_pond
import DiveDB.services.utils.edf
importlib.reload(DiveDB.services.duck_pond)
importlib.reload(DiveDB.services.utils.edf)

from DiveDB.services.duck_pond import DuckPond
from DiveDB.services.utils.edf import create_mne_array, create_mne_edf

duckpond = DuckPond(os.environ["CONTAINER_DELTA_LAKE_PATH"])

conn = duckpond.conn.sql("SELECT count(*) FROM DataLake").df()

display(conn)

## Querying for shared frequency from Delta Lake
Delta Lake can store multiple signal names at a single frequency. If you query a single signal name, the data will be returned as a list of values for each timestamp. If you query multiple signal names, the data will be returned as a list of lists of values for each timestamp.

The data will be returned as a Pandas DataFrame with a DatetimeIndex.

##### Example:

In [None]:
import os
import importlib
import DiveDB.services.duck_pond
import DiveDB.services.utils.edf
importlib.reload(DiveDB.services.duck_pond)
importlib.reload(DiveDB.services.utils.edf)

from DiveDB.services.duck_pond import DuckPond
from DiveDB.services.utils.edf import create_mne_array, create_mne_edf

duckpond = DuckPond(os.environ["CONTAINER_DELTA_LAKE_PATH"])

duckpond.get_delta_data(    
    labels=["corr_depth"],
    animal_ids="oror-002",
    frequency=1/60,  # Once a minute
)


### Using the DuckPond connection to query directly
More complex queries can be run directly on the DuckPond connection. This is useful for queries that may not be supported by the `get_delta_data` method which has those involving grouping or aggregations. 

DuckDB runs sql very similar in syntax to other SQL databases. A full breakdown of the syntax can be found [in the documenation](https://duckdb.org/docs/sql/introduction).

The connection object can be found in the `duckpond.conn` attribute. To run queries, use the `sql` method which also returns a [DuckDB DuckDBPyConnection](https://duckdb.org/docs/api/python/reference/#duckdb.DuckDBPyConnection) which can be used to convert the data in many different formats including the following ([see documentation for a full list](https://duckdb.org/docs/api/python/conversion#result-conversion-duckdb-results-to-python))
- NumPy Array (`.fetchnumpy()`)
- Pandas DataFrame (`.df()`)
- Arrows Table (`.arrow()`)
- Polars DataFrame (`.pl()`)

##### Example:

In [None]:
import importlib
import os
# Reload the DuckPond module to pick up any changes
import DiveDB.services.duck_pond
importlib.reload(DiveDB.services.duck_pond)
from DiveDB.services.duck_pond import DuckPond

duckpond = DuckPond(os.environ["CONTAINER_DELTA_LAKE_PATH"])

df = duckpond.conn.sql(f"""
    SELECT label, avg(value) as mean_data
    FROM (
        SELECT label, value.int as value
        FROM DataLake
        WHERE label = 'sensor_data_ecg'  -- Update to match labels in your data
        OR label = 'sensor_data_light'
    )
    GROUP BY label
""").df()

display(df)


## Chaining Queries
Queries can be chained together to form a pipeline. This is useful for running complex queries that involve multiple steps.

##### Example:

In [None]:
import importlib
import os
# Reload the DuckPond module to pick up any changes
import DiveDB.services.duck_pond
importlib.reload(DiveDB.services.duck_pond)
from DiveDB.services.duck_pond import DuckPond

duckpond = DuckPond(os.environ["CONTAINER_DELTA_LAKE_PATH"])

# Get the filtered data
filtered_data = duckpond.get_delta_data(    
    animal_ids="oror-002",
    frequency=10, # Resample values to 10 Hz and make sure each signal has the same time intervals
    # Aggregation of events (think state events - behaviors) type: state (has state and end dates)
    classes="sensor_data_accelerometer",
    
)

display(filtered_data)


## Query Variables
Sometimes we don't want to hardcode variables in our queries. We can use the `execute` method to pass variables to the query.

##### Example:

In [None]:
import importlib
import os
# Reload the DuckPond module to pick up any changes
import DiveDB.services.duck_pond
importlib.reload(DiveDB.services.duck_pond)
from DiveDB.services.duck_pond import DuckPond

duckpond = DuckPond(os.environ["CONTAINER_DELTA_LAKE_PATH"])

label = "sensor_data_temperature"
df = duckpond.conn.execute(f"""
SELECT label, avg(value) as mean_data
FROM (
    SELECT label, value.float as value
    FROM DataLake
    WHERE label = $1
)
GROUP BY label
""", [label]).df()
display(df)

## Query Metadata Database
We can also query the Metadata Database directly. This is useful for querying data that is not stored in Delta Lake and joining it for queries on measurement data.

##### Example:

In [None]:
import importlib
import os
# Reload the DuckPond module to pick up any changes
import DiveDB.services.duck_pond
importlib.reload(DiveDB.services.duck_pond)
from DiveDB.services.duck_pond import DuckPond

duckpond = DuckPond(os.environ["CONTAINER_DELTA_LAKE_PATH"])


# Show all tables we have access to
print(duckpond.get_db_schema())

df = duckpond.conn.sql("""
SELECT value.float as value
FROM DataLake 
JOIN Metadata.public.Animals ON DataLake.animal = Animals.id
WHERE Animals.project_id = 'test12_Wednesday'
AND label = 'sensor_data_temperature'
""").df()


display(df)

## Exporting Data to EDF
When it's easier to work with EDF files, we can export the data to an EDF file. This is useful for working with the data in other software packages.

The `create_mne_edf` function takes a DuckDB connection and a file path and creates an EDF file. 

*Note: it currently requires a lot of memory. Can be improved.*
*Note: it's lacking support for most info fields in the EDF file. Can be improved.*

##### Example:

In [None]:
import os
import importlib
import DiveDB.services.duck_pond
import DiveDB.services.utils.edf
importlib.reload(DiveDB.services.duck_pond)
importlib.reload(DiveDB.services.utils.edf)

from DiveDB.services.duck_pond import DuckPond
from DiveDB.services.utils.edf import create_mne_edf

duckpond = DuckPond(os.environ["CONTAINER_DELTA_LAKE_PATH"])

conn = duckpond.get_delta_data(    
    animal_ids="mian-003",
    labels=["ECG_ICA2", "EEG_ICA5"],
    limit=1000000,
)

create_mne_edf(conn, "test.edf")

## Exporting Data to MNE Signal Array
For working with the data in MNE, we can export the data to an MNE Signal Array. This is useful for manipulating the data in MNE.

The `create_mne_array` function takes a DuckDB connection and returns an MNE RawArray.

##### Example:

In [None]:
import importlib
import os
import DiveDB.services.duck_pond
importlib.reload(DiveDB.services.duck_pond)
from DiveDB.services.duck_pond import DuckPond
from DiveDB.services.utils.edf import create_mne_array

duckpond = DuckPond(os.environ["CONTAINER_DELTA_LAKE_PATH"])

conn = duckpond.get_delta_data(    
    animal_ids="mian-003",
    labels="ECG_ICA2",
    limit=1000000,
)

raw = create_mne_array(conn, resample=100, l_freq=1, h_freq=20)
display(raw)

In [None]:
import xarray as xr

with xr.open_dataset("./data/deployment_data.nc") as ds:
    display(ds)