# Export EDF

This notebook demonstrates the process of exporting DiveDB data as an EDF file.

While under development, it also contains the prototype (non-library) code; that'll be deleted when this notebook is ready to be merged into the main branch.

Punch list:
- [x] Make a list
- [x] Understand task :) 
- [ ] Prototype:
    - [x] Load basic metadata
    - [x] Load signals
    - [x] Generate EDF file 
        - [X] Can mne serve our needs here? Check if multiple sample rates, arbitrary metadata: edfio can!
        - [x] Decide if different library OR extend mne: use edfio, which is what mne depends on 
    - [x] Test EDF file can be opened externally (e.g. through EDF.jl or other app)
    - [x] Test EDF encodes max/min values
    - [ ] Add metadata to EDF header
- [ ] In tests, write (failing) test for basic new functionality
- [ ] Turn prototype into library code - test passes!
- [ ] Write up edge case tests
    - [ ] Make 'em pass OR file 'em
- [ ] Clean up this notebook (delete this punch list!)
- [ ] Mark PR ready for review

Reminder: this is the end goal

```python
# Example of usage once complete

from DiveDB.services.duck_pond import DuckPond

duckpond = DuckPond(os.environ["CONTAINER_DELTA_LAKE_PATH"])

dive_data = duckpond.get_delta_data(    
    labels=["eeg"],
    animal_ids="apfo-001a",
)

dive_data.export_to_edf("path_to_output.edf")
```

### Prototype

In [1]:
# 1. Get metadata
import os
import importlib
import DiveDB.services.duck_pond as dp
importlib.reload(dp)

duckpond = dp.DuckPond(os.environ["CONTAINER_DELTA_LAKE_PATH"])

# Example from the querying_docs notebook
data = duckpond.get_delta_data(    
    labels=["derived_data_depth"],
    animal_ids="apfo-001a",
    frequency=1/60,  # Once a minute
)
display(data)

# Okay, but is there a way to find out what animal_ids, etc, are available?
# Time to go spelunking!
duckpond.get_db_schema()

# ...okay, cool. :) 

(┌───────────────────────────┬───────────┬─────────────────────┐
 │         datetime          │  animal   │ derived_data_depth  │
 │ timestamp with time zone  │  varchar  │       double        │
 ├───────────────────────────┼───────────┼─────────────────────┤
 │ 2019-11-07 19:50:45+00    │ apfo-001a │ -2.0053139536656666 │
 │ 2019-11-07 19:50:45.02+00 │ apfo-001a │ -2.0053139536656666 │
 │ 2019-11-07 19:50:45.04+00 │ apfo-001a │ -2.0053139536656666 │
 │ 2019-11-07 19:50:45.06+00 │ apfo-001a │ -2.0053139536656666 │
 │ 2019-11-07 19:50:45.08+00 │ apfo-001a │ -2.0053139536656666 │
 │ 2019-11-07 19:50:45.1+00  │ apfo-001a │ -2.0053139536656666 │
 │ 2019-11-07 19:50:45.12+00 │ apfo-001a │ -2.0053139536656666 │
 │ 2019-11-07 19:50:45.14+00 │ apfo-001a │ -2.0053139536656666 │
 │ 2019-11-07 19:50:45.16+00 │ apfo-001a │ -2.0053139536656666 │
 │ 2019-11-07 19:50:45.18+00 │ apfo-001a │ -2.0053139536656666 │
 │            ·              │     ·     │           ·         │
 │            ·          

┌──────────┬─────────┬────────────────────────────┬─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬───────────┐
│ database │ schema  │            name            │                                                                                                                                                          column_names                                                                                                                                                           │              

In [2]:
# Let's try a sql query as well (also ripped from the querying_docs notebook)
labels_df = duckpond.conn.sql(f"""
    SELECT label
    FROM (
        SELECT DISTINCT label
        FROM DataLake
    )
""").df()
# display(labels_df)

animals_df = duckpond.conn.sql(f"""
    SELECT animal
    FROM (
        SELECT DISTINCT animal
        FROM DataLake
    )
""").df()
# display(animals_df)

signal_df = duckpond.conn.sql(f"""
    SELECT class, label
    FROM (
        SELECT DISTINCT label, class
        FROM DataLake
    )
""").df()
display(signal_df)


Unnamed: 0,class,label
0,derived_data_corrected_gyr,gy
1,derived_data_corrected_acc,az
2,logger_data_CC-35,light2
3,logger_data_CC-35,gz
4,sensor_data_magnetometer,mz
...,...,...
56,sensor_data_accelerometer,ax
57,derived_data_calibrated_acc,ax
58,logger_data_CC-35,mz
59,sensor_data_gyroscope,gx


In [3]:
signal_df.sort_values(by="class")
print(signal_df['class'].unique())

['derived_data_corrected_gyr' 'derived_data_corrected_acc'
 'logger_data_CC-35' 'sensor_data_magnetometer'
 'derived_data_calibrated_mag' 'derived_data_calibration_acc'
 'sensor_data_gyroscope' 'derived_data_prh' 'derived_data_calibration_mag'
 'derived_data_inclination_angle' 'derived_data_depth' 'sensor_data_light'
 'sensor_data_temperature' 'derived_data_corrected_mag'
 'sensor_data_accelerometer' 'derived_data_calibrated_acc'
 'sensor_data_pressure']


In [4]:
# commenting out b/c otherwise this crashes my kernel (if i do other stuff after it)

# # Once more from the other notebook....
# # Get the filtered data
# resampled_data = duckpond.get_delta_data(    
#     animal_ids="apfo-001a",
#     # Resample values to 100 Hz and make sure each signal has the same time intervals
#     frequency=100,
#     # Aggregation of events (think state events - behaviors) type: state (has state and end dates)
#     classes="sensor_data_accelerometer",
# )
# display(resampled_data)
# # Huh. okay, `frequency` triggering a materialization + resample is interesting, not sure 
# # I would have guessed that from the API! I would have guessed that had to do with 
# # the sampling rate of the recording.

# # Okay, so the output of `get_delta_data` with a set frequency returns the signal as a dataframe.

In [5]:

# Is there a way to get the original sample rate? 
unmaterialized_data = duckpond.get_delta_data(    
    animal_ids="apfo-001a",
    # Resample values to 10 Hz and make sure each signal has the same time intervals
    frequency=None,
    # Aggregation of events (think state events - behaviors) type: state (has state and end dates)
    classes="sensor_data_accelerometer",
)
display(unmaterialized_data)

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

(┌─────────────────────────────┬───────────┬───────────────┬────────────────┬───────────────┐
 │          datetime           │  animal   │      ax       │       az       │      ay       │
 │  timestamp with time zone   │  varchar  │    double     │     double     │    double     │
 ├─────────────────────────────┼───────────┼───────────────┼────────────────┼───────────────┤
 │ 2019-11-07 19:50:45+00      │ apfo-001a │ -0.0071826049 │ -10.6901104125 │  0.0263362182 │
 │ 2019-11-07 19:50:45.0025+00 │ apfo-001a │  0.0167594116 │  -10.637437976 │  0.0407014282 │
 │ 2019-11-07 19:50:45.005+00  │ apfo-001a │ -0.0167594116 │ -10.6565915893 │  0.0430956298 │
 │ 2019-11-07 19:50:45.0075+00 │ apfo-001a │  0.0239420166 │ -10.7356002441 │  0.0430956298 │
 │ 2019-11-07 19:50:45.01+00   │ apfo-001a │  0.0023942016 │ -10.6158901611 │  0.0478840332 │
 │ 2019-11-07 19:50:45.0125+00 │ apfo-001a │  0.0023942016 │ -10.7236292358 │  0.0622492431 │
 │ 2019-11-07 19:50:45.015+00  │ apfo-001a │  0.0526724365 │

In [6]:
# ... okay, got it. now, let's do what needs doing. 
# But also, keep in mind that we should NOT pass a frequency into `get_delta_data`
# before EDF export unless we are very explicit about what we are doing and why. 

# When we don't pass in a frequency (i.e., resample), we get a DuckDBPyRelation
# out of `get_delta_data`
print(type(unmaterialized_data))

# ...from task, I think we want a DuckDBPyConnection instead? Currently unclear to me
# how these interop.

<class 'tuple'>


In [7]:
# Okay, now to an EDF! 
# Let's do the demo from edfio (what mne depends on for its EDF support)

from edfio import Edf, EdfSignal, read_edf
import numpy as np
import math

# edfio's example
example_edf = Edf(
    [
        EdfSignal(np.random.randn(30 * 256), sampling_frequency=256, label="EEG Fpz"),
        EdfSignal(np.random.randn(30), sampling_frequency=1, label="Body Temp"),
    ]
)

outpath = ".tmp/example.edf"
example_edf.write(outpath)

example_edf_roundtrip = read_edf(outpath)
display(example_edf_roundtrip.signals)
display(example_edf_roundtrip.signals[0].data)


(<EdfSignal EEG Fpz 256Hz>, <EdfSignal Body Temp 1Hz>)

array([ 0.77000165, -0.20818987,  0.10912007, ...,  1.57440918,
        0.61083287, -0.44157653])

In [10]:
# ...and now with our data!
# Can we make an EDF from our data? 
# intentionally picking signals with different sampling rates
classes = ["sensor_data_accelerometer","sensor_data_pressure", "derived_data_prh"]

# Set up for EDF - first figure out what common max duration is, etc
max_duration_sec = 0
for class_name in classes:
    df = duckpond.get_delta_data(animal_ids="apfo-001a",
                                 classes=[class_name],
                                 ).df()

    #TODO-optimize: surely there is a way to get the number of samples without loading them all?? 
    # If so, do that!
    sampling_rate = df["datetime"].diff()[1:].dt.total_seconds().unique()[0]
    sampling_frequency = int(1/sampling_rate) # TODO-safety: don't just blindly round o_O

    # TODO-correctly! need to figure out max signal length, then start time, then 
    # Lpad to the correct start time + lpad to the correct stop time (lol EDF)
    # For now, we're faking it. We happen to know that the maximum duration signal 
    # of these two is 20 s, so lets zero-pad to that:
    sig_max_duration_sec = math.ceil(df.shape[0] / sampling_frequency)
    max_duration_sec = max(max_duration_sec, sig_max_duration_sec)

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

AttributeError: 'tuple' object has no attribute 'df'

In [None]:
# Now actually use the real data to create an edf
signals = []
for class_name in classes:
    df = duckpond.get_delta_data(    
        animal_ids="apfo-001a",
        classes=[class_name],
    ).df()

    # TODO-safety: check that there's only one value after the unique, check that 
    # this is an integer value or whatever the EDF spec requires, etc
    sampling_rate = df["datetime"].diff()[1:].dt.total_seconds().unique()[0]
    sampling_frequency = int(1/sampling_rate) # TODO-safety: don't just blindly round, see if we allow floating point values here also?? o_O

    if class_name.startswith("sensor_data"):
        class_prefix = class_name[12:]
    elif class_name.startswith("derived_data"):
        class_prefix = "**" + class_name[13:]
    else:
        class_prefix = class_name

    # Need to figure out max signal length, then start time, then 
    # Lpad to the correct start time + lpad to the correct stop time (lol EDF)
    # TODO-future: instead of padding w/ 0, use some signal-specific value
    num_channels = df.shape[1] - 1
    i_sample_start_offset = 0 #TODO - make sure this is set to the signal's offset
    for i_channel in range(0, num_channels):
        signal_data = np.zeros(max_duration_sec * sampling_frequency, dtype=np.float64)
        channel_label = (df.columns)[i_channel + 1]

        # TODO-future safety: make sure signal labels are unique across recording
        # TODO-future: pull into own function
        if class_name == channel_label:
            signal_label = class_prefix if len(class_prefix) <= 16 else class_prefix[0:16]  # lol EDF
        else:
            max_prefix_length = 16 - len(channel_label) - 1  # lol EDF 
            # TODO handle case when prefix is now < 0
            signal_prefix = class_prefix if len(class_prefix) <= max_prefix_length else label_prefix[0:max_prefix_length]
            signal_label = signal_prefix + "-" + channel_label

        num_samples = len(df[channel_label].values)
        signal_data[i_sample_start_offset:num_samples] = df[channel_label].values

        signal = EdfSignal(signal_data,
                           sampling_frequency=sampling_frequency, 
                           label=signal_label)
        # TODO-add header metadata
        signals.append(signal)
    
divedb_edf = Edf(signals)
path = ".tmp/prototype.edf"
divedb_edf.write(path)

divedb_edf_roundtrip = read_edf(path)
display(divedb_edf_roundtrip.signals)
display(divedb_edf_roundtrip.signals[0].data)

In [None]:
class_name = "sensor_data_accelerometer"
df = duckpond.get_delta_data(    
    animal_ids="apfo-001a",
    classes=[class_name],
    # limit=1000,
).df()


In [None]:
# Commenting out to avoid an unnecessary error! Uncomment to see the (fully expected) error. :) 
# # Can an edf contain nan or inf? 

# sig = np.random.randn(30 * 256)
# sig[0] = np.nan
# print(sig)
# example_edf = Edf([EdfSignal(sig, sampling_frequency=256, label="EEG Fpz")])

# # nope! that answers that.

In [None]:
for sig in divedb_edf_roundtrip.signals:
    print(sig)
    print(sig.__dict__)

sig = divedb_edf_roundtrip.signals[0]

In [None]:
sig.__dict__

In [None]:
# Let's practice setting the other fields for the recording 
from edfio import Patient, Recording
import datetime
import json

edf = divedb_edf_roundtrip.copy()

# Okay, looks like these additional fields are VERY strict, disallow spaces, basically can't 
# be json. According to PA, canonical thing to do here is use annotations, so we'll do that 
# for anything interesting. Single world fields/responses? allowed, in the kwarg form 
additional = ('kwarg1', 'value1', 'kwarg2', 'value2')

edf.recording = Recording(
    # startdate=datetime.date(2002, 2, 2), #TODO
    equipment_code="X", #TODO
)

path = ".tmp/prototype.edf"
edf.write(path)
edf.__dict__


In [None]:
divedb_edf.recording.__dict__

In [None]:
# Huzzah! Time to clean up :) 
# ...actually false. Time to figure out how to get the metadata into the EDF header!
# Check the edfio API: https://github.com/the-siesta-group/edfio?tab=readme-ov-file#usage 
# and https://edfio.readthedocs.io/en/stable/examples.html 

In [None]:
# Okay, how does this work now?
import DiveDB.services.duck_pond as dp
importlib.reload(dp)
import DiveDB.services.dive_data as dd
importlib.reload(dd)

duckpond = dp.DuckPond(os.environ["CONTAINER_DELTA_LAKE_PATH"])

results = duckpond.get_delta_data(    
    classes=["derived_data_depth", "sensor_data_accelerometer"],
    animal_ids="apfo-001a",
    limit=100, # 0000
)
print(type(results))
print(results.df())


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

<class 'DiveDB.services.dive_data.DiveData'>
                           datetime     animal  derived_data_depth
0         2019-11-07 19:50:45+00:00  apfo-001a           -2.005314
1  2019-11-07 19:50:45.020000+00:00  apfo-001a           -2.005314
2  2019-11-07 19:50:45.040000+00:00  apfo-001a           -2.005314
3  2019-11-07 19:50:45.060000+00:00  apfo-001a           -2.005314
4  2019-11-07 19:50:45.080000+00:00  apfo-001a           -2.005314
..                              ...        ...                 ...
95 2019-11-07 19:50:46.900000+00:00  apfo-001a           -1.983105
96 2019-11-07 19:50:46.920000+00:00  apfo-001a           -1.983105
97 2019-11-07 19:50:46.940000+00:00  apfo-001a           -1.983105
98 2019-11-07 19:50:46.960000+00:00  apfo-001a           -1.983105
99 2019-11-07 19:50:46.980000+00:00  apfo-001a           -1.983105

[100 rows x 3 columns]


In [None]:
display([])