# Export EDF

This notebook demonstrates the process of exporting DiveDB data as an EDF file.

While under development, it also contains the prototype (non-library) code; that'll be deleted when this notebook is ready to be merged into the main branch.

Punch list:
- [x] Make a list
- [x] Understand task :) 
- [ ] Prototype:
    - [x] Load basic metadata
    - [x] Load signals
    - [x] Generate EDF file 
        - [X] Can mne serve our needs here? Check if multiple sample rates, arbitrary metadata: edfio can!
        - [x] Decide if different library OR extend mne: use edfio, which is what mne depends on 
    - [ ] Test EDF file can be opened externally (e.g. through EDF.jl or other app)
    - [ ] Add metadata to EDF header
- [ ] In tests, write (failing) test for basic new functionality
- [ ] Turn prototype into library code - test passes!
- [ ] Write up edge case tests
    - [ ] Make 'em pass OR file 'em
- [ ] Clean up this notebook (delete this punch list!)
- [ ] Mark PR ready for review

Reminder: this is the end goal

```python
# Example of usage once complete

from DiveDB.services.duck_pond import DuckPond

duckpond = DuckPond(os.environ["CONTAINER_DELTA_LAKE_PATH"])

dive_data = duckpond.get_delta_data(    
    labels=["eeg"],
    animal_ids="apfo-001a",
)

dive_data.export_to_edf("path_to_output.edf")
```

### Prototype

In [None]:
# 1. Get metadata
import os
import importlib
import DiveDB.services.duck_pond as dp
importlib.reload(dp)

duckpond = dp.DuckPond(os.environ["CONTAINER_DELTA_LAKE_PATH"])

# Example from the querying_docs notebook
data = duckpond.get_delta_data(    
    labels=["derived_data_depth"],
    animal_ids="apfo-001a",
    frequency=1/60,  # Once a minute
)
display(data)

# Okay, but is there a way to find out what animal_ids, etc, are available?
# Time to go spelunking!
duckpond.get_db_schema()

# ...okay, cool. :) 

In [None]:
# Let's try a sql query as well (also ripped from the querying_docs notebook)
labels_df = duckpond.conn.sql(f"""
    SELECT label
    FROM (
        SELECT DISTINCT label
        FROM DataLake
    )
""").df()
display(labels_df)

animals_df = duckpond.conn.sql(f"""
    SELECT animal
    FROM (
        SELECT DISTINCT animal
        FROM DataLake
    )
""").df()
display(animals_df)


In [None]:
labels_df["label"][:]

In [None]:
# commenting out b/c otherwise this crashes my kernel (if i do other stuff after it)

# # Once more from the other notebook....
# # Get the filtered data
# resampled_data = duckpond.get_delta_data(    
#     animal_ids="apfo-001a",
#     # Resample values to 100 Hz and make sure each signal has the same time intervals
#     frequency=100,
#     # Aggregation of events (think state events - behaviors) type: state (has state and end dates)
#     classes="sensor_data_accelerometer",
# )
# display(resampled_data)
# # Huh. okay, `frequency` triggering a materialization + resample is interesting, not sure 
# # I would have guessed that from the API! I would have guessed that had to do with 
# # the sampling rate of the recording.

# # Okay, so the output of `get_delta_data` with a set frequency returns the signal as a dataframe.

In [None]:

# Is there a way to get the original sample rate? 
unmaterialized_data = duckpond.get_delta_data(    
    animal_ids="apfo-001a",
    # Resample values to 10 Hz and make sure each signal has the same time intervals
    frequency=None,
    # Aggregation of events (think state events - behaviors) type: state (has state and end dates)
    classes="sensor_data_accelerometer",
)
display(unmaterialized_data)

In [None]:
# ... okay, got it. now, let's do what needs doing. 
# But also, keep in mind that we should NOT pass a frequency into `get_delta_data`
# before EDF export unless we are very explicit about what we are doing and why. 

# When we don't pass in a frequency (i.e., resample), we get a DuckDBPyRelation
# out of `get_delta_data`
print(type(unmaterialized_data))

# ...from task, I think we want a DuckDBPyConnection instead? Currently unclear to me
# how these interop.

In [None]:
# Okay, now to an EDF! 
# Let's do the demo from edfio (what mne depends on for its EDF support)

from edfio import Edf, EdfSignal, read_edf
import numpy as np

# edfio's example
example_edf = Edf(
    [
        EdfSignal(np.random.randn(30 * 256), sampling_frequency=256, label="EEG Fpz"),
        EdfSignal(np.random.randn(30), sampling_frequency=1, label="Body Temp"),
    ]
)

outpath = ".tmp/example.edf"
example_edf.write(outpath)

example_edf_roundtrip = read_edf(outpath)
display(example_edf_roundtrip.signals)
display(example_edf_roundtrip.signals[0].data)


In [None]:
# ...and now with our data!
# Can we make an EDF from our data? 
# intentionally picking signals with different sampling rates

# ...normally we could query these all at the same time, except that we're putting limits
# on here so that we don't have to get ALL values for each signal. Also, in real case, 
# this is where we'd pull all data and then split it up and make one EDF file per animal/deployment/etc. 
# For now? Hard code it, bebe!
signals = []
for label in ["ax", "derived_data_depth"]:
    df = duckpond.get_delta_data(    
        animal_ids="apfo-001a",
        labels=[label],
        limit=1000,
    ).df()

    # TODO-safety: check that there's only one value after the unique, check that 
    # this is an integer value or whatever the EDF spec requires, etc
    sampling_rate = df["datetime"].diff()[1:].dt.total_seconds().unique()[0]
    sampling_frequency = int(1/sampling_rate) # TODO-safety: don't just blindly round o_O
    
    label_sanitized = label if len(label) <= 16 else label[0:16] # lol EDF

    # TODO-correctly! need to figure out max signal length, then start time, then 
    # Lpad to the correct start time + lpad to the correct stop time (lol EDF)
    # For now, we're faking it. We happen to know that the maximum duration signal 
    # of these two is 20 s, so lets zero-pad to that:
    signal_data = np.zeros(20 * sampling_frequency)
    signal_data[0:len(df[label].values)] = df[label].values # There has to be a pandas (where is nrow?) way to do this

    signal = EdfSignal(signal_data,
                       sampling_frequency=sampling_frequency, 
                       label=label_sanitized)
    signals.append(signal)
    
divedb_edf = Edf(signals)
path = ".tmp/prototype.edf"
divedb_edf.write(path)

divedb_edf_roundtrip = read_edf(path)
display(divedb_edf_roundtrip.signals)
display(divedb_edf_roundtrip.signals[0].data)

In [None]:
# Huzzah! Time to clean up :) 
# ...actually false. Time to figure out how to get the metadata into the EDF header!
# Check the edfio API: https://github.com/the-siesta-group/edfio?tab=readme-ov-file#usage 
# and https://edfio.readthedocs.io/en/stable/examples.html 