# Data Uploader

This notebook demonstrates the process of uploading EDF files data to Delta Lake and OpenStack Swift for long-term storage. 

It also includes the setup and execution of the data upload process, as well as querying the uploaded data for analysis.

## Starting the servers:
To launch the server, open the Docker Desktop app and run the following command at the root of the project:
```bash
$ make up
```
This command will launch the Django server, Postgres database, and Jupyter server using the environment variables defined in the `.env` file accross all containers.

#### Understanding expected file paths:
DiveDB expects the following paths to be set in the `.env` file:
- `CONTAINER_DATA_PATH`
- `LOCAL_DATA_PATH`
- `HOST_DELTA_LAKE_PATH`
- `CONTAINER_DELTA_LAKE_PATH`
- `HOST_FILE_STORAGE_PATH`
- `CONTAINER_FILE_STORAGE_PATH`

These paths are used to mount the Delta Lake and file storage to the containers. The "LOCAL_" and "HOST_" paths can be wherever makes sense for your local machine. The "CONTAINER_" paths are the paths that the containers expect. We recommend you keep the "CONTAINER_" paths as they are in the `.env.example` file.

#### When is the server ready?
There are 3 processes that need to be running for the server to be ready:
1. The Django server (`web`)
2. The Postgres database (`metadata_database`)
3. The Jupyter server (`jupyter`)

Jupyter is almost always the last to start up. You'll know it's ready when you see the following logs in the terminal:
```bash
jupyter-1            | [I 2024-08-30 16:12:37.083 ServerApp] Serving notebooks from local directory: /app
jupyter-1            | [I 2024-08-30 16:12:37.083 ServerApp] Jupyter Server 2.14.2 is running at:
jupyter-1            | [I 2024-08-30 16:12:37.083 ServerApp] http://e29d05e13fd0:8888/jupyter/tree
jupyter-1            | [I 2024-08-30 16:12:37.083 ServerApp]     http://127.0.0.1:8888/jupyter/tree
```

## Connecting to the Jupyter Kernel:
To connect to the Jupyter server in your notebook, follow these steps:
1. Click the "Select Kernel" button at the top right of the page.
1. Pick the "Select another kernel" option in the dropdown menu.
1. Pick the "Existing Jupyter Server" option in the dropdown menu.
1. Now we need to connect to the Jupyter server.
    - If you previously connected to the Jupyter server
        - Pick the "localhost" option in the dropdown menu (or whatever you named it prior)
    - If you have not connected to the Jupyter server before
        - Pick the "Enter the URL of the running Jupyter server" option in the dropdown menu.
        - Enter http://localhost:8888/jupyter
        - Give it a name you'll remember (like "Local DiveDB Jupyter Server")
1. Press the "Reload" icon in the top right of the dropdown menu to see the latest kernel.
1. Pick the "Python 3" option in the dropdown menu.

This will ensure you execute the Jupyter notebook in the correct environment.

After connecting to the Jupyter server, ensure your notebook runs this command to set the appropriate file paths:

## Preparing to upload data:
There are two aspects to any data upload:
1. A table containing measurements and time
    - This can be an EDF, CSV, or a data
2. The metadata for the measurements
    - This describes the context of the measurements using the following fields:
        - animal
        - deployment
        - logger
        - recording

There are several ways to define your metadata. 

#### Option 1: Supplied Metadata CSV File
If you have a metadata file, you can use the `metadata_map` dictionary to map the columns in the CSV file to the corresponding mode. You can then pass the metadata file to the `upload_edf` function which will interactively extract the metadata for your measurements using the data in the metadata file and in the Metadata Database.

#### Option 2: Interactive Selection (in the works)
If you provide no metadata, you can use the `upload_edf` function to interactively select the metadata for your measurements using the data in the Metadata Database.

#### Option 3: Supplied Metadata Dictionary
If you know the metadata for your measurements, you can pass a dictionary to the `upload_edf` function. The dictionary should represent metadata existing in the Metadata database and contain the following fields:
- animal: The animal ID
- deployment: The deployment name
- logger: The logger ID
- recording: The recording name


### Uploading EDFs
The `edf_file_paths` list contains the paths to the EDF files that we want to upload. It can point to files on your local machine or on a remote server.
In this example, the files are located in the ../data/files/ directory and are named test12_Wednesday_05_DAY1_PROCESSED.edf and test12_Wednesday_05_DAY2_PROCESSED.edf.

The `metadata_file_path` variable holds the path to the CSV file containing metadata for the EDF files (named Sleep Study Metadata.csv). 

The `metadata_map` dictionary is used to map the columns in the CSV metadata file to the corresponding mode. The keys in the dictionary represent the fields in the database, and the values represent the column names in the CSV file. For example:
- "animal" maps to the "Nickname" column in the CSV file.
- "deployment" maps to the "Deployment" column in the CSV file.
- "logger" maps to the "Logger Used" column in the CSV file.
- "recording" maps to the "Recording ID" column in the CSV file.

The upload_edf function will perform the following: 
- use the metadata map to extract the metadata for your measurements using the data in the metadata file and in the Metadata Database
- upload copies of the EDF files to OpenStack Swift
- upload the measurements to Delta Lake (by default, 10M measurements per batch)

The process takes between 5-10 minutes to complete per gigabyte. (*note: we can speed this up by parellizing the upload process*)


In [None]:
import os

os.environ["DJANGO_ALLOW_ASYNC_UNSAFE"] = "true"

import importlib
import DiveDB.services.data_uploader
importlib.reload(DiveDB.services.data_uploader)
from DiveDB.services.data_uploader import DataUploader


data_uploader = DataUploader()


edf_file_paths = [
    "./data/files/test12_Wednesday_05_DAY1_PROCESSED.edf",
    "./data/files/test12_Wednesday_05_DAY2_PROCESSED.edf"
]

metadata_file_path = "./data/files/Sleep Study Metadata.csv"

metadata_map = {
    "animal": "Nickname",
    "deployment": "Deployment",
    "logger": "Logger Used",
    "recording": "Recording ID"
}

data_uploader.upload_edf(edf_file_paths, metadata_file_path, metadata_map)

### Uploading NetCDF files
The `netcdf_file_path` list contains the paths to the NetCDF files that we want to upload. It can point to files on your local machine or on a remote server.
In this example, the file is located in the ../data/files/ directory and is named deployment_data.nc.

The upload_netcdf function will perform the following: 
- use the provided metadata dictionary to extract the metadata for your measurements
- upload copies of the NetCDF files to OpenStack Swift
- upload the measurements to Delta Lake

The process takes between 30 secs to 1 min to complete per gigabyte — about 2/3rds of the time is used to upload the files to OpenStack Swift. (*note: we can speed this up by parellizing the upload process*)

In [14]:
import os
import importlib

os.environ["DJANGO_ALLOW_ASYNC_UNSAFE"] = "true"
os.environ["SKIP_OPENSTACK_UPLOAD"] = "true"

import DiveDB.services.data_uploader
importlib.reload(DiveDB.services.data_uploader)
from DiveDB.services.data_uploader import DataUploader

data_uploader = DataUploader()

metadata = {
    'animal': 'oror-002', 
    'deployment': '2024-01-16_oror-002a', 
    'recording': '2024-01-16_oror-002a_UF-01_001'
}

data_uploader.upload_netcdf("data/deployment_data.nc", metadata)

Creating file record for deployment_data.nc and uploading to OpenStack...
Skipping OpenStack upload...
Processing 10 variables in the netCDF file.


Processing variables:   0%|          | 0/10 [00:00<?, ?it/s]


IndexError: index 1 is out of bounds for axis 1 with size 1

### Converting NetCDF files to expected format

The upload_netcdf method expects NetCDFs to match a specific format.  process of converting NetCDF files from the format employed for Elephant Seal data in Costa et al. (publish pending) to the format expected by the DiveDB uploader.

In [2]:
import os

os.environ["DJANGO_ALLOW_ASYNC_UNSAFE"] = "true"
os.environ["SKIP_OPENSTACK_UPLOAD"] = "true"
from DiveDB.services.data_uploader import DataUploader
import importlib
import DiveDB.services.data_uploader
importlib.reload(DiveDB.services.data_uploader)
from DiveDB.services.data_uploader import DataUploader
import DiveDB.services.utils.netcdf_conversions
importlib.reload(DiveDB.services.utils.netcdf_conversions)
from DiveDB.services.utils.netcdf_conversions import convert_to_formatted_dataset

converted_file_path = "./data/processed_2004001_TrackTDR_Processed.nc.nc"
converts_ds = convert_to_formatted_dataset("./data/2004001_TrackTDR_Processed.nc", output_file_path=converted_file_path)

data_uploader = DataUploader()

metadata = {
    'animal': 'oror-002', 
    'deployment': '2024-01-16_oror-002a', 
    'recording': '2024-01-16_oror-002a_UF-01_001'
}

data_uploader.upload_netcdf(converted_file_path, metadata)

No valid variables to convert for group RAW_ARGOS
No valid variables to convert for group RAW_GPS
No valid variables to convert for group RAW_TDR2
No valid variables to convert for group RAW_TDR3
All values are NaN or empty string for variable SEMI_MAJ_AXIS in group CURATED_LOCATIONS
All values are NaN or empty string for variable SEMI_MIN_AXIS in group CURATED_LOCATIONS
All values are NaN or empty string for variable ELLIPSE_ORIENTATION in group CURATED_LOCATIONS
No valid variables to convert for group CLEAN_ZOC_TDR2
No valid variables to convert for group CLEAN_ZOC_TDR3
Creating file record for processed_2004001_TrackTDR_Processed.nc.nc and uploading to OpenStack...
Skipping OpenStack upload...
Processing 8 variables in the netCDF file.


  & (numeric_values == numeric_values.astype(int)),
  numeric_values.astype(int),
Processing variables: 100%|██████████| 8/8 [00:12<00:00,  1.57s/it]

Upload complete.





In [19]:
import xarray as xr
import importlib
import DiveDB.services.utils.netcdf_conversions
importlib.reload(DiveDB.services.utils.netcdf_conversions)
from DiveDB.services.utils.netcdf_conversions import convert_to_formatted_dataset

raw_file_path = "./data/TrackTDR RawCurated.nc"
raw_ds = convert_to_formatted_dataset(raw_file_path)

processed_file_path = "./data/2004001_TrackTDR_Processed.nc"
converted_processed_ds = convert_to_formatted_dataset(processed_file_path)

owned_file_path = "./data/deployment_data.nc"
owned_ds = xr.open_dataset(owned_file_path)

print("Raw Data")
display(raw_ds)
# Print all variables in the raw dataset for the first 5 coordinate values
for var in raw_ds.variables:
    print(f"Variable: {var}")
    print(raw_ds[var].values[:5])
    print("\n")
print("Processed Data")
display(converted_processed_ds)
for var in converted_processed_ds.variables:
    print(f"Variable: {var}")
    print(converted_processed_ds[var].values[:5])
    print("\n")
print("Owned Data")
display(owned_ds)
for var in owned_ds.variables:
    print(f"Variable: {var}")
    print(owned_ds[var].values[:5])
    print("\n")



No valid variables to convert for group RAW_ARGOS
No valid variables to convert for group RAW_GPS
No valid variables to convert for group RAW_TDR2
No valid variables to convert for group RAW_TDR3
All values are NaN or empty string for variable SEMI_MAJ_AXIS in group CURATED_LOCATIONS
All values are NaN or empty string for variable SEMI_MIN_AXIS in group CURATED_LOCATIONS
All values are NaN or empty string for variable ELLIPSE_ORIENTATION in group CURATED_LOCATIONS
No valid variables to convert for group CLEAN_ZOC_TDR2
No valid variables to convert for group CLEAN_ZOC_TDR3
No valid variables to convert for group TDR2
No valid variables to convert for group TDR2_8S
No valid variables to convert for group TDR3
No valid variables to convert for group TDR3_8S
Raw Data


Variable: RAW_TDR1_samples
['2004-02-16T23:52:00.000000000' '2004-02-16T23:52:04.000000000'
 '2004-02-16T23:52:08.000000000' '2004-02-16T23:52:12.000000000'
 '2004-02-16T23:52:16.000000000']


Variable: DEPTH
[-40. -40. -40. -40. -40.]


Variable: EXTERNAL_TEMP
[9.96920997e+36 9.96920997e+36 9.96920997e+36 9.96920997e+36
 9.96920997e+36]


Variable: LIGHT
[9.96920997e+36 9.96920997e+36 9.96920997e+36 9.96920997e+36
 9.96920997e+36]


Variable: CURATED_LOCATIONS_samples
['2004-02-23T09:45:00.000000000' '2004-02-23T09:45:00.000000000'
 '2004-02-23T20:01:16.000000000' '2004-02-24T00:38:32.000000000'
 '2004-02-24T05:14:13.000000000']


Variable: LAT
[37.116396 37.116396 37.251    37.532    37.463   ]


Variable: LON
[-122.330756 -122.330756 -122.638    -122.749    -123.2     ]


Variable: LOC_CLASS
['G' 'G' 'A' 'B' 'B']


Variable: CLEAN_ZOC_TDR1_samples
['2004-02-24T09:45:00.000000000' '2004-02-24T09:45:03.000000000'
 '2004-02-24T09:45:08.000000000' '2004-02-24T09:45:12.000000000'
 '2004-

Variable: TDR1_samples
['2004-02-23T10:07:19.000000000' '2004-02-23T10:18:47.000000000'
 '2004-02-23T10:29:28.000000000' '2004-02-23T10:43:07.000000000'
 '2004-02-23T11:00:16.000000000']


Variable: MAXDEPTH
[35.  48.5 55.  61.  65.5]


Variable: DURATION
[604. 564. 708. 912. 932.]


Variable: DESC_TIME
[ 32.  52.  60. 108.  96.]


Variable: BOTT_TIME
[532. 452. 576. 720. 740.]


Variable: ASC_TIME
[40. 60. 72. 84. 96.]


Variable: DESC_RATE
[0.766 0.654 0.692 0.514 0.448]


Variable: ASC_RATE
[0.812 0.5   0.59  0.631 0.578]


Variable: PDI
[ 84.  76. 112. 116. 120.]


Variable: WIGGLES_DESC
[1. 1. 1. 1. 1.]


Variable: WIGGLES_BOTT
[43. 27. 45. 53. 53.]


Variable: WIGGLES_ASC
[1. 1. 1. 1. 1.]


Variable: TOT_VERT_DIST_BOTT
[72.  63.  85.  61.5 75.5]


Variable: BOTT_RANGE
[10.5 18.5 13.5  8.  22.5]


Variable: EFFICIENCY
[9.96920997e+36 9.96920997e+36 9.96920997e+36 9.96920997e+36
 9.96920997e+36]


Variable: IDZ
[0. 0. 0. 0. 1.]


Variable: SOLAR_EL
[-54.07211781 -52.33159011 -50.62

Variable: temperature_samples
['2024-01-16T17:13:56.000000000' '2024-01-16T17:13:57.000000000'
 '2024-01-16T17:13:58.000000000' '2024-01-16T17:13:59.000000000'
 '2024-01-16T17:14:00.000000000']


Variable: sensor_data_temperature
[[19.94]
 [19.96]
 [19.99]
 [19.98]
 [19.99]]


Variable: gyroscope_samples
['2024-01-16T17:13:56.000000000' '2024-01-16T17:13:56.020000000'
 '2024-01-16T17:13:56.040000000' '2024-01-16T17:13:56.060000000'
 '2024-01-16T17:13:56.080000000']


Variable: sensor_data_gyroscope
[[ 15.43126473 -22.34872823  -6.9174635 ]
 [ 18.09182761 -21.28450307  -6.38535092]
 [ 18.09182761 -20.22027792  -8.51380123]
 [ 18.62394019 -23.41295338  -5.85323835]
 [ 16.49548988 -21.81661565  -8.51380123]]


Variable: depth_samples
['2024-01-16T17:13:56.000000000' '2024-01-16T17:13:56.020000000'
 '2024-01-16T17:13:56.040000000' '2024-01-16T17:13:56.060000000'
 '2024-01-16T17:13:56.080000000']


Variable: sensor_data_depth
[[-1.83603688]
 [-1.80491761]
 [-1.80491761]
 [-1.83603688]
 [-1.

In [7]:
converts_ds

In [None]:
import xarray as xr
import netCDF4 as nc

input_file_path = "./data/TrackTDR RawCurated.nc"
with nc.Dataset(input_file_path, "r") as rootgrp:
    for group in rootgrp.groups:
        print(group)
        with xr.open_dataset(input_file_path, group=group) as ds:
            display(ds)

In [17]:
import numpy as np
import netCDF4 as nc
import xarray as xr

with xr.open_dataset("./data/deployment_data.nc") as ds:
    display(ds)
    display(ds["event_data"].values[:100])
    # for var in ds.variables:
    #     print(var)
    #     print(ds[var].values[:5])

    
    # fill_value = 9.96920997e+36
    # tolerance = 1e+30  # Adjust the tolerance as needed
    # eor_values = ds["LIGHT"].values
    # filtered_values = eor_values[np.abs(eor_values - fill_value) > tolerance]
    # print(filtered_values)

array([['point', 'heartbeat_manual_ok', 'heartbeat detection', 'nan'],
       ['point', 'heartbeat_manual_reject', 'heartbeat detection', 'nan'],
       ['point', 'exhalation_breath', 'exhalation followed by breath',
        'start exhale  Breath; snapshots for first breath taken later so they are out of sequence for file name time (time of snapshot not of event on video)'],
       ['point', 'heartbeat_manual_reject', 'heartbeat detection', 'nan'],
       ['point', 'heartbeat_manual_ok', 'heartbeat detection', 'nan'],
       ['point', 'heartbeat_manual_ok', 'heartbeat detection', 'nan'],
       ['point', 'heartbeat_manual_ok', 'heartbeat detection', 'nan'],
       ['point', 'heartbeat_manual_ok', 'heartbeat detection', 'nan'],
       ['point', 'heartbeat_manual_ok', 'heartbeat detection', 'nan'],
       ['point', 'heartbeat_manual_ok', 'heartbeat detection', 'nan'],
       ['point', 'heartbeat_manual_ok', 'heartbeat detection', 'nan'],
       ['point', 'heartbeat_manual_ok', 'heartbeat