# Data Uploader

This notebook demonstrates the process of uploading EDF files data to Delta Lake and OpenStack Swift for long-term storage. 

It also includes the setup and execution of the data upload process, as well as querying the uploaded data for analysis.

## Starting the servers:
To launch the server, open the Docker Desktop app and run the following command at the root of the project:
```bash
$ make up
```
This command will launch the Django server, Postgres database, and Jupyter server using the environment variables defined in the `.env` file accross all containers.

#### Understanding expected file paths:
DiveDB expects the following paths to be set in the `.env` file:
- `CONTAINER_DATA_PATH`
- `LOCAL_DATA_PATH`
- `HOST_DELTA_LAKE_PATH`
- `CONTAINER_DELTA_LAKE_PATH`

These paths are used to mount the Delta Lake and file storage to the containers. The "LOCAL_" and "HOST_" paths can be wherever makes sense for your local machine. The "CONTAINER_" paths are the paths that the containers expect. We recommend you keep the "CONTAINER_" paths as they are in the `.env.example` file.

#### When is the server ready?
There are 3 processes that need to be running for the server to be ready:
1. The Django server (`web`)
2. The Postgres database (`postgres`)
3. The Jupyter server (`jupyter`)

Jupyter is almost always the last to start up. You'll know it's ready when you see the following logs in the terminal:
```bash
jupyter-1            | [I 2024-08-30 16:12:37.083 ServerApp] Serving notebooks from local directory: /app
jupyter-1            | [I 2024-08-30 16:12:37.083 ServerApp] Jupyter Server 2.14.2 is running at:
jupyter-1            | [I 2024-08-30 16:12:37.083 ServerApp] http://e29d05e13fd0:8888/jupyter/tree
jupyter-1            | [I 2024-08-30 16:12:37.083 ServerApp]     http://127.0.0.1:8888/jupyter/tree
```

## Connecting to the Jupyter Kernel:
To connect to the Jupyter server in your notebook, follow these steps:
1. Click the "Select Kernel" button at the top right of the page.
1. Pick the "Select another kernel" option in the dropdown menu.
1. Pick the "Existing Jupyter Server" option in the dropdown menu.
1. Now we need to connect to the Jupyter server.
    - If you previously connected to the Jupyter server
        - Pick the "localhost" option in the dropdown menu (or whatever you named it prior)
    - If you have not connected to the Jupyter server before
        - Pick the "Enter the URL of the running Jupyter server" option in the dropdown menu.
        - Enter http://localhost:8888/jupyter
        - Give it a name you'll remember (like "Local DiveDB Jupyter Server")
1. Press the "Reload" icon in the top right of the dropdown menu to see the latest kernel.
1. Pick the "Python 3" option in the dropdown menu.

This will ensure you execute the Jupyter notebook in the correct environment.

After connecting to the Jupyter server, ensure your notebook runs this command to set the appropriate file paths:

## Preparing to upload data:
There are two aspects to any data upload:
1. A netCDF file containing measurements and time
2. The metadata for the measurements
    - This describes the context of the measurements using the following fields:
        - animal
        - deployment
        - logger

There are several ways to define your metadata. 

#### Supplied Metadata Dictionary
If you know the metadata for your measurements, you can pass a dictionary to the `upload_netcdf` function. The dictionary should represent metadata existing in the Metadata database and contain the following fields:
- animal: The animal ID
- deployment: The deployment name
- recording: The recording name


### Uploading netCDF files
The `netcdf_file_path` list contains the paths to the netCDF files that we want to upload. It can point to files on your local machine or on a remote server.
In this example, the file is located in the ../data/files/ directory and is named deployment_data.nc.

The upload_netcdf function will perform the following: 
- use the provided metadata dictionary to extract the metadata for your measurements
- upload copies of the netCDF files to OpenStack Swift
- upload the measurements to Delta Lake

The process takes between 30 secs to 1 min to complete per gigabyte — about 2/3rds of the time is used to upload the files to OpenStack Swift. (*note: we can speed this up by parellizing the upload process*)

### Example netCDF File
An example netCDF file can be downloaded here: [https://figshare.com/ndownloader/files/50061330](https://figshare.com/ndownloader/files/50061330) that meets the above requirements and can be used as a template for your own data.

### Example 1: Uploading a netCDF file when metadata is already in the database

In [None]:
import os
import importlib
import xarray as xr

os.environ["SKIP_OPENSTACK_UPLOAD"] = "true"
os.environ["DJANGO_ALLOW_ASYNC_UNSAFE"] = "true"

from DiveDB.services.duck_pond import DuckPond

import DiveDB.services.data_uploader
importlib.reload(DiveDB.services.data_uploader)
from DiveDB.services.data_uploader import DataUploader

duckpond = DuckPond(os.environ["CONTAINER_DELTA_LAKE_PATH"])
data_uploader = DataUploader()

# The example netCDF file can be downloaded here: [https://figshare.com/ndownloader/files/50061330](https://figshare.com/ndownloader/files/50061330)
example_data_path = "./files/example_data.nc"

# Prepare data for each model
with xr.open_dataset(example_data_path) as ds:
    animal_id = ds.attrs.get("animal_info_page_id")
    deployment_id = ds.attrs.get("deployment_info_page_id")
    
    sensor_info_attrs = {key: value for key, value in ds.attrs.items() if key.startswith("sensor_info")}
    sensor_info_words = list(set(key.split("sensor_info_")[1].split("_")[0] for key in sensor_info_attrs))
    logger_ids = {ds.attrs.get(f"sensor_info_{word}_logger_id") for word in sensor_info_words}
    
    if len(logger_ids) == 1:
        logger_id = list(logger_ids)[0]
        metadata = {
            "animal": animal_id,
            "deployment": deployment_id,
            "recording": f"{deployment_id}_{animal_id}_{logger_id}"
        }

        data_uploader.upload_netcdf(example_data_path, metadata)
    else:
        print("Multiple loggers detected. Divide data into separate files for each logger.")



### Example 2: Uploading a netCDF file when metadata is not in the database

If the metadata is not in the database, the upload_netcdf function require you to provide the data manually.

In [None]:
import os
import importlib
import xarray as xr

os.environ["SKIP_OPENSTACK_UPLOAD"] = "true"
os.environ["DJANGO_ALLOW_ASYNC_UNSAFE"] = "true"

from DiveDB.services.duck_pond import DuckPond

import DiveDB.services.data_uploader
importlib.reload(DiveDB.services.data_uploader)
from DiveDB.services.data_uploader import DataUploader

duckpond = DuckPond(os.environ["CONTAINER_DELTA_LAKE_PATH"])
data_uploader = DataUploader()

# The example netCDF file can be downloaded here: [https://figshare.com/ndownloader/files/50061330](https://figshare.com/ndownloader/files/50061330)
example_data_path = "./files/example.nc"

# Prepare data for each model
with xr.open_dataset(example_data_path) as ds:
    display(ds)
    animal_data = {
        "animal_id": ds.attrs.get("animal_info_page_id"),
        "project_id": ds.attrs.get("animal_info_Project ID"),
        "scientific_name": ds.attrs.get("animal_info_Scientific Name"),
        "common_name": ds.attrs.get("animal_info_Common Name"),
        "lab_id": ds.attrs.get("animal_info_Lab ID"),
        "birth_year": -999,  # Animal age is not provided in the example data
        "sex": ds.attrs.get("animal_info_Sex"),
        "domain_ids": ds.attrs.get("animal_info_Domain IDs"),
    }

    deployment_data = {
        "deployment_id": ds.attrs.get("deployment_info_page_id"),
        "domain_deployment_id": ds.attrs.get("deployment_info_Domain Deployment ID"),
        "animal_age_class": ds.attrs.get("deployment_info_Animal Age Class"),
        "animal_age": -999,  # Animal age is not provided in the example data
        "deployment_type": ds.attrs.get("deployment_info_Deployment Type"),
        "deployment_name": ds.attrs.get("deployment_info_page_id"),
        "rec_date": ds.attrs.get("deployment_info_Recording Date"),
        "deployment_latitude": ds.attrs.get("deployment_info_Deployment Latitude"),
        "deployment_longitude": ds.attrs.get("deployment_info_Deployment Longitude"),
        "deployment_location": ds.attrs.get("deployment_info_Deployment Location"),
        "departure_datetime": ds.attrs.get("deployment_info_Departure Datetime"),
        "timezone": ds.attrs.get("deployment_info_Time Zone"),
        "recovery_latitude": ds.attrs.get("deployment_info_Recovery Latitude"),
        "recovery_longitude": ds.attrs.get("deployment_info_Recovery Longitude"),
        "recovery_location": ds.attrs.get("deployment_info_Recovery Location"),
        "arrival_datetime": ds.attrs.get(f"deployment_info_Recording Date") + " " + ds.attrs.get(f"deployment_info_Start Time"),
        "notes": ds.attrs.get("deployment_info_Notes"),
    }

    # Create or get records
    animal, _ = data_uploader.get_or_create_animal(animal_data)
    deployment, _ = data_uploader.get_or_create_deployment(deployment_data)
    
    sensor_info_attrs = {key: value for key, value in ds.attrs.items() if key.startswith("sensor_info")}
    sensor_info_words = list(set(key.split("sensor_info_")[1].split("_")[0] for key in sensor_info_attrs))
    logger_ids = {ds.attrs.get(f"sensor_info_{word}_logger_id") for word in sensor_info_words}
    
    if len(logger_ids) == 1:
        logger_label = sensor_info_words[0]
        logger_data = {
            "logger_id": ds.attrs.get(f"sensor_info_{logger_label}_logger_id"),
            "manufacturer": ds.attrs.get(f"sensor_info_{logger_label}_logger_manufacturer"),
            "manufacturer_name": ds.attrs.get(f"sensor_info_{logger_label}_logger_model"),
            "serial_no": ds.attrs.get(f"sensor_info_{logger_label}_logger_serial_number"),
            "ptt": ds.attrs.get(f"sensor_info_{logger_label}_logger_ptt"),
            "type": ds.attrs.get(f"sensor_info_{logger_label}_logger_type"),
            "notes": ds.attrs.get(f"sensor_info_{logger_label}_details"),
        }
        
        logger, _ = data_uploader.get_or_create_logger(logger_data)
        
        recording_data = {
            "recording_id": f"{deployment_data["deployment_id"]}_{animal_data["animal_id"]}_{logger_data["logger_id"]}",
            "name": f"{deployment_data["deployment_id"]}_{animal_data["animal_id"]}_{logger_data["logger_id"]}",
            "animal": animal,
            "deployment": deployment,
            "logger": logger,
            "start_time": ds.attrs.get(f"sensor_info_{logger_label}_sensor_start_datetime"),
            "end_time": ds.attrs.get(f"sensor_info_{logger_label}_sensor_end_datetime"),
            "timezone": ds.attrs.get("Time_Zone"),
            "quality": ds.attrs.get("Quality"),
            "attachment_location": ds.attrs.get("Attachment_Location"),
            "attachment_type": ds.attrs.get("Attachment_Type"),
        }

        recording, _ = data_uploader.get_or_create_recording(recording_data)

        metadata = {
            "animal": animal.id,
            "deployment": deployment.id,
            "recording": recording.id,
        } 

        data_uploader.upload_netcdf(example_data_path, metadata)

    else:
        print("Multiple loggers detected. Divide data into separate files for each logger.")