# Data Uploader

This notebook demonstrates the process of uploading Pyologger-prepared data in NetCDFs to DiveDB's Iceberg storage.

## Starting your environment (Docker or Local)
DiveDB can be run either within a Docker container or directly on your local machine. Both approaches rely on the same .env configuration file for environment variables, but differ in how file paths are handled. Docker will more closely match a cloud-hosted environment, but local runs are quicker and easier to set up.

###Option 1: Run Jupyter server with Docker

To launch the server using Docker, open the Docker Desktop app and run the following command at the root of the project:
```bash
$ make up
```
This command will launch the Jupyter server using the environment variables defined in your .env file.

#### Understanding expected file paths (Docker):

For local uploads in Docker, DiveDB expects the following variables to be set in your .env file:
- `LOCAL_DATA_PATH`: The path to the local data directory (source of data to be uploaded).
- `LOCAL_ICEBERG_PATH`: The path to the local iceberg directory (destination of data to be uploaded).
- `CONTAINER_DATA_PATH`: The path that the container will see for the data directory.
- `CONTAINER_ICEBERG_PATH`: The path that the container will see for the iceberg directory.

The “LOCAL_” paths refer to folders on your machine. The “CONTAINER_” paths define how those folders are mounted inside the container. We recommend keeping the “CONTAINER_” paths consistent with the values provided in .env.example.

#### Uploads to S3 (Docker):

If uploading to S3, the following variables must also be defined:
- `S3_ENDPOINT`: The endpoint of the S3 bucket.
- `S3_ACCESS_KEY`: The access key of the S3 bucket.
- `S3_SECRET_KEY`: The secret key of the S3 bucket.
- `S3_BUCKET`: Just the name of the S3 bucket.

These values are used regardless of whether you run DiveDB in Docker or locally.

When is the server ready?

You’ll know the server is ready when you see logs like the following:
```
jupyter-1            | [I 2024-08-30 16:12:37.083 ServerApp] Serving notebooks from local directory: /app
jupyter-1            | [I 2024-08-30 16:12:37.083 ServerApp] Jupyter Server 2.14.2 is running at:
jupyter-1            | [I 2024-08-30 16:12:37.083 ServerApp] http://e29d05e13fd0:8888/jupyter/tree
jupyter-1            | [I 2024-08-30 16:12:37.083 ServerApp]     http://127.0.0.1:8888/jupyter/tree
```

#### Connecting to the Jupyter Kernel from within VSCode

The following steps assume that you're working within VSCode, with both the Python and Jupyter extensions installed. For other workflows, you'll need to follow your workflow-specific steps to connect to the Jupyter Kernel launched by `make up`.

To connect to the Jupyter server in your notebook, follow these steps:
1. Click the "Select Kernel" button at the top right of the page.
1. Pick the "Select another kernel" option in the dropdown menu.
1. Pick the "Existing Jupyter Server" option in the dropdown menu.
1. Now we need to connect to the Jupyter server.
    - If you previously connected to the Jupyter server
        - Pick the "localhost" option in the dropdown menu (or whatever you named it prior)
    - If you have not connected to the Jupyter server before
        - Pick the "Enter the URL of the running Jupyter server" option in the dropdown menu.
        - Enter http://localhost:8888/jupyter
        - Give it a name you'll remember (like "Local DiveDB Jupyter Server")
1. Press the "Reload" icon in the top right of the dropdown menu to see the latest kernel.
1. Pick the "Python 3" option in the dropdown menu.

This will ensure you execute the Jupyter notebook in the correct environment.

### Option 2: Run Jupyter server locally (no Docker)

You can also run the server directly on your machine, without Docker.

The environment variables from .env will still be used—consider loading them into your shell manually (e.g., using dotenv or exporting them yourself), or structure your notebook to load them directly with python-dotenv.

#### Understanding expected file paths (Local):

If running locally (outside Docker), the following variables are still expected:
- `LOCAL_DATA_PATH`: The path to the local data directory.
- `LOCAL_ICEBERG_PATH`: The path to the local iceberg directory.

Unlike the Docker setup, the container-specific variables (CONTAINER_DATA_PATH, CONTAINER_ICEBERG_PATH) are not used. Any paths referenced in your code should point directly to the LOCAL_ paths.

Uploads to S3 (Local):

S3 configuration remains unchanged. The same variables must be set:
- `S3_ENDPOINT`: The endpoint of the S3 bucket.
- `S3_ACCESS_KEY`: The access key of the S3 bucket.
- `S3_SECRET_KEY`: The secret key of the S3 bucket.
- `S3_BUCKET`: The name of the S3 bucket.

⸻

Choose the option that best fits your need. Docker provides a consistent, containerized environment, while the local setup is more flexible for quick iteration and debugging.


## Preparing to upload data:
There are two aspects to any data upload:
1. A netCDF file containing measurements and time
2. The metadata for the measurements
    - This describes the context of the measurements using the following fields:
        - dataset
        - animal
        - deployment
        - logger

There are several ways to define your metadata. 

#### Supplied Metadata Dictionary
If you know the metadata for your measurements, you can pass a dictionary to the `upload_netcdf` function. The dictionary should represent metadata existing in the Metadata database and contain the following fields:
- animal: The animal ID
- deployment: The deployment name
- recording: The recording name


### Uploading netCDF files
The `netcdf_file_path` list contains the paths to the netCDF files that we want to upload. It can point to files on your local machine or on a remote server.
In this example, the file is located in the ../data/files/ directory and is named deployment_data.nc.

The upload_netcdf function will perform the following: 
- use the provided metadata dictionary to extract the metadata for your measurements
- upload the measurements to Delta Lake

The process takes between 20 secs per gigabyte (*note: we can speed this up by parellizing the upload process*).

### Example netCDF File
An example netCDF file can be downloaded here: [https://figshare.com/ndownloader/files/50061330](https://figshare.com/ndownloader/files/50061330) that meets the above requirements and can be used as a template for your own data.

Once you've downloaded that file into the local `DiveDB/files/` subdirectory, you'll either need to rename it to `example_data.nc` or set `example_data_path` in the following examples to the name of the downloaded file. 

### Example 1: Uploading a netCDF file

In [None]:
import os
import importlib
import xarray as xr

from DiveDB.services.duck_pond import DuckPond

import DiveDB.services.data_uploader
importlib.reload(DiveDB.services.data_uploader)
from DiveDB.services.data_uploader import DataUploader

# Create DuckPond instance (new Iceberg-based data lake)
duck_pond = DuckPond(os.environ["CONTAINER_ICEBERG_PATH"])
data_uploader = DataUploader(duck_pond=duck_pond)

# See above [Example netCDF File](#example-netcdf-file) for saving an example file 
# to this path; if the file has not been renamed to `example_data.nc`, update the 
# path this variable points to.
example_data_path = "/Users/williamgislason/Downloads/2025-01-30_oror-001_output.nc"

# Prepare data for each model
with xr.open_dataset(example_data_path) as ds:
    display(ds)
    dataset_id = ds.attrs.get("dataset_info_page_id")
    animal_id = ds.attrs.get("animal_info_page_id")
    deployment_id = ds.attrs.get("deployment_info_page_id")
    
    sensor_info_attrs = {key: value for key, value in ds.attrs.items() if key.startswith("sensor_info")}
    sensor_info_words = list(set(key.split("sensor_info_")[1].split("_")[0] for key in sensor_info_attrs))
    logger_ids = {ds.attrs.get(f"sensor_info_{word}_logger_id") for word in sensor_info_words}
    
    if len(logger_ids) == 1:
        logger_id = list(logger_ids)[0]
        metadata = {
            "dataset": dataset_id,
            "animal": animal_id,
            "deployment": deployment_id,
            "recording": f"{deployment_id}_{animal_id}_{logger_id}"
        }

        data_uploader.upload_netcdf(example_data_path, metadata)
    else:
        print("Multiple loggers detected. Divide data into separate files for each logger.")

