# Working with Galv data

This notebook demonstrates how to work with Galv data in Python.
It closely follows the code example found on the Files page of the Galv app.

Before we can get our hands on data, we need to know where the API is (`host`), who to tell the API we are (`token`), and what data we want (`dataset_ids`).
If you're using the public Galv instance at https://galv-backend-dev.fly.dev, you can leave `host` as it is.

You can get a token by going to the Galv app, clicking on your name in the top right, and selecting "API token" from the dropdown.
Then select 'create a new token' and copy the token that is generated.
Remember to give your token a sensible expiry date, so you don't stop being authorised to access the data before you're done with it.

In [None]:
host = "https://galv-backend-dev.fly.dev"
token = input("Provide your API token")
dataset_ids = [input("Provide the dataset ID")]

## Download data

First we need some boilerplate code to cover configuration and importing the necessary libraries.

In [None]:
import requests
import duckdb
import tempfile
import os
import json

# Configuration
headers = {
    "Authorization": f"Bearer {token}",
    "accept": "application/json"
}
verbose = True
dataset_metadata = {}
parquets = {}


def vprintln(message):
    if verbose:
        print(message)


Then we can define our `get_dataset` function. 
This function will download the metadata for a dataset and then download the parquet partitions for that dataset.

In [None]:
def get_dataset(id):
    vprintln(f"Downloading dataset {id}")

    response = requests.get(f"{host}/files/{id}/", headers=headers)
    if response.status_code != 200:
        print(f"Error fetching dataset {id}: {response.status_code}")
        return

    try:
        body = response.json()
    except json.JSONDecodeError:
        print(f"Error parsing JSON for dataset {id}")
        return

    dataset_metadata[id] = body
    parquet_partitions = dataset_metadata[id]["parquet_partitions"]
    len_partitions = len(parquet_partitions)
    vprintln(f"Downloading {len_partitions} parquet partitions for dataset {id}")

    dataset_dir = tempfile.mkdtemp(prefix=f"py_{id}")

    for i, pp in enumerate(parquet_partitions):
        vprintln(f"Downloading partition {i + 1} from {pp}")
        partition_response = requests.get(pp, headers=headers)
        if partition_response.status_code != 200:
            print(f"Error fetching parquet partition {i + 1} for dataset {id}: {partition_response.status_code}")
            continue

        try:
            parquet_info = partition_response.json()
        except json.JSONDecodeError:
            print(f"Error parsing JSON for dataset {id} parquet partition {i + 1}")
            continue

        pq_file = parquet_info["parquet_file"]
        vprintln(f"Downloading .parquet from {pq_file}")
        path = os.path.join(dataset_dir, f"{i + 1}.parquet")

        download_response = requests.get(pq_file, headers=headers)
        if download_response.status_code == 200:
            with open(path, 'wb') as f:
                f.write(download_response.content)
            vprintln(f"Partition {i + 1} downloaded successfully")
        else:
            print(f"Error downloading .parquet file from {pq_file}: {download_response.status_code}")

    # Add parquet from directory
    parquets[id] = duckdb.read_parquet(f"{dataset_dir}/*.parquet")
    vprintln("Completed.")




And finally we can actually use our download function to get the data.

In [None]:
for id in dataset_ids:
    get_dataset(id)
    vprintln(f"Completed dataset {id}")

vprintln("All datasets complete.")

## Working with the data

Most data-oriented tasks are covered in other notebooks, so here we just take a quick look at our data.

In [None]:
# Load a dataset as a DataFrame
df = parquets[dataset_ids[0]].read().to_pandas()
df

In [None]:
from IPython.display import JSON
JSON(dataset_metadata[dataset_ids[0]], metadata={}, expanded=True, root='test')