# LEAP Pangeo JupyterHub: Hackathon Data Access Guide

**Urban Futures:** Co-Creating Climate Resilience in NYC  
**Dates:** Thurs Jan 15- Sat Jan 17, 2026

This notebook serves as a comprehensive guide for hackathon participants
on: - Navigating the LEAP Pangeo JupyterHub environment - Finding and
opening hackathon datasets hosted on **OSN pods (public, read-only)** -
Using **rclone**, **xarray**, and **fsspec** to explore and load data
efficiently - Writing intermediate and final results to **LEAP
`leap-scratch` (temporary)** - Avoiding common pitfalls

**Please refer to our technical documentation for more on LEAP Pangeo:**
https://leap-stc.github.io/

------------------------------------------------------------------------

## 1) Getting Started: Launching your JupyterHub

When you sign into the LEAP Pangeo JupyterHub, you will start on a
**Server Options** page before your environment launches. This is where
you choose the compute profile (CPU/RAM/GPU) and the software image (the
prebuilt environment).

### Step-by-step: start your server

1.  Log into the hub https://leap.2i2c.cloud/
2.  If you are not already running a server, click **Start My Server**.
3.  On the **Server Options** page, select:
    -   **Image** (software environment)
    -   **CPU/RAM profile** (how much compute you request)
4.  Click **Start** and wait for your JupyterLab to open.

### Choosing an image (software environment)

-   **Recommended Default:** **1 CPU / 8 GB RAM**. This option is
    sufficient for most hakcathon work.
-   **GPU Option:** Select a GPU-enabled image *only if you are actively
    running GPU-accelerated workloads (e.g., deep learning training)*

### GPU only when necessary:

-   GPUs are a shared resource.
-   If many participants select the GPU profile simultaneously, the hub
    can become unstable or fail to schedule servers.
-   **Please use GPU only when actively running GPU tasks**, and switch
    back to the default CPU profile when you’re done.

### Best practices to avoid crashing shared resources

-   Prefer the default profile unless you have a clear need.
-   Do not “camp” on GPU resources while idle.
-   Write outputs to `leap-scratch`, not your home directory.

If you hit startup errors or long waits, switch back to the default
profile first and retry.

------------------------------------------------------------------------

## 2) Mental Model: how storage fits together

You will interact with **three** main storage locations:

### 2A) OSN Pod (S3-compatible object storage) - **public, read-only**

-   These are the hackathon’s **published** datasets, so they remain
    accessible to the public even after the event.
-   You cannot write to the OSN pod
-   You should only **read** directly from the OSN whenever possible (do
    not download full datasets locally).

### 2B) LEAP JupyterHub Home Directory (`/home/jovyan`) - **small, persistent**

-   This is where your notebooks live.
-   Storage is limited to 100 GB and shared infrastructure depends on
    it.
-   **HARD RULE:** do **not** store large datasets here. If your home
    directory grows very large (over 100GB), it can cause the servers to
    crash for everyone

### 2C) LEAP `leap-scratch` (Google Cloud Storage bucket) - **writable, 30 days**

-   This is where you should write:
    -   intermediate files (data subsets, temporary exports)
    -   model outputs and figures
    -   anything else that is large GB
-   `leap-scratch` is cleared after 30 days, so it is not intended for
    long-term archiving

------------------------------------------------------------------------

## 3) Why you need `leap-scratch`

Even if the “source of truth” datasets are on OSN, you still need a
writable workspace for: - **Subsetting:** You might extract NYC-only
slices, specific time windows, or a few variables. - **Derived
products:** Regrids, indices, summary tables, GeoJSON outputs, tiles,
plots, ML features, etc. - **Team collaboration:** A shared scratch
prefix is a convenient handoff point. - **Performance:** Writing
computed intermediates (small) can accelerate iterative workflows
without redoing expensive steps.

**Key practice:** *Read big data from OSN; write small/medium artifacts
to `leap-scratch`; keep home directory minimal.*

------------------------------------------------------------------------

## 4) Using Git to collaborate with members

During the hackathon, teams may choose to use **Git/GitHub** to
collaborate on code, notebooks, and documentation.

> A full Git + GitHub tutorial for JupyterHub is available here:
> https://github.com/leap-stc/LEAPCourse-Climate-Pred-Challenges/blob/main/Tutorials/Github-Tutorial.md

### How Git fits into JupyterHub

On LEAP JupyterHub, Git is typically used to: - Share notebooks and
scripts within a team - Track changes to analysis code - Collaborate
asynchronously without emailing files

## Two ways to use Git on JupyterHub

### Option 1: Use the JupyterLab interface (recommended for beginners)

JupyterLab includes a built-in **Git extension** that allows you to: -
Clone repositories - View changed files - Commit and push changes - Pull
updates from teammates

You can access it from: - the **left sidebar** (Branch icon), or - the
**Git** on the topbar menu

### Option 2: Use the terminal (recommended if you know Git)

You can also use Git from the Unix terminal inside JupyterHub.

Example workflow: Open a terminal

``` bash
git clone https://github.com/<org-or-user>/<repo>.git
cd <repo>
git status
git add .
git commit -m "Add analysis notebook"
git push
```

------------------------------------------------------------------------

## 5) Hackathon datasets (OSN layout and access)

All hackathon datasets (HRRR, CorrDiff, and ERA5) are hosted on public
OSN object storage. OSN uses an S3-compatible interface, which means you
access data by combining: - an endpoint URL (where the storage service
lives) - a bucket name (the top-level container) - a prefix (a “folder”
path inside the bucket)

For this hackathon, these values are:

    OSN_ENDPOINT_URL = "https://nyu1.osn.mghpcc.org"
    OSN_BUCKET = "leap-pangeo-manual"
    HACKATHON_PREFIX = "hackathon-2026/"

**All datasets live under the following path:**

    OSN_ROOT = f"s3://{OSN_BUCKET}/{HACKATHON_PREFIX}" 

    # For this hackathon: s3://leap-pangeo-manual/hackathon-2026/

Inside this root directory, you will find separate subdirectories for
each dataset: - `hrrr/` – High-Resolution Rapid Refresh weather model
output - `era5_cds/NYC/` – ERA5 reanalysis data clipped to the NYC
region - `corrdiff/` – CorrDiff downscaled model outputs

**Note:** When working in Python, you do not include the endpoint URL in
the s3:// path. Instead, you provide the endpoint separately and
reference data using the bucket and prefix, as shown in later examples.

This separation (endpoint vs. bucket vs. prefix) is important and is a
common source of confusion when working with S3-compatible storage.

------------------------------------------------------------------------

## 6) Reading data directly from OSN with Python (xarray + fsspec)

The goal is to avoid downloading full datasets. Instead, open them in
*in place* using `xarray` + `fsspec` (via `s3fs`)

### Install/import the usual stack

    import os
    import xarray as xr
    import s3fs
    import fsspec

### 6A) Create an anonymous S3 filesystem for OSN / List Directories

    # Anonymous access to a public OSN endpoint
    fs_osn = s3fs.S3FileSystem(
        anon=True,
        client_kwargs={"endpoint_url": OSN_ENDPOINT_URL},
    )

    # Test: list top-level within the hackathon prefix
    fs_osn.ls(f"{OSN_BUCKET}/{HACKATHON_PREFIX}")[:20]

### 6B) Open Zarr datasets

The `hrrr` and `era5_cds` datasets are stored in Zarr format. You can
open Zarr datasets like this:

    #Example: replace with a real Zarr path you discover via listing
    example_zarr_path = f"{OSN_BUCKET}/{HACKATHON_PREFIX}hrrr/<some-variable>/<some-store>.zarr"

    mapper = fs_osn.get_mapper(example_zarr_path)

    # Open dataset
    ds = xr.open_zarr(mapper, consolidated=False)  # consolidated=True if metadata consolidated

    ds

**Cautions (Zarr):** - Do **not** call `.load()` on the full dataset.

For more information on Zarr files, please refer to
https://docs.xarray.dev/en/stable/user-guide/dask.html

### 6C) Open NetCDF / Other Formats

If data is NetCDF (.nc), you can open via fsspec and xarray like this:.

``` python
import xarray as xr

# Example NetCDF path
example_nc = f"s3://{OSN_BUCKET}/{HACKATHON_PREFIX}corrdiff/<somefile>.nc"

ds_nc = xr.open_dataset(
    example_nc,
    engine="h5netcdf",
    storage_options={
        "anon": True,
        "client_kwargs": {"endpoint_url": OSN_ENDPOINT_URL},
    },
)

ds_nc
```

**Cautions (NetCDF):** - Avoid opening dozens/hundreds of remote `.nc`
files at once. - If you need many files, work in batches and aggregate
outputs (e.g., per-day/per-month metrics), not full raw arrays. - If you
hit performance issues, stage the needed files to `leap-scratch` and
work from there.

For more information on how to navigate and subset NetCDF files, please
refer to https://docs.xarray.dev/en/stable/user-guide/io.html#netcdf

------------------------------------------------------------------------

## 7) Writing outputs to `leap-scratch` (GCS): the safe writable workspace

There are two ways outputs can end up in `leap-scratch`: writing
directly to `leap-scratch` and moving files from home directory to
`leap-scratch`. The first method is strongly preferred.

### 7A) Using `gcsfs` for scratch I/O

    import gcsfs

    # This will use your JupyterHub credentials (already set up on LEAP hub)
    fs_gcs = gcsfs.GCSFileSystem()

    # Fill in a scratch prefix given to you or choose a team folder
    # Example structure: gs://leap-scratch/hackathon-2026/<team-name>/
    SCRATCH_PREFIX = "leap-scratch/hackathon-2026/"

    # List (should work if you have access)
    fs_gcs.ls(SCRATCH_PREFIX)[:20]

### 7B) First Method: Write outputs directly to `leap-scratch`

Most outputs (tables, model results, derived datasets) are created **in
memory** as Python objects.  
The safest and most efficient approach is to **write them directly to
`leap-scratch`** without saving them locally.

Example: write a Pandas DataFrame to `leap-scratch`

``` python
import pandas as pd
from datetime import datetime
import gcsfs

fs_gcs = gcsfs.GCSFileSystem()

SCRATCH_PREFIX = "leap-scratch/hackathon-2026/<team-name>/"

df = pd.DataFrame({
    "timestamp": [datetime.utcnow().isoformat()],
    "note": ["example output"]
})

out_csv = f"{SCRATCH_PREFIX}outputs/example.csv"

with fs_gcs.open(out_csv, "w") as f:
    df.to_csv(f, index=False)

out_csv
```

### 7C) Second Method: Move or copy files from home directory to `leap-scratch`

Sometimes a file already exists locally (ex: a plot image or a file
created by a tool that writes to disk).  
In that case, you can **copy** the file from your home directory into
`leap-scratch`.

Example: copy a local image file to `leap-scratch`

``` python
import gcsfs

fs_gcs = gcsfs.GCSFileSystem()

local_file = "/home/jovyan/output.png"
scratch_file = "leap-scratch/hackathon-2026/<team-name>/outputs/output.png"

fs_gcs.put(local_file, scratch_file)

scratch_file
```

*Note: This should not be your default workflow to avoid large files in
your JupyterHub home directory*

------------------------------------------------------------------------

## 8) Opening files from `leap-scratch` (GCS)

### 8A) CSV → Pandas

``` python
import pandas as pd

csv_path = "gs://leap-scratch/hackathon-2026/<team-name>/outputs/table.csv"

with fs_gcs.open(csv_path, "r") as f:
    df = pd.read_csv(f)

df.head()
```

### 8B) Parquet → Pandas

``` python
import pandas as pd

pq_path = "gs://leap-scratch/hackathon-2026/<team-name>/outputs/table.parquet"
with fs_gcs.open(pq_path, "rb") as f:
    df = pd.read_parquet(f)
```

### 8C) JSON → Python

``` python
import json

json_path = "gs://leap-scratch/hackathon-2026/<team-name>/outputs/config.json"
with fs_gcs.open(json_path, "r") as f:
    obj = json.load(f)
obj
```

### 8D) NetCDF (`.nc`) → Xarray

``` python
import xarray as xr

nc_path = "gs://leap-scratch/hackathon-2026/<team-name>/derived/subset.nc"
with fs_gcs.open(nc_path, "rb") as f:
    ds = xr.open_dataset(f)
ds
```

### 8E) Zarr store → Xarray

``` python
import xarray as xr

zarr_path = "gs://leap-scratch/hackathon-2026/<team-name>/derived/subset.zarr"
ds = xr.open_zarr(fs_gcs.get_mapper(zarr_path))
ds
```

**Caution (GCS):** Prefer Zarr for large multi-dimensional arrays.
NetCDF is fine for smaller derived subsets.

------------------------------------------------------------------------

## 9) A minimal “pattern” for hackathon work

1.  **Explore**: list OSN prefixes; find the dataset you need  
2.  **Open**: `xr.open_zarr(...)` from OSN (lazy)  
3.  **Subset/derive**: compute only what you need for NYC + your time
    window  
4.  **Persist**: save reduced data or derived outputs to
    `leap-scratch`  
5.  **Visualize/ship**: build your demo/visualization using scratch
    outputs

------------------------------------------------------------------------

## 10) If something fails

-   If listing OSN returns empty: verify endpoint URL, bucket, prefix
-   If you see permissions errors on scratch: confirm your
    `leap-scratch` path and access
-   If xarray says “No group found in store”: you may be pointing at a
    non-zarr path; list deeper and confirm `.zarr` store root
-   If things are slow: reduce request size (fewer timesteps/variables),
    use `chunks=` and avoid eager reads