# WildfireSpreadTS: Download and Convert to HDF5

This notebook downloads the WildfireSpreadTS dataset from Zenodo and converts it to HDF5 format for faster training. Run this **once** on Colab, then save the HDF5 output to Google Drive so you can reuse it across sessions.

**Requirements:**
- ~50 GB free disk for download + extraction (Colab Pro recommended for larger disk)
- Or save directly to Google Drive (slower I/O, but persists)

## 1. Mount Google Drive (optional but recommended)

Mount Drive to persist the dataset. If you skip this, data will be lost when the runtime disconnects.

In [7]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## 2. Configure paths

- **`USE_DRIVE`**: Set to `True` to save the dataset to Google Drive (persists across sessions). Set to `False` to use Colab's local disk (faster, but lost on disconnect).
- **`DRIVE_BASE`**: Base path in Drive when `USE_DRIVE=True`.

In [8]:
from pathlib import Path

USE_DRIVE = True  # Set False to use /content only (faster, but ephemeral)
DRIVE_BASE = "/content/drive/MyDrive/wildfire_dataset"

if USE_DRIVE:
    BASE_DIR = DRIVE_BASE
else:
    BASE_DIR = "/content/wildfire_dataset"

HDF5_DIR = f"{BASE_DIR}/hdf5"    # Converted HDF5 output
ZIP_PATH = f"{BASE_DIR}/WildfireSpreadTS.zip"

!mkdir -p "{BASE_DIR}" "{HDF5_DIR}"
print(f"HDF5_DIR: {HDF5_DIR}")

HDF5_DIR: /content/drive/MyDrive/wildfire_dataset/hdf5


## 3. Download dataset from Zenodo

The dataset is ~48 GB. This may take 30–60+ minutes depending on connection.

In [None]:
ZENODO_URL = "https://zenodo.org/records/8006177/files/WildfireSpreadTS.zip?download=1"

if not Path(ZIP_PATH).exists():
    !wget -O "{ZIP_PATH}" "{ZENODO_URL}"
else:
    print(f"Zip already exists at {ZIP_PATH}, skipping download.")

## 4. Extract the zip archive

Extraction may take 10–20 minutes. The zip contains GeoTIFF files organized by year and fire event.

In [9]:
!unzip -o "{ZIP_PATH}" -d "{BASE_DIR}"
# Zip extracts to BASE_DIR/WildfireSpreadTS/{2018,2019,2020,2021}/
DATA_DIR = f"{BASE_DIR}/WildfireSpreadTS"
print(f"Extraction complete. Data at: {DATA_DIR}")

Archive:  /content/drive/MyDrive/wildfire_dataset/WildfireSpreadTS.zip
   creating: /content/drive/MyDrive/wildfire_dataset/2018/
   creating: /content/drive/MyDrive/wildfire_dataset/2018/fire_21890058/
  inflating: /content/drive/MyDrive/wildfire_dataset/2018/fire_21890058/2018-07-24.tif  
  inflating: /content/drive/MyDrive/wildfire_dataset/2018/fire_21890058/2018-07-16.tif  
  inflating: /content/drive/MyDrive/wildfire_dataset/2018/fire_21890058/2018-07-12.tif  
  inflating: /content/drive/MyDrive/wildfire_dataset/2018/fire_21890058/2018-07-21.tif  
  inflating: /content/drive/MyDrive/wildfire_dataset/2018/fire_21890058/2018-07-17.tif  
  inflating: /content/drive/MyDrive/wildfire_dataset/2018/fire_21890058/2018-07-23.tif  
  inflating: /content/drive/MyDrive/wildfire_dataset/2018/fire_21890058/2018-07-25.tif  
  inflating: /content/drive/MyDrive/wildfire_dataset/2018/fire_21890058/2018-07-15.tif  
  inflating: /content/drive/MyDrive/wildfire_dataset/2018/fire_21890058/2018-07-18.ti

## 5. Clone WildfireSpreadTS and install dependencies

The conversion script lives in the original WildfireSpreadTS repo. We need it for `CreateHDF5Dataset.py`.

In [10]:
!git clone https://github.com/SebastianGer/WildfireSpreadTS.git /content/WildfireSpreadTS
%cd /content/WildfireSpreadTS
!pip install -q rasterio h5py numpy torch tqdm

fatal: destination path '/content/WildfireSpreadTS' already exists and is not an empty directory.
/content/WildfireSpreadTS


## 6. Convert TIF to HDF5

This step reads each fire's GeoTIFFs and writes them to HDF5. HDF5 is ~2× the size but much faster to load during training. Expect 30–60+ minutes.

In [11]:
# Patch FireSpreadDataset.py for compatibility with PyTorch >= 2.x
# T_co was renamed/removed; create an alias so the upstream script still works.
fp = "/content/WildfireSpreadTS/src/dataloader/FireSpreadDataset.py"
txt = open(fp).read()
patched = txt.replace(
    "from torch.utils.data.dataset import T_co",
    "try:\n    from torch.utils.data.dataset import T_co\nexcept ImportError:\n    from torch.utils.data.dataset import _T_co as T_co",
)
open(fp, 'w').write(patched)
print('Patch applied.')


Patch applied.


In [12]:
import os
os.environ["HDF5_USE_FILE_LOCKING"] = "FALSE"

!python src/preprocess/CreateHDF5Dataset.py \
  --data_dir "{DATA_DIR}" \
  --target_dir "{HDF5_DIR}"

0it [00:00, ?it/s]


## 7. Optional: Free disk space

After conversion, you can delete the zip and raw TIFs to free space. Only run this if you're sure the HDF5 conversion completed successfully.

In [None]:
# Uncomment to delete zip and raw data (saves ~50+ GB)
# !rm -f "{ZIP_PATH}"
# !rm -rf "{DATA_DIR}"
# print("Cleanup done.")

## Done

Use `HDF5_DIR` as `--data_dir` when training:

```bash
python train_wildfire.py --data_dir /content/drive/MyDrive/wildfire_dataset/hdf5 --output_dir ./runs/wildfire --load_from_hdf5
```

(Adjust the path if you used a different `DRIVE_BASE` or `USE_DRIVE=False`.)