# WildfireSpreadTS Dataset Exploration

This notebook explores the structure and contents of the WildfireSpreadTS dataset downloaded from Zenodo.

**Dataset Info:**
- Zenodo Record: https://zenodo.org/records/8006177
- DOI: 10.5281/zenodo.8006177
- Coverage: 607 U.S. fire events (Jan 2018 - Oct 2021)
- Format: Multi-temporal, multi-modal remote-sensing data (GeoTIFF files)

**Goals:**
1. Understand dataset structure and file organization
2. Explore .tif file structure and multi-band channels
3. Identify label channels (active fire detections, burned area)
4. Prepare for label extraction and feature engineering for XGBoost training


In [11]:
# Setup and imports
from pathlib import Path
import pandas as pd
import numpy as np
from collections import defaultdict

# Try to import rasterio for reading .tif files
try:
    import rasterio
    HAS_RASTERIO = True
except ImportError:
    HAS_RASTERIO = False
    print("‚ö†Ô∏è  rasterio not installed. Install with: pip install rasterio")

# Set dataset directory
PROJECT_ROOT = Path().resolve().parent
DATASET_DIR = PROJECT_ROOT / "data" / "raw" / "wildfirespreadts"

print(f"Dataset directory: {DATASET_DIR}")
print(f"Directory exists: {DATASET_DIR.exists()}")

if not DATASET_DIR.exists():
    print("\n‚ö†Ô∏è  Dataset directory not found!")
    print("Please download the dataset first using:")
    print("  python scripts/download_wildfirespreadts.py")


Dataset directory: C:\Users\muron\Documents\ml-wildfire-risk-predictor\data\raw\wildfirespreadts
Directory exists: True


## 1. Dataset Structure Overview


In [12]:
# Find all files and directories
all_files = list(DATASET_DIR.rglob('*')) if DATASET_DIR.exists() else []
files_only = [f for f in all_files if f.is_file()]
dirs_only = [d for d in all_files if d.is_dir()]

print(f"Total files found: {len(files_only)}")
print(f"Total directories found: {len(dirs_only)}")
print(f"\nDataset directory structure:")
print(f"  {DATASET_DIR}")

if files_only:
    # Group files by extension
    by_ext = defaultdict(list)
    total_size = 0
    
    for filepath in files_only:
        size_mb = filepath.stat().st_size / (1024 * 1024)
        ext = filepath.suffix.lower() or '(no extension)'
        by_ext[ext].append({'name': filepath.name, 'size_mb': size_mb, 'path': filepath})
        total_size += size_mb
    
    # Display summary
    print("\n" + "="*70)
    print("Files by Type")
    print("="*70)
    print(f"{'Extension':<20} {'Count':<10} {'Total Size (MB)':<20}")
    print("-"*70)
    
    for ext in sorted(by_ext.keys()):
        count = len(by_ext[ext])
        total_mb = sum(f['size_mb'] for f in by_ext[ext])
        print(f"{ext:<20} {count:<10} {total_mb:>15.2f}")
    
    print("-"*70)
    print(f"{'TOTAL':<20} {len(files_only):<10} {total_size:>15.2f}")
    print("="*70)
else:
    print("\n‚ö†Ô∏è  No files found. The dataset may need to be extracted.")


Total files found: 13609
Total directories found: 611

Dataset directory structure:
  C:\Users\muron\Documents\ml-wildfire-risk-predictor\data\raw\wildfirespreadts

Files by Type
Extension            Count      Total Size (MB)     
----------------------------------------------------------------------
.md                  1                     0.00
.pdf                 1                     0.00
.tif                 13607             47010.65
----------------------------------------------------------------------
TOTAL                13609             47010.65


## 2. Top-Level Files


In [13]:
# Show top-level files
top_level_files = [f for f in files_only if f.parent == DATASET_DIR]

if top_level_files:
    print("Top-level files:")
    print("="*70)
    
    file_type_map = {
        '.nc': 'NetCDF (use xarray or netCDF4)',
        '.h5': 'HDF5 (use h5py or xarray)',
        '.hdf5': 'HDF5 (use h5py or xarray)',
        '.hdf': 'HDF/HDF-EOS (use h5py, pyhdf, or xarray)',
        '.zip': 'ZIP archive (extract first)',
        '.tar': 'TAR archive (extract first)',
        '.gz': 'GZIP compressed (may need extraction)',
        '.parquet': 'Parquet (use pandas or pyarrow)',
        '.csv': 'CSV (use pandas)',
        '.json': 'JSON (use json or pandas)'
    }
    
    file_info_list = []
    for filepath in sorted(top_level_files):
        size_mb = filepath.stat().st_size / (1024 * 1024)
        ext = filepath.suffix.lower()
        file_type = file_type_map.get(ext, f'Unknown format ({ext})')
        file_info_list.append({
            'filename': filepath.name,
            'size_mb': size_mb,
            'type': file_type,
            'path': filepath
        })
        print(f"  {filepath.name:<50} {size_mb:>8.2f} MB  ({file_type})")
    
    # Create DataFrame for easier viewing
    df_files = pd.DataFrame(file_info_list)
    print("\n" + "="*70)
else:
    print("No top-level files found.")
    df_files = pd.DataFrame()


Top-level files:
  README.md                                              0.00 MB  (Unknown format (.md))
  WildfireSpreadTS_Documentation.pdf                     0.00 MB  (Unknown format (.pdf))



## 3. Directory Structure


In [14]:
# Show directory structure (first 3 levels)
if dirs_only:
    print("Directory structure (showing first 3 levels):")
    print("="*70)
    
    # Get unique directories
    unique_dirs = sorted(set(d for d in dirs_only if d != DATASET_DIR))
    
    for directory in unique_dirs[:20]:  # Show first 20
        rel_path = directory.relative_to(DATASET_DIR)
        depth = len(rel_path.parts)
        if depth <= 3:
            indent = "  " * depth
            # Count files in this directory
            file_count = len([f for f in files_only if f.parent == directory])
            print(f"{indent}{rel_path.name}/ ({file_count} files)")
    
    if len(unique_dirs) > 20:
        print(f"  ... and {len(unique_dirs) - 20} more directories")
    print("="*70)
else:
    print("No subdirectories found.")


Directory structure (showing first 3 levels):
  2018/ (0 files)
    fire_21458798/ (14 files)
    fire_21458801/ (11 files)
    fire_21458806/ (19 files)
    fire_21458836/ (35 files)
    fire_21458848/ (15 files)
    fire_21459234/ (14 files)
    fire_21459239/ (15 files)
    fire_21459242/ (16 files)
    fire_21459249/ (20 files)
    fire_21459253/ (15 files)
    fire_21538827/ (16 files)
    fire_21615465/ (15 files)
    fire_21615469/ (11 files)
    fire_21617464/ (12 files)
    fire_21688910/ (19 files)
    fire_21688916/ (33 files)
    fire_21690064/ (38 files)
    fire_21690071/ (17 files)
    fire_21690073/ (16 files)
  ... and 591 more directories


## 4. Explore .tif File Structure and Labels

The dataset contains 13,607 GeoTIFF (.tif) files. These are multi-band files containing:
- Weather variables (temperature, humidity, wind, etc.)
- Fuel variables
- Topography
- **Labels**: Active fire detections and burned area

Let's explore the structure of these files to identify which bands contain labels.


In [15]:
# Find .tif files
tif_files = [f for f in files_only if f.suffix.lower() in ['.tif', '.tiff']]

print(f"GeoTIFF files found: {len(tif_files)}")
print(f"Total size: {sum(f.stat().st_size for f in tif_files) / (1024**3):.2f} GB")

# Find sample files from different years
def find_sample_files():
    """Find a few sample .tif files from different years."""
    sample_files = []
    for year_dir in sorted(DATASET_DIR.glob("[0-9][0-9][0-9][0-9]")):
        year_tif_files = [f for f in tif_files if f.parent.parent == year_dir or f.parent == year_dir]
        if year_tif_files:
            # Get a file from a fire event directory
            sample_files.append(year_tif_files[0])
            if len(sample_files) >= 3:
                break
    return sample_files

sample_files = find_sample_files()
print(f"\nSample files selected: {len(sample_files)}")
for f in sample_files:
    print(f"  - {f.relative_to(DATASET_DIR)}")


GeoTIFF files found: 13607
Total size: 45.91 GB

Sample files selected: 3
  - 2018\fire_21458798\2018-01-01.tif
  - 2019\fire_22710141\2019-03-07.tif
  - 2020\fire_23654679\2020-01-01.tif


## 5. Explore .tif File Structure

Let's examine sample .tif files to understand their multi-band structure and identify label channels.


In [16]:
def explore_tif_file(file_path):
    """
    Explore a .tif file to understand its structure.
    
    Returns:
        dict with information about file structure
    """
    print(f"\n{'='*70}")
    print(f"Exploring: {file_path.name}")
    print(f"Path: {file_path.relative_to(DATASET_DIR)}")
    print(f"{'='*70}")
    
    if not HAS_RASTERIO:
        print("  ‚ö†Ô∏è  rasterio not installed. Cannot read .tif files.")
        print("  Install with: pip install rasterio")
        return None
    
    try:
        with rasterio.open(file_path) as src:
            print(f"  Shape (height, width): {src.shape}")
            print(f"  Number of bands: {src.count}")
            print(f"  CRS: {src.crs}")
            print(f"  Transform: {src.transform}")
            print(f"  Data type: {src.dtypes[0]}")
            
            # Read and analyze each band
            print(f"\n  Band Analysis:")
            print(f"  {'Band':<8} {'Min':<12} {'Max':<12} {'Mean':<12} {'Std':<12} {'Non-null':<12}")
            print(f"  {'-'*70}")
            
            band_info = []
            for i in range(1, min(src.count + 1, 24)):  # Check up to 23 bands (typical for WildfireSpreadTS)
                band = src.read(i)
                band_min = np.nanmin(band)
                band_max = np.nanmax(band)
                band_mean = np.nanmean(band)
                band_std = np.nanstd(band)
                non_null = np.sum(~np.isnan(band))
                
                print(f"  {i:<8} {band_min:<12.2f} {band_max:<12.2f} {band_mean:<12.2f} {band_std:<12.2f} {non_null:<12}")
                
                # Try to identify label bands
                # Active fire detections are typically binary (0/1) or small integers
                # Burned area is typically continuous (0 to some max value)
                band_type = "unknown"
                if band_max <= 1.1 and band_min >= -0.1:
                    if np.all(np.isin(band[~np.isnan(band)], [0, 1])):
                        band_type = "binary (likely fire detection)"
                    else:
                        band_type = "normalized (0-1)"
                elif band_max > 1 and band_max < 1000:
                    band_type = "continuous (possible burned area)"
                elif band_max >= 1000:
                    band_type = "large values (possible raw data)"
                
                band_info.append({
                    'band': i,
                    'min': band_min,
                    'max': band_max,
                    'mean': band_mean,
                    'std': band_std,
                    'type': band_type
                })
            
            if src.count > 23:
                print(f"  ... ({src.count - 23} more bands)")
            
            return {
                'bands': src.count,
                'shape': src.shape,
                'dtype': str(src.dtypes[0]),
                'band_info': band_info
            }
    except Exception as e:
        print(f"  ‚ö†Ô∏è  Error reading file: {e}")
        return None

# Explore sample files
if sample_files and HAS_RASTERIO:
    file_info = []
    for file_path in sample_files:
        info = explore_tif_file(file_path)
        if info:
            info['file'] = file_path.name
            info['year'] = file_path.parent.name if file_path.parent.name.isdigit() else file_path.parent.parent.name
            file_info.append(info)
    
    # Summary
    if file_info:
        print(f"\n{'='*70}")
        print("Summary")
        print(f"{'='*70}")
        print(f"\nFiles explored: {len(file_info)}")
        print(f"Average bands per file: {np.mean([f['bands'] for f in file_info]):.1f}")
        print(f"Band count range: {min([f['bands'] for f in file_info])} - {max([f['bands'] for f in file_info])}")
        
        # Try to identify potential label bands
        print(f"\n{'='*70}")
        print("Potential Label Bands (based on data characteristics):")
        print(f"{'='*70}")
        print("Look for:")
        print("  - Binary bands (0/1): Likely active fire detections ‚Üí P-model target")
        print("  - Continuous bands (0 to moderate values): Likely burned area ‚Üí A-model target")
        print("  - Other bands: Weather, fuel, topography features")
elif not HAS_RASTERIO:
    print("\n‚ö†Ô∏è  Cannot explore .tif files without rasterio.")
    print("Install with: pip install rasterio")
else:
    print("\n‚ö†Ô∏è  No sample files found to explore.")



Exploring: 2018-01-01.tif
Path: 2018\fire_21458798\2018-01-01.tif
  Shape (height, width): (304, 247)
  Number of bands: 23
  CRS: EPSG:32610
  Transform: | 375.00, 0.00, 697500.00|
| 0.00,-375.00, 4136625.00|
| 0.00, 0.00, 1.00|
  Data type: float32

  Band Analysis:
  Band     Min          Max          Mean         Std          Non-null    
  ----------------------------------------------------------------------
  1        326.00       5257.00      2193.27      652.59       75088       
  2        715.00       6665.00      2768.31      787.28       75088       
  3        192.00       4190.00      1642.03      531.64       75088       
  4        -5354.00     9246.00      3342.11      1272.54      75088       
  5        -1212.00     9577.00      1791.95      804.91       75088       
  6        0.00         0.00         0.00         0.00         75088       
  7        0.60         1.90         0.96         0.17         75088       
  8        48.00        311.00       232.52      

  band_min = np.nanmin(band)
  band_max = np.nanmax(band)
  band_mean = np.nanmean(band)
  var = nanvar(a, axis=axis, dtype=dtype, out=out, ddof=ddof,


## 6. Label Channel Identification

Based on the exploration above, we need to identify which bands contain:
1. **Active fire detections** (binary: 0/1) ‚Üí P-model target (`ignition`)
2. **Burned area** (continuous: hectares/km¬≤) ‚Üí A-model target (`log_burned_area`)

**Note**: WildfireSpreadTS documentation mentions 23 multi-modal channels. Check the documentation PDF for the exact band mapping.


In [17]:
# Check if we can identify label bands from the data
if file_info and len(file_info) > 0:
    print("="*70)
    print("Label Band Identification Guide")
    print("="*70)
    print("""
Based on WildfireSpreadTS documentation and typical wildfire datasets:

**Expected Structure (23 channels):**
- Weather variables: Temperature, humidity, wind, precipitation
- Fuel variables: Fuel moisture, fuel type
- Topography: Elevation, slope, aspect
- Vegetation: NDVI, vegetation indices
- **Labels** (typically in specific bands):
  - Active fire detections: Binary (0/1) or small integers
  - Burned area: Continuous values (hectares or km¬≤)

**Next Steps:**
1. Check WildfireSpreadTS_Documentation.pdf for exact band mapping
2. Extract labels from identified bands
3. Match labels to embeddings (by filename/date/location)
4. Create target variables:
   - `ignition`: Binary (1 if fire detected, 0 otherwise)
   - `log_burned_area`: Continuous (log-transform burned area)

**Note**: Labels are described as:
- "Highly imbalanced" (most cells have no fire)
- "Noisy" (due to smoke, clouds, detection inaccuracies)
- May need filtering/cleaning before training
    """)
    
    # Show band statistics summary
    if file_info[0].get('band_info'):
        print("\n" + "="*70)
        print("Band Statistics Summary (from first file):")
        print("="*70)
        df_bands = pd.DataFrame(file_info[0]['band_info'])
        print(df_bands.to_string(index=False))
else:
    print("Run the previous cell to explore .tif file structure first.")


Label Band Identification Guide

Based on WildfireSpreadTS documentation and typical wildfire datasets:

**Expected Structure (23 channels):**
- Weather variables: Temperature, humidity, wind, precipitation
- Fuel variables: Fuel moisture, fuel type
- Topography: Elevation, slope, aspect
- Vegetation: NDVI, vegetation indices
- **Labels** (typically in specific bands):
  - Active fire detections: Binary (0/1) or small integers
  - Burned area: Continuous values (hectares or km¬≤)

**Next Steps:**
1. Check WildfireSpreadTS_Documentation.pdf for exact band mapping
2. Extract labels from identified bands
3. Match labels to embeddings (by filename/date/location)
4. Create target variables:
   - `ignition`: Binary (1 if fire detected, 0 otherwise)
   - `log_burned_area`: Continuous (log-transform burned area)

**Note**: Labels are described as:
- "Highly imbalanced" (most cells have no fire)
- "Noisy" (due to smoke, clouds, detection inaccuracies)
- May need filtering/cleaning before traini

## 7. Documentation Files

Check documentation files for band mapping and label information.


In [18]:
# Find documentation files
doc_keywords = ['readme', 'doc', 'documentation', 'guide', 'info']
doc_files = [f for f in files_only if any(kw in f.name.lower() for kw in doc_keywords)]

if doc_files:
    print("Documentation files found:")
    print("="*70)
    for doc_file in doc_files:
        size_mb = doc_file.stat().st_size / (1024 * 1024)
        print(f"  - {doc_file.relative_to(DATASET_DIR)} ({size_mb:.2f} MB)")
        
        # Try to read text files
        if doc_file.suffix.lower() in ['.txt', '.md', '.rst']:
            try:
                with open(doc_file, 'r', encoding='utf-8', errors='ignore') as f:
                    content = f.read(1000)  # First 1000 chars
                    print(f"    Preview: {content[:200]}...")
            except:
                pass
        elif doc_file.suffix.lower() == '.pdf':
            print(f"    ‚Üí Open this PDF to find band mapping and label information")
    
    print("\n" + "="*70)
    print("Important: Check WildfireSpreadTS_Documentation.pdf for:")
    print("  - Exact band mapping (which band = which variable)")
    print("  - Label channel indices (active fire, burned area)")
    print("  - Data format and coordinate system")
    print("="*70)
else:
    print("No documentation files found.")


Documentation files found:
  - README.md (0.00 MB)
    Preview: # WildfireSpreadTS Dataset

This directory contains the WildfireSpreadTS dataset downloaded from Zenodo.

## Dataset Information

- **Name**: WildfireSpreadTS
- **Zenodo Record**: https://zenodo.org/r...
  - WildfireSpreadTS_Documentation.pdf (0.00 MB)
    ‚Üí Open this PDF to find band mapping and label information

Important: Check WildfireSpreadTS_Documentation.pdf for:
  - Exact band mapping (which band = which variable)
  - Label channel indices (active fire, burned area)
  - Data format and coordinate system


## 8. Summary

Exploration summary and current status.


In [19]:
# Summary and recommendations
print("="*70)
print("EXPLORATION SUMMARY")
print("="*70)
print(f"\nDataset location: {DATASET_DIR}")
print(f"Total files: {len(files_only)}")
print(f"Total size: {sum(f.stat().st_size for f in files_only) / (1024**3):.2f} GB")

# Check for .tif files
tif_files = [f for f in files_only if f.suffix.lower() in ['.tif', '.tiff']]
if tif_files:
    print(f"\nüì∏ GeoTIFF files detected ({len(tif_files)} files)")
    print("  ‚Üí Multi-band files with weather, fuel, topography, and labels")
    print("  ‚Üí Can extract CNN embeddings: python src/data/extract_wildfirespreadts_embeddings.py")
    print("  ‚Üí Need to extract labels from specific bands for training")

# Check if embeddings already extracted
embeddings_path = PROJECT_ROOT / "data" / "processed" / "wildfirespreadts_embeddings.parquet"
if embeddings_path.exists():
    try:
        df_emb = pd.read_parquet(embeddings_path)
        print(f"\n‚úì CNN Embeddings already extracted ({len(df_emb):,} rows)")
        print("  ‚Üí Location: data/processed/wildfirespreadts_embeddings.parquet")
        print("  ‚Üí Next: Extract labels and combine with embeddings")
    except:
        pass
else:
    print(f"\n‚ö†Ô∏è  CNN Embeddings not yet extracted")
    print("  ‚Üí Run: python src/data/extract_wildfirespreadts_embeddings.py")

print("\n" + "="*70)
print("CURRENT STATUS")
print("="*70)
print("""
‚úÖ Dataset downloaded and organized
‚úÖ File structure explored
‚úÖ .tif file structure analyzed
‚è≥ Labels need to be extracted from .tif bands
‚è≥ Labels need to be matched with embeddings
‚è≥ Combined features file needs to be created
‚è≥ Ready for data splitting and model training after label extraction
""")
print("="*70)


EXPLORATION SUMMARY

Dataset location: C:\Users\muron\Documents\ml-wildfire-risk-predictor\data\raw\wildfirespreadts
Total files: 13609
Total size: 45.91 GB

üì∏ GeoTIFF files detected (13607 files)
  ‚Üí Multi-band files with weather, fuel, topography, and labels
  ‚Üí Can extract CNN embeddings: python src/data/extract_wildfirespreadts_embeddings.py
  ‚Üí Need to extract labels from specific bands for training

‚úì CNN Embeddings already extracted (27,214 rows)
  ‚Üí Location: data/processed/wildfirespreadts_embeddings.parquet
  ‚Üí Next: Extract labels and combine with embeddings

CURRENT STATUS

‚úÖ Dataset downloaded and organized
‚úÖ File structure explored
‚úÖ .tif file structure analyzed
‚è≥ Labels need to be extracted from .tif bands
‚è≥ Labels need to be matched with embeddings
‚è≥ Combined features file needs to be created
‚è≥ Ready for data splitting and model training after label extraction

