# 01 - Exploration: District Prediction

Load pre-computed embeddings and add district labels from locality data.

**Tasks:**
- Load pre-computed embeddings from `data/processed/`
- Load original eBird data to get locality information
- Extract district labels from locality names
- Match embeddings to districts using sampling IDs
- Explore district distribution
- Save embeddings with district labels

## Setup Paths

In [3]:
import sys
from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Add backbone to path
project_root = Path.cwd().parent.parent.parent  # Go up to bird-embeddings/
sys.path.insert(0, str(project_root))

print(f"Project root: {project_root}")
print(f"Current project: {Path.cwd().parent.name}")

Project root: c:\Users\Arnav\Documents\Python Scripts\bird embeddings
Current project: district_prediction


## Load eBird Data

In [4]:
from src.data import load_ebird_data

# Load shared eBird data
data_path = project_root / 'data' / 'raw' / 'ebd_IN-KL_smp_relSep-2025.txt'
df = load_ebird_data(str(data_path), nrows=50000)  # Adjust nrows as needed

print(f"Loaded {len(df)} observations")
print(f"Columns: {df.columns.tolist()}")
df.head()

✓ Loaded eBird data from: ebd_IN-KL_smp_relSep-2025.txt
  Rows: 50,000
  Columns: 53
  (Limited to first 50,000 rows)
Loaded 50000 observations
Columns: ['GLOBAL UNIQUE IDENTIFIER', 'LAST EDITED DATE', 'TAXONOMIC ORDER', 'CATEGORY', 'TAXON CONCEPT ID', 'COMMON NAME', 'SCIENTIFIC NAME', 'SUBSPECIES COMMON NAME', 'SUBSPECIES SCIENTIFIC NAME', 'EXOTIC CODE', 'OBSERVATION COUNT', 'BREEDING CODE', 'BREEDING CATEGORY', 'BEHAVIOR CODE', 'AGE/SEX', 'COUNTRY', 'COUNTRY CODE', 'STATE', 'STATE CODE', 'COUNTY', 'COUNTY CODE', 'IBA CODE', 'BCR CODE', 'USFWS CODE', 'ATLAS BLOCK', 'LOCALITY', 'LOCALITY ID', 'LOCALITY TYPE', 'LATITUDE', 'LONGITUDE', 'OBSERVATION DATE', 'TIME OBSERVATIONS STARTED', 'OBSERVER ID', 'OBSERVER ORCID ID', 'SAMPLING EVENT IDENTIFIER', 'OBSERVATION TYPE', 'PROTOCOL NAME', 'PROTOCOL CODE', 'PROJECT NAMES', 'PROJECT IDENTIFIERS', 'DURATION MINUTES', 'EFFORT DISTANCE KM', 'EFFORT AREA HA', 'NUMBER OBSERVERS', 'ALL SPECIES REPORTED', 'GROUP IDENTIFIER', 'HAS MEDIA', 'APPROVED', '

Unnamed: 0,GLOBAL UNIQUE IDENTIFIER,LAST EDITED DATE,TAXONOMIC ORDER,CATEGORY,TAXON CONCEPT ID,COMMON NAME,SCIENTIFIC NAME,SUBSPECIES COMMON NAME,SUBSPECIES SCIENTIFIC NAME,EXOTIC CODE,...,NUMBER OBSERVERS,ALL SPECIES REPORTED,GROUP IDENTIFIER,HAS MEDIA,APPROVED,REVIEWED,REASON,CHECKLIST COMMENTS,SPECIES COMMENTS,Unnamed: 52
0,URN:CornellLabOfOrnithology:EBIRD:OBS529729760,2024-04-28 17:19:40.106716,11963,species,avibase-B9538B77,Oriental Hobby,Falco severus,,,,...,,0,,0,1,1,,Submitted as rarity upload. Compiled by Abhina...,Observer&#61;Bourdillon--Reference&#61;http://...,
1,URN:CornellLabOfOrnithology:EBIRD:OBS444817753,2024-04-28 15:42:40.756927,5833,species,avibase-D348C2D6,Sociable Lapwing,Vanellus gregarius,,,,...,,0,,0,1,1,,Submitted as rarity upload. Compiled by Abhina...,Observer= HS Fergusson--https://www.biodiversi...,
2,URN:CornellLabOfOrnithology:EBIRD:OBS444815350,2024-04-28 15:42:40.756927,6265,species,avibase-236D9272,Indian Courser,Cursorius coromandelicus,,,,...,,0,,0,1,1,,Submitted as rarity upload. Compiled by Abhina...,Observer&#61; Museum collector--Reference&#61;...,
3,URN:CornellLabOfOrnithology:EBIRD:OBS444815641,2024-04-28 15:42:40.756927,6265,species,avibase-236D9272,Indian Courser,Cursorius coromandelicus,,,,...,,0,,0,1,1,,Submitted as rarity upload. Compiled by Abhina...,Observer&#61; Museum collector--Reference&#61;...,
4,URN:CornellLabOfOrnithology:EBIRD:OBS444815825,2024-04-28 15:42:40.756927,6265,species,avibase-236D9272,Indian Courser,Cursorius coromandelicus,,,,...,,0,,0,1,1,,Submitted as rarity upload. Compiled by Abhina...,Observer&#61; Museum collector--Reference&#61;...,


## Explore Data

In [None]:
# Add your exploration code here
# Example:
print(f"Unique checklists: {df['SAMPLING EVENT IDENTIFIER'].nunique()}")
print(f"Unique species: {df['COMMON NAME'].nunique()}")
print(f"Date range: {df['OBSERVATION DATE'].min()} to {df['OBSERVATION DATE'].max()}")

## Preprocess Data

In [None]:
from src.data import EBirdPreprocessor

# Preprocess using backbone
preprocessor = EBirdPreprocessor()
processed_df = preprocessor.fit_transform(df)

print(f"Processed shape: {processed_df.shape}")
print(f"Features (species): {processed_df.shape[1]}")
processed_df.head()

## Extract Embeddings

In [None]:
from src.inference import EmbeddingExtractor

# Load shared VAE model
model_path = project_root / 'models' / 'vae_model_inference_ready.pth'
extractor = EmbeddingExtractor(str(model_path), device='cpu')

# Extract embeddings
embeddings = extractor.extract_embeddings(processed_df, use_mean=True)

print(f"Embeddings shape: {embeddings.shape}")
print(f"Latent dimensions: {embeddings.shape[1]}")

## Add Project-Specific Features/Labels

**TODO**: Add your project-specific code here.

Examples:
- Extract labels from location/date/other columns
- Compute additional features
- Filter data for specific analysis

In [None]:
# Example: Add labels
# labels = df.loc[processed_df.index, 'YOUR_LABEL_COLUMN']

# Your code here
pass

## Save Processed Data

In [None]:
# Save embeddings and metadata to project folder
save_path = Path('../data/processed/embeddings.npz')

np.savez(
    save_path,
    embeddings=embeddings,
    # Add your labels/metadata here
    # labels=labels.values,
    checklist_ids=processed_df.index.values
)

print(f"✓ Saved processed data to {save_path}")

## Summary

**TODO**: Summarize what you found in this exploration.