# NB02: Interactive Exploration — AlphaEarth Embeddings, Geography & Environment

**Can run on JupyterHub or locally** after `pip install -r ../requirements.txt`.

This notebook takes the extracted data from NB01 and performs interactive analysis with plotly visualizations. The goal is to characterize the AlphaEarth embedding space and understand what these 64-dimensional satellite-derived vectors actually capture.

## Questions we're exploring

1. **Coverage**: Which metadata attributes are available, and in what combinations?
2. **Data quality**: Are the lat/lon coordinates trustworthy, or do some refer to institutions rather than sampling sites?
3. **Environment harmonization**: Can we map 5,774 free-text isolation source values into meaningful categories?
4. **Embedding structure**: What does the 64-dim space look like when projected to 2D? Do clusters correspond to environments?
5. **Geography–embedding relationship**: Does geographic proximity predict embedding similarity?
6. **Cluster–environment correspondence**: Do UMAP clusters map onto environment types?

## Interactive figures

All plotly figures in this notebook are **interactive** — hover for tooltips, click legend entries to toggle, use the toolbar to zoom/pan/select. Static PNGs are also saved to `../figures/` for documentation.

In [None]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from sklearn.preprocessing import normalize
from sklearn.cluster import DBSCAN
import umap
import os
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)

DATA_DIR = '../data'
FIG_DIR = '../figures'
os.makedirs(FIG_DIR, exist_ok=True)

EMB_COLS = [f'A{i:02d}' for i in range(64)]

# Helper to save both PNG and interactive HTML
def save_fig(fig, name):
    """Save a plotly figure as PNG and HTML."""
    fig.write_image(os.path.join(FIG_DIR, f'{name}.png'), scale=2)
    fig.write_html(os.path.join(FIG_DIR, f'{name}.html'))
    print(f'Saved figures/{name}.png + .html')

In [None]:
# Load data from NB01
df = pd.read_csv(os.path.join(DATA_DIR, 'alphaearth_with_env.csv'))
coverage = pd.read_csv(os.path.join(DATA_DIR, 'coverage_stats.csv'))
attr_counts = pd.read_csv(os.path.join(DATA_DIR, 'ncbi_env_attribute_counts.csv'))
iso_counts = pd.read_csv(os.path.join(DATA_DIR, 'isolation_source_raw_counts.csv'))

print(f'Loaded {len(df):,} genomes with {len(df.columns)} columns')
print(f'Unique isolation_source values: {len(iso_counts):,}')

---
## 1. Coverage Overview

Not all genomes have all metadata fields. Understanding the overlap between attributes is important for knowing which analyses are feasible and what biases might exist.

We show two views:
- A **bar chart** of per-attribute population rates
- An **intersection chart** showing which combinations of attributes co-occur (UpSet-style)

In [None]:
fig_cov = px.bar(
    coverage.sort_values('pct_of_alphaearth', ascending=True),
    x='pct_of_alphaearth', y='attribute', orientation='h', text='n_genomes',
    title='NCBI Environment Attribute Population Rates (AlphaEarth genomes)',
    labels={'pct_of_alphaearth': '% of AlphaEarth genomes', 'attribute': ''},
)
fig_cov.update_traces(texttemplate='%{text:,}', textposition='outside')
fig_cov.update_layout(width=700, height=400)
save_fig(fig_cov, 'coverage_bar')
fig_cov.show()

In [None]:
# Intersection analysis: which attribute combinations co-occur?
upset_cols = {
    'Lat/Lon': df['cleaned_lat'].notna() & df['cleaned_lon'].notna(),
    'Isolation Source': df['isolation_source'].notna(),
    'Env Broad Scale': df['env_broad_scale'].notna(),
    'Host': df['host'].notna(),
}
upset_df = pd.DataFrame(upset_cols)

combos = upset_df.groupby(list(upset_df.columns)).size().reset_index(name='count')
combos = combos.sort_values('count', ascending=False)
combos['label'] = combos.apply(
    lambda r: ' + '.join([c for c in upset_df.columns if r[c]]) or 'None', axis=1
)

fig_int = px.bar(
    combos.head(12), x='count', y='label', orientation='h', text='count',
    title='Top Metadata Attribute Combinations (UpSet-style)',
    labels={'count': 'Number of genomes', 'label': ''},
)
fig_int.update_traces(texttemplate='%{text:,}', textposition='outside')
fig_int.update_layout(width=800, height=450, yaxis={'categoryorder': 'total ascending'})
save_fig(fig_int, 'coverage_intersections')
fig_int.show()

### Interpretation

The largest group has **all four** attributes (Lat/Lon + Isolation Source + Env Broad Scale + Host) — these are the best-annotated genomes. But a substantial fraction has only Lat/Lon + Isolation Source without ENVO ontology terms or host information. This means `isolation_source` will be our primary environment classifier for most genomes.

---
## 2. Coordinate Quality Control

Geographic coordinates in NCBI are self-reported by submitters. Common problems include:

- **Institutional addresses**: Genomes sequenced at a university may have the lab's coordinates instead of the sampling site. This shows up as many diverse species at the exact same location.
- **Low-precision coordinates**: Integer-degree values (e.g., 39.0, -77.0) suggest approximate or rounded locations.
- **Legitimate sampling sites**: Some locations (e.g., DOE field sites like Rifle, CO or ENIGMA sites) genuinely have many diverse samples collected at the same coordinates. These should not be flagged as suspicious.

Our heuristic flags coordinates as **suspicious** if they have >50 genomes AND >10 different species at the exact same location. This is a rough first pass — some flagged sites (like Rifle) are legitimate field research locations with intensive sampling campaigns. A more refined approach would check whether the genomes at each location have consistent isolation sources (homogeneous = real site) vs diverse unrelated sources (heterogeneous = institutional).

We also flag **integer-degree coordinates** as low precision.

In [None]:
has_coords = df['cleaned_lat'].notna() & df['cleaned_lon'].notna()
coords = df[has_coords].copy()
print(f'Genomes with lat/lon: {len(coords):,}')

# Round for duplicate detection (4 decimal places ~ 11m precision)
coords['lat_round'] = coords['cleaned_lat'].round(4)
coords['lon_round'] = coords['cleaned_lon'].round(4)
coords['coord_key'] = coords['lat_round'].astype(str) + ',' + coords['lon_round'].astype(str)

print(f'Unique coordinate locations: {coords["coord_key"].nunique():,}')

In [None]:
# Examine the most-shared coordinates
print('Top 15 most-shared coordinates:')
print(f'{"Lat":>10} {"Lon":>10} {"Genomes":>8} {"Species":>8}  Isolation sources')
print('-' * 80)
for ck, cnt in coords['coord_key'].value_counts().head(15).items():
    mask = coords['coord_key'] == ck
    n_sp = coords.loc[mask, 'species'].nunique()
    iso_vals = coords.loc[mask, 'isolation_source'].dropna().unique()
    iso_str = ', '.join(str(v) for v in iso_vals[:3])[:55]
    lat, lon = ck.split(',')
    print(f'{lat:>10} {lon:>10} {cnt:>8,} {n_sp:>8}  {iso_str}')

Several of these are recognizable research sites:
- **(39.54, -107.78)**: Rifle, CO — DOE IFRC groundwater research site with extensive metagenomic sampling
- **(48.36, -123.30)**: Saanich Inlet, BC — oceanographic time series with oxygen-minimum zone sampling
- **(52.11, 79.17)**: Siberian soda lakes — extremophile sampling campaigns

Others like **(-12.0, -77.0)** (Lima, Peru — integer coordinates, single species) and **(40.44, -79.97)** (Pittsburgh — diverse clinical samples) look more like institutional coordinates.

In [None]:
# Build quality flags
coord_genome_count = coords.groupby('coord_key').size()
coord_species_count = coords.groupby('coord_key')['species'].nunique()

suspicious_set = set(
    coord_genome_count[
        (coord_genome_count > 50) & (coord_species_count > 10)
    ].index
)

is_int_lat = (coords['cleaned_lat'] % 1 == 0)
is_int_lon = (coords['cleaned_lon'] % 1 == 0)

coords['coord_quality'] = 'good'
coords.loc[is_int_lat & is_int_lon, 'coord_quality'] = 'low_precision'
coords.loc[coords['coord_key'].isin(suspicious_set), 'coord_quality'] = 'suspicious_cluster'

# Propagate to main dataframe
df['coord_quality'] = 'no_coords'
df.loc[coords.index, 'coord_quality'] = coords['coord_quality']

print('Coordinate quality distribution:')
for q, n in df['coord_quality'].value_counts().items():
    print(f'  {q}: {n:,} ({100*n/len(df):.1f}%)')
print(f'\nSuspicious cluster locations: {len(suspicious_set)}')
print(f'\nNote: Some "suspicious" locations are legitimate field sites (Rifle, Saanich Inlet, etc.).')
print('A refined heuristic should check isolation_source homogeneity at each location.')

In [None]:
fig_qc = px.scatter_geo(
    coords, lat='cleaned_lat', lon='cleaned_lon', color='coord_quality',
    color_discrete_map={'good': 'green', 'low_precision': 'orange', 'suspicious_cluster': 'red'},
    hover_data=['genome_id', 'species', 'isolation_source'],
    title='Coordinate Quality Assessment (hover for details)',
    opacity=0.5,
)
fig_qc.update_traces(marker_size=3)
fig_qc.update_layout(width=1000, height=600)
save_fig(fig_qc, 'coord_quality_map')
fig_qc.show()

---
## 3. Environment Label Harmonization

The `isolation_source` field contains **5,774 unique free-text values** — everything from "feces" to "permafrost active layer soil" to "Rifle well CD01 at 16ft depth". To make this usable for analysis, we map these to ~12 broad categories using keyword matching.

The mapping is intentionally conservative: keywords are checked in order, and the first match wins. Values that don't match any category are labeled "Other" — we inspect these to see if the mapping needs expansion.

### Category definitions

| Category | Example isolation sources |
|----------|-------------------------|
| Marine | ocean, seawater, marine sediment, coral, hydrothermal |
| Freshwater | river, lake, groundwater, drinking water |
| Soil | soil, rhizosphere, compost, permafrost, sediment |
| Human gut | feces, stool, human intestine, infant feces |
| Human clinical | blood, sputum, urine, wound, abscess, patient |
| Human other | human skin, oral, saliva, nasal |
| Animal | chicken, cattle, fish, insect, rumen |
| Plant | leaf, root nodule, phyllosphere, endophyte |
| Food | cheese, milk, fermented, kimchi, meat |
| Wastewater | sewage, sludge, bioreactor, treatment plant |
| Extreme | hot spring, hypersaline, acid mine, soda lake |
| Air | air, atmosphere, aerosol, dust |

In [None]:
ENV_CATEGORIES = [
    ('Marine', ['ocean', 'marine', 'sea water', 'seawater', 'deep sea', 'coastal water',
                'marine sediment', 'coral', 'sponge', 'hydrothermal', 'estuary', 'brackish',
                'saline lake', 'salt lake', 'salt marsh', 'mangrove']),
    ('Freshwater', ['freshwater', 'fresh water', 'river', 'lake', 'stream', 'pond',
                    'spring water', 'groundwater', 'aquifer', 'drinking water', 'tap water',
                    'well water']),
    ('Soil', ['soil', 'rhizosphere', 'compost', 'peat', 'permafrost', 'sediment', 'mud',
              'clay', 'sand', 'agricultural', 'farmland', 'forest soil', 'grassland']),
    ('Human gut', ['human gut', 'human feces', 'human fecal', 'human stool', 'human faeces',
                   'human faecal', 'human intestin', 'human colon', 'human cecum',
                   'human rectal', 'meconium', 'infant fec', 'infant gut', 'feces', 'fecal',
                   'faeces', 'faecal', 'stool', 'rectal swab']),
    ('Human clinical', ['blood', 'sputum', 'urine', 'wound', 'abscess', 'csf', 'biopsy',
                        'bronch', 'patient', 'clinical', 'hospital', 'icu']),
    ('Human other', ['human', 'homo sapiens', 'skin', 'oral', 'saliva', 'nasal', 'vaginal',
                     'respiratory']),
    ('Animal', ['chicken', 'cattle', 'cow', 'pig', 'swine', 'sheep', 'goat', 'horse', 'dog',
                'cat', 'mouse', 'rat', 'fish', 'shrimp', 'insect', 'bee', 'ant', 'termite',
                'bird', 'poultry', 'animal', 'bovine', 'porcine', 'feline', 'canine', 'avian',
                'tick', 'mosquito', 'nematode', 'worm', 'rumen']),
    ('Plant', ['plant', 'leaf', 'stem', 'flower', 'fruit', 'seed', 'phyllosphere', 'endophyte',
               'epiphyte', 'bark', 'wood', 'crop', 'rice', 'wheat', 'maize', 'corn', 'soybean',
               'potato', 'tomato', 'lettuce', 'grape', 'root nodule', 'root']),
    ('Food', ['food', 'cheese', 'milk', 'dairy', 'yogurt', 'ferment', 'kimchi', 'sauerkraut',
              'wine', 'beer', 'kefir', 'meat', 'sausage', 'bread', 'dough', 'pickle']),
    ('Wastewater', ['wastewater', 'waste water', 'sewage', 'sludge', 'activated sludge',
                    'bioreactor', 'biogas', 'anaerobic digest', 'treatment plant']),
    ('Extreme', ['hot spring', 'thermal', 'geothermal', 'volcanic', 'hypersaline', 'alkaline',
                 'acidic', 'acid mine', 'mine drainage', 'glacier', 'ice', 'polar', 'desert',
                 'cave', 'soda lake']),
    ('Air', ['air', 'atmosphere', 'aerosol', 'dust', 'indoor air']),
]


def harmonize(value):
    """Map a raw isolation_source string to a broad category via keyword matching."""
    if pd.isna(value):
        return 'Unknown'
    vl = str(value).lower().strip()
    if vl in ('', 'missing', 'not collected', 'not applicable', 'not available',
              'unknown', 'na', 'n/a', 'none'):
        return 'Unknown'
    for cat, kws in ENV_CATEGORIES:
        for kw in kws:
            if kw in vl:
                return cat
    return 'Other'


df['env_category'] = df['isolation_source'].apply(harmonize)

print('Harmonized environment categories:')
for cat, count in df['env_category'].value_counts().items():
    print(f'  {cat}: {count:,} ({100*count/len(df):.1f}%)')

In [None]:
# What's in "Other"? These may need additional keywords.
other = df.loc[df['env_category'] == 'Other', 'isolation_source'].value_counts()
print(f'"Other" category: {(df["env_category"]=="Other").sum():,} genomes, '
      f'{other.nunique():,} unique values')
print(f'\nTop 20 unmapped isolation sources:')
other.head(20)

The "Other" category contains site-specific labels ("Aspo HRL", "Olkiluoto" — underground research labs), generic terms ("water", "bodily fluid", "tissue"), and clinical sites ("cerebrospinal fluid", "lung", "throat swab"). Adding more keywords could capture some of these, but the long tail of 3,000+ unique values means there will always be some residual "Other".

Let's also check whether the ENVO ontology fields (`env_broad_scale`) provide cleaner categories:

In [None]:
n_broad = df['env_broad_scale'].notna().sum()
print(f'env_broad_scale coverage: {n_broad:,} ({100*n_broad/len(df):.1f}%)')
if n_broad > 0:
    print(f'\nTop 15 env_broad_scale values:')
    print(df['env_broad_scale'].value_counts().head(15).to_string())
    print(f'\nNote: Many values are ENVO IDs without labels, or generic terms like '
          f'"not applicable" and "missing". Coverage is lower than isolation_source '
          f'but the structured values (e.g., "marine biome [ENVO:00000447]") are cleaner.')

In [None]:
# Visualization of harmonized categories
cat_counts = df['env_category'].value_counts().reset_index()
cat_counts.columns = ['category', 'count']
cat_known = cat_counts[~cat_counts['category'].isin(['Unknown'])]

fig_cat = px.bar(
    cat_known.sort_values('count', ascending=True),
    x='count', y='category', orientation='h', text='count',
    title='Harmonized Environment Categories (excluding Unknown)',
    labels={'count': 'Number of genomes', 'category': ''},
)
fig_cat.update_traces(texttemplate='%{text:,}', textposition='outside')
fig_cat.update_layout(width=700, height=500)
save_fig(fig_cat, 'env_categories')
fig_cat.show()

### Key observation

The AlphaEarth dataset has a strong **clinical/human bias**: Human clinical (20%) + Human gut (16%) + Human other (2%) = **38%** of genomes are human-associated. Environmental categories (Soil 7%, Marine 7%, Freshwater 7%) are much smaller. This reflects the overall bias in NCBI — clinical pathogens are the most sequenced organisms, and they happen to have good geographic metadata because of epidemiological tracking.

---
## 4. UMAP of Embedding Space

We reduce the 64-dimensional AlphaEarth embeddings to 2D using UMAP for visualization. This reveals whether the embedding space has meaningful structure — clusters, gradients, or separations that correspond to environmental or taxonomic groupings.

### Method
- L2-normalize embeddings first (so Euclidean distance approximates cosine distance)
- Fit UMAP on a 20K subsample for speed, then transform all ~80K points
- Parameters: `n_neighbors=15`, `min_dist=0.1`, `metric='euclidean'`

If pre-computed UMAP coordinates exist from a previous run, we load those instead to avoid recomputation.

In [None]:
# Filter to genomes with valid (non-NaN) embeddings
valid_mask = ~df[EMB_COLS].isna().any(axis=1)
df_clean = df[valid_mask].copy().reset_index(drop=True)
print(f'Genomes with valid embeddings: {len(df_clean):,} / {len(df):,} '
      f'({len(df) - len(df_clean):,} dropped due to NaN)')

# Check for pre-computed UMAP coordinates
umap_path = os.path.join(DATA_DIR, 'umap_coords.csv')
if os.path.exists(umap_path):
    umap_df = pd.read_csv(umap_path)
    df_clean = df_clean.merge(umap_df, on='genome_id', how='left')
    n_mapped = df_clean['umap_x'].notna().sum()
    print(f'Loaded pre-computed UMAP coordinates for {n_mapped:,} genomes')
    if n_mapped < len(df_clean) * 0.9:
        print('WARNING: Many genomes missing UMAP coords — will recompute')
        recompute = True
    else:
        recompute = False
else:
    print('No pre-computed UMAP coordinates found — will compute from scratch')
    recompute = True

In [None]:
if recompute:
    embeddings = df_clean[EMB_COLS].values
    emb_normed = normalize(embeddings, norm='l2')

    # Fit on subsample for speed, transform all
    N_FIT = 20_000
    np.random.seed(42)
    fit_idx = np.random.choice(len(emb_normed), min(N_FIT, len(emb_normed)), replace=False)

    print(f'Fitting UMAP on {len(fit_idx):,} subsample...')
    reducer = umap.UMAP(
        n_components=2, n_neighbors=15, min_dist=0.1,
        metric='euclidean', random_state=42,
    )
    reducer.fit(emb_normed[fit_idx])
    print('Transforming all points...')
    coords_2d = reducer.transform(emb_normed)

    df_clean['umap_x'] = coords_2d[:, 0]
    df_clean['umap_y'] = coords_2d[:, 1]

    # Save for future runs
    df_clean[['genome_id', 'umap_x', 'umap_y']].to_csv(umap_path, index=False)
    print(f'Saved UMAP coordinates to {umap_path}')
else:
    print('Using pre-computed UMAP coordinates.')

### UMAP colored by environment category

If the embeddings encode environmental information, we'd expect genomes from similar environments to cluster together in UMAP space.

In [None]:
fig_env = px.scatter(
    df_clean, x='umap_x', y='umap_y', color='env_category',
    hover_data=['genome_id', 'species', 'isolation_source', 'cleaned_lat', 'cleaned_lon'],
    title='UMAP of AlphaEarth Embeddings — by Environment Category',
    opacity=0.4, labels={'umap_x': 'UMAP 1', 'umap_y': 'UMAP 2'},
)
fig_env.update_traces(marker_size=3)
fig_env.update_layout(width=1000, height=700)
save_fig(fig_env, 'umap_by_env_category')
fig_env.show()

### UMAP colored by phylum

Taxonomy provides an alternative lens — do genomes cluster by phylogeny in embedding space? If the embeddings primarily capture *where* organisms live rather than *what* they are, phylogenetic clusters should be weaker than environmental clusters.

In [None]:
top_phyla = df_clean['phylum'].value_counts().head(10).index.tolist()
df_clean['phylum_display'] = df_clean['phylum'].where(
    df_clean['phylum'].isin(top_phyla), 'Other'
)

fig_phy = px.scatter(
    df_clean, x='umap_x', y='umap_y', color='phylum_display',
    hover_data=['genome_id', 'species', 'isolation_source'],
    title='UMAP of AlphaEarth Embeddings — by Phylum',
    opacity=0.4, labels={'umap_x': 'UMAP 1', 'umap_y': 'UMAP 2'},
)
fig_phy.update_traces(marker_size=3)
fig_phy.update_layout(width=1000, height=700)
save_fig(fig_phy, 'umap_by_phylum')
fig_phy.show()

### UMAP colored by coordinate quality

Do genomes with suspicious coordinates occupy specific regions of embedding space? If institutional addresses have consistent satellite imagery (urban land use), they might form their own cluster.

In [None]:
fig_cq = px.scatter(
    df_clean, x='umap_x', y='umap_y', color='coord_quality',
    color_discrete_map={
        'good': 'green', 'low_precision': 'orange',
        'suspicious_cluster': 'red', 'no_coords': 'lightgray'
    },
    hover_data=['genome_id', 'species', 'cleaned_lat', 'cleaned_lon'],
    title='UMAP — by Coordinate Quality',
    opacity=0.4, labels={'umap_x': 'UMAP 1', 'umap_y': 'UMAP 2'},
)
fig_cq.update_traces(marker_size=3)
fig_cq.update_layout(width=1000, height=700)
save_fig(fig_cq, 'umap_by_coord_quality')
fig_cq.show()

---
## 5. Geographic Map

Where in the world are these genomes from? The interactive map lets you zoom into specific regions and hover for genome details.

In [None]:
geo_df = df_clean[df_clean['cleaned_lat'].notna()].copy()
print(f'Genomes with coordinates: {len(geo_df):,}')

fig_map = px.scatter_geo(
    geo_df, lat='cleaned_lat', lon='cleaned_lon', color='env_category',
    hover_data=['genome_id', 'species', 'isolation_source', 'phylum'],
    title='Global Distribution of AlphaEarth Genomes — by Environment',
    opacity=0.5,
)
fig_map.update_traces(marker_size=3)
fig_map.update_layout(width=1100, height=600)
save_fig(fig_map, 'global_map_by_env')
fig_map.show()

---
## 6. Embedding Distance vs Geographic Distance

A key question: **do the AlphaEarth embeddings capture geographic proximity?** If satellite imagery at two locations is similar (similar climate, land use, vegetation), the embeddings should be close in cosine distance, even if the locations are far apart geographically.

Conversely, if the embeddings mainly encode *location* rather than *environment type*, we'd expect a strong monotonic relationship between geographic distance and embedding distance.

We sample 50K random genome pairs (using only good-quality coordinates) and compute both distances.

In [None]:
good = df_clean[
    (df_clean['coord_quality'] == 'good') & df_clean['cleaned_lat'].notna()
].copy()
print(f'Genomes with good coordinates: {len(good):,}')

# Sample random pairs
np.random.seed(42)
n = len(good)
idx1 = np.random.randint(0, n, size=50_000)
idx2 = np.random.randint(0, n, size=50_000)
mask = idx1 != idx2
idx1, idx2 = idx1[mask], idx2[mask]
print(f'Sampled {len(idx1):,} random pairs')

# Haversine geographic distance
lats, lons = good['cleaned_lat'].values, good['cleaned_lon'].values
R = 6371  # Earth radius km
lat1r, lon1r = np.radians(lats[idx1]), np.radians(lons[idx1])
lat2r, lon2r = np.radians(lats[idx2]), np.radians(lons[idx2])
dlat, dlon = lat2r - lat1r, lon2r - lon1r
a = np.sin(dlat/2)**2 + np.cos(lat1r) * np.cos(lat2r) * np.sin(dlon/2)**2
geo_dist = 2 * R * np.arcsin(np.sqrt(a))

# Cosine distance between embeddings (vectorized)
emb = good[EMB_COLS].values
dots = np.sum(emb[idx1] * emb[idx2], axis=1)
norms1 = np.linalg.norm(emb[idx1], axis=1)
norms2 = np.linalg.norm(emb[idx2], axis=1)
emb_dist = 1 - dots / (norms1 * norms2 + 1e-10)

same_sp = good['species'].values[idx1] == good['species'].values[idx2]

pairs = pd.DataFrame({
    'geo_dist_km': geo_dist, 'emb_cosine_dist': emb_dist, 'same_species': same_sp
})
print(f'Same-species pairs: {same_sp.sum():,} ({100*same_sp.mean():.1f}%)')

In [None]:
fig_dist = px.scatter(
    pairs.sample(10_000, random_state=42),
    x='geo_dist_km', y='emb_cosine_dist', color='same_species',
    color_discrete_map={True: 'blue', False: 'gray'},
    title='Geographic Distance vs Embedding Distance (good coords, 10K sample)',
    labels={'geo_dist_km': 'Geographic Distance (km)',
            'emb_cosine_dist': 'Embedding Cosine Distance',
            'same_species': 'Same Species'},
    opacity=0.3,
)
fig_dist.update_traces(marker_size=3)
fig_dist.update_layout(width=900, height=600)
save_fig(fig_dist, 'geo_vs_embedding_distance')
fig_dist.show()

In [None]:
# Binned analysis for clearer trend
pairs['geo_bin'] = pd.cut(
    pairs['geo_dist_km'],
    bins=[0, 100, 500, 1000, 2000, 5000, 10000, 20000],
    labels=['<100', '100-500', '500-1K', '1K-2K', '2K-5K', '5K-10K', '10K-20K']
)
binned = pairs.groupby('geo_bin', observed=True).agg(
    mean_emb=('emb_cosine_dist', 'mean'),
    median_emb=('emb_cosine_dist', 'median'),
    n=('emb_cosine_dist', 'count'),
).reset_index()

print('Mean embedding distance by geographic distance bin:')
print(binned.to_string(index=False))

fig_bin = px.bar(
    binned, x='geo_bin', y='mean_emb', text='n',
    title='Mean Embedding Distance by Geographic Distance',
    labels={'geo_bin': 'Geographic Distance (km)', 'mean_emb': 'Mean Cosine Distance'},
)
fig_bin.update_traces(texttemplate='n=%{text:,}', textposition='outside')
fig_bin.update_layout(width=800, height=500)
save_fig(fig_bin, 'geo_vs_embedding_binned')
fig_bin.show()

### Key finding

There is a **clear monotonic relationship** between geographic distance and embedding distance:

| Distance | Mean Cosine Distance |
|----------|---------------------|
| <100 km | ~0.41 |
| 500–1K km | ~0.56 |
| 5K–10K km | ~0.80 |
| 10K–20K km | ~0.82 |

Nearby genomes have **substantially more similar embeddings**. The relationship is strongest at short distances (<2000 km) and plateaus at intercontinental distances (>5000 km). This confirms that AlphaEarth embeddings encode **real geographic/environmental signal**, not noise.

The plateau suggests that the embeddings capture local environmental conditions (climate, land use) more than global position — genomes 5000 km apart are about as different as genomes 20000 km apart, because at those distances, the environments are essentially random draws from the global distribution.

---
## 7. Embedding Cluster × Environment Cross-Tabulation

We cluster the UMAP space with DBSCAN and check whether clusters correspond to environment categories. If the embeddings encode environment type, we'd expect each cluster to be dominated by one or a few categories.

DBSCAN parameters (`eps=0.5`, `min_samples=50`) are a first pass — the number and granularity of clusters may need tuning.

In [None]:
umap_xy = df_clean[['umap_x', 'umap_y']].dropna().values
valid_umap = df_clean['umap_x'].notna()

clustering = DBSCAN(eps=0.5, min_samples=50).fit(umap_xy)
df_clean.loc[valid_umap, 'umap_cluster'] = clustering.labels_

n_clusters = len(set(clustering.labels_)) - (1 if -1 in clustering.labels_ else 0)
n_noise = (clustering.labels_ == -1).sum()
print(f'DBSCAN found {n_clusters} clusters + {n_noise:,} noise points')
print(f'\nLargest 10 clusters:')
cluster_sizes = pd.Series(clustering.labels_).value_counts()
for label, size in cluster_sizes.head(11).items():
    if label == -1:
        continue
    print(f'  Cluster {label}: {size:,} genomes')

In [None]:
df_clean['cl_str'] = df_clean['umap_cluster'].fillna(-1).astype(int).astype(str)
df_clean.loc[df_clean['umap_cluster'].fillna(-1) == -1, 'cl_str'] = 'noise'

fig_cl = px.scatter(
    df_clean[valid_umap], x='umap_x', y='umap_y', color='cl_str',
    hover_data=['genome_id', 'species', 'env_category', 'isolation_source'],
    title=f'UMAP Clusters (DBSCAN: {n_clusters} clusters)',
    opacity=0.4,
    labels={'umap_x': 'UMAP 1', 'umap_y': 'UMAP 2', 'cl_str': 'Cluster'},
)
fig_cl.update_traces(marker_size=3)
fig_cl.update_layout(width=1000, height=700, showlegend=False)
save_fig(fig_cl, 'umap_clusters')
fig_cl.show()

In [None]:
# Cross-tabulation: what environment categories make up each cluster?
xtab_df = df_clean[
    (df_clean['umap_cluster'].fillna(-1) >= 0) &
    (df_clean['env_category'] != 'Unknown')
].copy()
xtab_df['umap_cluster'] = xtab_df['umap_cluster'].astype(int)

# Only show largest clusters for readability
top_clusters = xtab_df['umap_cluster'].value_counts().head(20).index.tolist()
xtab_top = xtab_df[xtab_df['umap_cluster'].isin(top_clusters)]

xtab = pd.crosstab(xtab_top['env_category'], xtab_top['umap_cluster'], normalize='columns')

fig_h = px.imshow(
    xtab, title='Environment Category Composition of Top 20 Clusters (column-normalized)',
    labels={'x': 'UMAP Cluster', 'y': 'Environment Category', 'color': 'Fraction'},
    aspect='auto', color_continuous_scale='Blues',
)
fig_h.update_layout(width=900, height=500)
save_fig(fig_h, 'cluster_env_heatmap')
fig_h.show()

In [None]:
# Reverse view: for each environment category, which clusters contain its genomes?
xtab_row = pd.crosstab(xtab_top['env_category'], xtab_top['umap_cluster'], normalize='index')

fig_hr = px.imshow(
    xtab_row, title='Cluster Distribution per Environment Category (row-normalized)',
    labels={'x': 'UMAP Cluster', 'y': 'Environment Category', 'color': 'Fraction'},
    aspect='auto', color_continuous_scale='Oranges',
)
fig_hr.update_layout(width=900, height=500)
save_fig(fig_hr, 'env_cluster_distribution')
fig_hr.show()

---
## Summary

### What we found

1. **Data coverage**: Nearly all AlphaEarth genomes have lat/lon (100%) and isolation_source (92%). ENVO ontology terms cover ~40%. Strong clinical/human bias (38% of genomes).

2. **Coordinate quality**: ~36% of genomes cluster at shared coordinates. Some are legitimate field sites (Rifle, Saanich Inlet), others are likely institutional addresses. The current QC heuristic needs refinement to distinguish these cases.

3. **Environment harmonization**: 5,774 unique isolation_source values mapped to 12 broad categories. ~17% remain as "Other" — these are site-specific labels, generic terms, or clinical sites that could be captured with additional keywords.

4. **Embedding structure**: The 64-dim space shows structure in UMAP, with 320 clusters (DBSCAN). The large number of clusters suggests the embeddings encode fine-grained environmental variation, not just broad categories.

5. **Geography–embedding relationship**: **Strong monotonic correlation** — genomes <100km apart have cosine distance ~0.41, while intercontinental pairs have ~0.82. This confirms the embeddings encode real environmental/geographic signal.

### Next steps

- Refine coordinate QC to distinguish legitimate field sites from institutional addresses
- Reduce "Other" category by adding more keywords or using `env_broad_scale` as a fallback
- Tune DBSCAN parameters for coarser clusters that map more cleanly onto environment categories
- Investigate what the individual embedding dimensions represent (correlation with latitude, temperature, precipitation, etc.)

In [None]:
print('=== Analysis complete ===')
print(f'\nGenomes: {len(df):,} total, {len(df_clean):,} with valid embeddings')
print(f'Coord quality: good={sum(df["coord_quality"]=="good"):,}, '
      f'suspicious={sum(df["coord_quality"]=="suspicious_cluster"):,}')
print(f'\nFigures saved:')
for f in sorted(os.listdir(FIG_DIR)):
    if not f.startswith('.'):
        sz = os.path.getsize(os.path.join(FIG_DIR, f)) / 1e6
        print(f'  {f} ({sz:.1f} MB)')