# NB01: Data Extraction — AlphaEarth Embeddings + Environment Labels

**Requires BERDL JupyterHub** — `get_spark_session()` is only available in JupyterHub kernels.

## Background

The BERDL pangenome database includes **AlphaEarth environmental embeddings** — 64-dimensional vectors derived from satellite imagery at each genome's sampling location. These embeddings encode environmental context (climate, land use, vegetation, etc.) but their relationship to traditional environment metadata has not been systematically characterized.

This notebook extracts and joins three data layers for downstream interactive exploration:

1. **AlphaEarth embeddings** (`alphaearth_embeddings_all_years`) — 83K genomes with 64-dim vectors, cleaned lat/lon, and taxonomy
2. **NCBI environment metadata** (`ncbi_env`) — free-text environment labels in Entity-Attribute-Value format
3. **Coverage statistics** — which genomes have which metadata fields populated

### Key considerations

- AlphaEarth coverage is only **28.4%** of all 293K genomes — biased toward genomes with valid lat/lon metadata
- `ncbi_env` is an EAV table (multiple rows per genome) — we pivot it into one row per genome
- The `isolation_source` field is free text with thousands of unique values — harmonization happens in NB02

### Outputs

All saved to `../data/`:

| File | Description |
|------|-------------|
| `alphaearth_with_env.csv` | Merged embeddings + pivoted env labels (one row per genome) |
| `coverage_stats.csv` | Per-attribute population rates |
| `ncbi_env_attribute_counts.csv` | Full inventory of all 334 harmonized_name values |
| `isolation_source_raw_counts.csv` | Raw value frequencies for harmonization in NB02 |

In [None]:
import os
import pandas as pd

# get_spark_session() is injected into JupyterHub kernels — no import needed
spark = get_spark_session()

DATA_DIR = '../data'
os.makedirs(DATA_DIR, exist_ok=True)

print('Spark session initialized')
print(f'Output directory: {os.path.abspath(DATA_DIR)}')

## 1. Extract AlphaEarth Embeddings

The `alphaearth_embeddings_all_years` table contains one row per genome with:
- **64 embedding dimensions** (A00–A63): satellite-derived environmental vectors, values in [-0.54, 0.54]
- **Cleaned coordinates**: `cleaned_lat`, `cleaned_lon` — parsed and validated from NCBI metadata
- **Taxonomy**: full GTDB hierarchy from domain to species
- **Biosample links**: `ncbi_biosample_accession_id` for joining to `ncbi_env`

At 83K rows, this is small enough to collect entirely to the driver node.

In [None]:
ae_df = spark.sql("""
    SELECT *
    FROM kbase_ke_pangenome.alphaearth_embeddings_all_years
""").toPandas()

emb_cols = [c for c in ae_df.columns if c.startswith('A') and c[1:].isdigit()]

print(f'AlphaEarth embeddings: {len(ae_df):,} genomes')
print(f'Embedding dimensions: {len(emb_cols)}')
print(f'Lat/lon non-null: {ae_df["cleaned_lat"].notna().sum():,} / {len(ae_df):,}')
print(f'Year range: {ae_df["cleaned_year"].min()} – {ae_df["cleaned_year"].max()}')
print(f'\nTaxonomy coverage:')
for col in ['domain', 'phylum', 'class', 'order', 'family', 'genus', 'species']:
    if col in ae_df.columns:
        print(f'  {col}: {ae_df[col].nunique():,} unique')

In [None]:
ae_df.head(3)

## 2. Inventory NCBI Environment Attributes

The `ncbi_env` table uses an **Entity-Attribute-Value (EAV)** format — each row is one attribute for one biosample. Before we pivot, let's see what attributes exist and how many genomes have each.

This inventory helps us decide which attributes to pivot into columns (we want the most-populated ones) and reveals the overall metadata landscape.

In [None]:
attr_counts = spark.sql("""
    SELECT harmonized_name,
           COUNT(*) as n_rows,
           COUNT(DISTINCT accession) as n_genomes
    FROM kbase_ke_pangenome.ncbi_env
    GROUP BY harmonized_name
    ORDER BY n_genomes DESC
""").toPandas()

print(f'Total distinct harmonized_name values: {len(attr_counts)}')
print(f'\nTop 30 attributes by number of genomes:')
attr_counts.head(30)

The most-populated attributes are `collection_date` (273K genomes), `geo_loc_name` (272K), and `isolation_source` (245K). The ENVO ontology fields (`env_broad_scale`, `env_local_scale`, `env_medium`) cover ~80-88K genomes each — roughly matching the AlphaEarth coverage, which makes sense since both require geographic metadata.

We'll pivot the most useful attributes into columns for our analysis.

In [None]:
attr_counts.to_csv(os.path.join(DATA_DIR, 'ncbi_env_attribute_counts.csv'), index=False)
print(f'Saved ncbi_env_attribute_counts.csv ({len(attr_counts)} attributes)')

## 3. Pivot NCBI Environment Labels for AlphaEarth Genomes

We join `ncbi_env` to our AlphaEarth genomes via `ncbi_biosample_accession_id` and pivot selected attributes from rows into columns.

The approach:
1. Register AlphaEarth biosample IDs as a Spark temp view for efficient joining
2. Use `MAX(CASE WHEN ...)` to pivot each attribute into its own column
3. This gives us one row per genome with all environment fields as columns

In [None]:
# Register AlphaEarth biosample IDs as temp view for Spark join
biosample_ids = ae_df['ncbi_biosample_accession_id'].dropna().unique().tolist()
print(f'AlphaEarth genomes with biosample IDs: {len(biosample_ids):,}')

biosample_sdf = spark.createDataFrame(
    [(b,) for b in biosample_ids],
    ['accession']
)
biosample_sdf.createOrReplaceTempView('ae_biosamples')

In [None]:
# Pivot key environment attributes into columns
env_pivot = spark.sql("""
    SELECT ne.accession,
           MAX(CASE WHEN ne.harmonized_name = 'isolation_source' THEN ne.content END) as isolation_source,
           MAX(CASE WHEN ne.harmonized_name = 'geo_loc_name' THEN ne.content END) as geo_loc_name,
           MAX(CASE WHEN ne.harmonized_name = 'env_broad_scale' THEN ne.content END) as env_broad_scale,
           MAX(CASE WHEN ne.harmonized_name = 'env_local_scale' THEN ne.content END) as env_local_scale,
           MAX(CASE WHEN ne.harmonized_name = 'env_medium' THEN ne.content END) as env_medium,
           MAX(CASE WHEN ne.harmonized_name = 'host' THEN ne.content END) as host,
           MAX(CASE WHEN ne.harmonized_name = 'collection_date' THEN ne.content END) as collection_date,
           MAX(CASE WHEN ne.harmonized_name = 'lat_lon' THEN ne.content END) as lat_lon,
           MAX(CASE WHEN ne.harmonized_name = 'depth' THEN ne.content END) as depth,
           MAX(CASE WHEN ne.harmonized_name = 'altitude' THEN ne.content END) as altitude,
           MAX(CASE WHEN ne.harmonized_name = 'temp' THEN ne.content END) as temp
    FROM kbase_ke_pangenome.ncbi_env ne
    JOIN ae_biosamples ab ON ne.accession = ab.accession
    GROUP BY ne.accession
""").toPandas()

print(f'Environment labels pivoted for {len(env_pivot):,} genomes')
print(f'\nAttribute population rates:')
for col in env_pivot.columns[1:]:
    n = env_pivot[col].notna().sum()
    pct = 100 * n / len(env_pivot) if len(env_pivot) > 0 else 0
    print(f'  {col}: {n:,} ({pct:.1f}%)')

Nearly all AlphaEarth genomes have `geo_loc_name` and `collection_date` (100%), and 92% have `isolation_source`. The ENVO ontology fields (`env_broad_scale`, `env_local_scale`, `env_medium`) cover about 38-42% — these may be cleaner than free-text `isolation_source` but have lower coverage.

## 4. Merge Embeddings with Environment Labels

Join on `ncbi_biosample_accession_id` (left join to keep all AlphaEarth genomes, even those without env labels).

In [None]:
merged = ae_df.merge(
    env_pivot,
    left_on='ncbi_biosample_accession_id',
    right_on='accession',
    how='left'
)

if 'accession' in merged.columns:
    merged = merged.drop(columns=['accession'])

print(f'Merged dataset: {len(merged):,} genomes')
print(f'  With isolation_source: {merged["isolation_source"].notna().sum():,}')
print(f'  Without: {merged["isolation_source"].isna().sum():,}')

## 5. Coverage Statistics

Compute per-attribute population rates and boolean flags. These flags will be used in NB02 for UpSet-style intersection plots.

In [None]:
flag_cols = {
    'has_latlon': merged['cleaned_lat'].notna() & merged['cleaned_lon'].notna(),
    'has_isolation_source': merged['isolation_source'].notna(),
    'has_env_broad_scale': merged['env_broad_scale'].notna(),
    'has_env_local_scale': merged['env_local_scale'].notna(),
    'has_env_medium': merged['env_medium'].notna(),
    'has_host': merged['host'].notna(),
    'has_geo_loc_name': merged['geo_loc_name'].notna(),
}

for name, flag in flag_cols.items():
    merged[name] = flag

coverage = pd.DataFrame([
    {'attribute': name.replace('has_', ''), 'n_genomes': int(flag.sum()),
     'pct_of_alphaearth': round(100 * flag.mean(), 1)}
    for name, flag in flag_cols.items()
])

print('Coverage of AlphaEarth genomes:')
coverage

In [None]:
coverage.to_csv(os.path.join(DATA_DIR, 'coverage_stats.csv'), index=False)
print('Saved coverage_stats.csv')

## 6. Isolation Source Value Counts

The `isolation_source` field is free text entered by submitters — thousands of unique values that need harmonization. We save the raw value counts so NB02 can build a keyword-based mapping to broad categories (Soil, Marine, Human gut, etc.).

In [None]:
iso_counts = (
    merged.loc[merged['isolation_source'].notna(), 'isolation_source']
    .value_counts()
    .reset_index()
)
iso_counts.columns = ['isolation_source', 'count']

print(f'Unique isolation_source values: {len(iso_counts):,}')
print(f'\nTop 30 values:')
iso_counts.head(30)

The top values are dominated by clinical samples (feces, blood, sputum, stool, urine), reflecting the strong bias toward pathogen genomes in NCBI. Environmental samples (soil, seawater, groundwater) are present but less common. The `missing` and `Unknown` values will need to be grouped with the unclassified category.

In [None]:
iso_counts.to_csv(os.path.join(DATA_DIR, 'isolation_source_raw_counts.csv'), index=False)
print('Saved isolation_source_raw_counts.csv')

## 7. Save Merged Dataset

In [None]:
out_path = os.path.join(DATA_DIR, 'alphaearth_with_env.csv')
merged.to_csv(out_path, index=False)

print(f'Saved alphaearth_with_env.csv')
print(f'  Rows: {len(merged):,}')
print(f'  Columns: {len(merged.columns)}')
print(f'  File size: {os.path.getsize(out_path) / 1e6:.1f} MB')

## 8. Sanity Checks

Quick validation that the data looks reasonable before passing to NB02.

In [None]:
emb_cols = [f'A{i:02d}' for i in range(64)]
emb_stats = merged[emb_cols].describe().T

print('Embedding dimensions (A00–A63):')
print(f'  Value range: [{emb_stats["min"].min():.3f}, {emb_stats["max"].max():.3f}]')
print(f'  Mean of means: {emb_stats["mean"].mean():.3f}')
print(f'  Mean of stds:  {emb_stats["std"].mean():.3f}')
print(f'  Any NaN: {merged[emb_cols].isna().any().any()}')
print(f'  Genomes with any NaN embedding: {merged[emb_cols].isna().any(axis=1).sum():,}')

print(f'\nGeographic extent:')
print(f'  Latitude:  [{merged["cleaned_lat"].min():.2f}, {merged["cleaned_lat"].max():.2f}]')
print(f'  Longitude: [{merged["cleaned_lon"].min():.2f}, {merged["cleaned_lon"].max():.2f}]')

print(f'\nTop 10 phyla (of {merged["phylum"].nunique()} total):')
merged['phylum'].value_counts().head(10)

## Summary

We extracted **83,287 genomes** with 64-dimensional AlphaEarth environmental embeddings and joined them with NCBI environment metadata. Key observations:

- **Nearly all** genomes have lat/lon coordinates (99.99%) and geographic location names (100%)
- **91.6%** have an `isolation_source` label (5,774 unique raw values — needs harmonization)
- **38–42%** have structured ENVO ontology terms (`env_broad_scale`, `env_local_scale`, `env_medium`)
- **63.7%** have a `host` field — reflecting the clinical sample bias
- Some genomes (~3,838) have NaN in embedding dimensions — these will be filtered in NB02

Next: `02_interactive_exploration.ipynb` for coordinate QC, environment harmonization, UMAP visualization, and geographic analysis.

In [None]:
print('=== Data extraction complete ===')
print(f'\nOutput files in {os.path.abspath(DATA_DIR)}:')
for f in sorted(os.listdir(DATA_DIR)):
    if f.endswith('.csv'):
        size = os.path.getsize(os.path.join(DATA_DIR, f)) / 1e6
        print(f'  {f} ({size:.1f} MB)')