# 01 — Download & Explore GSE2034 (GEOparse)

**Goal:** Download the breast cancer microarray dataset **GSE2034** from GEO, cache it locally, and build an expression matrix for downstream analysis.

**What you'll do in this notebook:**
1. Set up folders & imports
2. Download GSE2034 using `GEOparse`
3. Inspect metadata (platforms, samples)
4. Build a probe × sample expression matrix (`VALUE`)
5. Try basic probe→gene mapping from the GPL annotation
6. Save matrices to disk for later notebooks

> Tip: If a step fails due to network hiccups, just re-run the cell.


In [1]:
import os
import pandas as pd
import numpy as np
from pathlib import Path
from tqdm import tqdm
import GEOparse

# making directory for the data (raw and processed)
PROJECT_ROOT = Path.cwd().parents[0] if (Path.cwd().name == "notebooks") else Path.cwd()
DATA_RAW = PROJECT_ROOT / "data" / "raw" / "GEO"
DATA_PROCESSED = PROJECT_ROOT / "data" / "processed"
REPORTS = PROJECT_ROOT / "reports"

for p in [DATA_RAW, DATA_PROCESSED, REPORTS]:
    p.mkdir(parents=True, exist_ok=True)

print("Project root:", PROJECT_ROOT)
print("Raw GEO cache:", DATA_RAW)
print("Processed:", DATA_PROCESSED)
print("GEOparse version:", GEOparse.__version__)


Project root: c:\Users\Kartikey Samadhiya\Desktop\Desktop\Acads\Projects\ML\MiniProject\BIO-MLLAB
Raw GEO cache: c:\Users\Kartikey Samadhiya\Desktop\Desktop\Acads\Projects\ML\MiniProject\BIO-MLLAB\data\raw\GEO
Processed: c:\Users\Kartikey Samadhiya\Desktop\Desktop\Acads\Projects\ML\MiniProject\BIO-MLLAB\data\processed
GEOparse version: 2.0.4


In [2]:
# Download the Series GSE2034 and cache in data/raw/GEO
# annotate_gpl=True tries to attach annotation tables (useful later)
gse_id = "GSE2034"
gse = GEOparse.get_GEO(geo=gse_id, destdir=str(DATA_RAW), annotate_gpl=True, how="full")
print(gse)
print("Number of samples (GSM):", len(gse.gsms))
print("Number of platforms (GPL):", len(gse.gpls))
print("Platforms:", list(gse.gpls.keys()))


19-Nov-2025 09:48:46 DEBUG utils - Directory c:\Users\Kartikey Samadhiya\Desktop\Desktop\Acads\Projects\ML\MiniProject\BIO-MLLAB\data\raw\GEO already exists. Skipping.
19-Nov-2025 09:48:46 INFO GEOparse - File already exist: using local version.
19-Nov-2025 09:48:46 INFO GEOparse - Parsing c:\Users\Kartikey Samadhiya\Desktop\Desktop\Acads\Projects\ML\MiniProject\BIO-MLLAB\data\raw\GEO\GSE2034_family.soft.gz: 
19-Nov-2025 09:48:46 DEBUG GEOparse - DATABASE: GeoMiame
19-Nov-2025 09:48:46 DEBUG GEOparse - SERIES: GSE2034
19-Nov-2025 09:48:47 DEBUG GEOparse - PLATFORM: GPL96
19-Nov-2025 09:48:48 DEBUG GEOparse - SAMPLE: GSM36777
19-Nov-2025 09:48:48 DEBUG GEOparse - SAMPLE: GSM36778
19-Nov-2025 09:48:48 DEBUG GEOparse - SAMPLE: GSM36779
19-Nov-2025 09:48:48 DEBUG GEOparse - SAMPLE: GSM36780
19-Nov-2025 09:48:48 DEBUG GEOparse - SAMPLE: GSM36781
19-Nov-2025 09:48:48 DEBUG GEOparse - SAMPLE: GSM36782
19-Nov-2025 09:48:48 DEBUG GEOparse - SAMPLE: GSM36783
19-Nov-2025 09:48:48 DEBUG GEOparse -

<SERIES: GSE2034 - 286 SAMPLES, 1 d(s)>
Number of samples (GSM): 286
Number of platforms (GPL): 1
Platforms: ['GPL96']


In [3]:
# Peek at study-level metadata
pd.set_option("display.max_rows", 20)
meta = pd.Series({k: v[0] if isinstance(v, list) and len(v) == 1 else v 
                  for k, v in gse.metadata.items()})
meta.head(30)


title                               Breast cancer relapse free survival
geo_accession                                                   GSE2034
status                                            Public on Feb 23 2005
submission_date                                             Dec 03 2004
last_update_date                                            Aug 10 2018
                                            ...                        
supplementary_file    ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE2nnn/...
platform_id                                                       GPL96
platform_taxid                                                     9606
sample_taxid                                                       9606
relation              BioProject: https://www.ncbi.nlm.nih.gov/biopr...
Length: 23, dtype: object

In [4]:
# Inspect sample (GSM) annotation: characteristics_ch1 often holds phenotype fields
def extract_gsm_meta(gsm):
    row = {"gsm": gsm.name}
    for k, v in gsm.metadata.items():
        if isinstance(v, list):
            row[k] = "; ".join(map(str, v))
        else:
            row[k] = v
    # Also flatten common fields inside .table if present
    if hasattr(gsm, "table") and isinstance(gsm.table, pd.DataFrame):
        for col in ["title", "source_name_ch1"]:
            if col in gsm.table.columns:
                row[col] = "; ".join(gsm.table[col].astype(str).unique())
    return row

gsm_meta = pd.DataFrame([extract_gsm_meta(gsm) for gsm in gse.gsms.values()])
print("GSM meta shape:", gsm_meta.shape)
display(gsm_meta.head(10))
# Save for later
gsm_meta.to_csv(DATA_PROCESSED / f"{gse_id}_gsm_metadata.csv", index=False)


GSM meta shape: (286, 28)


Unnamed: 0,gsm,title,geo_accession,status,submission_date,last_update_date,type,channel_count,source_name_ch1,organism_ch1,...,contact_institute,contact_address,contact_city,contact_state,contact_zip/postal_code,contact_country,supplementary_file,relation,series_id,data_row_count
0,GSM36777,Wang4812_JA_277,GSM36777,Public on Feb 23 2005,Dec 03 2004,May 31 2013,RNA,1,Breast,Homo sapiens,...,Veridex,,San Diego,CA,92121,USA,ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM36nn...,Reanalyzed by: GSE47561,GSE2034,22283
1,GSM36778,Wang4813_JA_278,GSM36778,Public on Feb 23 2005,Dec 03 2004,May 31 2013,RNA,1,Breast,Homo sapiens,...,Veridex,,San Diego,CA,92121,USA,ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM36nn...,Reanalyzed by: GSE47561,GSE2034,22283
2,GSM36779,Wang4630_JA798@C,GSM36779,Public on Feb 23 2005,Dec 03 2004,May 31 2013,RNA,1,Breast,Homo sapiens,...,Veridex,,San Diego,CA,92121,USA,ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM36nn...,Reanalyzed by: GSE47561,GSE2034,22283
3,GSM36780,Wang4889_JA_846@2,GSM36780,Public on Feb 23 2005,Dec 03 2004,May 31 2013,RNA,1,Breast,Homo sapiens,...,Veridex,,San Diego,CA,92121,USA,ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM36nn...,Reanalyzed by: GSE47561,GSE2034,22283
4,GSM36781,Wang4857_JA_765@2,GSM36781,Public on Feb 23 2005,Dec 03 2004,May 31 2013,RNA,1,Breast,Homo sapiens,...,Veridex,,San Diego,CA,92121,USA,ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM36nn...,Reanalyzed by: GSE47561,GSE2034,22283
5,GSM36782,Wang4585_JA600,GSM36782,Public on Feb 23 2005,Dec 03 2004,May 31 2013,RNA,1,Breast,Homo sapiens,...,Veridex,,San Diego,CA,92121,USA,ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM36nn...,Reanalyzed by: GSE47561,GSE2034,22283
6,GSM36783,Wang4586_JA601,GSM36783,Public on Feb 23 2005,Dec 03 2004,May 31 2013,RNA,1,Breast,Homo sapiens,...,Veridex,,San Diego,CA,92121,USA,ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM36nn...,Reanalyzed by: GSE47561,GSE2034,22283
7,GSM36784,Wang4587_JA602,GSM36784,Public on Feb 23 2005,Dec 03 2004,May 31 2013,RNA,1,Breast,Homo sapiens,...,Veridex,,San Diego,CA,92121,USA,ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM36nn...,Reanalyzed by: GSE47561,GSE2034,22283
8,GSM36785,Wang4820_JA_605,GSM36785,Public on Feb 23 2005,Dec 03 2004,May 31 2013,RNA,1,Breast,Homo sapiens,...,Veridex,,San Diego,CA,92121,USA,ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM36nn...,Reanalyzed by: GSE47561,GSE2034,22283
9,GSM36786,Wang4821_JA_606,GSM36786,Public on Feb 23 2005,Dec 03 2004,Apr 02 2012,RNA,1,Breast,Homo sapiens,...,Veridex,,San Diego,CA,92121,USA,ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM36nn...,,GSE2034,22283


In [5]:
# Build expression matrix: probes (rows) × samples (columns) by pivoting VALUE
# This works if GSM.tables contain a 'VALUE' column (typical for microarrays)
expr = gse.pivot_samples('VALUE')  # rows: probes, columns: GSM IDs
print("Expression shape:", expr.shape)
display(expr.iloc[:5, :10])
# Save
expr.to_parquet(DATA_PROCESSED / f"{gse_id}_expr_probe.parquet")
expr.to_csv(DATA_PROCESSED / f"{gse_id}_expr_probe.csv")


Expression shape: (22283, 286)


name,GSM36777,GSM36778,GSM36779,GSM36780,GSM36781,GSM36782,GSM36783,GSM36784,GSM36785,GSM36786
ID_REF,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1007_s_at,3848.1,6520.9,5285.7,4043.7,4263.6,2949.8,5498.9,3863.1,3370.4,3991.9
1053_at,228.9,112.5,178.4,398.7,417.7,221.2,280.4,198.2,304.7,198.2
117_at,213.1,189.8,269.7,312.4,327.1,225.0,243.5,244.4,348.5,185.3
121_at,1009.4,2083.3,1203.4,1104.4,1043.3,1117.6,1085.4,1423.1,1196.4,993.3
1255_g_at,31.8,145.8,42.5,108.2,69.2,47.4,84.3,102.0,22.8,86.3


In [6]:
# Try mapping probes to gene symbols using primary platform (first GPL)
# Note: Field names differ across platforms; we attempt a best-effort mapping.

primary_gpl = list(gse.gpls.values())[0]
gpl_tbl = primary_gpl.table.copy()
print("GPL table columns:", list(gpl_tbl.columns)[:20])

# Common candidate columns
candidate_gene_cols = [
    "Gene Symbol", "GENE_SYMBOL", "Symbol", "Gene symbol", "GeneSymbol", "SYMBOL"
]
candidate_id_cols = ["ID", "Probe ID", "ID_REF"]

def first_present(cols, df):
    return next((c for c in cols if c in df.columns), None)

gene_col = first_present(candidate_gene_cols, gpl_tbl)
id_col = first_present(candidate_id_cols, gpl_tbl)

if gene_col and id_col:
    map_df = gpl_tbl[[id_col, gene_col]].rename(columns={id_col: "probe_id", gene_col: "gene"})
    map_df["gene"] = map_df["gene"].fillna("").astype(str).str.split(" /// ").str[0].str.strip()
    print(map_df.head())

    # Collapse probes to genes (mean across probes per gene, ignoring empty gene labels)
    valid = map_df.query("gene != ''")
    expr_gene = (
        expr
        .merge(valid, left_index=True, right_on="probe_id", how="inner")
        .drop(columns=["probe_id"])
        .groupby("gene", as_index=True).mean(numeric_only=True)
        .sort_index()
    )
    print("Gene-level matrix:", expr_gene.shape)
    display(expr_gene.iloc[:5, :5])

    expr_gene.to_parquet(DATA_PROCESSED / f"{gse_id}_expr_gene.parquet")
    expr_gene.to_csv(DATA_PROCESSED / f"{gse_id}_expr_gene.csv")
else:
    print("Could not find expected probe/gene columns on GPL; keeping probe-level matrix.")


GPL table columns: ['ID', 'GB_ACC', 'SPOT_ID', 'Species Scientific Name', 'Annotation Date', 'Sequence Type', 'Sequence Source', 'Target Description', 'Representative Public ID', 'Gene Title', 'Gene Symbol', 'ENTREZ_GENE_ID', 'RefSeq Transcript ID', 'Gene Ontology Biological Process', 'Gene Ontology Cellular Component', 'Gene Ontology Molecular Function']
    probe_id    gene
0  1007_s_at    DDR1
1    1053_at    RFC2
2     117_at   HSPA6
3     121_at    PAX8
4  1255_g_at  GUCA1A
Gene-level matrix: (13237, 286)


Unnamed: 0_level_0,GSM36777,GSM36778,GSM36779,GSM36780,GSM36781
gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
A1CF,279.7,659.5,249.3,366.6,457.8
A2M,3885.5,3382.9,4260.2,5263.7,3906.1
A4GALT,159.1,84.6,94.1,83.2,80.3
A4GNT,291.1,213.2,270.7,260.6,219.3
AAAS,213.0,34.1,236.3,81.5,146.4


In [7]:
# Basic sanity checks & simple EDA
def summarize_matrix(M):
    s = pd.DataFrame({
        "n_genes_or_probes": [M.shape[0]],
        "n_samples": [M.shape[1]],
        "min": [float(np.nanmin(M.values))],
        "max": [float(np.nanmax(M.values))],
        "mean": [float(np.nanmean(M.values))],
        "std": [float(np.nanstd(M.values))],
        "missing_vals": [int(np.isnan(M.values).sum())],
    })
    return s

probe_summary = summarize_matrix(expr)
display(probe_summary)

# If gene-level exists, summarize too
gene_path = DATA_PROCESSED / f"{gse_id}_expr_gene.parquet"
if gene_path.exists():
    expr_gene = pd.read_parquet(gene_path)
    gene_summary = summarize_matrix(expr_gene)
    display(gene_summary)


Unnamed: 0,n_genes_or_probes,n_samples,min,max,mean,std,missing_vals
0,22283,286,0.1,157291.8,964.580996,3317.334678,0


Unnamed: 0,n_genes_or_probes,n_samples,min,max,mean,std,missing_vals
0,13237,286,0.1,99725.9,890.305044,2632.25999,0


## Next steps

- **02_preprocess_standardize.ipynb**: log transform (if needed), standardize per gene, handle missing values.
- **03_distance_clustering.ipynb**: compute correlation distance (1 - Pearson r), hierarchical clustering (Ward linkage).
- **04_bootstrap_stability.ipynb**: bootstrap resampling to assess cluster stability.
- **05_enrichment.ipynb**: run per-cluster enrichment (e.g., MSigDB) using `gseapy`.
