# 02 - Intersect, Align, and Scale (TCGA ↔ METABRIC)

**Goal:** Put TCGA (logCPM) and METABRIC (microarray) on the **same gene space** and a **comparable scale** for cross-cohort modeling.  
**Inputs:**  
- `data_proc/tcga_expr_logcpm.parquet`  
- `data_proc/tcga_labels.tsv`  
- `data_proc/metabric_expr_raw.parquet`   
- `data_proc/metabric_labels.tsv`

**Outputs (to be created):**  
- `data_proc/aligned/tcga_expr_aligned.parquet`  
- `data_proc/aligned/metabric_expr_aligned.parquet`  
- `data_proc/aligned/scaler_tcga_stats.json` (if z-scoring)  
- `data_proc/aligned/tcga_expr_z.parquet`, `data_proc/aligned/metabric_expr_z.parquet` (when we scale)

**Provenance:** Conda env `tcga-brca-survival-project` | Date: <fill> | Author: <`Amith Murikinati`>

In [3]:
#core
import os, json, sys, math, time, gc
from pathlib import Path

#data stack
import numpy as np
import pandas as pd

# display options (clean tables)
pd.set_option("display.width", 160)
pd.set_option("display.max_columns", 50)

# reproducibility (used later for splits/models)
SEED = 42
rng = np.random.default_rng(SEED)

print("Python:", sys.version.split()[0])
print("Pandas:", pd.__version__)
print("NumPy:", np.__version__)

Python: 3.11.13
Pandas: 2.3.2
NumPy: 2.3.3


In [4]:
# paths relative to repo root
REPO = Path.cwd().parent if Path.cwd().name == "notebooks" else Path.cwd()
DATA_PROC = REPO / "data_proc"
ALIGNED = DATA_PROC / "aligned"
ALIGNED.mkdir(parents=True, exist_ok=True)

# expected inputs
P_TCGA_X = DATA_PROC / "tcga_expr_logcpm.parquet"
P_TCGA_Y = DATA_PROC / "tcga_labels.tsv"

# you may have 'metabric_expr_raw.parquet' -we pick whichever exists
P_MB_X = DATA_PROC / "metabric_expr_raw.parquet"
P_MB_Y = DATA_PROC / "metabric_labels.tsv"

print("Repo:", REPO)
print("Exists TCGA X?:", P_TCGA_X.exists())
print("Exists TCGA Y?:", P_TCGA_Y.exists())
print("Exists METABRIC X?:", P_MB_X.exists(), "->", P_MB_X.name)
print("Exists METABRIC Y?:", P_MB_Y.exists())
print("Aligned out dir:", ALIGNED)

Repo: C:\Users\mailt\Desktop\DSprojects2025\tcga-brca-survival-project
Exists TCGA X?: True
Exists TCGA Y?: True
Exists METABRIC X?: True -> metabric_expr_raw.parquet
Exists METABRIC Y?: True
Aligned out dir: C:\Users\mailt\Desktop\DSprojects2025\tcga-brca-survival-project\data_proc\aligned


In [7]:
#load label tables (small)
tcga_y = pd.read_csv(P_TCGA_Y, sep="\t")
mb_y = pd.read_csv(P_MB_Y, sep="\t")
print("TCGA labels:", tcga_y.shape, "| cols:", tcga_y.columns.tolist())
print("MB labels:", mb_y.shape, "| cols:", mb_y.columns.tolist())

# fast load: full matrices (OK for your machine); if memory is tight, we can switch to chunked
tcga_X = pd.read_parquet(P_TCGA_X)
mb_X   = pd.read_parquet(P_MB_X)

print("TCGA X:", tcga_X.shape, tcga_X.dtypes.iloc[:3].tolist())
print("MB   X:", mb_X.shape,   mb_X.dtypes.iloc[:3].tolist())

# basic cross-checks (no action yet; just info)
print("TCGA genes:", tcga_X.index.nunique(), "| samples:", tcga_X.shape[1])
print("MB   genes:", mb_X.index.nunique(),   "| samples:", mb_X.shape[1])

TCGA labels: (1094, 3) | cols: ['SAMPLE_ID', 'os_event', 'os_time_months']
MB labels: (1980, 4) | cols: ['SAMPLE_ID', 'PATIENT_ID', 'os_event', 'os_time_months']
TCGA X: (59427, 1094) [dtype('float32'), dtype('float32'), dtype('float32')]
MB   X: (20385, 1980) [dtype('float32'), dtype('float32'), dtype('float32')]
TCGA genes: 59427 | samples: 1094
MB   genes: 20385 | samples: 1980


In [8]:
# parameters for the upcoming step (we won't execute the transform yet)
SCALING_METHOD = "zscore"   # options later: "zscore" or "qnz" (quantile-normalize-to-TCGA + z)
TRAIN_SOURCE   = "TCGA"     # fit stats on TCGA, apply to METABRIC
SAVE_TAG       = "v1"       # bump when iterating

print(f"Config -> scaling={SCALING_METHOD}, fit_on={TRAIN_SOURCE}, tag={SAVE_TAG}")

Config -> scaling=zscore, fit_on=TCGA, tag=v1
