# TCGA Fusion & Preprocessing Notebook üß¨

### Overview üåü
This notebook is designed to preprocess and integrate multiple datasets, including:

- **Expression matrices** üß™
- **Clinical data** üè•
- **Sample metadata** üóÇÔ∏è

The goal is to merge these datasets into a unified format suitable for training a **Deep Learning Model (MLP)**. üöÄ

### Why are we doing this? ü§î

1. **Deep Learning Models**, such as Multi-Layer Perceptrons (MLPs), require structured and clean input data. This ensures the model can learn meaningful patterns without being hindered by inconsistencies or missing values.
2. By combining clinical, expression, and sample data, we create a **comprehensive dataset** that captures both molecular and clinical features. This enables the model to:
   - Predict outcomes more accurately. üéØ
   - Identify key biomarkers. üî¨
   - Support personalized medicine approaches. üíä

### What will this notebook achieve? ‚úÖ
- Preprocess raw data files to ensure consistency.
- Map and align identifiers across datasets.
- Merge all relevant features into a single dataset ready for **MLP training**.

Let‚Äôs get started! üöÄ

# Data Loading 

In [3]:

#Librairies and Paths

import pandas as pd
import numpy as np
from tqdm import tqdm
import os
import json

# Paths (edit EXPR_PATH to your local path)

# Terry's path
EXPR_PATH = "C:\\Users\\assou\\Documents\\PYTHON\\BIP12\\gdc_download_20251125_142547.268493"

# Elodie's path
EXPR_PATH = "/Users/elodiehusson/Desktop/dataset_DL"

# Pierre's path
# EXPR_PATH = " "


In [4]:
#Data Loading
samples = pd.read_csv("../data/sample.tsv", sep="\t")
map_df = pd.read_csv("../script/metadata_mapping.csv")

# Mapping File Names to Entity Submitter IDs üóÇÔ∏è

In this section, we will:

- üîó Create a **mapping dataframe** using the metadata file provided by TCGA.
- üóÉÔ∏è Map **`file names`** to **`entity submitter IDs`**.
- üìã Include other relevant IDs required for merging with clinical data later on.

### Why is this important? ü§î
This mapping is a **crucial step** in preparing the data for:
- Accurate integration of clinical and expression datasets.
- Ensuring consistency across all data sources for downstream analysis. üöÄ

In [5]:
#mapping from metadata case_id to case submitter_id
with open("../data/metadata.cart.2025-11-25.json") as f:
    meta = json.load(f)

rows = []

for entry in meta:
    file_name = entry["file_name"]
    file_id = entry["file_id"]
    
    ent = entry["associated_entities"][0]
    
    aliquot_id = ent["entity_id"]
    case_id = ent["case_id"]
    submitter_id = ent["entity_submitter_id"].split("-")[0:3]
    submitter_id = "-".join(submitter_id)
    
    rows.append({
        "file_name": file_name,
        "aliquot_id": aliquot_id,
        "case_id": case_id,
        "submitter_id": submitter_id
    })

map_df = pd.DataFrame(rows)

#Use this command to download the .csv file of the mapping dataframe (dictionnary)
map_df.to_csv("metadata_mapping.csv", index=False)

map_df

Unnamed: 0,file_name,aliquot_id,case_id,submitter_id
0,aae97d39-d53c-4387-b1ca-415a8d7cea7c.rna_seq.a...,aa0338ae-c4e7-4f20-9be7-347aa5a2d8f2,e8f56d0f-eee4-4def-a43a-dec91f4382a1,TCGA-EM-A3AQ
1,eeb87fb5-69a1-4f07-816a-f9662a5e5650.rna_seq.a...,231c36ea-41b0-4a29-bb3f-222a266c2a19,fd3315da-c870-4ad0-9d2a-50b1647d3e46,TCGA-J8-A4HW
2,595a1305-804c-45c2-aa46-8a4f9cc3fc2f.rna_seq.a...,09c1d9fb-fb0f-4cb7-8a02-00e98e478eec,fd3315da-c870-4ad0-9d2a-50b1647d3e46,TCGA-J8-A4HW
3,d256684b-a5f0-4124-9044-1443348ec94e.rna_seq.a...,03a8d214-2f94-4ede-b6bc-e5e7d6d6515d,5e085199-152a-40f5-a8f8-3a3a0f31c2e0,TCGA-DJ-A3UW
4,cba84817-ec75-481b-be5c-bf2cb79cf3a3.rna_seq.a...,b769a631-6c4c-43aa-821b-b10cd4bba51b,d4c68c1c-a3f3-4e0c-b555-d457378a1d24,TCGA-BJ-A291
...,...,...,...,...
567,69591486-40ca-4916-b935-d79e541ada41.rna_seq.a...,7c372899-4ca8-4f06-9d2c-058f9a859e4a,3b3c99ab-5336-4433-b682-e1a590221611,TCGA-DJ-A13X
568,c162a6f7-231a-46d3-94f6-d18c8f9d483a.rna_seq.a...,67107614-9a9d-4d62-9c2e-3b8e242ac92d,3a211d5a-085f-4902-86b6-50e4eb36b897,TCGA-EM-A3O7
569,5885620e-b451-4014-9e2a-fff70415feb1.rna_seq.a...,35b40466-674d-4909-aec0-0067f8f6f00c,3a7c35e0-9ed1-4098-8585-52f5991b2534,TCGA-IM-A3EB
570,e0db2fea-b597-4569-951f-a0563c9a5521.rna_seq.a...,ee052957-28e2-4192-b54d-b7c877594ece,3bfa52af-cb40-45d8-9bdc-591985aae7fb,TCGA-EL-A3ZO


We need to filter the clinical file : 
- supprimer les espaces dans les noms de colonnes + dans toutes les cellules
- supprimer les √©chantillons qui n'ont pas √©t√© faits sur la glande thyroide 

In [6]:
clinical = pd.read_csv("../data/clinical.tsv", sep="\t")

# Supprimer tous les espaces (tous types d'espaces) des noms de colonnes, de l'index et des valeurs string
clinical.columns = (clinical.columns.map(lambda c: ''.join(c.split()) if isinstance(c, str) else c))
clinical.index = clinical.index.map(lambda v: ''.join(v.split()) if isinstance(v, str) else v)

# Supprimer les lignes o√π days_to_diagnosis != 0 ET tissue_or_organ != "Thyroid gland", puis compter les case IDs uniques
days = pd.to_numeric(clinical["diagnoses.days_to_diagnosis"], errors="coerce")
cond_remove = (days != 0) & (clinical["diagnoses.tissue_or_organ_of_origin"] != "Thyroid gland")
clinical = clinical[~cond_remove].copy()

#set case_id as index
clinical = clinical.set_index("cases.case_id")


#Map des case_id en submitter_id dans clinical
def get_sub_id_case(filename):
    match = map_df[map_df["case_id"] == filename]
    if len(match) > 0:
        return match["submitter_id"].values[0]
    return filename  # fallback to filename if not found

clinical.index = clinical.index.map(get_sub_id_case)
clinical.index.name = "submitter_id"

clinical
clinical.to_csv("../clinical_sub.tsv", sep="\t", index=True)
clinical_sub = pd.read_csv("../clinical_sub.tsv", sep="\t")



In [7]:
clinical["diagnoses.ajcc_pathologic_stage"].astype(str).value_counts()

diagnoses.ajcc_pathologic_stage
Stage I                            1155
Stage III                           472
Stage IVA                           212
Stage II                            198
Stage IVC                            24
'--                                   4
Stage IV                              4
Name: count, dtype: int64

In [8]:
# Concatenate expression data since each file represents ~ one sample 
# takes around 1 minute to run

expression = []

for root, dirs, files in os.walk(EXPR_PATH):
    for f in files:
        if f.endswith(".tsv"):
            path = os.path.join(root, f) # full path to the file

            try:
                df = pd.read_csv(path, sep="\t", comment="#")
                sample_id = f.split(".")[0]   # sample name based on filename

                # Add only the expression column for this sample
                expression.append(
                    df.set_index("gene_id")[["fpkm_uq_unstranded"]]
                    .rename(columns={"fpkm_uq_unstranded": f})
                )

            except Exception:
                pass

# Concatenate all expression columns horizontally
concat_all_expr = pd.concat(expression, axis=1)
concat_all_expr = concat_all_expr[4:].T.copy()
concat_all_expr.index.name = "file_name"
concat_all_expr

gene_id,ENSG00000000003.15,ENSG00000000005.6,ENSG00000000419.13,ENSG00000000457.14,ENSG00000000460.17,ENSG00000000938.13,ENSG00000000971.16,ENSG00000001036.14,ENSG00000001084.13,ENSG00000001167.14,...,ENSG00000288661.1,ENSG00000288662.1,ENSG00000288663.1,ENSG00000288665.1,ENSG00000288667.1,ENSG00000288669.1,ENSG00000288670.1,ENSG00000288671.1,ENSG00000288674.1,ENSG00000288675.1
file_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
990d59a1-18bd-4903-b1c9-f4d8b9edf980.rna_seq.augmented_star_gene_counts.tsv,12.7498,0.0361,22.1672,2.0347,0.7283,2.1849,26.4351,18.4802,2.8694,11.9485,...,0.0,0.0000,0.0468,0.0,0.4373,0.0000,4.6207,0.0,0.0000,0.4287
fc853d38-8069-41b0-af9c-77925a3f8063.rna_seq.augmented_star_gene_counts.tsv,16.3237,0.0490,33.2109,2.1767,0.9475,9.8380,42.9446,17.0994,4.8779,13.5060,...,0.0,0.0000,0.1332,0.0,0.0000,0.0045,3.8706,0.0,0.0188,0.3335
7091c2c6-682c-4c5e-810e-fa254b3a20bc.rna_seq.augmented_star_gene_counts.tsv,11.1951,0.0404,32.0691,2.6413,0.7896,6.2225,16.8074,14.3013,5.0269,11.6965,...,0.0,0.0000,0.0942,0.0,0.0000,0.0000,3.8857,0.0,0.0124,0.1421
bafc3122-091b-4648-bc51-8e6c72e47b6a.rna_seq.augmented_star_gene_counts.tsv,19.8536,0.0138,35.4080,2.3813,0.7640,3.4978,13.4795,17.3233,4.3141,12.4163,...,0.0,0.0000,0.0536,0.0,0.0000,0.0000,3.7747,0.0,0.0127,0.2545
0f6e2216-6762-4c82-aa09-a9b36a475392.rna_seq.augmented_star_gene_counts.tsv,11.9501,0.0000,26.5317,2.7781,0.6816,6.5881,11.0855,15.4334,4.1267,11.1938,...,0.0,0.0000,0.0639,0.0,0.0000,0.0000,3.6067,0.0,0.0086,0.1969
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2df38eb8-5350-4951-9159-a8add6474efe.rna_seq.augmented_star_gene_counts.tsv,21.5420,0.1269,35.5278,1.7214,0.9467,4.7013,41.1976,27.1338,5.5665,9.2689,...,0.0,0.0000,0.0329,0.0,0.0000,0.0000,5.4264,0.0,0.0053,0.4763
a61ada75-0759-45d0-9cb0-f08847b16a5d.rna_seq.augmented_star_gene_counts.tsv,24.5194,0.0263,37.2317,1.5829,0.7970,5.0245,26.1152,27.1348,5.5435,10.6860,...,0.0,0.0000,0.0273,0.0,0.0000,0.0000,4.3526,0.0,0.0020,0.1387
4fa0416d-79d1-4b68-a146-1956bcaf49b7.rna_seq.augmented_star_gene_counts.tsv,20.4696,0.0338,29.0052,2.3045,0.8117,2.6664,22.5726,17.3186,3.6015,12.8668,...,0.0,0.0000,0.0307,0.0,0.0000,0.0000,4.3269,0.0,0.0104,0.3420
185ecfd0-680c-4874-8667-5d7543ec562c.rna_seq.augmented_star_gene_counts.tsv,13.3239,0.0000,34.2054,1.3185,0.2376,4.7621,26.6526,18.8882,2.6144,7.3573,...,0.0,0.3477,0.0220,0.0,0.0000,0.0000,3.9138,0.0,0.0022,0.6209


In [9]:
unique_cases = concat_all_expr.index.nunique()
print("Nombre de samples unique :", unique_cases, f", donc il y a {len(concat_all_expr)-505} doublons")

Nombre de samples unique : 572 , donc il y a 67 doublons


# Deletion of duplicates and having "submitter_id" as index

In [10]:
# 1) Construire le dictionnaire file_name -> case_id
file_to_case = (
    map_df.drop_duplicates(subset="file_name")   # s√©curit√© si jamais
          .set_index("file_name")["case_id"]
)

# 2) Cr√©er un nouveau df avec la colonne case_id mapp√©e depuis l'index (file_name)
expr_matrix = concat_all_expr.copy()
expr_matrix.insert(
    0,  # met la colonne au d√©but (optionnel)
    "case_id",
    expr_matrix.index.to_series().map(file_to_case)
)

# 3) Retirer les lignes non mapp√©es
expr_matrix = expr_matrix.dropna(subset=["case_id"]).copy()

# 4) Garder un seul √©chantillon par case_id (le premier rencontr√©)
expr_matrix = expr_matrix.loc[~expr_matrix["case_id"].duplicated(keep="first")].copy()

# 5) supprimer la colonne case_id
expr_matrix = expr_matrix.drop(columns=["case_id"])


# 6) Mettre les submitter_id comme index
case_to_submitter = (map_df.drop_duplicates(subset="file_name").set_index("file_name")["submitter_id"]) # Construire le mapping file_name -> submitter_id
expr_matrix.index = expr_matrix.index.map(case_to_submitter)
expr_matrix.index.name = "submitter_id"

expr_matrix

gene_id,ENSG00000000003.15,ENSG00000000005.6,ENSG00000000419.13,ENSG00000000457.14,ENSG00000000460.17,ENSG00000000938.13,ENSG00000000971.16,ENSG00000001036.14,ENSG00000001084.13,ENSG00000001167.14,...,ENSG00000288661.1,ENSG00000288662.1,ENSG00000288663.1,ENSG00000288665.1,ENSG00000288667.1,ENSG00000288669.1,ENSG00000288670.1,ENSG00000288671.1,ENSG00000288674.1,ENSG00000288675.1
submitter_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
TCGA-FE-A22Z,12.7498,0.0361,22.1672,2.0347,0.7283,2.1849,26.4351,18.4802,2.8694,11.9485,...,0.0,0.0000,0.0468,0.0,0.4373,0.0000,4.6207,0.0,0.0000,0.4287
TCGA-EL-A3ZS,16.3237,0.0490,33.2109,2.1767,0.9475,9.8380,42.9446,17.0994,4.8779,13.5060,...,0.0,0.0000,0.1332,0.0,0.0000,0.0045,3.8706,0.0,0.0188,0.3335
TCGA-KS-A41I,11.1951,0.0404,32.0691,2.6413,0.7896,6.2225,16.8074,14.3013,5.0269,11.6965,...,0.0,0.0000,0.0942,0.0,0.0000,0.0000,3.8857,0.0,0.0124,0.1421
TCGA-E3-A3E0,19.8536,0.0138,35.4080,2.3813,0.7640,3.4978,13.4795,17.3233,4.3141,12.4163,...,0.0,0.0000,0.0536,0.0,0.0000,0.0000,3.7747,0.0,0.0127,0.2545
TCGA-DJ-A3US,11.9501,0.0000,26.5317,2.7781,0.6816,6.5881,11.0855,15.4334,4.1267,11.1938,...,0.0,0.0000,0.0639,0.0,0.0000,0.0000,3.6067,0.0,0.0086,0.1969
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
TCGA-DJ-A3VJ,18.4508,0.0214,34.8468,1.8898,0.7342,3.1442,33.0756,17.6688,3.7384,11.4268,...,0.0,0.0000,0.0361,0.0,0.0000,0.0000,3.2226,0.0,0.0099,0.3202
TCGA-EL-A3D6,24.5194,0.0263,37.2317,1.5829,0.7970,5.0245,26.1152,27.1348,5.5435,10.6860,...,0.0,0.0000,0.0273,0.0,0.0000,0.0000,4.3526,0.0,0.0020,0.1387
TCGA-DJ-A2Q4,20.4696,0.0338,29.0052,2.3045,0.8117,2.6664,22.5726,17.3186,3.6015,12.8668,...,0.0,0.0000,0.0307,0.0,0.0000,0.0000,4.3269,0.0,0.0104,0.3420
TCGA-ET-A39K,13.3239,0.0000,34.2054,1.3185,0.2376,4.7621,26.6526,18.8882,2.6144,7.3573,...,0.0,0.3477,0.0220,0.0,0.0000,0.0000,3.9138,0.0,0.0022,0.6209


In [11]:
# v√©rifier les doublons √©ventuels apr√®s suppression des doublons
print("Nombre d'indices:", len(expr_matrix.index))
print("Doublons:", expr_matrix.index.duplicated().sum())

Nombre d'indices: 505
Doublons: 0


# Adding clinical information to the expression matrix

Les variables cliniques qui nous interessent : 
- age (age56): `clinical_sub["demographic.age_at_index"]`
- sexe (male or female): `clinical_sub["demographic.gender"]`
- stade tumoral (StageI, StageIII, StageII, StageIVA, StageIVC, normal, StageIV): `clinical_sub["diagnoses.ajcc_pathologic_stage"]`
- type de tumeur : `clinical_sub["cases.disease_type"]`, `clinical_sub["diagnoses.primary_diagnosis"]`


Les d√©fis : 
- il y'a encore des doublons dans le fichier clinique, mais pour les variables qui nous interessent c'est pas un probl√®me vu que c'est toujours la m√™me valeur pour tous les patients : ca pose pas de probl√®me pour age, sexe et stade tumoral pour l'instant. Type tumeur, j'ai pas fait
- ‚úÖ remplacer les gens qui n'ont pas de stade tumoral en "normal"
- on a pas le type de tumeur parce que c'est trop large, peut √™tre faire clustering pour trouver les gens qui sont d'un type de tumeur 

Step 1 : regarder les variables dans clinical_sub avant de commencer

In [12]:
# age : 
clinical_sub["demographic.age_at_index"].value_counts()

demographic.age_at_index
46    76
51    70
33    64
55    63
37    61
      ..
89     4
80     4
81     3
88     3
87     2
Name: count, Length: 73, dtype: int64

In [13]:
# sexe : 
clinical_sub["demographic.gender"].value_counts()

demographic.gender
female                1521
male                   548
Name: count, dtype: int64

In [14]:
# stade tumoral :
clinical_sub["diagnoses.ajcc_pathologic_stage"].value_counts()

diagnoses.ajcc_pathologic_stage
Stage I                            1155
Stage III                           472
Stage IVA                           212
Stage II                            198
Stage IVC                            24
'--                                   4
Stage IV                              4
Name: count, dtype: int64

In [15]:
# type tumoral : 
print(clinical_sub["diagnoses.primary_diagnosis"].value_counts(),"\n\n")
clinical_sub["cases.disease_type"].value_counts()

diagnoses.primary_diagnosis
Papillary adenocarcinoma, NOS                  1429
Papillary carcinoma, follicular variant         447
Papillary carcinoma, columnar cell              153
Nonencapsulated sclerosing carcinoma             18
Carcinoma, NOS                                    7
Papillary carcinoma, oxyphilic cell               7
Follicular carcinoma, minimally invasive          2
Oxyphilic adenocarcinoma                          2
Papillary carcinoma, NOS                          2
Follicular adenocarcinoma, NOS                    2
Name: count, dtype: int64 




cases.disease_type
Adenomas and Adenocarcinomas    2060
Epithelial Neoplasms, NOS          7
Squamous Cell Neoplasms            2
Name: count, dtype: int64

Step 2 : ajouter les variables cliniques dans l'expression matrix

In [16]:
# Construire une table clinique minimale index√©e par submitter_id
clinical_map = (clinical_sub.drop_duplicates(subset="submitter_id")
        .set_index("submitter_id")[[
            "demographic.age_at_index",
            "demographic.gender",
            "diagnoses.ajcc_pathologic_stage",  ]])

clinical_map["diagnoses.ajcc_pathologic_stage"] = (
    clinical_map["diagnoses.ajcc_pathologic_stage"]
        .astype(str)
        .str.strip()
        .replace({"'--": "normal", "--": "normal", "nan": "normal"})
)


def build_new_index(submitter_id):
    # Supprimer le pr√©fixe TCGA-
    sid = submitter_id.replace("TCGA-", "")
    
    age = clinical_map.loc[submitter_id, "demographic.age_at_index"]
    sex = clinical_map.loc[submitter_id, "demographic.gender"]
    stage = clinical_map.loc[submitter_id, "diagnoses.ajcc_pathologic_stage"]
    
    return f"{sid}_age{age}_{sex}_{stage}_"

expr_matrix_clinic = expr_matrix.copy()
expr_matrix_clinic.index = expr_matrix_clinic.index.map(build_new_index)

KeyError: 'TCGA-FE-A22Z'

In [None]:
expr_matrix_clinic

gene_id,ENSG00000000003.15,ENSG00000000005.6,ENSG00000000419.13,ENSG00000000457.14,ENSG00000000460.17,ENSG00000000938.13,ENSG00000000971.16,ENSG00000001036.14,ENSG00000001084.13,ENSG00000001167.14,...,ENSG00000288661.1,ENSG00000288662.1,ENSG00000288663.1,ENSG00000288665.1,ENSG00000288667.1,ENSG00000288669.1,ENSG00000288670.1,ENSG00000288671.1,ENSG00000288674.1,ENSG00000288675.1
submitter_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
TCGA-FE-A22Z,12.7498,0.0361,22.1672,2.0347,0.7283,2.1849,26.4351,18.4802,2.8694,11.9485,...,0.0,0.0000,0.0468,0.0,0.4373,0.0000,4.6207,0.0,0.0000,0.4287
TCGA-EL-A3ZS,16.3237,0.0490,33.2109,2.1767,0.9475,9.8380,42.9446,17.0994,4.8779,13.5060,...,0.0,0.0000,0.1332,0.0,0.0000,0.0045,3.8706,0.0,0.0188,0.3335
TCGA-KS-A41I,11.1951,0.0404,32.0691,2.6413,0.7896,6.2225,16.8074,14.3013,5.0269,11.6965,...,0.0,0.0000,0.0942,0.0,0.0000,0.0000,3.8857,0.0,0.0124,0.1421
TCGA-E3-A3E0,19.8536,0.0138,35.4080,2.3813,0.7640,3.4978,13.4795,17.3233,4.3141,12.4163,...,0.0,0.0000,0.0536,0.0,0.0000,0.0000,3.7747,0.0,0.0127,0.2545
TCGA-DJ-A3US,11.9501,0.0000,26.5317,2.7781,0.6816,6.5881,11.0855,15.4334,4.1267,11.1938,...,0.0,0.0000,0.0639,0.0,0.0000,0.0000,3.6067,0.0,0.0086,0.1969
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
TCGA-DJ-A3VJ,18.4508,0.0214,34.8468,1.8898,0.7342,3.1442,33.0756,17.6688,3.7384,11.4268,...,0.0,0.0000,0.0361,0.0,0.0000,0.0000,3.2226,0.0,0.0099,0.3202
TCGA-EL-A3D6,24.5194,0.0263,37.2317,1.5829,0.7970,5.0245,26.1152,27.1348,5.5435,10.6860,...,0.0,0.0000,0.0273,0.0,0.0000,0.0000,4.3526,0.0,0.0020,0.1387
TCGA-DJ-A2Q4,20.4696,0.0338,29.0052,2.3045,0.8117,2.6664,22.5726,17.3186,3.6015,12.8668,...,0.0,0.0000,0.0307,0.0,0.0000,0.0000,4.3269,0.0,0.0104,0.3420
TCGA-ET-A39K,13.3239,0.0000,34.2054,1.3185,0.2376,4.7621,26.6526,18.8882,2.6144,7.3573,...,0.0,0.3477,0.0220,0.0,0.0000,0.0000,3.9138,0.0,0.0022,0.6209


Step 3 : enregistrer le fichier

In [None]:
# v√©rifier que le mapping s'est bien fait sur les normal 
clinical_map["diagnoses.ajcc_pathologic_stage"].value_counts()

diagnoses.ajcc_pathologic_stage
StageI      282
StageIII    112
StageII      52
StageIVA     47
StageIVC      6
normal        4
StageIV       2
Name: count, dtype: int64

In [None]:
# on peut pas enregistrer la matrice d'expression dans github parce que elle est trop lourde (>100MB), donc on l'enregistre en local
# ca prend 54 secondes √† charger
expr_matrix_clinic.to_csv(f"{EXPR_PATH}/expression_matrix_clinical.tsv", sep="\t", index=True)

## Sample Data üß™???

This step is crucial for ensuring that all datasets are aligned and can be effectively merged for comprehensive analysis. ‚úÖ

### Key Notes:
- In the **sample data**, the `submitter_id` is already present. üéØ
- Therefore, no additional mapping is required. üö´
- The goal here is simply to set the `submitter_id` as the index for easier access and consistency. üìã

In [None]:
# Set les cases.submitter_id comme index
samples = samples.set_index("cases.submitter_id")
samples.head()