## TCGA-LUAD (Lung Adenocarcinoma) dataset
#### Use CCA to identify sets of genes that best differentiate between NSCLC patients and healthy controls, showing genetic factors in NSCLC development.


In [5]:
import os
import pandas as pd

first_filename = os.path.join("..", 'data', 'illumina', "HiSeqV2")
second_filename = os.path.join("..", 'data', 'phenotypes', "TCGA.LUAD.sampleMap_LUAD_clinicalMatrix")

try: 
    with open(first_filename, "r") as f:
        X1 = pd.read_csv(f, sep="\t", index_col=0)

except FileNotFoundError:
    raise FileNotFoundError(f"File {first_filename} not found.")

try:
    with open(second_filename, "r") as f:
        X2 = pd.read_csv(f, sep="\t", index_col=0)

except FileNotFoundError:
    raise FileNotFoundError(f"File {second_filename} not found.")

X1: **20,530** rows (genes) and **576** columns (samples)
Each column corresponds to a patient sample.
These are TCGA sample IDs, with the **suffix -01 indicating a tumor sample** and -11 often indicating a normal (non-tumor) sample.

In [6]:
print(f"Shape of Illumina data: {X1.shape}")
print(f"Shape of Phenotype data: {X2.shape}")
print("\n----------------------------------------------")
print(f"First rows of gene expression profiles measured using RNA-Seq;")
print(X1.head())
print("\n----------------------------------------------")
print(f"First rows of clinical data on phenotypes;")
print(X2.head())

Shape of Illumina data: (20530, 576)
Shape of Phenotype data: (706, 147)

----------------------------------------------
First rows of gene expression profiles measured using RNA-Seq;
           TCGA-69-7978-01  TCGA-62-8399-01  TCGA-78-7539-01  TCGA-50-5931-11  \
sample                                                                          
ARHGEF10L           9.9898          10.4257           9.6264           8.6835   
HIF3A               4.2598          11.6239           9.1362           9.4824   
RNF17               0.4181           0.0000           1.1231           0.8221   
RNF10              10.3657          11.5489          11.6692          11.7341   
RNF11              11.1718          11.0200          10.4679          11.6787   

           TCGA-73-4658-01  TCGA-44-6775-01  TCGA-44-2655-01  TCGA-44-3398-01  \
sample                                                                          
ARHGEF10L           9.2078          10.0039           9.3263           9.0249   
HIF3A

: 