# Lung Cancer Gene Expression - Data Curation

**Objective:** To load, clean, and prepare the GSE81089 lung cancer dataset for machine learning analysis. This notebook covers the initial data curation phase.
(https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE81089)

---

## Table of Contents
1. [Setup and Data Loading](#1-setup-and-data-loading)
2. [Initial Data Inspection and Reshaping](#2-initial-data-inspection-and-reshaping)
3. [Data Cleaning and Quality Control](#3-data-cleaning-and-quality-control)
4. [Feature Discrepancies and Final Datasets](#4-feature-discrepancies-and-final-datasets)
5. [Target Variable (y) Creation](#5-target-variable-creation)
6. [Cleaned Datasets](#6-cleaned-datasets)
7. [Datasets intersection](#7-dataseets-intersection)
8. [Paired Datasets](#8-paired-datasets)
9. [Summary](#9-Summary)

## 1. Setup and Data Loading <a id='1-setup-and-data-loading'></a>

In [3]:
# Import necessary libraries for data manipulation
import pandas as pd
import numpy as np

In [4]:
# File paths for the raw count and FPKM datasets
x_file_path_raw = 'GSE81089_readcounts_featurecounts.tsv.gz'
x_file_path_fpkm = 'GSE81089_FPKM_cufflinks.tsv.gz'

In [5]:
# Load the raw counts data into a pandas DataFrame
x_df_raw = pd.read_csv(x_file_path_raw, sep="\t", comment="!", index_col=0)

# Load the FPKM normalized data into a pandas DataFrame
x_df_fpkm = pd.read_csv(x_file_path_fpkm, sep="\t", comment="!", index_col=0)

## 2. Initial Data Inspection and Reshaping <a id='2-initial-data-inspection-and-reshaping'></a>

In [7]:
# Display the first 5 rows of the raw data
x_df_raw.head()

Unnamed: 0_level_0,L400T,L401T,L404T,L406T,L413T,L414T,L417T,L420T,L439T,L440T,...,L877T,L879T,L880T,L881N,L881T,L884T,L885T,L886T,L887T,L890T
Ensembl_gene_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ENSG00000000003,6364,5953,3179,3617,5363,4336,2340,2931,4132,11516,...,4288,3391,2458,1190,4529,3783,1154,3307,3065,4180
ENSG00000000005,17,1,4,0,131,1,4,0,0,3,...,41,0,0,2,0,2,0,0,0,0
ENSG00000000419,2255,3068,2342,3264,2843,2936,3735,3264,1126,1696,...,2293,1973,2064,2484,6170,3271,1965,3141,2924,1485
ENSG00000000457,1941,1317,1931,1473,1285,1129,657,1029,1849,1680,...,1675,1010,1256,1024,982,1266,937,824,1344,618
ENSG00000000460,653,1083,1225,1101,687,1657,937,1030,670,476,...,1123,837,871,563,2017,1320,400,533,1143,610


In [8]:
# Display the first 5 rows of the FPKM data
x_df_fpkm.head()

Unnamed: 0_level_0,L400T,L401T,L404T,L406T,L413T,L414T,L417T,L420T,L439T,L440T,...,L877T,L879T,L880T,L881N,L881T,L884T,L885T,L886T,L887T,L890T
Ensembl_gene_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ENSG00000000003,52.195,37.8891,23.191,25.0324,41.9686,28.5794,19.9008,23.2632,37.024,101.314,...,29.7194,28.0398,21.309,6.60301,22.7284,26.425,10.5688,29.845,18.0575,44.8734
ENSG00000000005,0.230061,0.086034,0.048022,0.0,2.5709,0.087192,0.234047,0.0,0.0,0.0,...,0.404797,0.0,0.0,0.149436,0.0,0.024609,0.0,0.0,0.0,0.0
ENSG00000000419,43.8616,47.0457,38.1292,54.303,51.2969,42.1604,78.1961,60.7283,23.5296,34.258,...,35.6239,39.2999,43.9324,32.4767,76.6832,53.6666,42.4278,65.9684,39.9645,38.3959
ENSG00000000457,14.7101,7.81233,12.3117,8.41631,8.84999,6.28052,4.66235,7.39264,14.5544,12.8389,...,10.0797,6.79271,9.15611,4.75463,4.02081,7.114,7.67308,6.23066,6.5676,5.3117
ENSG00000000460,4.81335,5.92073,8.21385,6.71221,4.79088,9.64764,6.60509,8.49746,5.45051,3.40198,...,6.94448,5.92882,6.26774,2.32631,9.3759,8.97605,2.83725,4.44909,6.03634,6.40973


In [9]:
# Check the dimensions (rows, columns) of each DataFrame.
# Note: At this stage, rows represent genes and columns represent samples.
# We can observe a difference in the number of genes (rows). This will be handled in a later step.

print(f'Shape Raw data: {x_df_raw.shape} \nShape FPKM data: {x_df_fpkm.shape}')


Shape Raw data: (63152, 218) 
Shape FPKM data: (63130, 218)


In [10]:
# Transpose the raw counts DataFrame
x_df_raw = x_df_raw.T
x_df_raw.head()

Ensembl_gene_id,ENSG00000000003,ENSG00000000005,ENSG00000000419,ENSG00000000457,ENSG00000000460,ENSG00000000938,ENSG00000000971,ENSG00000001036,ENSG00000001084,ENSG00000001167,...,ENSG00000272537,ENSG00000272538,ENSG00000272539,ENSG00000272540,ENSG00000272541,ENSG00000272542,ENSG00000272543,ENSG00000272544,ENSG00000272545,TC%
L400T,6364.0,17.0,2255.0,1941.0,653.0,918.0,22449.0,2609.0,2099.0,2416.0,...,1.0,0.0,0.0,4.0,1.0,4.0,0.0,0.0,0.0,35.0
L401T,5953.0,1.0,3068.0,1317.0,1083.0,1478.0,10359.0,4318.0,2221.0,4401.0,...,0.0,0.0,0.0,15.0,0.0,0.0,0.0,0.0,0.0,25.0
L404T,3179.0,4.0,2342.0,1931.0,1225.0,1485.0,6044.0,3406.0,4486.0,4297.0,...,1.0,0.0,0.0,11.0,2.0,2.0,0.0,0.0,0.0,30.0
L406T,3617.0,0.0,3264.0,1473.0,1101.0,1454.0,18843.0,2708.0,9493.0,4417.0,...,1.0,0.0,0.0,56.0,8.0,0.0,0.0,0.0,2.0,35.0
L413T,5363.0,131.0,2843.0,1285.0,687.0,1110.0,7972.0,3550.0,2932.0,4683.0,...,5.0,0.0,0.0,53.0,2.0,2.0,0.0,0.0,2.0,35.0


In [11]:
# Transpose the FPKM DataFrame
x_df_fpkm = x_df_fpkm.T
x_df_fpkm.head()

Ensembl_gene_id,ENSG00000000003,ENSG00000000005,ENSG00000000419,ENSG00000000457,ENSG00000000460,ENSG00000000938,ENSG00000000971,ENSG00000001036,ENSG00000001084,ENSG00000001167,...,ENSG00000272537,ENSG00000272538,ENSG00000272539,ENSG00000272540,ENSG00000272541,ENSG00000272542,ENSG00000272543,ENSG00000272544,ENSG00000272545,TC%
L400T,52.195,0.230061,43.8616,14.7101,4.81335,7.40831,112.426,43.9196,12.1289,13.3027,...,0.0,0.0,0.0,1.38909,0.312571,0.086774,0.0,0.0,0.0,35.0
L401T,37.8891,0.086034,47.0457,7.81233,5.92073,9.83188,39.7146,60.4056,9.20525,19.6343,...,0.0,0.0,0.0,1.15011,0.050841,0.0,0.0,0.0,0.0,25.0
L404T,23.191,0.048022,38.1292,12.3117,8.21385,9.68575,25.9596,49.0519,23.9222,20.166,...,0.0,0.0,0.0,1.11998,0.551958,0.036071,0.0,0.0,0.0,30.0
L406T,25.0324,0.0,54.303,8.41631,6.71221,10.9263,80.2073,40.47,46.9369,20.1807,...,0.0,0.0,0.0,4.3457,0.319958,0.0,0.0,0.0,0.086684,35.0
L413T,41.9686,2.5709,51.2969,8.84999,4.79088,8.36149,38.4429,58.1048,15.6082,23.3442,...,0.0,0.0,0.0,2.61839,0.415423,0.041381,0.0,0.0,0.09524,35.0


In [12]:
# Verify the new dimensions of the dataframes.
# The format is now (samples, features).

print(f'Shape Raw data: {x_df_raw.shape} \nShape FPKM data: {x_df_fpkm.shape}')

Shape Raw data: (218, 63152) 
Shape FPKM data: (218, 63130)


## 3. Data Cleaning and Quality Control <a id='3-data-cleaning-and-quality-control'></a>

In [14]:
# summary of the raw counts DataFrame.
x_df_raw.info()

<class 'pandas.core.frame.DataFrame'>
Index: 218 entries, L400T to L890T
Columns: 63152 entries, ENSG00000000003 to TC%
dtypes: float64(63152)
memory usage: 105.0+ MB


In [15]:
# summary of the FPKM DataFrame.
x_df_fpkm.info()

<class 'pandas.core.frame.DataFrame'>
Index: 218 entries, L400T to L890T
Columns: 63130 entries, ENSG00000000003 to TC%
dtypes: float64(63130)
memory usage: 105.0+ MB


In [16]:
# descriptive statistics for raw counts DataFrame.
# x_df_raw.describe().round(2)

In [17]:
# descriptive statistics for FPKM DataFrame.
# x_df_fpkm.describe().round(2)

### 3.1. Null Value Check

In [19]:
# Null values for the raw counts DataFrame.
print(f'Total Null: {x_df_raw.isnull().sum().sum()}')
# Null values by columns
print(f'Gene Null: {x_df_raw.columns[x_df_raw.isnull().any()]}')
# Null values by row
print(f'Sample Null: {x_df_raw.index[x_df_raw.isnull().any(axis=1)]}')


Total Null: 1
Gene Null: Index(['TC%'], dtype='object', name='Ensembl_gene_id')
Sample Null: Index(['L771T_1'], dtype='object')


In [20]:
# Null values for the FPKM DataFrame.
print(f'Total Null: {x_df_fpkm.isnull().sum().sum()}')
# Null values by columns
print(f'Gene Null: {x_df_fpkm.columns[x_df_fpkm.isnull().any()]}')
# Null values by row
print(f'Sample Null: {x_df_fpkm.index[x_df_fpkm.isnull().any(axis=1)]}')


Total Null: 1
Gene Null: Index(['TC%'], dtype='object', name='Ensembl_gene_id')
Sample Null: Index(['L771T_1'], dtype='object')


### 3.2. Identify and Remove Non-Gene Columns

In [22]:
# identify any columns that are not gene identifiers for the raw counts DataFrame.
[col for col in x_df_raw.columns if not col.startswith("ENSG")]

['TC%']

In [23]:
# identify any columns that are not gene identifiers for the FPKM DataFrame.
[col for col in x_df_fpkm.columns if not col.startswith("ENSG")]

['TC%']

##### TC%
The TC% column represents the Tumor Cell Percentage, a quality control metric used in the source study by Djureinovic et al. The authors only included tissue samples with greater than 10% tumor content to ensure the quality and purity of the expression signal.

This column is removed from our feature matrix because it is sample metadata, not a gene expression feature. Including it would cause data leakage, allowing the model to predict the outcome directly from this value rather than learning from the underlying patterns in the gene expression profiles.

In [25]:
# Remove the non-gene 'TC%' column from the raw data DataFrame.
x_df_raw = x_df_raw.drop(columns=["TC%"])

In [26]:
# Remove the non-gene 'TC%' column from the FPKM data DataFrame.
x_df_fpkm = x_df_fpkm.drop(columns=["TC%"])

In [27]:
# Verify the new shapes after dropping the non-gene column.
print(f'Shape Raw data: {x_df_raw.shape} \nShape FPKM data: {x_df_fpkm.shape}')

Shape Raw data: (218, 63151) 
Shape FPKM data: (218, 63129)


### 3.3. Check for Duplicates
Three critical checks for data integrity on both datasets:
1.  **Duplicate Rows:** Samples with an identical expression profile across all genes.
2.  **Duplicate Index:** Samples that share the same ID.
3.  **Duplicate Columns:** Features (genes) that share the same ID.

In [29]:
# Raw data DataFrame Duplicate

## Check for fully duplicate rows (samples with identical values across all genes). from the raw data DataFrame.
print(x_df_raw[x_df_raw.duplicated])

Empty DataFrame
Columns: [ENSG00000000003, ENSG00000000005, ENSG00000000419, ENSG00000000457, ENSG00000000460, ENSG00000000938, ENSG00000000971, ENSG00000001036, ENSG00000001084, ENSG00000001167, ENSG00000001460, ENSG00000001461, ENSG00000001497, ENSG00000001561, ENSG00000001617, ENSG00000001626, ENSG00000001629, ENSG00000001630, ENSG00000001631, ENSG00000002016, ENSG00000002079, ENSG00000002330, ENSG00000002549, ENSG00000002586, ENSG00000002587, ENSG00000002726, ENSG00000002745, ENSG00000002746, ENSG00000002822, ENSG00000002834, ENSG00000002919, ENSG00000002933, ENSG00000003056, ENSG00000003096, ENSG00000003137, ENSG00000003147, ENSG00000003249, ENSG00000003393, ENSG00000003400, ENSG00000003402, ENSG00000003436, ENSG00000003509, ENSG00000003756, ENSG00000003987, ENSG00000003989, ENSG00000004059, ENSG00000004139, ENSG00000004142, ENSG00000004399, ENSG00000004455, ENSG00000004468, ENSG00000004478, ENSG00000004487, ENSG00000004534, ENSG00000004660, ENSG00000004700, ENSG00000004766, ENSG0

In [30]:
# Check for duplicate index labels (sample). from the raw data DataFrame.
print(x_df_raw[x_df_raw.index.duplicated(keep=False)])

Empty DataFrame
Columns: [ENSG00000000003, ENSG00000000005, ENSG00000000419, ENSG00000000457, ENSG00000000460, ENSG00000000938, ENSG00000000971, ENSG00000001036, ENSG00000001084, ENSG00000001167, ENSG00000001460, ENSG00000001461, ENSG00000001497, ENSG00000001561, ENSG00000001617, ENSG00000001626, ENSG00000001629, ENSG00000001630, ENSG00000001631, ENSG00000002016, ENSG00000002079, ENSG00000002330, ENSG00000002549, ENSG00000002586, ENSG00000002587, ENSG00000002726, ENSG00000002745, ENSG00000002746, ENSG00000002822, ENSG00000002834, ENSG00000002919, ENSG00000002933, ENSG00000003056, ENSG00000003096, ENSG00000003137, ENSG00000003147, ENSG00000003249, ENSG00000003393, ENSG00000003400, ENSG00000003402, ENSG00000003436, ENSG00000003509, ENSG00000003756, ENSG00000003987, ENSG00000003989, ENSG00000004059, ENSG00000004139, ENSG00000004142, ENSG00000004399, ENSG00000004455, ENSG00000004468, ENSG00000004478, ENSG00000004487, ENSG00000004534, ENSG00000004660, ENSG00000004700, ENSG00000004766, ENSG0

In [31]:
# Check for duplicate column names (features/genes).from the raw data DataFrame.
print(x_df_raw.loc[:, x_df_raw.columns.duplicated(keep=False)])

Empty DataFrame
Columns: []
Index: [L400T, L401T, L404T, L406T, L413T, L414T, L417T, L420T, L439T, L440T, L441T, L442T, L444T, L446T, L447T, L452T, L455T, L456T, L457T, L458T, L459T, L462T, L464T, L466T, L468T, L470T, L471T, L472T, L473T, L480T, L481T, L483T, L484T, L488T, L490T, L493T, L496T, L504T, L511N, L511T, L529T, L530T, L531T, L532N, L532T, L534T, L535T, L538T, L539T, L541T, L543T, L545T, L546T, L551T, L557T, L559T, L561N, L561T, L563N, L563T, L565T, L566N, L566T, L567T, L568T, L569T, L572N, L572T, L582T, L583T, L584T, L585T, L586T, L592T, L593T, L596T, L598T, L599T, L601T, L602T, L603T, L604T, L605T, L606N, L606T, L607T, L608T_2122, L612T, L613T, L616N, L616T, L617T, L619T, L620T, L621T, L626T, L628T, L630T, L633T, L635T, ...]

[218 rows x 0 columns]


In [32]:
# FPKM data DataFrame Duplicate

# Check for fully duplicate rows (samples with identical values across all genes). from the FPKM data DataFrame.
print(x_df_fpkm[x_df_fpkm.duplicated])

Empty DataFrame
Columns: [ENSG00000000003, ENSG00000000005, ENSG00000000419, ENSG00000000457, ENSG00000000460, ENSG00000000938, ENSG00000000971, ENSG00000001036, ENSG00000001084, ENSG00000001167, ENSG00000001460, ENSG00000001461, ENSG00000001497, ENSG00000001561, ENSG00000001617, ENSG00000001626, ENSG00000001629, ENSG00000001630, ENSG00000001631, ENSG00000002016, ENSG00000002079, ENSG00000002330, ENSG00000002549, ENSG00000002586, ENSG00000002587, ENSG00000002726, ENSG00000002745, ENSG00000002746, ENSG00000002822, ENSG00000002834, ENSG00000002919, ENSG00000002933, ENSG00000003056, ENSG00000003096, ENSG00000003137, ENSG00000003147, ENSG00000003249, ENSG00000003393, ENSG00000003400, ENSG00000003402, ENSG00000003436, ENSG00000003509, ENSG00000003756, ENSG00000003987, ENSG00000003989, ENSG00000004059, ENSG00000004139, ENSG00000004142, ENSG00000004399, ENSG00000004455, ENSG00000004468, ENSG00000004478, ENSG00000004487, ENSG00000004534, ENSG00000004660, ENSG00000004700, ENSG00000004766, ENSG0

In [33]:
# Check for duplicate index labels (sample). from the FPKM data DataFrame.
print(x_df_fpkm[x_df_fpkm.index.duplicated(keep=False)])

Empty DataFrame
Columns: [ENSG00000000003, ENSG00000000005, ENSG00000000419, ENSG00000000457, ENSG00000000460, ENSG00000000938, ENSG00000000971, ENSG00000001036, ENSG00000001084, ENSG00000001167, ENSG00000001460, ENSG00000001461, ENSG00000001497, ENSG00000001561, ENSG00000001617, ENSG00000001626, ENSG00000001629, ENSG00000001630, ENSG00000001631, ENSG00000002016, ENSG00000002079, ENSG00000002330, ENSG00000002549, ENSG00000002586, ENSG00000002587, ENSG00000002726, ENSG00000002745, ENSG00000002746, ENSG00000002822, ENSG00000002834, ENSG00000002919, ENSG00000002933, ENSG00000003056, ENSG00000003096, ENSG00000003137, ENSG00000003147, ENSG00000003249, ENSG00000003393, ENSG00000003400, ENSG00000003402, ENSG00000003436, ENSG00000003509, ENSG00000003756, ENSG00000003987, ENSG00000003989, ENSG00000004059, ENSG00000004139, ENSG00000004142, ENSG00000004399, ENSG00000004455, ENSG00000004468, ENSG00000004478, ENSG00000004487, ENSG00000004534, ENSG00000004660, ENSG00000004700, ENSG00000004766, ENSG0

In [34]:
# Check for duplicate column names (features/genes). from the FPKM data DataFrame.
print(x_df_fpkm.loc[:, x_df_fpkm.columns.duplicated(keep=False)])

Empty DataFrame
Columns: []
Index: [L400T, L401T, L404T, L406T, L413T, L414T, L417T, L420T, L439T, L440T, L441T, L442T, L444T, L446T, L447T, L452T, L455T, L456T, L457T, L458T, L459T, L462T, L464T, L466T, L468T, L470T, L471T, L472T, L473T, L480T, L481T, L483T, L484T, L488T, L490T, L493T, L496T, L504T, L511N, L511T, L529T, L530T, L531T, L532N, L532T, L534T, L535T, L538T, L539T, L541T, L543T, L545T, L546T, L551T, L557T, L559T, L561N, L561T, L563N, L563T, L565T, L566N, L566T, L567T, L568T, L569T, L572N, L572T, L582T, L583T, L584T, L585T, L586T, L592T, L593T, L596T, L598T, L599T, L601T, L602T, L603T, L604T, L605T, L606N, L606T, L607T, L608T_2122, L612T, L613T, L616N, L616T, L617T, L619T, L620T, L621T, L626T, L628T, L630T, L633T, L635T, ...]

[218 rows x 0 columns]


## 4. Feature Discrepancies and Final Datasets <a id='4-feature-discrepancies-and-final-datasets'></a>

### 4.1. Analyze Feature Discrepancies
We've observed that the two dataframes have a different number of genes. Here we identify exactly which genes are unique to each set. This is likely due to the different bioinformatics tools used to generate each file.

In [37]:
# Verify the shapes
print(f'Shape Raw data: {x_df_raw.shape} \nShape FPKM data: {x_df_fpkm.shape}')

Shape Raw data: (218, 63151) 
Shape FPKM data: (218, 63129)


In [38]:

# Get the set of gene IDs (columns) from each DataFrame to compare them
genes_raw = set(x_df_raw.columns)
genes_fpkm = set(x_df_fpkm.columns)

# Find genes that are in the raw dataset but not in the FPKM dataset
missing_genes = genes_raw.difference(genes_fpkm)

print(f"Number of genes missing from the FPKM dataset: {len(missing_genes)}")
print("IDs of some of the missing genes:")
print(list(missing_genes)[:10]) # FIrst 10

# If any missing genes were found, analyze their values in the raw dataset
if missing_genes:
    missing_genes_counts = x_df_raw[list(missing_genes)]

    # Display descriptive statistics for these genes.
    print("\nStatistics for the missing genes in the raw dataset:")
    print(missing_genes_counts.describe())
    

Number of genes missing from the FPKM dataset: 23
IDs of some of the missing genes:
['ENSG00000256825', 'ENSG00000261459', 'ENSG00000272162', 'ENSG00000258830', 'ENSG00000259753', 'ENSG00000183900', 'ENSG00000250424', 'ENSG00000258724', 'ENSG00000226894', 'ENSG00000267531']

Statistics for the missing genes in the raw dataset:
Ensembl_gene_id  ENSG00000256825  ENSG00000261459  ENSG00000272162  \
count                      218.0            218.0            218.0   
mean                         0.0              0.0              0.0   
std                          0.0              0.0              0.0   
min                          0.0              0.0              0.0   
25%                          0.0              0.0              0.0   
50%                          0.0              0.0              0.0   
75%                          0.0              0.0              0.0   
max                          0.0              0.0              0.0   

Ensembl_gene_id  ENSG00000258830  ENSG00

In [39]:
# Align the raw DataFrame by dropping the columns that are not present in the FPKM DataFrame (all zero values)
x_df_raw = x_df_raw.drop(columns=list(missing_genes))
# Verify the new shapes. Note that the FPKM dataframe now has one extra column.
print(f'Shape Raw data: {x_df_raw.shape} \nShape FPKM data: {x_df_fpkm.shape}')

Shape Raw data: (218, 63128) 
Shape FPKM data: (218, 63129)


In [40]:
# Find genes that are in the FPKM dataset but not in the raw dataset
missing_genes = genes_fpkm.difference(genes_raw)

print(f"Number of genes found in FPKM but not in raw data: {len(missing_genes)}")
print("IDs of some of the missing genes:")
print(list(missing_genes)[:10]) # First 10

# If any missing genes were found, analyze their values in the FPKM dataset
if missing_genes:
    missing_genes_counts = x_df_fpkm[list(missing_genes)]

    # Display descriptive statistics for these genes.
    print("\nStatistics for the missing genes in the raw dataset")
    print(missing_genes_counts.describe())


# A discrepancy was found between the two data files: one gene appears in the FPKM dataset but not in the raw counts data.
# This is a common occurrence, likely due to the different software tools or gene "maps" used to process each file.
# This difference will be noted and resolved in the data alignment step before modeling.

Number of genes found in FPKM but not in raw data: 1
IDs of some of the missing genes:
['ENSG00000223972']

Statistics for the missing genes in the raw dataset
Ensembl_gene_id  ENSG00000223972
count                 218.000000
mean                    0.085066
std                     0.144929
min                     0.000000
25%                     0.028140
50%                     0.053657
75%                     0.092494
max                     1.426270


### 4.2. Identify and Remove Constant Features
We remove non-informative features. Any gene that has the same value across all samples (zero variance) provides no information for a model and should be dropped. We will do this for each dataframe independently

In [42]:
#  Select only the columns from the raw DataFrame that have a constant value.
constant_columns = x_df_raw.columns[x_df_raw.nunique() == 1]
df_constant_genes = x_df_raw[constant_columns]

# Calculate the overall minimum and maximum across this entire subset.
overall_min = df_constant_genes.min().min()
overall_max = df_constant_genes.max().max()

print(f"For the {len(constant_columns)} columns with constant values (Raw):")
print(f"The overall minimum value is: {overall_min}")
print(f"The overall maximum value is: {overall_max}")

For the 11920 columns with constant values (Raw):
The overall minimum value is: 0.0
The overall maximum value is: 0.0


In [43]:
#  Select only the columns from the FPKM DataFrame that have a constant value.
constant_columns_fpkm = x_df_fpkm.columns[x_df_fpkm.nunique() == 1]
df_constant_genes_fpkm = x_df_fpkm[constant_columns_fpkm]

# Calculate the overall minimum and maximum across this entire subset.
overall_min_fpkm = df_constant_genes_fpkm.min().min()
overall_max_fpkm = df_constant_genes_fpkm.max().max()

print(f"For the {len(constant_columns_fpkm)} columns with constant values (FPKM):")
print(f"The overall minimum value is: {overall_min_fpkm}")
print(f"The overall maximum value is: {overall_max_fpkm}")

For the 17969 columns with constant values (FPKM):
The overall minimum value is: 0.0
The overall maximum value is: 0.0


In [44]:
# Create a new DataFrame for the raw data, excluding its constant columns (all have zero values)
x_df_raw_cleaned = x_df_raw.drop(columns=constant_columns)
# Create a new DataFrame for the FPKM data, excluding its respective constant columns (all have zero values)
x_df_fpkm_cleaned = x_df_fpkm.drop(columns=constant_columns_fpkm)
# Print the new shapes to confirm the removal of non-informative features from each dataset.
print(f'Shape Raw Data Cleaned: {x_df_raw_cleaned.shape} \nShape FPKM Data Cleaned: {x_df_fpkm_cleaned.shape}')


# The FPKM dataset has more constant columns because the normalization process can convert low-variance genes into constant ones.

Shape Raw Data Cleaned: (218, 51208) 
Shape FPKM Data Cleaned: (218, 45160)


## 5. Target Variable (y) Creation <a id='5-target-variable-creation'></a>

We will now create the binary target variable (`y`) by classifying samples based on their ID ('T' for Tumor, 'N' for Normal). The result is stored in a pandas Series to maintain index alignment with the feature data.

### 5.1. Create the Target Variable (y)

In [47]:
sample_identifiers = x_df_raw.index.tolist()

# Create an empty list to hold the class labels (0 or 1)
y_response_list = []

# Iterate through each sample ID to classify it
for sample_id in sample_identifiers:
    text_to_check = ""
    if isinstance(sample_id, str):
        text_to_check = sample_id.upper()

    # Classify based on the presence of 'N' or 'T'.
    if 'N' in text_to_check:
        y_response_list.append(0)  # No tumor
    elif 'T' in text_to_check:
        y_response_list.append(1)  # Tumor
    else:
        # Handle any sample ID that doesn't fit the pattern
        y_response_list.append(np.nan) # Use NaN (Not a Number) for missing values

# Convert the list of labels into a pandas Series.
# We pass the sample IDs as the index to ensure alignment with the feature data (X).
y_df = pd.Series(y_response_list, index=sample_identifiers, name="cancer_status")


# Verification

# # Display the first 5 entries of the new target Series
print(y_df.head())

# Counts to get the distribution of the classes
print(y_df.value_counts())


L400T    1
L401T    1
L404T    1
L406T    1
L413T    1
Name: cancer_status, dtype: int64
cancer_status
1    199
0     19
Name: count, dtype: int64


### 5.2. Verify sample IDs Alignment

We verify that the sample IDs (index) are consistent across the cleaned dataframes and the newly created target variable `y_df`. This confirms that the data objects are correctly aligned by sample.

In [49]:
# Check if the index of the cleaned raw DataFrame matches the index of the target Series 'y'.
raw_y_aligned = x_df_raw_cleaned.index.equals(y_df.index)
print(f"Are the Raw data and the Y variable aligned?  -> {raw_y_aligned}")

# Check if the index of the cleaned FPKM DataFrame matches the index of the target Series 'y'.
fpkm_y_aligned = x_df_fpkm_cleaned.index.equals(y_df.index)
print(f"Are the FPKM data and the Y variable aligned? -> {fpkm_y_aligned}")

# Check if the samples in the two feature DataFrames are aligned with each other.
raw_fpkm_aligned = x_df_raw_cleaned.index.equals(x_df_fpkm_cleaned.index)
print(f"Are the Raw and FPKM data aligned with each other? -> {raw_fpkm_aligned}")


Are the Raw data and the Y variable aligned?  -> True
Are the FPKM data and the Y variable aligned? -> True
Are the Raw and FPKM data aligned with each other? -> True


In [50]:
# Print the first 5 identifiers from each object for a visual confirmation.
print("Raw Data Index: ", x_df_raw_cleaned.index[:5].tolist())
print("FPKM Data Index: ", x_df_fpkm_cleaned.index[:5].tolist())
print("Y Target Index:            ", y_df.index[:5].tolist())


Raw Data Index:  ['L400T', 'L401T', 'L404T', 'L406T', 'L413T']
FPKM Data Index:  ['L400T', 'L401T', 'L404T', 'L406T', 'L413T']
Y Target Index:             ['L400T', 'L401T', 'L404T', 'L406T', 'L413T']


## 6. Cleaned Datasets <a id='6-final-cleaned-datasets'></a>
After completing the curation process, we have the following clean, aligned, and ready-to-use data objects for the prediction_model.

* **`x_df_raw_cleaned`**: DataFrame with raw counts. Features are cleaned and aligned.
* **`x_df_fpkm_cleaned`**: DataFrame with FPKM values. Features are cleaned and aligned.
* **`y_df`**: Pandas Series containing the binary target labels (0 for Normal, 1 for Tumor).

---
Note on Feature Discrepancy
A slight mismatch exists between the raw and FPKM feature sets, which is a common result of using different bioinformatics tools for data processing.
The final strategy for aligning the data for modeling will be decided in the prediction_model file. The options include:
- Using the intersection of genes.
- Developing independent pipelines for each dataset.
- Proceeding with only one of the datasets.


## 7. Datasets intersection <a id='7-dataseets-intersection'></a>
In case we decide to work with the intersection of genes, this code creates the final datasets.

In [53]:
# --- Align Cleaned DataFrames via Intersection ---

# Find the intersection of columns between the two cleaned dataframes.
common_features = x_df_raw_cleaned.columns.intersection(x_df_fpkm_cleaned.columns)
print(f"Number of common, non-constant genes to be used for modeling: {len(common_features)}")

# Filter both dataframes to contain only these common features.
x_final_raw = x_df_raw_cleaned[common_features]
x_final_fpkm = x_df_fpkm_cleaned[common_features]

# Verify that the final shapes are identical.
print("\n--- Final, Aligned, and Cleaned Shapes ---")
print(f"Shape of final raw data: {x_final_raw.shape}")
print(f"Shape of final FPKM data: {x_final_fpkm.shape}")

Number of common, non-constant genes to be used for modeling: 43239

--- Final, Aligned, and Cleaned Shapes ---
Shape of final raw data: (218, 43239)
Shape of final FPKM data: (218, 43239)


## 8. Paired Datasets <a id='8-paired-datasets'></a>
Isolating the Paired-Sample Cohort
The dataset contains 19 tumor samples that are paired with 19 matched normal tissue samples from the same patients. A paired analysis can be a powerful method to control for inter-individual variability.

For this reason, we will create dedicated dataframes for this paired subset. The decision to use this specific cohort for a separate analysis will be made in the prediction_model.

In [55]:
# --- Creating a Paired Subset from the "Cleaned" DataFrames ---

# 1. Identify the 19 normal sample IDs
# We use the index from one of the dataframes, as the samples are aligned.
all_sample_ids_cleaned = x_df_raw_cleaned.index
normal_samples_cleaned = [sid for sid in all_sample_ids_cleaned if isinstance(sid, str) and sid.endswith('N')]

# 2. Find their tumor counterparts by replacing 'N' with 'T'.
paired_tumor_samples_cleaned = [sid.replace('N', 'T') for sid in normal_samples_cleaned]

# 3. Combine the lists to get all 38 paired IDs.
all_paired_ids_cleaned = normal_samples_cleaned + paired_tumor_samples_cleaned

print(f"Total paired IDs found: {len(all_paired_ids_cleaned)}")

# 4. Filter the "_cleaned" dataframes to create the paired subset.
# We use .loc[] to select rows by their index label.
x_raw_cleaned_paired = x_df_raw_cleaned.loc[all_paired_ids_cleaned]
x_fpkm_cleaned_paired = x_df_fpkm_cleaned.loc[all_paired_ids_cleaned]

# 5. Verification
print("\n--- Shapes of Paired DataFrames (from 'cleaned') ---")
print(f"Shape raw cleaned paired: {x_raw_cleaned_paired.shape}")
print(f"Shape fpkm cleaned paired: {x_fpkm_cleaned_paired.shape}")

Total paired IDs found: 38

--- Shapes of Paired DataFrames (from 'cleaned') ---
Shape raw cleaned paired: (38, 51208)
Shape fpkm cleaned paired: (38, 45160)


In [56]:
# --- Creating a Paired Subset from the "Final" (Aligned) DataFrames ---

# 1. Identify the paired IDs (the list of IDs is the same as before).
all_sample_ids_final = x_final_raw.index
normal_samples_final = [sid for sid in all_sample_ids_final if isinstance(sid, str) and sid.endswith('N')]
paired_tumor_samples_final = [sid.replace('N', 'T') for sid in normal_samples_final]
all_paired_ids_final = normal_samples_final + paired_tumor_samples_final

# 2. Filter the "final" dataframes to create the paired subset.
x_raw_final_paired = x_final_raw.loc[all_paired_ids_final]
x_fpkm_final_paired = x_final_fpkm.loc[all_paired_ids_final]

# 3. Verification
print("\n--- Shapes of Paired DataFrames (from 'final' data) ---")
print(f"Shape of final raw paired data: {x_raw_final_paired.shape}")
print(f"Shape of final FPKM paired data: {x_fpkm_final_paired.shape}")


--- Shapes of Paired DataFrames (from 'final' data) ---
Shape of final raw paired data: (38, 43239)
Shape of final FPKM paired data: (38, 43239)


In [57]:
# Create the corresponding target variable for the paired subset
y_df_paired = y_df.loc[all_paired_ids_final]

# Print the class distribution for the paired subset to verify it has 19 of each class.
print(f"Class distribution in the paired subset:\n{y_df_paired.value_counts()}")

Class distribution in the paired subset:
cancer_status
0    19
1    19
Name: count, dtype: int64


In [58]:
# --- Verification of Index Alignment Across All Paired Datasets ---

# First, create a list of all the data objects you want to check.
# Make sure you have created all these variables in the previous steps.
datasets_to_check = {
    "x_raw_cleaned_paired": x_raw_cleaned_paired,
    "x_fpkm_cleaned_paired": x_fpkm_cleaned_paired,
    "x_raw_final_paired": x_raw_final_paired,
    "x_fpkm_final_paired": x_fpkm_final_paired,
    "y_df_paired": y_df_paired 
}

# Use the index of the first dataframe as the reference for comparison.
reference_index = datasets_to_check["x_raw_cleaned_paired"].index
all_aligned = True

print("--- Verifying Alignment of All Paired Datasets ---")

# Loop through each dataset and compare its index to the reference index.
for name, df in datasets_to_check.items():
    if not df.index.equals(reference_index):
        print(f"MISMATCH FOUND: The index of '{name}' does not match the reference.")
        all_aligned = False
        break # Stop checking if a mismatch is found

# Print the final result.
if all_aligned:
    print("Success! All specified paired data objects are perfectly aligned by sample.")
else:
    print("\nAttention! A misalignment was detected. Please review the steps where these dataframes were created.")

--- Verifying Alignment of All Paired Datasets ---
Success! All specified paired data objects are perfectly aligned by sample.


## 9. Summary <a id='9-Summary'></a>
After completing the data curation process, we have generated several data objects:

Full Datasets (218 Samples):
* **`x_df_raw_cleaned`**: DataFrame with raw counts, after removing only its own constant features.
* **`x_df_fpkm_cleaned`**: DataFrame with FPKM values, after removing only its own constant features.
* **`x_final_raw`**: Contains raw counts for the common, non-constant genes found in both datasets (intersection after cleaning).
* **`x_final_fpkm`**: Contains FPKM values for the common, non-constant genes found in both datasets ((intersection after cleaning).
* **`y_df`**:  Pandas Series containing the binary target labels (0 for Normal, 1 for Tumor) for all 218 samples.
    
Paired Subsets (38 Samples):
* **`x_raw_cleaned_paired`**: Paired subset (19 vs. 19) taken from the intermediate x_df_raw_cleaned dataframe.
* **`x_fpkm_cleaned_paired`**: Paired subset (19 vs. 19) taken from the intermediate x_df_fpkm_cleaned dataframe.
* **`x_raw_final_paired`**: Contains data for the 38 paired samples, using the final Raw aligned feature set.
* **`x_fpkm_final_paired`**: Contains data for the 38 paired samples, using the final fpkm aligned feature set.
* **`y_df_paired`**: The corresponding target labels for the 38 paired samples, aligned with the paired dataframes.

The specific datasets (e.g., full vs. paired, raw vs. FPKM) to be used for modeling will be selected in the prediction_model based on the analysis goals.
  


In [102]:
# Save Dataframes ('output_csv' folder)

# import os
# os.makedirs("output_csvs", exist_ok=True)

# x_df_raw_cleaned.to_csv("output_csvs/x_df_raw_cleaned.csv")
# x_df_fpkm_cleaned.to_csv("output_csvs/x_df_fpkm_cleaned.csv")

# x_final_raw.to_csv("output_csvs/x_final_raw.csv")
# x_final_fpkm.to_csv("output_csvs/x_final_fpkm.csv")

# x_raw_final_paired.to_csv("output_csvs/x_raw_final_paired.csv")
# x_fpkm_final_paired.to_csv("output_csvs/x_fpkm_final_paired.csv")

# y_df.to_csv("output_csvs/y_df.csv")
# y_df_paired.to_csv("output_csvs/y_df_paired.csv")