# An alternate approach to use the NSCLC gene expression data hosted on TCIA. 

## Unlike our blog, which uses Partner solution (Illumina DRAGEN), this dataset was pre-processed using open-source tools and made available on TCIA. It used STAR v.2.3 for alignment and Cufflinks v.2.0.2 for expression calls. Further details can be found in [1]. 

#### [1] Zhou, Mu, et al. "Non–small cell lung cancer radiogenomics map identifies relationships between molecular and imaging phenotypes with prognostic implications." Radiology 286.1 (2018): 307-315.

## Step 1 - Download file "GSE103584_R01_NSCLC_RNAseq.txt.gz" from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE103584

## Step 2 - Create a subset by removing case IDs and genes that are not relevant

In [1]:
import pandas as pd

gen_data = pd.read_csv('GSE103584_R01_NSCLC_RNAseq.txt', delimiter = '\t')

# Remove case IDs that do not have weight and pack/years in clinical data 
l_caseID_drop = ['R01-003', 'R01-004', 'R01-006', 'R01-007', 'R01-015', 'R01-016', 'R01-018', 'R01-022', 'R01-023', 'R01-098', 'R01-105']

gen_data1 = gen_data.drop(l_caseID_drop, axis = 1)

In [2]:
# Add column name for genes 
gen_data1.rename(columns={'Unnamed: 0':'Case_ID'}, inplace=True)

# Transpose the dataframe such that rows = case IDs and cols = genes 
gen_data1.set_index('Case_ID',inplace=True)
gen_data_t = gen_data1.transpose()

In [3]:
# Keep the genes suggested in Zhou, Mu, et al. [1]
# These are genes corresponding to Metagenes 19, 10, 9, 4, 3, 21 in Table 2 of the paper

l_genes = ['LRIG1', 'HPGD', 'GDF15', 'CDH2', 'POSTN', 'VCAN', 'PDGFRA', 'VCAM1', 'CD44', 'CD48', 'CD4', 'LYL1', 'SPI1', 'CD37', 'VIM', 'LMO2', 'EGR2', 'BGN', 'COL4A1', 'COL5A1', 'COL5A2']
gen_data2 = gen_data_t[l_genes]

In [4]:
# Replace NaN with 0
gen_data3 = gen_data2.fillna(0)

#Sort rows by Case_ID 
gen_data4 = gen_data3.sort_index(axis = 0)

# Add column for 'SurvivalStatus' (label), obtained from Clinical dataset
l_survivalstatus = [1,0,0,0,1,0,1,1,0,0,1,1,0,1,0,1,1,0,1,1,0,1,0,0,0,0,1,0,0,1,0,1,0,0,0,0,0,0,1,1,0,0,0,1,1,0,0,1,1,0,0,0,1,0,0,1,1,0,0,1,1,1,1,0,1,0,1,0,0,0,0,0,1,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,1,0,0,0,1,0,1,1,0,0,0,0]

gen_data4["SurvivalStatus"] = l_survivalstatus

## Step 3 - Store preprocessed data in CSV file

In [5]:
#Save data in CSV file. This file will be used as input for 'preprocess-genomic-data.ipynb'
gen_data4.to_csv (r'Genomic-data-119patients-TCIA.csv')

## Step 4 - Manually add 'Case_ID' to the first cell of the CSV file (row0, col0). 