# Introduction

This paper introduces several convolutional neural network (CNN) models for accurate prediction of cancer types based on gene expression data from The Cancer Genome Atlas (TCGA).

The main contributions of this paper are:

* Proposing three novel CNN architectures (1D-CNN, 2D-Vanilla-CNN, and 2D-Hybrid-CNN) tailored for processing unstructured gene expression data. These models achieve excellent accuracy (up to 95.7%) in classifying 33 cancer types and normal samples simultaneously.

* Incorporating normal tissue samples during training to account for the influence of tissue-of-origin, which helps identify cancer-specific markers rather than just tissue-specific markers.

* Developing a unique interpretation scheme based on guided saliency maps to identify important marker genes for each cancer type predicted by the CNN models.

* Identifying a total of 2090 marker genes across 33 cancer types and normal samples, including well-known cancer markers like GATA3 and ESR1 for breast cancer. The marker genes exhibit differential expression between cancer types.

* Demonstrating the models' applicability by achieving 88.42% accuracy in predicting breast cancer subtypes using the 1D-CNN architecture.

Overall, this paper presents novel CNN models tailored for cancer prediction from gene expression data, accounting for tissue-of-origin effects, and providing an interpretation scheme to identify potential cancer biomarkers, contributing to precision oncology and early cancer detection.


# Scope of reproducibility

The information provided in this paper is giving me some scope of reproducing the results and findings:
* Codes are available for all three models of CNN in GitHub which allows me to refer and reuse to build my model of 33 type of cancer tissue predictions.
* For each of proposed CNN models (1D-CNN, 2D-Vanilla-CNN, 2D-Hybrid-CNN), there are associated architectural details, including the number of layers, kernel sizes, and hyper parameters used.
* Referring to 34 classes  there is no clear  information available in paper(GitHub) on where to find normal tissue sample  dataset to use for prediction.
* Also, there are no exact trained model weights or random seeds used for reproducibility, However, the availability of  code, model architectures along with other details should be  reasonably useful to  reproduce their model training and results.

In last  reproducing the precise numerical results in terms of accuracy for prediction of 33 type cancer model may still be challenging for me due to potential differences in computational environments and random initializations .


# Methodology -

# Environment

Using Jupyter Notebook within Google Colab as my main interface for developing and customizing code for my project. Its flexibility allows me to add text, enhancing readability for both developers and reviewers.

During runtime, system crashes were occurring due to limitations in system RAM and disk space when downloading and preprocessing data. To address this issue, I opted to use GPU : Google Colab TPUv2 to alleviate the strain on system resources.


# Python Version

In [1]:
import sys
print(sys.version)

3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]


# Dependencies/Packages needed


In [None]:
!pip install plotly-express
!pip install -U scikit-learn
!pip install progressbar2

# Data

# Data Download instructions

According to the original paper, the models were trained and tested on gene expression profiles from The Cancer Genome Atlas (TCGA).           

In the first step, I utilized the dataset available from

https://xenabrowser.net/datapages/?cohort=TCGA%20Pan-Cancer%20(PANCAN)&addHub=https%3A%2F%2Flegacy.xenahubs.net&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443   

  under the heading "exon expression RNAseq”.

Two files are downloaded and unzipped below this dataset consists of gene expression (RNA-Seq) data for 10,459 samples across 20,530 features. The samples represent 33 different cancer types, which are the target labels for the classification task.


In [None]:
!wget "https://pancanatlas.xenahubs.net/download/TCGA_phenotype_denseDataOnlyDownload.tsv.gz"

In [None]:
!gunzip TCGA_phenotype_denseDataOnlyDownload.tsv.gz

In [None]:
!wget "https://legacy.xenahubs.net/download/TCGA.PANCAN.sampleMap/HiSeqV2.gz"

In [None]:
!gunzip 'HiSeqV2.gz'

#Preprocessing code + command

# Data Description

For my prediction model of 33 cancer types, I am working with a large dataset of RNA sequencing data and associated clinical information. Here are the key data descriptions:

* Dataset: The dataset is loaded from a file named 'HiSeqV2' and contains gene expression data for 10,459 samples. The data is initially loaded in chunks of 21,000 rows to conserve memory.
* Columns: The dataset has 20,530 columns, where each column represents a different gene or feature. The column names are set to the first row of the dataset.
* Samples: The rows in the dataset represent the different samples, with each row containing the gene expression values for that sample.
* Labels: The clinical information for the samples is loaded from a separate file named 'TCGA_phenotype_denseDataOnlyDownload.tsv'. This file contains the sample ID, sample type (e.g., Additional Metastatic, Metastatic), and primary disease for each sample.
* Data Representation: The gene expression data is represented as a numerical matrix, where each row corresponds to a sample and each column corresponds to a gene. The values in the matrix represent the expression level of each gene in each sample.
* Diseases: The dataset includes samples from a variety of cancer types, including skin cutaneous melanoma, thyroid carcinoma, sarcoma, prostate adenocarcinoma, and many others. The unique disease types are identified and listed in the output.
* Disease Encoding: A dictionary 'diseasedict' is created to map the disease types to numeric values, which can be useful for downstream machine learning tasks.
Data Dimensions: The total number of samples in the dataset is 10,459, and the number of features (genes) per sample is 20,530.

For my model code I am  setting up a large-scale RNA sequencing dataset for analysis and build, with a focus on exploring the gene expression patterns across different cancer types.


# Load Data

In [None]:
import pandas as pd
import h5py
import numpy as np
#import progressbar

In [None]:
data = 'HiSeqV2'
labels = 'TCGA_phenotype_denseDataOnlyDownload.tsv'
dbPath = 'data.h5'
verbose = False

print('Loading data ... Patience.')
df = pd.read_csv(data, sep='\t').transpose()

print('Loading labels ...')
labeldf = pd.read_csv(labels, sep = '\t')


Loading data ... Patience.
Loading labels ...


In [None]:
print('Housekeeping ...')
df.columns = df.iloc[0]
df = df.drop('Sample', axis = 0)

labeldf = labeldf.set_index('sample')

Housekeeping ...


In [None]:
df

Sample,ARHGEF10L,HIF3A,RNF17,RNF10,RNF11,RNF13,GTF2IP1,REM1,MTVR2,RTN4RL2,...,TULP2,NPY5R,GNGT2,GNGT1,TULP3,PTRF,BCL6B,GSTK1,SELP,SELS
TCGA-S9-A7J2-01,10.9576,4.8099,0.4657,11.2675,10.1761,10.4769,13.0456,3.2299,0.4657,8.7533,...,0.0,1.3357,2.9741,0.0,9.2594,9.4779,6.1595,9.6465,0.0,9.4848
TCGA-G3-A3CH-11,11.0186,5.3847,0.0,11.669,11.398,10.8249,11.5487,3.5408,1.4714,7.9144,...,0.0,3.5408,5.5302,0.0,7.5066,10.5302,7.3741,13.0045,7.0466,10.3411
TCGA-EK-A2RE-01,9.7106,2.8888,0.4192,11.4903,11.7371,9.9473,10.841,2.5988,0.0,3.9541,...,0.7436,0.0,2.953,1.2319,9.5217,13.8492,6.5812,9.2958,0.4192,9.745
TCGA-44-6778-01,9.6205,7.9642,1.5378,11.8432,11.0531,10.9005,12.4145,4.5366,2.0609,4.1805,...,0.0,2.0609,6.1839,4.1291,8.9832,12.3412,9.0862,10.4779,9.4517,10.4395
TCGA-VM-A8C8-01,11.6596,8.5622,0.0,11.2677,11.3549,10.8579,13.256,5.9962,0.0,5.357,...,1.3549,0.0,4.753,0.6034,9.0573,8.8984,5.9116,9.9584,1.6216,9.6811
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
TCGA-95-7947-01,10.0459,6.6572,0.0,11.3827,10.9459,10.5165,12.5061,3.6214,0.6881,3.1611,...,3.758,0.9387,4.2502,0.0,9.566,10.8301,7.6826,11.3541,7.3565,10.3328
TCGA-VQ-AA6F-01,9.5758,5.8461,0.0,11.6292,10.6314,11.5036,12.5995,5.9237,0.9655,5.556,...,0.3979,0.3979,4.74,5.4128,8.9264,11.2215,7.8272,10.8221,7.0938,10.614
TCGA-55-6985-11,9.6575,8.9521,0.4791,11.6766,11.3748,11.3349,12.3318,5.6618,0.4791,4.7755,...,0.8381,2.5176,6.8133,1.1254,9.0194,13.5597,9.9022,10.9969,9.3046,10.2187
TCGA-DD-A115-01,11.7589,3.7591,0.0,12.0914,11.5774,10.1702,12.0789,4.3081,0.0,11.0616,...,2.8619,2.8619,4.1258,0.0,7.5925,11.2591,8.0674,13.3772,6.4848,9.8594


In [None]:
labeldf

Unnamed: 0_level_0,sample_type_id,sample_type,_primary_disease
sample,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
TCGA-D3-A1QA-07,7.0,Additional Metastatic,skin cutaneous melanoma
TCGA-DE-A4MD-06,6.0,Metastatic,thyroid carcinoma
TCGA-J8-A3O2-06,6.0,Metastatic,thyroid carcinoma
TCGA-J8-A3YH-06,6.0,Metastatic,thyroid carcinoma
TCGA-EM-A2P1-06,6.0,Metastatic,thyroid carcinoma
...,...,...,...
TCGA-17-Z059-01,,,lung adenocarcinoma
TCGA-17-Z060-01,,,lung adenocarcinoma
TCGA-17-Z061-01,,,lung adenocarcinoma
TCGA-17-Z062-01,,,lung adenocarcinoma


In [None]:
df.columns

Index(['ARHGEF10L', 'HIF3A', 'RNF17', 'RNF10', 'RNF11', 'RNF13', 'GTF2IP1',
       'REM1', 'MTVR2', 'RTN4RL2',
       ...
       'TULP2', 'NPY5R', 'GNGT2', 'GNGT1', 'TULP3', 'PTRF', 'BCL6B', 'GSTK1',
       'SELP', 'SELS'],
      dtype='object', name='Sample', length=20530)

In [None]:
merged_df = df.join(labeldf, how='inner')

In [None]:
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10459 entries, TCGA-S9-A7J2-01 to TCGA-FV-A3I0-11
Columns: 20533 entries, ARHGEF10L to _primary_disease
dtypes: float64(1), object(20532)
memory usage: 1.6+ GB


In [None]:
merged_df

Unnamed: 0,ARHGEF10L,HIF3A,RNF17,RNF10,RNF11,RNF13,GTF2IP1,REM1,MTVR2,RTN4RL2,...,GNGT1,TULP3,PTRF,BCL6B,GSTK1,SELP,SELS,sample_type_id,sample_type,_primary_disease
TCGA-S9-A7J2-01,10.9576,4.8099,0.4657,11.2675,10.1761,10.4769,13.0456,3.2299,0.4657,8.7533,...,0.0,9.2594,9.4779,6.1595,9.6465,0.0,9.4848,1.0,Primary Tumor,brain lower grade glioma
TCGA-G3-A3CH-11,11.0186,5.3847,0.0,11.669,11.398,10.8249,11.5487,3.5408,1.4714,7.9144,...,0.0,7.5066,10.5302,7.3741,13.0045,7.0466,10.3411,11.0,Solid Tissue Normal,liver hepatocellular carcinoma
TCGA-EK-A2RE-01,9.7106,2.8888,0.4192,11.4903,11.7371,9.9473,10.841,2.5988,0.0,3.9541,...,1.2319,9.5217,13.8492,6.5812,9.2958,0.4192,9.745,1.0,Primary Tumor,cervical & endocervical cancer
TCGA-44-6778-01,9.6205,7.9642,1.5378,11.8432,11.0531,10.9005,12.4145,4.5366,2.0609,4.1805,...,4.1291,8.9832,12.3412,9.0862,10.4779,9.4517,10.4395,1.0,Primary Tumor,lung adenocarcinoma
TCGA-VM-A8C8-01,11.6596,8.5622,0.0,11.2677,11.3549,10.8579,13.256,5.9962,0.0,5.357,...,0.6034,9.0573,8.8984,5.9116,9.9584,1.6216,9.6811,1.0,Primary Tumor,brain lower grade glioma
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
TCGA-95-7947-01,10.0459,6.6572,0.0,11.3827,10.9459,10.5165,12.5061,3.6214,0.6881,3.1611,...,0.0,9.566,10.8301,7.6826,11.3541,7.3565,10.3328,1.0,Primary Tumor,lung adenocarcinoma
TCGA-VQ-AA6F-01,9.5758,5.8461,0.0,11.6292,10.6314,11.5036,12.5995,5.9237,0.9655,5.556,...,5.4128,8.9264,11.2215,7.8272,10.8221,7.0938,10.614,1.0,Primary Tumor,stomach adenocarcinoma
TCGA-55-6985-11,9.6575,8.9521,0.4791,11.6766,11.3748,11.3349,12.3318,5.6618,0.4791,4.7755,...,1.1254,9.0194,13.5597,9.9022,10.9969,9.3046,10.2187,11.0,Solid Tissue Normal,lung adenocarcinoma
TCGA-DD-A115-01,11.7589,3.7591,0.0,12.0914,11.5774,10.1702,12.0789,4.3081,0.0,11.0616,...,0.0,7.5925,11.2591,8.0674,13.3772,6.4848,9.8594,1.0,Primary Tumor,liver hepatocellular carcinoma


In [None]:
len(merged_df._primary_disease.unique())

33

In [None]:
labeldf._primary_disease.unique()

array(['skin cutaneous melanoma', 'thyroid carcinoma', 'sarcoma',
       'prostate adenocarcinoma', 'pheochromocytoma & paraganglioma',
       'pancreatic adenocarcinoma', 'head & neck squamous cell carcinoma',
       'esophageal carcinoma', 'colon adenocarcinoma',
       'cervical & endocervical cancer', 'breast invasive carcinoma',
       'bladder urothelial carcinoma', 'testicular germ cell tumor',
       'kidney papillary cell carcinoma', 'kidney clear cell carcinoma',
       'acute myeloid leukemia', 'rectum adenocarcinoma',
       'ovarian serous cystadenocarcinoma', 'lung adenocarcinoma',
       'liver hepatocellular carcinoma',
       'uterine corpus endometrioid carcinoma', 'glioblastoma multiforme',
       'brain lower grade glioma', 'uterine carcinosarcoma', 'thymoma',
       'stomach adenocarcinoma', 'diffuse large B-cell lymphoma',
       'lung squamous cell carcinoma', 'mesothelioma',
       'kidney chromophobe', 'uveal melanoma', 'cholangiocarcinoma',
       'adrenocorti

# Filtering [Optional]

In [None]:
gene_means = df.mean(axis=0)


In [None]:
gene_stds = df.std(axis=0)

In [None]:
gene_stds

Sample
ARHGEF10L    1.211973
HIF3A        2.741488
RNF17        1.330406
RNF10        0.384238
RNF11        0.664399
               ...   
PTRF          1.54043
BCL6B        1.484694
GSTK1        0.896771
SELP         2.477773
SELS         0.636399
Length: 20530, dtype: object

In [None]:
low_mean_genes = gene_means[gene_means < 0.5].index

In [None]:
low_information_genes = gene_means[gene_means < 0.5].index.union(gene_stds[gene_stds < 0.8].index)

In [None]:
len(low_information_genes)

7370

In [None]:
# Filter out genes with mean < 0.5 or standard deviation < 0.8
genes_to_remove = gene_stats[(gene_stats['mean'] < 0.5) | (gene_stats['std'] < 0.8)].index


# Visualization

//I will add few more charts to provide better view of my 33 Cancer types Prediction Models.

Also, I will cover detail explaination in my video representation and update same in Project pdf and in here towards completion of this project.//

* Each point in the below plot represents a patient sample, with colors indicating the associated cancer type.

* This visualization helps in understanding the similarities and differences in gene expression profiles among various cancer types.
* Distinct clusters corresponding to different cancer types are identifiable, suggesting that gene expression patterns can distinguish between cancer subtypes.

In [None]:
import numpy as np
from sklearn.manifold import TSNE
X_embedded = TSNE(n_components=2, learning_rate='auto',
                   init='random', perplexity=3).fit_transform(X)
X_embedded.shape

(10459, 2)

In [None]:
diseasedict = {
    'skin cutaneous melanoma':0, 'thyroid carcinoma':1, 'sarcoma':2,
    'prostate adenocarcinoma':3, 'pheochromocytoma & paraganglioma':4,
    'pancreatic adenocarcinoma':5, 'head & neck squamous cell carcinoma':6,
    'esophageal carcinoma':7, 'colon adenocarcinoma':8,
    'cervical & endocervical cancer':9, 'breast invasive carcinoma':10,
    'bladder urothelial carcinoma':11, 'testicular germ cell tumor':12,
    'kidney papillary cell carcinoma':13, 'kidney clear cell carcinoma':14,
    'acute myeloid leukemia':15, 'rectum adenocarcinoma':16,
    'ovarian serous cystadenocarcinoma':17, 'lung adenocarcinoma':18,
    'liver hepatocellular carcinoma':19,
    'uterine corpus endometrioid carcinoma':20, 'glioblastoma multiforme':21,
    'brain lower grade glioma':22, 'uterine carcinosarcoma':23, 'thymoma':24,
    'stomach adenocarcinoma':25, 'diffuse large B-cell lymphoma':26,
    'lung squamous cell carcinoma':27, 'mesothelioma':28,
    'kidney chromophobe':29, 'uveal melanoma':30, 'cholangiocarcinoma':31,
    'adrenocortical cancer':32
}

keyslist = list(diseasedict.keys())
valueslist = list(diseasedict.values())

cancers = []

for classno in y:
  cancers.append(keyslist[valueslist.index(classno)])


In [None]:
tsne = pd.DataFrame(X_embedded, columns = ["tsne1", "tsne2"])
cancers = pd.DataFrame(cancers, columns = ["cancer"])
tsne = pd.concat([tsne,cancers], axis = 1, sort = False)
tsne = tsne.sort_values(by = "cancer")

In [None]:
pip install plotly-express

Collecting plotly-express
  Downloading plotly_express-0.4.1-py2.py3-none-any.whl (2.9 kB)
Collecting statsmodels>=0.9.0 (from plotly-express)
  Downloading statsmodels-0.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (10.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.8/10.8 MB[0m [31m40.0 MB/s[0m eta [36m0:00:00[0m
Collecting patsy>=0.5 (from plotly-express)
  Downloading patsy-0.5.6-py2.py3-none-any.whl (233 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m233.9/233.9 kB[0m [31m23.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: patsy, statsmodels, plotly-express
Successfully installed patsy-0.5.6 plotly-express-0.4.1 statsmodels-0.14.1


In [None]:

import plotly_express as px

figx = px.scatter(
    tsne,
    x="tsne1",
    y="tsne2",
    color="cancer",
    hover_name="cancer",
    width=970,
    height=500,
    template="ggplot2",
    color_discrete_sequence= px.colors.qualitative.Alphabet,
    #facet_col="group_label",
    size_max=0.1,
)

figx.show()

# Model

* Citation to the original paper

PAPER: Convolutional neural network models for cancer type
prediction based on gene expression.

BY: BMC Medical Genomics, 13(Suppl 5).

Milad Mostavi, Yu-Chiao Chiu,Yufei Huang, & Yidong Chen .

Article No. : 44

YEAR: 2020

* Link to the paper is

https://bmcmedgenomics.biomedcentral.com/articles/10.1186/s12920-020-0677-2

* Link to the original paper’s repo is

 https://github.com/chenlabgccri/CancerTypePrediction

* Model Description

 * The model is defined in the CNN1D class, which inherits from nn.Module (the base class for all neural network modules in PyTorch).
 * The model consists of the following layers:
Conv1d layers with 16 and 32 output channels, respectively, and a kernel size of 3.
 * MaxPool1d layers with a kernel size of 2 for downsampling the feature maps.
 * Two fully connected (Linear) layers with 64 and num_classes units, respectively, for the final classification.
 * ReLU activation functions are used after the convolutional and first fully connected layers.

# Implementation Code



Please Note :

The code implementation is currently in progress and has not been completed yet. I will conduct the final code review, optimization, and formatting in the last phase of the project development.



# Build & Train 1d CNN

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

# Define the 1D CNN model
class CNN1D(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(CNN1D, self).__init__()
        self.conv1 = nn.Conv1d(in_channels=input_dim, out_channels=64, kernel_size=3)
        self.pool = nn.MaxPool1d(kernel_size=2)
        self.fc1 = nn.Linear(64 * ((input_dim - 2) // 2), 128)  # Adjusted based on conv and pooling layers
        self.fc2 = nn.Linear(128, output_dim)

    def forward(self, x):
        x = self.pool(torch.relu(self.conv1(x)))
        x = x.view(-1, 64 * ((x.shape[2] - 2) // 2))
        x = torch.relu(self.fc1(x))
        x = self.fc2(x)
        return x

# Convert data to PyTorch tensors
X_tensor = torch.tensor(df.values, dtype=torch.float32)
y_tensor = torch.tensor(label_encoder.fit_transform(labeldf['_primary_disease']), dtype=torch.long)

# Reshape X_tensor to add the channel dimension (1 for 1D CNN)
X_tensor = X_tensor.unsqueeze(1)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_tensor, y_tensor, test_size=0.2, random_state=42)

# Create DataLoader objects
train_dataset = TensorDataset(X_train, y_train)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

# Initialize the model, loss function, and optimizer
model = CNN1D(input_dim=X_train.shape[2], output_dim=len(labeldf['_primary_disease'].unique()))
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Train the model
num_epochs = 10
for epoch in range(num_epochs):
    running_loss = 0.0
    for inputs, labels in train_loader:
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
    print(f'Epoch {epoch+1}, Loss: {running_loss}')

# Evaluate the model
with torch.no_grad():
    outputs = model(X_test)
    _, predicted = torch.max(outputs, 1)
    accuracy = (predicted == y_test).sum().item() / y_test.size(0)
    print(f'Test Accuracy: {accuracy}')


In [None]:
import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader
from sklearn.preprocessing import LabelEncoder

# Assuming you already have the dataframes loaded as 'df' and 'labeldf'

# Encode the labels
label_encoder = LabelEncoder()
labels = label_encoder.fit_transform(labeldf['_primary_disease'])

# Convert data to PyTorch tensors
X = torch.from_numpy(df.values).float()
y = torch.from_numpy(labels).long()


In [None]:

# Create dataset and data loader
dataset = TensorDataset(X, y)
batch_size = 32
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

# Define the 1D CNN model
class CNN1D(nn.Module):
    def __init__(self, num_features, num_classes):
        super(CNN1D, self).__init__()
        self.conv1 = nn.Conv1d(1, 16, kernel_size=3, padding=1)
        self.conv2 = nn.Conv1d(16, 32, kernel_size=3, padding=1)
        self.pool = nn.MaxPool1d(2)
        self.fc1 = nn.Linear(32 * (num_features // 4), 64)
        self.fc2 = nn.Linear(64, num_classes)

    def forward(self, x):
        x = x.unsqueeze(1)  # Add channel dimension
        x = self.pool(nn.functional.relu(self.conv1(x)))
        x = self.pool(nn.functional.relu(self.conv2(x)))
        x = x.view(x.size(0), -1)
        x = nn.functional.relu(self.fc1(x))
        x = self.fc2(x)
        return x

# Initialize the model
num_features = df.shape[1]
num_classes = len(label_encoder.classes_)
model = CNN1D(num_features, num_classes)

# Define the loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters())

# Train the model
num_epochs = 10
for epoch in range(num_epochs):
    for inputs, targets in dataloader:
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, targets)
        loss.backward()
        optimizer.step()
    print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')

# Save data in h5 format

In [None]:
import progressbar
dbPath='data1.h5'
nTotal = df.shape[0]    #10459
nFeat = df.shape[1]     #20530

print('Total Number of samples: '+ str(nTotal))
print('Features (RNASeq) per sample: ' + str(nFeat))

print('Diseases to predict: ')

diseases = labeldf._primary_disease.unique()

for disease in diseases:
    print(disease)

# Defining Categorical values for each disease

diseasedict = {
    'skin cutaneous melanoma':0, 'thyroid carcinoma':1, 'sarcoma':2,
    'prostate adenocarcinoma':3, 'pheochromocytoma & paraganglioma':4,
    'pancreatic adenocarcinoma':5, 'head & neck squamous cell carcinoma':6,
    'esophageal carcinoma':7, 'colon adenocarcinoma':8,
    'cervical & endocervical cancer':9, 'breast invasive carcinoma':10,
    'bladder urothelial carcinoma':11, 'testicular germ cell tumor':12,
    'kidney papillary cell carcinoma':13, 'kidney clear cell carcinoma':14,
    'acute myeloid leukemia':15, 'rectum adenocarcinoma':16,
    'ovarian serous cystadenocarcinoma':17, 'lung adenocarcinoma':18,
    'liver hepatocellular carcinoma':19,
    'uterine corpus endometrioid carcinoma':20, 'glioblastoma multiforme':21,
    'brain lower grade glioma':22, 'uterine carcinosarcoma':23, 'thymoma':24,
    'stomach adenocarcinoma':25, 'diffuse large B-cell lymphoma':26,
    'lung squamous cell carcinoma':27, 'mesothelioma':28,
    'kidney chromophobe':29, 'uveal melanoma':30, 'cholangiocarcinoma':31,
    'adrenocortical cancer':32
}

print('Creating Database File at : ' + dbPath)
db = h5py.File(dbPath, mode = 'w')

print('Setting up Database')
db.create_dataset("name", (nTotal,), np.dtype('|S16'))
db.create_dataset("RNASeq", (nTotal, nFeat), np.float32)
db.create_dataset("label", (nTotal,), np.uint8)

idx = 0

print('Writing ' + str(nTotal) + ' samples to Dataset')

for index,row in progressbar.progressbar(df.iterrows(), redirect_stdout=True):
    try:
        data = labeldf.loc[index]
        if(verbose):
            print('Processing '+ str(idx) + ' of ' + str(nTotal) + ' : ' + index + '\t disease: \t' + str(data[2]))
        db["name"][idx] = np.asarray(index, dtype = np.dtype('|S16'))
        db["RNASeq"][idx] = np.asarray(row, dtype = np.float32)
        db["label"][idx] = np.uint8(diseasedict[data[2]])
        idx = idx + 1
    except:
        print("Error: Cannot find label")
        continue

print('Closing Database ..')
db.close()
print('Complete!')

Total Number of samples: 10459
Features (RNASeq) per sample: 20530
Diseases to predict: 
skin cutaneous melanoma
thyroid carcinoma
sarcoma
prostate adenocarcinoma
pheochromocytoma & paraganglioma
pancreatic adenocarcinoma
head & neck squamous cell carcinoma
esophageal carcinoma
colon adenocarcinoma
cervical & endocervical cancer
breast invasive carcinoma
bladder urothelial carcinoma
testicular germ cell tumor
kidney papillary cell carcinoma
kidney clear cell carcinoma
acute myeloid leukemia
rectum adenocarcinoma
ovarian serous cystadenocarcinoma
lung adenocarcinoma
liver hepatocellular carcinoma
uterine corpus endometrioid carcinoma
glioblastoma multiforme
brain lower grade glioma
uterine carcinosarcoma
thymoma
stomach adenocarcinoma
diffuse large B-cell lymphoma
lung squamous cell carcinoma
mesothelioma
kidney chromophobe
uveal melanoma
cholangiocarcinoma
adrenocortical cancer
Creating Database File at : data1.h5
Setting up Database
Writing 10459 samples to Dataset


| |       #                                       | 10458 Elapsed Time: 0:00:19


Closing Database ..
Complete!


In [None]:
del df

In [None]:
db = h5py.File(dbPath, mode = 'r')
X = db["RNASeq"][...]
y = db["label"][...]

In [None]:
print(X.shape)
print(y.shape)

(10459, 20530)
(10459,)


In [None]:
y

array([22, 19,  9, ..., 18, 19, 19], dtype=uint8)

# Training
Please Note :

// The current status of this section is "Work in Progress" as I am developing the code to train a cancer prediction model for 33 types of cancer. Therefore, all the pointers listed below this section are still being worked on.
I am currently uploading my training code to a GitHub repository. The code is in a Jupyter Notebook (.ipynb) file format. This will provide the reviewer with a glimpse of my work progress and serve as a reference.//

* Hyperparams
 * Report at least 3 types of hyperparameters such as learning rate, batch size, hidden size, dropout
* Computational Requirements
 * Report at least 3 types of requirements such as type of hardware, avg runtime for each epoch, total number of trial, GPU hrs used, # training epochs
* Training Code

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader

# Convert data to PyTorch tensors
X = torch.from_numpy(X).float()
y = torch.from_numpy(y).long()


In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:

# Create dataset and data loader
dataset = TensorDataset(X, y)
batch_size = 32
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)


In [None]:

# Define the 1D CNN model
class CNN1D(nn.Module):
    def __init__(self, num_features, num_classes):
        super(CNN1D, self).__init__()
        self.conv1 = nn.Conv1d(1, 16, kernel_size=3, padding=1)
        self.conv2 = nn.Conv1d(16, 32, kernel_size=3, padding=1)
        self.pool = nn.MaxPool1d(2)
        self.fc1 = nn.Linear(32 * (num_features // 4), 64)
        self.fc2 = nn.Linear(64, num_classes)

    def forward(self, x):
        x = x.unsqueeze(1)  # Add channel dimension
        x = self.pool(nn.functional.relu(self.conv1(x)))
        x = self.pool(nn.functional.relu(self.conv2(x)))
        x = x.view(x.size(0), -1)
        x = nn.functional.relu(self.fc1(x))
        x = self.fc2(x)
        return x

In [None]:


# Initialize the model
num_features = X.shape[1]
num_classes = len(torch.unique(y))
model = CNN1D(num_features, num_classes)


In [None]:

# Define the loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters())

# Train the model
num_epochs = 20
for epoch in range(num_epochs):
    for inputs, targets in dataloader:
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, targets)
        loss.backward()
        optimizer.step()
    print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')

Epoch [1/20], Loss: 3.3160
Epoch [2/20], Loss: 3.3369
Epoch [3/20], Loss: 3.0634
Epoch [4/20], Loss: 3.1332
Epoch [5/20], Loss: 3.1870
Epoch [6/20], Loss: 3.3486
Epoch [7/20], Loss: 3.2953
Epoch [8/20], Loss: 3.0553
Epoch [9/20], Loss: 3.3541
Epoch [10/20], Loss: 3.1441
Epoch [11/20], Loss: 3.2377
Epoch [12/20], Loss: 2.9853
Epoch [13/20], Loss: 3.0908
Epoch [14/20], Loss: 3.2155
Epoch [15/20], Loss: 3.4003
Epoch [16/20], Loss: 3.0812
Epoch [17/20], Loss: 3.2563
Epoch [18/20], Loss: 3.1533
Epoch [19/20], Loss: 3.3396
Epoch [20/20], Loss: 3.1586


# Others

In [None]:
from MulticoreTSNE import MulticoreTSNE as TSNE
tsne = TSNE(n_jobs=4, n_components=2, verbose = 1)
Y  = tsne.fit_transform(X)

In [None]:
diseasedict = {
    'skin cutaneous melanoma':0, 'thyroid carcinoma':1, 'sarcoma':2,
    'prostate adenocarcinoma':3, 'pheochromocytoma & paraganglioma':4,
    'pancreatic adenocarcinoma':5, 'head & neck squamous cell carcinoma':6,
    'esophageal carcinoma':7, 'colon adenocarcinoma':8,
    'cervical & endocervical cancer':9, 'breast invasive carcinoma':10,
    'bladder urothelial carcinoma':11, 'testicular germ cell tumor':12,
    'kidney papillary cell carcinoma':13, 'kidney clear cell carcinoma':14,
    'acute myeloid leukemia':15, 'rectum adenocarcinoma':16,
    'ovarian serous cystadenocarcinoma':17, 'lung adenocarcinoma':18,
    'liver hepatocellular carcinoma':19,
    'uterine corpus endometrioid carcinoma':20, 'glioblastoma multiforme':21,
    'brain lower grade glioma':22, 'uterine carcinosarcoma':23, 'thymoma':24,
    'stomach adenocarcinoma':25, 'diffuse large B-cell lymphoma':26,
    'lung squamous cell carcinoma':27, 'mesothelioma':28,
    'kidney chromophobe':29, 'uveal melanoma':30, 'cholangiocarcinoma':31,
    'adrenocortical cancer':32
}


In [None]:
keyslist = list(diseasedict.keys())
valueslist = list(diseasedict.values())

cancers = []

for classno in y:
  cancers.append(keyslist[valueslist.index(classno)])

In [None]:
tsne = pd.DataFrame(Y, columns = ["tsne1", "tsne2"])
cancers = pd.DataFrame(cancers, columns = ["cancer"])
tsne = pd.concat([tsne,cancers], axis = 1, sort = False)
tsne = tsne.sort_values(by = "cancer")

In [None]:
!pip install plotly_express

Collecting plotly_express
  Downloading https://files.pythonhosted.org/packages/d4/d6/8a2906f51e073a4be80cab35cfa10e7a34853e60f3ed5304ac470852a08d/plotly_express-0.4.1-py2.py3-none-any.whl
Installing collected packages: plotly-express
Successfully installed plotly-express-0.4.1


In [None]:

import plotly_express as px

figx = px.scatter(
    tsne,
    x="tsne1",
    y="tsne2",
    color="cancer",
    hover_name="cancer",
    width=970,
    height=500,
    template="ggplot2",
    color_discrete_sequence= px.colors.qualitative.Alphabet,
    #facet_col="group_label",
    size_max=0.1,
)

figx.show()

# Evaluation
// The current status of this section is "Not Started Yet". Once I complete the model training, I will update the requested information regarding metrics and evaluation code.//
* Metrics Description
* Evaluation Code

# Testing


In [None]:
pip install vaex

Collecting vaex
  Downloading vaex-4.17.0-py3-none-any.whl (4.8 kB)
Collecting vaex-core~=4.17.1 (from vaex)
  Downloading vaex_core-4.17.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.5/4.5 MB[0m [31m33.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting vaex-astro<0.10,>=0.9.3 (from vaex)
  Downloading vaex_astro-0.9.3-py3-none-any.whl (20 kB)
Collecting vaex-hdf5<0.15,>=0.13.0 (from vaex)
  Downloading vaex_hdf5-0.14.1-py3-none-any.whl (16 kB)
Collecting vaex-viz<0.6,>=0.5.4 (from vaex)
  Downloading vaex_viz-0.5.4-py3-none-any.whl (19 kB)
Collecting vaex-server~=0.9.0 (from vaex)
  Downloading vaex_server-0.9.0-py3-none-any.whl (23 kB)
Collecting vaex-jupyter<0.9,>=0.8.2 (from vaex)
  Downloading vaex_jupyter-0.8.2-py3-none-any.whl (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.2/43.2 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting vaex-ml<0.19,>=0.18.3

In [None]:
import pandas as pd
from sklearn.feature_selection import VarianceThreshold, SelectPercentile, f_classif

# Load the entire dataset in chunks
chunksize = 21000
chunks = []

for chunk in pd.read_csv('HiSeqV2', sep='\t', chunksize=chunksize, iterator=True):
    chunk = chunk.transpose()
    chunks.append(chunk)
    break

In [None]:
del df

In [None]:
df = chunks[0]
df.columns = df.iloc[0]

# Dropping the first row
df = df.drop(df.index[0])

# Resetting the index
df = df.reset_index(drop=True)

In [None]:
df

Sample,ARHGEF10L,HIF3A,RNF17,RNF10,RNF11,RNF13,GTF2IP1,REM1,MTVR2,RTN4RL2,...,IL28A,TMEM208,DYNC1H1,EPHA10,TIE1,ZNF718,DMRTA2,THG1L,ZNF716,DMRTA1
0,10.9576,4.8099,0.4657,11.2675,10.1761,10.4769,13.0456,3.2299,0.4657,8.7533,...,0.0,9.1962,14.0874,6.919,9.5257,6.8868,0.0,6.73,0.0,2.266
1,11.0186,5.3847,0.0,11.669,11.398,10.8249,11.5487,3.5408,1.4714,7.9144,...,0.0,10.7013,11.7089,1.4714,9.195,4.4189,0.9157,7.4854,0.0,7.9968
2,9.7106,2.8888,0.4192,11.4903,11.7371,9.9473,10.841,2.5988,0.0,3.9541,...,0.0,9.1294,13.1303,2.8888,7.3414,6.5555,5.2125,8.1145,0.0,3.5176
3,9.6205,7.9642,1.5378,11.8432,11.0531,10.9005,12.4145,4.5366,2.0609,4.1805,...,0.0,9.0347,13.739,3.7751,10.1636,7.2835,0.7088,7.0273,0.7088,5.7114
4,11.6596,8.5622,0.0,11.2677,11.3549,10.8579,13.256,5.9962,0.0,5.357,...,0.0,9.4628,13.7719,7.0142,9.3571,5.8738,2.7468,6.979,0.6034,2.5042
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10454,10.0459,6.6572,0.0,11.3827,10.9459,10.5165,12.5061,3.6214,0.6881,3.1611,...,0.0,8.6877,13.0213,3.2096,8.8636,7.2905,7.7081,6.7946,0.0,7.9031
10455,9.5758,5.8461,0.0,11.6292,10.6314,11.5036,12.5995,5.9237,0.9655,5.556,...,1.1828,8.7461,13.421,7.4089,8.897,5.1799,1.1828,8.6513,0.0,5.1799
10456,9.6575,8.9521,0.4791,11.6766,11.3748,11.3349,12.3318,5.6618,0.4791,4.7755,...,0.0,9.172,12.9206,4.5515,10.9946,6.2817,1.5701,7.3406,0.4791,6.289
10457,11.7589,3.7591,0.0,12.0914,11.5774,10.1702,12.0789,4.3081,0.0,11.0616,...,0.0,10.9194,12.3151,6.7174,9.417,6.3664,0.0,6.9818,5.6546,6.7702


In [None]:
df = df.transpose()

In [None]:
df.transpose()

Sample,ARHGEF10L,HIF3A,RNF17,RNF10,RNF11,RNF13,GTF2IP1,REM1,MTVR2,RTN4RL2,...,IL28A,TMEM208,DYNC1H1,EPHA10,TIE1,ZNF718,DMRTA2,THG1L,ZNF716,DMRTA1
ARHGEF10L,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10.9576,10.9576,4.8099,0.4657,11.2675,10.1761,10.4769,13.0456,3.2299,0.4657,8.7533,...,0.0,9.1962,14.0874,6.919,9.5257,6.8868,0.0,6.73,0.0,2.266
11.0186,11.0186,5.3847,0.0,11.669,11.398,10.8249,11.5487,3.5408,1.4714,7.9144,...,0.0,10.7013,11.7089,1.4714,9.195,4.4189,0.9157,7.4854,0.0,7.9968
9.7106,9.7106,2.8888,0.4192,11.4903,11.7371,9.9473,10.841,2.5988,0.0,3.9541,...,0.0,9.1294,13.1303,2.8888,7.3414,6.5555,5.2125,8.1145,0.0,3.5176
9.6205,9.6205,7.9642,1.5378,11.8432,11.0531,10.9005,12.4145,4.5366,2.0609,4.1805,...,0.0,9.0347,13.739,3.7751,10.1636,7.2835,0.7088,7.0273,0.7088,5.7114
11.6596,11.6596,8.5622,0.0,11.2677,11.3549,10.8579,13.256,5.9962,0.0,5.357,...,0.0,9.4628,13.7719,7.0142,9.3571,5.8738,2.7468,6.979,0.6034,2.5042
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10.0459,10.0459,6.6572,0.0,11.3827,10.9459,10.5165,12.5061,3.6214,0.6881,3.1611,...,0.0,8.6877,13.0213,3.2096,8.8636,7.2905,7.7081,6.7946,0.0,7.9031
9.5758,9.5758,5.8461,0.0,11.6292,10.6314,11.5036,12.5995,5.9237,0.9655,5.556,...,1.1828,8.7461,13.421,7.4089,8.897,5.1799,1.1828,8.6513,0.0,5.1799
9.6575,9.6575,8.9521,0.4791,11.6766,11.3748,11.3349,12.3318,5.6618,0.4791,4.7755,...,0.0,9.172,12.9206,4.5515,10.9946,6.2817,1.5701,7.3406,0.4791,6.289
11.7589,11.7589,3.7591,0.0,12.0914,11.5774,10.1702,12.0789,4.3081,0.0,11.0616,...,0.0,10.9194,12.3151,6.7174,9.417,6.3664,0.0,6.9818,5.6546,6.7702


In [None]:
data = 'HiSeqV2'
labels = 'TCGA_phenotype_denseDataOnlyDownload.tsv'
dbPath = 'data.h5'
verbose = False

print('Loading labels ...')
labeldf = pd.read_csv(labels, sep = '\t')


Loading labels ...


In [None]:
labeldf

Unnamed: 0,sample,sample_type_id,sample_type,_primary_disease
0,TCGA-D3-A1QA-07,7.0,Additional Metastatic,skin cutaneous melanoma
1,TCGA-DE-A4MD-06,6.0,Metastatic,thyroid carcinoma
2,TCGA-J8-A3O2-06,6.0,Metastatic,thyroid carcinoma
3,TCGA-J8-A3YH-06,6.0,Metastatic,thyroid carcinoma
4,TCGA-EM-A2P1-06,6.0,Metastatic,thyroid carcinoma
...,...,...,...,...
12799,TCGA-17-Z059-01,,,lung adenocarcinoma
12800,TCGA-17-Z060-01,,,lung adenocarcinoma
12801,TCGA-17-Z061-01,,,lung adenocarcinoma
12802,TCGA-17-Z062-01,,,lung adenocarcinoma


In [None]:
df

Unnamed: 0,ARHGEF10L,10.9576,11.0186,9.7106,9.6205,11.6596,0.7316,9.2845,8.4529,10.2648,...,12.0972,9.048,10.2686,8.9397,10.0416,10.0459,9.5758,9.6575,11.7589,11.525
0,ARHGEF10L,10.9576,11.0186,9.7106,9.6205,11.6596,0.7316,9.2845,8.4529,10.2648,...,12.0972,9.048,10.2686,8.9397,10.0416,10.0459,9.5758,9.6575,11.7589,11.525
1,HIF3A,4.8099,5.3847,2.8888,7.9642,8.5622,1.2147,2.359,3.9888,5.7145,...,3.0069,5.691,2.6968,5.5271,5.701,6.6572,5.8461,8.9521,3.7591,3.9462
2,RNF17,0.4657,0.0,0.4192,1.5378,0.0,0.0,2.7396,0.0,0.0,...,6.828,0.0,0.0,0.0,0.0,0.0,0.0,0.4791,0.0,0.0
3,RNF10,11.2675,11.669,11.4903,11.8432,11.2677,11.7164,12.4102,12.3562,11.6663,...,11.8776,11.964,11.7363,11.7051,11.2921,11.3827,11.6292,11.6766,12.0914,11.8189
4,RNF11,10.1761,11.398,11.7371,11.0531,11.3549,10.4861,11.1902,9.7486,10.8571,...,10.0321,11.2763,10.6688,11.7347,10.9219,10.9459,10.6314,11.3748,11.5774,11.2605
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,ZNF718,6.8868,4.4189,6.5555,7.2835,5.8738,7.1398,5.66,4.9763,5.4129,...,5.4872,5.2001,6.6087,7.551,6.0973,7.2905,5.1799,6.2817,6.3664,4.7616
9996,DMRTA2,0.0,0.9157,5.2125,0.7088,2.7468,0.7316,0.0,0.4466,0.0,...,0.0,0.0,0.0,0.0,0.0,7.7081,1.1828,1.5701,0.0,0.0
9997,THG1L,6.73,7.4854,8.1145,7.0273,6.979,7.694,7.9317,5.8206,7.6446,...,7.6841,8.2363,6.0889,7.4439,7.4721,6.7946,8.6513,7.3406,6.9818,7.3016
9998,ZNF716,0.0,0.0,0.0,0.7088,0.6034,0.0,0.6006,0.0,0.0,...,1.3919,0.0,0.8825,2.6029,0.0,0.0,0.0,0.4791,5.6546,0.0


In [None]:
df

ARHGEF10L,10.9576,11.0186,9.7106,9.6205,11.6596,0.7316,9.2845,8.4529,10.2648,8.0441,...,12.0972,9.0480,10.2686,8.9397,10.0416,10.0459,9.5758,9.6575,11.7589,11.5250
Sample,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ARHGEF10L,10.9576,11.0186,9.7106,9.6205,11.6596,0.7316,9.2845,8.4529,10.2648,8.0441,...,12.0972,9.048,10.2686,8.9397,10.0416,10.0459,9.5758,9.6575,11.7589,11.525
HIF3A,4.8099,5.3847,2.8888,7.9642,8.5622,1.2147,2.359,3.9888,5.7145,6.1856,...,3.0069,5.691,2.6968,5.5271,5.701,6.6572,5.8461,8.9521,3.7591,3.9462
RNF17,0.4657,0.0,0.4192,1.5378,0.0,0.0,2.7396,0.0,0.0,0.3369,...,6.828,0.0,0.0,0.0,0.0,0.0,0.0,0.4791,0.0,0.0
RNF10,11.2675,11.669,11.4903,11.8432,11.2677,11.7164,12.4102,12.3562,11.6663,12.109,...,11.8776,11.964,11.7363,11.7051,11.2921,11.3827,11.6292,11.6766,12.0914,11.8189
RNF11,10.1761,11.398,11.7371,11.0531,11.3549,10.4861,11.1902,9.7486,10.8571,11.6214,...,10.0321,11.2763,10.6688,11.7347,10.9219,10.9459,10.6314,11.3748,11.5774,11.2605
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ZNF718,6.8868,4.4189,6.5555,7.2835,5.8738,7.1398,5.66,4.9763,5.4129,5.8264,...,5.4872,5.2001,6.6087,7.551,6.0973,7.2905,5.1799,6.2817,6.3664,4.7616
DMRTA2,0.0,0.9157,5.2125,0.7088,2.7468,0.7316,0.0,0.4466,0.0,5.2307,...,0.0,0.0,0.0,0.0,0.0,7.7081,1.1828,1.5701,0.0,0.0
THG1L,6.73,7.4854,8.1145,7.0273,6.979,7.694,7.9317,5.8206,7.6446,5.9665,...,7.6841,8.2363,6.0889,7.4439,7.4721,6.7946,8.6513,7.3406,6.9818,7.3016
ZNF716,0.0,0.0,0.0,0.7088,0.6034,0.0,0.6006,0.0,0.0,0.0,...,1.3919,0.0,0.8825,2.6029,0.0,0.0,0.0,0.4791,5.6546,0.0


In [None]:

print('Housekeeping ...')
df.columns = df.iloc[0]
df = df.drop('Sample', axis = 0)

labeldf = labeldf.set_index('sample')

# dimensions: 10459 x 20530

nTotal = df.shape[0]    #10459
nFeat = df.shape[1]     #20530

print('Total Number of samples: '+ str(nTotal))
print('Features (RNASeq) per sample: ' + str(nFeat))

print('Diseases to predict: ')

diseases = labeldf._primary_disease.unique()

for disease in diseases:
    print(disease)

# Defining Categorical values for each disease

diseasedict = {
    'skin cutaneous melanoma':0, 'thyroid carcinoma':1, 'sarcoma':2,
    'prostate adenocarcinoma':3, 'pheochromocytoma & paraganglioma':4,
    'pancreatic adenocarcinoma':5, 'head & neck squamous cell carcinoma':6,
    'esophageal carcinoma':7, 'colon adenocarcinoma':8,
    'cervical & endocervical cancer':9, 'breast invasive carcinoma':10,
    'bladder urothelial carcinoma':11, 'testicular germ cell tumor':12,
    'kidney papillary cell carcinoma':13, 'kidney clear cell carcinoma':14,
    'acute myeloid leukemia':15, 'rectum adenocarcinoma':16,
    'ovarian serous cystadenocarcinoma':17, 'lung adenocarcinoma':18,
    'liver hepatocellular carcinoma':19,
    'uterine corpus endometrioid carcinoma':20, 'glioblastoma multiforme':21,
    'brain lower grade glioma':22, 'uterine carcinosarcoma':23, 'thymoma':24,
    'stomach adenocarcinoma':25, 'diffuse large B-cell lymphoma':26,
    'lung squamous cell carcinoma':27, 'mesothelioma':28,
    'kidney chromophobe':29, 'uveal melanoma':30, 'cholangiocarcinoma':31,
    'adrenocortical cancer':32
}


Housekeeping ...


KeyError: "['Sample'] not found in axis"

In [None]:
pip install progressbar2

Collecting progressbar2
  Downloading progressbar2-4.4.2-py3-none-any.whl (56 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/56.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━[0m [32m41.0/56.8 kB[0m [31m1.1 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.8/56.8 kB[0m [31m870.4 kB/s[0m eta [36m0:00:00[0m
[?25hCollecting python-utils>=3.8.1 (from progressbar2)
  Downloading python_utils-3.8.2-py2.py3-none-any.whl (27 kB)
Installing collected packages: python-utils, progressbar2
Successfully installed progressbar2-4.4.2 python-utils-3.8.2


In [None]:
import progressbar
print('Creating Database File at : ' + dbPath)
db = h5py.File(dbPath, mode = 'w')

print('Setting up Database')
db.create_dataset("name", (nTotal,), np.dtype('|S16'))
db.create_dataset("RNASeq", (nTotal, nFeat), np.float32)
db.create_dataset("label", (nTotal,), np.uint8)

idx = 0

print('Writing ' + str(nTotal) + ' samples to Dataset')

for index,row in progressbar.progressbar(df.iterrows(), redirect_stdout=True):
    try:
        data = labeldf.loc[index]
        if(verbose):
            print('Processing '+ str(idx) + ' of ' + str(nTotal) + ' : ' + index + '\t disease: \t' + str(data[2]))
        db["name"][idx] = np.asarray(index, dtype = np.dtype('|S16'))
        db["RNASeq"][idx] = np.asarray(row, dtype = np.float32)
        db["label"][idx] = np.uint8(diseasedict[data[2]])
        idx = idx + 1
    except:
        print("Error: Cannot find label")
        continue

print('Closing Database ..')
db.close()
print('Complete!')

Creating Database File at : data.h5


OSError: Unable to synchronously create file (unable to truncate a file which is already open)

In [None]:

data = pd.concat(chunks, axis=1)

# Load the labels
labels = pd.read_csv('labels.csv', sep='\t')['cancer_type']

# Step 1: Remove low-variance features
selector = VarianceThreshold()
data = data.loc[selector.fit_transform(data).any(axis=1)]

# Step 2: Select top features based on ANOVA F-value
selector = SelectPercentile(f_classif, percentile=100 * 7091 / data.shape[0])
selected_features = selector.fit(data, labels).get_support(indices=True)

# Filter the data to include only the selected features
filtered_data = data.iloc[selected_features]

In [None]:
labeldf.info()

<class 'pandas.core.frame.DataFrame'>
Index: 12804 entries, TCGA-D3-A1QA-07 to TCGA-02-0002-01
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   sample_type_id    12732 non-null  float64
 1   sample_type       12732 non-null  object 
 2   _primary_disease  12804 non-null  object 
dtypes: float64(1), object(2)
memory usage: 400.1+ KB


In [None]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.preprocessing import LabelEncoder

# Assuming you already have the dataframes loaded as 'df' and 'labeldf'

# Encode the labels
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(labeldf['_primary_disease'])

# Create a Random Forest classifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)

# Fit the classifier and get the feature importances
clf.fit(df, y)
feature_importances = clf.feature_importances_


ValueError: Found input variables with inconsistent numbers of samples: [10459, 12804]

In [None]:
labeldf

NameError: name 'labeldf' is not defined

# Results:

// The current status of this section is "Yet to begin." Once I complete training my model, I will proceed with the evaluation process. Based on the outcomes obtained from the evaluation, I will update all the results on the following pointers.//

* Table of results (no need to include additional experiments, but main reproducibility result should be included)
* All claims should be supported by experiment results
* Discuss with respect to the hypothesis and results from the original paper
* Experiments beyond the original paper
 Credits for each experiment depend on how hard it is to run the experiments. Each experiment should include results and discussion
* Ablation Study


# Discussion:

//The information provided in this section is incomplete, as the project work is still in progress. However, I will continue to update this section as progress is made or as I move closer to its completion.//

Implications of the experimental results, whether the original paper was reproducible, and if it wasn’t, what factors made it irreproducible.

* “What was easy”

 The GitHub link referenced in the paper assisted me in formulating a preliminary scope outline for my predictive model aimed at identifying 33 cancer types.

* “What was difficult”

 Up to this point, I've completed the preprocessing of the data, and I'm currently focused on training the model. Initially, during the project proposal phase and based on the literature review, it appeared that obtaining normal tissue samples from the dataset would be straightforward. However, during the actual data download phase, it became nearly impossible to locate such samples. Consequently, I decided to allocate my efforts towards developing a prediction model for 33 cancer types instead.
* Recommendations to the original authors or others who work in this area for improving reproducibility

 In discussion with the TA about the expected content to include in this write-up.

* Public GitHub Repo

* Publish your code in a public repository on GitHub and attach the URL in the notebook.

 https://github.com/anjaligang/DLH_Project_anjali9.git

* Make sure your code is documented properly

 * A README.md file describing the exact steps to run your code is required.

 * Check “ML Code Completeness Checklist” (https://github.com/paperswithcode/releasing-research-code

 * Check “Best Practices for Reproducibility” (https://www.cs.mcgill.ca/~ksinha4/practices_for_reproducibility/)

 The checklist mentioned above has been reviewed and is currently being implemented in my ongoing development work.
