# Predicting Cancer Types Using Machine Learning

## 1. Data Import

In [58]:
import pandas as pd

train = pd.read_csv('data/data_train.csv')
test = pd.read_csv('data/data_test.csv')
actual = pd.read_csv('data/actual.csv')

## * Data Description 

The dataset comprises gene expression data stored in RES format, a commonly used format for gene pattern data [more about RES format here](https://www.genepattern.org/file-formats-guide#RES). The dataset encompasses 7129 distinct gene features, with columns representing various samples.

For each gene, the numerical entries denote its expression levels within a given sample. Additionally, an accompanying `call` column indicates whether the gene is classified as Absent (A), Marginal (M), or Present (P) in that particular sample.

The dataset is divided into two files:

- `train`: Contains data from 38 samples.
- `test`: Contains data from 34 samples.

This totals to 72 samples in the entire dataset.

### Cancer Types

The dataset focuses on two types of cancer:

1. **Acute Myeloid Leukemia (AML):** AML affects myeloid cells, which are responsible for generating certain types of white blood cells.

2. **Acute Lymphocytic Leukemia (ALL):** ALL is a form of cancer that impacts lymphocytes, a crucial type of white blood cell involved in the immune response. ([source](https://www.healthline.com/health/leukemia/aml-vs-all))

### Patient Information

The `actual` file provides information about individual patients, including their unique identifiers and the specific type of cancer they have been diagnosed with (AML or ALL).


In [59]:
print(train.shape, test.shape)

(7129, 78) (7129, 70)


In [60]:
train.head()

Unnamed: 0,Gene Description,Gene Accession Number,1,call,2,call.1,3,call.2,4,call.3,...,29,call.33,30,call.34,31,call.35,32,call.36,33,call.37
0,AFFX-BioB-5_at (endogenous control),AFFX-BioB-5_at,-214,A,-139,A,-76,A,-135,A,...,15,A,-318,A,-32,A,-124,A,-135,A
1,AFFX-BioB-M_at (endogenous control),AFFX-BioB-M_at,-153,A,-73,A,-49,A,-114,A,...,-114,A,-192,A,-49,A,-79,A,-186,A
2,AFFX-BioB-3_at (endogenous control),AFFX-BioB-3_at,-58,A,-1,A,-307,A,265,A,...,2,A,-95,A,49,A,-37,A,-70,A
3,AFFX-BioC-5_at (endogenous control),AFFX-BioC-5_at,88,A,283,A,309,A,12,A,...,193,A,312,A,230,P,330,A,337,A
4,AFFX-BioC-3_at (endogenous control),AFFX-BioC-3_at,-295,A,-264,A,-376,A,-419,A,...,-51,A,-139,A,-367,A,-188,A,-407,A


In [61]:
actual.head()

Unnamed: 0,patient,cancer
0,1,ALL
1,2,ALL
2,3,ALL
3,4,ALL
4,5,ALL


In [62]:
actual.cancer.unique()

array(['ALL', 'AML'], dtype=object)

## 2. Data Manipulation

In [63]:
# removing the call columns as they are not required
required_train_columns = ['Gene Accession Number']
for i in range(1,39):
    required_train_columns.append(str(i))
# transposing the dataframe to have rows as samples
train = train[required_train_columns].set_index('Gene Accession Number').transpose()

In [64]:
# removing the call columns as they are not required
required_test_columns = ['Gene Accession Number']
for i in range(39,73):
    required_test_columns.append(str(i))
# transposing the dataframe to have rows as samples    
test = test[required_test_columns].set_index('Gene Accession Number').transpose()

In [65]:
# adding the target value column, i.e., cancer type from 'actual' file
train['target'] = list(actual.cancer.iloc[:38])

test['target'] = list(actual.cancer.iloc[38:])

In [66]:
# defining train and test sets clearly to easily use in models
X_train = train[train.columns[:-1]]
y_train = train.target
X_test = test[test.columns[:-1]]
y_test = test.target

## 3. Model Training

### SVM Model (with different kernels)

In [133]:
from sklearn.svm import SVC

# radial basis function (rbf) kernel
svcrbf = SVC(kernel='rbf', C=10)
svcrbf.fit(X_train, y_train)

print('Accuracy: ', svcrbf.score(X_test, y_test))

Accuracy:  0.9705882352941176


In [122]:
# linear kernel
svclin = SVC(kernel='linear')
svclin.fit(X_train, y_train)

print('Accuracy: ', svclin.score(X_test, y_test))

Accuracy:  0.9705882352941176


In [134]:
# polynomial kernel
svcpoly = SVC(kernel='poly', C=10)
svcpoly.fit(X_train, y_train)

print('Accuracy: ', svcpoly.score(X_test, y_test))

Accuracy:  0.9705882352941176


In [145]:
# sigmoid kernel
svcsig = SVC(kernel='sigmoid', C=10)
svcsig.fit(X_train, y_train)

print('Accuracy: ', svcsig.score(X_test, y_test))

Accuracy:  0.9117647058823529


### Random Forest Model

In [94]:
from sklearn.ensemble import RandomForestClassifier

rfmodel = RandomForestClassifier(random_state=42)
rfmodel.fit(X_train, y_train)

print('Accuracy: ', rfmodel.score(X_test, y_test))

Accuracy:  0.8529411764705882


### Neural Network Classification Model

In [149]:
from sklearn.neural_network import MLPClassifier

nnmodel = MLPClassifier(random_state=42)
nnmodel.fit(X_train, y_train)

print('Accuracy: ', nnmodel.score(X_test, y_test))

Accuracy:  0.9411764705882353
