# Predicting Cancer Types Using Machine Learning

## 1. Data Import

In [1]:
# importing necessary libraries and functions
import pandas as pd
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score

import warnings
warnings.filterwarnings("ignore")

train = pd.read_csv('data/data_train.csv')
test = pd.read_csv('data/data_test.csv')
actual = pd.read_csv('data/actual.csv')

## * Data Description 

The dataset comprises gene expression data stored in RES format, a commonly used format for gene pattern data ([more about RES format here](https://www.genepattern.org/file-formats-guide#RES)). The dataset encompasses 7129 distinct gene features, with columns representing various samples.

For each gene, the numerical entries denote its expression levels within a given sample. Additionally, an accompanying `call` column indicates whether the gene is classified as Absent (A), Marginal (M), or Present (P) in that particular sample.

The dataset is divided into two files:

- `train`: Contains data from 38 samples.
- `test`: Contains data from 34 samples.

This totals to 72 samples in the entire dataset.

### Cancer Types

The dataset focuses on two types of cancer:

1. **Acute Myeloid Leukemia (AML):** AML affects myeloid cells, which are responsible for generating certain types of white blood cells.

2. **Acute Lymphocytic Leukemia (ALL):** ALL is a form of cancer that impacts lymphocytes, a crucial type of white blood cell involved in the immune response. ([source](https://www.healthline.com/health/leukemia/aml-vs-all))

### Patient Information

The `actual` file provides information about individual patients, including their unique identifiers and the specific type of cancer they have been diagnosed with (AML or ALL).


In [2]:
print(train.shape, test.shape)

(7129, 78) (7129, 70)


In [3]:
train.head()

Unnamed: 0,Gene Description,Gene Accession Number,1,call,2,call.1,3,call.2,4,call.3,...,29,call.33,30,call.34,31,call.35,32,call.36,33,call.37
0,AFFX-BioB-5_at (endogenous control),AFFX-BioB-5_at,-214,A,-139,A,-76,A,-135,A,...,15,A,-318,A,-32,A,-124,A,-135,A
1,AFFX-BioB-M_at (endogenous control),AFFX-BioB-M_at,-153,A,-73,A,-49,A,-114,A,...,-114,A,-192,A,-49,A,-79,A,-186,A
2,AFFX-BioB-3_at (endogenous control),AFFX-BioB-3_at,-58,A,-1,A,-307,A,265,A,...,2,A,-95,A,49,A,-37,A,-70,A
3,AFFX-BioC-5_at (endogenous control),AFFX-BioC-5_at,88,A,283,A,309,A,12,A,...,193,A,312,A,230,P,330,A,337,A
4,AFFX-BioC-3_at (endogenous control),AFFX-BioC-3_at,-295,A,-264,A,-376,A,-419,A,...,-51,A,-139,A,-367,A,-188,A,-407,A


In [4]:
actual.head()

Unnamed: 0,patient,cancer
0,1,ALL
1,2,ALL
2,3,ALL
3,4,ALL
4,5,ALL


In [5]:
actual.cancer.unique()

array(['ALL', 'AML'], dtype=object)

## 2. Data Manipulation

In [6]:
# removing the call columns as they are not required
required_train_columns = ['Gene Accession Number']
for i in range(1,39):
    required_train_columns.append(str(i))
# transposing the dataframe to have rows as samples
train = train[required_train_columns].set_index('Gene Accession Number').transpose()

In [7]:
# removing the call columns as they are not required
required_test_columns = ['Gene Accession Number']
for i in range(39,73):
    required_test_columns.append(str(i))
# transposing the dataframe to have rows as samples    
test = test[required_test_columns].set_index('Gene Accession Number').transpose()

In [8]:
# adding the target value column, i.e., cancer type from 'actual' file
train['target'] = list(actual.cancer.iloc[:38])

test['target'] = list(actual.cancer.iloc[38:])

In [9]:
# defining train and test sets clearly to easily use in models
X_train = train[train.columns[:-1]]
y_train = train.target
X_test = test[test.columns[:-1]]
y_test = test.target

## 3. Model Training

### SVM Model (with different kernels)

In [10]:
# radial basis function (rbf) kernel
svcrbf = SVC(kernel='rbf', C=10)
svcrbf.fit(X_train, y_train)

print('Accuracy: ', svcrbf.score(X_test, y_test))

Accuracy:  0.9705882352941176


In [11]:
# linear kernel
svclin = SVC(kernel='linear')
svclin.fit(X_train, y_train)

print('Accuracy: ', svclin.score(X_test, y_test))

Accuracy:  0.9705882352941176


In [12]:
# polynomial kernel
svcpoly = SVC(kernel='poly', C=10)
svcpoly.fit(X_train, y_train)

print('Accuracy: ', svcpoly.score(X_test, y_test))

Accuracy:  0.9705882352941176


In [13]:
# sigmoid kernel
svcsig = SVC(kernel='sigmoid', C=10)
svcsig.fit(X_train, y_train)

print('Accuracy: ', svcsig.score(X_test, y_test))

Accuracy:  0.9117647058823529


### Random Forest Model

In [14]:
rfmodel = RandomForestClassifier(random_state=42)
rfmodel.fit(X_train, y_train)

print('Accuracy: ', rfmodel.score(X_test, y_test))

Accuracy:  0.8529411764705882


### Neural Network Classification Model

In [15]:
# Define the parameter grid for grid search
param_grid = {
    'activation': ['relu', 'tanh'],
    'hidden_layer_sizes': [(50,), (100,), (50, 50), (100, 50)],
    'alpha' : [0.0001, 0.001, 0.01],
    'learning_rate' : ['constant', 'invscaling', 'adaptive']
}

# Initialize the MLPClassifier
nnmodel = MLPClassifier(random_state=42, early_stopping=True)

# Initialize GridSearchCV
grid_search = GridSearchCV(nnmodel, param_grid, cv=3, n_jobs=-1, scoring=accuracy_score)

# Perform grid search
grid_search.fit(X_train, y_train)

# Get the best parameters and the best estimator
best_params = grid_search.best_params_
best_estimator = grid_search.best_estimator_

# Evaluate the model on the test set
accuracy = best_estimator.score(X_test, y_test)

print(f"Best Parameters: {best_params}")
print(f"Test Accuracy: {accuracy}")

Best Parameters: {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (50,), 'learning_rate': 'constant'}
Test Accuracy: 0.9411764705882353


## 4. Result Storage

In [16]:
# # Uncomment the lines to store files from the predictions of the above-mentioned models on the test data
# # The outputs generated can be found in 'output' folder [Note that the following code doesn't store it in that folder]
# pd.DataFrame({'target':y_test,'prediction':svcrbf.predict(X_test)}).to_csv('svm_rcf_preds.csv')
# pd.DataFrame({'target':y_test,'prediction':svclin.predict(X_test)}).to_csv('svm_lin_preds.csv')
# pd.DataFrame({'target':y_test,'prediction':svcpoly.predict(X_test)}).to_csv('svm_poly_preds.csv')
# pd.DataFrame({'target':y_test,'prediction':svcsig.predict(X_test)}).to_csv('svm_sig_preds.csv')
# pd.DataFrame({'target':y_test,'prediction':rfmodel.predict(X_test)}).to_csv('random_forest_preds.csv')
# pd.DataFrame({'target':y_test,'prediction':nnmodel.predict(X_test)}).to_csv('neural_networks_preds.csv')

## SVM's role in cancer type prediction

In our study, we leveraged Support Vector Machines (SVMs) to predict cancer types based on RES format gene data. SVMs proved invaluable in identifying the optimal decision boundary, effectively separating different cancer types by harnessing genetic information.

The versatility of SVMs was demonstrated through the use of various kernels:

- **Linear Kernel**: Effective for linearly separable data.
- **RBF Kernel**: Excelled in capturing non-linear relationships in genetic data.
- **Polynomial Kernel**: Valuable for representing polynomial decision boundaries.

Each kernel was thoroughly evaluated. The linear, RBF, and polynomial kernels achieved an impressive accuracy of approximately 97.06%. This underscores the robustness of SVMs in distinguishing between AML and ALL cases based on the provided gene data.

Notably, the sigmoid kernel, while still performing well, achieved a slightly lower accuracy of approximately 91.18%. This indicates that, for our specific dataset and problem, the sigmoid kernel might not be as well-suited as the other kernels. This emphasizes the critical importance of selecting the most appropriate kernel tailored to the data's characteristics.

### Handling High-Dimensional Data

SVMs excel in scenarios with numerous features, as seen in our study with 7129 features. Here's how SVMs handled this high-dimensional data:

1. **Feature Selection**: SVMs find optimal decision boundaries in high-dimensional spaces, crucial for distinguishing cancer types based on genetic data.

2. **Margin Maximization**: They maximize the margin, enhancing generalization to new data points.

3. **Kernel Trick for Complexity**: Various kernels handle non-linearities in genetic data effectively.

4. **Identifying Important Features**: SVMs pinpoint crucial genes for accurate predictions.

5. **Effective Generalization**: Despite high dimensionality, SVMs generalize well to new, unseen data.

# Neural Network Regression Analysis

In our study, we used neural network regression to predict cancer types from gene data. This approach is crucial for the following reasons:

- **Complex Relationship Modeling**: Neural networks excel at capturing intricate, non-linear relationships in high-dimensional genetic data. This uncovers subtle patterns not easily discernible with traditional methods.

- **Feature Extraction and Abstraction**: Neural networks automatically extract relevant features, potentially identifying critical genetic markers for cancer classification.

- **Adaptability to High-Dimensional Data**: Neural networks handle large feature sets, as demonstrated with our 7129-feature gene dataset.

We also conducted a grid search to optimize the neural network parameters:

- **Grid Search Process**:

  This process systematically explores a specified hyperparameter grid to find the best-performing combination. Key hyperparameters for neural networks include hidden layers, neurons per layer, activation functions, and regularization terms.

  Grid search fine-tunes the model, preventing overfitting and underfitting. Our analysis resulted in an approximately 94.12% accuracy, indicating effective hyperparameter selection for accurate cancer type predictions using the provided gene dataset.

# Model Comparison

We evaluated three models for cancer type prediction based on gene data: Neural Networks, Support Vector Machines (SVM), and Random Forest.

- **Neural Network Accuracy**: ~94.12%
- **SVM Accuracy**: ~97.06%
- **Random Forest Accuracy**: ~85.29% (no hyperparameter tuning)

## Model Strengths:

- **SVM**:
  - Effective in high-dimensional spaces.
  - Handles linear and non-linear relationships.
  - Robust against overfitting.

- **Neural Networks**:
  - Exceptional at capturing complex relationships.
  - Automatically extracts relevant features.

- **Random Forest**:
  - Handles high-dimensional data effectively.
  - Provides feature importances.

## Model Weaknesses:

- **SVM**:
  - Can be computationally expensive.
  - Sensitive to kernel and hyperparameters.

- **Neural Networks**:
  - Prone to overfitting with insufficient data.
  - Computationally intensive.

- **Random Forest**:
  - Can be computationally expensive with many trees.
  - Limited in capturing complex relationships.

## Model Selection:

Given our context, the SVM model with an accuracy of ~97.06% is the most suitable. It excels in high-dimensional spaces and generalizes well. While neural networks performed well, they may be computationally intensive. Random Forest, with hyperparameter tuning, could be further optimized.

# Discussion

## Broader Implications of Accurate Cancer Type Prediction

Accurate cancer type prediction holds immense significance in clinical practice and medical research. It can:

- **Guide Treatment Strategies**: Precise cancer typing informs treatment decisions, enabling tailored therapies for better patient outcomes.

- **Facilitate Early Detection**: Early identification of cancer types can lead to more effective and less invasive interventions.

- **Enable Personalized Medicine**: Understanding the specific genetic characteristics of a tumor allows for targeted treatments, minimizing side effects.

- **Advance Research and Drug Development**: Accurate classification aids in identifying potential drug targets and developing new therapies.

## Real-World Applications

The models' performance has practical applications:

- **Clinical Settings**: Doctors can use these models to support their diagnostic process, providing an additional layer of confidence in cancer typing.

- **Research Institutes**: Scientists can employ these models for genetic studies, accelerating discoveries in oncology.

- **Drug Development**: Pharmaceutical companies can use accurate cancer typing to streamline clinical trials and develop more effective drugs.

# Conclusion

## Summary of Findings

After careful evaluation, the Support Vector Machine (SVM) emerged as the best-performing model with an accuracy of approximately 97.06%. Its effectiveness in high-dimensional spaces and robust generalization makes it the optimal choice for accurate cancer type prediction.

## Importance of Model Selection and Tuning

This study underscores the critical role of thoughtful model selection and parameter tuning in machine learning. The SVM's success highlights the importance of matching the model to the data's characteristics. Additionally, parameter optimization ensures the model is fine-tuned for optimal performance.

In the context of cancer prediction, these considerations are paramount for achieving clinically relevant results and advancing cancer research and treatment.
