# Multilayer Perceptron for Pancreatic Cell Classification

* For this homework, you will apply scRNA-seq data to classify different types of pancreatic cells

* To do so, we will apply and compare various classification algorithms demonstrated in class


Preprocessed RNA sequence data is published by:

* [Abdelaal, T.; Michielsen, L.; Cats, D.; Hoogduin, D.; Mei, H.; Reinders, M. J. T.; Mahfouz, A. A Comparison of Automatic Cell Identification Methods for Single-Cell RNA Sequencing Data. Genome Biology 2019, 20 (1), 194.](https://doi.org/10.1186/s13059-019-1795-z)


## Problem Statement

1. Load pancreatic cell labels from `data/Pancreatic_Labels.csv` into a list or vector called `all_labels`, and print this vector. Load the single-cell RNA-seq data from `data/subset_combined_humanpancreas_data.csv` into a pandas data frame called `scRNAseq_data`, making sure to specify that the index column is column 0. Print the head of this data frame. The classes in `all_labels` correspond to rows in `scRNAseq_data`.

2. Isolate indices in `all_labels` corresponding either "alpha" or "beta" pancreatic cell labels. Isolate values in `all_labels` corresponding to these indices, and store this new label vector as `labels_ab`. Likewise, isolate rows in `scRNAseq_data` corresponding to these indices, and store these rows in a data frame called `scRNAseq_ab`. Print the **length** of `labels_ab` and the **shape** of `scRNAseq_ab`.

3. We will train classifiers to predict "alpha" or "beta" pancreatic cell types from scRNA-seq data. Our covariates are the mRNA counts in `scRNAseq_ab` and our labels are the "alpha" and "beta" classes in `labels_ab`. Create an 80%/20% train/test split of this data.

4. Using the training set, train a logistic regression classifier to predict pancreatic cell labels. Then, evaluate the classifier performance by reporting its accuracy at predicting cell classes in the test set.

5. Repeat problem 4 for an SVM classifier. You may select a kernel of your choice, but please explicitly specify the kernel when initializing your classifier.

6. Repeat problem 4 for an MLP classifier. You may select an activation function of your choice, but please explicitly specify the activation function when initializing your classifier. You may change the hidden layer architecture from its default settings if you would like, but you do not have to.

7. How did the performance of the three classifiers compare to one-another? 

## Solutions

In [1]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split

1. Load pancreatic cell labels from `data/Pancreatic_Labels.csv` into a list or vector called `all_labels`, and print this vector. Load the single-cell RNA-seq data from `data/subset_combined_humanpancreas_data.csv` into a pandas data frame called `scRNAseq_data`, making sure to specify that the index column is column 0. Print the head of this data frame. The classes in `all_labels` correspond to rows in `scRNAseq_data`.

In [8]:
all_data_labels = pd.read_csv('data/Pancreatic_Labels.csv')
all_labels = all_data_labels.loc[:, 'x'].values
print(all_labels)

scRNAseq_data = pd.read_csv('data/subset_combined_humanpancreas_data.csv', index_col = 0)
scRNAseq_data.head()

['beta' 'delta' 'delta' ... 'gamma' 'gamma' 'gamma']


Unnamed: 0,A1BG,A1CF,A2M,A4GALT,AAAS,AACS,AACSP1,AADAC,AADAT,AAED1,...,ASPSCR1,ASRGL1,ASS1,ASTE1,ASTN1,ASTN2,ASUN,ASXL1,ASXL2,ATAD1
human1_lib1.final_cell_0007,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
human1_lib1.final_cell_0013,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
human1_lib1.final_cell_0014,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
human1_lib1.final_cell_0015,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,...,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,1.0
human1_lib1.final_cell_0017,0.0,2.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


2. Isolate indices in `all_labels` corresponding either "alpha" or "beta" pancreatic cell labels. Isolate values in `all_labels` corresponding to these indices, and store this new label vector as `labels_ab`. Likewise, isolate rows in `scRNAseq_data` corresponding to these indices, and store these rows in a data frame called `scRNAseq_ab`. Print the **length** of `labels_ab` and the **shape** of `scRNAseq_ab`.

In [19]:
labels = ['alpha', 'beta']
inds = np.where(
    [i in labels for i in all_labels]
)[0]
labels_ab = all_labels[inds].flatten()

print('Length of labels_ab:', len(labels_ab))
print("Cell subclass labels: \n")
print(labels_ab)

Length of labels_ab: 8567
Cell subclass labels: 

['beta' 'beta' 'beta' ... 'alpha' 'alpha' 'alpha']


In [23]:
scRNAseq_ab = scRNAseq_data.iloc[inds,:]
print('Shape of scRNAseq_ab:', scRNAseq_ab.shape)

Shape of scRNAseq_ab: (8567, 1000)


3. We will train classifiers to predict "alpha" or "beta" pancreatic cell types from scRNA-seq data. Our covariates are the mRNA counts in `scRNAseq_ab` and our labels are the "alpha" and "beta" classes in `labels_ab`. Create an 80%/20% train/test split of this data. Print the number of samples in the training set.

In [40]:
X_train, X_test, y_train, y_test = train_test_split(scRNAseq_ab.values, labels_ab, test_size=0.2)
print('# samples in training set:', len(X_train))
print('# samples in test set:', len(X_test))


# samples in training set: 6853
# samples in test set: 1714


4. Using the training set, train a logistic regression classifier to predict pancreatic cell labels. Then, evaluate the classifier performance by reporting its accuracy at predicting cell classes in the test set.

In [41]:
logReg = LogisticRegression()
logReg.fit(X_train, y_train)

y_logReg = logReg.predict(X_test)
logReg_score = logReg.score(X_test, y_test)
print('Log reg accuracy:', logReg_score)

Log reg accuracy: 0.9789964994165694


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


5. Repeat problem 4 for an SVM classifier. You may select a kernel of your choice, but please explicitly specify the kernel when initializing your classifier.

In [42]:
svm = SVC(kernel = 'linear')
svm.fit(X_train, y_train)

y_svm = svm.predict(X_test)
svm_score = svm.score(X_test, y_test)
print('SVM accuracy:', svm_score)

SVM accuracy: 0.9702450408401401


6. Repeat problem 4 for an MLP classifier. You may select an activation function of your choice, but please explicitly specify the activation function when initializing your classifier. You may change the hidden layer architecture from its default settings if you would like, but you do not have to.

In [43]:
mlp_classifier = MLPClassifier(
    activation="logistic",
    max_iter=10000
)

mlp_classifier.fit(X_train, y_train)
y_test_predicted = mlp_classifier.predict(X_test)
print("MLP Accuracy: " + str(mlp_classifier.score(X_test, y_test)))

MLP Accuracy: 0.9801633605600933


7. How did the performance of the three classifiers compare to one-another? 

The accuracies of the three classifiers are very similar, all in the range of 0.97-0.98 for the 80%/20% train/test split. The MLP classifier is SLIGHTLY better than the logisitic regression and both are ~ 0.01 better than the SVM classifier. Considering they are all very similar, it might not be necessary to use a more computationally intensive classifier for a split similar to this one. For example, the MLP classifier is not necessary to reach the same accuracy as it can be achieved by the logistic regression.

I also tested the classifiers on a 50/50 train test split and noticed that the logistic regression and MLP stayed roughly the same in acccuracy (small drop), while the SVM dropped a little bit more (~0.05).
