# Supervised Learning

The goal of this notebook is to demonstrate the use of supervised learning algorithms to predict the cell types for single cell RNAseq data.
The cells in the training and test data sets have been labeled by a teams of experts in the field and will be used as input into the different supervised learning algorithms

## Library Imports and Data Loading

In [44]:
# Import libraries
%matplotlib inline
import matplotlib.pyplot as plt
import scanpy as sp
import pandas as pd
import numpy as np
import seaborn as sns
import loompy

from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

RANDOM_SEED = 42

In [26]:
# Data Loading
X_train = pd.read_csv("data/X_train_pca.csv")
y_train = pd.read_csv("data/y_train.csv")
X_test = pd.read_csv("data/X_test_pca.csv")
y_test = pd.read_csv("data/y_test.csv")

X_train.rename(columns={"Unnamed: 0": "CellID"}, inplace=True)
X_train.set_index("CellID", inplace=True)
y_train.set_index("CellID", inplace=True)

X_test.rename(columns={"Unnamed: 0": "CellID"}, inplace=True)
X_test.set_index("CellID", inplace=True)
y_test.set_index("CellID", inplace=True)

In [55]:
def print_accuracy_report(model, model_name):
    print("{} Training Set Accuracy: {}".format(model_name, model.score(X_train, y_train['type'])))
    print("{} Test Set Accuracy: {}".format(model_name, model.score(X_test, y_test['type'])))
    print("\nClassification Report for {}\n{}".format(model_name, classification_report(y_test['type'], model.predict(X_test), zero_division=0)))

## Algorithm Testing

### K Nearest Neighbors

This was done in a previous notebook and will be reproduced here so it can be included with the other comparisons

In [56]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors = 15).fit(X_train, y_train['type'])
print_accuracy_report(knn, "kNN with k=15")

kNN with k=15 Training Set Accuracy: 0.9488202637057599
kNN with k=15 Test Set Accuracy: 0.7623674035887776

Classification Report for kNN with k=15
              precision    recall  f1-score   support

    Alveolar       0.99      0.95      0.97        92
      B cell       0.00      0.00      0.00       855
       CD4 T       0.59      0.91      0.71      2119
       CD8 T       0.70      0.79      0.74      2503
          DC       0.00      0.00      0.00       101
 Endothelial       1.00      0.94      0.97       317
  Epithelial       1.00      0.88      0.94        43
        Mast       0.98      0.99      0.99       117
     Myeloid       0.93      0.93      0.93      1810
          NK       0.00      0.00      0.00       176
         RBC       0.00      0.00      0.00       187
     Stromal       1.00      0.89      0.94       306
       Tumor       0.98      0.90      0.93      1441
         pDC       0.00      0.00      0.00        20

    accuracy                           

### Dummy Classifier

In [49]:
from sklearn.dummy import DummyClassifier

mostfrequent_dummy = DummyClassifier(strategy = "most_frequent", random_state=RANDOM_SEED).fit(X_train, y_train['type'])
print_accuracy_report(mostfrequent_dummy, "Most-Frequent Dummy Classifier")

stratified_dummy = DummyClassifier(strategy = "stratified", random_state=RANDOM_SEED).fit(X_train, y_train['type'])
print_accuracy_report(stratified_dummy, "Stratified Dummy Classifier")

Most-Frequent Dummy Classifier Training Set Accuracy: 0.24819074055715276
Most-Frequent Dummy Classifier Test Set Accuracy: 0.24814117180529394

Classification Report for Most-Frequent Dummy Classifier
              precision    recall  f1-score   support

    Alveolar       0.00      0.00      0.00        92
      B cell       0.00      0.00      0.00       855
       CD4 T       0.00      0.00      0.00      2119
       CD8 T       0.25      1.00      0.40      2503
          DC       0.00      0.00      0.00       101
 Endothelial       0.00      0.00      0.00       317
  Epithelial       0.00      0.00      0.00        43
        Mast       0.00      0.00      0.00       117
     Myeloid       0.00      0.00      0.00      1810
          NK       0.00      0.00      0.00       176
         RBC       0.00      0.00      0.00       187
     Stromal       0.00      0.00      0.00       306
       Tumor       0.00      0.00      0.00      1441
         pDC       0.00      0.00      0.

### Support Vector Machine Classifier

In [50]:
from sklearn.svm import SVC

svm = SVC(C=10, random_state=RANDOM_SEED).fit(X_train, y_train['type'])
print_accuracy_report(svm, "SVM with C=10")

SVM with C=10 Training Set Accuracy: 0.9804946961435511
SVM with C=10 Test Set Accuracy: 0.6759195003469812

Classification Report for SVM with C=10
              precision    recall  f1-score   support

    Alveolar       0.97      0.38      0.55        92
      B cell       0.01      0.01      0.01       855
       CD4 T       0.70      0.71      0.70      2119
       CD8 T       0.59      0.64      0.62      2503
          DC       0.00      0.00      0.00       101
 Endothelial       0.99      0.74      0.85       317
  Epithelial       0.00      0.00      0.00        43
        Mast       1.00      0.98      0.99       117
     Myeloid       0.92      0.96      0.94      1810
          NK       0.00      0.00      0.00       176
         RBC       0.17      0.02      0.03       187
     Stromal       0.53      0.96      0.69       306
       Tumor       0.93      0.90      0.92      1441
         pDC       0.00      0.00      0.00        20

    accuracy                           

### Decision Tree

In [51]:
from sklearn.tree import DecisionTreeClassifier

decision_tree_nomd = DecisionTreeClassifier(random_state=RANDOM_SEED).fit(X_train, y_train['type'])
print_accuracy_report(decision_tree_nomd, "Decsion Tree without Max Depth")

decision_tree_md = DecisionTreeClassifier(max_depth=len(y_train['type'].unique()), random_state=RANDOM_SEED).fit(X_train, y_train['type'])
print_accuracy_report(decision_tree_md, "Decsion Tree with Max Depth={}".format(len(y_train['type'].unique())))

Decsion Tree without Max Depth Training Set Accuracy: 1.0
Decsion Tree without Max Depth Test Set Accuracy: 0.5492217705958164

Classification Report for Decision Tree without Max Depth
              precision    recall  f1-score   support

    Alveolar       0.99      0.92      0.96        92
      B cell       0.01      0.01      0.01       855
       CD4 T       0.67      0.64      0.65      2119
       CD8 T       0.49      0.29      0.37      2503
          DC       0.03      0.03      0.03       101
 Endothelial       0.96      0.96      0.96       317
  Epithelial       0.33      0.93      0.49        43
        Mast       0.95      0.31      0.46       117
     Myeloid       0.85      0.89      0.87      1810
          NK       0.09      0.02      0.03       176
         RBC       0.11      0.16      0.13       187
     Stromal       0.84      0.93      0.89       306
       Tumor       0.67      0.72      0.70      1441
         pDC       0.00      0.00      0.00        20

  

### Naive Bayes Classifiers

In [53]:
from sklearn.naive_bayes import GaussianNB

nb = GaussianNB().fit(X_train, y_train['type'])
print_accuracy_report(nb, "GaussianNB")

GaussianNB Training Set Accuracy: 0.9006146525230495
GaussianNB Test Set Accuracy: 0.7235055021314564

Classification Report for GaussianNB
              precision    recall  f1-score   support

    Alveolar       1.00      0.67      0.81        92
      B cell       0.50      0.76      0.60       855
       CD4 T       0.82      0.63      0.71      2119
       CD8 T       0.78      0.70      0.74      2503
          DC       0.00      0.00      0.00       101
 Endothelial       0.89      0.47      0.62       317
  Epithelial       1.00      0.81      0.90        43
        Mast       0.99      0.97      0.98       117
     Myeloid       0.82      0.92      0.87      1810
          NK       0.00      0.00      0.00       176
         RBC       0.10      0.04      0.05       187
     Stromal       0.45      0.97      0.62       306
       Tumor       0.70      0.85      0.77      1441
         pDC       0.00      0.00      0.00        20

    accuracy                           0.72     

### Random Forest

In [54]:
from sklearn.ensemble import RandomForestClassifier

random_forest = RandomForestClassifier(random_state=RANDOM_SEED).fit(X_train, y_train['type'])
print_accuracy_report(random_forest, "Random Forest")

Random Forest Training Set Accuracy: 1.0
Random Forest Test Set Accuracy: 0.755626053335977

Classification Report for Random Forest
              precision    recall  f1-score   support

    Alveolar       0.98      0.98      0.98        92
      B cell       0.24      0.19      0.21       855
       CD4 T       0.71      0.78      0.75      2119
       CD8 T       0.66      0.75      0.70      2503
          DC       0.00      0.00      0.00       101
 Endothelial       0.99      0.98      0.98       317
  Epithelial       0.83      0.12      0.20        43
        Mast       0.90      0.16      0.28       117
     Myeloid       0.92      0.99      0.96      1810
          NK       0.00      0.00      0.00       176
         RBC       0.00      0.00      0.00       187
     Stromal       0.99      0.93      0.96       306
       Tumor       0.92      0.99      0.95      1441
         pDC       0.00      0.00      0.00        20

    accuracy                           0.76     10087
 

### Gradient-boosted Decision Tree

In [57]:
from sklearn.ensemble import GradientBoostingClassifier

gbdt = GradientBoostingClassifier(random_state=RANDOM_SEED).fit(X_train, y_train['type'])
print_accuracy_report(gbdt, "Gradient-boosted Decision Tree")

Gradient-boosted Decision Tree Training Set Accuracy: 0.9736789927629622
Gradient-boosted Decision Tree Test Set Accuracy: 0.7670268662635075

Classification Report for Gradient-boosted Decision Tree
              precision    recall  f1-score   support

    Alveolar       0.94      0.63      0.75        92
      B cell       0.30      0.31      0.31       855
       CD4 T       0.77      0.80      0.78      2119
       CD8 T       0.70      0.77      0.73      2503
          DC       0.17      0.06      0.09       101
 Endothelial       0.98      0.95      0.96       317
  Epithelial       1.00      0.05      0.09        43
        Mast       0.99      0.98      0.99       117
     Myeloid       0.92      0.98      0.95      1810
          NK       0.07      0.01      0.02       176
         RBC       0.24      0.04      0.07       187
     Stromal       0.84      0.89      0.86       306
       Tumor       0.93      0.92      0.93      1441
         pDC       0.00      0.00      0.00

### Neural Network

In [58]:
from sklearn.neural_network import MLPClassifier

neural_net_100 = MLPClassifier(hidden_layer_sizes=[100], alpha=0.1, random_state=RANDOM_SEED).fit(X_train, y_train['type'])
print_accuracy_report(neural_net_100, "Neural Network with 1 layer, 100 units")
neural_net_2_10 = MLPClassifier(hidden_layer_sizes=[10, 10], alpha=0.1, random_state=RANDOM_SEED).fit(X_train, y_train['type'])
print_accuracy_report(neural_net_2_10, "Neural Network with 2 layers, 10 units each")

Neural Network with 1 layer, 100 units Training Set Accuracy: 0.9915733121839992
Neural Network with 1 layer, 100 units Test Set Accuracy: 0.6808763755328641

Classification Report for Neural Network with 1 layer, 100 units
              precision    recall  f1-score   support

    Alveolar       0.97      0.98      0.97        92
      B cell       0.01      0.01      0.01       855
       CD4 T       0.63      0.70      0.66      2119
       CD8 T       0.61      0.67      0.64      2503
          DC       0.00      0.00      0.00       101
 Endothelial       0.99      0.98      0.99       317
  Epithelial       1.00      0.05      0.09        43
        Mast       1.00      0.88      0.94       117
     Myeloid       0.91      0.94      0.92      1810
          NK       0.05      0.01      0.01       176
         RBC       0.09      0.02      0.03       187
     Stromal       0.93      0.95      0.94       306
       Tumor       0.96      0.83      0.89      1441
         pDC       