# Introduction to Classification 
## 11/19/19
## Due 11/25/19 @ 11:59 PM

In [None]:
# Load the modules we'll need
from datascience import *
import numpy as np
import random
import seaborn as sns
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
import scipy.stats as stats
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
from sklearn.manifold import TSNE
plt.style.use('fivethirtyeight')
from client.api.notebook import Notebook

Today we'll explore the task of classification by seeing whether we can successfully predict a cell type given gene expression measurements. To do so, we'll use the single-cell RNA-seq data from last week. These data consist of B cells, T cells, and NK cells from the murine spleen. Let's start by loading and normalizing the data as we did previously.

In [None]:
# Load metadata for single cells
sc_meta = pd.read_csv('https://raw.githubusercontent.com/ds-connectors/Data88-Genetics_and_Genomics/master/Lab07/spleen_meta_sc.csv', sep = ',', header = 0).set_index('index').rename(columns ={'Unnamed: 0':'Sample Num'})
# Load single-cell expression data and normalize it
scRNA_data_pre = pd.read_csv('https://raw.githubusercontent.com/ds-connectors/Data88-Genetics_and_Genomics/master/Lab07/cell_data.csv', sep = ',', header = 0).set_index('Gene')
# These data were on log scale so let's go back to counts
sc_data = (2 ** scRNA_data_pre - 1)
sc_total_med = sc_data.sum().median()
sc_data_norm = sc_data / sc_data.sum() * sc_total_med

Classification is usually a supervised learning task. This means we have data for which we know the true class labels and want to build a model which can predict the label for new data based on what we've already observed. We hence want to get the labels and "features" for our data to prepare to build our model.

In [None]:
# Extract the cell type labels and gene expression values for all the cells. 
sc_labels = sc_meta.cell_ontology_class[:]   # The [:] prevents aliasing and generates a new copy
sc_features = sc_data_norm.T                 # We need to transpose our data to apply the classifier

One of the key components of model building is splitting our data into training and test sets. We use the training set to construct our model and then evaluate its performance on the test set. There are many ways to split into training and test sets, but our data are amenable to doing so in the most straightforward fashion. Fortunately, there's a handy function that will do everything for us.

In [None]:
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(sc_features, sc_labels, test_size=0.3)

Having split our data, we are now ready to train our model. We're going to use a machine learning algorithm known as a random forest. Go ahead and run the following cell to build the classifier.

In [None]:
# Fit model
clf=RandomForestClassifier(n_estimators = 200)
clf.fit(X_train, y_train)

We now want to compute the error rates on our traning and test sets. We can use the model to make predictions and then compare to what we know to be the truth.

In [None]:
# Compute train and test errors
y_train_pred = clf.predict(X_train)
y_test_pred = ?  # Predict using the test data

[np.mean(y_train_pred == y_train), ?] # Fill in the test accuracy

# How do these errors compare? Why do you think this is?
# Answer:

The confusion matrix is a useful way to visualize the performance of our algorithm. It shows how predictions compare to the truth for each class. Perfect classification would be a diagonal matrix.

In [None]:
# Produce confusion matrix. Fill in the true test labels first and the predicted test labels second.
confusion_matrix(?, ?)

# Interpret your confusion matrix. What do you think it means? Does this make sense in light of the t-SNE plots we saw before?
# Answer:

Let's generate plots of our data with the true and predicted labels to see how it does. We'll start by running t-SNE so that we can visualize our cell type clusters.

In [None]:
# Produce t-SNE embeddings
x = StandardScaler().fit_transform(np.log2(X_test+1))
pca = PCA(n_components = 100)
principalComponents = pca.fit_transform(x)
tsne = TSNE(n_components = 2)
X_embedded = tsne.fit_transform(principalComponents)

In [None]:
# Make a plot which colors points by the true labels, predicted labels, and whether the classifier was correct or not.
plt.figure(figsize=(15,5))

plt.subplot(131)
plt.scatter(X_embedded[y_test == 'B cell',0], X_embedded[y_test == 'B cell',1])
plt.scatter(X_embedded[y_test == 'T cell',0], X_embedded[y_test == 'T cell',1])
plt.scatter(X_embedded[y_test == 'NK cell',0], X_embedded[y_test == 'NK cell',1])
plt.legend(['B cells','T cells', 'NK cells'])
plt.xlabel('t-SNE 1')
plt.ylabel('t-SNE 2')
plt.title('t-SNE, true labels')

plt.subplot(132)
plt.scatter(X_embedded[y_test_pred == 'B cell',0], X_embedded[y_test_pred == 'B cell',1])
plt.scatter(X_embedded[y_test_pred == 'T cell',0], X_embedded[y_test_pred == 'T cell',1])
plt.scatter(X_embedded[y_test_pred == 'NK cell',0], X_embedded[y_test_pred == 'NK cell',1])
plt.legend(['B cells','T cells', 'NK cells'])
plt.xlabel('t-SNE 1')
plt.ylabel('t-SNE 2')
plt.title('t-SNE, predicted labels')

plt.subplot(133)
# Fill in the statements such that the plot colors points by whether they were classified correctly or not.
# HINT: You may want to use y_test_pred == y_test and/or y_test_pred != y_test
plt.scatter(X_embedded[?,0], X_embedded[?,1])
plt.scatter(X_embedded[?,0], X_embedded[?,1])
plt.legend(['Correct','Incorrect'])
plt.xlabel('t-SNE 1')
plt.ylabel('t-SNE 2')
plt.title('t-SNE, Classification performance')
plt.show()

Random forests produce a quantity known as feature importances which (hopefully) tell us about how useful certain features are when trying to discriminate among classes. Thus, in theory, features with high importances are relevant for identifying class membership. It's not always so pretty in reality, but it's a good starting point for finding genes which are cell type markers.

In [None]:
# Feature importances. Let's look at the top genes
feature_imp = pd.Series(clf.feature_importances_,index = X_train.T.index)
feature_imp.sort_values(ascending=False)[0:25]

# Look at the list of genes that have large importances. To my eye, there are three gene prefixes which show up a few times. These are related genes. Identify these three "families". Look up a few of the genes online. Do their functions make sense given the task we're performing?

# Answer:

Let's look at a histogram of the feature importances. Many of them are zero and the scale is quite compressed, so we'll need to add 1e-10 and then apply np.log10 to get something useful.

In [None]:
plt.hist(?) # Fill in the correct quantities to make the requested histogram. HINT: np.log10(? + 1e-10)
plt.xlabel('log10 feature importance')
plt.ylabel('Frequency')
plt.title('Importances for cell type classification')
plt.show()

The manner in which we split our data into training and test sets will affect our model's performance. One factor that affects things is the fraction of data which is training or test. We can change this using the 'test_size' parameter in train_test_split. Let's build a for loop to see how this quantity affects our accuracy rate.

In [None]:
# Let's re-train with a different class imbalance. How do the confusion matrices compare?
test_acc = []
for i in [.9, .8, .7, .6, .5, .4, .3, .2, .1]:
    X_train, X_test, y_train, y_test = train_test_split(sc_features, sc_labels, test_size = 1-i)
    clf=RandomForestClassifier(n_estimators = 200)
    clf.fit(?, ?) # fit the model on the training data
    y_test_pred = clf.predict(?) # predict on the test set
    test_acc.append(?) # append the test accuracy

In [None]:
plt.plot([.9, .8, .7, .6, .5, .4, .3, .2, .1], ?) # Plot the test accuracy as a function of the training fraction
plt.xlabel('Fraction of data in training set')
plt.ylabel('Classification accuracy')
plt.title('Cell type accuracy vs training set size')
plt.show()

# Comment on your resulting plot. What do you think is generally true about the relationship between how much data is in the training set and performance? You may want to run it a few times since there is some randomness in the process.

# Answer:

Feature selection (and model selection) is another crucial component of classification tasks. Essentially, if we have many features, it may not be optimal to use all of them to build the model and we need some way to figure out what to keep and what to discard. One thing we can do is look at the accuracy on the test set as a function of the features we choose to keep. Let's see how our model performs when we truncate at different values of the feature importance.

In [None]:
# Let's re-train with fewer features (genes). How do the errors compare?
X_train, X_test, y_train, y_test = train_test_split(sc_features, sc_labels, test_size=.5)
clf=RandomForestClassifier(n_estimators = 200)

test_acc = []
for i in [0, .000001, .00001, .0001, .001, .01]:
    X_train_sub = X_train.T[feature_imp >= i].T
    X_test_sub = X_test.T[feature_imp >= i].T

    clf.fit(?, ?) # Fit model on the reduced training data
    # Compute test accuracy
    y_test_pred = clf.predict(?) # Predict on reduced testing data
    test_acc.append(?) # Append accuracy at that iteration

In [None]:
plt.plot(np.log10([0+1e-10, .000001, .00001, .0001, .001, .01]), ?) # Plot test accuracy
plt.xlabel('Log10 feature importance cutoff')
plt.ylabel('Test set accuracy')
plt.title('Cell type accuracy vs FI cutoff')
plt.show()

# Comment on your resulting plot. For this data, what seems to be true about performance as a function of the number of features we retain? You may want to run it a few times since there is some randomness in the process.

# Answer:

# Comment on how classification performed compared to clustering using t-SNE. Which was better? What is the key difference that makes one approach preferable to the other?

# Answer:

In [None]:
ok = Notebook('lab08_classification.ok')
_ = ok.auth(inline=True)

In [None]:
# Submit the assignment.
_ = ok.submit()