In [None]:
"""
Introduction to statistical and machine learning using scikit-learn (https://scikit-learn.org/stable/)

Instructor: Shaina Lu (slu@cshl.edu)
2019 URP Data Analysis using Python Course
11 July 2019"""

# Imports and Globals

In [None]:
#imports
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline  
from sklearn import datasets
from sklearn.linear_model import LinearRegression
from sklearn import metrics

# An introduction to basic machine learning with sklearn using the Iris dataset

In this first section, we will quickly introduce using scikit-learn for machine learning. This is one of the more popular python packages for statistical and machine learning, it works seamlessly with numpy arrays and pandas dataframes and can be thought of as a machine learning extension to scipy.

(For those interested, a couple years ago Google released their version of a Python machine learning package called TensorFlow. It's very powerful, especially for computationally intensive tasks such as deep learning. TensorFlow uses it's own data structures. We will not be covering TensorFlow in our course; those interested can check out: https://tensorflow.org)

### Load sample data and poke around to understand it

For this tutorial, we'll use a classic dataset called the Iris dataset collected by R.A. Fisher. There will be more details below.

In [None]:
#load data, we're loading a dataset that comes with the library so the functions are specific to it
iris = datasets.load_iris()

In [None]:
#let's see how the iris dataset is stored
type(iris)

What's an sklearn bunch?? Looking at the documentation for the sklearn iris dataset (https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html), we see that a bunch is just a python dictionary. Let's try interacting with it as we would a dictionary and seeing the what the keys are.

In [None]:
#get keys from iris dataset
iris.keys()

Great! We see that there are 6 keys. Let's take a look at the description.

In [None]:
#index the iris dictionary to see the description
print(iris['DESCR'])

This gives a nice description and background of the classic iris dataset. Now, let's poke around and see what the other things in the iris dictionary are.

In [None]:
#print each of the remaining values in the iris dictionary by indexing the dictionary with the keys


Okay, it's clear these values of the iris dataset appear as we expected from the documentation. Let's store the data in a pandas dataframe to more easily manipulate it. 

In [None]:
#create a pandas dataframe to store the data and a pandas series to store the targets or y labels
irisdf = pd.DataFrame(data=iris['data'], index=range(1,151,1), columns=iris['feature_names'])
ylabels = pd.Series(data=iris['target'], index=range(1,151,1), name="iris type")

In [None]:
#let's check that our dataframe looks as we expect


In [None]:
#check the dataframe dimensions


In [None]:
#now let's check the series


In [None]:
#check the series dimensions


### Get summary statistics for the different features

Everything looks good so far and as we expect. Let's play around with a few summary statistics using pandas. A great resource to look up handy pandas commands is the pandas cheat sheet, available here: https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf

In [None]:
#average of all petal lengths
irisdf.loc[:,"petal length (cm)"].mean()

Cool, this output matches with the mean petal length reported in the description. Play around with the other features and other summary statistics. 

In [None]:
#try other features and summary statistics here


Now, what if I only want to get the mean petal length for each of the three iris types? 

In [None]:
#use the ylabels to get the mean for each iris type separately
print(irisdf.loc[ylabels==0,"petal length (cm)"].mean())
print(irisdf.loc[ylabels==1,"petal length (cm)"].mean())
print(irisdf.loc[ylabels==2,"petal length (cm)"].mean())

In [None]:
#try other features and summary statistics here


### Plot two of the features for all the flowers

In [None]:
#plot petal length by petal width


I wonder if that cluster on the lower left represents one type of flower. The iris decription did mention that one group of flowers was linearly separable from the rest. Let's plot each iris type in a different color.

In [None]:
#same as above, but plot each iris type with a different color


It appears that setosa iris flowers are indeed separate from the other two types. Let's see if we can build a simple classifier to predict iris type based on these two features.

### linear regression as a classifier

Many of you are probably familiar with linear regression as fitting a line to the data, but today we're going to take it one step further and use linear regression as a classifier. For the sklearn linear regression documentation: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

First, we'll create a new target series so that the setosa flowers will have a value of 1 and the other will have a value of 0.

In [None]:
#this is hacky
setosalabs = ylabels
setosalabs[setosalabs != 0] = 5 #5 as a temp place holder
setosalabs[setosalabs == 0] = 1
setosalabs[setosalabs == 5] = 0

Then we'll split the data into train/test folds using a 50/50 split

In [None]:
def splitdata(data, testratio):
    #set seed so train and test will always split the same
    np.random.seed(42)
    shuffindices = np.random.permutation(len(data))
    testsize = int(len(data) * testratio)
    testindices = shuffindices[:testsize]
    trainindices = shuffindices[testsize:]
    return data.iloc[trainindices], data.iloc[testindices]

In [None]:
#use the above function to split the data
iristrain, iristest = splitdata(irisdf, 0.5)
ytrain, ytest = splitdata(setosalabs, 0.5) #the split function is seeded so both will split along the same indices

Fit the linear regression model

In [None]:
#fit the linear model to predict iris type (ylabels) based on the two features
reg = LinearRegression().fit(iristrain.loc[:,["petal length (cm)","petal width (cm)"]], ytrain)

In [None]:
#how good is the fit based on the R^2 coefficient of determination
reg.score(iristrain.loc[:,["petal length (cm)","petal width (cm)"]], ytrain)

Predict the held out test data using the linear regression model trained on the train fold of the data

In [None]:
#use the regression to predict the iris type
predictions = reg.predict(iristest.loc[:,["petal length (cm)","petal width (cm)"]])

Now let's evaluate how well the linear regression did in predicting the held out test set by calculating by plotting the receiver operating curve (ROC).

In [None]:
#calculate the false and true positive rate at various threholds
fpr, tpr, thresholds = metrics.roc_curve(y_true = ytest, y_score = predictions, pos_label = 1)
auroc = metrics.roc_auc_score(y_true = ytest, y_score = predictions)

plt.plot(fpr, tpr)
plt.plot([0,1],[0,1],'k--')  #y=x line
plt.xlabel("false positive rate")
plt.ylabel("true positive rate")
plt.text(0.7, 0.1, r'$AUC=%.3f$' %auroc, fontsize=16)
plt.show()

The preformance here is perfect, as it should be given that setosa is entirely linearly separable (even visually) from the other two flower types in the dimensions of petal length and width as seen is the above scatter plot)

# Allen Institute mouse brain in situ hybridization data

The Allen Institute generated a transcriptome-wide whole-brain in situ hybridization dataset published back in 2007. I will give a fuller describtion of the data during lecture. We will be using a subset of this data for the following section. A web interface for the data can be found here: https://mouse.brain-map.org/

I've taken a care of a lot of the pre-processing necessary for this data. The specific steps I took will be described in the lecture. In addition I've only given you half the number of total voxels (chosen randomly) and the top 100 genes that are highly expressed in the Thalamus, but not the rest of the brain. Pre-processing and filtering data is a large part of data analysis. However, I've chosen to give you filtered and subsetted data, because the orginal dataset is far too large to tractably compute on your individual computers. You have and will get some practice in filtering data with Ben.

### Read in data

First, download the two data files from my github repository under urpcourse19/data/lecture7/

In [None]:
#read in data
#remember, the infile path will change for you depending on where you've downloaded the data
infile1 = "/home/slu/urpcourse19/data/lecture7/ABAISHsubset.csv"
infile2 = "/home/slu/urpcourse19/data/lecture7/ABAISHsubset_labels.csv"
abasubset = pd.read_csv(infile1, index_col=0)
labels = pd.read_csv(infile2, index_col=0, header=None)

In [None]:
#look at the shape and the beginning of the dataframe to make sure everything looks good


In [None]:
#split the data and ylabels into train and test folds
#abatrain, abatest = 
#ytrain, ytest = 

### linear regression

In [None]:
#fit the linear model
#reg = 

In [None]:
#how good is the fit based on the R^2 coefficient of determination


Predict the held out test data using the linear regression model trained on the train fold of the data

In [None]:
#use the regression to predict the iris type
#predictions =

Evaluate the performance of using linear regression as a classifier

In [None]:
#calculate the false and true positive rate at various threholds


The performance here is very good too! This won't always be the case. I chose features (genes) that are highly differentially expressed and they do a very good job at discerning samples that belong to the thalamus vs. those that belong to the rest of the brain.