# Module 9 - Leukemia project  week 1

Today we will put everything we ahve learned so far to develop a complete end to end analysis. We will be working with gene expression data of leukemia patients aquired in 1999 and later [published in Science](https://doi.org/10.1126/science.286.5439.531). The paper demonstrated how new cases of cancer could be classified by gene expression monitoring (via DNA microarray) and thereby provided a general approach for identifying new cancer classes and assigning tumors to known classes. The data was used to classify patients with acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL). Our excersice is to develop a model that discriminates between AML and ALL patients based only on the this gene expression data and compare our model with the published results of the 1999 paper.

## Setup
Let's get all the requirements sorted before we move on to the excercise.  

In [None]:
# Requirements
!pip install --upgrade ipykernel
!pip install kaggle
!pip install pandas
!pip install tableone
!pip install numpy
!pip install matplotlib
!pip install scipy
!pip install seaborn

# Globals
seed = 1017

#imports
import kaggle
import pandas as pd
import seaborn as sns
from tableone import TableOne
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats


#magic
%matplotlib inline

## Loading the data via kaggle API
The leukemia data set was sourced from kaggle. 
To download the data directly from [kaggle](kaggle.com) you will need to have a kaggle account. **It's free.** Once you create your kaggle account you can generate an API token. After you log in you should see a circular account icon in the upper-right of any kaggle page. Clicking on your account icon will open a right-sidebar where you can select "Account" to edit your account. Scroll down to the API section and click on the "create new api token" button. An API token should automatically download and a prompt will also appear telling you which directory to put this token so python knows where find it. For MacOS users this location is "~/.kaggle/kaggle.json". Once you have done this modify the code below to download the dataset to the `data` folder distributed with this notebook.

In [None]:
#log in to kaggle using your api token
kaggle.api.authenticate()

#path relative to this notebook to put the data
datadir = 'data'

#name of the dataset on kaggle
dataset = 'crawford/gene-expression'

#downlaod the data
kaggle.api.dataset_download_files(dataset, path=datadir, unzip=True)

## Loading the data
There are two datasets containing the initial (training, 38 samples) and independent (test, 34 samples) datasets used in the paper. These datasets contain measurements corresponding to ALL and AML samples from Bone Marrow and Peripheral Blood. Intensity values have been re-scaled such that overall intensities for each chip are equivalent.

In [None]:
# download the data as a pandas dataframe
labels = pd.read_csv('data/actual.csv', index_col = 'patient')
test = pd.read_csv('data/data_set_ALL_AML_independent.csv')
train = pd.read_csv('data/data_set_ALL_AML_train.csv')

***Task*** Use the head() function to quickly have a look at the training data frame.

In [None]:
#Use the head() function display th first few rows of the training data frame.
train.head()

The testing set is formatted the same as the training set. Notice, that the gene description and accession numbers are given along with the count and outcome (call) for each patient. The patient outcomes are also provided in the file `actual.csv`. I think it will be more convientient to use the outcomes in this file and delete the 'call' columns in both the training and testing sets. ***Question: What are the observational units of interest?*** 

## Formatting
***Task*** Remove the 'call' columns from the training and testing sets.

In [None]:
#<--remove this for student verssion--> 
cols = [col for col in test.columns if 'call' in col]
test = test.drop(cols, 1)

cols = [col for col in train.columns if 'call' in col]
train = train.drop(cols, 1)

train.head()

Let's consider what the observational unit should be. ***Task*** Format the data to have observations in rows and features in columns.

In [None]:
# remove this for stutent version 
train = train.T
test = test.T
train.head()

We can also remove the gene bookkeeping data because we will not use it today.

In [None]:
#remove the gene bookkeeping data.
train = train.drop(['Gene Description', 'Gene Accession Number'])
test = test.drop(['Gene Description', 'Gene Accession Number'])
train.head()

Now let's encode the outcomes for binary classification. We'll use Zeros for the ALL outcomes and Ones for AML. Remember the first 38 patients were partitioned for the training set the remainder are in the testing set.

In [None]:
#remove for student version
labels = labels.replace({'ALL':0,'AML':1})
labels_train = labels[labels.index <= 38]
labels_test = labels[labels.index > 38]

### Treat missing data
Before moving on to a table 1. Let's look for and treat any missing data this. Remember to check for values that don't make sense. I think replacing witht the mean value would be a reasonable imputation strategy.***Task*** check for unreasonable and missing values and impute them with the column mean. 

In [None]:
#remove in student version

#remove zeros
#train = train.replace(0, np.nan)
#test = test.replace(0, np.nan)

#remove inf
train = train.replace(np.inf, np.nan)
test = test.replace(np.inf, np.nan)

#impute with mean
train = train.fillna(value = train.values.mean())
test = test.fillna(value = test.values.mean())

#KNN imputation -- don't bother


***Question*** How would you go about visualizing the data we just formatted? Why can't I just make a table 1?

In [None]:
#remove from student version
from sklearn import preprocessing
from sklearn.decomposition import PCA

#Do a PCA for all data (this should probably be done only with the training data)
nPC=10
df_all = train.append(test, ignore_index=True)
X_all = preprocessing.StandardScaler().fit_transform(df_all)
pca = PCA(n_components=nPC, random_state=seed)
X_pca = pca.fit_transform(X_all)
print(X_pca.shape)

In [None]:
#remove from student version
from sklearn import preprocessing
from sklearn.decomposition import PCA

#Do a PCA for all data (this should probably be done only with the training data)
nPC=10
pca = PCA(n_components=nPC, random_state=seed)

#define standard scaler
scaler = preprocessing.StandardScaler()

#fit pca on the training data
#df_all = train.append(test, ignore_index=True)
X_scaled = scaler.fit_transform(train)
X_train_pca = pca.fit_transform(X_scaled)

#transform the testing data using the fit scaler and pca objects
X_test_pca = pca.transform(scaler.transform(test))

print(X_train_pca.shape)

### Visualize Engineered Features
Let's plot the feature distributions.

In [None]:
#plot the principle component distributions
for PC in range(nPC):
    sns.kdeplot(data=X_train_pca[:, PC])
    #X_pca[:,nPC].plot.kde(bw_method='scott') #use bw_method=.02 for a lower bandwidth gaussian representation
    plt.legend(["PC" + str(PC)])
    plt.show()

In [None]:
# remove in student version 

# rescale the engineered features if neccessary
X_train_pca = scaler.fit_transform(X_train_pca)
X_test_pca = scaler.transform(X_test_pca)

#plot the principle component distributions
for PC in range(nPC):
    sns.kdeplot(data=X_test_pca[:, PC])
    #X_pca[:,nPC].plot.kde(bw_method='scott') #use bw_method=.02 for a lower bandwidth gaussian representation
    plt.legend(["PC" + str(PC)])
    plt.show()

## Capstone Project
We have chosen a more [challenging Acute Myeloid Leukemia dataset] (https://www.synapse.org/#!Synapse:syn2455683/wiki/64007) to test your ML skills. You will have to request access to this data through the link provided and follow directions to the 'request access' form. Originally DREAM challenge, this dataset contains multiple outcomes such as disease relapse and response to treatment. Both classificationa and regression tasks are possible. The data represents over 200 Leukemia patients and includes highly dimensional gemonics data in addition to clinical covariates. Your challenge is to develop models to address these outcomes completely End-to-End over the next few weeks. We will review your analysis in the final week of the course. We will periodically check in to make sure your analyses are goinig smoothly but for the mean time your analysis should generally follow these beats.
+ ***Format*** the data with observations in rows and features in columns
+ ***Manually exclude data*** like book keeping variables
+ ***Normalize*** Data
+ ***Treat missing*** Data
+ Choose and employ a ***Data Partitioning*** scheme
+ Do ***Feature selection*** for Clinical covariates and ***Dimension Reduction*** for omics data.
+ ***Choose a model*** with the highest performance potential
+ ***Tune your model***
+ Caclulate ***Performance metrics***
+ Report ***Model Predicitions***
+ ***Interpret*** results


## Pathway Representation
PCA is one of the most common dimension reduction schemes, however interpreting the biology from principle components is difficult. The PCs explain fluctuations in the data itself not necessarily the biological relationships at play. It would be better to choose a representation that is more convienient for interpretation in a clinical setting. Let us aggregate the data of related genes into a value that represents the pathways they participate in. Finding the predictive pathways may be more useful to explain the biology, design future invivio models, and drive treatment development.

The goal is to assign a number to each pathway represented in our dataset and use these as the predictive features for each patient. For each pathway we will need to find all the genes in our dataset that participate in this pathway and describe how only these releated genes fluctuate. We'll use a PCA to do this and use the value of the first PC as the value for the pathway feature. We will then normalize this pathway representation data before moving on to feature selection.

### Convert Affymetrix Gene IDs to Kegg IDs
The source data has genes in the Affymetrix format. We would like to use  KEGG IDs because of the convientient pathway analysis python tools KEGG provides.

[g:Convert](http://www.protocol-online.org/cgi-bin/prot/jump.cgi?ID=4145)
 is a gene identifier tool that allows conversion of genes, proteins, microarray probes, standard names, various database identifiers, etc. A mix of IDs of different types may be inserted to g:Convert. The user needs to select a target database; all input IDs will be converted into target database format. Input IDs that have no corresponding entry in target database will be displayed as N/A.
 
Let's use the g:Convert web tool to convert the affymetrix IDs to KEGG enzyme IDs. Copy and paste the "Gene Accession Number" column in the training set CSV file into the Query box in g:Convert. Exclude the column header when you paste. Select "Homo sapiens (Human)" in the organism box and select "KEGG_ENZYME" in the target namespace box. Press the "Run query" button to convert the IDs. When the query finishes you can download the results as a CSV file and save it in the data folder.    

In [None]:
#Let's look at the g:Convert file
gprof = pd.read_csv('data/gProfiler_hsapiens.csv')
gprof.head()

The corresponding gene names can be found in the 'name' column.Notice, that many initial aliases map to the same gene name. For every unique gene name let's get the pathways they participate in. We'll follow closely the [KEGG tutorial](https://bioservices.readthedocs.io/en/master/kegg_tutorial.html) to show us how.

In [None]:
#install bioservices and import KEGG
!pip install bioservices
from bioservices.kegg import KEGG

In [None]:
#start the kegg interface
ntrface = KEGG() 

#Example search for pathways by gene name
ntrface.get_pathway_by_gene("GAPDH", "hsa")

In [None]:
#get unique genes in gprof dataframe
genes = gprof['name'].unique()
genes = genes[0:50] #limit to first 50 genes so it runs quickly
len(genes)

In [None]:
%%capture
#init dictionary to hold pathway to gene mapping
pathway_dict = {}

#loop over unique genes in the dataset
for gene in genes:
    #get pathways associated with gene
    pathways = ntrface.get_pathway_by_gene(str(gene), "hsa")
    #loop over associated pathways
    try:
        for pathway, desc in pathways.items():
            #check if pathway is new
            if not (pathway in pathway_dict):
                #append new pathway and init gene list
                pathway_dict[pathway] = [str(gene)]   
            else:
                # add gene to the known pathway's list of genes
                pathway_dict[pathway].extend([str(gene)])
    except AttributeError:
        print("No pathways found for " + str(gene))

In [None]:
#display dictionary as pandas Series so we can show the head
display(pd.Series(pathway_dict).head())

In [None]:
#remove from student version
from sklearn import preprocessing
from sklearn.decomposition import PCA

#Do a PCA for all data (this should probably be done only with the training data)
nPC=1
pca = PCA(n_components=nPC, random_state=seed)

#define standard scaler
scaler = preprocessing.StandardScaler()

#loop over pathways
for pathway in pathway_dict:
    #get Gene Accession names for every Gene in pathway
    
    #select rows representing the genes in the pathway
    
    #fit pca on the training data
    #df_all = train.append(test, ignore_index=True)
    X_scaled = scaler.fit_transform(train)
    X_train_pca = pca.fit_transform(X_scaled)

    #transform the testing data using the fit scaler and pca objects
    X_test_pca = pca.transform(scaler.transform(test))

    print(X_train_pca.shape)