# Classification and Prediction in GenePattern Notebook

This notebook will show you how to use k-Nearest Neighbors (kNN) to build a predictor, use it to classify leukemia subtypes, and assess its accuracy in cross-validation.

### K-nearest-neighbors (KNN)
KNN classifies an unknown sample by assigning it the phenotype label most frequently represented among the k nearest known samples (Golub and Slonim et al., 1999). 

Additionally, you can select a weighting factor for the 'votes' of the nearest neighbors. For example, one might weight the votes by the reciprocal of the distance between neighbors to give closer neighors a greater vote.

## 1. Log in to GenePattern
<div class="alert alert-warning">
<ul>
<li>Select Broad Institute as the server.
<li>Enter your username and password.
<li>Click Run.
<li>When you are logged in, you can click the - button in the upper right hand corner to collapse the cell.
</ul>
</div>

In [4]:
# !AUTOEXEC

%reload_ext genepattern

# Don't have the GenePattern Notebook? It can be installed from PIP: 
# pip install genepattern-notebook 
import gp

# The following widgets are components of the GenePattern Notebook extension.
try:
    from genepattern import GPAuthWidget, GPJobWidget, GPTaskWidget
except:
    def GPAuthWidget(input):
        print("GP Widget Library not installed. Please visit http://genepattern.org")
    def GPJobWidget(input):
        print("GP Widget Library not installed. Please visit http://genepattern.org")
    def GPTaskWidget(input):
        print("GP Widget Library not installed. Please visit http://genepattern.org")

# The gpserver object holds your authentication credentials and is used to
# make calls to the GenePattern server through the GenePattern Python library.
# Your actual username and password have been removed from the code shown
# below for security reasons.
gpserver = gp.GPServer("https://genepattern.broadinstitute.org/gp", "", "")

# Return the authentication widget to view it
GPAuthWidget(gpserver)

## 2. Preprocess gene expression data
- The PreprocessDataset allows you to remove uninformative genes. These are genes whose values do not vary more than a certain amount between the two classes being compared.

<div class="alert alert-warning">
- Click Run in the GenePattern cell below to launch the PreprocessDataset module.
- When the job is complete, the status in the upper right corner of the cell will display "Complete".

In [5]:
# !AUTOEXEC

preprocessdataset_task = gp.GPTask(gpserver, 'urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00020:5.1')
preprocessdataset_job_spec = preprocessdataset_task.make_job_spec()
preprocessdataset_job_spec.set_parameter("input.filename", "http://software.broadinstitute.org/cancer/software/genepattern/data/all_aml/all_aml_train.gct")
preprocessdataset_job_spec.set_parameter("threshold.and.filter", "1")
preprocessdataset_job_spec.set_parameter("floor", "20")
preprocessdataset_job_spec.set_parameter("ceiling", "20000")
preprocessdataset_job_spec.set_parameter("min.fold.change", "3")
preprocessdataset_job_spec.set_parameter("min.delta", "100")
preprocessdataset_job_spec.set_parameter("num.outliers.to.exclude", "0")
preprocessdataset_job_spec.set_parameter("row.normalization", "0")
preprocessdataset_job_spec.set_parameter("row.sampling.rate", "1")
preprocessdataset_job_spec.set_parameter("threshold.for.removing.rows", "")
preprocessdataset_job_spec.set_parameter("number.of.columns.above.threshold", "")
preprocessdataset_job_spec.set_parameter("log2.transform", "0")
preprocessdataset_job_spec.set_parameter("output.file.format", "3")
preprocessdataset_job_spec.set_parameter("output.file", "<input.filename_basename>.preprocessed")
GPTaskWidget(preprocessdataset_task)

## 3. Run k-Nearest Neighbors Cross Validation
In the result cell for the PreprocessDataset job, you will see 2 files. 
<div class="alert alert-warning">Click the "i" icon next to the all_aml_preprocesed.gct file.</div>
You will see a dialog box with several options.

<div class="alert alert-warning">
- Select "Send to existing GenePattern Cell"
- Choose "KNNXvalidation"
- Click Run.

In [6]:
# !AUTOEXEC

knnxvalidation_task = gp.GPTask(gpserver, 'urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00013:6')
knnxvalidation_job_spec = knnxvalidation_task.make_job_spec()
knnxvalidation_job_spec.set_parameter("data.filename", "")
knnxvalidation_job_spec.set_parameter("class.filename", "http://software.broadinstitute.org/cancer/software/genepattern/data/all_aml/all_aml_train.cls")
knnxvalidation_job_spec.set_parameter("num.features", "10")
knnxvalidation_job_spec.set_parameter("feature.selection.statistic", "0")
knnxvalidation_job_spec.set_parameter("min.std", "")
knnxvalidation_job_spec.set_parameter("num.neighbors", "3")
knnxvalidation_job_spec.set_parameter("weighting.type", "1")
knnxvalidation_job_spec.set_parameter("distance.measure", "1")
knnxvalidation_job_spec.set_parameter("pred.results.file", "<data.filename_basename>.pred.odf")
knnxvalidation_job_spec.set_parameter("feature.summary.file", "<data.filename_basename>.feat.odf")
GPTaskWidget(knnxvalidation_task)

<div class="alert alert-warning">
- Click on the i icon next to the `all_aml_train.preprocessed.pred.odf` file
- Select "View Code use"
- Select and copy the reference to the output file, for example `job1306740.get_output_files()[1]` (do NOT include the "this file = " part)
- Paste the result into the code below to replace **INSERT PASTED CODE HERE**
- The resulting line should look like `prediction_results_file = job1306740.get_output_files()[1]`
- Execute the cell below

In [None]:
prediction_results_filename = **INSERT PASTED CODE HERE**
prediction_results_file = prediction_results_filename.open()

## 4. View prediction results
### a. Download result file and parse it into data objects
- Execute the following two cells:

In [None]:
%matplotlib inline

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Read result file into an array of lines:
predres = prediction_results_file.read()
lines = predres.split("\n")

# Extract column names, types, class labels, and data:
col_names = lines[2].split("\t")
col_names[0] = col_names[0].replace("COLUMN_NAMES:","")

col_types = lines[3].split("\t")
col_types[0] = col_types[0].replace("COLUMN_TYPES:","")

data_lines = lines[9]
data_rows = int(data_lines.replace("DataLines=",""))

result_data = list()

for c_line in range(10,len(lines)-1):
    result_data.append(list(lines[c_line].split("\t")))

df = pd.DataFrame(result_data, columns=col_names)

class_labels = list((set(df["True Class"])))

df.convert_objects(convert_numeric=True).dtypes

confidence = pd.to_numeric(df["Confidence"])

### b. Create a bar plot of prediction results
Execute the cell below. You will see a bar graph of class predictions.
- Direction of bars indicate which class was predicted
- Length of bars indicates confidence level
- Blue = true prediction
- Red = false prediction

In [None]:
result_bars = list()
bar_colors = list()
tick_labels = list()

for i in range(0, data_rows):
    tick_labels.append(df["Samples"][i])
    if df["Predicted Class"][i] == class_labels[1]:
        result_bars.append(confidence[i])
    else:
        result_bars.append(-confidence[i])
        
    if df["Correct?"][i] == "true":
        bar_colors.append('b')
    else:
        bar_colors.append('r')

ind = np.arange(data_rows)  # the x locations for the groups
width = 0.8
plt.style.use('ggplot')
fig = plt.figure(figsize=(16,12))
ax = fig.add_subplot(111)

# Set figure and axis titles
plt.title(class_labels[0]+" versus "+class_labels[1]+" Prediction Results")
plt.xlabel("Sample Name")
plt.ylabel("Confidence value")
plt.axis([0,data_rows,-1.25,1.25])
plt.text(0.2, -1.15, "Predicted " + class_labels[0])
plt.text(data_rows, 1.05, "Predicted " + class_labels[1], horizontalalignment='right')
plt.grid(True)

# Plot bar chart of predicted classes
rects1 = ax.bar(ind, result_bars, color=bar_colors, width=width)
tick_locs, tick_xlabels = plt.xticks()
plt.setp(tick_xlabels, rotation=50)
plt.show()

## References

Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. 1984. [Classification and regression trees](https://www.amazon.com/Classification-Regression-Wadsworth-Statistics-Probability/dp/0412048418?ie=UTF8&*Version*=1&*entries*=0). Wadsworth & Brooks/Cole Advanced Books & Software, Monterey, CA.

Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P., Coller, H., Loh, M., Downing, J.R., Caligiuri, M.A., Bloomfield, C.D., and Lander, E.S. 1999. Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression. [Science 286:531-537](http://science.sciencemag.org/content/286/5439/531.long).

Lu, J., Getz, G., Miska, E.A., Alvarez-Saavedra, E., Lamb, J., Peck, D., Sweet-Cordero, A., Ebert, B.L., Mak, R.H., Ferrando, A.A, Downing, J.R., Jacks, T., Horvitz, H.R., Golub, T.R. 2005. MicroRNA expression profiles classify human cancers. [Nature 435:834-838](http://www.nature.com/nature/journal/v435/n7043/full/nature03702.html).

Rifkin, R., Mukherjee, S., Tamayo, P., Ramaswamy, S., Yeang, C-H, Angelo, M., Reich, M., Poggio, T., Lander, E.S., Golub, T.R., Mesirov, J.P. 2003. An Analytical Method for Multiclass Molecular Cancer Classification. [SIAM Review 45(4):706-723](http://epubs.siam.org/doi/abs/10.1137/S0036144502411986).

Slonim, D.K., Tamayo, P., Mesirov, J.P., Golub, T.R., Lander, E.S. 2000. Class prediction and discovery using gene expression data. In [Proceedings of the Fourth Annual International Conference on Computational Molecular Biology (RECOMB)](http://dl.acm.org/citation.cfm?id=332564). ACM Press, New York. pp. 263-272.