# Classification and Prediction in GenePattern Notebook

This notebook will show you how to use k-Nearest Neighbors (kNN) to build a predictor, use it to classify leukemia subtypes, and assess its accuracy in cross-validation.

### K-nearest-neighbors (KNN)
KNN classifies an unknown sample by assigning it the phenotype label most frequently represented among the k nearest known samples. 

Additionally, you can select a weighting factor for the 'votes' of the nearest neighbors. For example, one might weight the votes by the reciprocal of the distance between neighbors to give closer neighors a greater vote.

<h2>1. Log in to GenePattern</h2>

<div class="alert alert-info">
<h3 style="margin-top: 0;"> Instructions <i class="fa fa-info-circle"></i></h3>
<ul>
	<li>Enter your username and password <i>(if needed)</i> in the GenePattern login cell below.</li>
	<li>Click <em>Log into GenePattern</em>.</li>
	<li>Alternatively, if you are prompted to Login as your username, just click that button and give it a couple seconds to authenticate.</li>
</ul>
</div>

In [8]:
# Requires GenePattern Notebook: pip install genepattern-notebook
import gp
import genepattern

# Username and password removed for security reasons.
genepattern.display(genepattern.session.register("https://cloud.genepattern.org/gp", "", ""))

GPAuthWidget()

## 2. Run k-Nearest Neighbors Cross Validation

<div class="alert alert-info">
<h3 style="margin-top: 0;"> Instructions <i class="fa fa-info-circle"></i></h3>
<ol>
<li>Select <a class="nbtools-markdown-file" href="https://datasets.genepattern.org/data/ccmi_tutorial/2017-12-15/BRCA_HUGO_symbols.preprocessed.gct">BRCA_HUGO_symbols.preprocessed.gct</a> in the <strong>data filename</strong> field below.
</li><li>Select <a class="nbtools-markdown-file" href="https://datasets.genepattern.org/data/ccmi_tutorial/2017-12-15/BRCA_HUGO_symbols.preprocessed.cls">BRCA_HUGO_symbols.preprocessed.cls</a> in the <strong>class filename</strong> field.
    </li><li>Click <strong>Run</strong>.</li>
</ol>
</div>

In [9]:
import warnings
warnings.filterwarnings('ignore')

knnxvalidation_task = gp.GPTask(genepattern.get_session(0), 'urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00013')
knnxvalidation_job_spec = knnxvalidation_task.make_job_spec()
knnxvalidation_job_spec.set_parameter("data.filename", "")
knnxvalidation_job_spec.set_parameter("class.filename", "")
knnxvalidation_job_spec.set_parameter("num.features", "10")
knnxvalidation_job_spec.set_parameter("feature.selection.statistic", "0")
knnxvalidation_job_spec.set_parameter("min.std", "")
knnxvalidation_job_spec.set_parameter("num.neighbors", "3")
knnxvalidation_job_spec.set_parameter("weighting.type", "1")
knnxvalidation_job_spec.set_parameter("distance.measure", "1")
knnxvalidation_job_spec.set_parameter("pred.results.file", "<data.filename_basename>.pred.odf")
knnxvalidation_job_spec.set_parameter("feature.summary.file", "<data.filename_basename>.feat.odf")
genepattern.display(knnxvalidation_task)

GPTaskWidget(lsid='urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00013')

## 3. View a list of features used in the prediction model

<div class="alert alert-info">
<h3 style="margin-top: 0;"> Instructions <i class="fa fa-info-circle"></i></h3>
<ol>
    <li>Select the <em>KNNXvalidation</em> job result cell by clicking anywhere in it.
    </li><li>Click on the <i class="fa fa-info-circle"></i> icon next to the <q>&lt;filename>.feat.odf</q> file.
    </li><li>Select <q>Send to DataFrame</q>.
</li><li>You will see a new cell created below the job result cell.
</li><li>You will see a table of features, descriptions, and the number of times each feature was included in a model in a cross-validation loop.
</li></ol></div>

## 4. View prediction results

<div class="alert alert-info">
<h3 style="margin-top: 0;"> Instructions <i class="fa fa-info-circle"></i></h3>
<ol>
    <li>For the <q>prediction results file</q> parameter below, click the down arrow in the file input box.
    </li><li>Right click the <q>BRCA_HUGO_symbols.preprocessed.pred.odf</q> above and select <q>Copy link address</q>.</li>
    <li>Paste the link into the <q>Prediction Results File</q> parameter.</li>
    <li>Click <strong><em>Run</em></strong>.</li>
    <li>You will see the prediction results in an interactive viewer.</li>
</ol>
</div>


In [10]:
predictionresultsviewer_task = gp.GPTask(genepattern.session.get(0), 'urn:lsid:broad.mit.edu:cancer.software.genepattern.module.visualizer:00019')
predictionresultsviewer_job_spec = predictionresultsviewer_task.make_job_spec()
predictionresultsviewer_job_spec.set_parameter("prediction.results.file", "")
genepattern.display(predictionresultsviewer_task)

GPTaskWidget(lsid='urn:lsid:broad.mit.edu:cancer.software.genepattern.module.visualizer:00019')

<h2>Part 2, Compare classification of Preprocessed data to non-preprocessed data</h2>
<p>
Back in the Data Preparation section at the beginning of the workshop, we ran the RNA-Seq read counts data through the VoomNormalize module to preprocess RNA-Seq data into a form suitable for use downstream in other GenePattern analyses originally designed for microarray data.  Many of these approaches assume that the data is distributed normally, yet this is not true of RNA-seq read count data. The PreprocessReadCounts module provides one approach to accommodate this. It uses a mean-variance modeling technique ('voom' from the limma Bioconductor package) to transform the dataset to fit an approximation of a normal distribution, with the goal of being able to apply statistical methods and workflows that assume a normal distribution.
</p>
<p>
Now lets take a look at how well our KNNXValidation classifier works if we provide it with the RNA-Seq matrix from before this processing step.
</p>

<h2>5. Run KNNXValidation on the non-normalized data</h2>
For this exercise we want you to compare the results of the classifier we used above to using the same classifier and  settings on the BRCA data that has not been run through VoomNormalize to normalize it first.

<div class="alert alert-info">
<h3 style="margin-top: 0;"> Instructions <i class="fa fa-info-circle"></i></h3>
<ol>
    <li>Select the blank cell below.</li> 
    <li>Click the <q><i class="fa fa-th"></i> Tools</q> button in the toolbar above and select the <em>KNNXValidation</em> module to insert the analysis.</li>
	<li>Select <a class="nbtools-markdown-file" href="https://datasets.genepattern.org/data/ccmi_tutorial/2018-09-20/BRCA_Dataset.collapsed.gct" target="_blank">BRCA_Dataset.collapsed.gct</a> in the <strong>data filename</strong> field below.  This is the version of the data that has been collapsed to HUGO symbols, but not run through VoomNormalize</li>
	<li>Select <a class="nbtools-markdown-file" href="https://datasets.genepattern.org/data/ccmi_tutorial/2018-09-20/BRCA_Dataset.cls" target="_blank">BRCA_Dataset.cls</a> in the <strong>class filename</strong> field.</li>
	<li>Click <strong>Run</strong>.</li>
    <li>Examine the results using the PredictionResultsViewer module.</li>
</ol>
</div>

<h3>Discussion</h3>

<p>As you can see, the KNNXvalidation classifier works just as well for the non-normalized data, though there is slightly less confidence in one of the calls.&nbsp; This is a reflection of the fact that from a gene-expression point of view, seperating tumor cells from normal cells is actually a fairly easy distinction to make which is why we us it in these exercises.&nbsp;&nbsp;</p>


## 6. Compare the selected features

We saw above that using the euclidean distance metric with the pre-normalized BRCA data, that KNNXValidation does not do as good a job classifying our tumor/normal samples.  Here we will examine why that might be.

<div class="alert alert-info">
<h3 style="margin-top: 0;"> Instructions <i class="fa fa-info-circle"></i></h3>
<ol>
    <li>From Step 5 above, click on <strong>BRCA_Dataset.collapsed.feat.odf</strong> and choose <Strong>Send to Dataframe</Strong></li> 
    <li>Move the resulting cell below this one using the down arrow (<i class="fa-arrow-down fa"></i>) in the toolbar.</li>
	<li>Repeat these 2 steps for the <strong>BRCA_HUGO_symbols.preprocessed.feat.odf</strong> from the beginning of this notebook.</li>
	<li>Compare the lists of features.  How many features are in both runs?</li>
</ol>
</div>



### The Answer (spoiler warning)

<p class="lead">'MATN2',
 'COMP',
 'KLHL29',
 'SPRY2',
 'TMEM220',
 'MME',
 'INHBA',
 'SLC50A1',
 'TNFRSF10D',
 'NTRK2'</p>

### Discussion

<p>As you can see, the lists of features for the classifiers for the unprocessed and preprocessed data are not identical, but there is some overlap.&nbsp; Nonetheless each of the classifiers was able to perfectly classify the samples.&nbsp; As mentioned before, this is a reflection of the fact that this is a pretty easy classification problem and there are many genes that could be used to make the class distinction.</p>

<p>&nbsp;</p>


<h3 id="Bonus-Content---comparing-the-processed-and-unprocessed-data">Bonus Content - comparing the processed and unprocessed data</h3>

<p><span style="color: inherit ; font-family: inherit">To try to see the differences in the data sets, you can use the cell below to generate X-Y plots of genes from the two datasets to see how they differ.</span></p>

<p>The cell below will use the pre-computed results from the spoilers above.&nbsp; In addition there is a third plot that shows the unprocessed dataset after it has been log transformed as a third refernce point.</p>

<p>&nbsp;</p>


In [4]:
%matplotlib inline
%load_ext autoreload
%autoreload 2

import numpy as np
from gp.data import GCT
import matplotlib.pyplot as plt

@genepattern.build_ui(parameters={
    "gene_1": {
        "name": "Gene 1",
        "description": "First gene for X-Y plots",
        "default": "MME",
        "type": "choice",
        "choices":{"ADAM33":"ADAM33",
"ANXA1":"ANXA1",
"CCNE2":"CCNE2",
"CDCA8":"CDCA8",
"CERS2":"CERS2",
"COL10A1":"COL10A1",
"COMP":"COMP",
"CXCL2":"CXCL2",
"DMD":"DMD",
"EIF2AK1":"EIF2AK1",
"FAXDC2":"FAXDC2",
"FLVCR1":"FLVCR1",
"GGTA1P":"GGTA1P",
"HSD17B6":"HSD17B6",
"IL11RA":"IL11RA",
"INHBA":"INHBA",
"KIF26B":"KIF26B",
"KLHL29":"KLHL29",
"LINC02408":"LINC02408",
"LMOD1":"LMOD1",
"MAMDC2":"MAMDC2",
"MATN2":"MATN2",
"MIR3153":"MIR3153",
"MME":"MME",
"MMP11":"MMP11",
"NTRK2":"NTRK2",
                "RGN":"RGN",
"ROBO3":"ROBO3",
"SCN4B":"SCN4B",
"SLC50A1":"SLC50A1",
"SPRY2":"SPRY2",
"TDRD10":"TDRD10",
"TMEM220":"TMEM220",
"TNFRSF10D":"TNFRSF10D",
"TSLP":"TSLP",
"UBE2T":"UBE2T",
"UNC5B":"UNC5B",
"VEGFD":"VEGFD",
"WISP1":"WISP1"}
    },
    "gene_2": {
        "name": "Gene 2",
        "description": "Second gene for X-Y plots",
        "default": "SPRY2",
        "type": "choice",
        "choices":{"ADAM33":"ADAM33",
"ANXA1":"ANXA1",
"CCNE2":"CCNE2",
"CDCA8":"CDCA8",
"CERS2":"CERS2",
"COL10A1":"COL10A1",
"COMP":"COMP",
"CXCL2":"CXCL2",
"DMD":"DMD",
"EIF2AK1":"EIF2AK1",
"FAXDC2":"FAXDC2",
"FLVCR1":"FLVCR1",
"GGTA1P":"GGTA1P",
"HSD17B6":"HSD17B6",
"IL11RA":"IL11RA",
"INHBA":"INHBA",
"KIF26B":"KIF26B",
"KLHL29":"KLHL29",
"LINC02408":"LINC02408",
"LMOD1":"LMOD1",
"MAMDC2":"MAMDC2",
"MATN2":"MATN2",
"MIR3153":"MIR3153",
"MME":"MME",
"MMP11":"MMP11",
"NTRK2":"NTRK2",
                "RGN":"RGN",
"ROBO3":"ROBO3",
"SCN4B":"SCN4B",
"SLC50A1":"SLC50A1",
"SPRY2":"SPRY2",
"TDRD10":"TDRD10",
"TMEM220":"TMEM220",
"TNFRSF10D":"TNFRSF10D",
"TSLP":"TSLP",
"UBE2T":"UBE2T",
"UNC5B":"UNC5B",
"VEGFD":"VEGFD",
"WISP1":"WISP1"}
    },
    "output_var": {
        "hide": True,
    }
})
def show_X_Y_plots(gene_1, gene_2):

    processedInput_GCT = GCT(gp.GPFile(genepattern.get_session(0), "https://datasets.genepattern.org/data/ccmi_tutorial/2018-09-20/BRCA_HUGO_symbols.preprocessed.gct"))
    UNprocessedInput_GCT = GCT(gp.GPFile(genepattern.get_session(0), "https://datasets.genepattern.org/data/ccmi_tutorial/2018-09-20/BRCA_Dataset.collapsed.gct"))
    ltgct = GCT(gp.GPFile(genepattern.get_session(0), "https://cloud.genepattern.org/gp/jobResults/82746/BRCA_HUGO_symbols.preprocessed.gct "))


    # drop the description from the index
    processedInput_GCT.reset_index(inplace=True)
    processedInput_GCT.set_index('Name', inplace=True)
    del processedInput_GCT['Description']

    UNprocessedInput_GCT.reset_index(inplace=True)
    UNprocessedInput_GCT.set_index('Name', inplace=True)
    del UNprocessedInput_GCT['Description']

    ltgct.reset_index(inplace=True)
    ltgct.set_index('Name', inplace=True)
    del ltgct['Description']
    ltgct = ltgct.astype(float)

    cls2 = gp.GPFile(genepattern.get_session(0), "https://datasets.genepattern.org/data/ccmi_tutorial/2018-09-20/BRCA_Dataset.cls")

    clsLines =  cls2.read().splitlines()
    labels = np.asarray(clsLines[1].split(), dtype=str)[1:] 
    classes = np.asarray(clsLines[2].strip('\n').split(' '), dtype=int)


    dft = processedInput_GCT.transpose()
    dft['classes'] = [labels[i] for i in classes]

    UNdft = UNprocessedInput_GCT.transpose()
    UNdft['classes'] = [labels[i] for i in classes]

    lgdft = ltgct.transpose()
    lgdft['classes'] = [labels[i] for i in classes]


    # First 2 plots are for 2 of the top genes for the preprocessed data
    groups = dft.groupby('classes')
    fig = plt.figure()
    fig.set_size_inches(18,6)
    ax = fig.add_subplot(132)
    #plt.subplots()
    #fig.set_size_inches(12,8)
    ax.margins(0.05) # Optional, just adds 5% padding to the autoscaling
    for name, group in groups:
        ax.plot(group[gene_1], group[gene_2], marker='o', linestyle='', ms=5, label=name)
    ax.legend()
    plt.title("Preprocessed Data")

    UNgroups = UNdft.groupby('classes')

    #fig2, ax2 = plt.subplots()
    #fig2.set_size_inches(12,8)
    ax2 = fig.add_subplot(131)
    ax2.margins(0.05) # Optional, just adds 5% padding to the autoscaling
    for name, group in UNgroups:
        ax2.plot(group[gene_1], group[gene_2], marker='o', linestyle='', ms=5, label=name)
    ax2.legend()
    plt.title("Un-preprocessed Data")

    LGgroups = lgdft.groupby('classes')

    #fig3, ax3 = plt.subplots()
    #fig3.set_size_inches(12,8)
    ax3 = fig.add_subplot(133)
    ax3.margins(0.05) # Optional, just adds 5% padding to the autoscaling
    for name, group in LGgroups:
        ax3.plot(group[gene_1], group[gene_2], marker='o', linestyle='', ms=5, label=name)
    ax3.legend()
    

    ax.set_xlabel(gene_1)
    ax.set_ylabel(gene_2)
    ax2.set_xlabel(gene_1)
    ax2.set_ylabel(gene_2)
    ax3.set_xlabel(gene_1)
    ax3.set_ylabel(gene_2)

    ax.set_title("Preprocessed Data")
    ax2.set_title("Unprocessed")

    ax3.set_title("Unprocessed - log transformed")
    
    plt.show()


UIBuilder(function_import='show_X_Y_plots', name='show_X_Y_plots', params=[{'name': 'gene_1', 'label': 'Gene 1…

<h3>Bonus - Discussion</h3>

<p>As you can see from the plots, the processed (voom) data seperates the classes much more cleanly than the unprocessed data for most gene sets.&nbsp; Comparing it to the log-transformed unprocessed data we see that for many genes, the clusters on the X-Y plots appear a bit tighter, probably due to voom&#39;s variance correction.</p>


## References

Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. 1984. [Classification and regression trees](https://www.amazon.com/Classification-Regression-Wadsworth-Statistics-Probability/dp/0412048418?ie=UTF8&*Version*=1&*entries*=0). Wadsworth & Brooks/Cole Advanced Books & Software, Monterey, CA.

Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P., Coller, H., Loh, M., Downing, J.R., Caligiuri, M.A., Bloomfield, C.D., and Lander, E.S. 1999. Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression. [Science 286:531-537](http://science.sciencemag.org/content/286/5439/531.long).

Lu, J., Getz, G., Miska, E.A., Alvarez-Saavedra, E., Lamb, J., Peck, D., Sweet-Cordero, A., Ebert, B.L., Mak, R.H., Ferrando, A.A, Downing, J.R., Jacks, T., Horvitz, H.R., Golub, T.R. 2005. MicroRNA expression profiles classify human cancers. [Nature 435:834-838](http://www.nature.com/nature/journal/v435/n7043/full/nature03702.html).

Rifkin, R., Mukherjee, S., Tamayo, P., Ramaswamy, S., Yeang, C-H, Angelo, M., Reich, M., Poggio, T., Lander, E.S., Golub, T.R., Mesirov, J.P. 2003. An Analytical Method for Multiclass Molecular Cancer Classification. [SIAM Review 45(4):706-723](http://epubs.siam.org/doi/abs/10.1137/S0036144502411986).

Slonim, D.K., Tamayo, P., Mesirov, J.P., Golub, T.R., Lander, E.S. 2000. Class prediction and discovery using gene expression data. In [Proceedings of the Fourth Annual International Conference on Computational Molecular Biology (RECOMB)](http://dl.acm.org/citation.cfm?id=332564). ACM Press, New York. pp. 263-272.