# Classification and Prediction in GenePattern Notebook

This notebook will show you how to use k-Nearest Neighbors (kNN) to build a predictor, use it to classify leukemia subtypes, and assess its accuracy in cross-validation.

### K-nearest-neighbors (KNN)
KNN classifies an unknown sample by assigning it the phenotype label most frequently represented among the k nearest known samples. 

Additionally, you can select a weighting factor for the 'votes' of the nearest neighbors. For example, one might weight the votes by the reciprocal of the distance between neighbors to give closer neighors a greater vote.

<h2>1. Log in to GenePattern</h2>

<ul>
	<li>Select GenePattern AWS Beta as the server</li>
	<li>Enter your username and password.</li>
	<li>Click <em>Login to GenePattern</em>.</li>
</ul>


In [4]:
# Requires GenePattern Notebook: pip install genepattern-notebook
import gp
import genepattern

# Username and password removed for security reasons.
genepattern.GPAuthWidget(genepattern.register_session("https://gp-beta-ami.genepattern.org/gp", "", ""))

## 2. Run k-Nearest Neighbors Cross Validation

<div class="alert alert-info">
- Drag [BRCA_HUGO_symbols.preprocessed.gct](https://datasets.genepattern.org/data/TCGA_BRCA/DP_4_BRCA_HUGO_symbols.preprocessed.gct) to the **data filename** field below.
- Drag [BRCA_HUGO_symbols.preprocessed.cls](https://datasets.genepattern.org/data/TCGA_BRCA/Pred_2_BRCA_HUGO_symbols.preprocessed.cls) to the **class filename** field.
- Click **Run**.

In [5]:
knnxvalidation_task = gp.GPTask(genepattern.get_session(0), 'urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00013')
knnxvalidation_job_spec = knnxvalidation_task.make_job_spec()
knnxvalidation_job_spec.set_parameter("data.filename", "")
knnxvalidation_job_spec.set_parameter("class.filename", "")
knnxvalidation_job_spec.set_parameter("num.features", "10")
knnxvalidation_job_spec.set_parameter("feature.selection.statistic", "0")
knnxvalidation_job_spec.set_parameter("min.std", "")
knnxvalidation_job_spec.set_parameter("num.neighbors", "3")
knnxvalidation_job_spec.set_parameter("weighting.type", "1")
knnxvalidation_job_spec.set_parameter("distance.measure", "1")
knnxvalidation_job_spec.set_parameter("pred.results.file", "<data.filename_basename>.pred.odf")
knnxvalidation_job_spec.set_parameter("feature.summary.file", "<data.filename_basename>.feat.odf")
genepattern.GPTaskWidget(knnxvalidation_task)

## 4. View a list of features used in the prediction model

<div class="alert alert-info">
- Select the XXXXXX.KNNXvalidation job result cell by clicking anywhere in it.
- Click on the i icon next to the `<filename>.**feat**.odf` file
- Select "Send to DataFrame"
- You will see a new cell created below the job result cell.
- You will see a table of features, descriptions, and the number of times each feature was included in a model in a cross-validation loop.

## 5. View prediction results

<div class="alert alert-info">
- For the **prediction results file** parameter below, click the down arrow in the file input box.
- Select the `BRCA_HUGO_symbols.preprocessed.pred.odf` file.
- Click **Run**.
- You will see the prediction results in an interactive viewer.

In [6]:
predictionresultsviewer_task = gp.GPTask(genepattern.get_session(0), 'urn:lsid:broad.mit.edu:cancer.software.genepattern.module.visualizer:00019')
predictionresultsviewer_job_spec = predictionresultsviewer_task.make_job_spec()
predictionresultsviewer_job_spec.set_parameter("prediction.results.file", "")
genepattern.GPTaskWidget(predictionresultsviewer_task)

## References

Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. 1984. [Classification and regression trees](https://www.amazon.com/Classification-Regression-Wadsworth-Statistics-Probability/dp/0412048418?ie=UTF8&*Version*=1&*entries*=0). Wadsworth & Brooks/Cole Advanced Books & Software, Monterey, CA.

Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P., Coller, H., Loh, M., Downing, J.R., Caligiuri, M.A., Bloomfield, C.D., and Lander, E.S. 1999. Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression. [Science 286:531-537](http://science.sciencemag.org/content/286/5439/531.long).

Lu, J., Getz, G., Miska, E.A., Alvarez-Saavedra, E., Lamb, J., Peck, D., Sweet-Cordero, A., Ebert, B.L., Mak, R.H., Ferrando, A.A, Downing, J.R., Jacks, T., Horvitz, H.R., Golub, T.R. 2005. MicroRNA expression profiles classify human cancers. [Nature 435:834-838](http://www.nature.com/nature/journal/v435/n7043/full/nature03702.html).

Rifkin, R., Mukherjee, S., Tamayo, P., Ramaswamy, S., Yeang, C-H, Angelo, M., Reich, M., Poggio, T., Lander, E.S., Golub, T.R., Mesirov, J.P. 2003. An Analytical Method for Multiclass Molecular Cancer Classification. [SIAM Review 45(4):706-723](http://epubs.siam.org/doi/abs/10.1137/S0036144502411986).

Slonim, D.K., Tamayo, P., Mesirov, J.P., Golub, T.R., Lander, E.S. 2000. Class prediction and discovery using gene expression data. In [Proceedings of the Fourth Annual International Conference on Computational Molecular Biology (RECOMB)](http://dl.acm.org/citation.cfm?id=332564). ACM Press, New York. pp. 263-272.