# Working with Samples and Features

From a combined dataset of cancer and normal samples, extract the normal samples. Within the normal samples, find the genes coexpressed with LRPPRC (Affymetrix probe M92439_at), a gene with mitochondrial function.

## Before you begin

* Sign in to GenePattern by entering your username and password into the form below. If you are seeing a block of code instead of the login form, go to the menu above and select Cell > Run All.
* The data will will use in this exercise is from the Global Cancer Map, published along with the paper *[Multi-Class Cancer Diagnosis Using Tumor Gene Expression Signatures](http://www.broadinstitute.org/cgi-bin/cancer/publications/pub_paper.cgi?mode=view&paper_id=61)*.
* Links to the data files used in this exercise are below:
    * RES file: [GCM_Total.res](http://software.broadinstitute.org/cancer/software/genepattern/data/gcm/GCM_Total.res)
    * CLS file: [GCM_Total.cls](http://software.broadinstitute.org/cancer/software/genepattern/data/gcm/GCM_Total.cls)
    * CHIP file: [HU6800.chip](http://software.broadinstitute.org/cancer/software/genepattern/data/gcm/HU6800.chip)

In [2]:
# !AUTOEXEC

%reload_ext genepattern

# Don't have the GenePattern Notebook? It can be installed from PIP: 
# pip install genepattern-notebook 
import gp

# The following widgets are components of the GenePattern Notebook extension.
try:
    from genepattern import GPAuthWidget, GPJobWidget, GPTaskWidget
except:
    def GPAuthWidget(input):
        print("GP Widget Library not installed. Please visit http://genepattern.org")
    def GPJobWidget(input):
        print("GP Widget Library not installed. Please visit http://genepattern.org")
    def GPTaskWidget(input):
        print("GP Widget Library not installed. Please visit http://genepattern.org")

# The gpserver object holds your authentication credentials and is used to
# make calls to the GenePattern server through the GenePattern Python library.
# Your actual username and password have been removed from the code shown
# below for security reasons.
gpserver = gp.GPServer("https://genepattern.broadinstitute.org/gp", "", "")

# Return the authentication widget to view it
GPAuthWidget(gpserver)

## Step 1: Selecting a Subset of an Expression File

1. Insert an analysis cell for the *SelectFeaturesColumns* module and move it below this set of instructions.
2. Set the following parameters:
    1. Drag-and-drop the *[GCM_Total.res](http://software.broadinstitute.org/cancer/software/genepattern/data/gcm/GCM_Total.res)* file linked above into the *input.filename* parameter.
    2. Set the columns parameter to *190-279*.
    3. Set the *output.file* paremeter to *GCM_Normals.res*.
3. Click the *Run* button.

## Step 2: Finding Coexpressed Genes

1. Insert an analysis cell for the *GeneNeighbors* module and move it below this set of instructions.
2. Set the following parameters:
    1. Send the *GCM_Normals.res* file produced by the *SelectFeaturesColumns* job above to the *data.filename* parameter.
    2. Set the *gene.accession* parameter to *M92439_at*.
3. Click the *Run* button.

## Step 3: Viewing Coexpressed Genes

1. Look for the *GCM_Normals.markerdata.gct* file produced by the GeneNeighbors job above. 
2. Click it and look for *Send to New GenePattern Cell* in the menu, then select *HeatMapViewer*.
3. Move the new *HeatMapViewer* cell below these instructions.
4. Click the *Run* button.

## Step 4: Collapse the Expression File

1. Insert an analysis cell for the *CollapseDataset* module and move it below this set of instructions.
2. Set the following parameters:
    1. Send the *GCM_Normals.markerdata.gct* file produced by the *GeneNeighbors* job above to the *dataset.file* parameter.
    2. Drag-and-drop *[HU6800.chip](http://software.broadinstitute.org/cancer/software/genepattern/data/gcm/HU6800.chip)* to the *chip.platform* parameter.
3. Click the *Run* button.

## Step 5: Converting an Affy Expression File to a List of Genes

1. Look for the *GCM_Normals.markerdata.collapsed.gct* file produced by the *CollapseDataset* job above. 
2. Click it and look for *Send to New GenePattern Cell* in the menu, then select *ExtractRowNames*.
3. Move the new *ExtractRowNames* cell below these instructions.
4. Click the *Run* button.
5. View the resulting gene list by clicking *GCM_Normals.markerdata.collapsed.row.names.txt* and selecting *Open in New Tab*.


## Find pathways associated with gene list
The following code will search the [mygene.info](http://mygene.info) gene database service and query each result gene to determine which Reactome pathways are associated with it.

<div class="alert alert-warning">
<p>Executing the cells below will read in a list of genes, similar to the list created earlier in the main Samples and Features exercise. Each gene in this list will then be sent to [mygene.info](http://mygene.info), a gene database service.</p>
</div>

<div class="alert alert-warning">
- Click on the i icon next to the `GCM_Normals.markerdata.collapsed.row.names.txt` file in the last step
- Select "View Code use"
- Select and copy the reference to the output file, for example `job1306740.get_output_files()[1]` (do NOT include the "this file = " part)
- Paste the result into the code below to replace **INSERT PASTED CODE HERE**
- The resulting line should look like `gene_list_filename = job1306740.get_output_files()[1]`
- Execute the cell below

In [None]:
gene_list_filename = **INSERT PASTED CODE HERE**
gene_list_file = gene_list_filename.open()
gene_list = gene_list_file.readlines()

In [None]:
import requests
import json

for gene in gene_list:
    if " " in gene:
        gene=gene[0:gene.find(" ")]
    gene_results = requests.get("https://mygene.info/v2/query?q="+gene+"&fields=pathway.reactome").content
    gene_results_json = json.loads(gene_results)
    print(gene)
    pathways = list()
    for h in range(len(gene_results_json["hits"])):
        for k in gene_results_json["hits"][h].keys():
            if u'pathway' == k:
                for i in range(len(gene_results_json["hits"][h]["pathway"]["reactome"])):
                    pathways.append(gene_results_json["hits"][h]["pathway"]["reactome"][i]["name"])
    if (len(pathways) == 0):
        pathways.append("No pathways found")
    else:
        for p in sorted(set(pathways)):
            print "\t" + p