# Differential Expression of RNA-Seq data in GenePattern Notebook

Compute differentially expressed genes or transcripts and visualize the results

## Before you begin

You must log in to a GenePattern server, in this notebook we will use **```GenePattern AWS Beta``` **

<div class="alert alert-info">
* Sign in to GenePattern by entering your username and password into the form below. 


In [6]:
# Requires GenePattern Notebook: pip install genepattern-notebook
import gp
import genepattern

# Username and password removed for security reasons.
genepattern.GPAuthWidget(genepattern.register_session("https://gp-beta-ami.genepattern.org/gp", "", ""))

## Load data to the notebook

#### Load the phenotye/class labels (contained in CLS file)
In order to make the phenotype labels file (the CLS file) easily accessible in the GenePattern modules and functions on this notebook, we will use the RenameFile module.

<div class="alert alert-info">
- Drag [BRCA_labels.cls](https://datasets.genepattern.org/data/ccmi_tutorial/2017-12-15/intermediate_data/WP_1_workshop_BRCA_labels.cls) to the **cls file** field.  
    Ignore the warning, *these are not the file formats you are looking for.*
- Leave the rest of the parameters as default:
  + **output filename**: classes.cls
  + **screen filename**: no
  + **force copy**: no
- Click **Run**

In [7]:
renamefile_task = gp.GPTask(genepattern.get_session(0), 'urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00338')
renamefile_job_spec = renamefile_task.make_job_spec()
renamefile_job_spec.set_parameter("input.file", "")
renamefile_job_spec.set_parameter("output.filename", "classes.cls")
renamefile_job_spec.set_parameter("screen.filename", "no")
renamefile_job_spec.set_parameter("force.copy", "no")
genepattern.GPTaskWidget(renamefile_task)

#### Load the RNA-Seq counts (contained in GCT file) and normalize the data
In order to make the RNA-Seq counts (the GCT file) easily accessible in the GenePattern modules and functions on this notebook, we will use the RenameFile module.

Transform raw RNA-Seq counts by fitting them with a normal distribution
<div class="alert alert-info">
Run PreprocessReadCounts using the following parameters:

+ **input file**: Drag and drop the file [BRCA_unversioned_ensembl_ids.collapsed.filtered.gct](https://datasets.genepattern.org/data/ccmi_tutorial/2017-12-15/intermediate_data/DP_3_1_BRCA_unversioned_ensembl_ids.collapsed.filtered.gct)
+ **cls file**: The output from the **RenameFile** module (i.e., **BRCA_labels.cls** if you used the suggested parameters in section 1)
+ **output file**: BRCA_dataset.preprocessed.gct
+ Click **Run**

In [8]:
preprocessreadcounts_task = gp.GPTask(genepattern.get_session(0), 'urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00355')
preprocessreadcounts_job_spec = preprocessreadcounts_task.make_job_spec()
preprocessreadcounts_job_spec.set_parameter("input.file", "")
preprocessreadcounts_job_spec.set_parameter("cls.file", "")
preprocessreadcounts_job_spec.set_parameter("output.file", "BRCA_dataset.preprocessed.gct")
preprocessreadcounts_job_spec.set_parameter("expression.value.filter.threshold", "1")
genepattern.GPTaskWidget(preprocessreadcounts_task)

## Compute differentially expressed transcripts


The following analysis cell uses an information-theoretic method to find significantly differentially expressed transcripts or genes.

<div class="alert alert-info">
- Click on the downward arrow (▼) for the **gene expression dataset** parameter and select the file we loaded above **BRCA_dataset.preprocessed.gct**
- Click on the downward arrow (▼) for the **phenotype file** parameter and select the file we loaded above **classes.cls**
- Leave the rest of the parameters as default:
  + **output filename**: diffex_output
  + **ranking method**: Pearson Correlation Matching
  + **max number of genes to show**: 20
  + **number of permutations**: 100
  + **title**: Differential Expression Results
  + **random seed**: 20180314
  + **output variable**: gene_scores
- Click **Run**

In [9]:
import genepattern
import ccalnoir as ccal

from ccalnoir import differential_gene_expression
# import pandas as pd
# import urllib.request
from ccalnoir import compute_information_coefficient
from ccalnoir import custom_pearson_corr
RANDOM_SEED = 20180314

genepattern.GPUIBuilder(differential_gene_expression, name="Differential gene expression, discrete pheotype.", 
                        description="Sort genes according to their association with a discrete phenotype or class vector.",
                        parameters={
                                "gene_expression":{"name":"gene_expression_dataset",
                                                   "type": "file",
                                                   "kinds": ["gct"]},
                                "phenotype_file":{"type": "file",
                                                  "kinds": ["cls"]},
                                "ranking_method":{
                                                  "default": "custom_pearson_corr",
                                                  "choices":{'Pearson Correlation Matching':"custom_pearson_corr",
                                                             'Information Coefficient Matching':"compute_information_coefficient",
                                                            }
                                                     },
                                "title":{"default":"Differential Expression Results"},
                                "output_filename":{"default": "diffex_output",},
                                "ramdon_seed":{"default":20180314,},
                                "output_var":{"default": "gene_scores",},
                        })

## Investigate differentially expressed genes using genecards.org

For a quick look at each of the top differentially expressed genes we probe http://www.genecards.org/ using the Table Creator function
<div class="alert alert-info">
- Leave both parameters as default
  + **differential expression results**: gene_scores
  + **max number of genes to show**: 20
- Click run
- When the table has been created, click on any links on the ```GeneCard``` column to learn about that particular gene

In [10]:
import numpy as np
import pandas as pd

def make_clickable(url):
    to_display = url.split('=')[1]
    return '<a href="{}" target="_blank">{}</a>'.format(url, to_display)

def make_table(df,max_number_of_genes_to_show, actual_ranking=False):
    df['Rank'] = range(1,len(df)+1)
#     pd.options.display.max_colwidth = 100
    if max_number_of_genes_to_show > len(df):
        max_number_of_genes_to_show = len(df)
        print("You want to show more genes than your data contains ಠ_ಠ")
        print("Showing you only {} instead (i.e., all the genes you have provided)".format(max_number_of_genes_to_show))
    
    if actual_ranking:
        df['abs-Score'] = np.absolute(df['Score'])
        sorted_df = df.sort_values(by=["abs-Score"], ascending=False)
        sorted_df = sorted_df.head(max_number_of_genes_to_show)
    else:
        to_show = int(np.floor(max_number_of_genes_to_show/2))
        sorted_df = df.head(to_show).append(df.tail(to_show))

    sorted_df['GeneCards link']=pd.Series(["http://www.genecards.org/cgi-bin/carddisp.pl?gene={}".format(gene) 
                                     for gene in sorted_df.index.values], index=sorted_df.index)
    
    cols = list(sorted_df)
    # move the column to head of list using index, pop and insert
    cols.insert(0, cols.pop(cols.index('GeneCards link')))
    # use ix to reorder
    sorted_df = sorted_df.ix[:, cols]
    sorted_df.set_index('Rank', inplace=True)
    styled_table = sorted_df.style.format({'GeneCards link': make_clickable})
    styled_table
    return styled_table

genepattern.GPUIBuilder(make_table, name="Table Creator", 
                        description="Show differentially expressed genes with links to genecards.org",
                        parameters={
                                "df":{"name": "differential_expression_results",
                                      "default":"gene_scores"},
                                "max_number_of_genes_to_show":{"description":"Maximum number of genes to show in the heatmap (half will be overexpressed in one class and half in the other)",
                                                               "default":20},
                                "actual_ranking":{"hide":True},
                                "output_var":{"hide":True},
                        })

<div class="well">
*Note:* We are raking based on the score for each gene. This value can be positive (overexpressed in one class and under expressed in the other) or negative (the opposite expression profile). Hence, we want to look at high positive values but also to highly negative values.
