# Differential Expression: RNA-Seq data in GenePattern Notebook

Compute differentially expressed genes or transcripts and visualize the results

<div class="alert alert-info">
<h3 style="margin-top: 0;"> Instructions <i class="fa fa-info-circle"></i></h3>
<ul><li>Sign in to GenePattern by entering your username and password into the form below. </li></ul>
</div>

In [7]:
# Requires GenePattern Notebook: pip install genepattern-notebook
import gp
import genepattern

# Username and password removed for security reasons.
genepattern.display(genepattern.session.register("https://cloud.genepattern.org/gp", "", ""))

GPAuthWidget()

# 1. Differential expression: tumor vs normal

<p>Below you can find the files we will use for differential expression.</p>

<ul>
    <li><a class="nbtools-markdown-file" href="https://datasets.genepattern.org/data/ccmi_tutorial/2018-09-20/BRCA_HUGO_symbols.preprocessed.gct">BRCA_HUGO_symbols.preprocessed.gct</a></li>
    <li><a class="nbtools-markdown-file" href="https://datasets.genepattern.org/data/ccmi_tutorial/2018-09-20/BRCA_Dataset.cls">BRCA_Dataset.cls</a></li>
</ul>

## Compute differentially expressed genes


<p>The following analysis cell uses an information-theoretic method to find significantly differentially expressed transcripts or genes.</p>

<!--<p>--&gt; If you are using microarray data you can use ComparativeMarkerSelection, or DESeq2 as an alternative...</p>-->

<div class="alert alert-info">
<h3 style="margin-top: 0;"> Instructions <i class="fa fa-info-circle"></i></h3>
<ol>
    <li>For the <strong>gene expression dataset</strong> parameter, select from the dropdown menu the file:&nbsp;<strong>BRCA_dataset.preprocessed.gct</strong></li>
	<li>For the <strong>phenotype file</strong> parameter,&nbsp;select from the dropdown menu the file:&nbsp;<strong>BRCA_Dataset.cls</strong></li>
	<li>Leave the rest of the parameters as default.
    <!-->
	<ul>
		<li><strong>output filename</strong>: diffex_output</li>
		<li><strong>ranking method</strong>: Pearson Correlation Matching</li>
		<li><strong>max number of genes to show</strong>: 20</li>
		<li><strong>number of permutations</strong>: 10</li>
		<li><strong>title</strong>: Differential Expression Results</li>
		<li><strong>random seed</strong>: 20180920</li>
		<li><strong>output variable</strong>: gene_scores</li>
	</ul>
    </!-->
	</li>
	<li>Click <strong>Run</strong></li>
</ol>
</div>


In [2]:
import genepattern
import ccalnoir as ccal

from ccalnoir import differential_gene_expression
import pandas as pd
import numpy as np
from ccalnoir import compute_information_coefficient
from ccalnoir import custom_pearson_corr
RANDOM_SEED = 20180314

genepattern.GPUIBuilder(differential_gene_expression, name="Differential gene expression, discrete pheotype.", 
                        description="Sort genes according to their association with a discrete phenotype or class vector.",
                        parameters={
                                "gene_expression":{"name":"gene_expression_dataset",
                                                   "type": "file",
                                                   "kinds": ["gct"]},
                                "phenotype_file":{"type": "file",
                                                  "kinds": ["cls"]},
                                "ranking_method":{
                                                  "default": "custom_pearson_corr",
                                                  "choices":{'Pearson Correlation Matching':"custom_pearson_corr",
                                                             'Information Coefficient Matching':"compute_information_coefficient",
                                                            }
                                                     },
                                "title":{"default":"Differential Expression Results"},
                                "output_filename":{"default": "diffex_output",},
                                "ramdon_seed":{"default":20180314,},
                                "output_var":{"default": "gene_scores",},
                        })

UIBuilder(description='Sort genes according to their association with a discrete phenotype or class vector.', …

## Investigate differentially expressed genes

<p>For a quick look at each of the top differentially expressed genes we probe <a href="http://www.genecards.org/" target="_blank">http://www.genecards.org/</a> using the Table Creator function</p>

<div class="alert alert-info">
<h3 style="margin-top: 0;"> Instructions <i class="fa fa-info-circle"></i></h3>
<ul>
	<li>Leave both parameters as default:
	<ul>
		<li><strong>differential expression results</strong>: gene_scores</li>
		<li><strong>max number of genes to show</strong>: 20</li>
	</ul>
	</li>
    <li>Click <strong><em>Run</em></strong></li>
	<li>When the table has been created, click on any links on the <q>GeneCard</q> column to learn about that particular gene</li>
</ul>
</div>


In [3]:
import numpy as np
import pandas as pd

def make_clickable(url):
    to_display = url.split('=')[1]
    return '<a href="{}" target="_blank">{}</a>'.format(url, to_display)

def make_table(df,max_number_of_genes_to_show, actual_ranking=False):
    df['Rank'] = range(1,len(df)+1)
#     pd.options.display.max_colwidth = 100
    if max_number_of_genes_to_show > len(df):
        max_number_of_genes_to_show = len(df)
        print("You want to show more genes than your data contains ಠ_ಠ")
        print("Showing you only {} instead (i.e., all the genes you have provided)".format(max_number_of_genes_to_show))
    
    if actual_ranking:
        df['abs-Score'] = np.absolute(df['Score'])
        sorted_df = df.sort_values(by=["abs-Score"], ascending=False)
        sorted_df = sorted_df.head(max_number_of_genes_to_show)
    else:
        to_show = int(np.floor(max_number_of_genes_to_show/2))
        sorted_df = df.head(to_show).append(df.tail(to_show))

    sorted_df['GeneCards link']=pd.Series(["http://www.genecards.org/cgi-bin/carddisp.pl?gene={}".format(gene) 
                                     for gene in sorted_df.index.values], index=sorted_df.index)
    
    cols = list(sorted_df)
    # move the column to head of list using index, pop and insert
    cols.insert(0, cols.pop(cols.index('GeneCards link')))
    # use ix to reorder
    sorted_df = sorted_df.ix[:, cols]
    sorted_df.set_index('Rank', inplace=True)
    styled_table = sorted_df.style.format({'GeneCards link': make_clickable})
    styled_table
    return styled_table

genepattern.GPUIBuilder(make_table, name="Table Creator", 
                        description="Show differentially expressed genes with links to genecards.org",
                        parameters={
                                "df":{"name": "differential_expression_results",
                                      "description":'The output from differerential expression (for this exercise, leave it as "gene_scores")',
                                      "default":"gene_scores"},
                                "max_number_of_genes_to_show":{"description":"Maximum number of genes to show in the heatmap (half will be overexpressed in one class and half in the other)",
                                                               "default":20},
                                "actual_ranking":{"hide":True},
                                "output_var":{"hide":True},
                        })

UIBuilder(description='Show differentially expressed genes with links to genecards.org', function_import='make…

<div class="well">
    <strong>Note:</strong> We are raking based on the score for each gene. This value can be positive (overexpressed in one class and under expressed in the other) or negative (the opposite expression profile). Hence, we want to look at high positive values but also to highly negative values.
</div>

# 2. Triple Negative BRCA

<p>We will now compute differentially expressed genes of triple negative BRCA vs non triple negative BRCA. This step uses an information-theoretic method to find significantly differentially expressed transcripts or genes.</p>

<p>Below you can find the files we will use.</p>

<ul>
    <li><a class="nbtools-markdown-file" href="https://datasets.genepattern.org/data/TCGA_BRCA/GSEA/brca_primary_all.gct">brca_primary_all.gct</a></li>
    <li><a class="nbtools-markdown-file" href="https://datasets.genepattern.org/data/TCGA_BRCA/GSEA/triple_negative.cls">triple_negative.cls</a></li>
</ul>

<div class="alert alert-info">
<h3 style="margin-top: 0;"> Instructions <i class="fa fa-info-circle"></i></h3>
<ol>
    <li>Click on the gear icon and toggle code view to hide the code of the cell below. Notice the imports, these will be automatically executed when the notebook loads</li>
	<li>For the <strong>gene expression dataset</strong> parameter, select from the dropdown menu the file:&nbsp;<strong>BRCA_dataset.preprocessed.gct</strong></li>
	<li>For the <strong>phenotype file</strong> parameter,&nbsp;select from the dropdown menu the file:&nbsp;<strong>BRCA_Dataset.cls</strong></li>
	<li>Leave the rest of the parameters as default.
    <!-->
	<ul>
		<li><strong>output filename</strong>: diffex_output</li>
		<li><strong>ranking method</strong>: Pearson Correlation Matching</li>
		<li><strong>max number of genes to show</strong>: 20</li>
		<li><strong>number of permutations</strong>: 10</li>
		<li><strong>title</strong>: Differential Expression Results</li>
		<li><strong>random seed</strong>: 20180920</li>
		<li><strong>output variable</strong>: gene_scores</li>
	</ul>
    </!-->
	</li>
	<li>Click <strong>Run</strong></li>
</ol>
</div>


<div class="well well-sm">
    <strong>Note:</strong> Because of the size of this dataset, running this analysis will take some time (especially the plotting portion). Remember to look at the status of your kernel represented in the dot in the upper right corner (next to "Python 3.7").
</div>

In [4]:
import genepattern
import ccalnoir as ccal

from ccalnoir import differential_gene_expression
# import pandas as pd
# import urllib.request
from ccalnoir import compute_information_coefficient
from ccalnoir import custom_pearson_corr
RANDOM_SEED = 20180314

genepattern.GPUIBuilder(differential_gene_expression, name="Differential gene expression, discrete pheotype.", 
                        description="Sort genes according to their association with a discrete phenotype or class vector.",
                        parameters={
                                "gene_expression":{"name":"gene_expression_dataset",
                                                   "type": "file",
                                                   "kinds": ["gct"]},
                                "phenotype_file":{"type": "file",
                                                  "kinds": ["cls"]},
                                "ranking_method":{
                                                  "default": "custom_pearson_corr",
                                                  "choices":{'Pearson Correlation Matching':"custom_pearson_corr",
                                                             'Information Coefficient Matching':"compute_information_coefficient",
                                                            }
                                                     },
                                "title":{"default":"Differential Expression Results"},
                                "output_filename":{"default": "diffex_output",},
                                "ramdon_seed":{"default":20180314,},
                                "output_var":{"default": "gene_scores",},
                        })

UIBuilder(description='Sort genes according to their association with a discrete phenotype or class vector.', …

## Validate experimental results

We will corroborate the results summarized in the abstract of "A New Gene Expression Signature for Triple-Negative Breast Cancer Using Frozen Fresh Tissue before Neoadjuvant Chemotherapy" by Santuario-Facio et al https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5469719/ :  

> Forty genes showed differential expression pattern in TNBC tumors. Among these, nine overexpressed genes (PRKX/PRKY, UGT8, HMGA1, LPIN1, HAPLN3, FAM171A1, BCL141A, FOXC1, and ANKRD11), and one underexpressed gene (ANX9) are involved in general metabolism. Based on this biochemical peculiarity and the overexpression of BCL11A and FOXC1 (involved in tumor growth and metastasis, respectively)[...]

In [None]:
# From https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5469719/
genes_of_interest = ['PRKX','PRKY', 'UGT8','HMGA1','LPIN1','HAPLN3','FAM171A1','BCL141A','FOXC1','ANKRD11','BCL11A','ANX9']

In [None]:
def validate_results(genes_of_interest, gene_scores):
    tn_genes = pd.DataFrame()
    ranks = gene_scores.rank(ascending=False)['Score']
    for gene in genes_of_interest:
        if gene in gene_scores.index:
            tn_genes = tn_genes.append(gene_scores.loc[gene])
            tn_genes.loc[gene,'rank'] = ranks[gene]
        else:
            print(f"Gene {gene} not present.")
    tn_genes = tn_genes.sort_values(by='Score',ascending=False)
    return tn_genes

validate_results(genes_of_interest, gene_scores)

In [5]:
import numpy as np
import pandas as pd

def make_clickable(url):
    to_display = url.split('=')[1]
    return '<a href="{}" target="_blank">{}</a>'.format(url, to_display)

def make_table(df,max_number_of_genes_to_show, actual_ranking=False):
    df['Rank'] = range(1,len(df)+1)
#     pd.options.display.max_colwidth = 100
    if max_number_of_genes_to_show > len(df):
        max_number_of_genes_to_show = len(df)
        print("You want to show more genes than your data contains ಠ_ಠ")
        print("Showing you only {} instead (i.e., all the genes you have provided)".format(max_number_of_genes_to_show))
    
    if actual_ranking:
        df['abs-Score'] = np.absolute(df['Score'])
        sorted_df = df.sort_values(by=["abs-Score"], ascending=False)
        sorted_df = sorted_df.head(max_number_of_genes_to_show)
    else:
        to_show = int(np.floor(max_number_of_genes_to_show/2))
        sorted_df = df.head(to_show).append(df.tail(to_show))

    sorted_df['GeneCards link']=pd.Series(["http://www.genecards.org/cgi-bin/carddisp.pl?gene={}".format(gene) 
                                     for gene in sorted_df.index.values], index=sorted_df.index)
    
    cols = list(sorted_df)
    # move the column to head of list using index, pop and insert
    cols.insert(0, cols.pop(cols.index('GeneCards link')))
    # use ix to reorder
    sorted_df = sorted_df.ix[:, cols]
    sorted_df.set_index('Rank', inplace=True)
    styled_table = sorted_df.style.format({'GeneCards link': make_clickable})
    styled_table
    return styled_table

genepattern.GPUIBuilder(make_table, name="Table Creator", 
                        description="Show differentially expressed genes with links to genecards.org",
                        parameters={
                                "df":{"name": "differential_expression_results",
                                      "description":'The output from differerential expression (for this exercise, leave it as "gene_scores")',
                                      "default":"gene_scores"},
                                "max_number_of_genes_to_show":{"description":"Maximum number of genes to show in the heatmap (half will be overexpressed in one class and half in the other)",
                                                               "default":20},
                                "actual_ranking":{"hide":True},
                                "output_var":{"hide":True},
                        })

UIBuilder(description='Show differentially expressed genes with links to genecards.org', function_import='make…

# 4. UI Builder example

We can turn code we use repetitively into a function and wrap it around the GenePattern UI Builder.

## Summary

<div class="alert alert-info">
<h3 style="margin-top: 0;"> Instructions <i class="fa fa-info-circle"></i></h3>
<p>Run the cell below to create a UI Builder cell. Note that this cell executes the same code that we used earlier.</p>
</div>

In [6]:
@genepattern.build_ui
def validate_results(genes_of_interest, gene_scores):
    tn_genes = pd.DataFrame()
    ranks = gene_scores.rank(ascending=False)['Score']
    for gene in genes_of_interest:
        if gene in gene_scores.index:
            tn_genes = tn_genes.append(gene_scores.loc[gene])
            tn_genes.loc[gene,'rank'] = ranks[gene]
        else:
            print(f"Gene {gene} not present.")
    tn_genes = tn_genes.sort_values(by='Score',ascending=False)
    return tn_genes

UIBuilder(function_import='validate_results', name='validate_results', params=[{'name': 'genes_of_interest', '…