# Differential Expression 1: RNA-Seq data in GenePattern Notebook

Compute differentially expressed genes or transcripts and visualize the results

## Before you begin

You must log in to a GenePattern server, in this notebook we will use **```GenePattern Cloud``` **

<div class="alert alert-info">
<h3 style="margin-top: 0;"> Instructions <i class="fa fa-info-circle"></i></h3>
<ul><li>Sign in to GenePattern by entering your username and password into the form below. </li></ul>
</div>

In [1]:
# Requires GenePattern Notebook: pip install genepattern-notebook
import gp
import genepattern

# Username and password removed for security reasons.
genepattern.display(genepattern.session.register("https://cloud.genepattern.org/gp", "", ""))

GPAuthWidget()

# 1. Differential expression: tumor vs normal

Below you can find the files we will use for differential expression.

In [2]:
file_list = ['https://datasets.genepattern.org/data/ccmi_tutorial/2018-09-20/BRCA_HUGO_symbols.preprocessed.gct',
             'https://datasets.genepattern.org/data/ccmi_tutorial/2018-09-20/BRCA_Dataset.cls',
             ]
genepattern.GPUIOutput(files=file_list)

UIOutput(files=['https://datasets.genepattern.org/data/ccmi_tutorial/2018-09-20/BRCA_HUGO_symbols.preprocessed…

## Compute differentially expressed genes


<p>The following analysis cell uses an information-theoretic method to find significantly differentially expressed transcripts or genes.</p>

<!--<p>--&gt; If you are using microarray data you can use ComparativeMarkerSelection, or DESeq2 as an alternative...</p>-->

<div class="alert alert-info">
<h3 style="margin-top: 0;"> Instructions <i class="fa fa-info-circle"></i></h3>
<ol>
    <li>Click on the gear icon and toggle code view to hide the code of the cell below. Notice the imports, these will be automatically executed when the notebook loads</li>
	<li>For the <strong>gene expression dataset</strong> parameter, select from the dropdown menu the file:&nbsp;<strong>BRCA_dataset.preprocessed.gct</strong></li>
	<li>For the <strong>phenotype file</strong> parameter,&nbsp;select from the dropdown menu the file:&nbsp;<strong>BRCA_Dataset.cls</strong></li>
	<li>Leave the rest of the parameters as default.
    <!-->
	<ul>
		<li><strong>output filename</strong>: diffex_output</li>
		<li><strong>ranking method</strong>: Pearson Correlation Matching</li>
		<li><strong>max number of genes to show</strong>: 20</li>
		<li><strong>number of permutations</strong>: 10</li>
		<li><strong>title</strong>: Differential Expression Results</li>
		<li><strong>random seed</strong>: 20180920</li>
		<li><strong>output variable</strong>: gene_scores</li>
	</ul>
    </!-->
	</li>
	<li>Click <strong>Run</strong></li>
</ol>
</div>


In [3]:
!pip install --user --no-deps git+https://github.com/edjuaro/ccal-noir.git > installed.txt
import genepattern
import ccalnoir as ccal

from ccalnoir import differential_gene_expression
import pandas as pd
import numpy as np
from ccalnoir import compute_information_coefficient
from ccalnoir import custom_pearson_corr
RANDOM_SEED = 20180314

genepattern.GPUIBuilder(differential_gene_expression, name="Differential gene expression, discrete pheotype.", 
                        description="Sort genes according to their association with a discrete phenotype or class vector.",
                        parameters={
                                "gene_expression":{"name":"gene_expression_dataset",
                                                   "type": "file",
                                                   "kinds": ["gct"]},
                                "phenotype_file":{"type": "file",
                                                  "kinds": ["cls"]},
                                "ranking_method":{
                                                  "default": "custom_pearson_corr",
                                                  "choices":{'Pearson Correlation Matching':"custom_pearson_corr",
                                                             'Information Coefficient Matching':"compute_information_coefficient",
                                                            }
                                                     },
                                "title":{"default":"Differential Expression Results"},
                                "output_filename":{"default": "diffex_output",},
                                "ramdon_seed":{"default":20180314,},
                                "output_var":{"default": "gene_scores",},
                        })

UIBuilder(description='Sort genes according to their association with a discrete phenotype or class vector.', …

## Investigate differentially expressed genes using genecards.org

<p>For a quick look at each of the top differentially expressed genes we probe <a href="http://www.genecards.org/" target="_blank">http://www.genecards.org/</a> using the Table Creator function</p>

<div class="alert alert-info">
<h3 style="margin-top: 0;"> Instructions <i class="fa fa-info-circle"></i></h3>
<ul>
	<li>Leave both parameters as default
	<ul>
		<li><strong>differential expression results</strong>: gene_scores</li>
		<li><strong>max number of genes to show</strong>: 20</li>
	</ul>
	</li>
	<li>Click run</li>
	<li>When the table has been created, click on any links on the <code>GeneCard</code> column to learn about that particular gene</li>
</ul>
</div>


In [4]:
import numpy as np
import pandas as pd

def make_clickable(url):
    to_display = url.split('=')[1]
    return '<a href="{}" target="_blank">{}</a>'.format(url, to_display)

def make_table(df,max_number_of_genes_to_show, actual_ranking=False):
    df['Rank'] = range(1,len(df)+1)
#     pd.options.display.max_colwidth = 100
    if max_number_of_genes_to_show > len(df):
        max_number_of_genes_to_show = len(df)
        print("You want to show more genes than your data contains ಠ_ಠ")
        print("Showing you only {} instead (i.e., all the genes you have provided)".format(max_number_of_genes_to_show))
    
    if actual_ranking:
        df['abs-Score'] = np.absolute(df['Score'])
        sorted_df = df.sort_values(by=["abs-Score"], ascending=False)
        sorted_df = sorted_df.head(max_number_of_genes_to_show)
    else:
        to_show = int(np.floor(max_number_of_genes_to_show/2))
        sorted_df = df.head(to_show).append(df.tail(to_show))

    sorted_df['GeneCards link']=pd.Series(["http://www.genecards.org/cgi-bin/carddisp.pl?gene={}".format(gene) 
                                     for gene in sorted_df.index.values], index=sorted_df.index)
    
    cols = list(sorted_df)
    # move the column to head of list using index, pop and insert
    cols.insert(0, cols.pop(cols.index('GeneCards link')))
    # use ix to reorder
    sorted_df = sorted_df.ix[:, cols]
    sorted_df.set_index('Rank', inplace=True)
    styled_table = sorted_df.style.format({'GeneCards link': make_clickable})
    styled_table
    return styled_table

genepattern.GPUIBuilder(make_table, name="Table Creator", 
                        description="Show differentially expressed genes with links to genecards.org",
                        parameters={
                                "df":{"name": "differential_expression_results",
                                      "description":'The output from differerential expression (for this exercise, leave it as "gene_scores")',
                                      "default":"gene_scores"},
                                "max_number_of_genes_to_show":{"description":"Maximum number of genes to show in the heatmap (half will be overexpressed in one class and half in the other)",
                                                               "default":20},
                                "actual_ranking":{"hide":True},
                                "output_var":{"hide":True},
                        })

UIBuilder(description='Show differentially expressed genes with links to genecards.org', function_import='make…

<div class="well">
*Note:* We are raking based on the score for each gene. This value can be positive (overexpressed in one class and under expressed in the other) or negative (the opposite expression profile). Hence, we want to look at high positive values but also to highly negative values.
</div>

---

# 2. Create your own CLS file

We will separate the same 40 samples based on their Tissue source site (TSS, the hospital to which the patient donated their samples)

<div class="alert alert-info">
<h3 style="margin-top: 0;"> Instructions <i class="fa fa-info-circle"></i></h3>
<ul>
<li> Click on the **input file** parameter field and select the same gct file that you used above. </li>
<li> Click Run.  </li>
<li> On the job cell that is created, click on the gear icon (⚙️) and select "Pop Out Visualizer."  </li>
<li> Select all samples by clicking **Check All.**  </li>
<li> Click Next.  </li>
<li> To create two clasees type "BH" into the **Enter class name** field, then click **➕ Add Class** -- repeat this to create a "rest" class. </li> 
<li> Click Next.  </li>
<li> Sort all of the "BH" samples into their respective class:
    <ul>
    <li> Make sure the "BH" class is selected on the right hand side column.  </li>
    <li> Type "BH" under **All Fields** on the left hand side column, then click **Filter...**  </li>
    <li> Click on the blank square next to **Sample Name** on the header to select all of the visible samples.  </li>
    <li> Click on the right pointing arrow (➡) to move those samples to the right hand side column.  </li>
    </ul>
</li>
<li> Sort the rest of the samples into the "rest" class:
    <ul>
    <li> In the right hand side column select "rest(0)" from the **class** drop-down menu.  </li>
    <li> Click on the blank square next to **Sample Name** on the header to select all of the visible samples.  </li>
    <li> Click on the right pointing arrow (➡) to move those samples to the right hand side column.  </li>
    </ul>
</li>
<li> Click Next.  </li>
<li> Review the classification of the samples.  </li>
<li> Click Next.  </li>
<li> Type "new_cls.cls" in the **Enter file name** field.</li>
<li> Select View/Download file</li>
<li> Click Save.  </li>
<li> Save this file in your computer to be used in the next step.  </li>
<li> Close the popped out visualizer. </li>
</div>

In [5]:
clsfilecreator_task = gp.GPTask(genepattern.session.get(0), 'urn:lsid:broad.mit.edu:cancer.software.genepattern.module.visualizer:00261')
clsfilecreator_job_spec = clsfilecreator_task.make_job_spec()
clsfilecreator_job_spec.set_parameter("input.file", "")
genepattern.display(clsfilecreator_task)

GPTaskWidget(lsid='urn:lsid:broad.mit.edu:cancer.software.genepattern.module.visualizer:00261')

<p>The following analysis cell uses an information-theoretic method to find significantly differentially expressed transcripts or genes.</p>

<!--<p>--&gt; If you are using microarray data you can use ComparativeMarkerSelection, or DESeq2 as an alternative...</p>-->

<div class="alert alert-info">
<h3 style="margin-top: 0;"> Instructions <i class="fa fa-info-circle"></i></h3>
<ul>
	<li>For the <strong>gene expression dataset</strong> parameter, select the file we loaded above:&nbsp;<strong>BRCA_dataset.preprocessed.gct</strong></li>
	<li>For the <strong>phenotype file</strong> parameter, select the file created in the previous step:&nbsp;<strong>
new_cls.cls</strong></li>
	<li>Leave the rest of the parameters as default:
	<ul>
		<li><strong>output filename</strong>: diffex_output</li>
		<li><strong>ranking method</strong>: Pearson Correlation Matching</li>
		<li><strong>max number of genes to show</strong>: 20</li>
		<li><strong>number of permutations</strong>: 10</li>
		<li><strong>title</strong>: Differential Expression Results</li>
		<li><strong>random seed</strong>: 20180920</li>
		<li><strong>output variable</strong>: gene_scores</li>
	</ul>
	</li>
	<li>Click <strong>Run</strong></li>
</ul>
</div>


In [6]:
import genepattern
import ccalnoir as ccal

from ccalnoir import differential_gene_expression
# import pandas as pd
# import urllib.request
from ccalnoir import compute_information_coefficient
from ccalnoir import custom_pearson_corr
RANDOM_SEED = 20180314

genepattern.GPUIBuilder(differential_gene_expression, name="Differential gene expression, discrete pheotype.", 
                        description="Sort genes according to their association with a discrete phenotype or class vector.",
                        parameters={
                                "gene_expression":{"name":"gene_expression_dataset",
                                                   "type": "file",
                                                   "kinds": ["gct"]},
                                "phenotype_file":{"type": "file",
                                                  "kinds": ["cls"]},
                                "ranking_method":{
                                                  "default": "custom_pearson_corr",
                                                  "choices":{'Pearson Correlation Matching':"custom_pearson_corr",
                                                             'Information Coefficient Matching':"compute_information_coefficient",
                                                            }
                                                     },
                                "title":{"default":"Differential Expression Results"},
                                "output_filename":{"default": "diffex_output",},
                                "ramdon_seed":{"default":20180314,},
                                "output_var":{"default": "gene_scores",},
                        })

UIBuilder(description='Sort genes according to their association with a discrete phenotype or class vector.', …

<div class="well">
*Note:* We see a substancial drop in the similarity score of the top differentially expressed genes (as we had expected). If we had seen genes as clearly differentially expressed as in the first test, we would have suspected a strong batch effect.
</div>

# 3. Evaluate the effects of sample size (with random data!)

In this section, we will evaluate the effects that (small) sample sizes have on differential expression results.
<div class="alert alert-info">
<h3 style="margin-top: 0;"> Instructions <i class="fa fa-info-circle"></i></h3>
- Run trhough the next 4 differential expression exercises (100 samples, 30 samples, 14 samples, 8 samples)
</div>

## 100 samples

<div class="alert alert-info">
<h3 style="margin-top: 0;"> Instructions <i class="fa fa-info-circle"></i></h3>
- Run the cell below to generate a gene expression matrix with 10,000 rows (genes) and 100 columns (samples).
- Run the differential expression UIBuilder cell to observe the results
</div>

In [None]:
n = 100
random_genes = pd.DataFrame(np.random.randn(10000, n), columns=list(range(n)))
random_genes.index.name = 'Names'
arr = np.array([1] * int(n/2) + [0] * int(n/2))

random_labels = pd.Series(arr, index=random_genes.columns)

In [8]:
import genepattern
import ccalnoir as ccal

from ccalnoir import differential_gene_expression
# import pandas as pd
# import urllib.request
from ccalnoir import compute_information_coefficient
from ccalnoir import custom_pearson_corr
RANDOM_SEED = 20180314

genepattern.GPUIBuilder(differential_gene_expression, name="Differential gene expression, discrete pheotype.", 
                        description="Sort genes according to their association with a discrete phenotype or class vector.",
                        parameters={
                                "gene_expression":{"name":"gene_expression_dataset",
                                                   "type": "file",
                                                   "kinds": ["gct"]},
                                "phenotype_file":{"type": "file",
                                                  "kinds": ["cls"]},
                                "ranking_method":{
                                                  "default": "custom_pearson_corr",
                                                  "choices":{'Pearson Correlation Matching':"custom_pearson_corr",
                                                             'Information Coefficient Matching':"compute_information_coefficient",
                                                            }
                                                     },
                                "title":{"default":"Differential Expression Results"},
                                "output_filename":{"default": "diffex_output",},
                                "ramdon_seed":{"default":20180314,},
                                "output_var":{"default": "gene_scores",},
                        })

UIBuilder(description='Sort genes according to their association with a discrete phenotype or class vector.', …

## 30 samples

<div class="alert alert-info">
<h3 style="margin-top: 0;"> Instructions <i class="fa fa-info-circle"></i></h3>
- Run the cell below to generate a gene expression matrix with 10,000 rows (genes) and 30 columns (samples).
- Run the differential expression UIBuilder cell to observe the results
</div>

In [None]:
n = 30
random_genes = pd.DataFrame(np.random.randn(10000, n), columns=list(range(n)))
random_genes.index.name = 'Names'
arr = np.array([1] * int(n/2) + [0] * int(n/2))

random_labels = pd.Series(arr, index=random_genes.columns)

In [10]:
import genepattern
import ccalnoir as ccal

from ccalnoir import differential_gene_expression
# import pandas as pd
# import urllib.request
from ccalnoir import compute_information_coefficient
from ccalnoir import custom_pearson_corr
RANDOM_SEED = 20180314

genepattern.GPUIBuilder(differential_gene_expression, name="Differential gene expression, discrete pheotype.", 
                        description="Sort genes according to their association with a discrete phenotype or class vector.",
                        parameters={
                                "gene_expression":{"name":"gene_expression_dataset",
                                                   "type": "file",
                                                   "kinds": ["gct"]},
                                "phenotype_file":{"type": "file",
                                                  "kinds": ["cls"]},
                                "ranking_method":{
                                                  "default": "custom_pearson_corr",
                                                  "choices":{'Pearson Correlation Matching':"custom_pearson_corr",
                                                             'Information Coefficient Matching':"compute_information_coefficient",
                                                            }
                                                     },
                                "title":{"default":"Differential Expression Results"},
                                "output_filename":{"default": "diffex_output",},
                                "ramdon_seed":{"default":20180314,},
                                "output_var":{"default": "gene_scores",},
                        })

UIBuilder(description='Sort genes according to their association with a discrete phenotype or class vector.', …

## 14 samples

<div class="alert alert-info">
<h3 style="margin-top: 0;"> Instructions <i class="fa fa-info-circle"></i></h3>
- Run the cell below to generate a gene expression matrix with 10,000 rows (genes) and 14 columns (samples).
- Run the differential expression UIBuilder cell to observe the results
</div>

In [None]:
n = 14
random_genes = pd.DataFrame(np.random.randn(10000, n), columns=list(range(n)))
random_genes.index.name = 'Names'
arr = np.array([1] * int(n/2) + [0] * int(n/2))
# np.random.shuffle(arr)

random_labels = pd.Series(arr, index=random_genes.columns)

In [12]:
import genepattern
import ccalnoir as ccal

from ccalnoir import differential_gene_expression
# import pandas as pd
# import urllib.request
from ccalnoir import compute_information_coefficient
from ccalnoir import custom_pearson_corr
RANDOM_SEED = 20180314

genepattern.GPUIBuilder(differential_gene_expression, name="Differential gene expression, discrete pheotype.", 
                        description="Sort genes according to their association with a discrete phenotype or class vector.",
                        parameters={
                                "gene_expression":{"name":"gene_expression_dataset",
                                                   "type": "file",
                                                   "kinds": ["gct"]},
                                "phenotype_file":{"type": "file",
                                                  "kinds": ["cls"]},
                                "ranking_method":{
                                                  "default": "custom_pearson_corr",
                                                  "choices":{'Pearson Correlation Matching':"custom_pearson_corr",
                                                             'Information Coefficient Matching':"compute_information_coefficient",
                                                            }
                                                     },
                                "title":{"default":"Differential Expression Results"},
                                "output_filename":{"default": "diffex_output",},
                                "ramdon_seed":{"default":20180314,},
                                "output_var":{"default": "gene_scores",},
                        })

UIBuilder(description='Sort genes according to their association with a discrete phenotype or class vector.', …

## 8 samples

<div class="alert alert-info">
<h3 style="margin-top: 0;"> Instructions <i class="fa fa-info-circle"></i></h3>
- Run the cell below to generate a gene expression matrix with 10,000 rows (genes) and 8 columns (samples).
- Run the differential expression UIBuilder cell to observe the results
</div>

In [None]:
n = 8
random_genes = pd.DataFrame(np.random.randn(10000, n), columns=list(range(n)))
random_genes.index.name = 'Names'
arr = np.array([1] * int(n/2) + [0] * int(n/2))
# np.random.shuffle(arr)

random_labels = pd.Series(arr, index=random_genes.columns)

In [14]:
import genepattern
import ccalnoir as ccal

from ccalnoir import differential_gene_expression
# import pandas as pd
# import urllib.request
from ccalnoir import compute_information_coefficient
from ccalnoir import custom_pearson_corr
RANDOM_SEED = 20180314

genepattern.GPUIBuilder(differential_gene_expression, name="Differential gene expression, discrete pheotype.", 
                        description="Sort genes according to their association with a discrete phenotype or class vector.",
                        parameters={
                                "gene_expression":{"name":"gene_expression_dataset",
                                                   "type": "file",
                                                   "kinds": ["gct"]},
                                "phenotype_file":{"type": "file",
                                                  "kinds": ["cls"]},
                                "ranking_method":{
                                                  "default": "custom_pearson_corr",
                                                  "choices":{'Pearson Correlation Matching':"custom_pearson_corr",
                                                             'Information Coefficient Matching':"compute_information_coefficient",
                                                            }
                                                     },
                                "title":{"default":"Differential Expression Results"},
                                "output_filename":{"default": "diffex_output",},
                                "ramdon_seed":{"default":20180314,},
                                "output_var":{"default": "gene_scores",},
                        })

UIBuilder(description='Sort genes according to their association with a discrete phenotype or class vector.', …

## Summary

<div class="well well-sm">
Conclusion: be aware that with a small sample size it is easy to find genes considered significant even by chance.
</div>

![random.png](attachment:random.png)

# 4. UIBuilder example

<div class="well well-sm">
We can turn code we use repetitively into a function and wrap it around the GenePattern UI Builder.
</div>

<div class="alert alert-info">
<h3 style="margin-top: 0;"> Instructions <i class="fa fa-info-circle"></i></h3>
- Run the cell below to create a UIBuilder cell. Note that this cell executes the same code that we used at the beginning of each of the exercises in the section above.
</div>

In [None]:
@genepattern.build_ui()
def random_samples(n):
    random_genes = pd.DataFrame(np.random.randn(10000, n), columns=list(range(n)))
    random_genes.index.name = 'Names'
    arr = np.array([1] * int(n/2) + [0] * int(n/2))
    random_labels = pd.Series(arr, index=random_genes.columns)
    return random_genes, random_labels