# Differential Expression part 2: Triple Negative BRCA

Compute differentially expressed genes of triple negative BRCA vs non triple negative BRCA

## Before you begin

You must log in to a GenePattern server, in this notebook we will use **```GenePattern Cloud``` **

<div class="alert alert-info">
<h3 style="margin-top: 0;"> Instructions <i class="fa fa-info-circle"></i></h3>
<ul><li>Sign in to GenePattern by entering your username and password into the form below. </li></ul>
</div>

In [None]:
# Requires GenePattern Notebook: pip install genepattern-notebook
import gp
import genepattern

# Username and password removed for security reasons.
genepattern.display(genepattern.session.register("https://cloud.genepattern.org/gp", "", ""))

# Identify markers for triple negative breast cancer

## Load data into the notebook

<div class="well well-sm">
- **Note:** For this excersise, we are using all of the BRCA primary tumor samples available at TCGA and separating them (using the provided CLS file) according to whether or not they are triple negative. We provide the necessary files in the cell below.  
</div>

In [6]:
file_list = ['https://datasets.genepattern.org/data/TCGA_BRCA/GSEA/brca_primary_all.gct',
             'https://datasets.genepattern.org/data/TCGA_BRCA/GSEA/triple_negative.cls']

genepattern.GPUIOutput(files=file_list)

UIOutput(files=['https://datasets.genepattern.org/data/TCGA_BRCA/GSEA/brca_primary_all.gct', 'https://datasets…

## Compute differentially expressed genes

<div class="well well-sm">
**Note:** Because of the size of this dataset, running this function will take some time (especially the plotting portion). Remember to look at the status of your kernel represented in the dot in the upper right corner (next to "Python 3.6")
</div>

<div class="alert alert-info">
<h3 style="margin-top: 0;"> Instructions <i class="fa fa-info-circle"></i></h3>
- For the gene expression dataset parameter, select from the dropdown menu the file: `brca_primary_all.gct`  
- For the phenotype file parameter, select from the dropdown menu the file: `BRCA_Dataset.cls`  
- Leave the rest of the parameters as default.
- Click Run
</div>

In [7]:
import genepattern
import ccalnoir as ccal

from ccalnoir import differential_gene_expression
# import pandas as pd
# import urllib.request
from ccalnoir import compute_information_coefficient
from ccalnoir import custom_pearson_corr
RANDOM_SEED = 20180314

genepattern.GPUIBuilder(differential_gene_expression, name="Differential gene expression, discrete pheotype.", 
                        description="Sort genes according to their association with a discrete phenotype or class vector.",
                        parameters={
                                "gene_expression":{"name":"gene_expression_dataset",
                                                   "type": "file",
                                                   "kinds": ["gct"]},
                                "phenotype_file":{"type": "file",
                                                  "kinds": ["cls"]},
                                "ranking_method":{
                                                  "default": "custom_pearson_corr",
                                                  "choices":{'Pearson Correlation Matching':"custom_pearson_corr",
                                                             'Information Coefficient Matching':"compute_information_coefficient",
                                                            }
                                                     },
                                "title":{"default":"Differential Expression Results"},
                                "output_filename":{"default": "diffex_output",},
                                "ramdon_seed":{"default":20180314,},
                                "output_var":{"default": "gene_scores",},
                        })

UIBuilder(description='Sort genes according to their association with a discrete phenotype or class vector.', …

## Validate experimental results

We will corroborate the results summarized in the abstract of "A New Gene Expression Signature for Triple-Negative Breast Cancer Using Frozen Fresh Tissue before Neoadjuvant Chemotherapy" by Santuario-Facio et al https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5469719/ :  

> Forty genes showed differential expression pattern in TNBC tumors. Among these, nine overexpressed genes (PRKX/PRKY, UGT8, HMGA1, LPIN1, HAPLN3, FAM171A1, BCL141A, FOXC1, and ANKRD11), and one underexpressed gene (ANX9) are involved in general metabolism. Based on this biochemical peculiarity and the overexpression of BCL11A and FOXC1 (involved in tumor growth and metastasis, respectively)[...]


In [None]:
# From https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5469719/
genes_of_interest = ['PRKX','PRKY', 'UGT8','HMGA1','LPIN1','HAPLN3','FAM171A1','BCL141A','FOXC1','ANKRD11','BCL11A','ANX9']

In [None]:
tn_genes = pd.DataFrame()
ranks = gene_scores.rank(ascending=False)['Score']
for gene in genes_of_interest:
    if gene in gene_scores.index:
        tn_genes = tn_genes.append(gene_scores.loc[gene])
        tn_genes.loc[gene,'rank'] = ranks[gene]
    else:
        print(f"Gene {gene} not present.")
tn_genes = tn_genes.sort_values(by='Score',ascending=False)
tn_genes

In [8]:
import numpy as np
import pandas as pd

def make_clickable(url):
    to_display = url.split('=')[1]
    return '<a href="{}" target="_blank">{}</a>'.format(url, to_display)

def make_table(df,max_number_of_genes_to_show, actual_ranking=False):
    df['Rank'] = range(1,len(df)+1)
#     pd.options.display.max_colwidth = 100
    if max_number_of_genes_to_show > len(df):
        max_number_of_genes_to_show = len(df)
        print("You want to show more genes than your data contains ಠ_ಠ")
        print("Showing you only {} instead (i.e., all the genes you have provided)".format(max_number_of_genes_to_show))
    
    if actual_ranking:
        df['abs-Score'] = np.absolute(df['Score'])
        sorted_df = df.sort_values(by=["abs-Score"], ascending=False)
        sorted_df = sorted_df.head(max_number_of_genes_to_show)
    else:
        to_show = int(np.floor(max_number_of_genes_to_show/2))
        sorted_df = df.head(to_show).append(df.tail(to_show))

    sorted_df['GeneCards link']=pd.Series(["http://www.genecards.org/cgi-bin/carddisp.pl?gene={}".format(gene) 
                                     for gene in sorted_df.index.values], index=sorted_df.index)
    
    cols = list(sorted_df)
    # move the column to head of list using index, pop and insert
    cols.insert(0, cols.pop(cols.index('GeneCards link')))
    # use ix to reorder
    sorted_df = sorted_df.ix[:, cols]
    sorted_df.set_index('Rank', inplace=True)
    styled_table = sorted_df.style.format({'GeneCards link': make_clickable})
    styled_table
    return styled_table

genepattern.GPUIBuilder(make_table, name="Table Creator", 
                        description="Show differentially expressed genes with links to genecards.org",
                        parameters={
                                "df":{"name": "differential_expression_results",
                                      "description":'The output from differerential expression (for this exercise, leave it as "gene_scores")',
                                      "default":"gene_scores"},
                                "max_number_of_genes_to_show":{"description":"Maximum number of genes to show in the heatmap (half will be overexpressed in one class and half in the other)",
                                                               "default":20},
                                "actual_ranking":{"hide":True},
                                "output_var":{"hide":True},
                        })

UIBuilder(description='Show differentially expressed genes with links to genecards.org', function_import='make…

<div class="well well-sm">
What if we want to look at these genes *as a set*?
</div>