# Analyzing HTSeq Data Using Two Different Models With GenePattern
The main goals of this project are:
- Analyze HTSeq count data with tools that assume an underlying [negative binomial distribution](https://en.wikipedia.org/wiki/Negative_binomial_distribution) on the data.
- Analyze HTSeq count data with tools that assume an underlying [normal distribution](https://en.wikipedia.org/wiki/Normal_distribution) on the data.
- Analyze [normalized HTSeq count](http://software.broadinstitute.org/cancer/software/genepattern/modules/docs/PreprocessReadCounts/1) data with tools that assume an underlying [normal distribution](https://en.wikipedia.org/wiki/Normal_distribution) on the data.
- Compare the results of differential gene expression analysis under the three scenarios above.

<img src="https://datasets.genepattern.org/data/ccmi_tutorial/2017-12-15/class_project_data/CCMI_workshop_project_overview.png" width="80%">

In [36]:
# Requires GenePattern Notebook: pip install genepattern-notebook
import gp
import genepattern

# Username and password removed for security reasons.
genepattern.display(genepattern.session.register("https://gp-beta-ami.genepattern.org/gp", "", ""))

---
## Section 1: Load and Filter the Dataset

In brief, the dataset we will use in this notebook is RNA-Seq counts downloaded from TCGA. We have selected 40 samples of Breast Invasive Carcinoma (BRCA), 20 of those samples come from tumor tissue and 20 come from their corresponding normal tissue.

<h3>1.1 Filter out uninformative genes.</h3>

<div class="alert alert-info">

<p>In order to remove the uninformative genes from the the HTseq dataset (i.e., the rows in the GCT file with the smallest variance), create a new cell below this one and use the <strong>PreprocessDataset*</strong> GenePattern module with these parameters:</p>

<ul>
	<li><strong>input filename</strong>: Drag and drop the link to <a href="https://datasets.genepattern.org/data/TCGA_BRCA/WP_0_BRCA_cp_40_samples.gct" target="_blank">this GCT file</a><br />
	<em>Note: It should display the file&#39;s url after you have done so.</em></li>
	<li><strong>output filename</strong>: <strong>workshop_BRCA_filtered.gct</strong></li>
	<li>The rest of the parameters can be left as default.</li>
</ul>

### 1.2 Load the CLS file for future use by using the RenameFile GenePattern module.
<div class="alert alert-info">
In order to make the phenotype labels file (the CLS file) easily accessible in the GenePattern modules on this notebook, we will use the **RenameFile** module. Create a new cell below this one and run the RenameFile GenePattern module with the folowing parameters:
+ **input filename**: Drag and drop the link to [this CLS file](https://datasets.genepattern.org/data/TCGA_BRCA/WP_0_BRCA_cp_40_samples.cls)  
*Note: It should display the file's url after you have done so.*  
*Also: Ignore the "File may not be an acceptable format" warning.*
+ **output filename**: **workshop_BRCA_labels.cls**
+ The rest of the parameters can be left as default.

---
## Section 2: Analyzing HTseq Counts Using a Negative Binomial Model
These results will be used as the reference for comparison later in this notebook and will be refered to as **`negative_binomial_results`**.

### 2.1 Perform differential gene expression using DESeq2 
<div class="alert alert-info">
Create a new cell bellow this one and use the **DESeq2** GenePattern module using the following parameters:

+ **input file**: From the dropdown menu, choose the output from the PreprocessDataset module (i.e., **workshop_BRCA_filtered.gct** if you used the suggested parameters in section 1).
+ **cls file**: From the dropdown menu, choose the output from the RenameFile module (i.e., **workshop_BRCA_labels.cls** if you used the suggested parameters in section 1).
+ Click on **Run** and move on to step 2.2 of this section once the job is complete.   

### 2.2 Extract top 100 differentially expressed genes and save them to a DataFrame for later use.
<div class="alert alert-info">
We will parse one of the TXT files from the previous cell (**DESeq2**) and extract only the information that we want (i.e., the name and rank of the 100 most differentially expressed genes) and save that list in a python dictionary named **`negative_binomial_results`**. To do so, we are using the GenePattern UI Buildier in the next cell. Feel free to check out the underlying code if you want. Set the input parameters as follows:

- Send the **first output** of **DESeq2** to Extract Ranked Gene List From TXT GenePattern Variable { }  
    + Hint: the name of the file should be **workshop_BRCA_filtered.normal.vs.tumor.DESeq2_results_report.txt**
    + Click the "i" icon and on the dropdown menu that appears click under **"Send to Existing GenePattern Cell"** from that menu, select **"Extract Ranked Gene List From TXT GenePattern Variable { }"**
    + Alternatively, choose that TXT file from the dropdown menu of the cell below.

- **file var**: the action just before this one should have populated this parameter with a long URL similar to this one: *https://<span></span>gp-beta-ami.genepattern.org/gp/jobResults/1234567/workshop_BRCA_filtered.normal.vs.tumor.DESeq2_results_report.txt*.
- **number of genes**: 100 (default)
- **verbose**: true (default)
- Confirm that the **output variable** is is set to be **negative_binomial_results**
- Run the cell.


In [59]:
import genepattern
def extract_genes_from_txt(file_var:'URL of the results_report_txt file from DESeq2', 
                           number_of_genes:'How many genes to extract'=100, 
                           verbose:'Whether or not to print the gene list'=True):
    
    genes_dict = {}  # Initializing the dictionary of genes and rankings
    
    # Get the job number and name of the file
    temp = file_var.split('/')
    # programatically access that job to open the file
    gp_file = eval('job'+temp[5]+'.get_file("'+temp[6]+'")')
    py_file = gp_file.open()
    py_file.readline()
    
    rank = 1
    for line in py_file.readlines():
        formatted_line = str(line,'utf-8').strip('\n').split('\t')
        genes_dict[formatted_line[0]] = rank
        if rank >= number_of_genes:
            break
        rank += 1
    
    if verbose:
        # For display only
        for gene in genes_dict:
            print("{}: {}".format(genes_dict[gene],gene))
    
    return genes_dict

genepattern.GPUIBuilder(extract_genes_from_txt,
                        name="Extract Ranked Gene List From TXT GenePattern Variable",
                        parameters={
                                    "file_var": {
                                                 "type": "file",
                                                 "kinds": ["txt"],
                                    }
                        })

---
## Section 3: Analyzing HTSeq Counts Using a Naive Normal Model
These results will be used for comparison later in this notebook and will be refered to as **`naive_normal_results`**

### 3.1. Perform differential gene expression analysis using ComparativeMarkerSelection
<div class="alert alert-info">
Create a new cell bellow this one and use the **ComparativeMarkerSelection** GenePattern module with the following parameters:
+ **input file**: The output from the **PreprocessDataset** module (i.e., **Workshop_BRCA_filtered.gct** if you used the suggested parameters in section 1).
+ **cls file**: The output from the **RenameFile** module (i.e., **Workshop_BRCA_labels.cls** is you used the suggested parameters in section 1).
+ The rest of the parameters can be left as default.

### 3.2 Extract top 100 genes and save to a dictionary for later use.
<div class="alert alert-info">
We will parse the ODF file from the previous cell (**ComparativeMarkerSelection**) and extract only the information that we want (i.e., the name and rank of the 100 most differentially expressed genes) and save that list in a python dictionary named **`naive_normal_results`**. To do so, we are using the GenePattern UI Buildier in the next cell. Feel free to check out the underlying code if you want. Set the input parameters as follows:

- Send the output of **ComparativeMarkerSelection** to Extract Ranked Gene List From ODF GenePattern Variable { }  
    + Click the "i" icon and on the dropdown menu that appears click under **"Send to Existing GenePattern Cell"** from that menu, select **"Extract Ranked Gene List From ODF GenePattern Variable { }"**
    + Alternatively, choose that ODF file from the dropdown menu of the cell below.

- **GP ODF**: the action just before this one should have populated this parameter with a long URL similar to this one: *https://<span></span>gp-beta-ami.genepattern.org/gp/jobResults/1234567/workshop_BRCA_filtered.preprocessed.comp.marker.odf*.
- **number of genes**: 100 (default)
- **verbose**: true (default)
- Confirm that the **output variable** is is set to be **naive_normal_results**
- Run the cell.

In [55]:
from gp.data import ODF
def custom_CMSreader(GP_ODF:'URL of the ODF output from ComparativeMarkerSelection',
                     number_of_genes:'How many genes to extract'=100, 
                     verbose:'Whether or not to print the gene list'=True):
    
    # Get the job number and name of the file
    temp = GP_ODF.split('/')
    # programatically access that job to open the file
    GP_ODF = eval('ODF(job'+temp[5]+'.get_file("'+temp[6]+'"))')
#     GP_ODF = GP_ODF.dataframe
    GP_ODF = GP_ODF.ix[GP_ODF['Rank']<=number_of_genes,['Rank','Feature']]
    GP_ODF.set_index('Feature', inplace=True)
    to_return = GP_ODF.to_dict()['Rank']
    if verbose:
        # For display only
        genes_list = sorted([[v,k] for k,v in to_return.items()])
        for gene in genes_list:
            print("{}: {}".format(gene[0],gene[1]))
    return to_return


genepattern.GPUIBuilder(custom_CMSreader, 
                        name="Extract Ranked Gene List From ODF GenePattern Variable",
                        parameters={
                                    "GP_ODF": {
                                                 "type": "file",
                                                 "kinds": ["Comparative Marker Selection"],
                                    }
                        })
# naive_normal_results = custom_CMSreader(**INSERT_THE_VALUE_YOU_COPIED_IN_THE_PREVIOUS_CELL_HERE**, number_of_genes=100)

---
## Section 4: Comparing Results of the Negative Bionmial and Naive Normal Models
In this section we define a fuction to compare the dictionaries which contain the lists of top differentially expressed genes and their ranks. This function takes into consideration the overlap between the **`negative_binomial_results`** and the **`naive_normal_results`** and the ranking of genes present in both lists.

<div class="alert alert-info">
Run the cell below this one and analyze the output of the **`compare_dictionaries()`** function. Use the following parameters:  
- **reference list**: negative_binomial_results
- **new list**: naive_normal_results

In [56]:
from scipy.stats import kendalltau as kTau

def compare_dictionaries(reference_list, new_list):
    # compute how many of the genes in ref are in new
    common = (list(set(reference_list) & set(new_list)))
    
    ref_common = [reference_list[temp] for temp in common]
    new_common = [new_list[temp] for temp in common]
    kendall_tau = kTau(ref_common,new_common)[0]  # Kendall's Tau measures the similarity between to ordered lists.
    metric = kendall_tau * len(common)/len(reference_list)  # Penalizing low overlap between lists.
    
    print("There is a {:.3g}% overlap.".format(100*len(common)/len(reference_list)),
          "Custom metric is {:.3g} (metric range [0,1])".format(metric))
    return metric

# compare_dictionaries(negative_binomial_results, naive_normal_results)
genepattern.GPUIBuilder(compare_dictionaries, name="Compare Two Ranked Lists")

---
## Section 5: Analyzing Transformed HTSeq Counts Using a Normal Model
These results will be used for comparison later in this notebook and will be refered to as **`transformed_normal_results`**

### 5.1 Transform HTSeq counts by fitting them with a normal distribution
<div class="alert alert-info">
Create a new cell bellow this one and use the **PreprocessReadCounts** GenePattern module with the following parameters:

+ **input file**: The output from the **PreprocessDataset** module (i.e., **workshop_BRCA_filtered.gct** if you used the suggested parameters in section 1).
+ **cls file**: The output from the **RenameFile** module (i.e., **workshop_BRCA_labels.cls** is you used the suggested parameters in section 1).
+ **output file**: leave as default.

### 5.2 Perform differential gene expression analysis on transformed counts using ComparativeMarkerSelection
<div class="alert alert-info">
Create a new cell bellow this one and use the **ComparativeMarkerSelection** GenePattern module with the following parameters:

+ **input file**: The output from the **PreprocessReadCounts** module (i.e., **workshop_BRCA_filtered.preprocessed.gct** if you used the suggested parameters in step 5.1 of this section).
+ **cls file**: The output from the **RenameFile** module (i.e., **workshop_BRCA_labels.cls** is you used the suggested parameters in section 1).
+ The rest of the parameters can be left as default.

### 5.3 Extract top 100 genes and save to a dictionary for later use.
<div class="alert alert-info">

We will parse the ODF file from the previous cell (**ComparativeMarkerSelection**) and extract only the information that we want (i.e., the name and rank of the 100 most differentially expressed genes) and save that list in a python dictionary named **`transformed_normal_results`**. To do so, we are using the GenePattern UI Buildier in the next cell. Feel free to check out the underlying code if you want. Set the input parameters as follows:

- Send the output of **ComparativeMarkerSelection** to Extract Ranked Gene List From ODF GenePattern Variable { }  
    + Click the "i" icon and on the dropdown menu that appears click under **"Send to Existing GenePattern Cell"** from that menu, select **"Extract Ranked Gene List From ODF GenePattern Variable { }"**
    + Alternatively, choose that ODF file from the dropdown menu of the cell below.

- **GP ODF**: the action just before this one should have populated this parameter with a long URL similar to this one: *https://<span></span>gp-beta-ami.genepattern.org/gp/jobResults/1234567/workshop_BRCA_filtered.preprocessed.comp.marker.odf*.
- **number of genes**: 100 (default)
- **verbose**: true (default)
- Confirm that the **output variable** is is set to be **transformed_normal_results**
- Run the cell.

In [57]:
#transformed_normal_results = custom_CMSreader(**INSERT_THE_VALUE_YOU_COPIED_IN_THE_PREVIOUS_CELL_HERE**, number_of_genes=100)

genepattern.GPUIBuilder(custom_CMSreader, 
                        name="Extract Ranked Gene List From ODF GenePattern Variable",
                        parameters={
                                    "GP_ODF": {
                                                 "type": "file",
                                                 "kinds": ["Comparative Marker Selection"],
                                    }
                        })

---
## Section 6: Comparing Results of the Negative Bionmial and Transformed Normal Models
In this short section we use the fuction we defined in section 4 to compare the dictionaries which contain the lists of top differentially expressed genes and their ranks. Use the following parameters:  
- **reference list**: negative_binomial_results
- **new list**: transformed_normal_results

In [58]:
genepattern.GPUIBuilder(compare_dictionaries, name="Compare Two Ranked Lists")
# compare_dictionaries(negative_binomial_results, transformed_normal_results)

<div class="alert alert-success">
*Note:* Why do we get better results after using PreprocessReadCounts? From the module's documentation:

>Many of these tools were originally designed to handle microarray data - particularly from Affymetrix arrays - and so we must be mindful of that origin when preprocessing data for use with them.
>
>The module does this by using a mean-variance modeling technique [1] to transform the dataset to fit an approximation of a normal distribution, with the goal of thus being able to apply classic normal-based microarray-oriented statistical methods and workflows.


---
## Extra credit: Cluster samples before before and after transforming HTSeq counts

## EC 1 Cluster samples using HTSeq counts
In this section we will build upon the results from section 3 and perform some manual checks on the results from that section. It is a good scientific practice to check the results of your analyses. The maroon elements in the following schematic represent what this section will accomplish:

<img src="https://datasets.genepattern.org/data/ccmi_tutorial/2017-12-15/class_project_data/CCMI_workshop_project_ec1.png" width="80%">

### EC 1.1 Display results of ComparativeMarkerSelection
<div class="alert alert-info">
Use **ComparativeMarkerSelectionViewer** to verify the output of **ComparativeMarkerSelection** from section 3 does not show any major problems. Use the following parameters:
- **comparative marker selection filename**: Select the output from **ComparativeMarkerSelection** from section 3 (i.e., **workshop_BRCA_filtered.comp.marker.odf** if you used the suggested parameters).
- **dataset filename**: Select the output from the PreprocessDataset module (i.e., **workshop_BRCA_filtered.gct** if you used the suggested parameters).
- Run the module.

### EC 1.2 Perform clustering on RNASeq samples
<div class="alert alert-info">
Use **KMeansClustering** to see if data can be clustered easily. Use the following parameters:
- input filename: The output from the **PreprocessDataset** module (i.e., **workshop_BRCA_filtered.gct** if you used the suggested parameters).
- number of clusters: 2.
- cluster by: columns.
- The rest of the parameters can be left as default.
- Run the module.

### EC 1.3 Manually Review  results of clustering
<div class="alert alert-info">
Open both of the *first two* GCT files created by **KMeansClustering**. These files show which samples have been clustered together.
+ Click the "i" icon and on the dropdown menu that appears choose "Open in New Tab."  

Do the two clusters correspond to the tumor tissue samples ("TCGA-xx-xxxx-**01**") and the normal tissue samples ("TCGA-xx-xxxx-**11**")?

## EC 2 Cluster samples using transformed HTSeq counts
In this section we will build upon the results from section 5 and perform some manual checks on the results from that section. It is a good scientific practice to check the results of your analyses. The maroon elements in the following schematic represent what this section will accomplish:

<img src="https://datasets.genepattern.org/data/ccmi_tutorial/2017-12-15/class_project_data/CCMI_workshop_project_ec2.png" width="80%">

### EC 2.1 Display results of ComparativeMarkerSelection

<div class="alert alert-info">
Use **ComparativeMarkerSelectionViewer** to verify the output of **ComparativeMarkerSelection** from section 5 does not show any major problems. Use the following parameters:
- **comparative marker selection filename**: Select the output from **ComparativeMarkerSelection** from section 5 (i.e., **workshop_BRCA_filtered.preprocessed.comp.marker.odf** if you used the suggested parameters).
- **dataset filename**: Select the output from the PreprocessDataset module (i.e., **workshop_BRCA_filtered.preprocessed.gct** if you used the suggested parameters).
- Run the module.

### EC 2.2 Perform clustering on RNASeq samples
<div class="alert alert-info">
Use **KMeansClustering** to see if data can be clustered easily. Use the following parameters:
- input filename: The output from the **PreprocessReadCounts** module (i.e., **workshop_BRCA_filtered.preprocessed.gct** if you used the suggested parameters in step 5.1 from section 5).
- number of clusters: 2.
- cluster by: columns.
- The rest of the parameters can be left as default.
- Run the module.

### EC 2.3 Manually Review  results of clustering
<div class="alert alert-info">
Open both of the first two GCT files created by **KMeansClustering**. These files show which samples have been clustered together.
+ Click the "i" icon and on the dropdown menu that appears choose "Open in New Tab."

Do the two clusters correspond to the tumor tissue samples ("TCGA-xx-xxxx-**01**") and the normal tissue samples ("TCGA-xx-xxxx-**11**")?