# Analyzing HTseq Data Using GenePattern
The goal of this project is twofold:
- Analyze HTseq count data with tools that were designed for count data.
- Process count data to be analyzed with tools designed for microarray data.

<img src="./CCMI_workshop_project_overview.png" width="80%">

---
## Section 1: Load and Filter the Dataset

### 1.1 Filter out uninformative genes.
In order to remove the uninformative genes from the the HTseq dataset (i.e., the rows in the GCT file with the smallest variance), create a new cell below this one and use PreprocessDataset GenePattern module with these parameters:
+ input filename: Drag and drop the link to [this GCT file](./BRCA_40_samples.gct)  
*Note: It should display the file's url after you have done so.*  
+ output filename: **Workshop_BRCA_filtered.< input.file_extension >**  
  *Note: make sure you remove spaces after the "<" and before the ">" characters.*    
+ The rest of the parameters can be left as default.

### 1.2 Load the CLS file for future use by using the RenameFile GenePattern module.
+ input filename: Drag and drop the link to [this CLS file](./BRCA_40_samples.cls)  
*Note: It should display the file's url after you have done so.*  
+ output filename: **Workshop_BRCA_labels.< input.file_extension >**  
  *Note: make sure you remove spaces after the "<" and before the ">" characters.*  
+ The rest of the parameters can be left as default.

---
## Section 2: Process HTseq Data Directly
These results will be used as the ground truth for this notebook.

### 2.1 Perform differential gene expression using the DESeq2 GenePattern module using the followigng parameters:
+ input file: From the dropdown menu, choose the output from the PreprocessDataset module (i.e., **Workshop_BRCA_filtered.gct** if you used the suggested parameters in Section 1).
+ cls file: From the dropdown menu, choose the output from the RenameFile module (i.e., **Workshop_BRCA_labels.cls** is you used the suggested parameters in Section 1).
+ Click on **Run** and move on to Step 2 of this section once the job is complete.   

### 2.2 Extract top 100 differentially expressed genes and save them to a DataFrame for later use.
+ Send the first output of DEseq2 to COde (e.g., **Workshop_BRCA_filtered.normal.vs.tumor.DEseq2_results_report.txt**)
+ Copy the name of the variable name which was created.
    - *Note: it should be a name simirlar to **workshop_brca_filtered_normal_vs_tumor_deseq2_results_report_txt_1234567** *
+ We will parse this text file and extract only the information that we want (i.e., the name and rank of the 100 most differentially expressed genes) by running the code in the next cell

In [None]:
def extract_genes_from_txt(file_var, number_of_genes, verbose=True):
    genes_dict = {}  # Initializing the dictionary of genes and rankings
    py_file = file_var.open()
    py_file.readline()
    
    rank = 1
    for line in py_file.readlines():
        formatted_line = str(line,'utf-8').strip('\n').split('\t')
        genes_dict[formatted_line[0]] = rank
        if rank >= number_of_genes:
            break
        rank += 1
    
    if verbose:
        print(sorted([[v,k] for k,v in genes_dict.items()]))  # For display only
    
    return genes_dict

ground_truth = extract_genes_from_txt(INSERT_THE_VALUE_YOU_COPIED_IN_THE_PREVIOUS_CELL_HERE, number_of_genes=100)

---
## Section 3: Naively Using HTseq Data with GenePattern

### 3.1. Use the CompartiveMarkerSelection GenePattern module with the following parameters:
+ input file: The output from the PreprocessDataset module (i.e., **Workshop_BRCA_filtered.gct** if you used the suggested parameters in Section 1).
+ cls file: The output from the RenameFile module (i.e., **Workshop_BRCA_labels.cls** is you used the suggested parameters in Section 1).
+ The rest of the parameters can be left as default.

### 3.2 Extract top 500 genes and save to a dictionary for later use.
+ Send the ODF file from ComparativeMarkerSelection to a DataFrame.
+ Copy the name of that variable and use it in the code below.

In [None]:
def custom_CMSreader(GP_ODF, number_of_genes=20, verbose=True):
    GP_ODF = GP_ODF.dataframe.ix[GP_ODF['Rank']<=number_of_genes,['Rank','Feature']]
    GP_ODF.set_index('Feature', inplace=True)
    to_return = GP_ODF.to_dict()['Rank']
    if verbose:
        print(sorted([[v,k] for k,v in to_return.items()]))  # For display only
    return to_return

naive = custom_CMSreader(INSERT_THE_VALUE_YOU_COPIED_IN_THE_PREVIOUS_CELL_HERE, number_of_genes=100)

---
## Section 4: Comparing Ground Truth to Naive Approach
In this section we define a fuction to compare the dictionaries which contain the lists of top differentially expressed genes and their ranks. This function takes into consideration the overlap between the "ground truth" and the "naive" lists and the ranking of those common genes. 

In [None]:
from scipy.stats import kendalltau as kTau
def compare_dictionaries(ref, new):
    # compute how many of the genes in ref are in new
    common = (list(set(ref) & set(new)))
    
    ref_common = [ref[temp] for temp in common]
    new_common = [new[temp] for temp in common]
    kendall_tau = kTau(ref_common,new_common)[0]  # Kendall's Tau measures the similarity between to ordered lists.
    metric = kendall_tau * len(common)/len(ref)  # Penalizing low overlap between lists.
    
    print("There is a {:.3g}% overlap.".format(100*len(common)/len(ref)),
          "Custom metric is {:.3g} (metric range [0,1])".format(metric))
    return metric

In [None]:
compare_dictionaries(ground_truth, naive)

---
## Section 5: Process HTseq Counts to Before Performing Differential Expression

### 5.1 Use the PreprocessReadCounts GenePattern module with the followibg parameters:
+ input file: The output from the PreprocessDataset module (i.e., **Workshop_BRCA_filtered.gct** if you used the suggested parameters in Section 1).
+ cls file: The output from the RenameFile module (i.e., **Workshop_BRCA_labels.cls** is you used the suggested parameters in Section 1).
+ output file: leave as default.

### 5.2 Use the CompartiveMarkerSelection GenePattern module with the following parameters:
+ input file: The output from the PreprocessReadCounts module (i.e., **Workshop_BRCA_filtered.preprocessed.gct** if you used the suggested parameters in Step 1 of this section).
+ cls file: The output from the RenameFile module (i.e., **Workshop_BRCA_labels.cls** is you used the suggested parameters in Section 1).
+ The rest of the parameters can be left as default.

### 5.3 Extract top 500 genes and save to a dictionary for later use.
+ Send the ODF file from ComparativeMarkerSelection to a DataFrame.
+ Copy the name of that variable and use it in the code below.

In [None]:
preprocessed = custom_CMSreader(INSERT_THE_VALUE_YOU_COPIED_IN_THE_PREVIOUS_CELL_HERE, number_of_genes=100)

---
## Section 6: Comparing Ground Truth to New Approach
In this short section we use the fuction we defined in Section 4 to compare the dictionaries which contain the lists of top differentially expressed genes and their ranks. 

In [None]:
compare_dictionaries(ground_truth, preprocessed)

*Note:* Why do we get better results after using PreprocessReadCounts? From the module's documentation:

>Many of these tools were originally designed to handle microarray data - particularly from Affymetrix arrays - and so we must be mindful of that origin when preprocessing data for use with them.
>
>The module does this by using a mean-variance modeling technique [1] to transform the dataset to fit an approximation of a normal distribution, with the goal of thus being able to apply classic normal-based microarray-oriented statistical methods and workflows.


## Extra Credit:


2. Optional but encouraged:
    + Use ComparativeMarkerSelectionViewer to verify the output of ComparativeMarkerSelection does not show any major problems.
    + Use KMeansClustering to see if data can be clustered easily. Use the following parameters:
        - input filename: The output from the PreprocessDataset module (i.e., **Workshop_BRCA_filtered.gct** if you used the suggested parameters in Section 1).
        - number of clusters: 2.
        - cluster by: columns.
        - The rest of the parameters can be left as default.
        - Do the two clusters correspond to the tumor tissue samples ("TCGA-xx-xxxx-**01**") and the normal tissue samples ("TCGA-xx-xxxx-**11**")?

3. Optional but encouraged:
    + Use ComparativeMarkerSelectionViewer to verify the output of ComparativeMarkerSelection does not show any major problems.
    + Use KMeansClustering to see if data can be clustered easily. Use the following parameters:
        - input filename: The output from the PreprocessDataset module (i.e., **Workshop_BRCA_filtered.preprocessed.gct** if you used the suggested parameters in Step 1 from this section).
        - number of clusters: 2.
        - cluster by: columns.
        - The rest of the parameters can be left as default.
        - Do the two clusters correspond to the tumor tissue samples ("TCGA-xx-xxxx-**01**") and the normal tissue samples ("TCGA-xx-xxxx-**11**")?