# Outlier Analysis
## Pan-Cancer and Personalized Cohorts

This script runs outlier analysis on an n-of-1 sample against pre-generated thresholds.
It returns two levels of results: pan-cancer and personalized.

### Input 
- the sample log2(tpm+1) from step 1
- the thresholds from each background cohort, from step 3

### Output
Two output files.
#### outlier_results_SAMPLEID 
contains the following columns, in this order:

- sample: in log2(tpm+1)
- is_top_5 : blank or "top5"

pancancer thresholds in log2(tpm+1):
- pc_low  
- pc_median	
- pc_high

- pc_outlier: blank, "pc_up" or "pc_down"

- pc_is_filtered : blank or "pc_dropped"

pandisease thresholds in log2(tpm+1). Present to maintain continuity of file format,
but will always be blank as the consensus outliers are no longer derivable from these thresholds.

- pd_low	
- pd_median	
- pd_high	

The consensus outlier results based on combining our customized cohorts:
- pd_outlier:  blank, "pd_up" or "pd_down"

Pancancer percentile the sample's expression was for this gene:
- pc_percentile



#### 4.0.json 
contains the following keys:

- 'outlier_results' jsonification of the above CSV
- 'personalized_outliers' outliers for each personalized cohort individually listed
- 'personalized_consensus_counts' : outliers appearing in 2 or more cohorts, with the count of how many cohorts.

### Pan-cancer cohort:
 - Runs against the chosen background_cohort
 - results filtered against the expression-variance filters
 - up outliers filtered against top5% expression values

### Personalized cohorts:
 - results are NOT filtered vs expression-variance
 - up outliers filtered against top5% expression values
 - personalized cohorts are :
    - pandis
    - first_degree
    - first_and_second_degree
    - nof1_disease
 
 

In [None]:
import csv
import collections
import os
import uuid
import pandas as pd
import numpy as np
import errno # for errno.EEXIST. specific to python 2.7.5. https://github.com/beautify-web/js-beautify/pull/349 
import json
import bisect
import logging


# Setup: load conf, retrieve sample ID, logging
with open("conf.json","r") as conf:
    c=json.load(conf)
sample_id = c["sample_id"]    
print("Running on sample: {}".format(sample_id))

logging.basicConfig(**c["info"]["logging_config"])
logging.info("\n4: Outlier Analysis")
def and_log(s):
    logging.info(s)
    return s

# if the analysis failed, create (if necessary) the flag file and
# add to it the reason it failed; increase max fail level if necessary
def mark_analysis_failed(text, level):
    try:
        with open(c["file"]["flag_analysis_failed"], "r") as jf:
            failed_json = json.load(jf)
    except IOError, e:
        if e.errno == errno.ENOENT:
            failed_json = {"reason": {}, "maxlevel": str(level)}
        else:
            raise
    if int(failed_json["maxlevel"]) < level:
        failed_json["maxlevel"] = str(level)
    if "4.0" in failed_json["reason"].keys():
        failed_json["reason"]["4.0"] = failed_json["reason"]["4.0"] + text
    else:
        failed_json["reason"]["4.0"] = text
    with open(c["file"]["flag_analysis_failed"], "w") as jf:
        json.dump(failed_json, jf, indent=2)
    
# Input requires steps: 1, 3
with open(c["json"]["1"],"r") as jf:
    expression=pd.DataFrame.from_dict(
        json.load(jf)["tpm_hugo_norm_uniq"],
        orient="columns",
        dtype="float64")
with open(c["json"]["3"],"r") as jf:
        json_3 = json.load(jf)
        
j = {}

In [None]:
# Function for quantile expression. Used with pd.apply. takes a gene column
# from the binned cohort, the entire sample data frame, and the sample id
def get_sample_quantile(cohort_column, sample_df, sample_id):
    sample_value = sample_df.loc[cohort_column.name][sample_id] # a single number
    column_values = cohort_column.values # an array of the percentile values, 0 to 100
    
    leftmost = bisect.bisect_left(column_values, sample_value)
    rightmost = bisect.bisect_right(column_values, sample_value)
    
    return int((float(leftmost) + float(rightmost) - 1)/2)

In [None]:
# Calculate the genes in the top 5% of the N-of-1 sample
# takes sample expression as series
def sample_top5percent_genes(sample_expression):
    top5_value = sample_expression.describe(percentiles=[0.95]).loc["95%"]    
    top5_valuelists = sample_expression.loc[lambda x: x >= top5_value]
    
    # Exclude genes with zero expression from the top5%. If any such genes are present, 
    # the sample is considered QC fail - mark this by creating the ANALYSIS_FAILED file.
    if(top5_value <= 0):
        top5_valuelists = top5_valuelists.loc[lambda x: x > 0]
        
        qc_failure_text = ("QC_FAIL_LowTop5GeneCount: Top 5% threshold is equal to or lower than 0.<br/>"
                          "Fewer than five percent of the genes in this sample have expression greater than zero.")
        print(qc_failure_text)
        logging.error(qc_failure_text)
        mark_analysis_failed(qc_failure_text, 4)

    return top5_valuelists.keys()

In [None]:
# Run the outlier analysis! Inputs:
#
# sample_expression: Series (not data frame) of the sample expression data
#
# Thresholds dict - dictionary suitable for loading into a data frame; created
# from dataframe.to_json(orient='columns'). Contains columns for 'high', 'median',
# and 'low' thresholds per gene.
#
# top5_genes - list of gene names for the N-of-1 sample's top 5% of expression
#
# is_filtered -- whether to apply a filter to the genes for up or down outliers
#
# filtered_genes -- an array of strings that are the names of the genes to KEEP;
# ignored unless is_filtered


def get_outliers(sample_expression,
                  thresholds_dict,
                  top5_genes,
                  is_filtered=False,
                  filtered_genes_array=[] 
                  ):
        
    expression_thresholds = pd.DataFrame.from_dict(thresholds_dict, orient="columns", dtype="float32")
    
    # Down outliers 
    down_outliers = sample_expression.loc[lambda x: x < expression_thresholds["low"]]
    down_with_median = pd.concat([expression_thresholds["median"],down_outliers],axis=1,join="inner" )
    
    # Up outliers - filter by top5% genes of the n-of-1 sample
    up_outliers = sample_expression.loc[lambda x: x > expression_thresholds["high"]]
    up_with_median = pd.concat([expression_thresholds["median"],up_outliers],axis=1,join="inner" )
    up_in_top5 = up_with_median.loc[top5_genes.intersection(up_with_median.index)]
    
    if(is_filtered):
        print "    Using expression variance filters"
        filteredgenes = pd.Index(data=filtered_genes_array, copy=False, name="Gene")         
        final_up=up_in_top5.loc[filteredgenes.intersection(up_in_top5.index)]
        final_down=down_with_median.loc[filteredgenes.intersection(down_with_median.index)]
    else:
        print "    Not using expression variance filters"
        filteredgenes = pd.Index([])
        final_up = up_in_top5
        final_down = down_with_median     
    
    # SO now we have the outliers!
    print "    {}: {} up {} down {} top5".format(
        sample_expression.name,
        len(final_up),
        len(final_down),
        len(top5_genes)
    )

    result =  {
            "expression_thresholds":expression_thresholds,
            "final_up_idx":final_up.index,
            "final_down_idx":final_down.index,
           }
    # if expression & variance filters used, also pass sample_expression and filtered genes
    if(is_filtered):
        result["sample_expression"] = sample_expression
        result["filteredgenes"] = filteredgenes
    
    return result

In [None]:
# print outlier results into the combined dataframe of everything
# Takes the two outlier results dictionaries and the percentiles series
def make_result_df(pancan_dict, disease_result_dict, percentiles):

    # an empty dict for disease_specific_result indicates we're skipping disease_specific
    skip_disease = (disease_result_dict == {})
                      
    result = pd.DataFrame(pancan_dict["sample_expression"])

    # 'sample' column - rename from sample ID to "sample"
    result.rename(columns={pancan_dict["sample_expression"].name:"sample"}, inplace=True)

    # Set top 5 column
    # TODO - there is probably a better way to do this then clearing it then filling in the correct ones
    result.set_value(result.index, "is_top_5", "")
    result.set_value(top5_genes, "is_top_5", "top5")

    # Cohort thresholds
    result["pc_low"] = pancan_dict["expression_thresholds"]["low"]
    result["pc_median"] = pancan_dict["expression_thresholds"]["median"]
    result["pc_high"] = pancan_dict["expression_thresholds"]["high"]

    # outlier status
    result.set_value(result.index, "pc_outlier", "")
    result.set_value(pancan_dict["final_up_idx"], "pc_outlier", "pc_up")
    result.set_value(pancan_dict["final_down_idx"], "pc_outlier", "pc_down")

    # filtered genes
    result.set_value(result.index, "pc_is_filtered", "pc_dropped") # set all genes as drop
    result.set_value(pancan_dict["filteredgenes"], "pc_is_filtered", "") # and then keep those that filter retained

    if(not skip_disease):
        # disease specific columns
        result["pd_low"] = disease_result_dict["expression_thresholds"]["low"]
        result["pd_median"] = disease_result_dict["expression_thresholds"]["median"]
        result["pd_high"] = disease_result_dict["expression_thresholds"]["high"]
        # disease outlier column
        result.set_value(result.index, "pd_outlier", "")
        result.set_value(disease_result_dict["final_up_idx"], "pd_outlier", "pd_up")
        result.set_value(disease_result_dict["final_down_idx"], "pd_outlier", "pd_down")
    
    # percentile column - will match indices regardless of order
    result["pc_percentile"] = percentiles
    
    # TODO sometime later - if requested - also add pandisease filter columns
    # eg pd_is_filtered (pd_dropped, "")

    return result

In [None]:
%%time
# Percentile analysis : what is the percentile rank of the sample's genes within the cohort?

binned_cohort = pd.read_hdf(c["cohort"]["percentiles"])
sample_percentiles = binned_cohort.apply(get_sample_quantile, axis=0, sample_df=expression, sample_id=sample_id)

Get the pan-cancer outlier results

In [None]:
%%time

top5_genes = sample_top5percent_genes(expression[sample_id])

final_pancan = get_outliers(
                                expression[sample_id], 
                                json_3["pancan_thresholds"],
                                top5_genes,
                                True,  # pancan - use expression variance filters
                                json_3["pancan_filtered_genes"]
                               )

For each of the personalized cohorts that are present, get their outlier results.

In [None]:
%%time

personalized_thresholds = [
    "pandis_thresholds",
    "first_degree_thresholds",
    "first_and_second_degree_thresholds",
    "nof1_disease_thresholds"
]

personalized_results = {}
how_many_cohorts = 0

for threshold in personalized_thresholds:
    if(json_3[threshold]):
        how_many_cohorts += 1
        print threshold
        personalized_results[threshold] = get_outliers(expression[sample_id], 
                                        json_3[threshold],
                                        top5_genes
                                        # personalized - Don't use expression variance filters
                                        )
    else:
        print "{}: personalized cohort not found, skipping".format(threshold) 

Store the personalized outliers in json

In [None]:
# makes up , down outliers key from final_up_idx, final_down_idx. Cohort names will be:
# ['pandis_outliers','nof1_disease_outliers','first_degree_outliers','first_and_second_degree_outliers']

j["personalized_outliers"] = {}
for thresholdname in personalized_results.keys():
    cohortname = thresholdname.replace("thresholds", "outliers")
    j["personalized_outliers"][cohortname] = {}
    for outlier in ["up", "down"]:
        j["personalized_outliers"][cohortname][outlier] = list(
            personalized_results[thresholdname]["final_{}_idx".format(outlier)])

Set the personalized expression thresholds that will appear in the CSV outlier results. These were originally the pan-disease thresholds. 
However, since the "pan-disease outliers" are now consensus outliers, and not derivable from these thresholds, it's misleading to keep them in the file. Leave the columns there, but set them to be NaN.

(Personalized thresholds for every personal cohort are available in the output from step 3.)

In [None]:
final_personalized = {}
final_personalized["expression_thresholds"] = final_pancan["expression_thresholds"].applymap(
    lambda x: "")

Then, get the final personalized up and down outliers.
We get a consensus (ie, intersection of outliers) that varies based on how many personalized cohorts we have.

We marked the alerts / failures based on too few cohorts in the previous threshold generation step.

- 0 cohorts: Treat as if 0 pandisease outliers were found
- 1 cohort: Treat it as if 0 pandisease outliers were found (since we don't have consensus).
- 2 cohorts: intersection of outliers (need 2 out of 2)
- 3 cohorts: 2 out of 3 
- 4 cohorts: 2 out of 4

In [None]:
j["personalized_consensus_counts"] = {}

if(how_many_cohorts in [0,1]):
    final_personalized["final_up_idx"] = pd.Index([])
    final_personalized["final_down_idx"] = pd.Index([])

elif(how_many_cohorts in [2,3,4]):
    for outlier_type in ["up", "down"]:
        all_genes = collections.Counter()
        for cohort in personalized_results.values():
            all_genes.update(cohort["final_{}_idx".format(outlier_type)])
        consensus_genes = {k:v for k,v in all_genes.iteritems() if v >= 2}
        final_personalized["final_{}_idx".format(outlier_type)] = pd.Index(consensus_genes)
        j["personalized_consensus_counts"]["{}_outliers".format(outlier_type)] = consensus_genes
else:
    print("Found {} cohorts; this should never happen!".format(how_many_cohorts))
    raise KeyboardInterrupt 

In [None]:
print "Pancancer results: {}".format(final_pancan.keys())
print "Personalized results: {}".format(final_personalized.keys())

Finally, combine the results into a string-based dataframe; save it both as a CSV and in the final json output.

In [None]:
oa_results = make_result_df(final_pancan, final_personalized,sample_percentiles)
oa_results_asString = oa_results.applymap(lambda x: x if (type(x) == str) else "%.12g" % x )
j["outlier_results"] = json.loads(oa_results_asString.to_json(orient='columns'))

In [None]:
print "Exporting to {}".format(c["file"]["outlier_results"])

oa_results.to_csv(c["file"]["outlier_results"],
                    sep="\t",
                    index_label="Gene")

with open(c["json"]["4.0"], "w") as jsonfile:
    json.dump(j, jsonfile, indent=2)
print("Done!")