- Use the tumormap algorithm to calculate the most similar samples
- calculate disease from those samples
- write the pancancer and pandisease sample lists for this sample

### Depends On Steps:
    0 (conf)
    1 (sample expression file created)

### Input files:

Tumormap placement is run using the `tumormap_expression` file within the `tumormap/` subdir of the compendium.
This file can be either .tsv or .hd5 format. If both are present, the .tsv will be used.

The standard setup is to have only a `tumormap_expression.hd5` file which is a hardlink to the outlier cohort hd5 file, and a `filtered_genes_to_keep.tsv` file listing the genes in that file to retain.  

The alternate setup is to have only a `tumormap_expression.tsv` file, in which case the `filtered_genes_to_keep.tsv` is not used.

These input files that are always used:
- c["cohort"]["essential_clinical"] - the clinical data for the outlier analysis cohort; only the disease column is used.
- c["tumormap"]["essential_clinical"] - the clinical data for the samples in the Tumormap background cohort. In the standard setup, these are the same as the outlier cohort samples. In the alternate setup, they might be a different set of samples. 

These input files are sometimes used:
- c["tumormap"]["filtered_genes_to_keep"] - Used in the standard setup. List of genes to retain in the hd5 file based on the results of the expression & variance filters. 

- c["tumormap"]["background_hdf"] - Used in the standard setup. Path to the hd5 file of background expression to place the n-of-1 sample on.
- c["tumormap"]["background_tsv"] - Used in the alternate setup. Path to the tsv file of background expression. filtered_genes_to_keep are not checked for this file; it's expected to have expression & variance filters pre-applied.
- c["file"]["alternate_tumormap_expression"] - Path to an alternate N-of-1 expression file that should be used for tumormap placement instead of the calculated N-of-1 expression file. Will be used instead of the calculated n-of-1 expression if a file exists at this path.

### Output files:

j["tumormap_results"] : a { neighbor ID -> similarity } dict
    - used by 2.5, 2.7
j["neighbor_diseases"] : array starting with sample ID

### Details

#### Tumormap Algorithm
The nearest_samples() function computes the most similar samples to our N-of-1 sample. The code 
of this function was extracted from Yulia Newton's compute_sparse_matrix.py script.

The following version of compute_sparse_matrix.py was used as the source:

https://github.com/ucscHexmap/compute/blob/254838415da027039d5b2102579d18ea0e115438/calc/compute_sparse_matrix.py

#### Fully qualified sample ID
The "fully qualified sample ID" is relevant when our N-of-1 sample is already present in the background cohort.
It is not relevant for placing on the tumormap, in which the N-of-1 sample ID is not important; but it is relevant for pre- and post-processing.

If our sample is present in the cohort, it's expected to appear in the MCS with a 1.0 correlation. Thus, we add an additional MCS slot beforehand to account for this, and check that it appeared as expected afterward.

The tumormap results will appear using the actual sample ID, and not the FQsample ID.

In [None]:
import os
import json
import errno
import logging
import numpy as np
import pandas as pd
import scipy
import sklearn.metrics.pairwise as sklp
import operator

# Setup: load conf, retrieve sample ID, logging
with open("conf.json","r") as conf:
    c=json.load(conf)
sample_id = c["sample_id"]    
print("Running on sample: {}".format(sample_id))

logging.basicConfig(**c["info"]["logging_config"])
logging.info("\n2.0: Get Most Similar Samples")
def and_log(s):
    logging.info(s)
    return s
    
# if the analysis failed, create (if necessary) the flag file and
# add to it the reason it failed; increase max fail level if necessary
def mark_analysis_failed(text, level):
    try:
        with open(c["file"]["flag_analysis_failed"], "r") as jf:
            failed_json = json.load(jf)
    except IOError, e:
        if e.errno == errno.ENOENT:
            failed_json = {"reason": {}, "maxlevel": str(level)}
        else:
            raise
    if int(failed_json["maxlevel"]) < level:
        failed_json["maxlevel"] = str(level)
    if "2.0" in failed_json["reason"].keys():
        failed_json["reason"]["2.0"] = failed_json["reason"]["2.0"] + text
    else:
        failed_json["reason"]["2.0"] = text
    with open(c["file"]["flag_analysis_failed"], "w") as jf:
        json.dump(failed_json, jf, indent=2)

j = {}


fqsample_id=c["info"]["id_for_tumormap"]
if(fqsample_id != sample_id):
    print "Using the alias '{}' when searching for this sample on the compendium.".format(fqsample_id)
    
### Configuration ###
neighbor_count=6

Get the minimum threshold of correlation for which a sample is sufficiently similar to our N-of-1 sample. Use the tumormap threshold (vs cohort) as we're pulling these MCS from the tumormap.

In [None]:
SIMILARITY_THRESHOLD = float(c["tumormap"]["info"]["mcs_similarity_threshold"])
print SIMILARITY_THRESHOLD

Figure out which cohort we are using -- either the standard hd5 or the alternate tsv.

In [None]:
%%time

# if expression TSV exists :
# use it, don't apply filter
# otherwise, use expression hd5, apply filter.

if(os.path.isfile(c["tumormap"]["background_tsv"])):
    print ("Loading tumormap background cohort as TSV.")
    print ("Expecting that expression & variance filters have been pre-applied.")
    filtered_cohort = pd.read_csv(c["tumormap"]["background_tsv"],
                         sep="\t",
                         index_col=0)
else:
    print ("Loading tumormap background cohort as hdf.")
    cohort = pd.read_hdf(c["tumormap"]["background_hdf"])
    print ("Filtering genes based on expression and variance filters.")
    
    keep_these_genes=pd.read_csv(c["tumormap"]["filtered_genes_to_keep"],
                             sep="\t", dtype="str", index_col="Gene").index
    filtered_cohort = cohort.loc[keep_these_genes]

In [None]:
"""
Compute the Spearman distance between sample and every member of the
cohort and return the N nearest samples from the cohort.

cohort, sample: Pandas data frames with rows=genes
N: the number of most similar samples to return

"""
def nearest_samples(cohort, sample, N=-1):
    # Reduce to only common features
    print("Computing intersection")
    intersection = cohort.index.intersection(sample.index)
    if len(intersection) == 0:
        print("Error: sample and cohort have no features in common.\n"
              "Are you using an alternate cohort and need to use the alternate n-of-1 sample?")
        raise KeyboardInterrupt
    
    cohort_incommon = cohort[cohort.index.isin(intersection)].sort_index(axis=0)
    sample_incommon = sample[sample.index.isin(intersection)].sort_index(axis=0)
    
    # Column wise rank transform to turn correlation into spearman
    print("Transforming Rank")
    cohort_transformed = np.apply_along_axis(scipy.stats.rankdata, 1, cohort_incommon.values.T)
    sample_transformed = np.apply_along_axis(scipy.stats.rankdata, 1, sample_incommon.values.T)

    # Compute spearman distances
    print("Computing distances")
    distances = sklp.pairwise_distances(X=cohort_transformed, Y=sample_transformed, metric="correlation", n_jobs=1)
            
    # Rank and return top N
    rank = 1 - pd.DataFrame(distances, cohort.columns.values)
    return rank.sort_values(by=0, ascending=False)[0:N]

In [None]:
# Should we use a pre-existing alternate expression file, or our calculated one?
if(os.path.exists(c["file"]["alternate_tumormap_expression"])):
    print("Loading alternate tumormap expression from {}".format(c["file"]["alternate_tumormap_expression"]))
    n_of_1_expression=pd.read_csv(c["file"]["alternate_tumormap_expression"], delimiter="\t", index_col=0)
else:
    print("Loading sample expression from Step 1 JSON file")
    with open(c["json"]["1"],"r") as step1_json:
        n_of_1_expression=pd.DataFrame.from_dict(json.load(step1_json)["tpm_hugo_norm_uniq"],
                                                 dtype="float64", orient="columns")

Then, get the correlations from our N-of-1 sample to every other sample.

In [None]:
%%time
cohort_correlations = nearest_samples(filtered_cohort, n_of_1_expression, N=len(filtered_cohort.columns))

Get the traditional tumormap results. This is the top 6 samples, or, if the N-of-1 sample is in the tumormap background cohort, 7.

In [None]:
# Get the tumormap metadata & see whether our N-of-1 sample is in it
tumormap_diseases = pd.read_csv(
                    c["tumormap"]["essential_clinical"], 
                    sep="\t", keep_default_na=False, na_values=['_']) # no default NA so we get "" instead of np.nan
self_neighbor_expected = int(fqsample_id in tumormap_diseases["th_sampleid"].values)

In [None]:
# Traditional tumormap results, retaining the self-sample
neighbor_ids_and_values = cohort_correlations[
    0:(neighbor_count + self_neighbor_expected)].to_dict()[0]
j["tumormap_results"] = neighbor_ids_and_values.copy()
neighbor_ids_and_values

 Then, handle potential error cases before we continue on with the analysis, checking the entire correlation results. This will drop an existing self-sample from correlations_dict.

OK cases:
  - We didn't expect the self-sample in the MSS, and it is not
  - We expected the self-sample in the MSS, and it is, with good correlation ( > .999)
  
Failure cases:
  - There is a sample in the MSS with good correlation ( > .999) that we did not expect

  - We expected the self-sample in the MSS, and it is there, but with bad correlation ( < .999)
  - We expected the self-sample in the MSS, and it is not there at all
  
  Also, make a failure if no MSS (other than self) has a correlation higher than 80%.

In [None]:
correlations_dict = cohort_correlations.to_dict()[0]

# If we added a neighbor for itself, confirm that the self neighbor was found.
# Then drop it from neighbor_ids_and_values so that its disease will not be included.
if(self_neighbor_expected == 1):
    if(fqsample_id in correlations_dict):
        self_corr = correlations_dict.pop(fqsample_id)
        if(self_corr < 0.999): # Correlation arbitrarily chosen
            lowself_message = ("This sample was indicated to be present on the Tumor Map background cohort"
                             " with ID {}. However, the sample with that ID has only {} correlation with"
                             " the focus sample. Confirm that the sample is labeled correctly.<br/><br/>").format(
                             fqsample_id, self_corr)
            print lowself_message
            mark_analysis_failed(lowself_message, 4)
    else:
        notfound_message = ("This sample was indicated to be present on the Tumor Map background cohort"
                            " with ID {}. However, none of the sample's Most Correlated Samples have that ID."
                            " You might have the wrong focus sample expression file!"
                            " Confirm that the sample is labeled correctly.<br/><br/>").format(fqsample_id)
        print notfound_message
        mark_analysis_failed(notfound_message, 4)
        
# On the other hand, make sure that we didn't get a self neighbor we weren't expecting
else:
    for neighbor, value in correlations_dict.iteritems():
        if value > 0.999:
            tooclose_message = ("Most Correlated Sample correlation too high! The background sample {} "
                               "has a {} correlation with the focus sample; this might be the same expression"
                               " file. If this is intentional, use the alias column in the manifest.tsv to indicate "
                               "that the focus sample is present in the cohort under this name.<br/><br/>").format(
                neighbor, value)
            print(tooclose_message)
            mark_analysis_failed(tooclose_message, 4)
            
if max(correlations_dict.values()) <= 0.8:
    lowcorr_message = ("Most Correlated Sample correlation too low! The highest correlation for this sample is "
                      "{}. A correlation below 0.80 can indicate extremely low read depth, incorrect normalization, "
                      " or other problems.<br/><br/>").format(max(correlations_dict.values()))
    print(lowcorr_message)
    mark_analysis_failed(lowcorr_message, 4)
    

In [None]:
# Correlations dict (now sans focus sample) added to output json.
j["correlations_vs_focus_sample"] = correlations_dict.copy()

Now let's make some personalized cohorts based on the tumormap results.
First, get all the first-degree Most Correlated Samples: all samples which are more similar to the N-of-1 sample than the similarity threshold.


In [None]:
first_degree_mcs = {k:v for (k,v) in correlations_dict.items() if v > SIMILARITY_THRESHOLD}
j["first_degree_mcs_cohort"] = sorted(first_degree_mcs.keys())
# TODO: If any of these samples are not in the outlier cohort, a later step will fail.
first_degree_mcs

Then, we'll make our new variant on traditional pan-disease cohort. This are the diseases of the top 6 first-degree MCS; that is, the typical top 6 neighbors, but only counting those which are more similar than the threshold.
If we have fewer than 6, use only those.

The self-sample has already been dropped from correlations_dict, so we simply use neighbor_count without self_neighbor_expected.

In [None]:
top_six_ids = sorted(
    first_degree_mcs, key=first_degree_mcs.get, reverse=True)[0:neighbor_count]
top_six_dict = { k: first_degree_mcs[k] for k in top_six_ids }
j["tumormap_results_above_threshold"] = top_six_dict.copy()

# Find the diseases associated with the neighbors
dis_samps_df = tumormap_diseases[tumormap_diseases["th_sampleid"].isin(top_six_dict)]
print "Neighbors and Diseases:"
print dis_samps_df[["th_sampleid", "disease"]]

# Get diseases and drop empty
cleaned_found_diseases =  filter(lambda x: x != "", dis_samps_df["disease"].unique().tolist())
if len(cleaned_found_diseases) == 0:
    print "Warning! No diseases found for any nearest neighbor! Pandisease cohort is empty."
    # The pandisease cohort will be empty, but we may be able to survive with other cohorts
else:
    print "\nUsing the following diseases:\n{}".format("\n".join(cleaned_found_diseases))


In [None]:
# Create the sample lists for the pancancer and disease-specific cohorts
# Load up the samples vs disease matrix for the Compendium (not tumormap)
cohort_diseases = pd.read_csv(
                    c["cohort"]["essential_clinical"], 
                    sep="\t", keep_default_na=False, na_values=['_'])

# Exclude the sample in question (including medbook name if present) from the cohort
samples_excluding_self = cohort_diseases[~cohort_diseases["th_sampleid"].isin([fqsample_id])]

# Then, select the samples for the diseases that were found among the neighbors
# This selects from the pool of samples available to the cohort, as opposed to those available from the tumormap.

disease_specific_df = samples_excluding_self[samples_excluding_self ["disease"].isin(cleaned_found_diseases)]

j["pancan_samples"] = sorted(samples_excluding_self["th_sampleid"].tolist())
j["pandisease_samples"] = sorted(disease_specific_df["th_sampleid"].tolist())

In [None]:
# Print some useful info here

if(fqsample_id in cohort_diseases["th_sampleid"].values):
    isin = ""
    hasbeenremoved = " and has been removed"
else:
    isin = " not"
    hasbeenremoved = ""

print("Background cohorts generated.")
print("Original cohort: {} samples".format(len(cohort_diseases)))
print("N-of-1 sample {} was{} in the original cohort{}.".format(fqsample_id, isin, hasbeenremoved))
print("{} total pancancer samples.".format(len(j["pancan_samples"])))
print("Diseases in pandisease cohort: {}".format(", ".join(cleaned_found_diseases)))
print("{} total pandisease samples.".format(len(j["pandisease_samples"])))


In [None]:
# For each disease, count how many samples in the pandisease cohort are listed with this disease
# Used by the report in step 2.5
j["sample_disease_counts"]=("Pan-disease cohort breakdown:\nDisease\tCohort_sample_count\n" +
                           "\n".join(
                                map(
                                    lambda disease: "{}\t{}".format(
                                        disease,
                                        len(
                                            disease_specific_df[
                                                disease_specific_df["disease"].isin([disease])])),
                                    cleaned_found_diseases
                            )))

print j["sample_disease_counts"] 

Create a TSV of compendium correlations vs focus sample, suitable to load into Tumormap as an attribute. Sorted by correlation from high to low.

In [None]:
with open(c["file"]["sample_vs_compendium_tumormap_attribute"], "w") as f:
    f.write("sample ID\tCorrelationVs{}\n".format(c["sample_id"]))
    for k, v in sorted(j["correlations_vs_focus_sample"].iteritems(),
                       key=operator.itemgetter(1),
                       reverse=True):
        f.write("{}\t{}\n".format(k, v))

In [None]:
# Write the final JSON

with open(c["json"]["2.0"], "w") as jsonfile:
    json.dump(j, jsonfile, indent=2)
    
print "Done!"