In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import os
sns.set_theme()

In [2]:
data_dir = "/oak/stanford/groups/akundaje/atwang/gp_mouse_sc_analyses/results/assembly_2/mm10/merge_peak_scores_downsampled"
data_path = os.path.join(data_dir, "scores.tsv")



# Per-Peak Visualizations

Now, I look into the distributions of several metrics across individual mouse peaks.

In [3]:

data = pd.read_csv(data_path, sep='\t', header=0)
data

FileNotFoundError: [Errno 2] No such file or directory: '/oak/stanford/groups/akundaje/atwang/gp_mouse_sc_analyses/results/assembly_2/mm10/merge_peak_scores_downsampled/scores.tsv'

## Mean vs Variance per peak

Here, we look at the means and variances of divergence between species for each mouse peak.  

The x axis for each subplot is the difference of the predicted log10 counts from the GP models and the mouse models, averaged over 5 folds. A positive value indicates a greater prediction from the GP models relative to the mouse models.

The y axis is the estimated standard deviation across folds of the divergence between the two species' count outputs. A higher standard deviation indicates higher variances across folds.    
The color indicates the significance of the difference between species for each peak, as evaluated using the Mann-Whitney U test on quantile-normalized predictions. I also looked into using DESeq also but I its assumptions are geared towards observed counts from experimental data and aren't really appropiate for analyzing predicted counts from models (more below).

Here, we see basically no correlation between mean and variance, which directly contradicts with the assumptions made by DESeq. The exceptions are mCM-1-3, Other-2, and pCM-3, though these seem to be due to technical issues (more later).

In [None]:
sns.relplot(data=data, x="diff_mean", y="diff_std", hue="-log10p_qn", col="Label", col_wrap=5, s=5, edgecolor=None)


## Mouse vs. GP predictions

Next, we look at the mean predictions of the mouse and GP models on the mouse peaks.  

The x axis is the mean predicted log10 counts as predicted by the mouse models, averaged across 5 folds. The y axis is the mean predicted log10 counts as predicted by the GP models.

And as before, the color indicates the significance of the difference between species.

Note that within each species and each cell types, the predicted log counts are zero-centered across peaks to account for read depth differences across species. Thus, the intercept in the plots are always zero. 

For most cell types, the relationship between the axes is linear with a slope of 1. 

However, for some cell types, the slope is significantly different than 1 or even non-linear. I account for this when calculating significance by quantile-normalizing the predicted counts from each model. It is unclear whether this phenomenon is an artifact of training or represents real biology. 

In [None]:
g = sns.relplot(data=data, x="c", y="xs_mean", hue="-log10p_qn", col="Label", col_wrap=5, s=5, edgecolor=None)
# g.set(xscale="log")
# g.set(yscale="log")

In [None]:
g = sns.relplot(data=data, x="diff_mean_qn", y="prof_e_dist", col="Label", hue="-log10p_qn", col_wrap=5, s=5, edgecolor=None)

In [None]:
g = sns.relplot(data=data, x="diff_mean_qn", y="contrib_counts_e_dist", col="Label", hue="-log10p_qn", col_wrap=5, s=5, edgecolor=None)

In [None]:
g = sns.relplot(data=data, x="diff_mean_qn", y="contrib_profiles_e_dist", col="Label", hue="-log10p_qn", col_wrap=5, s=5, edgecolor=None)

In [None]:
g = sns.relplot(data=data, x="contrib_counts_e_dist", y="contrib_profiles_e_dist", col="Label", hue="-log10p_qn", col_wrap=5, s=5, edgecolor=None)

In [None]:
g = sns.relplot(data=data, x="contrib_counts_e_dist", y="prof_e_dist", col="Label", hue="-log10p_qn", col_wrap=5, s=5, edgecolor=None)

In [None]:
g = sns.relplot(data=data, x="contrib_profiles_e_dist", y="prof_e_dist", col="Label", hue="-log10p_qn", col_wrap=5, s=5, edgecolor=None)

In [None]:
sns.relplot(data=data, x="cross_contrib_counts_dist_mean", y="contrib_counts_e_dist", col="Label", hue="contrib_counts_nlp", hue_norm=(0,3), col_wrap=5, s=5, edgecolor=None)

In [None]:
sns.relplot(data=data, x="cross_contrib_profiles_dist_mean", y="contrib_profiles_e_dist", col="Label", hue="contrib_profiles_nlp", hue_norm=(0,3), col_wrap=5, s=5, edgecolor=None)

In [None]:
g = sns.relplot(data=data, x="ss_mean", y="xs_mean", hue="contrib_counts_e_dist", col="Label", col_wrap=5, s=5, edgecolor=None)
# g.set(xscale="log")
# g.set(yscale="log")

In [None]:
g = sns.relplot(data=data, x="ss_mean", y="xs_mean", hue="contrib_profiles_e_dist", col="Label", col_wrap=5, s=5, edgecolor=None)
# g.set(xscale="log")
# g.set(yscale="log")

In [None]:
sns.relplot(data=data, x="cross_contrib_counts_dist_mean", y="contrib_counts_e_dist", col="Label", hue="-log10p_qn", hue_norm=(0,3), col_wrap=5, s=5, edgecolor=None)

In [None]:
sns.relplot(data=data, x="ss_self_contrib_counts_dist_mean", y="xs_self_contrib_counts_dist_mean", col="Label", col_wrap=5, s=5, edgecolor=None)


In [None]:
sns.relplot(data=data, x="ss_self_contrib_counts_dist_mean", y="cross_contrib_counts_dist_mean", col="Label", col_wrap=5, s=5, edgecolor=None)


In [None]:
sns.relplot(data=data, x="ss_self_contrib_profiles_dist_mean", y="xs_self_contrib_profiles_dist_mean", col="Label", col_wrap=5, s=5, edgecolor=None)


In [None]:
sns.relplot(data=data, x="ss_self_contrib_counts_dist_mean", y="xs_self_contrib_counts_dist_mean", col="Label", hue="obs_logcounts", col_wrap=5, s=5, edgecolor=None)


In [None]:
sns.relplot(data=data, x="ss_self_contrib_counts_dist_mean", y="xs_self_contrib_counts_dist_mean", col="Label", hue="ss_mean", col_wrap=5, s=5, edgecolor=None)


In [None]:
sns.relplot(data=data, x="ss_self_contrib_counts_dist_mean", y="cross_contrib_counts_dist_mean", col="Label", hue="-log10p_qn", col_wrap=5, s=5, edgecolor=None)


In [None]:
sns.relplot(data=data, x="ss_self_contrib_counts_dist_mean", y="xs_self_contrib_counts_dist_mean", col="Label", hue="-log10p_qn", col_wrap=5, s=5, edgecolor=None)


In [None]:
from scipy import stats

grouped = data.groupby("Label")
header = ["Label", "SpearmanSame", "SpearmanCross", "SpearmanCrossPooled", "SpearmanObserved"]
records = []
for name, group in grouped:
    fold_0_ss = group[['fold_0_ss']].to_numpy()
    fold_1_ss = group[['fold_1_ss']].to_numpy()
    fold_0_xs = group[['fold_0_xs']].to_numpy()
    ss_mean = group[['ss_mean']].to_numpy()
    xs_mean = group[['xs_mean']].to_numpy()
    observed = group[['obs_logcounts']].to_numpy()
    corr_ss = stats.spearmanr(fold_0_ss, fold_1_ss).correlation
    corr_xs = stats.spearmanr(fold_0_ss, fold_0_xs).correlation
    corr_xs_pooled = stats.spearmanr(ss_mean, xs_mean).correlation
    corr_observed = stats.spearmanr(fold_0_ss, observed).correlation
    records.append([name, corr_ss, corr_xs, corr_xs_pooled, corr_observed])
    
corrs = pd.DataFrame.from_records(records, columns=header)
corrs

In [None]:
sns.relplot(data=data, x="fold_0_ss", y="obs_logcounts", hue="-log10p_qn", col="Label", col_wrap=5, s=5, edgecolor=None)


## Predicted vs. True counts

We look a the fold 0 mouse model predictions compared the true counts on mouse peaks.  

The x axis is the predicted log10 counts as predicted by the mouse model. The y axis is the log1p true counts for the same peaks. Note that these values are on different scales but they are linearly related.

And as before, the color indicates the significance of the difference between species.

There seems to be an enrichment of significant peaks in inputs with low model predictive accuracy. This indicates potential false positives we should watch out for. 

## Predicted vs. predicted counts

We look a the fold 0 mouse model predictions compared the fold 1 mouse model predictions.  

The x axis is the predicted log10 counts as predicted by the fold 0 mouse model. The y axis is the predicted log10 counts as predicted by the fold 1 mouse model.

And as before, the color indicates the significance of the difference between species.

There seems to be an enrichment of significant peaks in inputs with low model predictive accuracy. This indicates potential false positives we should watch out for. 

In [None]:
sns.relplot(data=data, x="fold_0_ss", y="fold_1_ss", hue="-log10p_qn", col="Label", col_wrap=5, s=5, edgecolor=None)


## Predicted vs. cross-species predicted counts

We look a the fold 0 mouse model predictions compared the fold 0 GP model predictions.  

The x axis is the predicted log10 counts as predicted by the fold 0 mouse model. The y axis is the predicted log10 counts as predicted by the fold 0 GP model.

And as before, the color indicates the significance of the difference between species.

As expected, significantly different peaks tend to be at the edges of the distribution

In [None]:
sns.relplot(data=data, x="fold_0_ss", y="fold_0_xs", hue="-log10p_qn", col="Label", col_wrap=5, s=5, edgecolor=None)


## Quantitative cross-species prediction performance

Lastly, we look at the predictive cross_species performance for each cell type using spearman correlation across peaks.

SpearmanSame is the correlation of predicted counts from two folds of mouse model. This serves as an upper bound of cross-species predictive performance

SpearmanCross is the correlation of predicted counts from a mouse model with the predicted counts from the corresponding GP model. 

SpearmanObserved is the correlation of observed counts from a mouse model with true observed mouse counts.

We see that SpearmanCross is substantially closer to SpearmanSame than to SpearmanObserved, indicating that the predicted counts between two species correspond much more strongly than the predicted vs. true counts in the same species