# Reconstruct cytokine trajectories for the theoretical antigen classes
To run this notebook, you need:
- To have run the notebook ``theoretical_antigen_classes_from_capacity_HighMI_13.ipynb`` (and hence its dependencies) and saved its results in ``results/capacity/``. 
- To have run the notebook `generate_synthetic_data.ipynb` and saved its outputs in the folder `results/reconstruction`, by default training of the decoder on a selection of 9 datasets (result files `quadratic_tanh_pipeline_selectdata.pkl` and `"tanh_norm_factors_integrals_selectdata.hdf"`)
- Dataframe of model parameters ``results/fits/df_params_Sigmoid_freealpha20_reg04_selectdata.hdf``  fitted on an ensemble of 14 datasets, an output of the notebook ``latentspace_models.ipynb`` with the force model with matching. 
- Raw cytokine time series in `data/final/`, in particular the corrected dataset `"cytokineConcentrationPickleFile-20210619-HighMI_13-final-corrected.pkl"` (corrected by the channel capacity notebook above), to recover the absolute scale of cytokine concentrations after reconstruction. 
- Table of OT-1 antigens' EC$_{50}$s, values from the literature and from our own measurements, in the JSON file `data/misc/potencies_df_2021.json`. Also, for plotting aesthetics, tick parameters saved in JSON files in this folder. 

The files mentioned above are available in the Zenodo repository archive. 

## Motivation
We use the EC$_{50}$s of the six antigen prototypes found from the channel capacity analysis, which was based on the HighMI_13 data. This dataset, with its many technical replicates for each antigen, allowed us to perform an accurate estimation of the variance of cytokine responses to different antigens, as it appears in the model parameter distributions. Therefore, we can use the covariance matrices (parameters $a_0$, $\tau_0$, $\theta$) and the variance of other parameters ($v_{t1}$, $\alpha$, $\beta$) fitted on these parameter distributions, and interpolated at the antigen classes' EC$_{50}$s. 

However, to have an accurate picture of the typical cytokine trajectories corresponding to those six antigen classes, it is more appropriate to use the *average* value of latent space parameters (and hence the average latent space trajectories) estimated from multiple datasets. It is also more appropriate to use a decoder (i.e. cytokine reconstruction coefficients from latent space) optimized on multiple datasets, rather than one that we would optimize on the latent space trajectories of the dataset used for channel capacity. Indeed, while the latter experiment provides excellent estimates of the *variance* of parameters (and thus of the mutual information and channel capacity), it does not necessarily reflect the average cytokine responses seen in most other experiments. Moreover, estimating the variance of parameters from an aggregate of multiple experimental repeats would be incorrect, as this variance is enlarged by batch effects and variability in various preparation steps or calibration standards over two years of experimental repeats. 

Therefore, to generate cytokine trajectories as representative as possible of the six antigen classes we derived, we use the noise structure estimated from the channel capacity analysis, but the average parameter values and the reconstruction decoder optimized on an ensemble of datasets (the ones used to generate figure S12, which used KDEs to estimate parameter distributions). So, basically, we build multivariate gaussian distributions interpolated as a function of EC$_{50}$, with covariances estimated with the channel capacity dataset, but averages estimated from many datasets. 

In [None]:
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import pickle
import seaborn as sns
import os, json
import sys
main_dir_path = os.path.abspath('../')
if main_dir_path not in sys.path:
    sys.path.insert(0, main_dir_path)

from utils.statistics import build_symmetric
from utils.distrib_interpolation import (eval_interpolated_means_covs, 
                        interpolate_params_vs_logec50, stats_per_levels)
from utils.extra_pairplots import dual_pairplot

In [None]:
%matplotlib inline

#plt.rcParams["figure.figsize"] = (2.25, 1.75)
plt.rcParams["axes.labelsize"] = 8.
plt.rcParams["legend.fontsize"] = 8.
plt.rcParams["axes.labelpad"] = 0.5
plt.rcParams["xtick.labelsize"] = 7.
plt.rcParams["ytick.labelsize"] = 7.
plt.rcParams["legend.title_fontsize"] = 8.
plt.rcParams["axes.titlesize"] = 8.
plt.rcParams["font.size"] = 8.
plt.rcParams["figure.dpi"] = 150

# Load uniform tick props across the whole figure 3
with open(os.path.join(main_dir_path, "data", "misc", "minor_ticks_props.json"), "r") as hd:
    props_minorticks = json.load(hd)
with open(os.path.join(main_dir_path, "data", "misc", "major_ticks_props.json"), "r") as hd:
    props_majorticks = json.load(hd)

## Interpolate distributions as a function of EC$_{50}$
We already have the covariance matrices for $a_0$, $t_0$, $\theta$ and variances of other parameters at the right EC$_{50}$s from HighMI_13. 

In [None]:
# Import the antigen classes' EC50s
file_name = "chancap_theo_antigen_trajectories_highmi13.hdf"
df_agclass_ec50s = pd.read_hdf(os.path.join(main_dir_path, "results", "capacity", file_name), key="ec50s")

# Import covariance and variance from HighMI_13
df_agclass_covparams = pd.read_hdf(os.path.join(main_dir_path, "results", "capacity", file_name), key="covparams")
df_agclass_varparams = pd.read_hdf(os.path.join(main_dir_path, "results", "capacity", file_name), key="varparams")

### Interpolate mean parameters as a function of EC$_{50}$

For the averages, we need to import empirical parameter distributions (like we do in the ``generate_synthetic_data.ipynb`` notebook), fit the averages, and interpolate them as a function of EC$_{50}$. 

In [None]:
# Add error bars; can't use relplot or catplot anymore, 
# they don't have the option to include known standard deviations
def plot_params_vs_logec50(df_estim, df_estim_vari, ser_x, cols_plot=None, 
                        ser_interp=None, x_name="Peptide", col_wrap=3):
    """ Optional df_interp, which contains interpolating splines for each parameter
    as a function of the x variable (log EC50, usually). 
    """
    if cols_plot is None:
        nplots = len(df_estim.columns)
        cols_plot = df_estim.columns
    else:
        nplots = len(cols_plot)
    
    nrows = nplots // col_wrap + min(1, nplots % col_wrap)  # Add 1 if there is a remainder. 
    ncols = min(nplots, col_wrap)
    
    fig, axes = plt.subplots(nrows, ncols, sharex=True, sharey=False)
    fig.set_size_inches(3*ncols, 2.5*nrows)
    axes = axes.flatten()
    
    for i in range(nplots):
        estim = df_estim[cols_plot[i]]
        stds = np.sqrt(df_estim_vari[cols_plot[i]])
        x_labels = estim.index.get_level_values(x_name)
        xpoints = ser_x.reindex(x_labels)  # assume ser_x has a single index level?
        xmin, xmax = np.amin(xpoints), np.amax(xpoints)
        axes[i].errorbar(xpoints, estim, yerr=stds, ls='none', marker="o", ms=5, label="Fit")
        axes[i].set_ylabel(cols_plot[i])
        xr = xmax - xmin
        yr = np.amax(estim) - np.amin(estim)
        for j in range(len(x_labels)):
            axes[i].annotate(x_labels[j], xy=(xpoints[j]+0.01*xr, estim[j]+0.02*yr), fontsize=8)
        if ser_interp is not None:
            spl = ser_interp[cols_plot[i]]
            xrange = np.linspace(xmin, xmax, 201)
            axes[i].plot(xrange, spl(xrange), lw=1.5, label="Interpolation")
            axes[i].legend()
    return [fig, axes]

In [None]:
df_sample_p = pd.read_hdf(os.path.join(main_dir_path, "results", "fits", 
                    "df_params_Sigmoid_freealpha_reg04_selectdata.hdf"), key="df_params")

# Remove different T cell numbers if necessary
tcn_fit = "100k"
try:
    df_sample_p = df_sample_p.xs(tcn_fit, level="TCellNumber", axis=0, drop_level=True)
except KeyError: 
    print("TCellNumber was already sliced; skipping.")

# Drop A8 and Q7, which we don't use in general. 
df_sample_p = df_sample_p.drop(["A8", "Q7"], level="Peptide")
print(df_sample_p.index.get_level_values("Data").unique())

# Only keep G4 and E1 if theta < 0; we know that others are outliers,
# artifacts due to insufficient regularization or background fluorescence
df_sample_p = df_sample_p.loc[np.logical_not(df_sample_p.index.isin(["E1", "G4"], level="Peptide")&(df_sample_p["theta"] > 0.0))]

In [None]:
# Fit distributions on the imported parameter samples
params_with_cov = ["a0", "tau0", "theta"]
levels_group = ["Peptide"]

# THIS IS WHERE THE GAUSSIANS ARE ACTUALLY FITTED. Only keep the means
ret = stats_per_levels(df_sample_p, levels_groupby=levels_group, feats_keep=params_with_cov)
df_p_means, df_p_means_estim_vari, _, _, _ = ret

# We might have zero variance on E1/G4 which are at zero
# This can cause bugs when interpolating with zero error bars, so set variance to very small value
df_p_means_estim_vari = df_p_means_estim_vari.clip(lower=np.abs(df_p_means).min().min())

In [None]:
# Import EC50s of peptides
df_potencies = pd.read_json(os.path.join(main_dir_path, "data", "misc", "potencies_df_2021.json"))
ser_log10ec50s = np.log10(df_potencies).mean(axis=1)
ser_log10ec50s.index.name = "Peptide"
# Only keep peptides for which we have parameter values
ser_log10ec50s = ser_log10ec50s.loc[df_sample_p.index.get_level_values("Peptide").unique()]
print(ser_log10ec50s)

In [None]:
ser_splines_means = interpolate_params_vs_logec50(df_p_means, df_p_means_estim_vari, 
                                      ser_log10ec50s, x_name="Peptide")

# Plot the interpolation versus the data
fig, axes = plot_params_vs_logec50(df_p_means, df_p_means_estim_vari, ser_log10ec50s, 
                             ser_interp=ser_splines_means, cols_plot=None, x_name="Peptide", col_wrap=3)
for ax in axes[-3:]:  # 3 is col_wrap
    ax.set_xlabel(r"$\log_{10}{\mathrm{EC}_{50}}$ [-]")
fig.tight_layout()
plt.show()
plt.close()

In [None]:
# Evaluate the means at the antigen classes' EC50s
n_categories = len(df_agclass_ec50s.index.get_level_values("TheoreticalAntigen").unique())
interpolated_means = np.zeros([n_categories, 3])
for i in range(n_categories):
    interpolated_means[i, 0] = ser_splines_means["a0"](df_agclass_ec50s["log10ec50"].iloc[i])
    interpolated_means[i, 1] = ser_splines_means["tau0"](df_agclass_ec50s["log10ec50"].iloc[i])
    interpolated_means[i, 2] = ser_splines_means["theta"](df_agclass_ec50s["log10ec50"].iloc[i])

In [None]:
# For other parameters (v1, alpha, beta), we use linear interpolation
# since we do not bother to get error bars on the estimates on the means
# and the behaviour of those parameters as a function of EC50s is not so
# monotonic, we better avoid cubic splines like we used for a0, tau0, theta. 
def interpolate_nearest_peptides(logec50, pep_logec50s, ser_values):
    """ Given an arbitrary log_10 EC_50, a list of peptide log_10 EC_50s, and a 
    value of the quantity to interpolate for each peptide label, find the two peptides
    closest to the desired EC_50 and interpolate linearly between their values. """
    # Find the peptide below and the peptide above
    sorted_ec50s = pep_logec50s.sort_values()
    ec50_index_above = np.searchsorted(sorted_ec50s, logec50, side="left")
    try:
        ec50_above = sorted_ec50s.iloc[ec50_index_above]
    except IndexError:
        raise ValueError("We are above the interpolation range")
    else:
        pep_above = sorted_ec50s.index.to_series().iloc[ec50_index_above]
    
    try:
        ec50_below = sorted_ec50s.iloc[ec50_index_above-1]
    except IndexError:
        raise ValueError("We are below the interpolation range")
    else:
        pep_below = sorted_ec50s.index.to_series().iloc[ec50_index_above-1]
    
    # Find the parameter values below and above
    try:
        value_below = ser_values[pep_below]
        value_above = ser_values[pep_above]
    except KeyError as e:
        print("Peptide {} not available; check consistency of EC50 and parameter tables.")
        raise e
    
    # Interpolate linearly
    value_inter = (logec50 - ec50_below) / (ec50_above - ec50_below) * (value_above - value_below) + value_below
    return value_inter

In [None]:
ser_means_pars = {
    "v1": df_sample_p["v1"].groupby("Peptide").mean(), 
    "alpha": df_sample_p["alpha"].groupby("Peptide").mean(), 
    "beta": df_sample_p["beta"].groupby("Peptide").mean()
}
ideal_peptides_par_means = {
    "v1": np.asarray(list(map(lambda x: interpolate_nearest_peptides(x, ser_log10ec50s, ser_means_pars["v1"]), 
                             df_agclass_ec50s["log10ec50"].values))), 
    "alpha": np.asarray(list(map(lambda x: interpolate_nearest_peptides(x, ser_log10ec50s, ser_means_pars["alpha"]), 
                             df_agclass_ec50s["log10ec50"].values))),
    "beta": np.asarray(list(map(lambda x: interpolate_nearest_peptides(x, ser_log10ec50s, ser_means_pars["beta"]), 
                             df_agclass_ec50s["log10ec50"].values)))
}

In [None]:
# Combine the mean parameters evaluated so far (a0, tau0, theta; v1, alpha, beta)
df_agclass_meanparams = pd.DataFrame(np.zeros([n_categories, 6]), 
                        index=df_agclass_ec50s.index,  
                        columns=pd.Index(["a0", "tau0", "theta", "v1", "alpha", "beta"], 
                                            name="Parameter"))
df_agclass_meanparams.iloc[:, :3] = interpolated_means

for i, pch in zip((3, 4, 5), ("v1", "alpha", "beta")):
    df_agclass_meanparams.iloc[:, i] = ideal_peptides_par_means[pch]

print(df_agclass_meanparams)

## Generate sigmoid model trajectories in latent space
Generate the average trajectories from the mean parameters. Also generate a bunch of parameter values and associated latent space trajectories, using the variance and covariance of parameters we also have for each antigen class. 

This part has a tricky aspect: we fitted $N_i(t')$, where $t' = t/\tilde{t}$, $\tilde{t} = 20 $ h (the time scale). Now, we want $n_i(t) = \frac{d N_i}{dt} = \frac{d t'}{dt} \frac{d N_i(t')}{d t'} = \frac{1}{\tilde{t}} n_i(t', a_0', \ldots)$, where $n_i(t', a_0', \ldots)$ is the formal function $n_i(t)$ called with $t'$ and parameters fitted for $N_i(t')$, instead of $t$: same functional form, different scale of variables. We need to compensate this by dividing by $\tilde{t}$, because when fitting $N_i(t')$, we have the following things happening:
 - To preserve $\alpha t = \alpha' t'$ and $\beta t = \beta' t'$ in the exponentials with $t$ replaced by $t'$, $\alpha' = \tilde{t} \alpha$
 - Because the magnitude of $N_i$ is proportional to $a_0 / \alpha$ or $v_i / \alpha$, then $a_0' = \tilde{t} a_0$, i.e. the fitted value for $a_0$ is too large in reality by a factor of $t_0$
 - Since the functional form of $n_i$ has a magnitude proportional to $a_0$ only, calling that function with the fitted value of $a_0'$ would give a concentration too large by a factor $\tilde{t}$ for being truly $\frac{d N_i}{dt}$. 
 

In [None]:
from ltspcyt.scripts.sigmoid_ballistic import sigmoid_conc_full_freealpha, ballistic_sigmoid_freealpha

In [None]:
def generate_synth_sample_params(df_m, df_c, df_v, p_with, v2v1, rng=None, nsamp=100):
    """ Function that generates nsamp parameter distribution samples for each
    key in the index of df_m, which contains the average value of parameters for each key
    (usually, the keys are the different peptides). df_c and df_v contain the covariance
    matrices and variance of the parameters for each key. p_with specifies which
    parameters are in the covariance matrix, and which are without covariance and  only
    appear in the variances of df_v. v2v1 is the ratio of v2/v1 at late times. 
    rng is a random number generator (np.random.RandomGenerator). 
    """
    if rng is None:
        rng = np.random.default_rng()
    new_idx = pd.MultiIndex.from_product([df_m.index, range(nsamp)],
                                            names=[df_m.index.name, "Sample"])
    df_p = pd.DataFrame(np.zeros([len(df_m.index)*nsamp, len(df_m.columns)]),
                index=new_idx, columns=df_m.columns)
    # Treat parameters that require a covariance matrix
    # Generate nsamp samples for each key in the index. 
    for key in df_m.index:
        cov_mat = build_symmetric(df_c.loc[key])
        mean_vec = df_m.loc[key, p_with].values
        df_p.loc[key, p_with] = rng.multivariate_normal(mean_vec, cov_mat, nsamp)
    
    # Now treat parameters that have separate variance
    other_p = list(df_m.columns)
    for p in p_with:
        other_p.remove(p)
    for key in df_m.index:
        varis = df_v.loc[key, other_p].values
        mean_vec = df_m.loc[key, other_p].values
        df_p.loc[key, other_p] = (mean_vec
            + np.sqrt(varis).reshape(1, -1)
                *rng.normal(size=[nsamp, len(other_p)]))

    # At the end, clip parameter a0, tau0, v1, alpha, beta to be >= 0
    # Clip alpha and beta to tscale/100 = 1/5
    params_clips = {"a0":(0, np.inf), "tau0":(0, np.inf), "t0":(0, np.inf), 
                   "v1":(0, np.inf), "alpha":(0.2, np.inf), "beta":(0.2, np.inf)}
    for p, bounds in params_clips.items():
        try:
            df_p[p].clip(*bounds, inplace=True)
        except KeyError: continue
    return df_p


def compute_lsmodel_trajectories(df_p_samp, times, v2v1, tscale=20.0):
    """ For each row giving LS model parameter values in df_p_samp,
    compute latent space trajectories, to be reconstructed later.
    Compute both n_i and N_i. """
    # Normalize time
    taus = times / tscale
    # Initialize DataFrame
    idx = pd.MultiIndex.from_tuples([(*k, t) for t in range(len(times)) for k in df_p_samp.index],
                                names=[*df_p_samp.index.names] + ["Time"])
    cols = pd.Index(["ls1", "ls2", "LS1", "LS2"], name="Feature")
    df_traj = pd.DataFrame(np.zeros([len(df_p_samp.index)*len(times), len(cols)]),
                            index=idx, columns=cols)
    df_traj = df_traj.sort_index()
    # Compute trajectories for each parameter set
    # Make sure we have the right parameters.
    assert np.all(np.asarray(["a0", "tau0", "theta", "v1", "alpha", "beta"]) == df_p_samp.columns.values)
    for key in df_p_samp.index:
        params = df_p_samp.loc[key].values
        ls12 = sigmoid_conc_full_freealpha(taus, params, v2v1_ratio=v2v1)
        # Normalize n1 and n2 by tscale, due to derivative wrt tau from N1, n2
        df_traj.loc[key, "ls1":"ls2"] = ls12.T / tscale
        LS12 = ballistic_sigmoid_freealpha(taus, *params, v2v1_ratio=v2v1)
        df_traj.loc[key, "LS1":"LS2"] = LS12.T

    return df_traj

In [None]:
# Sample parameter tuples for each antigen class
nsamples = 100
# Pick typical v2/v1 for the datasets used
v2v1_slope = pd.read_hdf(os.path.join(main_dir_path, "results", "reconstruction", 
                        "ser_v2v1_synth_selectdata.hdf"), key="ser_v2v1_synth").mean()
time_scale = 20.0
rdgen = np.random.default_rng(seed=292192031)
df_synth_p = generate_synth_sample_params(df_agclass_meanparams, df_agclass_covparams,
                df_agclass_varparams, params_with_cov, v2v1=v2v1_slope, rng=rdgen, nsamp=nsamples)

In [None]:
# Compare the empirical parameter to synthetic parameters samples
idx = np.concatenate([np.arange(n) for n in df_sample_p.groupby("Peptide").count().sort_index().values[:, 0]])
df_both_p = df_sample_p.reset_index().set_index("Peptide")
df_both_p = df_both_p.rename_axis(index={"Peptide":"TheoreticalAntigen"})

df_both_p["Sample"] = idx
df_both_p = df_both_p.set_index("Sample", append=True)
df_both_p = pd.concat({"Data":df_both_p, "Synthetic":df_synth_p}, names=["Source"], axis=0)

#pep_order = ["N4", "Q4", "T4", "V4", "G4", "E1", "A2", "Y3"]
#fig, axes, legend2 = dual_pairplot(data=df_both_p.reset_index(), vari=list(df_synth_p.columns),
#                dual_lvl="Source", dual_labels=["Data", "Synthetic"],
#                dual_hues = [(0.5, 0.5, 0.5), plt.cm.viridis([206])[0]],
#                hue="IdealPeptide", hue_order=pep_order, s=12, alpha=0.7)
#fig.set_size_inches(2*6, 2*6)
#fig.tight_layout()
#plt.show()
#plt.close()

In [None]:
# Compute latent space curves from the generated parameters. The function compute_lsmodel_trajectories
# only computes ls_i and LS_i, not the tanh(LS_i), which will have to be added before reconstruction. 
tpoints = np.arange(0, 73, dtype=int)  # Use ints because nicer as index.
df_latent_synth = compute_lsmodel_trajectories(df_synth_p, tpoints, v2v1_slope, tscale=time_scale)

df_latent_synthmean = compute_lsmodel_trajectories(pd.concat({1:df_agclass_meanparams}, names=["Sample"]), 
                                                   tpoints, v2v1_slope, tscale=time_scale)

In [None]:
sns.relplot(data=df_latent_synthmean.stack("Feature").reset_index(), 
            col="Feature", x="Time", y=0, hue="TheoreticalAntigen", kind="line", facet_kws=dict(sharey=False), 
           palette={i:col for i, col in enumerate(sns.color_palette("deep"))}, height=2.5, lw=2.)

In [None]:
g = sns.relplot(data=df_latent_synthmean.reset_index(), x="LS1", y="LS2", hue="TheoreticalAntigen", kind="line",
           palette={i:col for i, col in enumerate(sns.color_palette("deep"))}, height=2.5, sort=False)
g.axes[0, 0].set_aspect("equal")
plt.show()
plt.close()

## Reconstruct cytokine trajectories
Both for the average trajectory of each antigen category, and for all the sampled trajectories around them. 

In [None]:
from ltspcyt.scripts.reconstruction import QuadraticRegression

In [None]:
def reconstruct_from_lstraj(df_ls, pipeline, tanh_norm, cyorder):
    """ Reconstruct cytokine time series from latent space trajectories.
    Use a pre-trained pipeline (basically contains a linear regression matrix)
    and pre-determined the normalization coefficients tanh_norm. 
    The reconstructions not scaled back to proper log10(pM) scale. 
    The lower LOD corresponds to zero in the reconstruction output.  
    The scaling back and offsetting has to be done separately later. 
    """
    # Add features necessary to reconstruction
    df_features = df_ls.copy()
    df_features["tanh_LS1"] = np.tanh(df_ls["LS1"]/tanh_norm["LS1"])
    df_features["tanh_LS2"] = np.tanh(df_ls["LS2"]/tanh_norm["LS2"])
    df_features = df_features.drop(["LS1", "LS2"], axis=1)
    # Make sure the order of features is correct
    feat_order = ["ls1", "ls2", "tanh_LS1", "tanh_LS2"]
    df_features = df_features.reindex(columns=pd.Index(feat_order, name="Feature"))

    # Reconstruct from features: one big dot product with recon. matrix
    # recon_matrix has shape [n_cytos, n_features] so need to transpose
    # and dot with recon_matrix to the right, so the dimensions left
    # are [n_samples, n_cytos]
    df_cytos = pd.DataFrame(pipeline.predict(df_features), index=df_features.index, 
                             columns=cyto_order)
    df_cytos.columns = pd.Index(cyorder, name="Cytokine")

    # Clip
    df_cytos = df_cytos.clip(0, np.inf)

    # Rescale to proper log10 scale
    #df_cytos = df_cytos * cyscales.reshape(1, -1)
    return df_cytos


In [None]:
df_min, df_max = pickle.load(open(os.path.join(main_dir_path, "data", "trained-networks", 
                                               "min_max-thomasRecommendedTraining.pkl"), "rb"))
df_min, df_max = df_min.xs("integral", level="Feature"), df_max.xs("integral", level="Feature")

In [None]:
# Import reconstruction objects
model_type = "mixed_quad"
recon_folder = os.path.join(main_dir_path, "results", "reconstruction")
with open(os.path.join(recon_folder, "quadratic_tanh_pipeline_selectdata.pkl"), "rb") as hd:
    pipe = pickle.load(hd)
tanh_norm_factors = pd.read_hdf(os.path.join(recon_folder, "tanh_norm_factors_integrals_selectdata.hdf"), 
                                key="tanh_norm")
tanh_norm_factors = tanh_norm_factors.rename({"Node 1":"LS1", "Node 2":"LS2"}, axis=1)
print(tanh_norm_factors)
print(pipe[model_type].regressor_.Q)

# Reconstruct sampled and mean latent space trajectories
cyto_order = df_min.index.get_level_values("Cytokine")
print(cyto_order)
df_recon_synth = reconstruct_from_lstraj(df_latent_synth, pipe, tanh_norm_factors, cyto_order)
df_recon_synthmean = reconstruct_from_lstraj(df_latent_synthmean, pipe, tanh_norm_factors, cyto_order)

### Plot the scaled cytokine trajectories

In [None]:
# It is nicer to average all sampled trajectories than to use the trajectory for average parameters. 
df_recon_synthmean2 = df_recon_synth.groupby(["TheoreticalAntigen", "Time"]).mean()
sns.relplot(data=df_recon_synthmean2.stack("Cytokine").reset_index(), x="Time", y=0, 
    col="Cytokine", hue="TheoreticalAntigen", height=2.5, kind="line", 
    palette=sns.color_palette("deep", 
    n_colors=len(df_recon_synthmean.index.get_level_values("TheoreticalAntigen").unique())))

# Rescale and export time integrals of reconstruction

In [None]:
def scale_back(df_cyt, dfmin, dfmax):
    """ Take scaled cytokine data/reconstruction, and put it back in log_10 scale. """
    feat_keys = ["integral", "concentration", "derivative"]
    df_scaled = df_cyt.copy()
    for typ in feat_keys:
        try:
            df_scaled[typ] = df_cyt[typ] * (dfmax - dfmin)
            if typ == "integral":
                df_scaled[typ] = df_scaled[typ] + dfmin
        except KeyError:
            continue  # This feature isn't available in this df; fine
        else:
            print("Put scale back for feature", typ)
    return df_scaled

In [None]:
# Obtain the lower LOD (lowest concentration in pM, which was set to zero by the log-transform)
# Parse LOD files, take the geometric average lowest concentration detected
# in the experiments used to estimate the average parameter values
all_kept_lower_lods = {}
lod_path = os.path.join(main_dir_path, "data/", "LOD/")
for f in os.listdir(lod_path):
    if not f.endswith(".pkl"): continue
    try:
        loddf = pd.read_pickle(os.path.join(lod_path, f))
    except:
        print("Could not load LOD file" + f)
        continue
    key = f.split("-")[2]
    if key in df_sample_p.index.get_level_values("Data").unique():
        lower_lod = pd.Series({k:loddf[k][2] for k in loddf.keys()})
        all_kept_lower_lods[key] = lower_lod
all_kept_lower_lods = pd.concat(all_kept_lower_lods, axis=1, names=["Data"]).T
print(all_kept_lower_lods)
pM_offsets = np.exp((np.log(all_kept_lower_lods).mean())) * 1000
print(pM_offsets)

In [None]:
levels_to_stack = list(df_recon_synth.index.names)
levels_to_stack.remove("Time")
df_integrals = (df_recon_synth.copy().unstack(levels_to_stack).sort_index()
                .cumsum(axis=0).stack(levels_to_stack).unstack("Time").stack("Time"))
df_recon_combined = pd.concat({"concentration":df_recon_synth, "integral":df_integrals}, 
                             axis=1, names=["Feature", "Cytokine"])
df_recon_combined = scale_back(df_recon_combined, df_min, df_max)
df_recon_combined["concentration"] += np.log10(pM_offsets)

In [None]:
df_recon_combined

In [None]:
# Add time integrals, save the results
#df_recon_combined.to_hdf("output_recon/synthetic/antigen_prototypes_recon_conc_integrals_HighMI13_training-means.hdf", key="df")

# Proper plot of cytokine time series for antigen classes
Average time series with shaded standard deviation at each time. 

In [None]:
# Use same color palette as in the main text

all_theo_antigen_colors = sns.color_palette("deep", 10)
theoretical_antigen_colors = sns.color_palette("deep", n_categories)
# Make the second class have a fuchsia color similar to A2
theoretical_antigen_colors = [all_theo_antigen_colors[0],all_theo_antigen_colors[6]]+all_theo_antigen_colors[1:5]
theoretical_antigen_colors = [sns.set_hls_values(a, s=0.4, l=0.6) for a in theoretical_antigen_colors]
theoretical_antigen_colors[-1] = (0, 0, 0, 1)  # Make the null peptide black. 

# Color for smaller, lighter sample trajectories. Not used here, using alphas instead. 
#colors_samples = [sns.set_hls_values(a, l=0.8) for a in theoretical_antigen_colors]
#colors_samples[-1] = (0.5, 0.5, 0.5, 0.8)
#colors_samples = theoretical_antigen_colors

idealpeps = df_recon_synth.index.get_level_values("TheoreticalAntigen").unique()
colors_dict = {idealpeps[i]:theoretical_antigen_colors[i] for i in range(len(idealpeps))}

In [None]:
# Logarithmic minor ticks (we plotted the real log so need to put log ticks manually)
# Find the linear scale limiting ticks
def compute_log_minor_ticks(loglims, stp=2, base=10.0):
    smallest_major = int(np.floor(loglims[0]))
    largest_major = int(np.ceil(loglims[1]))
    n_decades = largest_major - smallest_major

    # Generate linear ranges with the exponents found
    tiles = []
    for i in range(n_decades):
        tiles.append(np.arange(stp*base**(smallest_major+i), 
                    base**(smallest_major+i+1), stp*base**(smallest_major+i)))
    minorticks = np.concatenate(tiles, axis=0)
    minorticks = np.log(minorticks) / np.log(base)
    minorticks = minorticks[(minorticks > loglims[0]) * (minorticks < loglims[1])]
    return minorticks

# Formatting log(number) into "number \times 10^power" string
def format_scinotation(logf, n_decim=0):
    """ Transform the log_10 of a number, given by logf, into a string
    in scientific notation, a \times 10^{power}, where a has n_decim decimals"""
    pwr = int(np.floor(logf))
    num = 10**(logf - pwr)
    num = round(num, n_decim)
    if n_decim == 0:
        if num >= 10:
            pwr += 1
            num = num // 10
        num = "{" + str(int(num)) + "}"
        s = r"${} \times 10^{}$".format(num, pwr)
    else:
        num = "{" + str(num)[2+n_decim] + "}"  # Truncate n_decim after the dot
        s = r"${} \times 10^{}$".format(num, pwr)
    return s

In [None]:
# Consider doing separate panels for each antigen if the graphs become too overcrowded
# Plot cytokine time series statistics
def plot_cyto_timeseries_grid_stats(df_cy, grouplvl="TheoreticalAntigen", pep_sel=None, palet=None, pep_ec50s=None, 
                                   feature="concentration"):
    nice_cytos_lbls = {"IFNg": r"IFN-$\gamma$", "TNFa": "TNF"}  
    # Get the number of antigen classes
    if pep_sel is None:
        pep_sel = df_cy.index.get_level_values(grouplvl).unique()
        
    # Compute the average and variance of each cytokine time series for each antigen class
    df_means = df_cy.groupby([grouplvl, "Time"]).mean()
    df_varis = df_cy.groupby([grouplvl, "Time"]).var()
    
    # Prepare plots
    cytos = df_cy.columns.to_list()
    ncols = len(pep_sel)
    nrows = len(cytos)
    fig, axes = plt.subplots(nrows, ncols, sharex=True, sharey="row")
    fig.set_size_inches(0.9*ncols, 0.9*nrows)

    # Get the time points
    times = df_cy.index.get_level_values("Time").unique().to_list()
    times.sort()
    
    # Prepare a color palette, if one was not already provided
    if palet is None:
        palet = sns.color_palette(n_colors=len(pep_sel))
        palet = {k:palet[i] for i, k in enumerate(pep_sel)}
    
    # Plot each cytokine for each peptide
    leghandles, leglabels = [], []
    for j, ky in enumerate(pep_sel):
        if pep_ec50s is not None:
            ec50 = pep_ec50s.loc[ky, "log10ec50"]
            label_ec50 = format_scinotation(ec50)
        else:
            label_ec50 = ky
        for i in range(len(cytos)):
            mn = df_means.loc[(ky, times), cytos[i]]
            std = np.sqrt(df_varis.loc[(ky, times), cytos[i]])
            li, = axes[i, j].plot(times, mn, label=label_ec50, color=palet[ky])
            axes[i, j].plot(times, mn-std, lw=1., ls="--", color=palet[ky], alpha=0.5)
            axes[i, j].plot(times, mn+std, lw=1., ls="--", color=palet[ky], alpha=0.5)
            axes[i, j].fill_between(times, mn-std, mn+std,  color=palet[ky], alpha=0.2)
            if i == 0:
                leghandles.append(li)
                leglabels.append(label_ec50)
                axes[i, j].set_title(r"EC$_{50}$: " + label_ec50, fontsize=7)

    # Some labeling
    for i in range(len(cytos)):
        ylbl = nice_cytos_lbls.get(cytos[i], cytos[i])
        if feature == "concentration":
            ylbl = ylbl + " (pM)"
            ylblsize=None
        elif feature == "integral":
            ylbl = r"$\int_0^t \mathrm{d}u \, \log_{10}[$" + ylbl + r"$]$"
            ylblsize = 6
        axes[i, 0].set_ylabel(ylbl, size=ylblsize)
    for j in range(ncols):
        axes[-1, j].set_xlabel("Time (h)")
    
    # Using logarithmic ticks
    if feature == "concentration":
        for i in range(len(cytos)):
            ax = axes[i, 0]
            cytlims = ax.get_ylim()
            minorticks = compute_log_minor_ticks(cytlims, stp=1, base=10.0)
            y_ticker = mpl.ticker.FuncFormatter(lambda x, pos:r"$10^{}$".format("{"+str(int(x))+"}"))
            ax.yaxis.set_major_formatter(y_ticker)
            skp = (cytlims[1] - cytlims[0]) // 5 + 1
            majoryticks = np.arange(np.ceil(cytlims[0]).astype(int), np.floor(cytlims[1]).astype(int)+1, skp)
            if len(majoryticks) == 1:
                majoryticks = np.arange(np.ceil(cytlims[0]).astype(int), np.floor(cytlims[1]).astype(int)+2, skp)
            ax.set_yticks(majoryticks)
            ax.set_yticks(minorticks, minor=True)
            ax.tick_params(which="minor", axis="both", **props_minorticks)
            ax.tick_params(which="major", axis="both", **props_majorticks)

    # Decide where to place the legend
    leg_esthetics = dict(frameon=False, title=r"Antigen EC$_{50}$", fontsize=7, borderpad=0.2, 
                        labelspacing=0.3, handletextpad=0.5, borderaxespad=0.3)
    leg = fig.legend(leghandles, leglabels,
                        bbox_to_anchor=(0.92, 0.5), loc="center left", **leg_esthetics)
    fig.tight_layout(w_pad=1.0)
    fig.subplots_adjust(right=0.92)
    return df_means, df_varis, fig, axes, leg


In [None]:
retn = plot_cyto_timeseries_grid_stats(df_recon_combined["concentration"], 
                grouplvl="TheoreticalAntigen", pep_sel=None, palet=colors_dict, 
                pep_ec50s=df_agclass_ec50s)
_, _, fig, axes, leg = retn
#fig.savefig("figures_synthetic/cytokine_reconstructions_theoretical_antigen_classes_multipanel.pdf", 
#            transparent=True, bbox_inches="tight", bbox_extra_artists=(leg,))
plt.show()
plt.close()
del retn

In [None]:
# Integrals too
retn = plot_cyto_timeseries_grid_stats(df_recon_combined["integral"], 
                grouplvl="TheoreticalAntigen", pep_sel=None, palet=colors_dict, 
                pep_ec50s=df_agclass_ec50s, feature="integral")
df_recon_int_means, df_recon_int_varis, fig, axes, leg = retn
plt.show()
plt.close()
del retn