# Latent space dynamical model plus reconstruction give a model for cytokine dynamics
We can use our latent space model, combined with our cytokine reconstruction method (linear regression with quadratic terms), i.e. our "latent space decoder", as a dynamical model that accurately fits 5D cytokine concentration time series (in log scale). 

## Main steps in the model fitting

The first main part consists in fitting a dynamical model to data projected in latent space, and reconstructing cytokine trajectories from model trajectories: 
1. Import naive OT-1 data to be fitted with the model
    1. Select datasets to optimize the reconstruction algorithm ("training" data)
    2. Select a dataset for the plot ("test" data)
2. Optimize the reconstruction algorithm
3. Fit the force model with matching to that dataset in the latent space
4. Compute the model concentration curves corresponding to the fitted parameters values
5. Project back those curves to cytokine concentration space

The second main part consists in comparing the original cytokine time courses and the ones generated from the ballistic parameters:
1. Use more naive OT-1 data to estimate noise (error bars and covariance elements at each time point) on the cytokine data
2. Compute residuals between the original time courses and the model-generated, reconstructed cytokine trajectories
3. Compute a $\chi^2$ for each time course, summing the multivariate chi-squared at each 5D time point: $\chi^2 = \sum_t \sum_{ij} c_i(t) {C^{-1}}_{ij}(t) c_j(t)$ where $c_i(t)$ is cytokine i at time $t$. 
4. The number of degrees of freedom $\nu$ is (nb time points) $\times$ (nb cytokines) - (nb model parameters fitted on a time course) $= 12 \times 5 - 6 = 54$. This takes into account correlations between cytokines but neglects correlations between time points, because we don't have enough data to estimate one large covariance matrix between all time points of all cytokines. In other words, we take this hypothetical big covariance matrix and assume it is block-diagonal, with one block per time point. 
5. Compute p-value for each time course fit with the chi-squared distribution. This p-value gives the probability that a correct model of the data would give a equal or larger $\chi^2$, taking into account the amount of noise (i.e. the covariance matrix) of the data. 



## Code structure

The following useful functions are in separate Python scripts, for clarity of the notebook. 

- Scripts to import and process data: `ltspcyt.scripts.neural_network`
- Scripts to optimize a reconstruction model (on "reconstruction training" data): `ltspcyt.scripts.reconstruction`
- Scripts to fit the sigmoid-ballistic model (on separate "reconstruction test" data): `ltspcyt.scripts.sigmoid_ballistic` and `fitting_functions`

In [None]:
import os
import numpy as np
import pandas as pd
import scipy as sp
import scipy.stats
import json, pickle
from time import perf_counter

import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns

# Scripts for data importation
from ltspcyt.scripts.neural_network import import_WT_output

# Scripts for reconstruction, using distinct functions for distinct methods. 
from ltspcyt.scripts.reconstruction import train_reconstruction, plot_recon_true, compute_latent_curves

# Scripts for curve fitting
from ltspcyt.scripts.sigmoid_ballistic import (
    return_param_and_fitted_latentspace_dfs, sigmoid_conc_full_freealpha, 
    ballistic_sigmoid_freealpha)

In [None]:
%matplotlib inline

# Part I: Model fitting and reconstruction

# I.1 Import data

### I.1.1 Import all data
Remove unwanted levels, normalize the values with the neural network training data's min and max. 

In [None]:
df_wt = import_WT_output()
df_min, df_max = pd.read_pickle(os.path.join("data", "trained-networks", "min_max-thomasRecommendedTraining.pkl"))
df_min, df_max = df_min.xs("integral", level="Feature"), df_max.xs("integral", level="Feature")

# Projection matrix
P = np.load(os.path.join("data", "trained-networks", "mlp_input_weights-thomasRecommendedTraining.npy")).T

In [None]:
peptides = ["N4", "Q4", "T4", "V4", "G4", "E1", "A2", "Y3", "A8", "Q7"]
concentrations = ["1uM", "100nM", "10nM", "1nM"]
cytokines = df_min.index.get_level_values("Cytokine")
times = np.arange(0, 73)

# Select only the desired cytokines, times, and T cell number
df_wt = df_wt.unstack("Time").loc[:, (slice(None), cytokines, times)].stack("Time")

# Rescale and project each feature, but do not offset (don't use MLP's intercepts)
proj_dfs = []
feat_keys = ["integral", "concentration", "derivative"]
cols = pd.Index(["Node 1", "Node 2"], name="Node", copy=True)

for typ in feat_keys:
    # Rescale with the training min and max
    if typ == "integral":
        df_wt[typ] = (df_wt[typ] - df_min)/(df_max - df_min)
    else:   # for conc and deriv, the constant rescaling term disappears. 
        df_wt[typ] = df_wt[typ]/(df_max - df_min)
    df_temp = pd.DataFrame(np.dot(df_wt[typ], P.T), index=df_wt[typ].index, columns=cols)
    proj_dfs.append(df_temp)
df_proj = pd.concat(proj_dfs, axis=1, names=["Feature"], keys=feat_keys)
del proj_dfs, cols, feat_keys

In [None]:
# Remove different T cell numbers
tcellnum = "100k"
df_wt = df_wt.xs(tcellnum, level="TCellNumber", axis=0, drop_level=True)
df_proj = df_proj.xs(tcellnum, level="TCellNumber", axis=0, drop_level=True)

### I.1.2 Select training and testing datasets

In [None]:
subset_train = ["HighMI_1-1", "HighMI_1-3"]

df_wt_train = df_wt.loc[subset_train]
df_proj_train = df_proj.loc[subset_train]

subset_test = ["HighMI_1-2", "HighMI_1-4"]

df_wt_test = df_wt.loc[subset_test]
df_proj_test = df_proj.loc[subset_test]

# Remove A2 and Y3 from the training
df_wt_train = df_wt_train.drop(["A2", "Y3"], level="Peptide", axis=0)
df_proj_train = df_proj_train.drop(["A2", "Y3"], level="Peptide", axis=0)

## I.2 Train the reconstruction function
Also compute the reconstructed cytokines for both the test and training data sets

In [None]:
# Find the reconstruction matrix, based on reconstructing integrals
feature = "concentration"
model_type = "mixed_quad"

modelargs = {"which_to_square":[0, 1]}

# Add some arbitrary features. 
# Try exponentials
tanh_norm_factors = df_proj_train["integral"].mean(axis=0)
print(tanh_norm_factors)

df_proj_train = pd.concat([df_proj_train["concentration"], np.tanh(df_proj_train["integral"] / tanh_norm_factors)], 
                           keys=["concentration", "tanh integral"], names=["Feature"], axis=1)

In [None]:
pipe, score = train_reconstruction(df_proj_train, df_wt_train, feature=feature, 
                                   method=model_type, model_args=modelargs, do_scale_out=False)
print("Reconstruction training R^2 score:", score)

## I.3 Fit the ballistic model to test data
We fit the latent space integrals $N_1$ and $N_2$, rescaling time by $\tilde{t} = 20 h$, as we usually do. 

In [None]:
# Choice of fitting hyperparameters
fit_vars={"Constant velocity":["v0","t0","theta","vt"],"Constant force":["F","t0","theta","vt"],
         "Sigmoid":["a0", "t0", "theta", "v1", "gamma"], 
         "Sigmoid_freealpha":["a0", "t0", "theta", "v1", "alpha", "beta"]}
fit = "Sigmoid_freealpha"
regul_rate = 0.4
tscale = 20.

# Fit the integrals
start_time = perf_counter()

ret = return_param_and_fitted_latentspace_dfs(
    df_proj_test.xs("integral", level="Feature", axis=1), 
    fit, reg_rate=regul_rate, time_scale=tscale)
df_params, df_compare, df_hess, ser_v2v1 = ret

end_t = perf_counter()
print("Time to fit: ", perf_counter() - start_time)
del start_time

nparameters = len(fit_vars[fit])

## I.4 Compute ballistic curves for fitted parameters
The normalization is a bit tricky here. We fitted $N_i(t')$, where $t' = t/\tilde{t}$, $\tilde{t} = 20 $ h (the time scale). Now, we want $n_i(t) = \frac{d N_i}{dt} = \frac{d t'}{dt} \frac{d N_i(t')}{d t'} = \frac{1}{\tilde{t}} n_i(t', a_0', \ldots)$, where $n_i(t', a_0', \ldots)$ is the function $n_i(t)$ called with $t'$ and parameters fitted for $N_i(t')$, instead of $t$: same functional form, different scale of variables. We need to compensate this by dividing by $\tilde{t}$, because when we fitted $N_i(t')$, we had the following things happening:
 - To preserve $\alpha t = \alpha' t'$ and $\beta t = \beta' t'$ in the exponentials with $t$ replaced by $t'$, $\alpha' = \tilde{t} \alpha$
 - Because the magnitude of $N_i$ is proportional to $a_0 / \alpha$ or $v_i / \alpha$, then $a_0' = \tilde{t} a_0$, i.e. the fitted value for $a_0$ is too large in reality by a factor of $t_0$
 - Since the functional form of $n_i$ has a magnitude proportional to $a_0$ only, calling that function with the fitted value of $a_0'$ would give a concentration too large by a factor $\tilde{t}$ for being truly $\frac{d N_i}{dt}$. 
 

In [None]:
# Extend ser_v2v1 to have one entry per entry in df_params
ser_v2v1_synth = pd.Series(np.zeros(len(df_params.index)), index=df_params.index)
for dset in ser_v2v1.index:
    ser_v2v1_synth[dset] = ser_v2v1[dset]

# Create a new df_compare, by concatenation.
df_latent_fit = compute_latent_curves(df_params.loc[:, :"beta"], ser_v2v1_synth, tanh_norm_factors, times,
    model="Sigmoid_freealpha", tsc=tscale)

# Replace the values in df_compare
df_compare2 = df_compare.unstack("Processing type").unstack("Feature").stack("Node").unstack("Node")

df_compare2[("Fit", "concentration")] = df_latent_fit["concentration"]  # each row is a time
df_compare2[("Fit", "tanh integral", "Node 1")] = df_latent_fit[("tanh integral", "Node 1")]
df_compare2[("Fit", "tanh integral", "Node 2")] = df_latent_fit[("tanh integral", "Node 2")]

## I.5 Reconstruct cytokines from generated curves

In [None]:
df_recon_test = pd.DataFrame(pipe.predict(df_latent_fit), index=df_latent_fit.index, 
                             columns=df_wt_test.xs(feature, axis=1, level="Feature", drop_level=False).columns)

# Concentrations can't be negative but sometimes this reconstruction gives slightly negative values
# So the last part of our reconstructino algorithm is to clip values to zero
df_recon_test.clip(lower=0.0, inplace=True)

## I.6 Compare visually the generated cytokines to the data

In [None]:
figlist = plot_recon_true(df_wt_test, df_recon_test.loc[df_recon_test.index.get_level_values("Time") > 0], 
                          feature=feature, sharey=True, do_legend=True, 
                          palette=sns.color_palette(), pept=peptides)
for xp in figlist.keys():
    print(xp)
    legend = figlist[xp].axes[-1].get_legend()
    plt.show()
plt.close()

### Adding E1 from another dataset
It is not available in the `HighMI_1` experiment, but we would like to see its reconstruction anyways, so we reconstruct the model fits on another experiment (`CD25MutantTimeSeries_OT1_Timeseries_2`) using the same reconstruction coefficients as for `HighMI_1`. 

In [None]:
# Select data for E1
dset_with_e1 = ["CD25MutantTimeSeries_OT1_Timeseries_2"]
df_wt_e1 = df_wt.loc[dset_with_e1]
df_proj_e1 = df_proj.loc[dset_with_e1]

# Fit the model on N_1 and N_2
fit = "Sigmoid_freealpha"
regul_rate = 0.4
tscale = 20.
nparameters = len(fit_vars[fit])

start_time = perf_counter()

ret = return_param_and_fitted_latentspace_dfs(
    df_proj_e1.xs("integral", level="Feature", axis=1), 
    fit, reg_rate=regul_rate, time_scale=tscale)
df_params_e1, df_compare_e1, df_hess_e1, ser_v2v1_e1 = ret

end_t = perf_counter()
print("Time to fit: ", perf_counter() - start_time)
del start_time, end_t


# Extend ser_v2v1 to have one entry per entry in df_params
ser_v2v1_synth_e1 = pd.Series(np.zeros(len(df_params_e1.index)), index=df_params_e1.index)
for dset in ser_v2v1_e1.index:
    ser_v2v1_synth_e1[dset] = ser_v2v1_e1[dset]

# Create a new df_compare, by concatenation.
df_latent_fit_e1 = compute_latent_curves(df_params_e1.loc[:, :"beta"], ser_v2v1_synth_e1, tanh_norm_factors, times,
    model="Sigmoid_freealpha", tsc=tscale)

# Replace the values in df_compare
df_compare_e1 = df_compare_e1.unstack("Processing type").unstack("Feature").stack("Node").unstack("Node")

df_compare_e1[("Fit", "concentration")] = df_latent_fit_e1["concentration"]  # each row is a time
df_compare_e1[("Fit", "tanh integral", "Node 1")] = df_latent_fit_e1[("tanh integral", "Node 1")]
df_compare_e1[("Fit", "tanh integral", "Node 2")] = df_latent_fit_e1[("tanh integral", "Node 2")]

# REconstruct cytokines for the model curves fitted on this dataset
df_recon_e1 = pd.DataFrame(pipe.predict(df_latent_fit_e1), index=df_latent_fit_e1.index, 
                             columns=df_wt_e1.xs(feature, axis=1, level="Feature", drop_level=False).columns)

# Concentrations can't be negative but sometimes this reconstruction gives slightly negative values
# So the last part of our reconstructino algorithm is to clip values to zero
df_recon_e1.clip(lower=0.0, inplace=True)

# Part II: Quality of model fits ($\chi^2$ p-values)

## II.1 Importing a few functions to estimate noise from raw data

In [None]:
# Functions to estimate noise from data and put back absolute scale of reconstructions
from utils.recon_scaling import extract_process_naive_part, import_folder_naive_data, scale_back

## II.2 Import reconstruction and original data

In [None]:
df_recon = scale_back(df_recon_test, df_min, df_max)
# Or, import the result saved to disk
#df_recon = scale_back(pd.read_hdf(os.path.join(
#    "results", "reconstruction"", df_compare_recon-fit_HighMI-1.hdf"), key="df_recon"), df_min, df_max)

In [None]:
# Load the original HighMI_1 data, log-scale it, select TCellNumber of interest
df_orig_full = {}
for i in range(1, 5):
    df_orig_full["HighMI_1-{}".format(i)] = pd.read_pickle(
        os.path.join("data", "final", "cytokineConcentrationPickleFile-20200624-HighMI_1-{}-final.pkl".format(i)))

df_orig_full = pd.concat(df_orig_full, names=["Data"])
df_orig_full = df_orig_full.loc[df_orig_full.index.isin(cytokines, level="Cytokine")].unstack("Cytokine").stack("Time")

# Rescale by the minimum concentration (lower LOD) and take the log10
dset_choice = "HighMI_1-2"
df_orig_full = np.log10(df_orig_full / df_orig_full.min(axis=0))
df_orig_full = df_orig_full.xs("100k", level="TCellNumber", axis=0)
df_orig = df_orig_full.xs(dset_choice, level="Data", axis=0)

## II.3 Prepare more raw data to estimate noise

In [None]:
# Similar datasets that we use to get a reasonable noise estimate:
raw_data_list = [
    "cytokineConcentrationPickleFile-20190412-PeptideComparison_OT1_Timeseries_18-final.pkl", 
    "cytokineConcentrationPickleFile-20190608-PeptideComparison_OT1_Timeseries_19-final.pkl", 
    "cytokineConcentrationPickleFile-20190718-NewPeptideComparison_OT1_Timeseries_20-final.pkl", 
    "cytokineConcentrationPickleFile-20190802-TCellNumber_OT1_Timeseries_7-final.pkl", 
    "cytokineConcentrationPickleFile-20191022-PeptideComparison_OT1_Timeseries_21-final.pkl", 
    "cytokineConcentrationPickleFile-20200220-Activation_TCellNumber_1-final.pkl",
    "cytokineConcentrationPickleFile-20200624-HighMI_1-1-final.pkl",
    "cytokineConcentrationPickleFile-20200624-HighMI_1-2-final.pkl",
    "cytokineConcentrationPickleFile-20200624-HighMI_1-3-final.pkl",
    "cytokineConcentrationPickleFile-20200624-HighMI_1-4-final.pkl",
    "cytokineConcentrationPickleFile-20200627-TCellNumber_2-final.pkl", 
    "cytokineConcentrationPickleFile-20190404-CD25MutantTimeSeries_OT1_Timeseries_2-final.pkl"
]
df_raw_data, df_nM_min_conc =  import_folder_naive_data(os.path.join("data", "final"), raw_data_list)
df_raw_data = df_raw_data.loc[:, df_raw_data.columns.isin(cytokines, level="Cytokine")]
df_raw_data = df_raw_data.xs("100k", level="TCellNumber", axis=0)

# Keep only the time points and peptides of interest
df_raw_data = df_raw_data.loc[(slice(None), *df_orig.index.levels)]

In [None]:
# Computing the deviations from the average across the selected raw data sets
all_lvl_names = list(df_raw_data.index.names)
all_lvl_names.remove("Data")

# More fair noise estimation: per time point, peptide, concentration
#df_std = np.sqrt(df_raw_data.groupby(all_lvl_names).var())
# Equivalent way:
df_diff = df_raw_data - df_raw_data.groupby(all_lvl_names).mean()
df_std = np.sqrt((df_diff**2).groupby(all_lvl_names).sum() / (df_diff.groupby(all_lvl_names).count() - 1))

# Combining all residuals to get one single variance estimate per cytokine
ser_std = np.sqrt((df_diff**2).sum(axis=0) / (df_raw_data.shape[0] - 1))

# Some points in some peptide conditions have zero variance, e.g. IL-2 at 72hrs for V4 is exactly 0, always. 
# In this case, replace the variance with the ser_std value below (average variance for the cytokine overall)
# Actually, replace all points where the variance is smaller than ser_std; that's an artifact
# ser_std is already low because it averages including all points where the variance is zero. 
for cyto in df_std.columns:
    where_wrong = df_std[cyto] < ser_std[cyto]
    df_std.loc[where_wrong, cyto] = ser_std[cyto]
df_std.min()

## II.4 Compute residuals and variance per cytokine

In [None]:
# Compute the chi-square of the reconstruction as a fit to the original data, 
# one per cytokine (not multivariate just yet). Then, obtain a p-value 

# Select the conditions we will plot
conc_choice = ["1uM"]
dset_choice = "HighMI_1-2"
peps_to_plot = ["N4", "Q4", "T4", "V4"]

# Remove all unwanted conditions
df_orig_select = df_orig.loc[(peps_to_plot, conc_choice), :]
df_recon_select = df_recon.loc[(dset_choice, peps_to_plot, conc_choice), "concentration"].droplevel("Data")
df_std_select = df_std.loc[(peps_to_plot, conc_choice), :]

# We should get one p-value per time course to plot. 
traject_indices = list(df_orig.index.names)
traject_indices.remove("Time")

# Find the time points available in both original data and recon
# for the conditions we already selected above. 
common_times = df_orig_select.index.get_level_values("Time").unique()
common_times = [a for a in common_times if a in df_recon_select.index.get_level_values("Time").unique()]
df_recon_select = df_recon_select.loc[(slice(None), slice(None), common_times)]
df_orig_select = df_orig_select.loc[(slice(None), slice(None), common_times)]
df_std_select = df_std_select.loc[(slice(None), slice(None), common_times)]

# Now, compute the chisquare, grouped by traject_indices
df_resids_select = (df_orig_select - df_recon_select)

## II.5 Multivariate $\chi^2$ distribution and p-value

In [None]:
# Multivariate chisq: each cytokine at a given time point is a different random variable
# So we have 5 times more points (still fitted with 6 d.o.f total), but we must
# take the covariance matrix between cytokine residuals into account
df_cov_mat = df_diff.groupby(all_lvl_names).cov()

# Remove extremely small eigenvalues and invert each matrix
min_eigenval = float(ser_std.min())**2
df_invcov_mat = df_cov_mat.copy()
df_cov_mat_corrected = df_cov_mat.copy()
for lvl in df_std.index:
    mat = df_cov_mat.loc[lvl].values
    umat, sigmat, vmat = np.linalg.svd(mat)
    # Remove extremely small eigenvalues
    sigmat[sigmat < min_eigenval/2] = min_eigenval/2
    inv_eigenvals = 1.0 / sigmat
    # Rebuild covmat and inverse cov mat without small eigenvalues
    covmat = umat.dot(np.diagflat(sigmat)).dot(vmat)
    df_cov_mat_corrected.loc[lvl] = covmat
    invcovmat = umat.dot(np.diagflat(inv_eigenvals)).dot(vmat)
    df_invcov_mat.loc[lvl] = invcovmat
    
print(df_invcov_mat.loc[("N4", "1uM", 24.0)])

In [None]:
# Compute the multivariate chisquare for each time series
df_invcov_mat_select = df_invcov_mat.loc[(peps_to_plot, conc_choice, common_times), :]
ser_mv_chisq = {}
for key in df_orig_select.index.droplevel("Time").unique():
    chisq = 0.0
    df_mat = df_invcov_mat_select.loc[key]
    resids = df_resids_select.loc[key]
    for t in common_times:
        mat = df_mat.xs(t, level="Time")
        chisq += resids.loc[t].values.dot(mat).dot(resids.loc[t].values.T)
    ser_mv_chisq[key] = chisq

# Convert to Series
ser_mv_chisq = pd.Series(ser_mv_chisq, index=df_orig_select.index.droplevel("Time").unique())

# Normalize chi squares. Number of dof: timepoints * nb_cytokines - nb_params
nu_dof5 = len(common_times)*len(cytokines) - 6
ser_mv_chisq_norm = ser_mv_chisq / nu_dof5
print(ser_mv_chisq_norm)

chi_distrib5 = sp.stats.chi2(df=nu_dof5)
ser_pval_mv = 1.0 - chi_distrib5.cdf(ser_mv_chisq)
ser_pval_mv = pd.Series(ser_pval_mv, index=ser_mv_chisq.index)
print("Multivariate p-values (high is good):")
print(ser_pval_mv)

### Adding E1 from another dataset
The estimation of the noise is set to the minimum value in ser_std for each cytokine. It is not a good idea to include E1 in the covariance matrix above, as it would bias all values to artificially low noise. 

In [None]:
# Put back actual cytokine scale and keep E1 only
df_recon_e1 = scale_back(df_recon_e1, df_min, df_max)
df_recon_e1 = df_recon_e1.xs("E1", level="Peptide", drop_level=False)
df_recon_e1 = df_recon_e1.droplevel("Data").loc[:, "concentration"]

In [None]:
df_orig_e1 = pd.read_pickle(os.path.join("data", "final", 
    "cytokineConcentrationPickleFile-20190404-CD25MutantTimeSeries_OT1_Timeseries_2-final.pkl"))
cytokines = ["IFNg", "IL-17A", "IL-2", "IL-6", "TNFa"]
df_orig_e1 = df_orig_e1.xs("WT", level="Genotype")
df_orig_e1 = df_orig_e1.loc[df_orig_e1.index.isin(cytokines, level="Cytokine")].unstack("Cytokine").stack("Time")

# Rescale by the minimum concentration (lower LOD) and take the log10
df_orig_e1 = np.log10(df_orig_e1 / df_orig_e1.min(axis=0))

# Select TCellNumber and E1 only
df_orig_e1 = df_orig_e1.xs("100k", level="TCellNumber", axis=0)
df_orig_e1 = df_orig_e1.xs("E1", level="Peptide", drop_level=False)
df_orig_e1 = df_orig_e1.reset_index()
df_orig_e1 = df_orig_e1.set_index(["Peptide", "Concentration", "Time"])

In [None]:
# Add a covariance matrix for each time point of the E1 time series
# Just use ser_std on the diagonal for the cytokines, assume no correlation. 
df_cov_e1 = pd.DataFrame(np.zeros([len(df_orig_e1.index)*len(df_orig_e1.columns), len(df_orig_e1.columns)]), 
    index=pd.MultiIndex.from_product([*df_orig_e1.index.levels, df_orig_e1.columns]), 
    columns=df_orig_e1.columns)
df_invcov_e1 = df_cov_e1.copy()
for key in df_orig_e1.index:
    for i in range(len(df_orig_e1.columns)):
        df_cov_e1.loc[key].iloc[i, i] = ser_std.iloc[i]**2
        df_invcov_e1.loc[key].iloc[i, i] = 1 / ser_std.iloc[i]**2
print(df_invcov_e1)

In [None]:
e1_times = df_orig_e1.index.get_level_values("Time").unique()
df_resids_e1 = df_orig_e1 - df_recon_e1.loc[df_recon_e1.index.isin(e1_times, level="Time")]
df_resids_e1 = df_resids_e1.dropna()  # Times not available
e1_times = df_resids_e1.index.get_level_values("Time").unique()
chisq_e1 = 0.0
for t in e1_times:
    mat = df_invcov_e1.xs(t, level="Time")
    resids = df_resids_e1.xs(t, level="Time")
    chisq_e1 += float(resids.values.dot(mat).dot(resids.values.T))
print("chi^2:", chisq_e1)
# Compute normalized chi-squared and p-value
chisq_e1_norm = chisq_e1 / nu_dof5
print("Normalized chi^2:", chisq_e1_norm)
pvalue_e1 = 1.0 - chi_distrib5.cdf(chisq_e1)
print("p-value:", pvalue_e1)

# III. Plotting reconstructed time courses

In [None]:
# Retrive the minimum concentration in HighMI_1 for proper scaling back
# This is a constant: log(cyto/min) = log(cyto/units) - log(min/units)
# that we will add back to all data before plotting
pM_offset = 1000 * df_nM_min_conc.loc[dset_choice]
print(pM_offset)

In [None]:
# Logarithmic minor ticks (we plotted the real log so need to put log ticks manually)
# Find the linear scale limiting ticks
def compute_log_minor_ticks(logylims, stp=2, base=10.0):
    smallest_major = int(np.floor(logylims[0]))
    largest_major = int(np.ceil(logylims[1]))
    n_decades = largest_major - smallest_major

    # Generate linear ranges with the exponents found
    tiles = []
    for i in range(n_decades):
        tiles.append(np.arange(stp*base**(smallest_major+i), 
                    base**(smallest_major+i+1), stp*base**(smallest_major+i)))
    minorticks = np.concatenate(tiles, axis=0)
    minorticks = np.log(minorticks) / np.log(base)
    minorticks = minorticks[(minorticks > logylims[0]) * (minorticks < logylims[1])]
    return minorticks

In [None]:
### Plot comparing the data to the reconstruction, with error bars
## FIRST ROW: data vs reconstructed cytokine trajectories
# For each peptide, plot one quantity
fig, allaxes = plt.subplots(2, 5, sharex=False, sharey="row")
fig.set_size_inches(4.75, 2*1.5)
axes = allaxes[0]

cytokines_plot = ["IFNg", "IL-2", "IL-17A", "IL-6", "TNFa"]
cytokines_nice = [r"IFN-$\gamma$", "IL-2", "IL-17A", "IL-6", "TNF"]

times_recon = df_recon.index.get_level_values("Time").unique()
times_orig = df_orig.index.get_level_values("Time").unique()

peps_to_plot = ["N4", "Q4", "T4", "V4", "E1"]
pep_palette = sns.color_palette(n_colors=len(peps_to_plot)+1)
pep_palette.pop(4)
pep_palette = {peps_to_plot[i]:pep_palette[i] for i in range(len(pep_palette))}
lw_choice = 1.5
for j, pep in enumerate(peps_to_plot[:4]):
    for i in range(len(cytokines_plot)):
        y = df_recon.loc[(dset_choice, pep, conc_choice, times_recon), ("concentration", cytokines_plot[i])].values
        y2 = df_orig.loc[(pep, conc_choice, times_orig), (cytokines_plot[i],)].values.flatten()
        # Offsets for proper absolute concentration plotting
        y = y + np.log10(pM_offset.loc[cytokines_plot[i]])
        y2 = y2 + np.log10(pM_offset.loc[cytokines_plot[i]])
        yerr = df_std.loc[(pep, conc_choice, times_orig), (cytokines_plot[i],)].values.flatten()
        # Display p-value in legend. Optional, currently off. 
        #pvalue = float(ser_pval_mv.loc[(pep, conc_choice)])
        #chi_value = float(ser_mv_chisq_norm.loc[(pep, conc_choice)])
        axes[i].plot(times_recon, y, color=pep_palette[pep], lw=lw_choice, zorder=2*(4-j)+5,
            label=pep)  #+" (p = {:.2f},\n".format(pvalue)+r"$\chi^2/\nu$ = {:.2f})".format(chi_value))
        axes[i].errorbar(times_orig, y2, yerr=yerr, color=pep_palette[pep], lw=lw_choice, zorder=2*(4-j)+4,
                         ls="none", marker="o", ms=2, elinewidth=lw_choice*0.5)

# Add E1 manually to the plot
for i in range(len(cytokines_plot)):
    recon_times = df_recon_e1.index.get_level_values("Time").unique()
    y = df_recon_e1.loc[("E1", "1uM"), cytokines_plot[i]].values
    y2 = df_orig_e1.loc[("E1", "1uM", e1_times), cytokines_plot[i]].values
    y = y + np.log10(pM_offset.loc[cytokines_plot[i]])
    y2 = y2 + np.log10(pM_offset.loc[cytokines_plot[i]])
    # Variance: we used ser_std anyways
    yerr = ser_std[cytokines_plot[i]]
    axes[i].plot(recon_times, y, color=pep_palette["E1"], lw=lw_choice, zorder=1,
            label="E1")  #+" (p = {:.2f},\n".format(pvalue_e1)+r"$\chi^2/\nu$ = {:.2f})".format(chisq_e1_norm))
    axes[i].errorbar(e1_times, y2, yerr=yerr, color=pep_palette["E1"], lw=lw_choice, zorder=2,
                         ls="none", marker="o", ms=2, elinewidth=lw_choice*0.5)

# Add logarithmic ticks manually
#ylims_log100 = (np.log(10.0**axes[0].get_ylim()[0]) / np.log(100), 
#                np.log(10.0**axes[0].get_ylim()[1]) / np.log(100))
#print(ylims_log100)
#minorticks = compute_log_minor_ticks(ylims_log100, stp=1, base=100.0)
#Convert back to log10, which is the scale of the axis
#minorticks = np.log10(100**minorticks)

with open(os.path.join("data", "misc", "minor_ticks_props.json"), "r") as hd:
    props_minorticks = json.load(hd)
with open(os.path.join("data", "misc", "major_ticks_props.json"), "r") as hd:
    props_majorticks = json.load(hd)
ylims_log10 = axes[0].get_ylim()
minorticks = compute_log_minor_ticks(ylims_log10, stp=1, base=10.0)

def powten_format(x, pos):
    return r"$10^{" + "{}".format(int(x)) + r"}$"
for i, ylbl in enumerate(cytokines_nice):
    axes[i].set_title(ylbl, size=7, pad=4., va="top", y=0.91)
    axes[i].tick_params(axis="both", **props_majorticks)
    axes[i].tick_params(axis="y", which="minor", **props_minorticks)
    axes[i].set_yticks(np.arange(np.ceil(ylims_log10[0]), np.floor(ylims_log10[1])+1, 1))
    axes[i].set_yticks(minorticks, minor=True)
    axes[i].yaxis.set_major_formatter(mpl.ticker.FuncFormatter(powten_format))

# Label the y axes
axes[0].set_ylabel("[cytokine] (pM)", size=7, labelpad=0.5)
for i in range(1, len(cytokines_plot)):
    axes[i].set_ylabel("")
    
# Label the x axes
for i in range(len(cytokines_plot)):
    axes[i].set_xlabel("Time (h)", size=7, labelpad=0.5)
    axes[i].set_xticks([0, 30, 60])
    axes[i].set_xticklabels([0, 30, 60])
    for axis in ['bottom', 'left', "top", "right"]:
        axes[i].spines[axis].set_linewidth(0.8)
    
# Legend for peptides on the side
handles, labels = axes[0].get_legend_handles_labels()
leg = fig.legend(handles, labels, loc="upper left", bbox_to_anchor=(0.95, 0.95), 
                 handlelength=0.8, fontsize=7, labelspacing=0.3, frameon=False, 
                 borderaxespad=0.5, handletextpad=0.3)

# Legend on top to emphasize reconstruction vs data
handles2 = [
    mpl.lines.Line2D([0, 1], [0, 1], color="k", ls="none", marker="o", ms=2), 
    mpl.lines.Line2D([0, 1], [0, 1], color="k", lw=lw_choice)]
labels2 = ["Data", "Reconstruction from latent space model"]
second_legend = mpl.legend.Legend(parent=fig, handles=handles2, labels=labels2, ncol=2, 
                    loc='upper center', bbox_to_anchor=(0.5, 1.02), frameon=False, 
                    handlelength=0.8, fontsize=8, labelspacing=0.3, 
                    borderaxespad=0.5, handletextpad=0.3)
fig.add_artist(second_legend)


## SECOND ROW: residuals
axes = allaxes[1]

times_recon = df_recon.index.get_level_values("Time").unique()
times_orig = df_orig.index.get_level_values("Time").unique()

peps_to_plot = ["N4", "Q4", "T4", "V4", "E1"]
pep_palette = sns.color_palette(n_colors=len(peps_to_plot)+1)
pep_palette.pop(4)
pep_palette = {peps_to_plot[i]:pep_palette[i] for i in range(len(pep_palette))}
lw_choice = 2.
for j, pep in enumerate(peps_to_plot[:4]):
    for i in range(len(cytokines_plot)):
        y = df_resids_select.loc[(pep, conc_choice, times_orig), (cytokines_plot[i],)].values.flatten()
        yerr = df_std.loc[(pep, conc_choice, times_orig), (cytokines_plot[i],)].values.flatten()
        #pvalue = float(ser_pval_mv.loc[(pep, conc_choice)])
        #chi_value = float(ser_mv_chisq_norm.loc[(pep, conc_choice)])
        zord = (5-j)
        axes[i].plot(times_orig, y, color=pep_palette[pep], lw=lw_choice, 
                         zorder=zord+2*5, label=pep)
        axes[i].plot(times_orig, yerr, color=pep_palette[pep], zorder=zord+5,
                         ls="-.", lw=lw_choice*0.5, alpha=0.7, label=r"$\sigma_{"+pep+"}$")
        axes[i].plot(times_orig, -yerr, color=pep_palette[pep], zorder=zord+5,
                         ls="-.", lw=lw_choice*0.5, alpha=0.7)
        # Fill between?
        #axes[i].fill_between(times_orig, -yerr, yerr, color=pep_palette[pep], 
        #                     zorder=zord, alpha=0.1)

# Add E1 manually to the plot
for i in range(len(cytokines_plot)):
    y = df_resids_e1.loc[("E1", "1uM", e1_times), cytokines_plot[i]].values
    # Variance: we used ser_std anyways
    yerr = np.asarray([ser_std[cytokines_plot[i]]]*len(e1_times))
    axes[i].plot(e1_times, y, color=pep_palette["E1"], lw=lw_choice, zorder=1+10,
            label="E1")
    axes[i].plot(e1_times, yerr, color=pep_palette["E1"], zorder=1+5, ls="-.", 
                 lw=lw_choice/1.5, alpha=0.65, label=r"$\sigma_{E1}$")
    axes[i].plot(e1_times, -yerr, color=pep_palette["E1"], zorder=1+5, ls="-.", 
                 lw=lw_choice/1.5, alpha=0.65)
    #axes[i].fill_between(e1_times, -yerr, yerr, color=pep_palette["E1"], 
    #              zorder=1, alpha=0.1)


for i, ylbl in enumerate(cytokines_nice):
    axes[i].tick_params(axis="both", labelsize=7, length=2., width=0.8)

# Label the y axes
axes[0].set_ylabel(r"$\Delta \log_{10}$([cytokine])", size=7, labelpad=1.5)
for i, ylbl in enumerate(cytokines_nice):
    axes[i].set_title(ylbl, size=7, pad=4., va="top", y=0.91)
    for axis in ['bottom', 'left', "top", "right"]:
        axes[i].spines[axis].set_linewidth(0.8)

# Label the x axes
for i in range(len(cytokines_plot)):
    axes[i].set_xlabel("Time (h)", size=7, labelpad=0.5)
    axes[i].set_xticks([0, 30, 60])
    axes[i].set_xticklabels([0, 30, 60])
    for axis in ['bottom', 'left', "top", "right"]:
        axes[i].spines[axis].set_linewidth(0.8)
    
# Legend for peptides below
handles3, labels3 = axes[0].get_legend_handles_labels()
handles3 = handles3[1::2]
labels3 = labels3[1::2]

# Third legend for residuals
third_legend = mpl.legend.Legend(parent=fig, handles=handles3, labels=labels3, 
                  bbox_to_anchor=(0.95, 0.45), loc="upper left", ncol=1, 
                  handlelength=0.8, fontsize=7, labelspacing=0.3, frameon=False, 
                 borderaxespad=0.5, handletextpad=0.3)
fig.add_artist(third_legend)


# Fourth legend for residuals and standard deviation
handles4 = [
    mpl.lines.Line2D([0, 1], [0, 1], color="k", ls="-", lw=lw_choice), 
    mpl.lines.Line2D([0, 1], [0, 1], color="k", ls="-.", lw=lw_choice/1.5)]
labels4 = ["Residual reconstruction-data", "Standard deviation"]
fourth_legend = mpl.legend.Legend(parent=fig, handles=handles4, labels=labels4, ncol=2, 
                    loc='center', bbox_to_anchor=(0.5, 0.47), frameon=False, 
                    handlelength=1.85, fontsize=8, labelspacing=0.3, 
                    borderaxespad=0.5, handletextpad=0.3)
fig.add_artist(fourth_legend)

fig.tight_layout(w_pad=0.3, h_pad=2.5)

#fig.savefig(os.path.join("figures", "reconstruction", "supp_figure_reconstruction_model_pvalues_residuals.pdf"), 
#           bbox_inches="tight", bbox_extra_artists=(leg, second_legend, third_legend, fourth_legend), transparent=True)
plt.show()
plt.close()

In [None]:
# Output the p-values and chi^2/nu (and chi^2 and nu separately) to a text file
df_fit_stats = pd.DataFrame(
    {"p-value": ser_pval_mv.loc[(peps_to_plot[:4], conc_choice)].tolist()+[pvalue_e1], 
    "chi^2/nu": ser_mv_chisq_norm.loc[(peps_to_plot[:4], conc_choice)].tolist()+[chisq_e1_norm], 
    "chi^2": ser_mv_chisq.loc[(peps_to_plot[:4], conc_choice)].tolist() + [chisq_e1], 
    "nu": [nu_dof5]*len(peps_to_plot)
    }, index=pd.MultiIndex.from_product(
        (peps_to_plot, conc_choice), names=["Peptide", "Concentration"])
    )
print(df_fit_stats)
#df_fit_stats.to_json(os.path.join("results", "reconstruction", "reconstruction_model_pvalues.json"))