# Reconstruct cytokine time courses from data projected in latent space

The best reconstructions are obtained with the following procedure. 
In addition to latent space concentrations, we include linear terms in $\tanh({N_1 \ \bar{N}_1})$ and $\tanh({N_2/\bar{N}_2})$ in the reconstruction,  with quadratic concentration terms too. The constants $\bar{N}_1$ and $\bar{N}_2$ are normalization constants, taken to be the average value of $N_1$ and $N_2$ over all times and conditions in the training data. The purpose of including tanh functions is to saturate the value of integrals, so they can sustain the reconstructed cytokines at late times without causing artificial continuous increase in the cytokine values. 

In other words, we reconstruct cytokine $c_i$ with the following combination of terms: 

$$ c_i = Q_{i1} n_1 + Q_{i2} n_2 + Q_{i3} n_1^2 + Q_{i4} n_2^2 + Q_{i5} n_1 n_2  + Q_{i6} \tanh{(N_1/\bar{N}_1)} + Q_{i7} \tanh{(N_2/\bar{N}_2)}$$

The $5\times 7$ matrix $Q_{ij}$ is fitted by linear least-squares regression on the non-linear terms. 

In [None]:
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
import pickle
import pandas as pd
import os
import seaborn as sns

In [None]:
from ltspcyt.scripts.adapt_dataframes import sort_SI_column
from ltspcyt.scripts.neural_network import import_WT_output
%matplotlib inline

In [None]:
from ltspcyt.scripts.reconstruction import (train_reconstruction, plot_recon_true, 
                                    performance_recon, plot_histograms)

# Import cytokine data, integrals, projections, and MLP
Use OT-1 datasets, it is the simplest latent space we have. 

In [None]:
peptides = ["N4", "Q4", "T4", "V4", "G4", "E1", "A2", "Y3", "A8", "Q7"]
concentrations = ["1uM", "100nM", "10nM", "1nM"]

In [None]:
df_wt = import_WT_output()

In [None]:
df_min, df_max = pd.read_pickle("data/trained-networks/min_max-thomasRecommendedTraining.pkl")
mlp = pickle.load(open("data/trained-networks/mlp-thomasRecommendedTraining.pkl", "rb"))
df_min, df_max = df_min.xs("integral", level="Feature"), df_max.xs("integral", level="Feature")

# Projection matrix
P = mlp.coefs_[0].T
print(P)

In [None]:
cytokines = df_min.index.get_level_values("Cytokine")
times = np.arange(0, 69)

In [None]:
# Select only the desired cytokines, times, and T cell number
df_wt = df_wt.unstack("Time").loc[:, (slice(None), cytokines, times)].stack("Time")

# Rescale and project each feature
proj_dfs = []
feat_keys = ["integral", "concentration", "derivative"]
cols = pd.Index(["Node 1", "Node 2"], name="Node", copy=True)
print(P.T)

for typ in feat_keys:
    # Rescale with the training min and max
    if typ == "integral":
        df_wt[typ] = (df_wt[typ] - df_min)/(df_max - df_min)
    else:   # for conc and deriv, the constant offset term -df_min disappears. 
        df_wt[typ] = df_wt[typ]/(df_max - df_min)
    df_temp = pd.DataFrame(np.dot(df_wt[typ], P.T), index=df_wt[typ].index, columns=cols)
    proj_dfs.append(df_temp)
df_proj = pd.concat(proj_dfs, axis=1, names=["Feature"], keys=feat_keys)
del proj_dfs, cols, feat_keys  # temporary variables

## Select training and test data
Use different datasets as a simple means of splitting the data. Could also use sklearn.model_selection.split_train_test, but that's not necessary: by selecting different datasets, we are sure to have similar test and train data, modulo experimental variability, since the peptide conditions are the same. 

Select only one T cell number for now. You could try using the same reconstruction coefficients for different T cell numbers, but it would not work as well, because the 2D manifold changes slightly depending on T cell number. 

In [None]:
# Remove different T cell numbers
tcellnum = "100k"
df_wt = df_wt.xs(tcellnum, level="TCellNumber", axis=0, drop_level=True)
df_proj = df_proj.xs(tcellnum, level="TCellNumber", axis=0, drop_level=True)

In [None]:
# Keep multiple datasets to populate latent space better
# Mix datasets with old and new protocols, because IL-6 is low in new ones, for instance. 

subset_train = [
    "HighMI_1-1", 
    "HighMI_1-3"
]
df_wt_train = df_wt.loc[subset_train]
df_proj_train = df_proj.loc[subset_train]

subset_test = [
    "HighMI_1-2", 
    "HighMI_1-4"
]
df_wt_test = df_wt.loc[subset_test]
df_proj_test = df_proj.loc[subset_test]

### Visualize the latent space of selected data 
Color per dataset to emphasize experimental variability. 

In [None]:
g = sns.relplot(x="Node 1", y="Node 2", data=df_proj_train["integral"].reset_index(), 
            col="Peptide", col_wrap=3, col_order=peptides, kind="line",
            hue="Data", hue_order=subset_train, palette=sns.color_palette("Set2", len(subset_train)), 
            size="Concentration", sort=False, height=2.5)
legend = g._legend
#g.fig.savefig("figures/latentspace/latentspace_integral_train_datasets.pdf", 
#              transparent=True, bbox_extra_artists=(legend,), bbox_inches='tight')
plt.show()
plt.close()

In [None]:
# Test data
g = sns.relplot(x="Time", y=0, data=df_proj_test["concentration"].sort_index(level="Time").stack("Node").reset_index(), 
            col="Peptide", row="Node", col_order=peptides, kind="line",
            hue="Data", hue_order=subset_test, palette=sns.color_palette("Set2", len(subset_test)), 
            size="Concentration", sort=False, height=2.5)
legend = g._legend
#g.fig.savefig("figures/latentspace/latentspace_integral_test_datasets.pdf", 
#              transparent=True, bbox_extra_artists=(legend,), bbox_inches='tight')
plt.show()
plt.close()

# Reconstruction of the selected feature from the projections

We reconstruct only one feature (integrals, concentrations, or derivatives) at a time. Once the reconstructed cytokines are obtained, of course the other features can be recovered by differentiation or time integration. 
All possible methods are defined in ltspcyt/scripts/reconstruction.py:
- Linear regression
- Linear regression with quadratic terms
- Neural network with two input variables
- Linear regression mixed input features, some with quadratic terms. 
Here we use linear regression with quadratic terms, option "mixedquad", and we add $\tanh(N_i)$ terms for saturation.

To try simple linear regression, use the commented out cell instead of the one below. Note that in that case only, because of the linearity, the same reconstruction coefficients can be used for time integrals, concentrations, or derivatives alike. 

In [None]:
# Find the reconstruction matrix, based on reconstructing integrals
feature = "concentration"
model_type = "mixed_quad"

modelargs = {"which_to_square":[0, 1]}

# Add some arbitrary features. 
# Try exponentials
norm_factors = df_proj_train["integral"].mean(axis=0)


df_proj_train2 = pd.concat([df_proj_train["concentration"], np.tanh(df_proj_train["integral"] / norm_factors)], 
                           keys=["concentration", "tanh_integrals"], names=["Feature"], axis=1)
df_proj_test2 = pd.concat([df_proj_test["concentration"], np.tanh(df_proj_test["integral"] / norm_factors)], 
                           keys=["concentration", "tanh_integrals"], names=["Feature"], axis=1)

In [None]:
pipe, score = train_reconstruction(df_proj_train2, df_wt_train, feature=feature, 
                                   method=model_type, model_args=modelargs, do_scale_out=False)
print("R^2 training score:", score)
print("Regression coefficients (Q matrix):")
print(pipe[model_type].regressor_.Q)

In [None]:
# Reconstruct both the test and training data sets, for the selected feature (pipeline does not work for others)
# Don't need the inverse_transform, because when mlpreg.predict is called, the inverse transform is applied on the prediction of the regressor. 
# These wrappers work well! no need to worry about all the steps
# Danger: forget what's happening under the hood.

columns2 = pd.MultiIndex.from_product([[feature], df_wt_train[feature].columns], names=df_wt_train.columns.names)

df_recon_train = pd.DataFrame(pipe.predict(df_proj_train2), index=df_wt_train.index, 
                              columns=columns2)
df_recon_test = pd.DataFrame(pipe.predict(df_proj_test2), index=df_wt_test.index, 
                             columns=columns2)

## Save reconstruction results

In [None]:
# Save the results: processed latent space used as input and reconstruction
fnames = "results/reconstruction/df_{}_include-integrals_tanh_HighMI_1.hdf"
# fnames = "results/reconstruction/df_{}_linear_HighMI_1.hdf"  # Use this name if using linear reconstruction

df_recon_train.to_hdf(fnames.format("recon"), key="train", mode="w")
df_recon_test.to_hdf(fnames.format("recon"), key="test", mode="a")

df_proj_train.to_hdf(fnames.format("proj"), key="train", mode="w")
df_proj_test.to_hdf(fnames.format("proj"), key="test", mode="a")

df_wt_train.to_hdf(fnames.format("wt"), key="train", mode="w")
df_wt_test.to_hdf(fnames.format("wt"), key="test", mode="a")

### Compare the reconstruction and data for test replicates
Plot per cytokine, for all datasets, to infer general trends that we can then correct manually for each cytokine. 

In [None]:
# Per experiment
figlist = plot_recon_true(df_wt_test, df_recon_test, feature=feature, sharey=True)
# If one wants to select only a subset of the data
#figlist = plot_recon_true(df_wt_test.loc[(slice(None), ["N4", "T4"]), :], 
#                          df_recon_test.loc[(slice(None), ["N4", "T4"]), :], 
#                          feature=feature, sharey=True)
for exp in figlist.keys():
    legend = figlist[exp].axes[-1].get_legend()
    figlist[exp].set_size_inches(9, 6)
    #figlist[exp].savefig("figures_quadratic/per_dataset/cyto_reconstruction_HighMI-test_selectpep_" + exp + "_add-integrals_tanh.pdf", 
    #                     format="pdf", transparent=True, bbox_extra_artists=(legend,), bbox_inches='tight')
plt.show()
plt.close()

## Quantification of the reconstruction performance
Histogram of residuals and residuals over time. 

Note that the R2 score computed by sklearn is slightly different, because it takes the uniform average of the R2 score of each cytokine separately. If $X$ is the true data (shape $N_{samp} \times {N_{dim}}$) and $\hat{X}$ is the reconstruction:

$$ R^2_{sklearn} = \frac{1}{N_{dim}}\sum_{j=1}^{N_{dim}} \left[ 1 - \frac{\sum_{i} (X_{ij} - \hat{X}_{ij})^2}{\sum_{i'}(X_{i'j} - \langle X_{j} \rangle)^2}  \right]$$

Here, we treat the five cytokines as 5 components of a 5D vector and sum the squared residuals in the five dimensions (i.e. compute the L2 norm of the difference vector) into a single R2 score. 

$$ R^2_{vector} = 1 -  \frac{\sum_{i, j} (X_{ij} - \hat{X}_{ij})^2}{\sum_{i'j} (X_{i'j} - \langle X_{j} \rangle)^2} $$

In [None]:
# Check the performance of the reconstruction, per dataset
perform_train = performance_recon(df_wt_train, df_recon_train, toplevel="Data", feature=feature)
perform_test = performance_recon(df_wt_test, df_recon_test, toplevel="Data", feature=feature)
#perform_train_conc = performance_recon(df_wt_train, df_recon_train, toplevel="Data", feature="concentration")
#perform_test_conc = performance_recon(df_wt_test, df_recon_test, toplevel="Data", feature="concentration")

# Plot the histograms and print the results
print("------ Performance on TRAIN datasets ------")
print("-- {} --".format(feature))
print("R2 coefficient (score):", perform_train[-1])
print("Residuals per dataset: \n", perform_train[0])
fig, axes = plot_histograms(perform_train[1], perform_train[2])
plt.show()
plt.close()

print("------ Performance on TEST datasets ------")
print("-- {} --".format(feature))
print("R2 coefficient (score):", perform_test[-1])
print("Residuals per dataset per point: \n", perform_test[0])
fig, axes = plot_histograms(perform_test[1], perform_test[2])
plt.show()
plt.close()



In [None]:
# Plot residuals
palette_backup = plt.rcParams['axes.prop_cycle'].by_key()['color']
peptides_backup_short = ["N4", "Q4", "T4", "V4", "A2"]

def plot_residuals_percyto(df_res, feature="integral", toplevel="Data", datatype="relative",
    sharey=True, palette=palette_backup, pept=peptides_backup_short, y_lims=None):
    """
    Args:
        df_res (pd.DataFrame): dataframe containing relative residuals
        feature (str): the feature to compare ("integral", "concentration", "derivative")
        toplevel (str): the first index level, one plot per entry
        datatype (str): "relative" or "absolute"
        sharey (bool): whether or not the y axis on each row should be shared
            True by default, allows to see if somne cytokines weigh less in the reconstruction.
        palette (list): list of colors, at least as long as pept
        pept (list): list of peptides
        y_lims (pd.DataFrame): dataframe of maxes
    """
    # Slice for the desired feature
    df = df_res.xs(feature, level="Feature", axis=1, drop_level=True)

    # Plot the result
    # Rows are for cytokines, columns for peptides
    # One panel per dataset
    figlist = {}
    for cyt in df.columns.get_level_values("Cytokine").unique():
        # Extract labels
        cols = df.index.get_level_values("Peptide").unique()
        cols = [p for p in pept if p in cols]  # Use the right order
        try:
            rows = df.index.get_level_values(toplevel).unique()
        except KeyError:
            rows = df.columns.get_level_values("Node").unique()
            print("Reconstructing latent space")
        # Sort the concentrations
        concs_num = sort_SI_column(df.index.get_level_values("Concentration").unique(), "M")
        concs = np.asarray(df.index.get_level_values("Concentration").unique())[np.argsort(concs_num)]
        # Prepare colors and sizes
        colors = {cols[i]:palette[i] for i in range(len(cols))}
        sizes = {concs[i]:1 + i for i in range(len(concs))}
        fig, axes = plt.subplots(len(rows), len(cols), sharex=False, sharey=sharey)
        fig.set_size_inches(3*len(cols), 3*len(rows))
        times = df.index.get_level_values("Time").unique()
        times = [float(t) for t in times]
        for i, xp in enumerate(rows):
            for j, pep in enumerate(cols):
                for k in concs:
                    try:
                        li1, = axes[i, j].plot(times, df.loc[(xp, pep, k), cyt],
                                    color=colors[pep], lw=sizes[k], ls="-")
                        li2 = axes[i, j].axhline(0, color="k", ls="--", lw=1.)
                    except KeyError:  # This combination dos not exist
                        continue
                # Some labeling
                if j == 0:
                    units = "" if datatype == "absolute" else " [%]"
                    axes[i, j].set_ylabel(xp[:-20] + "\n" + "Residuals" + units)
                    if y_lims is not None:
                        axes[i, j].set_ylim(-y_lims.loc[xp, (feature, cyt)], 
                                              y_lims.loc[xp, (feature, cyt)])
                if i == len(rows) - 1:
                    axes[i, j].set_xlabel("Time")
                elif i == 0:
                    axes[i, j].set_title(pep)
        # Save the figure afterwards, with a title
        fig.suptitle(cyt)
        figlist[cyt] = fig
    return figlist

In [None]:
# Residuals (negative if the reconstruction is smaller than the true value)
df_resids_test = df_recon_test - df_wt_test
df_resids_train = df_recon_train - df_wt_train

# Find max of each cytokine across all peptides, etc. so it's the min, max of the plots
ylims_train = df_wt_train.groupby("Data", axis=0).max()
ylims_test = df_wt_test.groupby("Data", axis=0).max()

In [None]:
figlist = plot_residuals_percyto(df_resids_test, feature=feature, sharey=True, 
                                 datatype="absolute", y_lims=ylims_test)
for cyt in figlist.keys():
    legend = figlist[cyt].axes[-1].get_legend()
    #figlist[cyt].savefig("figures/reconstruction/cyto_reconstruction_HighMI-test_" + cyt + "_add-integrals_tanh.pdf", 
    #                     format="pdf", transparent=True, bbox_extra_artists=(legend,), bbox_inches='tight')
    plt.show()
plt.close()