# Generating cytokine trajectories from ballistic parameters
The goal is to have a model for cytokine dynamics. We want to show that we can create realistic time courses just by picking a few parameter values, corresponding to different ligand quality, ligand quantities, and T cell numbers. 

In particular, here, we fit kernel density estimates (KDEs) in model parameter space, sample from them, and compute the resulting latent and full-space trajectories. We then give those trajectories to Thomas for re-classification. We also compare them to trajectories from the data sets used to build the KDEs. 


## Main steps to follow

1. Import selected WT data
    1. Select HighMI dataset to train the reconstruction algorithm
    2. Select datasets to fit parameter value KDEs
2. Train the reconstruction algorithm
3. Fit the sigmoid-ballistic model to all datasets in latent space
4. Fit KDEs on the parameter space
5. Sample from the KDEs to generate model (sigmoid-ballistic) latent space trajectories 
5. Project back those curves to cytokine concentration space, as well as those used to train the reconstruction
7. Try also reconstructing all data sets used for parameter space, as a proof that we can uniformize them? 
    7.1 And look again at their re-projection in latent space after uniformization, as a proof they look alike more? 
8. Prepare a nice dataframe of synthetic, processed cytokine time courses for classification by Thomas' neural network. 

## Code structure

The following useful functions are in separate Python scripts, for clarity of the notebook. 

- Scripts to import and process data: ltspcyt.scripts.neural_network
- Scripts to train a reconstruction model: ltspcyt.scripts.reconstruction
- Scripts to fit the sigmoid-ballistic model: ltspcyt.scripts.sigmoid_ballistic and fitting_functions
    - With free $\alpha$. 

Then, the notebook will use those functions as we follow the steps outlined just above. 

# 1. Importing scripts and data
Same kind of code as in other notebooks

In [None]:
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
import pickle
from time import perf_counter  # For timing
import pandas as pd
import os
import seaborn as sns

from sklearn.neighbors import KernelDensity
from sklearn.model_selection import GridSearchCV
from sklearn.base import BaseEstimator
from sklearn.preprocessing import StandardScaler, MinMaxScaler

In [None]:
# Scripts for data importation
from ltspcyt.scripts.adapt_dataframes import set_standard_order, sort_SI_column
from ltspcyt.scripts.neural_network import import_WT_output
from ltspcyt.scripts.latent_space import import_mutant_output

# Scripts for reconstruction, using distinct functions for distinct methods. 
from ltspcyt.scripts.reconstruction import (train_reconstruction, plot_recon_true, 
    compute_latent_curves, fit_param_distrib_kdes, ScalerKernelDensity, sample_from_kdes)

# Scripts for curve fitting
from ltspcyt.scripts.sigmoid_ballistic import (
    return_param_and_fitted_latentspace_dfs, sigmoid_conc_full_freealpha, 
    ballistic_sigmoid_freealpha)

In [None]:
%matplotlib inline

## 1.1 Import all data
Remove unwanted levels, normalize the values. 

In [None]:
df_wt = import_WT_output()

In [None]:
df_min, df_max = pd.read_pickle(os.path.join("data", "trained-networks", "min_max-thomasRecommendedTraining.pkl"))
df_min, df_max = df_min.xs("integral", level="Feature"), df_max.xs("integral", level="Feature")

# Projection matrix
P = np.load(os.path.join("data", "trained-networks", "mlp_input_weights-thomasRecommendedTraining.npy")).T

In [None]:
peptides = ["N4", "Q4", "T4", "V4", "G4", "E1", "A2", "Y3", "A8", "Q7"]
concentrations = ["1uM", "100nM", "10nM", "1nM"]
cytokines = df_min.index.get_level_values("Cytokine")
times = np.arange(0, 69)

In [None]:
# Select only the desired cytokines, times, and T cell number
df_wt = df_wt.unstack("Time").loc[:, (slice(None), cytokines, times)].stack("Time")

# Rescale and project each feature, but do not offset (don't use MLP's intercepts)
proj_dfs = []
feat_keys = ["integral", "concentration", "derivative"]
cols = pd.Index(["Node 1", "Node 2"], name="Node", copy=True)
print(P.T)

for typ in feat_keys:
    # Rescale with the training min and max
    if typ == "integral":
        df_wt[typ] = (df_wt[typ] - df_min)/(df_max - df_min)
    else:   # for conc and deriv, the constant rescaling term disappears. 
        df_wt[typ] = df_wt[typ]/(df_max - df_min)
    df_temp = pd.DataFrame(np.dot(df_wt[typ], P.T), index=df_wt[typ].index, columns=cols)
    proj_dfs.append(df_temp)
df_proj = pd.concat(proj_dfs, axis=1, names=["Feature"], keys=feat_keys)
del proj_dfs, cols, feat_keys  # temporary variables

In [None]:
# Remove different T cell numbers
tcellnum = "100k"
df_wt = df_wt.xs(tcellnum, level="TCellNumber", axis=0, drop_level=True)
df_proj = df_proj.xs(tcellnum, level="TCellNumber", axis=0, drop_level=True)

## 1.2 Select training and testing datasets
Here, we don't remove A2 and Y3 from the reconstruction optimization (training) data, because the goal is to have reconstructions as good as possible and generate realistic cytokine trajectories from the model, and including as many peptides as possible helps. We do not have to show that we can reconstruct new peptides not previously seen in training: we already did that in the notebooks `reconstruct_cytokines_fromLSmodel_pvalues.ipynb` and `reconstruct_cytokines_fromLSdata.ipynb`. 

In [None]:
# Keep multiple datasets to populate latent space better
# Mix datasets with old and new protocols, because IL-6 is low in new ones, for instance. 
subset_exp = [
    "PeptideComparison_OT1_Timeseries_18", 
    "PeptideComparison_OT1_Timeseries_19", 
    "NewPeptideComparison_OT1_Timeseries_20", 
    "PeptideComparison_OT1_Timeseries_21", 
    "TCellNumber_OT1_Timeseries_7",
    "CD25MutantTimeSeries_OT1_Timeseries_2",     
    "Activation_Timeseries_1", 
    "ITAMDeficient_OT1_Timeseries_9", 
    "PeptideTumorComparison_OT1_Timeseries_1", 
]
# subset_exp2 = ["HighMI_1-" + str(i) for i in range(1, 5)]

df_wt_train = df_wt.loc[subset_exp]
df_proj_train = df_proj.loc[subset_exp]
df_wt_train = df_wt_train.loc[df_wt_train.index.isin(concentrations, level="Concentration")]
df_proj_train = df_proj_train.loc[df_proj_train.index.isin(concentrations, level="Concentration")]

df_wt_kde = df_wt.loc[subset_exp]
df_proj_kde = df_proj.loc[subset_exp]
df_wt_kde = df_wt_kde.loc[df_wt_kde.index.isin(concentrations, level="Concentration")]
df_proj_kde = df_proj_kde.loc[df_proj_kde.index.isin(concentrations, level="Concentration")]

# 2. Train the reconstruction function
Also compute the reconstructed cytokines for both the test and training data sets

In [None]:
# Find the reconstruction matrix, based on reconstructing integrals
feature = "concentration"
model_type = "mixed_quad"

modelargs = {"which_to_square":[0, 1]}

# Add some arbitrary features. 
# Try exponentials
tanh_norm_factors = df_proj_train["integral"].mean(axis=0)
print(tanh_norm_factors)

df_proj_train = pd.concat([df_proj_train["concentration"], np.tanh(df_proj_train["integral"] / tanh_norm_factors)], 
                           keys=["concentration", "tanh integral"], names=["Feature"], axis=1)

In [None]:
pipe, score = train_reconstruction(df_proj_train, df_wt_train, feature=feature, 
                                   method=model_type, model_args=modelargs, do_scale_out=False)
print(score)

print(pipe[model_type].regressor_.Q)

## Remark
We could also skip this step, and just import a pre-trained set of reconstruction coefficients, or in my case, a pickled reconstruction pipeline. 

# 3. Fit the ballistic model to train data
This is standard, we fit the integrals and rescale time by $\tilde{t} = 20 h$. 

In [None]:
# Choice of fitting hyperparameters
fit_vars={"Constant velocity":["v0","t0","theta","vt"],"Constant force":["F","t0","theta","vt"],
         "Sigmoid":["a0", "t0", "theta", "v1", "gamma"], 
         "Sigmoid_freealpha":["a0", "t0", "theta", "v1", "alpha", "beta"]}
fit = "Sigmoid_freealpha"
regul_rate = 0.4
tscale = 20.  # Rescaling of time for nicer parameter ranges
name_specs = "{}20_reg{}".format(fit, str(round(regul_rate, 2)).replace(".", ""))

# Fit the integrals
start_time = perf_counter()

ret = return_param_and_fitted_latentspace_dfs(
    df_proj_kde.xs("integral", level="Feature", axis=1), 
    fit, reg_rate=regul_rate, time_scale=tscale)
df_params, df_compare, df_hess, df_v2v1 = ret

end_t = perf_counter()
print("Time to fit: ", perf_counter() - start_time)
del start_time

nparameters = len(fit_vars[fit])
print(df_hess.median())
# The concentrations in df_compare should be good, because they are dN_i / dt computed numerically
# so we don't have to worry about the magnitude of a_0, v_i. However, we want to generate the curves ourselves
# with our equations, so we call the n_i function and take care of scaling as below. 

### Visual inspection of the fits

In [None]:
dataset = subset_exp[-1]  # Select the data set to plot here
tcellnum = "100k"
df_compare_sel = df_compare.xs(dataset, level="Data", axis=0)
print(df_compare_sel.index.names)
df_compare_sel.columns.names = ["Node"]
data=df_compare_sel.loc[(peptides,concentrations,slice(None),slice(None),"concentration"),:]
h=sns.relplot(data=data.stack().reset_index(),x="Time",y=0,kind="line",sort=False,
            hue="Peptide",hue_order=peptides,
            col="Concentration",col_order=concentrations, row="Node",
            style="Processing type", height=3.25)
#h.fig.tight_layout()
# h.fig.savefig(os.path.join("figures", "fit", "concentrations_{}_{}.pdf".format(name_specs, dataset)))
plt.show()
plt.close()

In [None]:
#df_params_sel = df_params.xs(tcellnum, level="TCellNumber", axis=0)
pep_order = [a for a in peptides if a in df_params.index.get_level_values("Peptide").unique()]
h = sns.pairplot(data=df_params.reset_index(), vars=["a0", "t0", "theta", "v1", "alpha", "beta"], 
                 hue="Peptide", hue_order=pep_order)
legend = h.legend

#h.fig.savefig(os.path.join("figures", "fit", "pairplot_{}_selectdata.pdf".format(name_specs)), 
#    transparent=True, bbox_extra_artists=(legend,), bbox_inches='tight')
plt.show()
plt.close()

# 4. Fit KDEs to the parameter distributions and sample from them
This will then allow us in 5) to sample from those distributions and generate completely synthetic time courses. 

We need to make sure that all parameters have a comparable scale, otherwise the bandwidth may be way too large for some parameters and lead to completely crazy values. This happens for $v_1$, which has a scale 10x smaller than a_0, and  we can get $v_1$ way too large because of that.  So instead of just KDEs, use a pipeline with first a pre-fit (on all data) standard scaler where mins and maxs are the fitting bounds on the parameters. 

In [None]:
# Fit KDEs
print(df_params.columns)
dict_kdes, v2v1_kde = fit_param_distrib_kdes(df_params[fit_vars[fit]], df_v2v1, group_lvls=["Peptide"])
# Also, get a KDE of v2/v1 ratios, sample it for each peptide
print(dict_kdes)

In [None]:
df_params_synth = sample_from_kdes(dict_kdes, fit_vars[fit], fit, {a:4 for a in dict_kdes.keys()}, seed=130695)

In [None]:
#df_params_sel = df_params.xs(tcellnum, level="TCellNumber", axis=0)
pep_order = [a for a in peptides if a in df_params.index.get_level_values("Peptide").unique()]

h = sns.pairplot(data=df_params_synth.reset_index(), vars=["a0", "t0", "theta", "v1"], 
                 hue="Peptide", hue_order=pep_order)
legend = h.legend

#h.fig.savefig(os.path.join("figures", "fits", "pairplot_{}_selectdata.pdf".format(name_specs)), 
#    transparent=True, bbox_extra_artists=(legend,), bbox_inches='tight')
plt.show()
plt.close()

In [None]:
# Final slopes sample
ser_v2v1_synth = pd.Series(v2v1_kde.sample(len(df_params_synth.index))[:, 0], 
                          index=df_params_synth.index, name="v2v1")

# 5. Compute ballistic curves for sampled parameters
This part is the trickiest: we fitted $N_i(t')$, where $t' = t/\tilde{t}$, $\tilde{t} = 20 $ h (the time scale). Now, we want $n_i(t) = \frac{d N_i}{dt} = \frac{d t'}{dt} \frac{d N_i(t')}{d t'} = \frac{1}{\tilde{t}} n_i(t', a_0', \ldots)$, where $n_i(t', a_0', \ldots)$ is the formal function $n_i(t)$ called with $t'$ and parameters fitted for $N_i(t')$, instead of $t$: same functional form, different scale of variables. We need to compensate this by dividing by $\tilde{t}$. This is taken care of in the function compute_latent_curves. 
 

In [None]:
# Create a new df_compare, by concatenation.
df_latent_synth = compute_latent_curves(df_params_synth, ser_v2v1_synth, tanh_norm_factors, times,
    model="Sigmoid_freealpha", tsc=tscale)

In [None]:
print(df_params_synth.loc["N4", "t0"])

In [None]:
h=sns.relplot(data=df_latent_synth.xs(feature, level="Feature", axis=1).stack().reset_index(), 
            x="Time", y=0, kind="line", sort=False, hue="Peptide", hue_order=peptides,
            col="Replicate", row="Node")
plt.show()
plt.close()

# 5. Reconstruct cytokines from generated curves

In [None]:
df_recon_synth = pd.DataFrame(pipe.predict(df_latent_synth), index=df_latent_synth.index, 
                             columns=df_wt_kde.xs(feature, axis=1, level="Feature", drop_level=False).columns)

In [None]:
#df_recon_synth = pd.concat([df_recon_synth], keys=["Synthetic"], names=["Data"]+df_recon_synth.index.names)
df_recon_synth.index = df_recon_synth.index.rename("Concentration", "Replicate")
df_recon_synth.index = df_recon_synth.index.set_levels(["1uM", "100nM", "10nM", "1nM"], level="Concentration")

# 6. Compare the generated cytokines to the data

In [None]:
#figlist = plot_recon_true(df_recon_synth, df_recon_synth, feature=feature, sharey=True, do_legend=False, 
#                          palette=pep_palette, pept=peptides)
dset = subset_exp[1]
df_both = pd.concat([df_wt_kde.xs(dset, level="Data", axis=0), df_recon_synth], 
                    axis=0, keys=["HighMI", "Synth"], names=["Data"])

with sns.plotting_context("notebook", font_scale=0.75):
    h = sns.relplot(data=df_both.stack().reset_index(),x="Time",y="concentration", size="Concentration",
                kind="line",sort=False, hue="Peptide", hue_order=pep_order, style="Data", 
                style_order=["Synth", "HighMI"],
                row="Cytokine", col="Peptide", col_order=["N4", "A2", "Y3", "Q4", "T4", "V4"], 
               height=2.5, aspect=1)

#h.fig.savefig(os.path.join("figures", "reconstruction", "synthetic_data_selectdata_tanh.pdf"), transparent=True, 
#              bbox_inches="tight", extra_artists=(h.legend,))
plt.show()
plt.close()

## Saving useful results
- The dictionary of fitted ScalerKernelDensity instances: will need to import the ScalerKernelDensity class and StandardScaler before reading the pickled object. 
- The Pipeline used for reconstruction. Will need to import the QuadraticRegression class before reading the pickled object. 
- The scaling coefficients in the tanh (pd.DataFrame). 
- The sampled parameter dataframe (pd.DataFrame)
- The reconstructed cytokines from the sampled dataframe (pd.DataFrame). 

In [None]:
# Clip reconstructed cytokines (if we judge they are reasonable above) to remove slightly negative values
df_recon_synth.clip(lower=0.0, inplace=True)

In [None]:
option = "HighMI_1" if np.all([a.startswith("HighMI_1") for a in subset_exp]) else "selectdata"
folder = os.path.join("results", "reconstruction")
# Pipeline and KDEs
with open(os.path.join(folder, "scalerkde_dict_sigmoid_freealpha_{}.pkl".format(option)), "wb") as hd:
    pickle.dump(dict_kdes, hd)
with open(os.path.join(folder, "v2v1_kde_sigmoid_freealpha_{}.pkl".format(option)), "wb") as hd:
    pickle.dump(v2v1_kde, hd)

with open(os.path.join(folder, "quadratic_tanh_pipeline_{}.pkl".format(option)), "wb") as hd:
    pickle.dump(pipe, hd)
tanh_norm_factors.to_hdf(os.path.join(folder, "tanh_norm_factors_integrals_{}.hdf".format(option)), key="tanh_norm")

# Generated parameters
df_params_synth.to_hdf(os.path.join(folder, "df_params_synth_sigmoid_freealpha_{}.hdf".format(option)), key="df_params_synth")
ser_v2v1_synth.to_hdf(os.path.join(folder, "ser_v2v1_synth_{}.hdf".format(option)), key="ser_v2v1_synth")

# Generated data (clipped to remove negative values)
df_recon_synth.to_hdf(os.path.join(folder, "df_recon_synth_sigmoid_freealpha_{}.hdf".format(option)), key="df_recon_synth")
df_latent_synth.to_hdf(os.path.join(folder, "df_latent_synth_sigmoid_freealpha_{}.hdf".format(option)), key="df_latent_synth")