# Mutual information per time point

To run this notebook, you need:
- Pre-processed cytokine time series in the `data/processed/` folder
- the input weights of a neural network and the min and max cytokine concentrations used to scale the data, in `data/trained-networks`. 
Those files are available in the data repository hosted online, or you can generate them yourself from raw cytokine data using [`cytokine-pipeline`](https://github.com/soorajachar/antigen-encoding-pipeline). 

By default, the code uses the following datasets (the last three are added to ensure E1 is present in the MI calculation):
- `HighMI_1-1.hdf`, `HighMI_1-2.hdf`, `HighMI_1-3.hdf`, `HighMI_1-4.hdf`  (4 replicates split in 4 files by processing code)
- `PeptideComparison_1.hdf`
- `PeptideComparison_2.hdf`
- `Activation_3.hdf`

but you can change the used datasets in the code below. 


## Procedure
Here, we apply the mutual information estimator defined in (Kraskov et al., 2004) and (Ross, 2014), to compute mutual information between peptide quality $Q$ and cytokines $\mathbf{X}$, as a function of time. MI between peptide quality and cytokines is computed at each time point by aggregating time points over a sliding time window of 3 hours for better statistics. We use various quantities for the (vector) random variable $\mathbf{X}$: each individual cytokine, the vector of 5 cytokines (IFN-$\gamma$, IL-2, IL-17A, IL-6, TNF), each latent space variable (LS$_1$ or LS$_2$), the two latent space variables combined in a vector (LS$_1$, LS$_2$). 

Then, we compare to the mutual information between antigen quality and parameters of the constant velocity model, fitted on latent space time courses as a way to summarize the entire time kinetics of cytokines with a single vector of three real numbers ($v_0, t_0_0, \theta$). 

We use a dataset (HighMI_1) which contains 4 replicates of the cytokine time series for each peptide at each concentration. This is the dataset shown in main figure 1. 

In [None]:
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
import os

import utils.custom_pandas as cpd

In [None]:
%matplotlib inline

In [None]:
plt.rcParams["figure.figsize"] = (2.25, 1.75)
plt.rcParams["axes.labelsize"] = 8.
plt.rcParams["legend.fontsize"] = 8.
plt.rcParams["axes.labelpad"] = 0.5
plt.rcParams["xtick.labelsize"] = 7.
plt.rcParams["ytick.labelsize"] = 7.
plt.rcParams["legend.title_fontsize"] = 8.
plt.rcParams["axes.titlesize"] = 8.
plt.rcParams["font.size"] = 8.

# Import data and project to latent space
The HighMI_1 replicates were split in four separate files by our processing pipeline for compatibility with most former experiments that had only one replicate per condition. Here, we recombine those files to reform the whole dataset. 

In [None]:
# Cytokine data
df_dict = {}
for fi in os.listdir(os.path.join("data", "processed")):
    if fi.startswith("HighMI_1-") and fi.endswith(".hdf"):
        df_dict[fi[:-4]] = pd.read_hdf(os.path.join("data", "processed", fi))

df_wt = pd.concat(df_dict, names=["Data"])
df_wt = df_wt.xs("100k", level="TCellNumber", drop_level=False)

In [None]:
peptides = ["N4", "Q4", "T4", "V4", "G4", "E1", "A2", "Y3", "A8"]
concentrations = ["1uM","100nM","10nM","1nM"]

In [None]:
minmaxfile = os.path.join("data", "trained-networks", "min_max-thomasRecommendedTraining.hdf")
df_min = pd.read_hdf(minmaxfile, key="df_min")
df_max = pd.read_hdf(minmaxfile, key="df_max")

cytokines = ["IFNg", "IL-17A", "IL-2", "IL-4", "IL-6", "IL-10", "TNFa"]
times = np.arange(1, 73)
print(cytokines)

In [None]:
# Project to latent space and scale
df = df_wt.unstack("Time").loc[:, ("integral", cytokines, times)].stack("Time")
df_conc = df_wt.unstack("Time").loc[:, ("concentration", cytokines, times)].stack("Time")
df = df.droplevel("Feature", axis=1)
df_conc = df_conc.droplevel("Feature", axis=1)

# Normalize
df_min = df.min()
df_max = df.max()
df = (df - df_min)/(df_max - df_min)

## Add E1 from other datasets
This peptide was not included in the HighMI_1 dataset because it consistently gives zero cytokine response, only measurement noise. Therefore, we import it from a few other datasets, since this null peptide category is important to get a proper estimate of mutual information. 

In [None]:
# Import a few datasets containing E1
df_dict = {}
for fi in os.listdir(os.path.join("data", "processed")):
    if fi.startswith("Activation_3") and fi.endswith(".hdf"):
        df_dict[fi[:-4]] = (pd.read_hdf(os.path.join("data", "processed", fi))
            .xs("E1", level="Peptide", drop_level=False).xs("Naive", level="ActivationType", drop_level=True))
    elif fi.startswith("PeptideComparison_1") and fi.endswith(".hdf"):
        df_dict[fi[:-4]] = (pd.read_hdf(os.path.join("data", "processed", fi))
            .xs("E1", level="Peptide", drop_level=False))
    elif fi.startswith("PeptideComparison_2") and fi.endswith(".hdf"):
        df_dict[fi[:-4]] = (pd.read_hdf(os.path.join("data", "processed", fi))
            .xs("E1", level="Peptide", drop_level=False))

df_data_e1 = pd.concat(df_dict, names=["Data"])
df_data_e1 = df_data_e1.loc[:, (slice(None), cytokines)]

In [None]:
df_conc_e1 = df_data_e1.xs("concentration", level="Feature", axis=1)
df_integ_e1 = df_data_e1.xs("integral", level="Feature", axis=1)
df_integ_e1 = (df_integ_e1 - df_min)/(df_max - df_min)

In [None]:
df_conc = df_conc.append(df_conc_e1).sort_index()

## Functions to compute MI over a sliding time window
The heavy lifting of the MI estimation is done by functions defined in ``utils.mi_time_window`` and ``utils.discrete_continuous_info``; see those file for the details. In a sentence, we concatenate time points of all series for each peptide over a short, sliding time window, and we estimate MI from those samples using our own Python implementation of the Kraskov estimator (Kraskov et. al, 2004), which was translated from the Matlab version developed by (Ross, 2014) and significantly optimized for Python by us. 


In [None]:
# Our own Python implementation of the MI algorithm
from utils.discrete_continuous_info import discrete_continuous_info_fast
from utils.mi_time_window import compute_mi_timecourse

## Compute MI for individual cytokines and the vector of five cytokines

In [None]:
all_variables_dfs = {
    "all cytokines": df_conc
}
all_variables_dfs.update({c:df_conc[c] for c in cytokines})

In [None]:
all_variables_dfs.keys()

In [None]:
# Number of NN: 3 neighbors times length of time window.  
df_mi_time, max_mi = compute_mi_timecourse(all_variables_dfs, q="Peptide", overlap=False, 
                      window=3, knn=3*3, speed="fast")

In [None]:
# This is an unpolished version of main figure 1D of the antigen encoding paper (same data shown)
g = sns.relplot(data=df_mi_time.stack("Variable").reset_index(), x="Time", y=0, hue="Variable", kind="line", height=3)
#g.fig.savefig(os.path.join("figures", "capacity", "mi_vs_time_cytokines_HighMI_1.pdf"), transparent=True, 
#              bbox_inches="tight", bbox_extra_artists=(g.legend,))

## Save all results for further plotting
Notice that the MI is probably slightly over-estimated, because we are using a dataset that has multiple replicates, but not in the hundreds, so the points are quite sparse in 5D space. We will use a dataset with more replicates than this one for channel capacity, and also we will use ballistic parameters in a lower-dimensional space, which will not be as affected by low sample numbers. 

In [None]:
# Append the theoretical maximal MI (entropy of Q) to the dataframe, for reference. 
# This is simply log_2(number of peptides). 
df_mi_time["MaxMI"] = np.ones(df_mi_time.shape[0])*np.nan
df_mi_time["MaxMI"].iloc[-1] = max_mi
df_mi_time

In [None]:
# Uncomment to save; data used for main figure 1E
# df_mi_time.to_hdf(os.path.join("results", "mi_time", "miStatistics-HighMI_1-all-cytokines.hdf"), key="df")

# MI estimation for latent space variables
LS$_1$ and LS$_2$ taken together preserve all information available in the five cytokines. 

In [None]:
mlp_coefs = np.load(os.path.join("data", "trained-networks", "mlp_input_weights-thomasRecommendedTraining.npy"))
minmaxfile = os.path.join("data", "trained-networks", "min_max-thomasRecommendedTraining.hdf")
df_min = pd.read_hdf(minmaxfile, key="df_min")
df_max = pd.read_hdf(minmaxfile, key="df_max")

df2 = df.append(df_integ_e1)
df2 = cpd.xs_slice(df2, "Cytokine", df_min.index.get_level_values("Cytokine").unique().tolist(), axis=1)
# Rename Data to Replicate, add Data level
lvl_names = df2.index.names
df2.index = df2.index.set_names(["Replicate"]+lvl_names[1:])
df2 = df2.rename({a:str(i) for i, a in enumerate(df2.index.get_level_values("Replicate").unique())})
df2 = pd.concat({"HighMI_1": df2}, names=["Data"])
df2 = df2.sort_index()
df_proj = pd.DataFrame(np.dot(df2, mlp_coefs), index=df2.index, columns=["Node 1", "Node 2"])

In [None]:
all_variables_dfs_latent = {
    "LS1": df_proj["Node 1"],
    "LS2": df_proj["Node 2"],
    "2 LS": df_proj
}
df_mi_latent, max_mi_latent = compute_mi_timecourse(all_variables_dfs_latent, q="Peptide",
                       overlap=False, window=3, knn=3*3, speed="fast")
print(df_mi_latent.max(axis=0))

In [None]:
sns.relplot(data=df_mi_latent.stack("Variable").reset_index(), x="Time", y=0, hue="Variable", kind="line", height=2.)

# MI estimation for $v_0$, $t_0$, $\theta$
Fit the constant velocity parameters on each time series, then compute MI between that description of cytokine time kinetics and antigen quality. 

In [None]:
from ltspcyt.scripts.sigmoid_ballistic import return_param_and_fitted_latentspace_dfs
fit_vars = {"Constant velocity":["v0", "t0", "theta", "vt"], 
           "Sigmoid_freealpha":["a0", "tau0", "theta", "v1", "alpha", "beta"]}

In [None]:
e1_key = (slice(None), slice(None), slice(None), "E1")
df2.loc[e1_key, :] = df2.loc[e1_key, :] + 0.01*np.random.normal(size=df2.loc[e1_key].size).reshape(df2.loc[e1_key].shape)

In [None]:
# Fitting
choice_model = "Constant velocity"
regul_rate = 1.0

# Here, we need to reject negative v2v1 slopes, this improves the constant velocity fit for mouse1-replicate4
ret = return_param_and_fitted_latentspace_dfs(df_proj, choice_model, reg_rate=regul_rate, reject_neg_slope=True)
df_params, df_compare, df_hess, df_v2v1 = ret

nparameters = len(fit_vars[choice_model])
peptides = [a for a in peptides if a in df_params.index.get_level_values("Peptide").unique()]

In [None]:
df_params

In [None]:
var_choice = fit_vars[choice_model][:3]
pep_palette_order = ["N4", "Q4", "T4", "V4", "G4", "E1", "A2", "Y3"]
palette = sns.color_palette(n_colors=len(pep_palette_order))
pep_palette = {pep:palette[i] for i, pep in enumerate(pep_palette_order)}
hue_order = [a for a in pep_palette_order if a in df_params.index.get_level_values("Peptide").unique()]
sns.pairplot(data=df_params.iloc[:, :4].reset_index(), hue="Peptide", hue_order=hue_order, 
             palette=[pep_palette.get(a) for a in hue_order], 
             vars=var_choice)

In [None]:
# Remove clear outliers
df_params = df_params.loc[df_params.index.isin(["V4"], level="Peptide")*df_params["theta"] < np.pi/2]
df_params = df_params.loc[df_params.index.isin(["E1"], level="Peptide")*df_params["theta"] < np.pi/2]

In [None]:
var_choice = fit_vars[choice_model][:3]
vals = df_params[var_choice].values
pep_map = {peptides[i]:i for i in range(len(peptides))}
target = df_params.index.get_level_values("Peptide").map(pep_map)
# Number of knn: equals to number used before (3) per time point, for fair comparison
mi_v0t0theta = discrete_continuous_info_fast(target, vals, k=3, base=2)

In [None]:
# Result
print(mi_v0t0theta)  # bits