Skip to content

Theoretical analysis code for antigen encoding from cytokine dynamics.

License

Notifications You must be signed in to change notification settings

frbourassa/antigen_encoding_theory

Repository files navigation

Antigen encoding from high dimensional cytokine dynamics – theory

Repository with all the necessary code for modelling cytokine dynamics in latent space, generating cytokine data with reconstruction from the latent space, and computing the channel capacity of cytokine dynamics for antigen quality. The code generates all the necessary results to reproduce main and supplementary figures related to theoretical parts of the paper

Sooraj R. Achar#, François X. P. Bourassa#, Thomas J. Rademaker#, Angela Lee, Taisuke Kondo, Emanuel Salazar-Cavazos, John S. Davies, Naomi Taylor, Paul François, and Grégoire Altan-Bonnet. "Universal antigen encoding of T cell activation from high dimensional cytokine data", submitted, 2021. (#: these authors contributed equally)

All cytokine time series data necessary to run the code is included in the Github repository. Also included are weights of the neural network that produces the latent space used throughout the paper, and a few other parameters (e.g., antigen functional EC50s). More neural networks can be trained and more cytokine data processing and fitting can be done using the `antigen-encoding-pipeline user interface, also hosted on Github.

Installation

Compilation of the chancapmc C module

You need to run the Python setup script: first navigate to the chancapmc/ folder, then execute the script and move the built library (.so file) in the chancapmc/ folder:

>>> cd chancapmc/
>>> python setup_chancapmc.py build_ext --inplace
>>> mv build/lib*/chancapmc*.so .

Then, if you want, try running the test file for the Python interface:

python test_chancapmc.py

Unit tests in C are also available in the script unittests.c. Compile and execute them according to the instructions given in the header of the file.

More details can be found in the Github repository where this module is hosted separately: https://github.com/frbourassa/chancapmc .

Downloading data

Data is tracked in the git repository and is downloaded with the code when the repository is cloned. There are starting data files in four folders:

  • data/initial/: Contains the raw cytokine time series data.
  • data/LOD: Contains limits of detection used to process some files (this one is not essential, the code can run without it).
  • data/misc/: Contains plotting parameters and EC50 values for antigens.
  • data/trained-networks/: Contains projection matrix and normalization factors to construct the latent space, and also other weight matrices of the default neural network.

Data preprocessing

After downloading the git repository along with the data (or unzipping the cytokine data files (HDF5 format) in data/initial/, run the script run_first_prepare_data.py, which will save cleaned up versions of the raw dataframes in data/final/, then process all cytokine time series (log transformation, smoothing spline interpolation, time integration), and save the processed time series in data/processed/.


IMPORTANT

Run run_first_prepare_data.py the first time you download the code and data. Otherwise no analysis script will run.


Suggested order in which to run the code

As a general rule, always run scripts from the top folder of the repository (i.e. folder antigen_encoding_theory/ unless you renamed it) and not from subfolders within.

Some scripts and Jupyter notebooks depend on the outputs saved to disks by other code files. The first time you use this repository, we suggest executing them in the following order. Afterwards, once the outputs are saved on disk, it becomes easier to go from one code file to the other.

  1. run_first_prepare_data.py to process data.
  2. fit_latentspace_model.ipynb to fit model parameters on latent space trajectories. Ideally, run three times, once for each version of the model.
  3. reconstruct_cytokines_fromLSdata.ipynb to reconstruct cytokine time series from data projected in latent space.
  4. reconstruct_cytokines_fromLSmodel_pvalues.ipynb to use the latent space model as a cytokine model that can fit cytokines themselves, via reconstruction.
  5. generate_synthetic_data.ipynb to generate new cytokine time series by sampling model parameters and reconstructing the corresponding cytokines.
  6. compute_channel_capacity_HighMI_3.ipynb to compute channel capacity $C$ of antigen encoding using interpolation of multivariate gaussian distributions in parameter space and our chancapmc module (Blahut-Arimoto algorithm with Monte Carlo integration for continuous input-output pdfs).
  7. theoretical_antigen_classes_from_capacity_HighMI_3.ipynb to subdivide the continuum of antigen qualities into $2^{C}$ ''theoretical'' antigen classes.

Once these main codes are run and their outputs saved, secondary scripts can be run more freely, and lastly plotting functions 7. Secondary calculations in more_main_scripts/. Some will save further output files used by plotting scripts. 8. Finally, run plotting scripts in main_plotting_scripts/.

We give more details on these scripts and notebooks in the CONTENTS.md file. Most files also contain indications on their dependencies in their headers. Also, the flowchart below illustrates those dependencies and the folders in which they share files.

Diagram of the code structure

The following diagram represents the main dependencies between most of the scripts in this project. Scripts are colored per theme (neural networks: pale orange, model fitting: orange, reconstruction: yellow, channel capacity: green, data processing: pink). Indented scripts are those which need other scripts to be run first, as indicated by arrows on the left or right, annotated with the folders where intermediate results are stored. Scripts which produce figures included in the main or supplementary text are indicated by arrows going to the sub-folders in figures/.

Code structure diagram

Requirements

The Python code was tested on Mac (macOS Catalina 10.15.7) and Linux (Linux 3.2.84-amd64-sata x86_64) with the Anaconda 2020.07 distribution, with the following versions for important packages:

  • Python 3.7.6
  • numpy 1.19.2
  • scipy 1.5.2
  • pandas 1.2.0
  • matplotlib 3.3.2
  • seaborn 0.11.1
  • scikit-learn 0.23.2

The following additional Python packages were installed and are necessary for specific scripts in the project (but not needed for most code):

  • tensorflow 2.0.0 (macOS) or 2.3.0 (Linux)
  • wurlitzer 2.0.1
  • channel-capacity-estimator 1.0.1 (see Github page)

The exact Python configuration used is included in data/python_environment.yml.

Moreover, a C compiler is necessary to build the module chancapmc (C code interfaced with the Python C API). The module was tested with compilers Apple clang 11.0.3 (macOS) and GNU gcc 4.9.4 (Linux).

Note on sub-modules

The code modules ltspcyt (latent space cytokines) and chancapmc (channel capacity Monte Carlo) contain the core functions for data processing, latent space building, model fitting (ltspcyt), and channel capacity calculation (chancapmc).

The ltspcyt module is basically is a collection of the core functions under the GUI of antigen-encoding-pipeline, with added functions and classes for cytokine reconstruction. Of interest, it includes a customized version of Scipy's curve_fit function that is applicable to vector-valued functions of a scalar variable and can add L1 regularization of the fitted parameters.

chancapmc is also hosted separately on Github: https://github.com/frbourassa/chancapmc It is licensed under the more permissive BSD 3-Clause-License.

This module provides functions to calculate channel capacity between any discrete input variable Q and continuous (vectorial) output variable Y which has multivariate normal conditional distributions P(Y|Q). It may be extended to more multivariate distributions in the future.

License information

This repository is licensed under the GNU GPLv3.0 because one of the scripts (estimate_channel_capacity_cce.ipynb) uses the channel-capacity-estimator package from Grabowksi et al., 2019, which is also licensed under GPLv3.0. Other dependencies are licensed under the BSD-3-Clause License, which is compatible with GPLv3.0.

DOI