The Great Repertoire Project
This repository contains code and data used for our study of the baseline human antibody repertoire. Briefly, we performed ultra-deep sequencing of the antibody repertoires of 10 healthy, adult subjects (approxmately 3 billion total antibody sequences). The Great Repertoire Project revealed a massively diverse repertoire with surprisingly high overlap between the repertoires of different subjects.
The code used in this project is assembled into a series of Juypter notecooks. There are two sets of notebooks, those containing code used for DATA PROCESSING and those containing code used to MAKE FIGURES. GitHub will render each of the notebooks, but the code cannot be executed from within GitHub. If you'd like to actually run the code contained in the notebooks, you must clone the repository.
NOTE: Whenever possible, the intermediate datasets required to run the code are included in this repository, however, many intermediate datasets are too large to be included. In such cases, links to the required datasets are provided in the appropriate notebook.
We have generated several large datasets, in two primary groups: antibody sequences from healthy adult subjects, and synthetic antibody sequences using statistical models of V(D)J recombination.
Antibody sequencing data
Raw and processed datasets from each subject can be downloaded using the following links. Some of these datasets are quite large (the compressed raw FASTQs are roughly 100GB per subject, and the uncompressed JSON datasets range from ~100GB to nearly 1TB).
For each subject, there are a total of 18 samples: 3 technical replicates of each of 6 biological replicates. Biological replicates refer to different aliquots of peripheral blood monomuclear cells (PBMCs), from which total RNA was separately isolated and processed. Thus, sequences or clonotypes found in multiple biological replicates are assumed to have independently occurred in different cells. Technical relicates refer to independent library preparations using the same aliquot of PBMC-derived RNA. In each of the above datasets, samples 1-6 are biological replicates. Samples 7-12 and 13-18 are technical replicates of samples 1-6.
Due to technical issues, the sequence data for subject 326797 was spread across two HiSeq flowcells. Thus, the raw FASTQs and FASTQC results can be downloaded in two separate batches. Starting with the first processed dataset (UMI-corrected consensus FASTAs), reads from both flowcells were pooled.
Synthetic antibody sequences
We generated synthetic antibody sequences using IGoR. Two datasets of synthetic sequences are available. As with the repertoire sequencing datasets above, the annotated datasets are quite large (uncompressed, each exceeds 1TB in size).
- Ten batches of 100M synthetic sequences, generated with IGoR's default V(D)J recombination model:
- Ten batches of 100M synthetic sequences, generated with subject-specific recombination models, inferred by IGoR using 500,000 unmutated antibody sequences from each subject:
- Python 3.3+ (although Python 2.7 may work for many or most notebooks, this has not been tested)
- Jupyter Notebook
Additionally, each notebook may require additional third-party Python packages. Any notebook-specific requirements, as well as instructions for package installation with pip, are provided in each notebook.
If you're new to Python, a great way to get started is to install the Anaconda Python distribution, which includes pip as well as a ton of useful scientific Python packages.