Code and data used in The Great Repertoire Project
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
data_processing
make_figures
.gitattributes
.gitignore
LICENSE
README.md

README.md

The Great Repertoire Project

This repository contains code and data used for our study of the baseline human antibody repertoire. Briefly, we performed ultra-deep sequencing of the antibody repertoires of 10 healthy, adult subjects (approxmately 3 billion total antibody sequences). The Great Repertoire Project revealed a massively diverse repertoire with surprisingly high overlap between the repertoires of different subjects.

Code

The code used in this project is assembled into a series of Juypter notecooks. There are two sets of notebooks, those containing code used for DATA PROCESSING and those containing code used to MAKE FIGURES. GitHub will render each of the notebooks, but the code cannot be executed from within GitHub. If you'd like to actually run the code contained in the notebooks, you must clone the repository.

NOTE: Whenever possible, the intermediate datasets required to run the code are included in this repository, however, many intermediate datasets are too large to be included. In such cases, links to the required datasets are provided in the appropriate notebook.

Datasets

We have generated several large datasets, in two primary groups: antibody sequences from healthy adult subjects, and synthetic antibody sequences using statistical models of V(D)J recombination.

Antibody sequencing data

Raw and processed datasets from each subject can be downloaded using the following links. Some of these datasets are quite large (the compressed raw FASTQs are roughly 100GB per subject, and the uncompressed JSON datasets range from ~100GB to nearly 1TB).

For each subject, there are a total of 18 samples: 3 technical replicates of each of 6 biological replicates. Biological replicates refer to different aliquots of peripheral blood monomuclear cells (PBMCs), from which total RNA was separately isolated and processed. Thus, sequences or clonotypes found in multiple biological replicates are assumed to have independently occurred in different cells. Technical relicates refer to independent library preparations using the same aliquot of PBMC-derived RNA. In each of the above datasets, samples 1-6 are biological replicates. Samples 7-12 and 13-18 are technical replicates of samples 1-6.

Due to technical issues, the sequence data for subject 326797 was spread across two HiSeq flowcells. Thus, the raw FASTQs and FASTQC results can be downloaded in two separate batches. Starting with the first processed dataset (UMI-corrected consensus FASTAs), reads from both flowcells were pooled.

Synthetic antibody sequences

We generated synthetic antibody sequences using IGoR. Two datasets of synthetic sequences are available. As with the repertoire sequencing datasets above, the annotated datasets are quite large (uncompressed, each exceeds 1TB in size).

Requirements

  • Python 3.3+ (although Python 2.7 may work for many or most notebooks, this has not been tested)
  • Jupyter Notebook

Additionally, each notebook may require additional third-party Python packages. Any notebook-specific requirements, as well as instructions for package installation with pip, are provided in each notebook.

If you're new to Python, a great way to get started is to install the Anaconda Python distribution, which includes pip as well as a ton of useful scientific Python packages.