Permalink
Fetching contributors…
Cannot retrieve contributors at this time
175 lines (119 sloc) 5.53 KB

Running the Example Pipeline

This page serves to familiarize new users with the basic flow of running the ImmuneDB pipeline. Example input FASTQ files are provided which contain human B-cell heavy chain sequences.

Commands are listed as either being run in either the Docker container or on the host.

To begin, run the Docker container :ref:`as documented <running-the-container>`:

Metadata Specification

Before ImmuneDB can be run, metadata must be specified for each input file. For this example, one has already been created for you. To learn how to create a metadata file for your own data, see :ref:`Creating a Metadata Sheet`.

ImmuneDB Instance Creation

Next, we create a database for the data with:

This creates a new database named example_db and stores its configuration in /share/configs/example_db.json.

Identifying or Importing Sequences

The first step of the pipeline is to annotate sequences and store the resulting data in the newly created database. To do so, the immunedb_identify is used. It requires that V and J germline sequences be specified in two separate FASTA files. The Docker image provides Human & Mouse IGH, TRA, and TRB germlines in $HOME/germlines.

For this example, there are two provided input files in /example along with the requisite metadata.tsv file which you can view with:

Given this, run the immunedb_identify command:

Sequence Collapsing

ImmuneDB determines the uniqueness of a sequence both at the sample and subject level. For the latter, immunedb_collapse is used to find sequences that are the same except at positions that have an N. Thus, the sequences ATNN and ANCN would be collapsed.

To collapse sequences, run:

Clonal Assignment

After sequences are assigned V and J genes, they can be clustered into clones based on CDR3 Amino Acid similarity with the immunedb_clones command. This takes a number of arguments which should be read before use.

There are three ways to create clones: based on CDR3 AA similarity, T-cell exact CDR3 NT identity, and a lineage based method. For this example we'll use the similarity based method with default parameters:

This will create clones where all sequences in a clone will have the same V-gene, J-gene, and (by default) 85% CDR3 AA identity.

Statistics Generation

Two sets of statistics can be calculated in ImmuneDB:

  • Clone Statistics: For each clone and sample combination, how many unique and total sequences appear as well as the mutations from the germline.
  • Sample Statistics: Distribution of sequence and clone features on a per-sample basis, including V and J usage, nucleotides matching the germline, copy number, V length, and CDR3 length. It calculates all of these with and without outliers, and including and excluding partial reads.

These are calculated with the immunedb_clone_stats and immunedb_sample_stats commands and must be run in that order.

Selection Pressure (Optional)

Warning

Selection pressure calculations are time-consuming, so you can skip this step if time is limited.

Selection pressure of clones can be calculated with Baseline. To do so run:

Note, this process is relatively slow and may take some time to complete.

Clone Trees (Optional)

Lineage trees for clones is generated with the immunedb_clone_trees command. The only currently supported method is neighbor-joining as provided by Clearcut.

Among others, the --min-mut-copies parameter allows for mutations to be omitted if they have not occurred at least a specified number of times. This can be useful to correct for sequencing error.

Web Interface

ImmuneDB has a web interface to interact with a database instance. Running this can be slightly complicated, but the Docker image contains a helper script to simplify the process:

You can then navigate to http://localhost:8080.