# **Analysis of adaptive immune repertoires**
<br>
## Simon D.W. Frost
### *University of Cambridge*

## Introduction

- What are adaptive immune receptor repertoires?
- Why study them?
- How can we study them?
  - Opportunities provided by next generation sequencing

## B vs. T cell receptors

- Similarities
  - Diversity in IGH and TRB repertoires formed by reassortment of V, D, and J germline genes
- Differences
  - Somatic hypermutation in BCRs

# Workflow

- Bioinformatics processing
  - Sequence cleaning etc.
  - Germline assignment
  - Identification of clones
- Statistical analysis
  - CDR3 length, amino acid properties
  - Diversity
  - BCRs: Somatic hypermutation

## Sequencing platforms

- Roche 454
  - Longer reads, but discontinued
- Illumina
  - Paired end reads up to 300bp
    - Sufficient for characterising most of V/D/J region
    - High throughput/(relatively) low cost
- PacBio
  - Can sequence amplicons spanning the variable and constant regions
  - Has been used for sequencing VH-VL amplicons using linked PCR

## Barcoding

- Barcodes used to identify samples run in multiplex (MIDs)
- Unique molecular identifiers (UMIs)
  - Random primers intended to uniquely tag distinct molecules
- Can be combined with microfluidics/single cell PCR to allow e.g. sequencing of paired VH/VL

## Sequence formats

- Depend on the sequencing platform used
- NCBI Sequence Read Archive stores reads in a bespoke format, sra
- Most tools can take input as:
  - fastq
  - fasta
  - normal or compressed (gzipped)


## fastq

- Contains base calls with associated quality scores
- Paired end sequence datasets may interleave reads, or have them in separate files

```
@SRR765688.43055 HQP9GKT01ANQEB length=386
NNNNNNNNNNNNNNNNNNNNCCGGGGAGTCTCTGAAAAATCT
+
IIIIIIIIIIIIIIIIIIIHHHCCCCIIIIIIII66666III
```

## fasta

- Contains base calls only

```
>SRR765688.43055 HQP9GKT01ANQEB length=386
NNNNNNNNNNNNNNNNNNNNCCGGGGAGTCTCTGAAAAATCT
```

## Output formats

- VDJML
- Change-O
- vdjtools

## VDJML

![](https://vdjserver.org/vdjml/xsd_doc/1/xsd_dot.svg)

## Software

- Diverse and growing suite of software
- Various languages:
  - Python
  - R
  - Perl
  - C/C++
  - Java
- List at [omictools/rep-seq](https://omictools.com/rep-seq-category) curated by Mikhail Shugay

## Reproducibility

- Research should be *reproducible*
- Important for transparency in research
- Also very useful
  - Often have to rerun analyses
    - Expanded datasets, updated tools, etc.

## Why scripts are not enough

- Many tools come with example scripts to run the tools on some example datasets
- While fine for examples, not well suited for research purposes
  - Lack documentation
  - Pipelines may break on other datasets
  - May want to combine different tools

## Jupyter notebooks

- Interactive notebook, interleaves
  - Notes
  - Code
  - Output (text, tables, images)
- Based on Python, but with multiple language 'kernels'
  - `bash_kernel`: shell commands
  - `irkernel`: R commands
  - `python3`: Python commands
- 'Magics' allow mixing of languages inside a single notebook
- Can also generate slideshows

## R notebooks

- Although not covered here, notebooks are also available in the R programming environment Rstudio

## Workflows

- Like many bioinformatic analyses, analysis of repertoires often involves repetitive tasks
- May want to run an analysis from start to finish with different settings
- *Workflow tools* can help to automate this

## Snakemake

- Text-based workflow
  - Can include Python, R, and shell commands
- Uses file names to infer workflow

## Common Workflow Language (CWL)

- Tools and workflows are written in markup
  - YAML
  - JSON
 - Language agnostic
 - Explicit workflow

## Example datasets

- TCR
  - Britanova et al. *J. Immunol.* (2016)
- BCR
  - Jiang et al. *Sci. Trans. Med* (2013)

## Britanova et al. (2016)

- 65 healthy individuals aged 6–103 yrs

## Stability of TCR repertoire over time

![](britanova2016.F1.large.jpg)

## Age-related changes

![](britanova2016.F2.large.jpg)

## Repertoire changes

![](britanova2016.F3.large.jpg)

## Age and gender

![](britanova2016.F4.large.jpg)

## Jiang et al. (2013)

- Analyzed antibody repertoires from peripheral blood drawn from 17 human volunteers immunized with 2009 or 2010 seasonal influenza vaccines
- Three age groups:
  - children (8-17 yrs)
  - young adults (18-30 yrs)
  - elderly (70-100 yrs)
- Given trivalent inactivated influenza vaccine (TIV) or live attenuated influenza vaccine (LAIV; not 70+ group)

## Isotype distribution

![](jiang2013.F1.large.jpg)

## Predicted influenza-specific lineages

![](jiang2013.F2.large.jpg)

## Age-related lineages

![](jiang2013.F3.large.jpg)

## Interlineage structure of IgGs in visit 2 PBMCs

![](jiang2013.F4.large.jpg)

## Outline

- Day 1
  - Obtaining NGS data from GenBank
  - Data cleaning with pRESTO
- Day 2
  - Snakemake workflows
- Day 3
  - Analysis