In [39]:
%load_ext rpy2.ipython

# Germline assignment

## Introduction

- Most analyses work at the level of a clonotype
- Identifying what germlines have given rise to a reassorted BCR/TCR is usually the first step in the analysis
- Complicated by:
  - Sequencing error
  - Incomplete reference sequences
  - Evolutionary similarity between alleles in a family
  - Deletions
  - N- and P- nucleotides
  - Somatic hypermutation in the context of BCRs
- Particularly difficult to assign D genes

## Software

- There are an increasing number of software packages for germline assignment
  - Similarity based
    - IgBLAST, IgGraph, IMGT/V-Quest
  - Hidden Markov models
    - iHMMune, partis, repgenHMM
  - Phylogenetic approaches
    - IgSCUEAL

## IgBLAST

- Highly tuned version of the BLAST algorithm for sequence similarity
- Fast
- Reasonably accurate for TCRs and BCRs with low mutational load
  - Less so for highly mutated BCRs

## Reference datasets

- IgBLAST requires a database of germline sequences to compare query sequences to
- IMGT restricts dissemination of germline datasets
  - Usually have to be downloaded separately

## Postprocessing

- IgBLAST does not generate friendly output
- There are a number of postprocessors

## Change-O

- Both a *package* and a *format*
- Package:
  - Python package
  - Parses output from IgBLAST IMGT/High-VQuest
  - Also can generate clones, reconstruct germlines
- Format:
  - Simple tabular format
    - Easy to analyse and process
  - Specifies names for particular fields

## Workflow

- Obtain reference sequences
- Generate IgBLAST database
- Convert FASTQ to FASTA
- Run IgBLAST
- Parse the output of IgBLAST
- Generate clones

## Obtaining germline sequences

- Germline sequences can be obtained from IMGT
- Also a community based effort from the AIRR Community
  - Work in progress

## Generating IgBLAST database

In [1]:
%%bash
# V-segment database
perl ./edit_imgt_file.pl IMGT_Human_TRBV.fasta > database/human_trb_v
makeblastdb -parse_seqids -dbtype nucl -in database/human_trb_v



Building a new DB, current time: 10/31/2016 17:09:29
New DB name:   /home/simon/Projects/aairr16-working/slides/database/human_trb_v
New DB title:  database/human_trb_v
Sequence type: Nucleotide
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 142 sequences in 0.00876689 seconds.


In [2]:
%%bash
# D-segment database
perl ./edit_imgt_file.pl IMGT_Human_TRBD.fasta > database/human_trb_d
makeblastdb -parse_seqids -dbtype nucl -in database/human_trb_d
# J-segment database
perl ./edit_imgt_file.pl IMGT_Human_TRBJ.fasta > database/human_trb_j
makeblastdb -parse_seqids -dbtype nucl -in database/human_trb_j



Building a new DB, current time: 10/31/2016 17:09:36
New DB name:   /home/simon/Projects/aairr16-working/slides/database/human_trb_d
New DB title:  database/human_trb_d
Sequence type: Nucleotide
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 3 sequences in 0.000468016 seconds.


Building a new DB, current time: 10/31/2016 17:09:36
New DB name:   /home/simon/Projects/aairr16-working/slides/database/human_trb_j
New DB title:  database/human_trb_j
Sequence type: Nucleotide
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 16 sequences in 0.000876904 seconds.


## Convert FASTQ to FASTA

- Most pipelines keep sequence data in FASTQ format, but IgBLAST takes FASTA
- This little Biopython snippet converts FASTQ to FASTA

In [31]:
from copy import deepcopy
from Bio import SeqIO
inhandle = open("A2-i131.fastq",'r')
outhandle = open("A2-i131_reheader.fastq",'w')
for record in SeqIO.parse(inhandle,"fastq"):
    newr = deepcopy(record)
    newr.id=newr.description.replace("MIG UMI:","")
    newr.id=newr.id.replace(":","|CONSCOUNT=")
    newr.description=""
    newr.name=""
    SeqIO.write(newr,outhandle,"fastq")
inhandle.close()
outhandle.close()

In [32]:
!head A2-i131_reheader.fastq

@GTCATTTAGCATGCTG|CONSCOUNT=2
TCCTGGAGTCGCCCAGCCCCAACCAGACCTCTCTGTACTTCTGTGCCAGCAGTTTAGAGGGGTACACTGAAGCTTTCTTTGGACAAGGCACCAGACTCAC
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@TTGCGTTCGTTCTAAT|CONSCOUNT=3
TGAGCAACATGAGCCCTGAAGACAGCAGCATATATCTCTGCAGCGTCGTTACTAAGGACAGGGAAGAGACCCAGTACTTCGGGCCAGGCACGCGGCTCCT
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@TATAATATCGTAACGT|CONSCOUNT=7
GATCCAGCCTGCAAAGCTTGAGGACTCGGCCGTGTATCTCTGTGCCAGCAGCTCCGGATACACCGGGGAGCTGTTTTTTGGAGAAGGCTCTAGGCTGACC


In [33]:
from Bio import SeqIO
SeqIO.convert("A2-i131_reheader.fastq","fastq","A2-i131.fasta","fasta")

429994

## Running IgBLAST

- A custom output format is chosen to allow parsing with Change-O

In [34]:
%%bash
igblastn \
    -germline_db_V database/human_trb_v \
    -germline_db_D database/human_trb_d \
    -germline_db_J database/human_trb_j \
    -auxiliary_data optional_file/human_gl.aux \
    -domain_system imgt -ig_seqtype TCR -organism human \
    -outfmt '7 std qseq sseq btop' \
    -query A2-i131.fasta \
    -out A2-i131.fmt7 \
    -num_threads 2

## Postprocess BLAST

In [35]:
%%bash
MakeDb.py igblast -i A2-i131.fmt7 -s A2-i131.fasta -r IMGT_Human_TRB[VDJ].fasta \
    --regions --scores

        START> MakeDB
      ALIGNER> IgBlast
ALIGN_RESULTS> A2-i131.fmt7
     SEQ_FILE> A2-i131.fasta
     NO_PARSE> False
 SCORE_FIELDS> True
REGION_FIELDS> True

PROGRESS> 20:08:50 [                    ]   0% (      0) 0.0 minPROGRESS> 20:10:28 [#                   ]   5% ( 21,500) 1.6 minPROGRESS> 20:12:11 [##                  ]  10% ( 43,000) 3.3 minPROGRESS> 20:13:47 [###                 ]  15% ( 64,500) 4.9 minPROGRESS> 20:15:31 [####                ]  20% ( 86,000) 6.7 minPROGRESS> 20:17:10 [#####               ]  25% (107,500) 8.3 minPROGRESS> 20:18:41 [######              ]  30% (129,000) 9.8 minPROGRESS> 20:20:13 [#######             ]  35% (150,500) 11.4 minPROGRESS> 20:21:49 [########            ]  40% (172,000) 13.0 minPROGRESS> 20:23:24 [#########           ]  45% (193,500) 14.6 minPROGRESS> 20:24:54 [##########          ]  50% (215,000) 16.1 minPROGRESS> 20:26:28 [###########         ]  55% (236,500) 17.6 minPROGRESS> 20:28:00 [############        ]  60% (258

In [36]:
!head A2-i131_db-pass.tab

SEQUENCE_ID	SEQUENCE_INPUT	FUNCTIONAL	IN_FRAME	STOP	MUTATED_INVARIANT	INDELS	V_CALL	D_CALL	J_CALL	SEQUENCE_VDJ	SEQUENCE_IMGT	V_SEQ_START	V_SEQ_LENGTH	V_GERM_START_VDJ	V_GERM_LENGTH_VDJ	V_GERM_START_IMGT	V_GERM_LENGTH_IMGT	NP1_LENGTH	D_SEQ_START	D_SEQ_LENGTH	D_GERM_START	D_GERM_LENGTH	NP2_LENGTH	J_SEQ_START	J_SEQ_LENGTH	J_GERM_START	J_GERM_LENGTH	JUNCTION_LENGTH	JUNCTION	V_SCORE	V_IDENTITY	V_EVALUE	V_BTOP	J_SCORE	J_IDENTITY	J_EVALUE	J_BTOP	HMM_SCORE	FWR1_IMGT	FWR2_IMGT	FWR3_IMGT	FWR4_IMGT	CDR1_IMGT	CDR2_IMGT	CDR3_IMGT	CONSCOUNT
GTCATTTAGCATGCTG	TCCTGGAGTCGCCCAGCCCCAACCAGACCTCTCTGTACTTCTGTGCCAGCAGTTTAGAGGGGTACACTGAAGCTTTCTTTGGACAAGGCACCAGACTCAC	T	T	F		F	TRBV27*01	TRBD1*01,TRBD2*02	TRBJ1-1*01	TCCTGGAGTCGCCCAGCCCCAACCAGACCTCTCTGTACTTCTGTGCCAGCAGTTTAGAGGGGTACACTGAAGCTTTCTTTGGACAAGGCACCAGACTCAC	......................................................................................................................................................................................................

## Identifying clones

In [43]:
%%R
library(ggplot2)
library(alakazam)
library(shazam)

In [None]:
%%R
db <- readChangeoDb("A2-i131_db-pass.tab")
db <- distToNearest(db, model="ham", symmetry="min")
p1 <- ggplot() + theme_bw() + 
    ggtitle("Distance to nearest: ham") + xlab("distance") +
    geom_histogram(data=db, aes(x=DIST_NEAREST), binwidth=0.025, 
                   fill="steelblue", color="white")
plot(p1)

## Define clones

In [None]:
%%bash
DefineClones.py bygroup -d A2-i131_db-pass.tab --act set --model ham \
--sym min --norm len --dist 0.20

In [None]:
!head A2-i131_db-pass_clone-pass.tab