# IgBLAST

Given the following FASTQ file:
- Convert to FASTA
- Use IgBLAST to assign germline V, D, and J segments
- Post-process using Change-O.

##  Convert FASTQ to FASTA

In [None]:
from Bio import SeqIO
SeqIO.convert('SRR765688.fastq','fastq','SRR765688.fasta','fasta') 

## Set up IgBLAST

In [None]:
%%bash
wget -r -nH --cut-dirs=4 --no-parent ftp://ftp.ncbi.nih.gov/blast/executables/igblast/release/internal_data
wget -r -nH --cut-dirs=4 --no-parent ftp://ftp.ncbi.nih.gov/blast/executables/igblast/release/optional_file

In [None]:
%%bash
mkdir database
# V-segment database
perl ./edit_imgt_file.pl IMGT_Human_IGHV.fasta > database/human_igh_v
makeblastdb -parse_seqids -dbtype nucl -in database/human_igh_v
# D-segment database
perl ./edit_imgt_file.pl IMGT_Human_IGHD.fasta > database/human_igh_d
makeblastdb -parse_seqids -dbtype nucl -in database/human_igh_d
# J-segment database
perl ./edit_imgt_file.pl IMGT_Human_IGHJ.fasta > database/human_igh_j
makeblastdb -parse_seqids -dbtype nucl -in database/human_igh_j

## Obtain reference sequences

Go to [http://www.imgt.org/vquest/refseqh.html](http://www.imgt.org/vquest/refseqh.html) and download IGHV, IGHD, and IGHJ sequences for humans in FASTA format. Save them as `IMGT_Human_IGHV.fasta` etc..

(For advanced users, one could download the whole database from [here](http://www.imgt.org/download/GENE-DB/) and postprocess.)

## Use IgBLAST

IgBLAST has many options (see below) but the most important ones are as follows:

- germline_db_V: the V gene database
- germline_db_D: the D gene database
- germline_db_J: the J gene database
- auxiliary_data: contains annotations for the sequences
- domain_system: the system used (e.g. imgt) for defining the domains
- ig_seqtype: Ig or TCR
- organism: e.g. human, mouse
- outfmt: the output format; for postprocessing with ChangeO, has to be '7 std qseq sseq btop'
- query: the input data in FASTA format
- out: the output filename
- num_threads: the number of threads to use

In [None]:
!igblastn -help

Complete the following cell to run `SRR765688,fasta` against the IGH databases generated previously. Ensure that the outfmt term is '7 std qseq sseq btop', and save the output as `SRR765688.fmt7`.

In [None]:
%%bash
igblastn

## Postprocess IgBLAST

In [None]:
%%bash
MakeDb.py igblast -i SRR765688.fmt7 -s SRR765688.fasta -r IMGT_Human_IGH[VDJ].fasta \
    --regions --scores

In [None]:
%%bash
ParseDb.py split -d SRR765688_db-pass.tab -f FUNCTIONAL

In [None]:
%load_ext rpy2.ipython

In [None]:
%%R
library(ggplot2)
library(alakazam)
library(shazam)
db <- readChangeoDb("SRR765688_db-pass_FUNCTIONAL-T.tab")
db <- distToNearest(db, model="ham", symmetry="min")
p1 <- ggplot() + theme_bw() + 
    ggtitle("Distance to nearest: ham") + xlab("distance") +
    geom_histogram(data=db, aes(x=DIST_NEAREST), binwidth=0.01, 
                   fill="steelblue", color="white")
plot(p1)

**Look at the above histogram, and decide what a good cutoff is for defining a clone, then add this to the end of the following command.**

In [None]:
%%bash
DefineClones.py bygroup -d SRR765688_db-pass_FUNCTIONAL-T.tab --act set --model ham \
--sym min --norm len --dist

In [None]:
%%bash
CreateGermlines.py -d SRR765688_db-pass_FUNCTIONAL-T_clone-pass.tab -r IMGT_Human_IGH[VDJ].fasta \
    -g dmask --cloned

In [None]:
import pandas as pd
db=pd.read_csv("SRR765688_db-pass_FUNCTIONAL-T_clone-pass_germ-pass.tab",sep="\t")
db