Simple Python-based simulator used for evaluation of V(D)J assignment software
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.

IGH simulation

This repository contains Python scripts to simulate IGH rearrangements, in order to test V(D)J assignment utilities. These were used to generate simulated data for our paper on V(D)J assignment using phylogenetic placement. The code is very simple (and in places, horribly inefficient), and at some point, this code may be overhauled.

It uses the following files:

  • IGHV.fas, IGHD.fas, IGHJ.fas: these are functional V, D and J human genes obtained from IMGT (IMGT/V-QUEST reference directory release 201443-5, 24 October 2014). Ambiguous nucleotides have been resolved to the base of the most closely related allele. Some alleles contain Ns.
    • IGHV1-2*03 has a N replaced with G
    • IGHV1-45*01 has a N replaced with T
    • IGHV4-30-4*05 N replaced with A
    • IGHV4-4*06 has NNN replaced with AGG
    • IGHV4-59*09 has NANNN replaced with GAAGG
  • vdj.csv: this contains information on inferred V(D)J rearrangements from a curated set of immunoglobulin sequences analysed by Jackson et al. (2004). This was taken from an Excel sheet in the supplementary information. I renamed the columns, and saved as a csv file.
  • Mutability.csv and Substitution.csv: these are files on mutability and substitution probabilities from the S5F model of Yaari et al. 2013, version 0.1 (July 26, 2013), downloaded from the Kleinstein lab webpage.


  • Python
  • Python libraries (e.g. installable via pip)
    • Biopython
    • Numpy
    • progress

This Python script generates all possible V(D)J rearrangements of *01 alleles, without insertions, deletions, or mutations.


python <output filename>

This Python script generates V(D)J rearrangements using simple random sampling (SRS) of the V, D and J regions. Deletion lengths in the V, D and J regions and the lengths of the N1 and N2 non-templated regions are generated by resampling the distributions from Jackson et al. (given in the file vdj.csv). Base frequencies for N1 and N2 are from concatenated N1 and N2 regions. The resulting sequences are coding (no stops), have the conserved [FW]G.G motif, end with the J region motif TVSS, and are restricted to have a CDR3 region as defined by the following (Pythonic) regular expression.



python <number of reads> <output filename stub> <random number seed>

This will generate a FASTA file (<output filename stub>.fas), as well as a table of germlines (<output filename stub>.txt).

This script takes a FASTA file, e.g. one produced by or, and adds mutations according to the S5F model (or at least, my interpretation of it). This script ensures that despite mutations, the above conditions (no stops, conserved motifs, and the regular expression) are preserved; there is a hard-coded limit of 100 mutation attempts per sequence.


python <number of mutations> <input fasta> <output fasta> <random number seed>


Some simulated data have been included:

python simple_rearrangements.fas
python 1000 simple_plus_indels 1
python 40 simple_plus_indels.fas simple_plus_indels_40.fas 1

To do

  • Improve efficiency, especially for mutations


If for some bizarre reason you use this code in your research, please cite:

Frost SDW, Murrell B, Hossain ASMM, Silverman GJ, Kosakovsky Pond SL. (2015) Assigning and visualizing germline genes in antibody repertoires. Philos Trans R Soc Lond B Biol Sci. 370:20140240. doi:10.1098/rstb.2014.0240.