# Synthetic antibody repertories  
  
We'll create synthetic Ab repertoires using two approaches. In both cases, we'll generate 10 "repertoires", matching the 10 subjects for which we have observed repertorie data. First, we'll generate synthetic sequences using IGoR's default recombination model. The result will be 10 synthetic repertoires derived from the same recombination model. Second, we'll generate synthetic repertoires using the inferred recombination models for each of our 10 subjects.
  
The following Python packages are required to run this Jupyter Notebook:
  * [abutils](https://github.com/briney/abutils)

All of the requirements can be installed with `pip install abutils`

In [1]:
from __future__ import print_function

import multiprocessing as mp
import os
import shutil
import subprocess as sp
import sys
import time

from abutils.utils.jobs import monitor_mp_jobs
from abutils.utils.pipeline import list_files, make_dir

### User-defined options

If you want to create your own synthetic sequences (rather than just look through the code), feel free to modify the options below to suit your needs.

  * `num_seqs` is the number of sequences you'd like to synthesize for each subject.
  * `num_batches` is the number of synthetic "repertoires" you'd like to generate. Multiple repertoires will be generated in parallel.
  * `working_dir` is the working directory for IGoR.
  * `fasta_dir` is the directory into which the synthetic antibody sequences (in FASTA format) will be written.

In [25]:
num_seqs = 100000000
num_batches = 10

working_dir = './data/igor_synthetic_subject-specific-models/'
fasta_dir = './data/igor_synthetic_subject-specific-models/fastas'
make_dir(fasta_dir)

In [None]:
subjects = [os.path.basename(f) for f in list_files('./data/subject-specific_models/')]

## Generate synthetic sequences

The following function generates synthetic sequences with IGoR. Multiple repertoires will be generated in parallel using Python's `multiprocessing` package. When using the default parameters above (100M synthetic sequences), you will need about 100GB of disk space for each repertoire you plan to generate. The final FASTA files (again, assuming the default 100M synthetic sequences) will total approximately 40GB.

In [1]:
def synthesize_sequences(working_dir, fasta_dir, batch, num_seqs, model_params=None, model_marginals=None):
    #assemble the IGoR shell command
    igor_cmd = 'igor -set_wd {} -batch {} -species human -chain heavy_naive'.format(working_dir,
                                                                                    batch)
    if all([model_params is not None, model_marginals is not None]):
        igor_cmd += ' -set_custom_model {} {}'.format(model_params, model_marginals)
    igor_cmd += ' -generate {} --noerr'.format(num_seqs)
    
    # run IGoR
    synth = sp.Popen(igor_cmd, stdout=sp.PIPE, stderr=sp.PIPE, shell=True)
    o, e = synth.communicate()
    
    # IGoR's output is semicolon-delimited, convert to FASTA
    csv_file = os.path.join(working_dir, '{}_generated/generated_seqs_noerr.csv'.format(batch))
    fasta_file = os.path.join(fasta_dir, '{}.fasta'.format(batch))
    with open(csv_file, 'r') as c:
        with open(fasta_file, 'w') as f:
            for line in c:
                if 'seq_index' in line:
                    continue
                name, seq = line.strip().split(';')
                f.write('>{}\n{}\n'.format(name, seq))
    shutil.rmtree(os.path.join(working_dir, '{}_generated'.format(batch)))
    return o, e

### Default recombination model

In [None]:
async_results = []

for subject in subjects:
    batch = subject
    args = (working_dir, fasta_dir, batch, num_seqs)
    async_results.append(p.apply_async(synthesize_sequences, args=args))

monitor_mp_jobs(async_results)

### Subject-specific recombination models

Model files are included in the Github repo, but if you'd like to use alternate model files, update the `model_dir` variable below as needed.

In [None]:
model_dir = './data/subject-specific_models/'
async_results = []

for subject in subjects:
    batch = subject
    params = os.path.join(model_dir, '{}/final_parms.txt'.format(subject))
    marginals = os.path.join(model_dir, '{}/final_marginals.txt'.format(subject))
    args = (working_dir, fasta_dir, batch, num_seqs, params, marginals)
    async_results.append(p.apply_async(synthesize_sequences, args=args))
    
monitor_mp_jobs(async_results)