# Subject-specific VDJ recombination models

Calculate a VDJ recombination model for each subject using a dataset of unmutated antibody sequences from each subject. Model will be created using [**IGoR**](https://github.com/qmarcou/IGoR).

The following Python packages are required to run the code in this notebook:
  * [abutils](https://github.com/briney/abutils)

They can be install by running `pip install abutils`

In [1]:
import os
import subprocess as sp
import sys

from abutils.utils.pipeline import list_files, make_dir

## Subjects

In [22]:
with open('./data/subjects.txt') as f:
    subjects = sorted(f.read().split())

## Files and directories

Running the code in this notebook requires a a dataset that is too large to be included in this repository. You can download the dataset [**here**](http://burtonlab.s3.amazonaws.com/GRP_github_data/igor_naive_inference_input.tar.gz). Uncompressing the downloaded archive in `./data/vdj_recombination_modeling/` will produce a folder structure compatible with the defaults in this notebook. If you'd prefer to store the data in a different location, update the `data_dir` variable below.

***NOTE:*** *The uncompressed dataset is fairly large (about 16GB), so ensure that you have sufficient available storage before downloading and decompressing.*

In [35]:
naive_input_dir = './data/vdj_recombination_modeling/igor_inference_input_naive'

## Create model

We'll now use IGoR to create a recombination model. We first subsample each of the naive sequences files to randomly select 500,000 sequences from each subject. In the IGoR publication, the authors note that an accurate model for TCRbeta recombination can be created with as few as 5,000 sequences, so we expect 500,000 sequences from each subject to be adequate for accurate model inference.

***NOTE:*** *This code will take a substantial amount of time to run. On a large AWS instance (m5.24xlarge) with 96 vCPUs and 370GB of memory, it takes several days to complete. Additionally, the output files generated by the model inference process (primarily the alignment files) are very large, requiring about 1.5TB of disk space in total for all 10 subjects. Please ensure that adequate resources are available before running this code.*

In [2]:
def compute_model(input_file, working_dir, batch, subsample):
    make_dir(working_dir)
    
    # read sequences
    print('Reading sequences...')
    read_cmd = 'igor -set_wd {} -batch {} -subsample {} -read_seqs {}'.format(working_dir,
                                                                              batch,
                                                                              subsample,
                                                                              input_file)
    rc = sp.Popen(read_cmd, stdout=sp.PIPE, stderr=sp.PIPE, shell=True)
    stdout, stderr = rc.communicate()
    with open(os.path.join(working_dir, 'read_seqs.stdout'), 'wb') as f:
        f.write(stdout)
    with open(os.path.join(working_dir, 'read_seqs.stderr'), 'wb') as f:
        f.write(stderr)
        
    # align
    print('Aligning sequences...')
    align_cmd = 'igor -set_wd {} -batch {} -species human -chain heavy_naive -align --all'.format(working_dir,
                                                                                                  batch)
    ac = sp.Popen(align_cmd, stdout=sp.PIPE, stderr=sp.PIPE, shell=True)
    stdout, stderr = ac.communicate()
    with open(os.path.join(working_dir, 'align.stdout'), 'wb') as f:
        f.write(stdout)
    with open(os.path.join(working_dir, 'align.stderr'), 'wb') as f:
        f.write(stderr)
    
    # infer
    print('Inferring model...')
    infer_cmd = 'igor -set_wd {} -batch {} -species human -chain heavy_naive -infer'.format(working_dir,
                                                                                            batch)
    ic = sp.Popen(infer_cmd, stdout=sp.PIPE, stderr=sp.PIPE, shell=True)
    stdout, stderr = ic.communicate()
    with open(os.path.join(working_dir, 'infer.stdout'), 'wb') as f:
        f.write(stdout)
    with open(os.path.join(working_dir, 'infer.stderr'), 'wb') as f:
        f.write(stderr)

In [None]:
for subject in subjects:
    print(subject)
    naive_file = os.path.join(naive_input_dir, '{}.txt'.format(subject))
    working_dir = '/data/igor/models/{}'.format(subject)
    compute_model(naive_file, working_dir, subject, 500000)
    print('')