# Repertoire classification subsampling

When training a classifier to assign repertoires to the subject from which they were obtained, we need a set of subsampled sequences. The sequences have been condensed to just the V- and J-gene assignments and the CDR3 length (VJ-CDR3len). Subsample sizes range from 10 to 10,000 sequences per biological replicate.

The [`abutils`](https://www.github.com/briney/abutils) Python package is required for this notebook, and can be installed by running `pip install abutils`.

*NOTE: this notebook requires the use of the Unix command line tool `shuf`. Thus, it requires a Unix-based operating system to run correctly (MacOS and most flavors of Linux should be fine). Running this notebook on Windows 10 may be possible using the [Windows Subsystem for Linux](https://docs.microsoft.com/en-us/windows/wsl/about) but we have not tested this.*

In [None]:
from __future__ import print_function, division

from collections import Counter
import os
import subprocess as sp
import sys
import tempfile

from abutils.utils.pipeline import make_dir

## Subjects, subsample sizes, and directories

The `input_dir` should contain deduplicated clonotype sequences. The datafiles are too large to be included in the Github repository, but may be downloaded [**here**](http://burtonlab.s3.amazonaws.com/GRP_github_data/techrep-merged_vj-cdr3len_no-header.tar.gz). If downloading the data (which will be downloaded as a compressed archive), decompress the archive in the `data` directory (in the same parent directory as this notebook) and you should be ready to go. If you want to store the downloaded data in some other location, adjust the `input_dir` path below as needed.

By default, subsample sizes increase by 10 from 10 to 100, by 100 from 100 to 1,000, and by 1,000 from 1,000 to 10,000.

In [None]:
with open('./data/subjects.txt') as f:
    subjects = sorted(f.read().split())

subsample_sizes = list(range(10, 100, 10)) + list(range(100, 1000, 100)) + list(range(1000, 11000, 1000))

input_dir = './data/techrep-merged_vj-cdr3len_no-header/'
subsample_dir = './data/repertoire_classification/user-created_subsamples_vj-cdr3len'
make_dir(subsample_dir)

## Subsampling

In [None]:
def subsample(infile, outfile, n_seqs, iterations):
    with open(outfile, 'w') as f:
        f.write('')
    shuf_cmd = 'shuf -n {} {}'.format(n_seqs, infile)
    p = sp.Popen(shuf_cmd, stdout=sp.PIPE, stderr=sp.PIPE, shell=True)
    stdout, stderr = p.communicate()
    with open(outfile, 'a') as f:
        for iteration in range(iterations):
            seqs = ['_'.join(s.strip().split()) for s in stdout.strip().split('\n') if s.strip()]
            counts = Counter(seqs)
            count_strings = []
            for k, v in counts.items():
                count_strings.append('{}:{}'.format(k, v))
            f.write(','.join(count_strings) + '\n')

In [None]:
for subject in subjects:
    print(subject)
    files = list_files(os.path.join(input_dir, subject))
    for file_ in files:
        for subsample_size in subsample_sizes:
            num = os.path.basename(file_).split('_')[0]
            ofile = os.path.join(subsample_dir, '{}_{}-{}'.format(subject, subsample_size, num))
            subsample(file_, ofile, subsample_size, 50)