# Clonotype deduplication (synthetic repertoires)

Starting with annotated sequence data (in AbStar's `minimal` output format), reduces sequences to clonotypes and collapses dupicate clonotypes.

The [`abutils`](https://www.github.com/briney/abutils) Python package is required, and can be installed by running `pip install abutils`

*NOTE: this notebook requires the use of the Unix command line tool `sort`. Thus, it requires a Unix-based operating system to run correctly (MacOS and most flavors of Linux should be fine). Running this notebook on Windows 10 may be possible using the [Windows Subsystem for Linux](https://docs.microsoft.com/en-us/windows/wsl/about) but we have not tested this.*

In [None]:
from __future__ import print_function, division

import itertools
import multiprocessing as mp
import os
import subprocess as sp
import sys
import tempfile

from abutils.utils.jobs import monitor_mp_jobs
from abutils.utils.pipeline import list_files, make_dir
from abutils.utils.progbar import progress_bar

### Subjects, directories and data fields

The input data (annotated synthetic sequences in [abstar's](https://github.com/briney/abstar) `minimal` format) is too large to be stored in a Github repository. The two sythetic datasets can be downloaded from the following links:

  * default IGoR recombination model: [**DOWNLOAD**](http://burtonlab.s3.amazonaws.com/GRP_github_data/synthetic_default-model_minimal.tar.gz)
  * subject-specific IGoR recombination models: [**DOWNLOAD**](http://burtonlab.s3.amazonaws.com/GRP_github_data/synthetic_subject-specific-models_minimal.tar.gz)

The datasets are fairly large (each dataset is approximately 1TB uncompressed), so make sure you have enough space before downloading. Decompressing the default IGoR recombination model archive from within the `data` directory (located in the same parent directory as this notebook) will allow the code in this notebook to run without modification. If you would prefer to store the input data somewhere else or would like to use the subject-specific IGoR model data instead, be sure to modify the `raw_input_dir` path below.

The data fields defined below correspond to the prosition in abstar's `minimal` format. If for some reason you have a differently formatted annotation file, change the field positions to suit your annotation file.

In [4]:
# subjects
with open('./data/subjects.txt') as f:
    subjects = sorted(f.read().split())

# directories
raw_input_dir = './data/synthetic_default-model_minimal/'
raw_clonotype_dir = './data/synthetic_default-model_vj-aa/'
unique_clonotype_dir = './data/dedup_synthetic_default-model_vj-aa/'
counts_clonotype_dir = './data/dedup_synthetic_default-model_vj-aa_with-counts/'
temp_dir = './data/temp'
logfile = './data/dedup.log'

# make directories
make_dir(raw_clonotype_dir)
make_dir(unique_clonotype_dir)
make_dir(counts_clonotype_dir)
make_dir(temp_dir)
with open(logfile, 'w') as f:
    f.write('')

# data fields
prod_field = 3
v_field = 5
j_field = 9
cdr3aa_field = 12

### Deduplication

For each synthetic sequence datafile, we'd like to create three new clonotype files:
  1. a file containing the raw clonotypes, one for every productive sequence
  2. a file containing just unique clonotypes, used to quantify cross-sample clonotype sharing
  3. a file containing unique clonotypes with counts (the number of times each unique clonotype was observed), used to quantify repeat observation

In [5]:
def dedup_clonotypes(minimal_file):
      
    # process minimal file
    print(os.path.basename(minimal_file))
    clonotype_output_data = []
    sequence_output_data = []
    raw_clonotype_file = os.path.join(raw_clonotype_dir, os.path.basename(minimal_file))
    unique_clonotype_file = os.path.join(unique_clonotype_dir, os.path.basename(minimal_file))
    counts_clonotype_file = os.path.join(counts_clonotype_dir, os.path.basename(minimal_file))

    # collect clonotype/sequence information
    with open(minimal_file) as f:
        for line in f:
            data = line.strip().split(',')
            if data[prod_field] == 'no':
                continue
            v_gene = data[v_field]
            j_gene = data[j_field]
            cdr3_aa = data[cdr3aa_field]
            clonotype_output_data.append(' '.join([v_gene, j_gene, cdr3_aa]))

    # write raw clonotype info to file
    raw_clontype_string = '\n'.join(clonotype_output_data)
    with open(raw_clonotype_file, 'w') as rf:
        rf.write(raw_clontype_string)
    raw_clonotype_count = len(clonotype_output_data)
    print('raw clonotypes:', raw_clonotype_count)
    
    # collapse duplicate clonotypes (without counts)
    uniq_cmd = 'sort -u -o {} -'.format(unique_clonotype_file)
    p = sp.Popen(uniq_cmd, stdout=sp.PIPE, stderr=sp.PIPE, stdin=sp.PIPE, shell=True)
    stdout, stderr = p.communicate(input=raw_clonotype_string)
    
    # collapse duplicate_clonotypes (with counts)
    uniq_cmd = 'sort -T {} | uniq -c > {}'.format(temp_dir,
                                                  counts_clonotype_file)
    p = sp.Popen(uniq_cmd, stdout=sp.PIPE, stderr=sp.PIPE, stdin=sp.PIPE, shell=True)
    stdout, stderr = p.communicate(input=raw_clonotype_string)
    
    # count the number of unique clonotypes
    wc_cmd = 'wc -l {}'.format(unique_clonotype_file)
    q = sp.Popen(wc_cmd, stdout=sp.PIPE, stderr=sp.PIPE, shell=True)
    _count, _ = q.communicate()
    unique_clonotype_count = int(_count.split()[0])
    print('unique clonotypes:', unique_clonotype_count)
    with open(log_file, 'a') as f:
        f.write('CLONOTYPES: {} {}\n'.format(raw_clonotype_count, unique_clonotype_count))
    print('')

In [None]:
for minimal_file in list_files(raw_input_dir):
    dedup_clonotypes(minimal_file)