# Clonotype and sequence counting

Starting with deduplicated clonotypes or sequences, count the occurrence of repeatedly observed clonotypes/sequences (seen in multiple biological replicates from the same subject) or shared clonotypes/sequences (seen in multiple subjects).

The [`abutils`](https://www.github.com/briney/abutils) Python package is required, and can be installed by running `pip install abutils`

In [2]:
import multiprocessing as mp
import os
import sys

from abutils.utils.jobs import monitor_mp_jobs
from abutils.utils.pipeline import list_files, make_dir

### Files and directories

There are two input data directories each for clonotypes and sequences -- one containing deduplicated single-subject pools (to quantify repeatedly observed clonotypes/sequences) and another containing deduplicated cross-subject pools (to quantify shared clonotypes/sequences). The input data used by the following code is too large to be included in this repository. Input datasets can be generated using the code in [**this Jupyter notebook**](LINK). Alternatively, data can be downloaded from the following links:
  * single-subject clonotype data can be downloaded [**here**](LINK)
  * cross-subject clonotype data can be downloaded [**here**](LINK)
  * single-subject sequence data can be downloaded [**here**](LINK)
  * cross-subject sequence data can be downloaded [**here**](LINK)

If generating the input data using the code in the referenced Jupyter notebook, the data should be deposited into the appropriate directory. If downloading the data, either decompress the downloaded data file in the appropriate directory or modify the `single_subject_dir` and/or `cross_subject_dir` variables as needed.

***NOTE:*** *The uncompressed cross-subject input data is quite large (>2TB for clonotypes and >20TB for sequences). Ensure that you have sufficient storage before downloading and decompressing.*

In [3]:
# directories
single_subject_clonotype_dir = './data/dedup_subject_clonotype_pools/'
cross_subject_clonotype_dir = './data/dedup_cross-subject_clonotype_pools/'
clonotype_output_dir = './data/user-calculated_cross-subject_clonotype_duplicate-counts/'
single_subject_sequence_dir = './data/dedup_subject_sequence_pools/'
cross_subject_sequence_dir = './data/dedup_cross-subject_sequence_pools/'
sequence_output_dir = './data/user-calculated_cross-subject_sequence_duplicate-counts/'
make_dir(clonotype_output_dir)
make_dir(sequence_output_dir)

# files
clonotype_files = [f for f in list_files(single_subject_clonotype_dir) if 'pool_vj-aa_with-counts.txt' in f]
clonotype_files += [f for f in list_files(cross_subject_clonotype_dir) if 'pool_vj-aa_with-counts.txt' in f]
sequence_files = [f for f in list_files(single_subject_sequence_dir) if 'pool_nt-seq_with-counts.txt' in f]
sequence_files += [f for f in list_files(cross_subject_sequence_dir) if 'pool_nt-seq_with-counts.txt' in f]

In [4]:
# clonotypes
clonotype_files_by_subject_count = {i: [] for i in range(1, 11)}
for f in clonotype_files:
    num = len(os.path.basename(f).split('_')[0].split('-'))
    clonotype_files_by_subject_count[num].append(f)

# sequences
sequence_files_by_subject_count = {i: [] for i in range(1, 11)}
for f in clonotype_files:
    num = len(os.path.basename(f).split('_')[0].split('-'))
    sequence_files_by_subject_count[num].append(f)

## Duplicate counting

Now we'd like to count the duplicate (repeatedly observed or shared) clonotypes for every groupwise combination of our 10 subjects. Each group can contain one or more subjects, meaning the total number of possible groupwise combinations is quite large. We'll use the `multiprocessing` package to parallelize the process which should speed things up substantially, although even with parallelization, this will take some time.

In [None]:
def count_duplicates(input_file, output_dir):
    counts = {str(i): 0 for i in range(1, 11)}
    with open(input_file, 'r') as f:
        for line in f:
            if not line.strip():
                continue
            c = line.strip().split()[0]
            counts[c] += 1
    subject_prefix = os.path.basename(input_file).split('_')[0]
    output_file = os.path.join(output_dir, '{}_occurrence-counts.txt'.format(subject_prefix))
    with open(output_file, 'w') as f:
        data = ['{}\t{}'.format(k, v) for k, v in sorted(counts.items(), key=lambda x: int(x[0]))]
        data_string = '\n'.join(data)
        f.write(data_string) 
    return counts

### Clonotypes

In [None]:
p = mp.Pool(maxtasksperchild=1)
clonotype_counts = {}

for num in clonotype_files_by_subject_count.keys():
    async_results = []
    print('subject count:', num)
    sys.stdout.flush()
    for ifile in files_by_subject_count[num]:
        async_results.append(p.apply_async(count_duplicates, args=(ifile, clonotype_output_dir)))
    monitor_mp_jobs(async_results)
    clonotype_counts[num] = [ar.get() for ar in async_results]
    print('\n')
p.close()
p.join()

### Sequences

In [None]:
p = mp.Pool(maxtasksperchild=1)
sequence_counts = {}

for num in sequence_files_by_subject_count.keys():
    async_results = []
    print('subject count:', num)
    sys.stdout.flush()
    for ifile in files_by_subject_count[num]:
        async_results.append(p.apply_async(count_duplicates, args=(ifile, sequence_output_dir)))
    monitor_mp_jobs(async_results)
    sequence_counts[num] = [ar.get() for ar in async_results]
    print('\n')
p.close()
p.join()