# Shared clonotype frequency

For every groupwise combination of two or more subjects, compute the frequency of universally shared clonotypes (that is, clonotypes found in the repertoire of every member in the group).

The following Python packages are required to run the code in this notebook:
  * numpy
  * pandas
  * [abutils](https://github.com/briney/abutils)

They can be install by running `pip install numpy pandas abutils`

*NOTE: this notebook requires the use of the Unix command line tool `wc`. Thus, it requires a Unix-based operating system to run correctly (MacOS and most flavors of Linux should be fine). Running this notebook on Windows 10 may be possible using the [Windows Subsystem for Linux](https://docs.microsoft.com/en-us/windows/wsl/about) but we have not tested this.*

In [8]:
from collections import Counter
from datetime import datetime
import itertools
import json
import multiprocessing as mp
import os
import subprocess as sp
import sys

import numpy as np
import pandas as pd

from abutils.utils.jobs import monitor_mp_jobs
from abutils.utils.pipeline import list_files, make_dir
from abutils.utils.progbar import progress_bar

### Subjects, files and directories

In [12]:
# files and directories
dedup_subject_dir = './data/dedup_subject_clonotype_pools/'
cross_subject_occurance_files = list_files('./data/cross-subject_clonotype_duplicate-counts/')

# subjects
with open('./data/subjects.txt') as f:
    subjects = sorted(f.read().split())

### Number of unique clonotypes per subject

If you'd like to actually count the number of unique clonotypes per subject, you can run the code in [**this**](LINK) notebook or download a dataset containing each subject's unique clonotypes [**here**](LINK). Note that the decompressed unique clonotype dataset is fairly large (about 8GB). 

All we're doing is counting the number of lines in the unique clonotype file. If you'd rather not download and decompress the data just to count the lines, skip the next block of code.

In [None]:
subject_sizes = {}
for subject in subjects:
    print(subject)
    dedup_file = os.path.join(dedup_subject_dir, '{}_dedup_pool_vj-aa.txt'.format(subject))
    wc_cmd = 'wc -l {}'.format(dedup_file)
    p = sp.Popen(wc_cmd, stdout=sp.PIPE, stderr=sp.PIPE, shell=True)
    stdout, stderr = p.communicate()
    size = int(stdout.strip().split()[0])
    subject_sizes[subject] = size

Only run the following code block if you skipped the previous one. This loads pre-computed unique clonotype counts for each subject.

In [5]:
with open('./data/unique_clonotype_counts.json') as f:
    subject_sizes = json.load(f)

## Quantify shared clonotypes

In [14]:
shared_frequencies_by_group_size = {i + 1: [] for i in range(len(subjects))}
shared_frequencies_by_group = []

start_time = datetime.now()
progress_bar(0, len(cross_subject_occurance_files), start_time=start_time)

for i, of in enumerate(cross_subject_occurance_files):
    _subjects = os.path.basename(of).split('_')[0].split('-')
    smallest = min([subject_sizes[s] for s in _subjects])
    min_freq = str(len(_subjects))
    with open(of) as f:
        for line in f:
            if not line.strip():
                continue
            if line.strip().split()[0] == min_freq:
                count = int(line.strip().split()[1])
                break
    frequency = 1. * count / smallest
    shared_frequencies_by_group.append('{}: {}'.format(', '.join(_subjects), 100. * frequency))
    shared_frequencies_by_group_size[len(_subjects)].append(frequency)
    progress_bar(i + 1, len(cross_subject_occurance_files), start_time=start_time)

with open('./data/shared_clonotypes/groupwise_shared_clonotype_frequencies.txt', 'w') as f:
    f.write('\n'.join(shared_frequencies_by_group))
    
with open('./data/shared_clonotypes/groupwise_shared_clonotype_frequencies_by-size.json', 'w') as f:
    json.dump(shared_frequencies_by_group_size, f)

(1023/1023) ||||||||||||||||||||||||||||||||||||||||||||||||||||  100%  (00:01)  