# AntiRef: data download

The download scripts used in the current version of AntiRef (`v2022.12.14`) were retrieved from the [Observed Antibody Space](http://opig.stats.ox.ac.uk/webapps/oas/) on 12/14/22 and can be found in the `scripts/` directory. The scripts contain a single `wget` command per line, which downloads a single dataset from OAS.


### download scripts
dataset filtering criteria (used to build the download script):

| field | value |
| --- | --- |
| Chain | `'Heavy'` or `'Light'` |
| Disease | `'None'` |
| BSource | `'PBMC'` |
| Species | `'Human'` |
| Vaccine | `'None'` |


All other fields were left as default (`'*'`), which performs no additional filtering

__NOTE:__ *the `'PBMC'` filter excludes all of the sequences in the [Briney et al., 2019](https://www.nature.com/articles/s41586-019-0879-y) paper, which have a bsource of `'LeukoPak'` in the OAS database (and is the only set of samples with that particular bsource). That's not a problem for this data download, however, since we're going to filter only full-length VDJ regions. The Briney et al., 2019 paper used BIOMED-2 primers, so all of those sequences are FR1 truncated and would have been excluded post-download.*

### results

The `'Heavy'` chain search yielded:
* **631,028,215** sequences (unique within the individual dataset)
* datasets from **31** different studies 
* a total of **2,931** datasets 

The `'Light'` chain search yielded:
* **272,491,529** sequences (unique within the individual dataset)
* datasets from **13** different studies 
* a total of **437** datasets

In [None]:
# install dependencies
%pip install tqdm abutils

In [None]:
import os
import sys
import subprocess as sp 
from typing import Optional

from tqdm.notebook import tqdm

import abutils

In [35]:
def oas_downloader(
    oas_sh_file: str, 
    raw_download_dir: str, 
    decompressed_dir: Optional[str] = None, 
    decompress: Optional[bool] = True
):
    '''
    Downloads annotated antibody sequence data from the `Observed Antibody Space`_ repository.
    
    Parameters
    ----------
    oas_sh_file : str
        Path to a OAS-generated data download script. Must be a text file with a single 
        ``wget`` command per line.
    
    raw_download_dir : str
        Path to a directory into which compressed (`'.gz'`) data files will be 
        downloaded. If the directory does not exist, it will be created.
        
    decompressed_dir : str, default=None
        Path to a directory into which data files will be decompressed. If `decompress` is
        ``True`` and `decompressed_dir` is not provided, files will be downloaded into 
        a subdirectory of `raw_download_dir`. Default is ``None``.
        
    decompress : bool, default=True
        If ``True``, CSV files will be decompressed after downloading. Default is ``True``.
    
    .. _Observed Antibody Space:
        http://opig.stats.ox.ac.uk/webapps/oas/
    '''
    # directory setup
    raw_download_dir = os.path.abspath(raw_download_dir)
    if not os.path.isdir(raw_download_dir):
        abutils.io.make_dir(raw_download_dir)
    if decompress:
        if decompressed_dir is None:
            decompressed_dir = os.path.join(raw_download_dir, 'decompressed')
        decompressed_dir = os.path.abspath(decompressed_dir)
        if not os.path.isdir(decompressed_dir):
            abutils.io.make_dir(decompressed_dir)
    # do the download
    with open(oas_sh_file) as oas_file:
        lines = [l for line in oas_file.readlines() if (l := line.strip())]
        pbar = tqdm(lines)
        for line in pbar:
            if sline := line.strip():
                _, url = sline.split()
                compressed_fname = os.path.basename(url)
                compressed_file = os.path.join(raw_download_dir, compressed_fname)
                pbar.set_description(f"{compressed_fname} - downloading")
                wget_cmd = f"wget -O {compressed_file} {url}"
                p = sp.Popen(wget_cmd, shell=True, stdout=sp.PIPE, stderr=sp.PIPE)
                stdout, stderr = p.communicate()
                # decompress if desired
                if decompress:
                    pbar.set_description(f"{compressed_fname} - decompressing")
                    decompressed_fname = compressed_fname.rstrip('.gz')
                    decompressed_file = os.path.join(decompressed_dir, decompressed_fname)
                    gunzip_cmd = f"gunzip -kc {compressed_file} > {decompressed_file}"
                    p = sp.Popen(gunzip_cmd, shell=True, stdout=sp.PIPE, stderr=sp.PIPE)
                    stdout, stderr = p.communicate()

### heavy chains

The download process takes quite a while (at least a few hours, potentially much longer depending on your internet connection) and produces a large amount of data:  

* compressed: **296GB**
* decompressed: **2.7TB**

Make sure the current directory has sufficient storage, or modify the `raw_download_dir` and `decompressed_dir` arguments to point to a location with sufficient storage.

In [39]:
oas_downloader(
    oas_sh_file='./download_heavy.txt', 
    raw_download_dir='./data/raw/gz/heavy', 
    decompressed_dir='./data/raw/csv/heavy', 
    decompress=True,
    )

### light chains

The download process takes quite a while and produces a large amount of data:  

* compressed: **96GB**
* decompressed: **1.1TB**

Make sure the current directory has sufficient storage, or modify the `raw_download_dir` and `decompressed_dir` arguments to point to a location with sufficient storage.

In [None]:
oas_downloader(
    oas_sh_file='./download_light.txt', 
    raw_download_dir='./data/raw/gz/light', 
    decompressed_dir='./data/raw/csv/light', 
    decompress=True,
    )