Preprint: https://www.biorxiv.org/content/10.1101/2020.08.22.262576v1 More info: https://altlabs.tech/projects/
To reproduce all figures, first clone the repository. Then request the data.
Next, download the conda environment manager, which will come with the needed python 3.6 distro. This has only been tested with Linux, specifically Ubuntu 15.04 and 18.04.
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
Restart the currrent shell, then create and activate the provided environment. You may want to change the prefix at the bottom of the .yml page. For more info on installing a conda env from a file, see the conda docs (https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#sharing-an-environment).
Install the yml:
conda env create -f attrib.yml
You are ready to go! Each notebook in figures
folder reproduces a figure subpanel, except where stated otherwise.
The other
folder contains code necessary to reproduce all training and and analysis outside of the figures. It uses two environments, one is the same as the above used for the figures (and found in attrib.yml
) and the other described below, for use in training deep learning models on GPUs. The latter environment will be referred to as pytorch_training
. In addition to the escription below, each file should flag the environment in the header.
It's contents are as follows:
CNN
Code to train and inference with the CNN architecture modeled off Nielsen and Voigt (2018). Env:pytorch_training
blast
Configure a blast database and search it for matches to produce the blast baseline. Note this requires the blast command line tool installed (https://www.ncbi.nlm.nih.gov/books/NBK279690/). Env:attrib
bpe
Learn an enocding of the training set sequence using byte pair encoding and unigram encoding. Env:pytorch_training
calibration
Learn the temperature scaling parameter for recalibrating the deteRNNt model. Env:pytorch_training
countries
Train a random forest to predict nation of origin. Env:attrib
deteRNNt
Training and inference with the deteRNNt model. Training proceeds in steps noted in the file names. Env:pytorch_training
lineages
Train a random forest to predict ancestor lab. Env:attrib
lineages_and_split
Train test split and parse the lineages. Env:attrib
score
Compute the accuracies, top10 accuracies, and calibration curves for all models. Also produces Figure 2. Env:attrib
Most of the code flagged to be used with the pytorch_training
environment was run on Amazon Web Services (AWS). This README will assume AWS acccess; please feel free to open an issue if you need help recreating the training environment on another machine or cloud service.
- Launch an AWS instance. Most of the training can be accomplished on p2.xlarge instances, and some code (eg
calibration
andbpe
) do not need GPUs - Select the Deep Learning AMI for Ubuntu (this analysis used AMI ami-0f9e8c4a1305ecd22)
- Connect to the instance and download the data and code as described above.
source activate pytorch_p36
to start the conda environmentpip install -U ray && pip install -U ray[debug]
install raypip install sentencepiece
Install sentencepiece for BPE tasksconda install -y pandas=0.24.1
Ensure that the pandas version can read the picked dataframes.- Run the desired code!