Deconvoluting the Diversity of Within-host Pathogen Strain in a MLST Framework
- We are using Python 2.7. You will only need the folder BorreliaPipeline to run the pipeline. This can be done in the following
git init git remote add origin https://github.com/WGS-TB/MLST git fetch origin git checkout origin/master -- BorreliaPipeline
SRA toolkit: To download sample reads. Please download the toolkit here: https://www.ncbi.nlm.nih.gov/sra/docs/toolkitsoft/. You will need
fastq-dumpto download the sample reads. Our samples are downloaded using
fastq-dumpinto your bin folder (ours: /usr/local/bin) or provide the path to
download_samples.pyscript. The python script calls the
fastq-dumpcommand to download sample reads.
Bowtie v0.12.7: For read mapping. Please download it here: https://sourceforge.net/projects/bowtie-bio/files/bowtie/0.12.7/. You will need both
bowtie-build, copy these into your bin folder or provide the path to the folder containing
mapSamples.pyscript. The python script calls
bowtie-buildcommands to map reads.
CPLEX Python API: For solving ILP. You will need academic license to download CPLEX. Once you installed CPLEX, follow the instructions here to install CPLEX-Python modules: https://www.ibm.com/support/knowledgecenter/SSSA5P_12.5.1/ilog.odms.cplex.help/CPLEX/GettingStarted/topics/set_up/Python_setup.html
ART: For simulating reads. Please download it here: https://www.niehs.nih.gov/research/resources/software/biostatistics/art/. You will need
art_illumina, copy this into your bin file or provide the path for
run_sim.py. This script runs the allele simulation.
Samtools v1.3.1: For extracting information from sam files. Please download it here: https://sourceforge.net/projects/samtools/files/samtools/1.3.1/. You will need
samtools, copy it into your bin folder or provide the path to
mapSamples.py. The python script calls the
samtoolscommand while extracting information from sam files after mapping.
In computing edit distance, we used the package by https://github.com/aflc/editdistance and we hereby acknowledge Hiroyuki Tanaka, where the algorithm is proposed by Heikki Hyyrö, "Explaining and extending the bit-parallel approximate string matching algorithm of Myers", (2001). Possible Python packages you will need: numpy, pandas, matplotlib, linecache, argparse . Please use
pip installto install these packages if you do not have any of the packages.
We also use the package PyEMD for EMD calculations, where we hereby acknowledge Gary Doran for his PyEMD package: https://github.com/garydoranjr/pyemd.
We are not sure if we cover all the Python packages here, please let us know if we miss any!
Instructions to run the pipeline
The required scripts and files are in
BorreliaPipelinefolder (except for sample reads as the files are big).
python download_samples.py [-h] [-c CMD]in the
pipelinefolder to download the samples' reads (downloaded samples are based on the SRR_Acc_List.txt).
download_samples.pyhas an optional argument
-c, which sets the path for
fastq-dumpcommand needed to download the reads for Borrelia samples. By default, it assumes
fastq-dumpis in your bin folder. After running this script, it will create a
datafolder which contains folders for each sample.
Since we are using Bowtie to map the sample reads, we run
python mapSamples.py [-h] [-c NUMOFCORES] [-b BOWTIE] [-s SAMTOOLS]to map the sample reads.
-cspecifies the number of core for running Bowtie, default is 4.
-bspecifies the path to the folder containing two commands
bowtie-build, default assumes these two commands are in the user's bin folder.
-sspecifies the path of
samtools, default assumes the command is in user's bin folder. It will be easier to understand with a working example:
python mapSamples.py -c 2 -b ../bowtieFolder/ -s ../../samtools, for example. The script will create a
variantsAndPropfolder containing a folder for each sample, where each of these folder contains a list of
sampleX_geneY_paired_reads.txttext files, in which the text file contains information needed for our first stage ILP pipeline. These information are extracted from
To solve the allele diversity problem, run the following script in the pipeline folder:
python alleleDiversity.py [-h] [-s SAMPLE]
If you want to run on a specific sample, you can indicate using
-s. Otherwise it will run on all samples in
variantsAndProp. For example, you could run
python alleleDiversity.py -s SRR2034333. This script will output some intermediate results to the screen and create 8 CSV files (for each locus) in each of the sample folder in
variantsAndProp, named as
geneX_proportions.csv which contains the alleles and their proportions identified at
The script will output some intermediate results for you to keep track.
- To solve the global strain diversity problem, run the following script:
python globalStrainDiversity.py [-h] [-o OUTPUT] [-oc OBJECTIVECOMPONENT] [-timelim TIMELIMIT] [-g GAP] [-pathToDistMat PTDM]
-o, the folder name for the results to be stored. Default will be
-oc argument specifies different formulation of the objective function, 'all' includes all components of the objective function, 'noPropAndErr' omits the proportion and error component.
-g specifies the relative gap in % to stop the solver, default is 5%.
-timelim specifies the maximum time limit in seconds before stopping the solver, default is 600 seconds. The program will stop if and only if
-timelim are both satisfied. For example, if
-g is 5 and
timelim is 600, if the relative gap is above 5% and the 600 seconds have passed, the program will not stop until gap of 5% is reached.
-pathToDistMat is the folder name containing the distance matrices for each gene, which is provided and named 'editDist'.
The script will output some intermediate results for you to keep track. It will create a new folder in your current directory with name specified by
-o. The folder contains a list of
sampleX_strainsAndProportions.csv with sampleX corresponds a particular sample, and the csv file contains the strains and their proportions identified in sampleX.
Instructions to run the allele diversity ADP simulation
The required scripts and files to run the simulation are in the
alleleSimulationfolder in the
alleleSimulationfolder, you only have to run
python run_sim.py [-h] [-i NUMOFITER] [-f SIMULATIONRESULTFOLDER] [-c COVERAGE] [-e EDITDIST] [-fp FULLPAPER] [-b BOWTIE] [-s SAMTOOLS] [-a ART]
-i specifies the number of simulations on each gene, default is 40.
-f specifies the name of the folder to store the results, default is simulation_results.
-c specifies the coverage to test on, default is 30.
-b specifies the path to the folder containing
bowtie-build commands, default assumes that both commands are in user's bin folder.
-s specifies the path to
samtools, default assumes it's in your bin folder.
-a specifies the path to
art_illumina, default assumes that it's in the user's bin folder.
-fp set to True means we run a 2D parameters experiments(coverage and editDist), which is presented in our paper. In the results folder, it will contain .csv and .png files representing different statistics.
- You will need to install
kallistoand the link is provided here: https://pachterlab.github.io/kallisto/download. To run simulation using kallisto, run
run_kallisto_sim.py [-h] [-i NUMOFITER] [-f SIMULATIONRESULTFOLDER] [-c COVERAGE] [-a ART] [-k KAL] [-fp FULLPAPER]
Instructions are similar to
-k refers to the path to
kallisto and default assumes it is in user's bin folder.
- To run novel allele experiment, run
python run_loo_sim.py [-h] [-i NUMOFITER] [-f SIMULATIONRESULTFOLDER] [-c COVERAGE] [-b BOWTIE] [-s SAMTOOLS] [-a ART]
Instructions are the same as running
Instructions to run the SDP and full pipeline simulation
- In the
strainAndFullPipeline_simcontains the scripts and files to run the SDP simulation(with no reads involved) and full pipeline simulation for our method.
strainEST_comparecontains the scripts to run the full pipeline simulation which compares with strainEST. The link to install strainEST is here: https://github.com/compmetagen/strainest. Details about running these simulations are written in separate README files instead in each of the folder.
For Minimum Spanning Tree Analysis
A sample script to compute the MST is in the
pipeline folder and can be altered for your own analysis.