Code to generate synthetic labelled metagenomic training data for the AMRtime metagenomic AMR prediction project.
ART read simulator available here or via conda (
conda install -c bioconda art)
SeqAN C++ library available here or via your package manager.
CMake available here or via your package manager.
To run unit tests: googletest C++ library available here or via your package manager.
Build uses the
cmake configuration tool for out-of-source builds.
cmake ..if your seqan library is in an unusual location or cmake if missing the module to find it you made need to specify the path to the library include directory e.g. if it is in
cmake -DSEQAN_INCLUDE_DIRS=~/conda/envs/AMRtime/include ..
To build with unit tests you need the googletest library and
to run cmake with a
As with SeqAN, if the googletest library is installed in an unusual location
may need to specify
the path to the googletest library using the
To run the unit tests:
amrtime generate_training --help for detailed usage.
amrtime generate_training [options] path_tsv_specifying_inputs
amrtime generate_training a TSV file must be provided that lists
the details of the input genomes, RGI (v4+) annotation TSV files and the
copy number for that genome. The copy number is specified in integers and
can be used to simulate relative abundance of source genomes
for the synthetic metagenome.
This file needs to be in the format of:
path_to_genome_fasta \t path_to_annotation_tsv_for_genome \t genome_copy_number
On successful completed run this tool will generate the following outputs:
output_synthetic_metagenome.fasta- the synthetic 'assembled' metagenome from the input genome contigs copied to specified numbers.
output.fqthe full set of simulated reads with specified sequencing error profile, fold coverage and read length.
output.samthe sam file specifying where the simulated reads are sampled from in the original genomes.
output_errFree.samthe same sam file but containing reads free of simulated sequening error.
output_labels.tsvthe output labels in the format of a TSV file with the columns:
overlap length. Where there is no overlap between the read and an annotation these will all be
Finally, if the
clean option is enabled, two additional files will be generated
only containing data from reads that are labelled:
output_clean.fqonly those reads that are labelled with an RGI annotation.
output_clean_labels.tsvonly the labels associated with those reads.
First either pull the pre-built container from dockerhub:
docker pull finlaymaguire/amrtime
Or alternatively build yourself:
git clone https://github.com/beiko-lab/AMRtime
docker build -t finlaymaguire/amrtime AMRtime
Then you can mount a local directory and run interactively within the container e.g. to mount a local directory called
amrtime_data in your current directory to /data within the container and then start a bash shell in the container:
docker run -i -v $PWD/amrtime_data:/data -t --entrypoint /bin/bash finlaymaguire/amrtime