How Low Can You Go? Short-read polishing of Oxford Nanopore bacterial genome assemblies Code Repository
This repository holds code and data for this manuscript:
Bouras G, Judd LM, Edwards RA, Vreugde S, Stinear TP, Wick RR. How low can you go? Short-read polishing of Oxford Nanopore bacterial genome assemblies. Microbial Genomics. 2024. doi:10.1099/mgen.0.001254.
Contents:
figures
: contains all of the manuscript's main and supplementary figures along with their captions.supp_tables.xlsx
: contains the paper's supplementary tables.ont_assemblies
: contains the ONT-only Trycycler assemblies used as a starting point for polishing.reference_assemblies
: contains the polished and manually curated assemblies used as a ground truth.pypolca_example_plot
: contains code to simulate reads, errors and make (Figure 1).main_analysis
: contains the read subsampling, polishing and plotting commands for the main analysis (Figures 2, S2, S3 and S7).errors_in_repeats
: contains the details of the errors-in-repeats analysis (Figure S1).long_homopolymer
: contains the details of the long-homopolymer analysis (Figure S4).error_characterisation
: contains the detailed error characterisation of the 37 existing errors and all polisher introduced errors (Table S2 and Figures S5 and S6).hybracter_analysis
: contains the read subsampling assembly and plotting commands for the Hybracter analysis in the paper (Figures S8 and S9).reference_chromosome_assemblies_hybracter
: contains the polished and manually curated assemblies used as a ground truth, chromosomes only. Used for the Hybracter analysis.low_quality_draft
: contains the details of the polishing analysis using low-quality draft assemblies (Figure S10).parameter_sweep
: contains the details of the low-depth parameter sweep analysis (Table S6).compare_assemblies.py
: assembly comparison script used for counting/characterising errors.hapog
: contains additional figure panels for Hapo-G results (produced after the manuscript was published).
ONT and Illumina reads are not included in this repository due to size, but they can be found on SRA:
Genome | ONT reads | Illumina reads |
---|---|---|
Campylobacter jejuni (ATCC-33560) | SRR27638397 | SRR26899120 |
Campylobacter lari (ATCC-35221) | SRR27638396 | SRR26899115 |
Escherichia coli (ATCC-25922) | SRR27638398 | SRR26899128 |
Listeria ivanovii (ATCC-19119) | SRR27638399 | SRR26899136 |
Listeria monocytogenes (ATCC-BAA-679) | SRR27638394 | SRR26899101 |
Listeria welshimeri (ATCC-35897) | SRR27638395 | SRR26899109 |
Salmonella enterica (ATCC-10708) | SRR27638402 | SRR26899135 |
Vibrio cholerae (ATCC-14035) | SRR27638401 | SRR26899095 |
Vibrio parahaemolyticus (ATCC-17802) | SRR27638400 | SRR26899141 |
These are easily downloaded using the fastq-dl
program e.g.
CPUS=16
fastq-dl --accession SRR27638397 --cpus $CPUS
fastq-dl --accession SRR27638396 --cpus $CPUS
fastq-dl --accession SRR27638398 --cpus $CPUS
fastq-dl --accession SRR27638399 --cpus $CPUS
fastq-dl --accession SRR27638394 --cpus $CPUS
fastq-dl --accession SRR27638395 --cpus $CPUS
fastq-dl --accession SRR27638402 --cpus $CPUS
fastq-dl --accession SRR27638401 --cpus $CPUS
fastq-dl --accession SRR27638400 --cpus $CPUS