plasgraph2-datasets

This repository contains training and testing sets for the PlASgraph2 tool.

Each set is consists of

A .csv file listing all assemblies (e.g. eskapee-test.csv) with columns: path to gfa.gz file, path to gfa.csv file and an identifier of the sample
A folder with files for each assembly (more detailed description of both files is below):
- gfa.gz file is a GFA file with the assembly graph, compressed by gzip
- gfa.csv is a file with the correct classification of each contig

GFA assembly graph files

PlASgraph2 was developed with GFA files produced by Unicycler and SKESA. In principle, plASgraph2 should be usable with other assemblers that use the GFA format. However, one of the features is the read coverage of a node which is currently obtained from GFA files as follows:

If nodes have the dp tag, containing normalized read depth computed by Unicycler, it is used as read depth of the node.
If nodes have the KC tag contining k-mer count reorted by SKESA, its value is divided by the length of the sequence corresponding to the node.

In both cases, the coverage is normalized by dividing it with weighted mean of coverages of all nodes. As a result, chromosome contigs are expected to have coverage close to 1. However, for Unicycler this step does not change the values much because a similar procedure was already done.

For other assemblies, make sure that the GFA contains a dp or KC tag with a similar meaning. If the assembler does not provide this information, you can align the source reads to the assembly and label the nodes with read coverage obtained from the alignments.

Classification CSV files

The CSV file with correct classification should contain the following columns (plus any other optional columns):

contig: contig id from the gfa.gz file
label: one of the strings chromosome, plasmid, ambiguous or unlabeled. Label ambiguous means that the contig should be correctly classified as both a chromosome and a plasmid (e.g. a transposon present in both molecules of the given sample) and unlabeled means that the correct label is unknown, e.g. due to short contig size.
length: the contig length (used for evaluation only, not needed for training)
chrom_score: for golder answer, this should be 1 for chromosome and ambiguous and 0 for plasmid and unknown (if omitted, score will be filled in according to label). For predictions of actual tools, the score should be the confidence that the contig belongs to the chromosome class.
plasmid_score: should be 1 for plasmid and ambiguous and 0 for chromosome and unknown; otherwise analogous to chrom_score.

ESKAPEE training and testing sets

ESKAPEE species are Enterococcus faecium, Staphylococcus aureus, Klebsiella pneumoniae, Acinetobacter baumannii, Pseudomonas aeruginosa, Enterobacter spp and Escherichia coli

eskapee-train: The training set of samples from ESKAPEE species (internally split to training and validation). It contains 70 samples, 10 samples from each species. For each sample there are 2 assemblies (Unicycler and SKESA).
eskapee-test: The testing set of samples from ESKAPEE species, 112 samples in total (E.fae. 2, S.aur. 31, K.pne. 46, A.bau. 5, P.aer. 5, E.spp 15, E.col 8)

Other testing sets

cfre-test: Citrobacter freundii, 50 samples
efer-test: Escherichia fergusonii, 50 samples
koxy-test: Klebsiella oxytoca, 31 samples
myco-test: mycobacteria (genera Mycobacterium, Mycobacteroides, Mycolicibacterium, Mycolicibacter), 30 samples
sent-test: Salmonella enterica, 29 samples

Summary tables

reference_genomes.csv contains the list of chromosome references (roughly one per species used in the study) used in our paper to aid golden standard annotation of contigs in hybrid assemblies. The csv file has two columns: accession number of the sequence and description of the sequence
all_samples.csv contains the list of all samples in all our sets. Columns:
- dataset which dataset uses this sample
- our_id our sample id which is used with suffix -s or -u based on assembler used
- sample_id typically NCBI/ENA/DDBJ accession of the bacterial sample, except for samples from Arredondo-Alonso el al 2018, where their ID is used
- short_reads accession of short reads in SRA. Missing in samples from Boostrom et al. 2022 where short reads provided by the authors
- long_reads accession of long reads in SRA. Missing in samples from Arredondo-Alonso et al. 2018 where long reads provided by the authors at figshare
- long_reads_type Nanopore or PacBio
- reference source of the sample
- species bacterial species

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
bacillus-test		bacillus-test
cfre-test		cfre-test
cjej-test		cjej-test
efer-test		efer-test
eskapee-test		eskapee-test
eskapee-train		eskapee-train
koxy-test		koxy-test
myco-test		myco-test
sent-test		sent-test
README.md		README.md
all_samples.csv		all_samples.csv
bacillus-test.csv		bacillus-test.csv
cfre-test.csv		cfre-test.csv
cjej-test.csv		cjej-test.csv
efer-test.csv		efer-test.csv
eskapee-test.csv		eskapee-test.csv
eskapee-train.csv		eskapee-train.csv
koxy-test.csv		koxy-test.csv
myco-test.csv		myco-test.csv
reference_genomes.csv		reference_genomes.csv
sent-test.csv		sent-test.csv

fmfi-compbio/plasgraph2-datasets

Folders and files

Latest commit

History

Repository files navigation

plasgraph2-datasets

GFA assembly graph files

Classification CSV files

ESKAPEE training and testing sets

Other testing sets

Summary tables

About

Resources

Stars

Watchers

Forks