-
Notifications
You must be signed in to change notification settings - Fork 14
Kover Dataset Format
Kover datasets are HDF5 files. They are structured as follows:
HDF5 "example.kover" {
FILE_CONTENTS {
group /
dataset /genome_identifiers
dataset /kmer_by_matrix_column
dataset /kmer_matrix
dataset /kmer_sequences
dataset /phenotype
group /splits
group /splits/my_split_name_1
dataset /splits/my_split_name_1/test_genome_idx
dataset /splits/my_split_name_1/train_genome_idx
dataset /splits/my_split_name_1/unique_risk_by_anti_kmer
dataset /splits/my_split_name_1/unique_risk_by_kmer
dataset /splits/my_split_name_1/unique_risks
group /splits/my_split_name_2
dataset /splits/my_split_name_2/test_genome_idx
dataset /splits/my_split_name_2/train_genome_idx
dataset /splits/my_split_name_2/unique_risk_by_anti_kmer
dataset /splits/my_split_name_2/unique_risk_by_kmer
dataset /splits/my_split_name_2/unique_risks
group /splits/my_split_name_2/folds
group /splits/my_split_name_2/folds/fold_1
dataset /splits/my_split_name_2/folds/fold_1/test_genome_idx
dataset /splits/my_split_name_2/folds/fold_1/train_genome_idx
dataset /splits/my_split_name_2/folds/fold_1/unique_risk_by_anti_kmer
dataset /splits/my_split_name_2/folds/fold_1/unique_risk_by_kmer
dataset /splits/my_split_name_2/folds/fold_1/unique_risks
group /splits/my_split_name_2/folds/fold_2
dataset /splits/my_split_name_2/folds/fold_2/test_genome_idx
dataset /splits/my_split_name_2/folds/fold_2/train_genome_idx
dataset /splits/my_split_name_2/folds/fold_2/unique_risk_by_anti_kmer
dataset /splits/my_split_name_2/folds/fold_2/unique_risk_by_kmer
dataset /splits/my_split_name_2/folds/fold_2/unique_risks
}
}
This group has the following attributes:
- compression: the level of gzip compression (0-9)
- created:
- genome_source: the source for the genomic data (e.g.: path to data file)
- genome_source_type: the type of genomic data (e.g.: tsv)
- phenotype_metadata_source: the source for the metadata (e.g.: path to data file)
- phenotype_name: a user specified name for the phenotype
- uuid: a unique identifier for the kover dataset
This dataset contains the identifier of each genome.
This dataset contains the index of the k-mer sequence (in dataset:kmer_sequences) that is associated to each column of the k-mer matrix.
This dataset gives the presence/absence (1 and 0, respectively) of each k-mer in each genome. The binary values in the columns are packed into integers. The number of rows should be the size of dataset:genome_identifiers divided by the size of the integers used for bit packing the columns. The number of columns should match the size of dataset:kmer_sequences and dataset:kmer_by_matrix_column.
This dataset gives the sequence of each k-mer in the dataset. Its size should match the number of columns in dataset:kmer_matrix and in dataset:kmer_by_matrix_column.
This dataset gives the label (0 or 1) assigned to each genome. The size should match the size of dataset:genome_identifiers.
This group contains one sub-group for each partition of the dataset (split) that has been defined by the user. Each partition sub-group has the following attributes:
- n_folds: The number of cross-validation folds contained i
- random_seed: The random seed that was used for data partitioning.
- test_proportion: The proportion of examples that are in the testing set.
- train_proportion: The proportion of examples that are in the training set.
This dataset contains the indices of the genomes, with respect to dataset:kmer_matrix and dataset:genome_identifiers that belong to the testing set for the split.
This dataset contains the indices of the genomes, with respect to dataset:kmer_matrix and dataset:genome_identifiers that belong to the training set for the split.
This dataset contains the individual error rate of each k-mer at predicting the labels by its presence, i.e., the empirical risk of presence(k-mer). The value that is stored is not the actual error rate, but the index of the unique error rate value in dataset:unique_risks.
This dataset contains the individual error rate of each k-mer at predicting the labels by its absence, i.e., the empirical risk of absence(k-mer). The value that is stored is not the actual error rate, but the index of the unique error rate value in dataset:unique_risks.
This dataset contains the unique empirical risk values obtained in dataset:unique_risk_by_kmer and dataset:unique_risk_by_anti_kmer. This reduces the memory requirements of loading and storing the empirical risk for each k-mer/anti-kmer, since there are often less unique values than the total number of k-mers.
This group contains one sub-group for each cross-validation fold that is available. The number of cross-validation folds is specified by the user when creating the split. Each fold corresponds to a partition of the data in the training set of the split. The fold sub-groups have the exact same structure as the split sub-groups.