Skip to content

Kover Dataset Format

Alexandre Drouin edited this page May 20, 2016 · 9 revisions

Kover datasets are HDF5 files. They are structured as follows:

HDF5 "example.kover" {
FILE_CONTENTS {
 group      /
    dataset    /genome_identifiers
    dataset    /kmer_by_matrix_column
    dataset    /kmer_matrix
    dataset    /kmer_sequences
    dataset    /phenotype
    group      /splits
       group      /splits/my_split_name_1
          dataset    /splits/my_split_name_1/test_genome_idx
          dataset    /splits/my_split_name_1/train_genome_idx
          dataset    /splits/my_split_name_1/unique_risk_by_anti_kmer
          dataset    /splits/my_split_name_1/unique_risk_by_kmer
          dataset    /splits/my_split_name_1/unique_risks
       group      /splits/my_split_name_2
          dataset    /splits/my_split_name_2/test_genome_idx
          dataset    /splits/my_split_name_2/train_genome_idx
          dataset    /splits/my_split_name_2/unique_risk_by_anti_kmer
          dataset    /splits/my_split_name_2/unique_risk_by_kmer
          dataset    /splits/my_split_name_2/unique_risks
          group      /splits/my_split_name_2/folds
             group      /splits/my_split_name_2/folds/fold_1
                dataset    /splits/my_split_name_2/folds/fold_1/test_genome_idx
                dataset    /splits/my_split_name_2/folds/fold_1/train_genome_idx
                dataset    /splits/my_split_name_2/folds/fold_1/unique_risk_by_anti_kmer
                dataset    /splits/my_split_name_2/folds/fold_1/unique_risk_by_kmer
                dataset    /splits/my_split_name_2/folds/fold_1/unique_risks
             group      /splits/my_split_name_2/folds/fold_2
                dataset    /splits/my_split_name_2/folds/fold_2/test_genome_idx
                dataset    /splits/my_split_name_2/folds/fold_2/train_genome_idx
                dataset    /splits/my_split_name_2/folds/fold_2/unique_risk_by_anti_kmer
                dataset    /splits/my_split_name_2/folds/fold_2/unique_risk_by_kmer
                dataset    /splits/my_split_name_2/folds/fold_2/unique_risks
 }
}

Top Level (/)

This group has the following attributes:

  • compression: the level of gzip compression (0-9)
  • created:
  • genome_source: the source for the genomic data (e.g.: path to data file)
  • genome_source_type: the type of genomic data (e.g.: tsv)
  • phenotype_metadata_source: the source for the metadata (e.g.: path to data file)
  • phenotype_name: a user specified name for the phenotype
  • uuid: a unique identifier for the kover dataset

Dataset: genome_identifiers

This dataset contains the identifier of each genome.

Dataset: kmer_by_matrix_column

This dataset contains the index of the k-mer sequence (in dataset:kmer_sequences) that is associated to each column of the k-mer matrix.

Dataset: kmer_matrix

This dataset gives the presence/absence (1 and 0, respectively) of each k-mer in each genome. The binary values in the columns are packed into integers. The number of rows should be the size of dataset:genome_identifiers divided by the size of the integers used for bit packing the columns. The number of columns should match the size of dataset:kmer_sequences and dataset:kmer_by_matrix_column.

Dataset: kmer_sequences

This dataset gives the sequence of each k-mer in the dataset. Its size should match the number of columns in dataset:kmer_matrix and in dataset:kmer_by_matrix_column.

Dataset: phenotype

This dataset gives the label (0 or 1) assigned to each genome. The size should match the size of dataset:genome_identifiers.

The splits group (/splits)

This group contains one sub-group for each partition of the dataset (split) that has been defined by the user. Each partition sub-group has the following attributes:

  • n_folds: The number of cross-validation folds contained i
  • random_seed: The random seed that was used for data partitioning.
  • test_proportion: The proportion of examples that are in the testing set.
  • train_proportion: The proportion of examples that are in the training set.

Dataset: test_genome_idx

This dataset contains the indices of the genomes, with respect to dataset:kmer_matrix and dataset:genome_identifiers that belong to the testing set for the split.

Dataset: train_genome_idx

This dataset contains the indices of the genomes, with respect to dataset:kmer_matrix and dataset:genome_identifiers that belong to the training set for the split.

Dataset: unique_risk_by_kmer

This dataset contains the individual error rate of each k-mer at predicting the labels by its presence, i.e., the empirical risk of presence(k-mer). The value that is stored is not the actual error rate, but the index of the unique error rate value in dataset:unique_risks.

Dataset: unique_risk_by_anti_kmer

This dataset contains the individual error rate of each k-mer at predicting the labels by its absence, i.e., the empirical risk of absence(k-mer). The value that is stored is not the actual error rate, but the index of the unique error rate value in dataset:unique_risks.

Dataset: unique_risks

This dataset contains the unique empirical risk values obtained in dataset:unique_risk_by_kmer and dataset:unique_risk_by_anti_kmer. This reduces the memory requirements of loading and storing the empirical risk for each k-mer/anti-kmer, since there are often less unique values than the total number of k-mers.

The folds group (/splits/split_name/folds)

This group contains one sub-group for each cross-validation fold that is available. The number of cross-validation folds is specified by the user when creating the split. Each fold corresponds to a partition of the data in the training set of the split. The fold sub-groups have the exact same structure as the split sub-groups.