Skip to content

Help: Wizard ‐ Challenge ‐ Data

Isabelle Guyon edited this page Sep 27, 2017 · 31 revisions

Use "Pick another dataset" if you want to change dataset.

AutoML format

In this version of Chalab, datasets must be formatted in the AutoML format. DOWNLOAD AN EXAMPLE.

Organizers are encouraged to propose tasks of **CLASSIFICATION** (binary, multi-label, or multi-class) or **REGRESSION** (predictions of a continuous variable). The datasets can cover a wide range of application areas, such as pharmacology, medicine, marketing, ecology, text, image, video and speech processing and span a range difficulties (sparsity, missing data, noise, categorical variables, etc.). BUT, all datasets must be pre-formatted in a fixed-length feature-based representation. The number of variables and samples can vary between thousands and millions. BUT, the size of the zipped achieve containing ALL the data, must not exceed 200 MB.

The data files should be put under a directory of name DataName/, where DataName is the dataset name, and zipped. They should include the following text files [GROUPED VERSION]:

  • DataName.data
  • DataName.solution

OR [SPLIT VERSION]:

  • DataName_train.data
  • DataName_train.solution
  • DataName_valid.data
  • DataName_valid.solution
  • DataName_test.data
  • DataName_test.solution
If the data are supplied without a split into train, valid(ation) and test sets, Chalab will perform the split. However, it is sometimes better that the organizers perform the split themselves. For example, the organizers may want NOT to mix data from different sources in the training and {validation and test} sets. In a speaker independent speech recognition task, for instance, one should have different speakers for training and testing. Chalab would not take that into account and would perform a random split mixing data from all speakers in all sets.

We use several data formats:

  • Data files:
    • Full matrices are represented as space delimited numeric value tables, samples in lines and features in columns.
    • Sparse binary matrices are represented lists of indices of non zero feature values. Each line contains the list representing one sample.
    • Sparse numeric matrices include lists of pairs "index:value". Each line contains the list representing one sample.
  • Solution files: Target values (one value or vector per line, corresponding to the target values of one sample). The solution files of the validation and test data are hidden to the participants. The goal of the challenge is to produce prediction values for these solution files.

Optional header files and type files

Unrecognized file names will be ignored by Chalab. However, several other optional files may be provided, including column headings:

  • DataName_feat.name
  • DataName_label.name
The files DataName.feat_name and DataName.label_name provide the column headers of data files (features) and solution files (labels) respectively. They are text files with one entry per line (for each column header). These two files will be made available to participants. They can be useful for data visualisation purposes.

Also optionally, you may supply the feature types (Numerical, Binary, or Categorical) is file:

  • DataName_feat.type
The number of lines in this text file is the same as the number of features.

Optional public documentation

We also encourage the organizers to supply optional public documentation files with all datasets:

  • DataName_public.info (will be made available to participants).

The DataName_public.info file contains the following fields (you may supply only a subset of such fields):

  • usage = 'Challenge name'.
  • name = 'DataName' (dataset short nickname to be used in file names).
  • task = 'regression', 'binary.classification', 'multiclass.classification', or 'multilabel.classification'.
  • target_type = 'Numerical' or 'Binary' (we do not use categorical targets; multiclass problems for c classes have c binary targets).
  • feat_type = 'Numerical', 'Categorical', or 'Binary'.
  • metric = An AutoML metric such as'r2_metric', 'auc_metric', 'bac_metric', 'f1_metric', 'pac_metric', or your own metric.
  • feat_num = number or variables (or features), i.e. number of columns of the data matrix.
  • target_num = number of target values (1 for regression and same as label_num for classification problems, because we do not use categorical targets)
  • label_num = number of labels for classification problems (same as target_num or NA for regression).
  • train_num = number of training examples.
  • valid_num = number of validation set examples.
  • test_num = number of test set examples.
  • has_categorical = existence of categorical variable (yes=1, no=0).
  • has_missing = existence of missing values (yes=1, no=0).
  • is_sparse = the data matrix is a sparse matrix (yes=1, no=0).
  • time_budget = the time budget in seconds. In this version of Chalab, we impose a maximum of 500 seconds of execution time per submission.

Three fields are of particular importance: task, metric, and time_budget. They vary from dataset to dataset. If code is submitted, training and testing must be done within the time budget, including data reading and result writing to file. The results will be evaluated with the given metric for the given task. Some metrics are computed differently for multilabel and multiclass classification. WARNING: Editing the "info" file with an editor may introduce special characters that will render it unreadable by our sample code.

Optional private documentation

You may also include:

  • DataName_private.info (will be kept confidential)
This file will be NOT be downloadable by participants, it will be kept as documentation for you, for further reference.

The DataName_private.info file contains the following fields:

  • title = 'Original data title/name.'
  • keywords = Any relevant keyword, for example: 'text.classification,document.processing'
  • authors = 'List of original data providers.'
  • resource_url = 'URL of where the data came from'
  • contact_name = 'Name of the person who formatted the data'
  • contact_url = 'URL of the contact person (to avoid listing the email)'
  • license = 'URL of a license, e.g. nay open data license such as http://creativecommons.org/about/cc0'
  • date_created = 'Date of data release'
  • past_usage = 'Whether the data were used in other challenges or benchmarks or in published papers.'
  • description = 'Detailed data description.'
  • preparation = 'How the data were prepared, preprocessed. Methods for anonymizing, subsampling, etc.'
  • representation = 'What the features represent (e.g. word frequencies in a document, pixels, etc.'
  • real_feat_num = An integer number indicating the number of "true" (informative) features.
  • probe_num = An integer number indicating the number of "fake" distractor features (probes).
  • frac_probes = Fraction of probes (i.e. probe_num/(probe_num+real_feat_num)
  • feat_type = { 'Numerical' 'Categorical' 'Binary' } # do not change that
  • feat_type_freq = [ 1 0 0 ] # These 3 numbers add up to one, they indicate the fraction of different types of features 'Numerical'/'Categorical'/'Binary'
  • label_names = { 'label1' 'label2' etc. 'labeln' } # Column header of the solution files indicating the names of the labels in multi-label or multi-class tasks.
  • train_label_freq = [ 0.00971176 0.00415485 etc. 0.0715763 ] # Frequency of the labels in training data
  • train_label_entropy = 0.781311 # Entropy of the labels in training data
  • train_sparsity = 0.998401 # Sparsity of training data (a number between 0 and 1 indicating the fraction of features having 0 value).
  • train_frac_missing = 0 # The fraction of missing values in training data.
  • valid_label_freq = [ 0.0094996 0.00395019 etc. 0.0710078 ] # Frequency of the labels in validation data
  • valid_label_entropy = 0.780359 # Entropy of the labels in validation data
  • valid_sparsity = 0.998397 # Sparsity of validation data
  • valid_frac_missing = 0 # The fraction of missing values in validation data.
  • test_label_freq = [ 0.00999453 0.00395133 etc. 0.0719716 ] # Frequency of the labels in test data
  • test_label_entropy = 0.781757 # Entropy of the labels in test data
  • test_sparsity = 0.998401 # Sparsity of test data
  • test_frac_missing = 0 # The fraction of missing values in test data.
  • train_data_aspect_ratio = 3.81305 # Ratio of number of features over the number of training examples.