Create K-fold splits from data files and assist in training and testing (useful for cross-validation in supervised machine learning)
Ruby
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.
.idea
bin
lib
spec
.gitignore
.rspec
CHANGELOG
Gemfile
LICENSE
Manifest
README.rdoc
Rakefile
kfold.gemspec

README.rdoc

kfold

kfold creates K-fold splits from data files and assists in training and testing (useful for cross-validation in supervised machine learning)

Command overview

help                 Display global or [command] help documentation.		
split                Split a data file into K partitions		
test                 Apply trained models on a dataset previously split using kfold		
train                Train models on a dataset previously split using kfold

Example usage

10-fold cross-validation of the standard MaltParser on a treebank named shuffled.c32.conll may be done as follows:

kfold split -f -i shuffled.c32.conll --fold -d '\n\n'
kfold train -f --base shuffled.c32.conll -- java -jar ~/Tools/malt-1.4.1/malt.jar -c %B.model_%N -i %T -m learn
kfold test -f --base shuffled.c32.conll -- java -jar ~/Tools/malt-1.4.1/malt.jar -c %B.model_%N -i %T -o %O -m parse
eval07.pl -q -g shuffled.c32.conll -s shuffled.c32.conll.output

The MaltParser does not like to put its models in a subdirectory, so rather than using the standard model files suggested by kfold (%M), we construct custom non-nested model filenames using %B.model_%N.

Command details

The following is simply the output of the built-in help commands.

Splitting data files

NAME:

  split

DESCRIPTION:

  Given the data file INPUT, the partitions are written to files named INPUT.parts/{01..K}

SYNOPSIS:

  kfold split -i INPUT [options]

EXAMPLES:

# Split the file sample.txt into 4 parts
kfold split -k4 sample.txt

# Split the double-newline-delimited file sample.conll into 10 parts
kfold split -d"\n\n" sample.conll

OPTIONS:

-i, --input FILE 
    Data file to split

-k, --parts N 
    The number of partitions desired

-d, --delimiter DELIM 
    String used to separate individual entries (newline per default)

-g, --granularity N 
    Ensure the number of entries in each partition is divisible by N (useful for block-structured data)

-f, --overwrite 
    Remove existing parts prior to executing

--fold 
    Additionally, create K folds of K-1 parts in a another folder

--parts-name STRING 
    Use the given name as suffix for the partitions folder created

--folds-name STRING 
    Use the given name as suffix for the folds folder created

Training on the folds

NAME:

  train

DESCRIPTION:

  Given training data previously split in K parts and folds, train K models on the K folds

  Certain keywords in the training command and its arguments are interpolated at runtime:

   * %N  - fold number, e.g. '01'
   * %F  - fold filename, e.g. 'brown.train/01'
   * %I  - alias for %F
   * %M  - model filename, e.g. 'brown.models/01'
   * %B  - basename (as specified on the command line), e.g. 'brown'

SYNOPSIS:

  kfold train --base NAME [options] -- CMD [--CMD-OPTIONS] [CMD-ARGS]

EXAMPLES:

# Train MaltParser for cross-validation
kfold train -f --base shuffled.c32.conll -- java -jar ~/Tools/malt-1.4.1/malt.jar -c %B.model_%N -i %T -m learn

OPTIONS:

-f, --overwrite 
    Remove existing models prior to executing

--base NAME 
    Default prefix of training folds and model files

--folds-name SUFFIX 
    Look for folds {01..K} in the folder BASE.SUFFIX

--models-name SUFFIX 
    Yield model names as BASE.SUFFIX/{01..K} as interpolation pattern %M

Testing the models on their reciprocal data file parts

NAME:

  test

DESCRIPTION:

  Process K parts of a split datafile using K previously trained models.

  Certain keywords in the testing command and its arguments are interpolated at runtime:

   * %N  - part number, e.g. '01'
   * %T  - part filename, e.g. 'brown.test/01'
   * %I  - alias for %T
   * %O  - output filename, e.g. 'brown.outputs/01'
   * %M  - model filename, e.g. 'brown.models/01'
   * %B  - basename (as specified on the command line), e.g. 'brown'

SYNOPSIS:

  kfold test --base NAME [options] -- CMD [--CMD-OPTIONS] [CMD-ARGS]

EXAMPLES:

# Apply trained MaltParser models for cross-validation
kfold test -f --base shuffled.c32.conll -- java -jar ~/Tools/malt-1.4.1/malt.jar -c %B.model_%N -i %T -o %O -m parse

OPTIONS:

-f, --overwrite 
    Remove existing test output prior to executing

--base NAME 
    Default prefix of model files and test outputs

--parts-name SUFFIX 
    Look for parts {01..K} to be processed in the folder BASE.SUFFIX

--models-name SUFFIX 
    Yield model names as BASE.SUFFIX/{01..K} as interpolation pattern %M

--outputs-name SUFFIX 
    Yield output filenames as BASE.SUFFIX/{01..K} as interpolation pattern %O

--output-name SUFFIX 
    Put the concatenated output of all models in BASE.SUFFIX