Structural Variant Machine (SV-M) to accurately predict InDels from NGS paired-end short reads
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
Model
src
trainingsdata
LICENSE
Makefile
README.md

README.md

SV-M

Structural Variant Machine (SV-M) to accurately predict InDels from NGS paired-end short reads as described in:

D. Grimm J. Hagmann, D. Koenig, D. Weigel and K. Borgwardt (2013) Accurate indel prediction using paired-end short reads BMC Genomics 14:132 link

Structural Variant Machine (SV-M)

Contents:

  1. Installation
  2. Usage
  3. Input format
  4. Output files and format
  5. Training Data
  6. Author and license informations

1. Installation

To install the tool you have to compile the source code. Type into you Linux/Mac terminal:

make all

The source code get compiled, generating two directory (build, bin). The bin directory contains the complied tool sv-m.

To re-compile:

make clean
make all

2. Usage

a) Prediction

To predict if an indel is a true or false candidate use the -predict command:

./sv-m -predict <model_file> <normalization_parameter_file> <data_file> <output_filename>

where:

  • <model_file>: trained SVM model file
  • <normalization_parameter_file>: the corresponding normalization parameter file for the trained SVM model
  • <data_file>: input data file with all features
  • <output_filename>: filename for the output file

b) Training

To train a new SVM model on a set of features use the -train command:

./sv-m -train <data_filename> <output_directory>

where:

  • <data_filename>: input data file
  • <output_directory>: name of an existing emtpy output directory

Optional arguments:

  • -n k-fold (default = 10)
  • -experiments number of experiments/repeats (default=1)

(In general several experiments are performed)

3. Input format

a) Prediction

The <model_file> and <normalization_parameter_file> can be found in the Model folder in the root directory. For a new or different set of features these files have to be generated by performing a new training.

  • <data_file> format (tab seperated):
<chromosome> <start position> <end position> <feature 1> <feature 2> ... <feature n>

b) Training

  • <data_file> format (tab seperated):
<class label: 1 for positive, -1 for negative> <chromosome> <start position> <end position> <feature 1> <feature 2> ... <feature n>

4. Output files and format

a) Prediction

  • <output_file> format (tab seperated):
<class label, 1 positive, -1 negative class> <probability for positive class (negative class: 1-probability of positive class)> <chromsome> <start position> <end position> <feature 1> <feature 2> ... <feature n>

b) Training

The output directory contains the following output files:

model.svm: The trained model file

model_normalization.param: The corresponding normalization parameters for that model

results.txt: A summary of the performance of the model and the corresponding weights

experiments.tab: A tab seperated file containg the C-Value, AUC and BEP value for each experiment

<C-Value> <AUC> <BEP>

5. Training Data

The folder trainingdata contains the Sanger validated training data. For more detailed informations and the file format see the README file within the trainingdata folder.

6. Author and license informations

Version: 0.1 Author: Dominik Gerhard Grimm Mail: dominik.grimm@tuebingen.mpg.de Date: 07th of Dezember 2011

Group: Machine Learning and Computational Biology Group (http://webdav.tuebingen.mpg.de/u/karsten/group/) Institutes: Max Planck Institute for Developmental Biology and Max Planck Institute for Intelligent Systems (Tübingen, Germany)

This tool make use of libSVM 3.0 (www.csie.ntu.edu.tw/~cjlin/libsvm/)