Skip to content
Switch branches/tags
Go to file

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time


File listing:

  • main script for tagging an input plain text
  • script for installing libraries and dependencies


This work is licensed under a Creative Commons Attribution 3.0 Unported License: href="". See the file LICENSEDATA.TXT and COPYING-CC.TXT that should be in the top-directory of this distribution.

Description and installation

This program svm_wsd implements a machine learning Word Sense Disambiguation system based on Support Vector Machines. We use a bag-of-words model for representing the features. It has been implemented in python, so a valid installation of python is required, at least version 2.6. There are also some dependencies that will be automatically downloaded and installed by the script

To install this program you have to follow 2 steps:

    1. Download/clone this repository
    1. Go to the folder and run the script, which will install libsvm, treetagger and the models for WSD. or
    1. Go to the folder and run the script to work with NAF files (not treetagger)

In summary:

$ git clone
$ cd svm_wsd
$ . OR .


The input for the program has to be valid UTF-8 plain text. The script will read the text from the standard input. So there are 2 easy ways to call the program:

For working with plain files and call treetagger for lemma and pos tagging:

$ echo 'Dit is een input text' |
$ cat my_input_file.txt |

The output will be an XML file with a structure similar to the one used by the SemCor corpus:

  <sent s_num="sentence number">
    <wf id="identifier" lemma="lemma" pos="part-of-speech' sense_confidence="value" sense_label="lexical-unit-label">token</wf>

The attributes sense_confidence and sense_label are only present when the token is disambiguated.

For working with NAF files:

$ cat input.naf | --naf -ref odwnSY> output.naf

The parameter ref represents what type of reference we want to have in the output:

  • corLU: for cornetto lexical unit ids
  • odwnLU: OpenDutchWordNet lexical unit ids
  • odwnSY: OpenDutchWordNet synset ids.

The output is the NAF extended with the senses and confidences, represented as external references on the term--> externalReferences element. The ranking of all the senses with the returned confidence value according to SVM are included on the output.



Word Sense Disambiguation system developed on the DutchSemCor project using Support Vector Machines. The input is plain text, and the output XML




No releases published


No packages published