Reads a KAF or NAF file to detect multiword sequences of terms according the WordNet
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
src/main/java/eu/kyotoproject/multiwordtagger
KafMultiWordTagger.iml
KafMultiWordTagger.ipr
LICENSE
MWTAGGER DOC.odt
README.md
build_readme.txt
gpl-3.0.html
pom.xml
readme.txt

README.md

multiword-tagger

This module reads a KAF or NAF file to detect multiword sequences of terms according the WordNet

https://github.com/cltl/MultiWordTagger

This module was developed at VU University Amsterdam.

KafMultiWordTagger version 1.0 Copyright: VU University Amsterdam email: piek.vossen@vu.nl

DESCRIPION: KafMultiWordTagger reads KAF files and detects multiword sequences in the Term layer of KAF/NAF using wordnet lexicons. A configuration file is needed for each language to determine the head of the multiword. The head is used to adapt the chunk and dependency layers.

SOURCE CODE: https://kyoto.let.vu.nl/svn/kyoto/trunk/modules/mwtagger DEPENDENCIES: eu.kyotoproject.kaf KyotoKafSaxParser 1.0 compile net.sf.pipet pipet-api 1.2.3 provided

The binaries can be built using maven and the pom.xml

mvn install

BINARIES: http://kyoto.let.vu.nl/~kyoto/files/multiwordtagger/mwtagger.v.01.zip WEBSITE: http://xmlgroup.iit.cnr.it/kyoto/index.php?option=com_content&view=article&id=311&Itemid=139

REQUIREMENTS KafMultiWordTagger is compiled with Java 1.6 on MAC OS X. It should run on any platform that supports Java 1.6. It does not require any specific installation actions besides copying the structure as is. You may need to edit the configuration file to use the proper WN-LMF file and the correct language patterns.

Integration in KYOTO pipeline The eu.kyotoproject.multiwordtagger.MultiwordTaggerModule class should be used to run MWT as a module within the KYOTO PipeT architecture. Within the standard KYOTO pipeline, the MultiWordtagger operates on the KAF that is generated by the Linguistic Processors (LPs), before word-sense-disambigution takes place. As a pipeline module in KYOTO, MWT will take kaf/lp as an inputstream and generates kaf/mw as an outputstream for any KAF document in the document base to which the MWT is added a processor. The MWT module takes the path to a configuration file as a configuration value in the constructor. This path is specified through the pipeline configuration option (see the documentation on PipeT. The configuration file contains the patterns for a specified language (see above) and the path to the wordnet lexicons in WN-LMF format containing the multiwords, for example:

last or first # any pos tag that marks post head position, e.g. for English a preposition P terminates the search for the head so that the last N before P becomes the head # patterns are checked in the listed order # first matching pattern applies lang=en generic_wn_lmf=/Projects/Kyoto/Data/mwtagger/resources/wnen3.xml.lmf domain_wn_lmf=/Projects/Kyoto/Data/mwtagger/resources/wneng_domain_LMF_v3.xml N:P N:last V:last G:last

It is possible to specify up to two wordnet files in WN-LMF containing the mutiwords. If no multiwords lexicons are found, the program aborts and does not generate output. If no patterns are specified, the MWT will take the last word with the same POS as the head. Through the configuration file, MWT can be set to run on different languages and with different WN-LMF files. Specify the correct absolute (!) path to the WN-LMF files for runing MWT. You may also need to validate the patterns and the POS codes in KAF.

To run MWT as a standalone program on KAF files on disk The eu.kyotoproject.multiwordtagger.MultiTaggerTest class can be used to run the tagger as a standalone application on any set of KAF files on disk. MultiTaggerTest takes two arguments:

  1. the full path to the configuration file
  2. the full path to a folder that contains the KAF files

Below is an example of how to call the MWT test class on a folder containing KAF documents:

java -Xmx512m -cp ./lib/kaf.jar:./lib/mwtagger.jar eu.kyotoproject.multiwordtagger.MultiTaggerTest "/Projects/Kyoto/mwtagger.v01/conf/mwtagger.english.cfg" "/Projects/Kyoto/Data/Estuaries/English"

This call will use the configuration file for Enlgish and process all files with the extension *.kaf in the English folder. It will create a new folder English_lp+mw to store all the KAF files with multiword annotation.

LICENSE:

KafMultiWordTagger is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

KafMultiWordTagger is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with KafMultiWordTagger. If not, see http://www.gnu.org/licenses/.