Skip to content

extjwnl/extjwnl-data-mcr30

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

About

extjwnl-data-mcr30 prepackages jars with wordnet data from the Multilingual Central Repository 3.0 (2016 release; currently only the Spanish portion).

A configuration file is included to make it extremely easy to use these resources in your project.

Getting started

In your pom.xml:

<dependency>
    <groupId>net.sf.extjwnl</groupId>
    <artifactId>extjwnl</artifactId>
    <version>2.0.3</version>
</dependency>
<dependency>
    <groupId>net.sf.extjwnl.mcr</groupId>
    <artifactId>extjwnl-data-spa-mcr30</artifactId>
    <version>1.0.5</version>
</dependency>

In your code:

import net.sf.extjwnl.dictionary.*;

Dictionary d = Dictionary.getDefaultResourceInstance();

Mapping Between Dictionaries

extjwnl-data-mcr30 also contains an alignment module which supports loading multiple dictionaries and mapping word senses between them. To use it, you first need the following additional dependency in your pom.xml:

<dependency>
    <groupId>net.sf.extjwnl.mcr</groupId>
    <artifactId>extjwnl-data-alignment-mcr30</artifactId>
    <version>1.0.5</version>
</dependency>

Then you can load the MCR 3.0 Spanish wordnet together with two versions (3.0 and 3.1) of Princeton WordNet:

import net.sf.extjwnl.dictionary.*;
import net.sf.extjwnl.data.mcr30.alignment.*;

Dictionary spa = InterLingualIndex.getDictionary("mcr30", "spa");
Dictionary wn31 = InterLingualIndex.getDictionary("wn31", "eng");
Dictionary wn30 = InterLingualIndex.getDictionary("wn30", "eng");

After that, if you have a Spanish synset, you can find the corresponding English synset (if a mapping exists):

Synset englishSynset = InterLingualIndex.mapSynset(spanishSynset, wn31);

If you need to map lots of synsets, then use the SynsetMapper interface instead for better performance:

SynsetMapper mapper = InterLingualIndex.loadMapper(spa, wn31);
Synset englishSynset1 = mapper.mapSynset(spanishSynset1);
Synset englishSynset2 = mapper.mapSynset(spanishSynset2);
...

For more information, see the javadoc.

Acknowledgements

The data for this package comes from the Multilingual Central Repository (MCR):

Aitor Gonzalez-Agirre, Egoitz Laparra and German Rigau (2012) Multilingual Central Repository version 3.0: upgrading a very large lexical knowledge base. In Proceedings of the 6th Global WordNet Conference (GWC 2012) Matsue, Japan.

@InProceedings{Gonzalez-Agirre:Laparra:Rigau:2012,
  author = "Aitor Gonzalez-Agirre and Egoitz Laparra and German Rigau",
  title = "Multilingual Central Repository version 3.0: upgrading a very large lexical knowledge base",
  booktitle = "Proceedings of the 6th Global WordNet Conference (GWC 2012)",
  year = 2012,
  address = "Matsue",
}

This package is designed for use with extjwnl. The resource bundling is based on the pattern set by extjwnl-data-wn31 for the English-language Princeton WordNet 3.1.

Princeton University "About WordNet." WordNet. Princeton University. 2010.

MCR data is converted into extjwnl format via a modified version of the wn-mcr-transform script. You can find the modified version here.

The MCR is aligned with Princeton WordNet 3.0, so for realigning to Princeton WordNet 3.1, we use the 3.0->3.1 mapping_wordnet.json from:

@misc{ZendelWordNetConv19,
  author = {Zendel, Oliver},
  title = {WordNet v3.0 vs. v3.1 mapping},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/ozendelait/wordnet-to-json}},
  commit = {7521b70937355e826ea7e028a615108cdb18d0ee}
}

Stemming

Language-specific stemming rules are packaged in each data module; for example, here are the Spanish-specific stemming rules.

Exceptional Forms

For Spanish, exceptional forms (irregular verb conjugations, noun pluralizations, and adjective pluralizations) are enumerated using the morphala project. All lemmas from the MCR dictionary are run through morphala's conjugation/pluralization routines. From the resulting derived form, we attempt to reverse-derive the lemma as a base form via the standard DetachSuffixesOperation. When this fails, we treat the derived form as an exception and add it to supplemental_spa.txt.

Future Work

If you are interested in adding support for languages beyond Spanish (such as Portuguese), please open an issue on this project. The bare minimum for a language would be to bring in the language-specific dataset from MCR and also add stemming rules for regular inflections; bonus would be to enhance morphala with the necessary support for generating exceptional forms for that language.

About

Dictionary data from Multilingual Central Repository version 3.0

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages