About

extjwnl-data-mcr30 prepackages jars with wordnet data from the Multilingual Central Repository 3.0 (2016 release; currently only the Spanish portion).

A configuration file is included to make it extremely easy to use these resources in your project.

Getting started

In your pom.xml:

<dependency>
    <groupId>net.sf.extjwnl</groupId>
    <artifactId>extjwnl</artifactId>
    <version>2.0.3</version>
</dependency>
<dependency>
    <groupId>net.sf.extjwnl.mcr</groupId>
    <artifactId>extjwnl-data-spa-mcr30</artifactId>
    <version>1.0.5</version>
</dependency>

In your code:

import net.sf.extjwnl.dictionary.*;

Dictionary d = Dictionary.getDefaultResourceInstance();

Mapping Between Dictionaries

extjwnl-data-mcr30 also contains an alignment module which supports loading multiple dictionaries and mapping word senses between them. To use it, you first need the following additional dependency in your pom.xml:

<dependency>
    <groupId>net.sf.extjwnl.mcr</groupId>
    <artifactId>extjwnl-data-alignment-mcr30</artifactId>
    <version>1.0.5</version>
</dependency>

Then you can load the MCR 3.0 Spanish wordnet together with two versions (3.0 and 3.1) of Princeton WordNet:

import net.sf.extjwnl.dictionary.*;
import net.sf.extjwnl.data.mcr30.alignment.*;

Dictionary spa = InterLingualIndex.getDictionary("mcr30", "spa");
Dictionary wn31 = InterLingualIndex.getDictionary("wn31", "eng");
Dictionary wn30 = InterLingualIndex.getDictionary("wn30", "eng");

After that, if you have a Spanish synset, you can find the corresponding English synset (if a mapping exists):

Synset englishSynset = InterLingualIndex.mapSynset(spanishSynset, wn31);

If you need to map lots of synsets, then use the SynsetMapper interface instead for better performance:

SynsetMapper mapper = InterLingualIndex.loadMapper(spa, wn31);
Synset englishSynset1 = mapper.mapSynset(spanishSynset1);
Synset englishSynset2 = mapper.mapSynset(spanishSynset2);
...

For more information, see the javadoc.

Acknowledgements

The data for this package comes from the Multilingual Central Repository (MCR):

Aitor Gonzalez-Agirre, Egoitz Laparra and German Rigau (2012) Multilingual Central Repository version 3.0: upgrading a very large lexical knowledge base. In Proceedings of the 6th Global WordNet Conference (GWC 2012) Matsue, Japan.

@InProceedings{Gonzalez-Agirre:Laparra:Rigau:2012,
  author = "Aitor Gonzalez-Agirre and Egoitz Laparra and German Rigau",
  title = "Multilingual Central Repository version 3.0: upgrading a very large lexical knowledge base",
  booktitle = "Proceedings of the 6th Global WordNet Conference (GWC 2012)",
  year = 2012,
  address = "Matsue",
}

This package is designed for use with extjwnl. The resource bundling is based on the pattern set by extjwnl-data-wn31 for the English-language Princeton WordNet 3.1.

Princeton University "About WordNet." WordNet. Princeton University. 2010.

MCR data is converted into extjwnl format via a modified version of the wn-mcr-transform script. You can find the modified version here.

The MCR is aligned with Princeton WordNet 3.0, so for realigning to Princeton WordNet 3.1, we use the 3.0->3.1 mapping_wordnet.json from:

@misc{ZendelWordNetConv19,
  author = {Zendel, Oliver},
  title = {WordNet v3.0 vs. v3.1 mapping},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/ozendelait/wordnet-to-json}},
  commit = {7521b70937355e826ea7e028a615108cdb18d0ee}
}

Stemming

Language-specific stemming rules are packaged in each data module; for example, here are the Spanish-specific stemming rules.

Exceptional Forms

For Spanish, exceptional forms (irregular verb conjugations, noun pluralizations, and adjective pluralizations) are enumerated using the morphala project. All lemmas from the MCR dictionary are run through morphala's conjugation/pluralization routines. From the resulting derived form, we attempt to reverse-derive the lemma as a base form via the standard DetachSuffixesOperation. When this fails, we treat the derived form as an exception and add it to supplemental_spa.txt.

Future Work

If you are interested in adding support for languages beyond Spanish (such as Portuguese), please open an issue on this project. The bare minimum for a language would be to bring in the language-specific dataset from MCR and also add stemming rules for regular inflections; bonus would be to enhance morphala with the necessary support for generating exceptional forms for that language.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
alignment		alignment
lang-spa		lang-spa
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

alignment

alignment

lang-spa

lang-spa

.gitattributes

.gitattributes

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

pom.xml

pom.xml

Repository files navigation

About

Getting started

Mapping Between Dictionaries

Acknowledgements

Stemming

Exceptional Forms

Future Work

About

Releases

Packages

Languages

License

extjwnl/extjwnl-data-mcr30

Folders and files

Latest commit

History

Repository files navigation

About

Getting started

Mapping Between Dictionaries

Acknowledgements

Stemming

Exceptional Forms

Future Work

About

Resources

License

Stars

Watchers

Forks

Languages