This is an implementation of the Term Definition Vectors (TDV) method for language representation. TDV provides high dimensional, sparse vector representations of lexical items (terms) ranging from morphemes to phrases, based on the definitions of their meanings. It contrasts with distributional representation methods, such as word2vec and GloVe, which define term meanings from their usage patterns (context windows). Compared to distributional methods, TDV performs better at semantic similarity computation, where the former perform better at semantic relatedness. In this implementation, each concept vector represents a sense described in Wiktionary, as well as its available translations. The TDV representations can be used in several natural language processing applications and since each vector dimension represents a specific definition property, they are also human-readable. See the paper for more information on Term Definition Vectors.
Computing vector cosine will yield positive values for adjectives agreeing in sense, such as "skinny" and "thin", and negative values for those with opposing senses, such as "happy" and "sad".
Translations available in Wiktionary are joined at the same vector space, so that terms like "today", "hoy" and "今日" have similar representations and can be compared meaningfully.
POS information can be used to reduce the number of senses in the definition vectors, disambiguating their meaning. POS-tagging can be done through popular tools, such as the Stanford Tagger or the NLTK toolset.
- Download the code: git clone https://github.com/dscarvalho/tdv.git
- Change to the 'tdv' directory, from the project root: cd tdv/tdv
- Run 'make all'
- Change to the 'build' directory: cd build
- Run
- './run_service.sh' to start the TDV web service.
- './gen_vectors' to generate vectors and write to a file.
You can adjust the base link weights, among other parameters in the configuration files under 'build/cfg/'. A complete documentation of the system is under construction and will be included in the repository soon.
- A UNIX-like system: Linux, Mac OSX.
- A recent C++ compiler, with support for C++11: LLVM/Clang (>= 3.1), GCC (>= 4.8).
- Python 2.7
- JSON for modern C++ (included in the source).
- CppCMS (downloaded during setup).
CppCMS dependencies should be installed prior to setup, in special CMake, Zlib and PCRE. See CppCMS installation page for details.
The TDV web service provides the following methods:
- similarity: returns a similarity measure (cosine + heuristics) for a given pair of terms and their corresponding POS (optional).
- similar: returns Wiktionary entries that are similar to a provided term, in decreasing order of similarity. Can be reversed to obtain the "most dissimilar" or "opposite" entries.
- repr: returns a the definition vector for the given term and POS (optional).
- disambig: given a sentence and a term from the sentence, with optional POS, return the sense definition of the given term.
- wiktdef: pre-processed Wiktionary entry of a term.
All service responses are JSON compatible.
-
http://localhost:6480/tdv/similarity?term1=cat&pos1=noun&term2=lion&pos2=noun
-
http://localhost:6480/tdv/repr?term=move&pos=verb&human=true
-
http://localhost:6480/tdv/disambig?sent=There%20is%20no%20ship%20docked%20here&term=ship
-
http://localhost:6480/tdv/disambig?sent=They%20will%20ship%20those%20books%20next%20week&term=ship
An online demonstration service can be accessed at https://lasics.dcc.ufrj.br/tdv/
Extracted concept vectors can be download from the following links:
- Concept vectors for English Wiktionary 2017-04-20 (cn,de,en,es,fr,jp,pt,vn)
- Concept vectors for English Wiktionary 2017-04-20 (cn,de,en,es,fr,jp,pt,vn) (human-readable)
- Concept vectors for English Wiktionary 2017-04-20 (en, Translingua)
- Concept vectors for English Wiktionary 2017-04-20 (en, Translingua) (human-readable)
Wiktionary pre-processed data can be download from the following links:
- Pre-processed English Wiktionary 2017-04-20 (all languages)
- Pre-processed English Wiktionary 2017-04-20 (English, Translingua)
The files are minified and compressed with bzip2.
The schema for the pre-processed Wiktionary data can be found on wiktparser/wikt_entry_schema.json. Strings not processed by the parser are left in the original Wiktionary markup format.
These files are distributed under the Creative Commons Attribution-ShareAlike 3.0 license.
You can also process an up-to-date Wiktionary database file (XML) in the following way:
- Change to the 'wiktparser' directory, from the project root: cd tdv/wiktparser
- Run 'python extract.py -idx [path of temp. index] -n [num. parallel processes] -pp [num. of entries per process] [path of Wiktionary XML database dump] [path of the output file (JSON)]'
- The number of processes and entries per process should be adjusted according to the number of CPUs and available main memory.
- To filter the output file for specific languages, run 'python langfilter.py [path of the preprocessed Wiktionary file (JSON)] [languages]'
- where [languages] is a comma separated list of the desired languages. Ex: "English,German,Chinese".
WikTDV is released under the free MIT license. You are free to copy, modify and redistribute the code, under the condition of including the original copyright notice and license in all copies of substantial portions of this software. See LICENSE for more information.
If you use WikTDV in your research, please cite the paper:
Danilo S. Carvalho and Minh Le Nguyen. Building Lexical Vector Representations from Concept Definitions. [pdf] [bib]
Feel free to send us links to your project or research paper related to WikTDV, that you think will be useful for others. They will be included in this page.
Contact info: {danilo, nguyenml} [at] jaist.ac.jp