A tool for calculation semantic similarity between words from a text corpus based on lexico-syntactic patterns.
- Currently, the tool consist of two separate programs -- patternsim and patternsim-rank (see below).
- This tool implements the extraction method described in these papers:
- Panchenko A., Morozova O., Naets H. “A Semantic Similarity Measure Based on Lexico-Syntactic Patterns.” In Proceedings of the 11th Conference on Natural Language Processing (KONVENS 2012), — Vienna (Austria), 2012
- http://www.oegai.at/konvens2012/proceedings/23_panchenko12p/
- Kristina Sabirova, Artem Lukanin. Automatic Extraction of Hypernyms and Hyponyms from Russian Texts // Supplementary Proceedings of the 3rd International Conference on Analysis of Images, Social Networks and Texts (AIST 2014) / Ed. by D. I. Ignatov, M. Y. Khachay, A. Panchenko, N. Konstantinova, R. Yavorsky, D. Ustalov. Vol. 1197: Supplementary Proceedings of AIST 2014. CEUR-WS.org, 2014. С. 35-40.
- http://ceur-ws.org/Vol-1197/paper6.pdf
- A demo of the extraction results provided with this method can be accessed here: http://serelex.cental.be/
- Related repositories:
- Source code of the demo system: https://github.com/PomanoB/lsse
- An evaluation framework for semantic similarity measures: https://github.com/alexanderpanchenko/sim-eval
LGPLv3: http://www.gnu.de/documents/lgpl-3.0.en.html
A tool for extraction of raw extraction counts with lexico-syntactic patterns.
Requirements
- Perl 5.14.x or higher
- Unitex 3.0beta (http://www-igm.univ-mlv.fr/~unitex/)
Installation on Ubuntu 12.04
- Install Unitex 3.0beta (http://www-igm.univ-mlv.fr/~unitex/zips/Unitex3.0beta.zip)
- Install cpanm: "sudo cpan App::cpanminus"
- Install all dependencies: "sudo cpanm --installdeps ."
Quick Start
Use ./rerank.sh to rerank relations with the default formula, and as an example of usage of patternsim-rank.
Synopsis
patternsim [options] [corpus_file(s) ...]
Options
Usage:
patternsim [options] [corpus_file(s) ...]
Mandatory options:
--unitex Unitex main directory
--output (-o) output directory
Options:
--vocabulary (-v) input vocabulary file
--workers (-w) number of workers
--language (-l) language
--list-languages list all available languages
--verbose verbose mode
--help brief help message
--man full documentation
Options:
--unitex *unitex_main_directory*
Specify the Unitex main directory if you want to use your own
Unitex installation (overwite the patternsim configuration file)
--output -o *output_directory*
Specify the output directory.
--vocabulary --vocab -v *vocabulary_file*
Specify the UTF-8 input vocabulary file (one word per line)
--workers -w *number_of_workers*
Specify the number of parallel workers Workers will extract in
parallel semantic relations. A good number of workers will be
the number of CPU cores minus 1.
--language -w *language_id*
Specify the current language
--list-languages
Show all available languages (language_id and full name)
--verbose
Explains what is being done
--help -h
Prints a brief help message and exits.
--man Prints the manual page and exits.
--verbose
Activates the verbose mode. Explains all the processes. Outputs
will be shown on stderr.
Example
./patternsim --unitex /home/user/Unitex3.0beta -v vocabulary.txt -o output corpus.txt
The output of this command -- a set of files in the directory "./output":
- conc-freq.csv -- a frequency list derived from a set of extraction concordances
- corpus-freq.csv -- a frequency list derived from an input corpus "corpus.txt"
- pairs.csv -- similarity matrix containing raw extraction counts between all single words
- pairs-np.csv -- similarity matrix containing raw extraction counts between all noun phrases
- pairs-voc.csv -- similarity matrix containing raw extraction counts between terms from the input vocabulary "vocabulary.txt"
The files conc-freq.csv and corpus-freq.csv are CSV files in the following format:
word;frequency\n
The files pairs.csv, pairs-np.csv and pairs-voc.csv are CSV files in the following format:
target-word;relatum-word;e-syno;e-cohypo;e-hyper-hypo;e-hyper;e-hypo;e-all;e1;e2;e3;е4;е5;е6;е7;е8;е9;е10;е11;е12;е13;е14;е15;е16;е17\n
Here target-word and related-word are words, ' e-all is the number of extractions between target-word and relatum-word with all the 17 patterns, ei is number of extractions between target-word and relatum-word with the i-th pattern (see the referenced above paper for details). Thus e-all = sum_i (ei).
e-syno, e-cohypo, e-hyper, e-hyper-hypo, e-hypo is the number of specific relations extracted between terms (synonyms, co-hyponyms, hypernyms, hyponym, hypernyms+hyponyms).
Corpus
Here are some corpora which you may use with this tool:
- Some Wikipedia articles: http://cental.fltr.ucl.ac.be/team/~panchenko/patternsim/corpus/
- For even bigger corpora use ukWaC and WaCkypedia: http://wacky.sslmit.unibo.it/doku.php?id=corpora
- Use DBPedia dump of Wikipedia: http://wiki.dbpedia.org/Downloads
- Use a corpus of your own
Russian morphological dictionary
The Russian dictionary in this repository is an extract of the Russian computational morphological dictionary developed at CIS, Munich. This extract contains about 15% of the original dictionary (the most frequent lemmata). The whole dictionary actually contains 140,000 simple entries (= 2.7 million distinct forms), 166,000 simple proper nouns (= 900,000 distinct forms) and 1800 compound words.
If you want to use the full version of the lexicon, please contact:
Sebastian Nagel
CIS
Oettingenstr. 67
80538 München
Germany
wastl@cis.uni-muenchen.de
http://www.cis.uni-muenchen.de
For additional information see:
Nagel, Sebastian 2002: Formenbildung im Russischen. Formale Beschreibung und Automatisierung für das CISLEX-Wörterbuchsystem (http://www.cis.uni-muenchen.de/~wastl/pub/ruslex.pdf)
For a short description (in German), see http://www.cis.uni-muenchen.de/~wastl/pub/ruslexUnitex.pdf
Reranking semantic similarity scores between words extracted with the patternsim. Directory -- "rank".
Synopsis
patternsim-rank [options]
System Requirements
- Windows -- Microsoft .NET framework 4.0 or higher (http://www.microsoft.com/net).
- Linux or Mac OSX -- Mono 2.0 or higher (http://www.go-mono.com/mono-downloads/). For instance, for Ubuntu 12.04 use "sudo apt-get install mono-runtime".
- At least 4Gb of RAM is recommended.
Binaries
Binaries are readily available the bin folder. On Unix based systems you may use "./patternsim-rank" or "./patternsim-rank.exe". On Windows, use "patternsim-rank.exe".
Testing
- Download test data http://cental.fltr.ucl.ac.be/team/~panchenko/sim-eval/patternsim-rank-data.tgz.
- Save the archive to the "rank" directory.
- Extract the data (tar xzf patternsim-rank-data.tgz). The directory "data" should appear.
- Run tests.sh script. It will produce the output in the data/output folder.
Recompilation
- Open patternsim-rank.sln with MonoDevelop or Visual Studio.
- Build the solution.
Options
p, pairs
Required. An UTF-8 encoded CSV file in provided by the PattenSim program. In the format:
target;relatum;syno;cohypo;hyper_hypo;hyper;hypo;sum;pattern;pattern2;pattern3;pattern4;pattern5;pattern6;pattern7;pattern8;pattern9;pattern10;pattern11;pattern12;pattern13;pattern14;pattern15;pattern16;pattern17
This file must contain symmetric relations between words (generated by the PatternSim by default). If there exist a relation 'target;relatum;type;sim' then there should exist one and only one relation 'relatum;target;type;sim' in the same file.
o, output
Required. An UTF-8 encoded CSV file 'target;relatum;sim', where 'sim' is similarity score between 'target' and 'relatum'. This file is sorted by 'target' and then 'sim'.
c, corpusfreq
Required. An UTF-8 encoded CSV file 'word;freq' with frequencies of words.
t, type
Required. Type of reranking:
- Efreq, no reranking, transform scores to the interval [0;1].
- Efreq-Rfreq, reranking by frequency of relations to other words. Uses option 'alpha'.
- Efreq-Rnum, reranking by number of relations to other words. Uses option 'beta'.
- Efreq-Cfreq, reranking by word frequency. Uses option 'corpusfreq'.
- Efreq-Rnum-Cfreq, reranking by number of relations to other words and by word frequency. Uses options 'beta' and 'corpusfreq'.
- Efreq-Rnum-Cfreq-Pnum, reranking by number of relations to other words, by word frequency and by number of different patterns extracted the relations. Uses options 'corpusfreq', 'patterns', 'beta' and 'sqrt'.
a, alpha
Expected number of relations per word, default -- 15.
b, beta
Minimum number of extractions which establish a relation between words, default -- 2.
s, sqrt
Sqrt of the number of different patterns, default -- true.