Sentiment Lexicon Generation Suite
This project provides executable files and scripts for generating
sentiment lexicons from GermaNet
(a German equivalent of the English
WordNet
), raw text corpora, and neural word embeddings.
Building
For generating a sentiment lexcion from pre-trained word embeddings, you first need to compile the C++ code by running the following commands (please note that the build requires the Armadillo library to be installed):
cd build/
cmake ../
make
Afterwards, an executable called vec2dic
wil apper in
the subdirectory bin
. You can exectute this file by envoking:
./bin/vec2dic [OPTIONS] --type=TYPE VECTOR_FILE SEED_FILE
where the TYPE
argument (an integer from zero to three) will
determine the algorithm to use for inducing a sentiment lexicon,
VECTORE_FILE
denotes a path to a text file with pre-trained
word2vec
embeddings (note that the file should be in the raw text
format with space separated values), and SEED_FILE
. We currently
support the following types of algorithms:
- 0 -- nearest centroids (default);
- 1 -- KNN;
- 2 -- PCA.
Examples
In addition to the C++ executables, we also provide several
reimplementations of popular alternative approaches which generate
sentiment lexcions from lexical taxonomies (e.g., GermaNet
) or raw
unlabeled text corpora. Please note that in order to use
dictionary-based methods, you need to download
GermaNet, which is not
included here by default due to license restrictions, and place its
files in the directory data/GermaNet_v9.0/
. For corpus-based
algorithms, you need to provide a pre-lemmatized corpus in the format
similar to the one used in data/snapshot_corpus_data/example.txt
.
Alternatively, for the method of Takamura et al. (2005), you need to
provide a list of coordinately conjoined pairs similar to the one
provided in data/corpus/cc_light.txt
.
Below, you can find a short summary and command examples of the provided systems.
Hu and Liu (2004)
For generating a sentiment lexicon with the method of Hu and Liu (2004), you should envoke the following command:
./scripts/generate_lexicon.py hu-liu \
--ext-syn-rels --seed-pos=adj \
--form2lemma=data/GermaNet_v9.0/gn_form2lemma.txt \
data/seeds/hu_liu_seedset.txt data/GermaNet_v9.0
Blair-Goldensohn (2008)
If you want to generate a sentiment lexicon using the method of Blair-Goldensohn et al. (2008), you should envoke the following command:
./scripts/generate_lexicon.py blair-goldensohn \
--ext-syn-rels --seed-pos=adj \
--form2lemma=data/GermaNet_v9.0/gn_form2lemma.txt \
data/seeds/hu_liu_seedset.txt data/GermaNet_v9.0/
Kim-Hovy (2004)
For generating a sentiment lexicon with the method of Kim and Hovy, (2004), use the following command:
./scripts/generate_lexicon.py kim-hovy \
--ext-syn-rels --seed-pos=adj \
--form2lemma=data/GermaNet_v9.0/gn_form2lemma.txt \
data/seeds/hu_liu_seedset.txt data/GermaNet_v9.0/
Takamura et al. (2005)
To generate a sentiment lexicon with the method of
Takamura et al. (2005),
use the following command instead (note that the file
data/corpus/cc.txt
is not included in this repository due to its big
size):
./scripts/generate_lexicon.py takamura \
--form2lemma=data/GermaNet_v9.0/gn_form2lemma.txt \
data/seeds/turney_littman_2003.txt data/GermaNet_v9.0/ data/corpus/cc.txt -1
Esuli and Sebastiani (2006)
For generating a sentiment lexicon using the SentiWordNet
method of
Esuli and Sebastiani (2006),
you should use the following command:
./scripts/generate_lexicon.py esuli --ext-syn-rels \
--seed-pos=adj --form2lemma=data/GermaNet_v9.0/gn_form2lemma.txt \
data/seeds/hu_liu_seedset.txt data/GermaNet_v9.0
Rao and Ravichandran (2009)
In order to generate a sentiment lexicon with the min-cut approach of Rao and Ravichandran (2009), use the below command:
./scripts/generate_lexicon.py rao-min-cut --ext-syn-rels \
--seed-pos=adj --form2lemma=data/GermaNet_v9.0/gn_form2lemma.txt \
data/seeds/hu_liu_seedset.txt data/GermaNet_v9.0
If you want to test the label propagation algorithm described by these authors, you should specify the following arguments:
./scripts/generate_lexicon.py rao-lbl-prop --ext-syn-rels \
--seed-pos=adj --form2lemma=data/GermaNet_v9.0/gn_form2lemma.txt \
data/seeds/hu_liu_seedset.txt data/GermaNet_v9.0
Awdallah and Radev (2010)
To generate a sentiment lexicon using the method of Awdallah and Radev (2010), you should use the following command:
./scripts/generate_lexicon.py awdallah --ext-syn-rels \
--seed-pos=adj --form2lemma=data/GermaNet_v9.0/gn_form2lemma.txt \
data/seeds/hu_liu_seedset.txt data/GermaNet_v9.0/
Velikovich et al. (2010)
For generating a sentiment lexicon using the algorithm of Velikovich et al. (2010), you can use the following command:
./scripts/generate_lexicon.py velikovich \
data/seeds/hu_liu_seedset.txt -1 data/snapshot_corpus_data/example.txt
Kiritchenko et al. (2014)
In order to generate a sentiment lexicon using the system of Kiritchenko et al. (2014), you should use the following command:
./scripts/generate_lexicon.py kiritchenko \
data/seeds/hu_liu_seedset.txt -1 data/snapshot_corpus_data/example.txt
Severyn and Moschitti (2014)
For generating a sentiment lexicon using the approach of Severyn and Moschitti (2014), you should use the following command:
./scripts/generate_lexicon.py severyn \
data/seeds/hu_liu_seedset.txt -1 data/snapshot_corpus_data/example.txt
Evaluation
You can evaluate the resulting sentiment lexicon on the PotTS dataset by using the following command and providing a valid path to the downloaded corpus data:
./scripts/evaluate.py -l data/form2lemma.txt \
data/results/esuli-sebastiani/esuli-sebastiani.ext-syn-rels.turney-littman-seedset.txt \
${PATH_TO_PotTS}/corpus/basedata/ ${PATH_TO_PotTS}/corpus/annotator-2/markables/