Join GitHub today
GitHub is home to over 36 million developers working together to host and review code, manage projects, and build software together.Sign up
Sentiment Specific Word Embeddings
Word embeddings are typically learned from unannotated plain text and provide a dense vector representation of syntactic/semantic aspects of a word. These representations though are not able to distinguish contrasting aspects of a word sense, for example sentiment polarity or opposite senses (e.g. high/low).
In order to distinguish these contrasting aspects, one can use a training set of texts annotated with a specific polarity or sense, and specialize the generic word embeddings to take them into account.
dl-sentiwords.py allows creating sentiment specific word embeddings.
You will need generic word embeddings created from texts with broad coverage, for example Wikipedia, using either:
You can extract the plain text from a Wikipedia dump using WikiExtractor.
Then you need some training text annotated with polarity, for example sentiment annotated tweets. The tweets training file should be in the format of the SemEval 2013 Sentiment Analysis in Twitter, i.e. one tweet per line, in the following format:
100032373000896513 15486118 positive Wow!! Lady Gaga is actually at the Britney Spears Femme Fatale Concert tonight!!! She still listens to her music!!!! WOW!!!
If you are using
word2vect embeddings, the script should be invoked like this:
dl-sentiwords.py training.tsv --vectors vectors.txt --variant word2vec
while if you are using
DeepNL embeddings you must supply two separate files with words and vectors:
dl-sentiwords.py training.tsv --vocab words.txt --vectors vectors.txt
Notice that the words and vectors file will be updated by the program, so use a copy to avoid clobbering the originals.
Once you have trained the sentiment specific embeddings, you can use them as features for a sentiment classifier. Notice that the new vocabulary will also contain relevant bigram and trigram words, represented by the concatenation of the words with a '_' in between. The classifier will typically use additional features, for example:
- polarity score of word/ngram from a lexicon
- number of positive/negative emoticons
- number of all capital words (WOW)
- number of negation words (no, none, nobody)
- number of elongated words
- number of elongated punctuations (!!, ??)
- number of each class of POS