# Sentiment Lexicon

The original paper authors use a sentiment lexicon from MQPA - [Subjectivity Lexicon](mpqa.cs.pitt.edu/lexicons/subj_lexicon/).

> We used MPQA sentiment lexicon (Wilson et al., 2005) for our study, which contains 2,718 positive and 4,912 negative lexicons.

The lexicon is a single file with one term per row. Example:
```
type=weaksubj len=1 word1=abjure pos1=verb stemmed1=y priorpolarity=negative
```

Consulting the provided README points us to look at the last element/key `priorpolarity` to define our positive/negative lexicon. With a couple simple invocations, we can see that this methodology produces a lexicon that matches that of the original authors.[<sup>caveat</sup>](#Size-of-the-lexicon)

```
$ cat <mpqa-lexicon-file> | grep 'priorpolarity=positive' | wc -l
    2718
```

```
$ cat <mpqa-lexicon-file> | grep 'priorpolarity=negative' | wc -l
    4912 
```

(See [Notes](#Notes) for characteristics of the dataset)

## Lookup Table from Lexicon

Let's load the lexicon into memory and create an efficient method for lookup.

The lexicon includes word stems and parts of speech in addition to the actual word. Because the papers' authors to do describe how they used the lexicon, we will assume all possible features were used. Parses from Stanford CoreNLP (as well as annotated data provided by the authors) include lemmas and POS data so we will match tokens with lexicon entities with all available data.

This means that we only care about the following fields each entry in the lexicon:
 - `word1` - The word or stem (different than lemma!)
 - `pos1` - Part of speech will be `adj`, `adverb`, `anypos`, `noun`, or `verb`
 - `stemmed1` - True or False, is this entry a stem?
 - `priorpolarity` - `positive`, `negative`, `neutral`, or `both` - We will only use `positive` and `negative`

The MPQA sentiment lexicon uses a much more general set of POS labels than CoreNLP. This means that we will need a mapping between the two. (`anypos` is any part of speech)
```
$ cat subjclueslen1-HLTEMNLP05.tff | grep -o 'pos1=[a-z]*' | sort | uniq
pos1=adj
pos1=adverb
pos1=anypos
pos1=noun
pos1=verb
```

Additionally, the lexicon uses stemming and CoreNLP's annotations use lemmas. Although similar, [they are not the same](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html). Instead, we'll match the tokens either a whole word or a stem matches either the whole token or the lemma (in addition to part of speech).

We'll create a mapping of word/stem to (stem?, POS) to sentiment mapping. For efficient prefix lookups, we'll use a trie - specifically, Google's implementation [`pygtrie`](https://github.com/google/pygtrie) which is a drop-in replacement for Python dictionaries.

In [1]:
from enum import Enum

class POS(Enum):
    ANY_POS = 'anypos'
    ADJ = 'adj'
    ADVERB = 'adverb'
    NOUN = 'noun'
    VERB = 'verb'
    
class Sentiment(Enum):
    POSITIVE = 'positive'
    NEGATIVE = 'negative'
    
valid_sentiments = set([s.value for s in Sentiment.__members__.values()])

In [2]:
import pygtrie

trie = pygtrie.CharTrie()

def add_entry(text, pos, is_stem, sentiment):
    if text not in trie:
        trie[text] = dict()
    assert (pos, is_stem) not in trie[text], '{} already exists'.format((text, pos, is_stem))
    trie[text][(pos, is_stem)] = sentiment

def get_sentiment(text, pos):
    if trie.has_key(text):
        if (pos, False) in trie[text]:
            return trie[text][pos, False]
        elif (pos, True) in trie[text]:
            return trie[text][pos, True]
    else:
        for _, entries in reversed(list(trie.prefixes(text))):
            if (pos, True) in entries:
                return entries[pos, True]
    return None

In [3]:
filepath = '../../data/subjectivity_clues_hltemnlp05/subjclueslen1-HLTEMNLP05.tff'
with open(filepath, 'r') as f:
    for raw_entry in f:
        try:
            entry = dict((pair.split('=')) for pair in raw_entry.strip().split(' '))
        except ValueError as e:
            print('Skipping', raw_entry, e)
        if entry['priorpolarity'] in valid_sentiments:
            try:
                add_entry(entry['word1'], POS(entry['pos1']), entry['stemmed1']=='y', Sentiment(entry['priorpolarity']))
            except AssertionError as e:
                print('Skipping,', e)

Skipping, ('autocratic', <POS.ADJ: 'adj'>, False) already exists
Skipping, ('boast', <POS.VERB: 'verb'>, True) already exists
Skipping, ('brazenly', <POS.ANY_POS: 'anypos'>, False) already exists
Skipping, ('brazenness', <POS.NOUN: 'noun'>, False) already exists
Skipping, ('cohesive', <POS.ADJ: 'adj'>, False) already exists
Skipping, ('constructive', <POS.ADJ: 'adj'>, False) already exists
Skipping, ('disinterested', <POS.ADJ: 'adj'>, False) already exists
Skipping, ('disown', <POS.VERB: 'verb'>, True) already exists
Skipping, ('distinctive', <POS.ADJ: 'adj'>, False) already exists
Skipping, ('famed', <POS.ADJ: 'adj'>, False) already exists
Skipping, ('hasty', <POS.ADJ: 'adj'>, False) already exists
Skipping, ('hegemony', <POS.NOUN: 'noun'>, False) already exists
Skipping, ('isolation', <POS.NOUN: 'noun'>, False) already exists
Skipping, ('killer', <POS.NOUN: 'noun'>, False) already exists
Skipping, ('lecher', <POS.NOUN: 'noun'>, False) already exists
Skipping, ('lure', <POS.VERB: 'ver

In [4]:
len(trie)

6450

In [5]:
trie['help']

{(<POS.ADJ: 'adj'>, False): <Sentiment.POSITIVE: 'positive'>,
 (<POS.NOUN: 'noun'>, False): <Sentiment.POSITIVE: 'positive'>,
 (<POS.VERB: 'verb'>, True): <Sentiment.POSITIVE: 'positive'>}

## "Calculating" sentiment label for given span

For spans of text, the authors
> define the sentiment label ... to be positive if it contains more words that appear in teh positive sentiment lexicon than that appear in the negative one [and vice versa]

Easy!

In [6]:
from collections import Counter

def get_sentiment_label(tokens):
    sentiment = Counter()
    for token in tokens:
        pos = ptb2ezpos(token['pos'])
        ts = get_sentiment(token['originalText'], pos) or get_sentiment(token['lemma'], pos) if 'lemma' in token else None
        if ts is not None:
            sentiment[ts] += 1
    return sentiment.most_common()[0][0] if len(sentiment) > 0 else None

We will also have to create the mappings from CoreNLP's tag set ([Penn Treebank tags](http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)] to the much simplier tags used in the MPQA sentiment lexicons.

In [7]:
def ptb2ezpos(ptb_tag):
    tag = ptb_tag.lower()[:1]
    if tag == 'nn': # NN, NNS, NNP, NNPS
        return POS.NOUN
    elif tag == 'vb': # VB, VBD, VBG, VBN, VBP, VBZ
        return POS.VERB
    elif tag == 'rb': # RB, RBR, RBS
        return POS.ADVERB
    elif tag == 'jj': # JJ, JJR, JJS
        return POS.ADJ
    return POS.ANY_POS

Let's test it out with a file from MPQA.

In [8]:
from pycorenlp import StanfordCoreNLP
nlp = StanfordCoreNLP('http://localhost:9000')
test_file = '../../data/database.mpqa.3.0/docs/20010926/23.17.57-23406'
text = open(test_file, 'r').read()
output = nlp.annotate(text, properties={
    'annotators': 'ner',
    'outputFormat': 'json'
})
sentences = output['sentences']

def get_text(tokens_slice):
    raw_text = []
    for token in tokens_slice:
        raw_text.append(token['originalText'])
        raw_text.append(token['after'])
    return ''.join(raw_text[:-1])

In [9]:
sentences[0]['tokens'][0]

{'after': '',
 'before': '',
 'characterOffsetBegin': 0,
 'characterOffsetEnd': 6,
 'index': 1,
 'lemma': 'TAIPEI',
 'ner': 'LOCATION',
 'originalText': 'TAIPEI',
 'pos': 'NNP',
 'word': 'TAIPEI'}

Something simple, what's the sentiment for each of the sentences?

In [10]:
for sentence in sentences[:4]:
    sent = get_sentiment_label(sentence['tokens'])
    print('{}\n{}\n'.format(sent, get_text(sentence['tokens'])))

Sentiment.NEGATIVE
TAIPEI, Sept 26 (AFP) -- Taiwan President Chen Shui-bian on Wednesday reiterated Taipei's full support for the United States as Washington prepared to launch reprisals against Afghanistan.

Sentiment.NEGATIVE
"On behalf of the government and people of the Republic of China (Taiwan's official name), I would like to extend our full support to the George W. Bush administration in its any decision and act against terrorists," Chen said while meeting Oregon governor John Kitzhaber.

None
Taiwan "would not stand idly by" because "the attacks were not only a challenge to the US but also a disruption of peace for mankind," Chen said in a statement released by the presidential office.

None
"The ROC government will be with the US government firmly."



## e2e test

In [11]:
import os
import sys
preproc_path = os.path.abspath(os.path.join('..'))
print(preproc_path)
if preproc_path not in sys.path:
    sys.path.append(preproc_path)

/Users/andrew/Documents/college/cs8803-css/replication-project/code/src


In [12]:
import importlib
from base_models import sentiment_lexicon

importlib.reload(sentiment_lexicon);

In [13]:
sl, sent_count, pos_count = sentiment_lexicon.SentimentLexicon.from_mpqa_file(filepath)
for sentence in sentences[:4]:
    sent = sl.get_sentiment_label(sentence['tokens'])
    print('{}\n{}\n'.format(sent, get_text(sentence['tokens'])))

(<Sentiment.NEGATIVE: 'negative'>, Counter({None: 31, <Sentiment.NEGATIVE: 'negative'>: 1}))
TAIPEI, Sept 26 (AFP) -- Taiwan President Chen Shui-bian on Wednesday reiterated Taipei's full support for the United States as Washington prepared to launch reprisals against Afghanistan.

(<Sentiment.NEGATIVE: 'negative'>, Counter({None: 52, <Sentiment.NEGATIVE: 'negative'>: 1}))
"On behalf of the government and people of the Republic of China (Taiwan's official name), I would like to extend our full support to the George W. Bush administration in its any decision and act against terrorists," Chen said while meeting Oregon governor John Kitzhaber.

(None, Counter({None: 41}))
Taiwan "would not stand idly by" because "the attacks were not only a challenge to the US but also a disruption of peace for mankind," Chen said in a statement released by the presidential office.

(None, Counter({None: 13}))
"The ROC government will be with the US government firmly."



In [14]:
sent_count

Counter({<Sentiment.NEGATIVE: 'negative'>: 4898,
         <Sentiment.POSITIVE: 'positive'>: 2712})

In [15]:
pos_count

Counter({<POS.ADJ: 'adj'>: 3000,
         <POS.ADVERB: 'adverb'>: 311,
         <POS.ANY_POS: 'anypos'>: 1036,
         <POS.NOUN: 'noun'>: 2017,
         <POS.VERB: 'verb'>: 1246})

## Notes

### Size of the lexicon

Although the paper's authors claim the MPQA dataset to contain "2,718 positive and 4,912 negative lexicons", it contains a negligible amount fewer unique entities. (Note: this is the loosest definition for uniqueness because reordered fields or other meaningless changes between rows will show up as unique rows under this count).

```
$ cat <mpqa-lexicon-file> | wc -l
    8222
```

```
$ cat <mpqa-lexicon-file> | sort | uniq | wc -l
    8209
```

In [16]:
from collections import Counter
Counter([s for entries in trie.values() for s in entries.values()])

Counter({<Sentiment.NEGATIVE: 'negative'>: 4898,
         <Sentiment.POSITIVE: 'positive'>: 2712})

### Interesting quirks (errors?) in the MPQA lexicon

I found some strange and seemingly erroneous lines in the lexicon file when first exploring the file. None of these quirks were "addressed" in anyway and all replication was performed without altering the original lexicon in any way.

The README tells us that the only possible values for `priorpolarity` should be `positive, negative, both, neutral` but that does not turn out to be the case...
```
$ cat subjclueslen1-HLTEMNLP05.tff | grep -o 'priorpolarity=[a-z]*' | sort | uniq
priorpolarity=both
priorpolarity=negative
priorpolarity=neutral
priorpolarity=positive
priorpolarity=weakneg
```

It looks like line 3749 in the lexicon file is the offending entry:
```
$ cat <mpqa-lexicon-file> | nl | \grep 'priorpolarity=weakneg'
  3749	type=weaksubj len=1 word1=impassive pos1=adj stemmed1=n polarity=negative priorpolarity=weakneg
```

It seems like almost all the entries have the same keys in the same order, but on lines 5549 and 5550 there is a stray `m`:
```
$ cat <mpqa-lexicon-file> | nl | grep ' m '
  5549	type=strongsubj len=1 word1=pervasive pos1=adj stemmed1=n m priorpolarity=negative
  5550	type=strongsubj len=1 word1=pervasive pos1=noun stemmed1=n m priorpolarity=negative
```