# Chapters 7 and 8: Sentiment Analysis

## Data loading and preliminary inspection

We will be using popular `polarity dataset 2.0` collected by [Bo Pang and colleagues from Cornell Univeristy](http://www.cs.cornell.edu/people/pabo/movie-review-data/). Let's first read it in and inspect the data. This code is very similar to the routine applied to spam filtering in Chapter 2:

In [1]:
import os, codecs

def read_in(folder):
    files = os.listdir(folder)
    a_dict = {}
    for a_file in sorted(files):
        if not a_file.startswith("."):
            with codecs.open(folder + a_file, encoding='ISO-8859-1', errors ='ignore') as f:
                file_id = a_file.split(".")[0].strip()
                a_dict[file_id] = f.read()
            f.close()
    return a_dict

Now, when you download the dataset, it comes as two subfolders named `pos/` for all positive reviews and `neg/` for all negative ones, put within a folder called `review_polarity/txt_sentoken/`. If you don't change the folder names, you can simply read in the contents of all positive and negative reviews and put them in separate Python dictionaries of review titles mapped to the reviews content, using the method `read_in` from above.

Let's also print out the number of reviews in positive and negative dictionaries, as well as the very first positive and very first negative reviews in the dictionaries.

In [2]:
folder = "review_polarity/txt_sentoken/"
pos_dict = read_in(folder + "pos/")
print(len(pos_dict))
print(pos_dict.get(next(iter(pos_dict))))
neg_dict = read_in(folder + "neg/")
print(len(neg_dict))
print(neg_dict.get(next(iter(neg_dict))))

1000
films adapted from comic books have had plenty of success , whether they're about superheroes ( batman , superman , spawn ) , or geared toward kids ( casper ) or the arthouse crowd ( ghost world ) , but there's never really been a comic book like from hell before . 
for starters , it was created by alan moore ( and eddie campbell ) , who brought the medium to a whole new level in the mid '80s with a 12-part series called the watchmen . 
to say moore and campbell thoroughly researched the subject of jack the ripper would be like saying michael jackson is starting to look a little odd . 
the book ( or " graphic novel , " if you will ) is over 500 pages long and includes nearly 30 more that consist of nothing but footnotes . 
in other words , don't dismiss this film because of its source . 
if you can get past the whole comic book thing , you might find another stumbling block in from hell's directors , albert and allen hughes . 
getting the hughes brothers to direct this seems almos

So far so good! There are a couple of observation that you can make at this point:

- The dataset is balanced between positive and negative reviews – both subsets have `1000` texts.
- Texts are already preprocessed: they are tokenized (i.e., each word token is separated from its neighbours by a white space, all words are converted to lower case).

What can you tell about the word content? Does the first positive review contain many positive words: i.e., if you didn't know it came from the `pos/` folder, would you be able to tell it was a positive one?

Let's next calculate some simple statistics, including: 
- Average length of a review in positive and negative subsets
- Average number of words per sentence
- Size of the vocabulary (unique words in each subset)
- *Diversity* of sentiment vocabulary – average number of times each word occurs in a text with a specific sentiment

Many of these calculations will rely on having words extracted from text. Since texts are already tokenized and words are separated by white spaces, it's a matter of simply splitting texts into words by white spaces – see method `tokenize`:

In [3]:
def tokenize(text):
    text.replace("\n", " ")
    return text.split()

def statistics(a_dict):
    length = 0
    sent_length = 0
    num_sents = 0
    vocab = []
    for review in a_dict.values():
        length += len(tokenize(review))
        sents = review.split("\n")
        num_sents += len(sents)
        for sent in sents:
            sent_length += len(tokenize(sent))
        vocab += tokenize(review)
    avg_length = float(length)/len(a_dict)
    avg_sent_length = float(sent_length)/num_sents
    vocab_size = len(set(vocab))
    diversity = float(length)/float(vocab_size)
    return avg_length, avg_sent_length, vocab_size, diversity
        
categories = ["Positive", "Negative"]
rows = []
rows.append(["Category", "Avg_Len(Review)", "Avg_Len(Sent)", "Vocabulary Size", "Diversity"])
stats = {}
stats["Positive"] = statistics(pos_dict)
stats["Negative"] = statistics(neg_dict)
for cat in categories:
    rows.append([cat, f"{stats.get(cat)[0]:.6f}", 
                f"{stats.get(cat)[1]:.6f}",
                f"{stats.get(cat)[2]:.6f}",
                f"{stats.get(cat)[3]:.6f}"])

columns = zip(*rows)
column_widths = [max(len(item) for item in col) for col in columns]
for row in rows:
    print(''.join(' {:{width}} '.format(row[i], width=column_widths[i]) 
                  for i in range(0, len(row))))   

 Category  Avg_Len(Review)  Avg_Len(Sent)  Vocabulary Size  Diversity 
 Positive  787.051000       23.191531      36805.000000     21.384350 
 Negative  705.630000       21.524266      34542.000000     20.428174 


As it shows, positive reviews are overall longer than negative ones; they also have longer sentences with larger and more diverse vocabularies. Let's see how many words there are in overlap between positive and negative vocabularies:

In [4]:
def vocab_difference(list1, list2):
    vocab1 = []
    vocab2 = []
    for rev in list1:
        vocab1 += tokenize(rev)
    for rev in list2:
        vocab2 += tokenize(rev)
    return sorted(list(set(vocab1) - set(vocab2)))

pos_wordlist = pos_dict.values()
neg_wordlist = neg_dict.values()

print(vocab_difference(pos_wordlist, neg_wordlist)[1500:1600])
print(vocab_difference(neg_wordlist, pos_wordlist)[1500:1600])
print()
print(str(len(vocab_difference(pos_wordlist, neg_wordlist))) + " unique words in positive reviews only")
print(str(len(vocab_difference(neg_wordlist, pos_wordlist))) + " unique words in negative reviews only")

['asssss', 'assuaged', 'asthmatic', 'astin', 'astonishment', 'astor', "astronauts'", 'astronomer/writer', 'astronomical', 'astronomically', 'asuma', 'at-odds', 'atf', 'atheism', 'atheists', 'atkine', "atlanta's", "atlantic's", 'atlantis', "atlantis'", 'atmospheres', 'atmospherically', 'atmostpheric', 'atomizers', 'atone', 'atreus', 'atrophied', 'atrophy', 'attachs', 'attainable', "attanasio's", "attendant's", 'attendees', 'attention-grabbing', 'attentive', 'attentiveness', 'attest', 'attilla-looking', 'attractions', 'attractive-seeming', 'attractiveness', 'au/~koukoula/', 'auberjonois', 'audacious', 'audacious--and', 'audience--in', 'audience-friendly', 'audience-pleaser', 'audience-pleasing', 'audiotapes', 'auditioned', 'auditor', 'audre', 'aug', 'aughra', 'augment', "august's", 'augusts', 'aulon', 'aumont', 'aunjanue', 'auntie', 'aural', 'aurally', 'aurelien', 'aurelius', "aurelius'", 'auriga', 'auriol', 'aurora', 'auschwitz', 'austen-like', "australian's", 'australian/belgian', 'aut

There are $16378$ unique words in the `positive` dictionary and $14115$ unique words in the `negative` one.

So far, you've been looking at word forms – i.e., words like *movie* and *movies*, *like* and *liked* were considered as separate tokens in the statistics above. How much diversity is there in reviews when you take lemmas? Let's apply further NLP analysis and see how many word lemmas, not word forms (i.e., original tokens), there are in each subset.

First, import `spacy` and implement the method `lemmatize`, that, when the "switch" is turned on, will return lemmas rather than word tokens.

In [5]:
import spacy
nlp = spacy.load("en_core_web_md")

In [6]:
def lemmatize(sentence, switch):
    text = nlp(sentence.replace("\n", " "))
    if switch=="on":
        lemmas = [text[i].lemma_ for i in range(len(text))]
        return lemmas
    else:
        tokens = [text[i] for i in range(len(text))]
        return tokens

#print out the results and compare
print(lemmatize(pos_dict.get(next(iter(pos_dict))), "on")[:200])
print()
print(lemmatize(pos_dict.get(next(iter(pos_dict))), "off")[:200])

['film', 'adapt', 'from', 'comic', 'book', 'have', 'have', 'plenty', 'of', 'success', ',', 'whether', '-PRON-', 'be', 'about', 'superhero', '(', 'batman', ',', 'superman', ',', 'spawn', ')', ',', 'or', 'gear', 'toward', 'kid', '(', 'casper', ')', 'or', 'the', 'arthouse', 'crowd', '(', 'ghost', 'world', ')', ',', 'but', 'there', 'be', 'never', 'really', 'be', 'a', 'comic', 'book', 'like', 'from', 'hell', 'before', '.', ' ', 'for', 'starter', ',', '-PRON-', 'be', 'create', 'by', 'alan', 'moore', '(', 'and', 'eddie', 'campbell', ')', ',', 'who', 'bring', 'the', 'medium', 'to', 'a', 'whole', 'new', 'level', 'in', 'the', 'mid', "'", '80', 'with', 'a', '12-part', 'series', 'call', 'the', 'watchman', '.', ' ', 'to', 'say', 'moore', 'and', 'campbell', 'thoroughly', 'research', 'the', 'subject', 'of', 'jack', 'the', 'ripper', 'would', 'be', 'like', 'say', 'michael', 'jackson', 'be', 'start', 'to', 'look', 'a', 'little', 'odd', '.', ' ', 'the', 'book', '(', 'or', '"', 'graphic', 'novel', ',', '"

Since processing with `spacy` might take time, let's run it once and store the results in specific data structures as we did in the previous chapters:

In [7]:
def spacy_preprocess_reviews(source):
    source_docs = {}
    index = 0
    for review_id in source.keys():
        #to speed processing up, you can disable "ner" – Named Entity Recognition module of spaCy
        source_docs[review_id] = nlp(source.get(review_id).replace("\n", ""), disable=["ner"])
        if index>0 and (index%200)==0:
            print(str(index) + " reviews processed")
        index += 1
    print("Dataset processed")
    return source_docs

pos_docs = spacy_preprocess_reviews(pos_dict)
neg_docs = spacy_preprocess_reviews(neg_dict)

200 reviews processed
400 reviews processed
600 reviews processed
800 reviews processed
Dataset processed
200 reviews processed
400 reviews processed
600 reviews processed
800 reviews processed
Dataset processed


Now, let's calculate statistics as before, but this time taking lemmas rather than word tokens into account – intuitively, this should make the space of unique 'words' smaller:

In [8]:
def statistics_lem(source_docs):
    length = 0
    vocab = []
    for review_id in source_docs.keys():
        review_doc = source_docs.get(review_id)
        lemmas = []
        for token in review_doc:
            lemmas.append(token.lemma_)
        length += len(lemmas)
        vocab += lemmas
    avg_length = float(length)/len(source_docs)
    vocab_size = len(set(vocab))
    diversity = float(length)/float(vocab_size)
    return avg_length, vocab_size, diversity
        
categories = ["Positive", "Negative"]
rows = []
rows.append(["Category", "Avg_Len(Review)", "Vocabulary Size", "Diversity"])
stats = {}
stats["Positive"] = statistics_lem(pos_docs)
stats["Negative"] = statistics_lem(neg_docs)
for cat in categories:
    rows.append([cat, f"{stats.get(cat)[0]:.6f}", 
                f"{stats.get(cat)[1]:.6f}",
                f"{stats.get(cat)[2]:.6f}"])

columns = zip(*rows)
column_widths = [max(len(item) for item in col) for col in columns]
for row in rows:
    print(''.join(' {:{width}} '.format(row[i], width=column_widths[i]) 
                  for i in range(0, len(row))))

 Category  Avg_Len(Review)  Vocabulary Size  Diversity 
 Positive  818.722000       24432.000000     33.510232 
 Negative  737.379000       22839.000000     32.285958 


The difference gets considerably smaller, but you can still see that positive reviews are longer and more diverse than negative ones. What do you think it says about the nature of positive and negative reviews?

Let's now extract the non-overlapping lemmas from the positive and negative subsets. This requires you to modify the code for `vocab_difference` above slightly:

In [9]:
def vocab_lem_difference(source_docs1, source_docs2):
    vocab1 = []
    vocab2 = []
    for rev_id in source_docs1.keys():
        rev = source_docs1.get(rev_id)
        for token in rev:
            vocab1.append(token.lemma_)
    for rev_id in source_docs2.keys():
        rev = source_docs2.get(rev_id)
        for token in rev:
            vocab2.append(token.lemma_)
    return sorted(list(set(vocab1) - set(vocab2)))

print(str(len(vocab_lem_difference(pos_docs, neg_docs))) + " unique lemmas in positive reviews only")
print(str(len(vocab_lem_difference(neg_docs, pos_docs))) + " unique lemmas in negative reviews only")

9230 unique lemmas in positive reviews only
7637 unique lemmas in negative reviews only


Compare these numbers to the ones for word tokens above.

Finally, let's see if the number of unique adjectives and adverbs differs across the reviews with different polarity:

In [10]:
def vocab_pos_difference(source_docs1, source_docs2, pos):
    vocab1 = []
    vocab2 = []
    for rev_id in source_docs1.keys():
        rev = source_docs1.get(rev_id)
        for token in rev:
            if token.pos_==pos:
                vocab1.append(token.text)
    for rev_id in source_docs2.keys():
        rev = source_docs2.get(rev_id)
        for token in rev:
            if token.pos_==pos:
                vocab2.append(token.text)
    return sorted(list(set(vocab1) - set(vocab2)))


categories = ["Positive", "Negative"]
rows = []
rows.append(["Category", "Unique adj's", "Unique adv's"])
stats = {}
stats["Positive"] = (len(vocab_pos_difference(pos_docs, neg_docs, "ADJ")), 
                     len(vocab_pos_difference(pos_docs, neg_docs, "ADV"))) 
stats["Negative"] = (len(vocab_pos_difference(neg_docs, pos_docs, "ADJ")), 
                     len(vocab_pos_difference(neg_docs, pos_docs, "ADV"))) 
for cat in categories:
    rows.append([cat, f"{stats.get(cat)[0]:.6f}", 
                f"{stats.get(cat)[1]:.6f}"])

columns = zip(*rows)
column_widths = [max(len(item) for item in col) for col in columns]
for row in rows:
    print(''.join(' {:{width}} '.format(row[i], width=column_widths[i]) 
                  for i in range(0, len(row))))      

 Category  Unique adj's  Unique adv's 
 Positive  2374.000000   574.000000   
 Negative  1906.000000   495.000000   


As before, positive reviews turn out to have more unique adjectives and adverbs.

## Simple approach with sentiment lexicons

[William Hamilton and his colleagues from Stanford](https://nlp.stanford.edu/projects/socialsent/) collected lists of words with sentiment scores assigned:

- per part of speech, for instance for all words vs for adjectives only
- per decade, registering how sentiment changed over time, e.g. between 1990's and 2000's (take a look at how the sentiment of different words changed, e.g., '*wicked*' becoming much more positive these days, changing from $-3.09$ in 1890's up to $-0.76$ in 2000's)
- per domain, e.g. movie-related words specifically.

To this end, let's build a simple "classifier" that will decide upon the sentiment taking the overall weight of different sentimental words in a review into account. 

Let's first implement a method that will collect sentiment score per word, depending on the source list:

In [11]:
def collect_wordlist(input_file):
    word_dict = {}
    with codecs.open(input_file, encoding='ISO-8859-1', errors ='ignore') as f:
        for a_line in f.readlines():
            cols = a_line.split("\t")
            if len(cols)>2:
                word = cols[0].strip()
                score = float(cols[1].strip())
                word_dict[word] = score
    f.close()
    return word_dict


adj_90 = collect_wordlist("sentiment_words/adjectives/1990.tsv")
print(adj_90.get("cool"))
print(len(adj_90))
adj_00 = collect_wordlist("sentiment_words/adjectives/2000.tsv")
print(adj_00.get("cool"))
print(len(adj_00))
all_90 = collect_wordlist("sentiment_words/frequent_words/1990.tsv")
print(len(all_90))
all_00 = collect_wordlist("sentiment_words/frequent_words/2000.tsv")
print(len(all_00))
movie_words = collect_wordlist("sentiment_words/subreddits/movies.tsv")
print(len(movie_words))

1.28
1968
1.19
2041
4924
4924
4981


Now, let's implement a simple threshold-based approach, based on the idea that if the cumulative score is positive, it's a positive review, and if it is negative, it is a negative one. You have a choice between two strategies:
- each positive word is worth +1 point and each negative one -1
- each positive and negative word has different weight defined by the score from the lexicon.


In [12]:
def bin_decisions(a_dict, label, sent_dict):
    decisions = []
    for rev_id in a_dict.keys():
        score = 0
        for token in a_dict.get(rev_id):
            if token.text in sent_dict.keys():
                if sent_dict.get(token.text)<0:
                    score -= 1
                else:
                    score += 1
        if score < 0:
            decisions.append((-1, label))
        else:
            decisions.append((1, label))
    return decisions

def weighted_decisions(a_dict, label, sent_dict):
    decisions = []
    for rev_id in a_dict.keys():
        score = 0
        for token in a_dict.get(rev_id):
            if token.text in sent_dict.keys():
                score += sent_dict.get(token.text)
        if score < 0:
            decisions.append((-1, label))
        else:
            decisions.append((1, label))
    return decisions

def get_accuracy(pos_docs, neg_docs, sent_dict):
    decisions_pos = bin_decisions(pos_docs, 1, sent_dict)
    decisions_neg = bin_decisions(neg_docs, -1, sent_dict)
    decisions_all = decisions_pos + decisions_neg
    lists = [decisions_pos, decisions_neg, decisions_all]
    accuracies = []
    for i in range(0, len(lists)):
        match = 0
        for item in lists[i]:
            if item[0]==item[1]:
                match += 1
        accuracies.append(float(match)/float(len(lists[i])))
    return accuracies

        
categories = ["Adj_90", "Adj_00", "All_90", "All_00", "Movies"]
rows = []
rows.append(["List", "Acc(positive)", "Acc(negative)", "Acc(all)"])
accs = {}
accs["Adj_90"] = get_accuracy(pos_docs, neg_docs, adj_90) 
accs["Adj_00"] = get_accuracy(pos_docs, neg_docs, adj_00) 
accs["All_90"] = get_accuracy(pos_docs, neg_docs, all_90) 
accs["All_00"] = get_accuracy(pos_docs, neg_docs, all_00) 
accs["Movies"] = get_accuracy(pos_docs, neg_docs, movie_words) 
for cat in categories:
    rows.append([cat, f"{accs.get(cat)[0]:.6f}", 
                f"{accs.get(cat)[1]:.6f}",
                f"{accs.get(cat)[2]:.6f}"])

columns = zip(*rows)
column_widths = [max(len(item) for item in col) for col in columns]
for row in rows:
    print(''.join(' {:{width}} '.format(row[i], width=column_widths[i]) 
                  for i in range(0, len(row))))    

 List    Acc(positive)  Acc(negative)  Acc(all) 
 Adj_90  0.889000       0.267000       0.578000 
 Adj_00  0.825000       0.354000       0.589500 
 All_90  1.000000       0.000000       0.500000 
 All_00  0.935000       0.086000       0.510500 
 Movies  0.014000       0.996000       0.505000 


Here is the summary of the **results:**

- With binary approach:
    - adj_90: Positive: 0.889; Negative: 0.267; All: 0.578
    - adj_00: Positive: 0.825; Negative: 0.354; All: **0.5895**
    - all_90: Positive: **1.0**; Negative: 0.0; All: 0.5
    - all_00: Positive: 0.935; Negative: 0.086; All: 0.5105
    - movie_words: Positive: 0.014; Negative: **0.996**; All: 0.505

- With weighted scores approach:
    - adj_90: Positive: 0.788; Negative: 0.507; All: **0.6475**
    - adj_00: Positive: 0.818; Negative: 0.424; All: 0.621
    - all_90: Positive: **0.984**; Negative: 0.017; All: 0.5005
    - all_00: Positive: 0.805; Negative: 0.373; All: 0.589
    - movie_words: Positive: 0.008; Negative: **0.997**; All: 0.5025
    
Let's check how many words from the lexicons occur in each type of reviews:

In [13]:
def occurrences(a_dict, sent_dict):
    occur = []
    for rev_id in a_dict.keys():
        for token in a_dict.get(rev_id):
            if token.text in sent_dict.keys():
                occur.append(token.text)
    return len(set(occur))

categories = ["Adj_90", "Adj_00", "All_90", "All_00", "Movies"]
rows = []
rows.append(["List", "Occurs(pos)", "Occurs(neg)"])
occs = {}
occs["Adj_90"] = occurrences(pos_docs, adj_90), occurrences(neg_docs, adj_90)
occs["Adj_00"] = occurrences(pos_docs, adj_00), occurrences(neg_docs, adj_00)
occs["All_90"] = occurrences(pos_docs, all_90), occurrences(neg_docs, all_90)
occs["All_00"] = occurrences(pos_docs, all_00), occurrences(neg_docs, all_00)
occs["Movies"] = occurrences(pos_docs, movie_words), occurrences(neg_docs, movie_words)
for cat in categories:
    rows.append([cat, f"{occs.get(cat)[0]:.6f}", 
                f"{occs.get(cat)[1]:.6f}"])

columns = zip(*rows)
column_widths = [max(len(item) for item in col) for col in columns]
for row in rows:
    print(''.join(' {:{width}} '.format(row[i], width=column_widths[i]) 
                  for i in range(0, len(row))))    

 List    Occurs(pos)  Occurs(neg) 
 Adj_90  1726.000000  1667.000000 
 Adj_00  1782.000000  1731.000000 
 All_90  4772.000000  4712.000000 
 All_00  4789.000000  4732.000000 
 Movies  4675.000000  4660.000000 


More to follow in Chapter 8.