### Tokenization
We're going to conduct sentiment analysis using Bag of Words (BOW). This includes tokenizing text into words. There are two ways of tokinizing words:

- Stemming
- Lemmatization

Stemming is part of text processing, namely reducing a word into its base form. The idea is that every word can be seen as a morphological variant of one base word. This approach processes words according to morphology and not to semantics (as each word is considered independent of it's neighbouring words or enclosing sentence for example). (https://www.ibm.com/think/topics/stemming-lemmatization)

Stemming reduces words to word bases by comparing a word to a pre-defined list of common suffixes. So this process simply consists of removing recognized suffixes according to a rule, which means it is heuristics based. This can pose problems, as it occasionally presents incorrect base words (f.e. nothing to noth, as -ing suffix is removed)

Lemmatization is more in line with reducing morphological variants to a dictionary base form. It does so by "Part of Speech" (POS) tagging. This assigns each word its syntactic function in the sentence, which allows the procedure to correctly identify a dictionary base form. For example, the word "nothing" remains unaltered in lemmatization, because its syntactic function as a noun. 

# PennTreebank to Wordnet POS

We will be using PennTreebank tokenization. A Treebank tokenizer is simply a tokenizer that contains rules for English contractions and hence allows us to tokenize based on the syntactic function of a word. 

Note that the POS tags of PennTreebank are different from the Wordnet tags, hence we have to convert them to Wordnet tags within the pipeline. 

In [47]:
from nltk.tokenize import TreebankWordTokenizer
from nltk.corpus import wordnet as wn
from nltk import sent_tokenize, pos_tag
from nltk.stem import WordNetLemmatizer
from nltk.corpus import sentiwordnet as swn
import nltk
import pandas as pd
import numpy as np
import seaborn as sns
nltk.download('averaged_perceptron_tagger_eng')
nltk.download("sentiwordnet")

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /home/dan/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package sentiwordnet to /home/dan/nltk_data...
[nltk_data]   Unzipping corpora/sentiwordnet.zip.


True

In [3]:
df_small = pd.read_csv("../data/small_corpus.csv")


In [4]:
df_small.head()

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,unixReviewTime,vote,style,image
0,1.0,False,"12 25, 2003",A3NN4RTUN0LHBN,B00009WAVB,George Rownd,This game is awful. Very bad graphics by a fir...,BAD,1072310400,,{'Format:': ' Video Game'},
1,1.0,True,"10 14, 2015",A1DAH6X9PGVH7D,B00XKCC00I,Paht,It would be really cool if the frame rate didn...,Terrible.,1444780800,,{'Platform:': ' Xbox One'},
2,1.0,False,"11 22, 2015",A2ZWI1773NNY12,B00W8FYF56,Pamela,got this for my grandson as a pre-order along ...,actually disappointed with both,1448150400,3.0,{'Format:': ' Video Game'},
3,1.0,False,"03 7, 2017",A1D7U5NSLRAU8E,B00BQMGW4Y,Kaxey,"Fake, ingenuine product. The picture shoes an ...",a shorter cable (OEMs were ten feet) and crapp...,1488844800,,,
4,1.0,False,"06 22, 2008",A2KBD1UW414PH2,B000V1OUTU,Greg,I bought the collector's edition and if you do...,"Sad, So much promise...",1214092800,4.0,"{'Edition:': "" Collector's""}",


In [5]:
# Function to convert PennTreebank tags to Wordnet tags
nltk.download("wordnet")

def penn_to_wn(tag):
    """
    Convert PennTreebank tags to simpler Wordnet tags
    """
    if tag.startswith("J"): # Adjectives start with J in PennTreebank tags
        return wn.ADJ
    elif tag.startswith("N"):
        return wn.NOUN
    elif tag.startswith("R"):
        return wn.ADV
    elif tag.startswith("V"):
        return wn.VERB

    return None

[nltk_data] Downloading package wordnet to /home/dan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [6]:
penn_to_wn("JJR") # JJR = Adjective, comparative. In PennTreebank POS

'a'

In [7]:
from nltk import sent_tokenize, pos_tag

In [8]:
# Test sentence and word tokenizer
review_tokens = df_small["reviewText"].apply(str)
review_tokens = review_tokens.apply(sent_tokenize)
sentence_tokens = review_tokens.loc[0]
sentence_token = sentence_tokens[1]
sentence_token

'Very bad graphics by a first person shooter.'

In [9]:
word_tokens = TreebankWordTokenizer().tokenize(sentence_token)
word_tokens

['Very', 'bad', 'graphics', 'by', 'a', 'first', 'person', 'shooter', '.']

In [10]:
tags = pos_tag(word_tokens)
print(tags)

[('Very', 'RB'), ('bad', 'JJ'), ('graphics', 'NNS'), ('by', 'IN'), ('a', 'DT'), ('first', 'JJ'), ('person', 'NN'), ('shooter', 'NN'), ('.', '.')]


In [11]:
for word, tag in tags:
    tag = penn_to_wn(tag)
    if not tag:
        continue
    lemma = WordNetLemmatizer().lemmatize(word, pos=tag)
    print(lemma)
    

Very
bad
graphic
first
person
shooter


In [29]:
            
            

def lemmatize_corpus(text):
    lemmas = []
    sentence_tokens = sent_tokenize(text)
    #print(sentence_tokens)
    for sentence_token in sentence_tokens:
        # Tokenize sentence into words
        word_tokens = TreebankWordTokenizer().tokenize(sentence_token)

        # Make POS tag tuples
        wn_tags = pos_tag(word_tokens)
        
        for token, tag in wn_tags:
            token = token.lower()
            # Convert any applicable PennTreebank tags to Wordnet tags. If not applicable
            # for lemmatization (not noun, verb, adjective or adverb), returns None, and therefore
            # don't use word 
            wn_tag = penn_to_wn(tag)
            if not wn_tag:
                continue
            lemma = WordNetLemmatizer().lemmatize(token, wn_tag)
            lemmas.append((lemma, wn_tag))

    return lemmas
    
            
        
        

In [30]:
text = "She had a little lamb. The lamb was very young and vulnerable!"
lemmatize_corpus(text)

[('have', 'v'),
 ('little', 'a'),
 ('lamb', 'n'),
 ('lamb', 'n'),
 ('be', 'v'),
 ('very', 'r'),
 ('young', 'a'),
 ('vulnerable', 'a')]

In [31]:
sample = df_small.head().copy(deep=True)
sample["lemmatized"] = sample["reviewText"].apply(lemmatize_corpus)
sample["tags"] = sample["reviewText"].apply(tokenize)


In [32]:
print(sample["reviewText"][1])
print(sample["lemmatized"][1])

It would be really cool if the frame rate didn't drop like crazy every time you throw a pass, catch a pass, cause a turnover, catch an INT, kickoff, call a play, try to get behind your blocks....the list can go on really but I'll leave you with this for now.
Save your money until these EA hacks fix this crap (although they probably won't).
[('be', 'v'), ('really', 'r'), ('cool', 'a'), ('frame', 'n'), ('rate', 'n'), ('do', 'v'), ("n't", 'r'), ('drop', 'v'), ('crazy', 'a'), ('time', 'n'), ('throw', 'v'), ('pas', 'n'), ('catch', 'v'), ('pas', 'n'), ('cause', 'v'), ('turnover', 'n'), ('catch', 'v'), ('int', 'n'), ('kickoff', 'n'), ('call', 'v'), ('play', 'n'), ('try', 'v'), ('get', 'v'), ('block', 'n'), ('.the', 'a'), ('list', 'n'), ('go', 'v'), ('really', 'r'), ('leave', 'v'), ('now', 'r'), ('save', 'v'), ('money', 'n'), ('ea', 'n'), ('hack', 'n'), ('fix', 'v'), ('crap', 'n'), ('probably', 'r'), ("n't", 'r')]


In [33]:
sample.head()

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,unixReviewTime,vote,style,image,lemmatized,tags
0,1.0,False,"12 25, 2003",A3NN4RTUN0LHBN,B00009WAVB,George Rownd,This game is awful. Very bad graphics by a fir...,BAD,1072310400,,{'Format:': ' Video Game'},,"[(game, n), (be, v), (awful, a), (very, r), (b...","[(game, n), (is, v), (awful, a), (Very, r), (b..."
1,1.0,True,"10 14, 2015",A1DAH6X9PGVH7D,B00XKCC00I,Paht,It would be really cool if the frame rate didn...,Terrible.,1444780800,,{'Platform:': ' Xbox One'},,"[(be, v), (really, r), (cool, a), (frame, n), ...","[(be, v), (really, r), (cool, a), (frame, n), ..."
2,1.0,False,"11 22, 2015",A2ZWI1773NNY12,B00W8FYF56,Pamela,got this for my grandson as a pre-order along ...,actually disappointed with both,1448150400,3.0,{'Format:': ' Video Game'},,"[(get, v), (grandson, n), (pre-order, n), (han...","[(got, v), (grandson, n), (pre-order, n), (Han..."
3,1.0,False,"03 7, 2017",A1D7U5NSLRAU8E,B00BQMGW4Y,Kaxey,"Fake, ingenuine product. The picture shoes an ...",a shorter cable (OEMs were ten feet) and crapp...,1488844800,,,,"[(fake, n), (ingenuine, a), (product, n), (pic...","[(Fake, n), (ingenuine, a), (product, n), (pic..."
4,1.0,False,"06 22, 2008",A2KBD1UW414PH2,B000V1OUTU,Greg,I bought the collector's edition and if you do...,"Sad, So much promise...",1214092800,4.0,"{'Edition:': "" Collector's""}",,"[(buy, v), (collector, n), (edition, n), (do, ...","[(bought, v), (collector, n), (edition, n), (d..."


In [50]:
# Test out SentiwordNet
lemma = sample["lemmatized"][1][1]
synset = wn.synsets(lemma[0], pos=lemma[1])

In [51]:
display(synset)

[Synset('truly.r.01'),
 Synset('actually.r.01'),
 Synset('in_truth.r.01'),
 Synset('very.r.01')]

It seems that the first synset is typically the most common sense of the word by some reference corpus (f.e. Brown Corpus)(OpenAI, 2023). 

Given that Bag of Words does not include contextual information, it is wise to use the most common synset

In [52]:
synset = synset[0]

In [53]:
display(synset)

# Name of synset object can be accessed by method
display(synset.name())

Synset('truly.r.01')

'truly.r.01'

In [54]:
swn_synset = swn.senti_synset(synset.name())

In [55]:
display(swn_synset)

SentiSynset('truly.r.01')

By evaluating positive and negative scores, we can see where along the PN-polarity the SentiSynset is located.

In [57]:
display(swn_synset.pos_score())
display(swn_synset.neg_score())

0.625

0.0

Consequently, by **subtracting the negative from the positive score**, we can see towards which polarity the SentiSynet is leaning. 

Note that there are three scores in SentiSynsets, **positive, negative, and objective**, and these scores **always sum up to one**. This means that the sentiment score (positive - negative) of a SentiSynset with high objectivity is necessarily limited. This should be a desireable characteristic, as we want words that are more neutral to be less informative than words that are less neutral.  

In [58]:
senti_score = swn_synset.pos_score() - swn_synset.neg_score()
senti_score

0.625

### Complete Sentiment Scoring Function
Now we can write one function that averages the sentiment score of a text. This function will also do lemmatization, so that we don't have to write any additional functions.

In [97]:
def sentiment_score(text):
    text = str(text)
    sentence_tokens = sent_tokenize(text)
    total_score = 0
    for sentence_token in sentence_tokens:
        sentence_score = 0
        no_words_used = 0
        word_tokens = TreebankWordTokenizer().tokenize(sentence_token)
        pos_tuple = pos_tag(word_tokens)
        for word, tag in pos_tuple:
            wn_tag = penn_to_wn(tag)
            if not wn_tag:
                continue
            lemma = WordNetLemmatizer().lemmatize(word, pos=wn_tag)
            if not lemma:
                continue
            synset = wn.synsets(lemma, pos=wn_tag)
            if not synset:
                continue
            synset = synset[0] # Select most common synset

            # Transform into SentiSynset
            swn_synset = swn.senti_synset(synset.name())
            sentence_score += swn_synset.pos_score() - swn_synset.neg_score()

            # Word was used, add count
            no_words_used += 1

        # Average sentence score across words in the sentence, and add to total
        if no_words_used == 0:
            continue
        total_score += (sentence_score/no_words_used)
        

    return total_score

In [98]:
df_small = df_small.copy(deep=True)
df_small["reviewText"].apply(lambda x: str(x))

df_small["sentiment_score"] = df_small["reviewText"].apply(sentiment_score)

In [99]:
df_small[df_small["overall"] > 4]

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,unixReviewTime,vote,style,image,sentiment_score
3000,5.0,True,"06 14, 2013",A2OWL0NS33ZQST,B00113T0VA,KAW999,Great game lots of action great story that rel...,Endless hours of entertainment,1371168000,,{'Format:': ' CD-ROM'},,0.085227
3001,5.0,False,"03 31, 2005",A3O31BQQ0751PI,B00008KTNW,mike,this game has some of the best cutscenes ever....,masterpiece,1112227200,,,,0.181250
3002,5.0,True,"06 12, 2016",A1K619I1XM2BLC,B0166QDJDQ,Scott w.,Loved the game and the price!,Five Stars,1465689600,,{'Format:': ' Video Game'},,0.166667
3003,5.0,True,"12 10, 2016",ABGUYZ6CWA2V1,B01326JM3Y,blank,Great snes controller substitute very close to...,Five Stars,1481328000,,,,0.000000
3004,5.0,False,"07 16, 2014",A1WTMP0BQ76WTZ,B0014X7SQ6,Noneatallhere,What a great look back into the world of FF7. ...,A PSP must own for any action/role play fans.,1405468800,,{'Edition:': ' Standard'},,0.395485
...,...,...,...,...,...,...,...,...,...,...,...,...,...
4495,5.0,True,"12 12, 2015",A30KYKOAYNH1WM,B000XJNTNS,Shannon Solis,Christmas gift,Five Stars,1449878400,,,,0.125000
4496,5.0,True,"12 29, 2015",A1B6EN5Y5DC6CP,B00002STI2,Roy Parr,Great game worth picking up,Five Stars,1451347200,,,,0.000000
4497,5.0,True,"01 9, 2017",A2JCY15Z85WOLQ,B00W8FYFBA,YYK,,Five Stars,1483920000,,{'Format:': ' Video Game'},,0.000000
4498,5.0,True,"07 9, 2015",A1E5UBXNO5AVF6,B00O9JLBOC,AKA,"Awesome deal, came in perfect condition for my...",Great deal,1436400000,,,,0.477273
