# Gap Framework - Data Preparation for Sentiment Analysis

## PDSG: Boardgame rating and comment

The Boardgame rating/comment dataset is in a CSV format. To start you will need to read in the raw data from the CSV file using a csv reader. Fortunately, Python has a CSV reader. Let's start by importing the reader.

In [1]:
# Import Python's CSV parser module
import csv

The Python csv reader creates a generator. For convenience, since you may want to re-read the file more than once as you practice preparing the data, let's make a function to create the generator.

Note: The boardgame rating/comment dataset has unicode characters in it. So we need to open the file with the encoding 'utf-8'. If we don't the reader will throw an exception when it encounters a unicode character sequence.

In [2]:
# Create a generator for reading each row as a list in the CSV file.
def load():
    # make sure to open with Unicode encoding (there are unicode chars in the dataset)
    f = open('boardgames.csv', encoding='utf-8')
    r = csv.reader(f, delimiter=',')
    return r

### Keep only the data I am interested in.

For my purposes, I am only interested in the rating and the comment. The entry and game ID will not contribute to my model, so I want to toss them out. I will simply create a new dataset where I will copy only the rating and comment over.

In [3]:
# Make a new dataset of just the rating and comment
dataset = []

# Let's read each row to assemble a new dataset
header = True
for row in load():
    # First row is a header, so let's skip it.
    if header:
        header = False
        continue
        
    # some of the ratings are floating point values, like 8.5 or 6.2. We will make them all ints for convinence. 
    rating = int(float(row[2]))

    dataset.append( { 'rating': rating, 'comment': row[3] } )

Let's check that we really create the new dataset as we expect.

In [4]:
print("Number of rows", len(dataset))
print("Row 0", dataset[0])
print("Row 2", dataset[2])

Number of rows 847
Row 0 {'rating': 8, 'comment': "++++ Thematic +++ Bluff - Many randomness   I really like that one. Maybe it's more fun to play as a cylon than as a human, but when you do, you really feel like an undercover agent who awaits the best moment to strike and hurt bad...  It is a long poker time when everyone tries to guess who is who.  Having seen the series is really a must, to enjoy this game, otherwise the thematic won't be appreciated."}
Row 2 {'rating': 8, 'comment': 'LOVE this game!  If only the GF would play it with me.  Tired at end of day with basic math = bad idea.'}


This first row shows some of the issues you will need to consider:

    - Punctuation sequences like +++ and *** may indicate the user is emphasizing the word.
    - All caps (e.g., BAD) may indicate the user is emphasizing the word.
    - ! may mean the user is elevating their statement.
    
So why is this an issue? The general rule of thumb in NLP preprocessing is to remove punctuation and lowercase words. Hum, that might cause us to lose valuable information. We will explore this, along with other things.

### Preprocessing with Gap

Okay, there are a zillion ways one could code preprocessing text for sentiment analysis. There are no shortage of blogs.

I will show you how to use **Gap** as an alternative to writing each line of preprocessing code by hand, and do it in a few simple steps. 

We start by first importing the <b style='color: saddlebrown'>SYNTAX</b> module from **Gap**. This module does syntactically analysis for Text.



In [5]:
# Import Words from Gap's Syntax module
from gapml.syntax import Words

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\'\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\'\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


I will use the <b style='color:saddlebrown'>Words</b> class to demonstrate methods to preprocess (i.e., prepare) the text for machine learning. Like many applications of NLP, sentiment analysis has its own unique challenges:

    - Reviews contain slang
    - Repetitive punctuation and all CAPS are used as emphasis.
    - Short-hand, incomplete sentences, higher frequency of spelling errors (not a concern to someone entering a review).
    
I will start with a simple example of a single 'positive' phrase; but in the phrase, the reviewer has used punctuation and all caps to emphasis words.

First, we preprocess the traditional way: remove punctuation, lowercase, stem and stopword removal.

In [6]:
# This is our helper function to make it easier to display preprocessed text w/o tagging
def towords(words):
    for word in words:
        print(word['word'], ' ')

#### Standard Method

Let's process and print the NLP tokenized string according to the 'standard rule of thumb'

In [7]:
w = Words('+++ My favourite game of all TIME!!')
towords(w.words)

game  
time  


OMG. What's wrong with this. Well, we lost stuff that might be important, like:

    - TIME was in all CAPS
    - +++ punctuation was used as emphasis
    - favorite was spelled according to UK spelling.

#### Bare Mode

Let's do the opposite and just keep everything. In **Gap**, that's the parameter *bare=True*.

In [8]:
w = Words('+++ My favourite game of all TIME!!', bare=True)
towords(w.words)

# Note that tag 14 is an acronymn

+  
+  
+  
My  
favourite  
game  
of  
all  
TIME  
!  
!  


Okay, that might be better. But favourite won't match other occurrences of the American version favorite, and TIME won't match time and Time, etc. Perhaps your neural network will learn the relationship. But why burden the neural network? It has plenty of other important things to learn, like predicting the rating!

#### Keeping Punctuation

Okay, let's go back to step 0. We start by doing the standard rule of thumb, but this time we keep punctuation by setting the keyword parameter *punct* to True, since it might indicate an emphasis. 

In [9]:
w = Words('+++ My favourite game of all TIME!!', punct=True)
towords(w.words)

+  
+  
+  
game  
time  
!  
!  


Okay, that's better. But we are still missing important things. Note how time is lowercased. But if we look at it's tag, it will tell us that it was uppercased (value 14). Also the punctuation all have tags as punctuation (value 23) or symbol (value 24). 

Interesting, we don't need to write code to tell if something was uppercased or if we have sequence of punctuation. We can just look at the tags.

In [10]:
print(w.words)

[{'tag': 24, 'word': '+'}, {'tag': 24, 'word': '+'}, {'tag': 24, 'word': '+'}, {'tag': 0, 'word': 'game'}, {'tag': 14, 'word': 'time'}, {'tag': 23, 'word': '!'}, {'tag': 23, 'word': '!'}]


#### Sentiment Words

We are still missing something. It's the word 'favourite' that indicates the sentiment! And it's in UK spelling. **Gap** has a word dictionary of US and UK spellings, slang and misspellings of words that (or may) indicate a sentiment. Let's tell **Gap** to keep these words by setting the keyword parameter *sentiment* to True.

In [11]:
w = Words('+++ My favourite game of all TIME!!', punct=True, sentiment=True)
print(w.words)

[{'tag': 24, 'word': '+'}, {'tag': 24, 'word': '+'}, {'tag': 24, 'word': '+'}, {'tag': 18, 'word': 'favorite'}, {'tag': 0, 'word': 'game'}, {'tag': 14, 'word': 'time'}, {'tag': 23, 'word': '!'}, {'tag': 23, 'word': '!'}]


Let's take a close look at the NLP sequence. Aah, eventhough time is lowercase, the tag (14) indicates it was all CAPS. We kept the punctuation. And look, we retain the word favorite, and it's in US spelling.

#### Negation

Okay, sometimes people use a negation. That is, they use a positive word like 'great' or 'good', but proceed it was a negation, like 'not' or 'never'. **Gap** recognizes this as well. Let's try one.

In [12]:
w = Words("Did not like the game.", sentiment=True, punct=True)
print(w.words)

# Note the positive word Like is negated by not, so it was removed.

[{'tag': 19, 'word': 'not'}, {'tag': 0, 'word': 'game'}, {'tag': 23, 'word': '.'}]


Ahh, **Gap** dropped the otherwise positive word 'like' and simply kept the negative word 'not'.

#### Contractions

**Gap** handles contractions as well. For example, the not might be part of a contraction like: don't, can't, won't, isn't, etc.

In [13]:
w = Words("Didn't like the game.", sentiment=True, punct=True)
print(w.words)

# Handles contractions too.

[{'tag': 19, 'word': 'not'}, {'tag': 0, 'word': 'game'}, {'tag': 23, 'word': '.'}]


#### So many ways to write a word

**Gap** has builtin and 3rd party stemmers and lemmatizers to match the same word with different word endings. For example, the words 'best' and 'better' are recognized as the same.

In [14]:
w = Words('it is a great game', sentiment=True, punct=True)
print(w.words)
w = Words('it is the greatest game', sentiment=True, punct=True)
print(w.words)

[{'tag': 18, 'word': 'great'}, {'tag': 0, 'word': 'game'}]
[{'tag': 18, 'word': 'great'}, {'tag': 0, 'word': 'game'}]


#### Spelling Errors

Reviews are notorious for spelling errors! **Gap** an implementation of the (Peter) Norvig speller builtin for English (and French, Spanish, German and Italian) language spell checker. Using the spell parameter, **Gap** will lookup each word in the pyaspeller dictionary. If the word is not found and the *Norvig* speller finds a replacement, the misspelled word is replaced.

Note in the example, the misspelled word 'grat' is replaced with 'great'.

In [15]:
w = Words('The game was graat!', sentiment=True, punct=True)
print(w.words)
w = Words('The game was graat!', sentiment=True, punct=True, spell='en')
print(w.words)

[{'tag': 0, 'word': 'game'}, {'tag': 0, 'word': 'graat'}, {'tag': 23, 'word': '!'}]
[{'tag': 0, 'word': 'game'}, {'tag': 18, 'word': 'great'}, {'tag': 23, 'word': '!'}]


### Prepare for Neural Network

First, **Gap** wasn't focused for sentiment analysis, but for extracting and processing text from PDF documents. So we need to do some last steps by hand that in the future will be incorporated into **Gap**, these include:


1. Add special markers.
2. Reduce token sequence to fixed length.
3. Convert words to integers.

#### Special Markers

Let's start defining special markers, some of which I will explain their purpose further into the code-along.




In [16]:
pad      = '<PAD>'
emphasis = '<EMP>'
positive = '<POS>'
negative = '<NEG>'

Let's know create a function that will insert some of our special markers, as follows:
    1. If a word has a negative sentiment tag, insert <NEG>
    2. If a word has a positive sentiment tag, insert <POS>
    3. If a word is all caps, insert a <EMP>
    4. If + or * occurs two or more times in sequence, replace with a <EMP>
    5. If a ! mark appears, replace with a <EMP>
    6. Otherwise, remove all remaining punctuation and symbols.

In [19]:
from gapml.syntax import Vocabulary

def prepare(comment):
    """ Convert text to NLP sequence """
    ret = []
    
    # Create words object from text: keep punctuation and sentiment words
    try:
        words = Words(comment, punct=True, sentiment=True)
    except:
        return
    
    # Reconstruct word list, adding special markers
    last = None
    for word in words.words:
        # Add special <EMP> marker for all caps words
        if word['tag'] == Vocabulary.ACRONYM:
            ret.append(emphasis)
        # Add special <POS> marker for positive words
        if word['tag'] == Vocabulary.POSITIVE:
            ret.append(positive)
        # Add special <NEG> marker for negative words
        elif word['tag'] == Vocabulary.NEGATIVE:
            ret.append(negative)
        # Drop punctuation, unless its an exclamation mark
        elif word['tag'] == Vocabulary.PUNCT:
            # Add <EMP> if exclamation mark
            if word['word'] == '!':
                ret.append(emphasis)
            continue
        elif word['tag'] == Vocabulary.SYMBOL:
            # Drop symbols, unless + or * as multiple sequence
            if word['word'] in ['+', '*']:
                # Add <EMP> if multiple sequence
                if last == word['word']:
                    ret.append(emphasis)
            continue
        ret.append(word['word'])
        # remember the last word
        last = word['word']
        
    return ret

Let's now update each comment in the dataset with our special markers using the prepare function.

Note, since the contents of lists are mutable in Python, we do not need to create a new list. We can replace the previous element in the list with a new element in place (no copy).

In [20]:
# Annotate each row with special markers
for row in dataset:
    comment = prepare(row['comment'])
    row['comment'] = comment

In [21]:
dataset[1]

{'comment': ['well',
  '<NEG>',
  'ugly',
  'artwork',
  'help',
  'immerse',
  'egyptian',
  'theme',
  '<POS>',
  'cool',
  'auction',
  '<EMP>',
  'ra',
  '<EMP>',
  '<EMP>',
  '<EMP>'],
 'rating': 8}

#### Truncating Size

So we plan to send our sequences into a Recurrent Neural Network (RNN), like a LSTM. Best practice is that each sequence of tokens we input is the same length. So we need to pick a length, and anything above that length we truncate and below that length we add the special marker <PAD> - alas, the purpose of the <PAD> special marker.
    
Let's set the sequence length to 15.

In [22]:
seq_len = 15

# Resize each comment to be exactly 15 tokens
for row in dataset:
    try:
        rlen = len(row['comment'])
    except:
        continue
    # If comment is > 15 tokens, then  truncate
    if rlen > 15:
        row['comment'] = row['comment'][0:15]
    # If less than 15 tokens, then add paddimg
    elif rlen < 15:
        for _ in range(rlen, 15+1):
            row['comment'].append('<PAD>')

#### Let's look at a couple of rows and see what changed.
    - Row 1: Was 16 elements, we dropped the last element <EMP>.
    - Row 2: Was 13 elements, we added two <PAD> elements.

In [23]:
dataset[1]

{'comment': ['well',
  '<NEG>',
  'ugly',
  'artwork',
  'help',
  'immerse',
  'egyptian',
  'theme',
  '<POS>',
  'cool',
  'auction',
  '<EMP>',
  'ra',
  '<EMP>',
  '<EMP>'],
 'rating': 8}

In [24]:
dataset[2]

{'comment': ['<EMP>',
  'love',
  'game',
  '<EMP>',
  '<EMP>',
  'gf',
  'play',
  'tir',
  'basic',
  'math',
  '<NEG>',
  'bad',
  'idea',
  '<PAD>',
  '<PAD>',
  '<PAD>'],
 'rating': 8}

#### Convert Words to Integers

Well, inputs to neural networks are numbers not words! Yeaks, what next.

Well, we need to make a dictionary that maps each unique word to a unique integer and the replace the words in our inputs with the corresponding integer value.

In Python, we do that with a dictionary object. Let's start by initializing our dictionary with the special markers.

In [25]:
word2int = {}
word2int['<PAD>'] = 0
word2int['<EMP>'] = 1
word2int['<POS>'] = 2
word2int['<NEG>'] = 3

Let's now build a dictionary of all the words in our dataset, mapping each word to a unique integer

In [26]:
index = 4 # next integer value of word to add to dictionary

# Walk thru each row in the dataset
for row in dataset:
    # Get the list of preprocessed words for this row's comment
    words = row['comment']
    # For each word, we will see if we need to add it to the dictionary
    for word in words:
        # looks like this word is not in the dictionary!
        if word not in word2int:
            # Add the word and set its value to the next index sequence
            word2int[word] = index
            index += 1

Let's do some verification that this step work.

In [27]:
print("Number of Unique Words:", len(word2int))
print("Mapping of the word fun:", word2int['fun'])

Number of Unique Words: 1774
Mapping of the word fun: 7


In [28]:
# Update each row in the dataset, replacing words with their unique integer value
for row in dataset:
    comment = []
    # for each word in the comment, lookup its integer value
    for word in row['comment']:
        # replace the word with its unique integer value
        comment.append(word2int[word])
    # replace the comment with the list of integer values
    row['comment'] = comment

#### Let's look at a couple of rows and see what changed.

In [29]:
dataset[1]

{'comment': [15, 3, 16, 17, 18, 19, 20, 21, 2, 22, 23, 1, 24, 1, 1],
 'rating': 8}

In [30]:
dataset[2]

{'comment': [1, 25, 26, 1, 1, 27, 8, 28, 29, 30, 3, 31, 32, 0, 0, 0],
 'rating': 8}

# END OF SESSION 4

You are now ready to input your dataset into a Recurrent Neural Network (RNN) like a LSTM or GRU.