Welcome to module 2.1. In this module, we will use a popular Python module `scikit-learn` to build a sentiment classifier with Naive Bayes! We will introduce the concept of feature as numerical representation of the input data. We will experiment with different types of features to investiage their impact on training. 


Let's first refresh your memory on the Naive Bayes model. 

## ❓ Pre-module quiz

Say that we have two events: Fire and Smoke. $P(Fire)$ is the probability of a fire (or in other words, how often a fire occurs), $P(Smoke)$ is the probability of seeing smoke (how often we see smoke). We want to know $P(Fire|Smoke)$, that is, how often fire occurs when we see smoke. Suppose we know the following:

$P(Fire)=0.01$

$P(Smoke)=0.1$

$P(Smoke|Fire)=0.9$ (ie. 90\% of the fire makes smoke)


Can you work out $P(Fire|Smoke)$?

A. 0.1

B. 0.09

C. 0.01

D. 0.9




<hr>    <!-- please remember this! -->
<details>
  <summary>Click <b>here</b> to see the answer.</summary>
  <p>\begin{equation*}P(Fire|Smoke) =  \frac{P(Fire)*P(Smoke|Fire)}{P(Smoke)} \end{equation*} </p>
<p>\begin{equation*}=\frac{0.01 * 0.9}{ 0.1} \end{equation*}</p>
<p>\begin{equation*}=0.09\end{equation*}</p>

</details> 



# Sentiment Analysis Task Introduction

Our task of focus today is a popular NLP classification task: sentiment analysis. What exactly is sentiment? Sentiment relates to the meaning of a word or sequence of words and is usually associated with an opinion or emotion. And analysis? Well, this is the process of looking at data and making inferences; in this case, using machine learning to learn and predict whether a movie review is positive or negative.

In this section, we will replicate the experiments from the paper: Thumbs up? Sentiment Classification using Machine Learning
Techniques (https://www.aclweb.org/anthology/W02-1011.pdf). We will extract a number of features including unigrams, bigrams, pos tags etc., and train Naive Bayes models on these features. 

## Preparing data

The data for this tutorial is stored in the `./data` folder. The two subdirectories `./data/pos` and `./data/neg` contain samples of IMDb positive and negative movie reviews. Each line of a review text file is a tokenized sentences. 

As usual, we download the files for the notebook from Github. If you're running this notebook locally or on Binder, you may skip this cell.

In [1]:
!wget https://github.com/cambridgeltl/python4cl/raw/module_2.1/module_2/module_2.1/data.zip
!unzip -n -q data.zip

--2020-11-24 13:04:10--  https://github.com/cambridgeltl/python4cl/raw/module_2.1/module_2/module_2.1/data.zip
Resolving github.com (github.com)... 140.82.121.3
Connecting to github.com (github.com)|140.82.121.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/cambridgeltl/python4cl/module_2.1/module_2/module_2.1/data.zip [following]
--2020-11-24 13:04:11--  https://raw.githubusercontent.com/cambridgeltl/python4cl/module_2.1/module_2/module_2.1/data.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.16.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.16.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4848090 (4.6M) [application/zip]
Saving to: ‘data.zip.5’


2020-11-24 13:04:15 (1.52 MB/s) - ‘data.zip.5’ saved [4848090/4848090]



We can load an individual text file by opening it, reading in the ASCII text, and closing the file. For example, we can load the first negative review file “cv000_29416.txt” as follows:


In [2]:

# load one file
filename = 'data/neg/cv000_29416.txt'
# open the file as read only
file = open(filename, 'r')
# read all text
text = file.read()
print (text)
# close the file
file.close()


plot : two teen couples go to a church party , drink and then drive . 
they get into an accident . 
one of the guys dies , but his girlfriend continues to see him in her life , and has nightmares . 
what's the deal ? 
watch the movie and " sorta " find out . . . 
critique : a mind-fuck movie for the teen generation that touches on a very cool idea , but presents it in a very bad package . 
which is what makes this review an even harder one to write , since i generally applaud films which attempt to break the mold , mess with your head and such ( lost highway & memento ) , but there are good and bad ways of making all types of films , and these folks just didn't snag this one correctly . 
they seem to have taken this pretty neat concept , but executed it terribly . 
so what are the problems with the movie ? 
well , its main problem is that it's simply too jumbled . 
it starts off " normal " but then downshifts into this " fantasy " world in which you , as an audience member , have no id

This loads the document as ASCII and preserves any white space, like new lines.

We can turn this into a function called load_doc() that takes a filename of the document to load and returns the text.


In [3]:
# load doc into memory
def load_doc(filename):
    """
    Parameters
    ----------
    filename: the filename to extract text

    Return
    ------
    text in strings
    """
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text


We can process each directory in turn by first getting a list of files in the directory using the `listdir()` function from the `os` module, then loading each file in turn.

For example, we can load each document in the negative directory using the `load_doc()` function to do the actual loading. Below, we define a `process_docs()` function to load all documents in a folder. 

Let's first read in these positive and negative files and store them as two list of texts. To navigate the files, we can use Python's `os` module. 

In [4]:
from os import listdir 
# load all docs in a directory
def process_docs(directory):
    """
    Parameters
    ----------
    directory: a directory containing positive/negative samples from the Thumbs
    Up! dataset.

    Return
    ------
    A list of of documents, where each document is string text
    """
    # walk through all files in the folder
    docs=[] # a list of review texts
    for filename in listdir(directory):
        # skip files that do not have the right extension
        if not filename.endswith(".txt"):
            continue
        # create the full path of the file to open
        path = directory + '/' + filename
        # load document
        doc = load_doc(path)
        print('Loaded %s' % filename)
        docs.append(doc)
    return docs
 
# specify directory to load
directory = 'data/neg'
docs=process_docs(directory)

Loaded cv676_22202.txt
Loaded cv839_22807.txt
Loaded cv155_7845.txt
Loaded cv465_23401.txt
Loaded cv398_17047.txt
Loaded cv206_15893.txt
Loaded cv037_19798.txt
Loaded cv279_19452.txt
Loaded cv646_16817.txt
Loaded cv756_23676.txt
Loaded cv823_17055.txt
Loaded cv747_18189.txt
Loaded cv258_5627.txt
Loaded cv948_25870.txt
Loaded cv744_10091.txt
Loaded cv754_7709.txt
Loaded cv838_25886.txt
Loaded cv131_11568.txt
Loaded cv401_13758.txt
Loaded cv523_18285.txt
Loaded cv073_23039.txt
Loaded cv688_7884.txt
Loaded cv664_4264.txt
Loaded cv461_21124.txt
Loaded cv909_9973.txt
Loaded cv939_11247.txt
Loaded cv368_11090.txt
Loaded cv185_28372.txt
Loaded cv749_18960.txt
Loaded cv836_14311.txt
Loaded cv322_21820.txt
Loaded cv789_12991.txt
Loaded cv617_9561.txt
Loaded cv288_20212.txt
Loaded cv464_17076.txt
Loaded cv904_25663.txt
Loaded cv866_29447.txt
Loaded cv429_7937.txt
Loaded cv212_10054.txt
Loaded cv007_4992.txt
Loaded cv522_5418.txt
Loaded cv109_22599.txt
Loaded cv753_11812.txt
Loaded cv312_29308.tx

Loaded cv450_8319.txt
Loaded cv247_14668.txt
Loaded cv236_12427.txt
Loaded cv827_19479.txt
Loaded cv658_11186.txt
Loaded cv712_24217.txt
Loaded cv582_6678.txt
Loaded cv030_22893.txt
Loaded cv094_27868.txt
Loaded cv786_23608.txt
Loaded cv151_17231.txt
Loaded cv364_14254.txt
Loaded cv276_17126.txt
Loaded cv945_13012.txt
Loaded cv767_15673.txt
Loaded cv167_18094.txt
Loaded cv149_17084.txt
Loaded cv158_10914.txt
Loaded cv334_0074.txt
Loaded cv956_12547.txt
Loaded cv589_12853.txt
Loaded cv680_10533.txt
Loaded cv474_10682.txt
Loaded cv354_8573.txt
Loaded cv834_23192.txt
Loaded cv649_13947.txt
Loaded cv108_17064.txt
Loaded cv268_20288.txt
Loaded cv126_28821.txt
Loaded cv912_5562.txt
Loaded cv791_17995.txt
Loaded cv311_17708.txt
Loaded cv521_1730.txt
Loaded cv697_12106.txt
Loaded cv256_16529.txt
Loaded cv463_10846.txt
Loaded cv780_8467.txt
Loaded cv406_22199.txt
Loaded cv715_19246.txt
Loaded cv456_20370.txt
Loaded cv545_12848.txt
Loaded cv728_17931.txt
Loaded cv224_18875.txt
Loaded cv586_8048.

### ❓ Quiz  

Use the predefined `process_docs()` function to read in negative texts and positive reviews. How many reviews are there for each class? 

You can write your code below:


<hr>    <!-- please remember this! -->
<details>
  <summary>Click <b>here</b> to see the answer.</summary>
    <p>1000 positive and 1000 negative reviews</p>

<!--   <p>\begin{equation*}P(Fire|Smoke) =  \frac{P(Fire)*P(Smoke|Fire)}{P(Smoke)} \end{equation*} </p>
<p>\begin{equation*}=\frac{0.01 * 0.9}{ 0.1} \end{equation*}</p>
<p>\begin{equation*}=0.09\end{equation*}</p> -->

</details> 



# Feature Extraction


So far, each document is represented as text. However, to input the data into a machine learning model, we often need to convert text into numerical representations of features. 

We will start by introducing the concept of features and feature vectors. 


## What is a feature and what is a feature vector?

A feature is an individual measurable property or characteristic of a phenomenon being observed. For sentiment analysis, we can extract many features from a document to predict the sentiment polarity. For example, whether the word 'good' occurs in the document can be a feature. We can assign a number for this feature to indicate occurrence (eg. 1) or absence (eg. 0). We can then combine the values for all the features in the document into a numerical list that we call as a feature ector.  

The feature vector has a fixed length corresponding to the number of features. Each dimension represents a feature. For example, to represent unigrams in a document, we can create a vector with the same length of the vocabulary, and the value in each dimension (ie. each position of the list) stores the frequency or presence of a specific word. 


As an example, suppose we have seven words (apple,banana,red,dog,is,the,and) in the vocabulary which are represented as seven dimensions (features). We also have two documents as in below:

document 1: "the apple is red and the banana is yellow"
document 2: "the red dog"

To produce a feature vector to represent unigram presence in a document, we can write '1' in a dimension to indicate the word the dimension corresponds to is present in the document, and we will write '0' to indicate the word is not present. Below is a table view of the vectors. 


|document no.||apple|banana|red|dog|is|the|and
|------||------|------|------|------|------|
|document 1 ||1|0|1|0|1|1|1|
|document 2 ||0|0|1|1|0|1|0|


The feature vector for document 1 becomes `[1,0,1,0,1,1,1]`

To produce a feature vector that represent frequency of each unigram in a document, we can count the number of occurence of each unigram word and write down the number in the corresponding dimension. 

|document no.||apple|banana|red|dog|is|the|and
|------||------|------|------|------|------|
|document 1 ||1|0|1|0|1|2|1|
|document 2 ||0|0|1|1|0|1|0|


Now the feature vector for document 1 becomes `[1,0,1,0,1,2,1]`

When the data consists of more than one document as in our example, we will have multiple feature vectors as representations of our data. We can stack the vectors into a list of vectors. This is often referred to as matrix. 
In our example, we have a 2*7 matrix where 2 is the number of documents, and 7 is the number of features

Presence feature matrix for our toy data: `[[1,0,1,0,1,1,1],[0,0,1,1,0,1,0]]`


### ❓ Quiz  

Following the examples, please write below both the unigram presence and unigram frequenc feature vector for the document text 'the red dog and the red apple'




<hr>    <!-- please remember this! -->
<details>
  <summary>Click <b>here</b> to see the answer.</summary>
    <p>Unigram presence feature vector: [1,0,1,1,0,1,1]</p>
    <p>Unigram frequency feature vector: [1,0,2,1,0,2,1]</p>


</details> 



### ❓ Quiz  

What is the matrix consisting of the unigram presence vectors of the following documents:

document1: 'the apple is red and the banana is yellow'

document2: 'the red dog and the red apple'


<hr>    <!-- please remember this! -->
<details>
  <summary>Click <b>here</b> to see the answer.</summary>
    <p>The matrix is: [[1,0,1,0,1,1,1],[1,0,1,1,0,1,1]] </p>


</details> 



The feature vectors are usually represented as `numpys` arrays in Python. Let's spend some time to understand what `numpy` is. 

## Numpy arrays

A `numpy` array is just like a `list` but with smaller memory and faster access. 

Below, we introduce several ways to create a numpy array

In [5]:
import numpy as np
# Create a numpy array of zeros with dimension 2
vector1=np.zeros(2)
# create a numpy array from a list [1,2,3]
vector2=np.array([1,2,3])
# create an empty array of dimension 3 with arbitary data
vector3=np.empty(3)
print ('vector1',vector1)
print ('vector2',vector2)
print ('vector3',vector3)

vector1 [0. 0.]
vector2 [1 2 3]
vector3 [3. 2. 1.]


We can also concatenate two numpy arrays of the same dimension

In [6]:
np.concatenate((vector2,vector3))

array([1., 2., 3., 3., 2., 1.])

So far, we have created numpy arrays of one dimension. Let's try creating a 2-D array (also called a matrix). We can pass dimension size (also called axes) as (a,b) where a is the number of rows in the matrix, and b specifies the number of columns. 

In [7]:
matrix1=np.zeros((3,3)) 
# this is a matrix of zeros that has 3 vectors, and within each vector there are 4 items. 
print ('matrix1',matrix1)
# a matrix from nested list
matrix2=np.array([[1,2,3],[2,3,4]])
print ('matrix2',matrix2)
# an empty matrix usually used as initialisation. It will print as an empty list
matrix3=np.empty((0,4))
print ('matrix3',matrix3)
#To check the axes of an array, you can retrieve the shape attribute like this:
print (matrix2.shape) 


matrix1 [[0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]]
matrix2 [[1 2 3]
 [2 3 4]]
matrix3 []
(2, 3)


Numpy arrays are mutable. Therefore, we can change values in the vector. For example:

In [8]:
vector2[0]=0 # change the first item in vector3 to 0
print (vector2)
matrix2[0][2]=0 # change the third item of the first vector to 0
print (matrix2)
matrix2[0]=vector2 # change the first vector in matrix2 to vector3
print (matrix2)


[0 2 3]
[[1 2 0]
 [2 3 4]]
[[0 2 3]
 [2 3 4]]


We can also slice a numpy array with an index array. 

In [9]:
#Let's select the values at index 1,2 of vector2
vector2[[1,2]]

array([2, 3])

## Extracting unigrams

Now let's extract unigram features from our data. First, let’s load one document and look at the raw tokens split by white space. We will use the `load_doc()` function developed in the previous section. We can use the `split()` function to split the loaded document into unigram tokens separated by white space.

In [10]:
# load the document
filename = 'data/neg/cv000_29416.txt'
text = load_doc(filename)
# split into tokens by white space
tokens = text.split()
print(tokens)

['plot', ':', 'two', 'teen', 'couples', 'go', 'to', 'a', 'church', 'party', ',', 'drink', 'and', 'then', 'drive', '.', 'they', 'get', 'into', 'an', 'accident', '.', 'one', 'of', 'the', 'guys', 'dies', ',', 'but', 'his', 'girlfriend', 'continues', 'to', 'see', 'him', 'in', 'her', 'life', ',', 'and', 'has', 'nightmares', '.', "what's", 'the', 'deal', '?', 'watch', 'the', 'movie', 'and', '"', 'sorta', '"', 'find', 'out', '.', '.', '.', 'critique', ':', 'a', 'mind-fuck', 'movie', 'for', 'the', 'teen', 'generation', 'that', 'touches', 'on', 'a', 'very', 'cool', 'idea', ',', 'but', 'presents', 'it', 'in', 'a', 'very', 'bad', 'package', '.', 'which', 'is', 'what', 'makes', 'this', 'review', 'an', 'even', 'harder', 'one', 'to', 'write', ',', 'since', 'i', 'generally', 'applaud', 'films', 'which', 'attempt', 'to', 'break', 'the', 'mold', ',', 'mess', 'with', 'your', 'head', 'and', 'such', '(', 'lost', 'highway', '&', 'memento', ')', ',', 'but', 'there', 'are', 'good', 'and', 'bad', 'ways', 'of'

To keep track of the frequency and presence for each token in the document, we will use the `Counter` dictionary from `collections` module. Let's write a function `tokens_to_dict()` that turns a list of unigram tokens into a counter dictionary. 

In [11]:
from collections import Counter
def tokens_to_dict(tokens):
    """
    Parameters
    ----------
    tokens: a list of tokens in a document

    Return
    ------
    A counter dictionary that records the number of occurrence for each token in a document
    """
    token2count=Counter()
    for token in tokens:
        token2count[token]+=1
    return token2count

tokens_dict=tokens_to_dict(tokens)
print (tokens_dict)

Counter({',': 44, 'the': 38, '.': 34, 'it': 21, 'and': 20, 'to': 16, 'of': 16, 'a': 14, 'that': 13, 'are': 13, 'is': 12, 'but': 10, '"': 10, 'this': 10, 'there': 10, '(': 9, ')': 9, 'in': 8, 'i': 7, '-': 7, '?': 6, 'movie': 6, 'all': 6, 'they': 5, 'into': 5, 'with': 5, 'pretty': 5, 'film': 5, 'make': 5, 'teen': 4, 'her': 4, 'for': 4, 'on': 4, 'which': 4, 'just': 4, 'its': 4, "it's": 4, 'from': 4, 'most': 4, 'over': 4, 'we': 4, ':': 3, 'then': 3, 'get': 3, 'an': 3, 'one': 3, 'has': 3, 'out': 3, 'even': 3, 'so': 3, 'you': 3, 'who': 3, 'like': 3, 'not': 3, '!': 3, 'two': 2, 'go': 2, 'see': 2, "what's": 2, 'mind-fuck': 2, 'very': 2, 'cool': 2, 'idea': 2, 'bad': 2, 'what': 2, 'since': 2, 'films': 2, 'your': 2, 'lost': 2, 'highway': 2, 'memento': 2, 'good': 2, "didn't": 2, 'have': 2, 'problem': 2, 'simply': 2, 'world': 2, 'audience': 2, 'going': 2, 'coming': 2, 'dead': 2, 'others': 2, 'look': 2, 'scenes': 2, 'things': 2, 'now': 2, "don't": 2, 'same': 2, 'again': 2, 'up': 2, 'after': 2, 'bigg

We can put all above preprocessing steps into a function `clean_doc_unigrams()`. This function will preprocess the data and extract unigram tokens from the document. We then test it on another review, this time a positive review.


In [12]:
# turn a doc into clean tokens
def clean_doc_unigrams(doc):
    """
    Parameters
    ----------
    doc: text from a document

    Return
    ------
    A counter dictionary of tokens
    """
    # split into tokens by white space
    tokens = doc.split()
    tokens_dict=tokens_to_dict(tokens)
    return tokens_dict
 
# load the document
filename = 'data/pos/cv000_29590.txt'
text = load_doc(filename)
tokens_dict = clean_doc_unigrams(text)
print(tokens_dict)

Counter({'the': 46, ',': 43, '.': 23, 'and': 20, '(': 18, ')': 18, 'in': 18, 'a': 15, 'to': 15, 'of': 14, 'from': 8, 'but': 7, 'is': 7, 'it': 6, '"': 6, 'comic': 5, "don't": 5, 'film': 5, 'about': 4, 'like': 4, 'who': 4, 'with': 4, 'say': 4, 'you': 4, 'this': 4, "it's": 4, 'i': 4, 'had': 3, 'or': 3, 'been': 3, 'book': 3, 'for': 3, 'moore': 3, 'campbell': 3, 'ripper': 3, 'be': 3, 'that': 3, "hell's": 3, 'me': 3, ':': 3, 'than': 3, '?': 3, 'he': 3, 'an': 3, 'so': 3, 'even': 3, 'at': 3, 'all': 3, 'have': 2, 'world': 2, 'never': 2, 'really': 2, 'hell': 2, 'whole': 2, 'called': 2, 'jack': 2, 'starting': 2, 'little': 2, 'if': 2, 'more': 2, 'other': 2, 'because': 2, 'get': 2, 'hughes': 2, 'direct': 2, 'seems': 2, 'as': 2, 'ghetto': 2, 'whitechapel': 2, 'end': 2, 'place': 2, 'has': 2, 'when': 2, 'first': 2, 'turns': 2, 'peter': 2, 'not': 2, 'enough': 2, 'abberline': 2, 'depp': 2, 'graham': 2, 'into': 2, 'here': 2, 'both': 2, 'identity': 2, 'good': 2, '-': 2, 'make': 2, 'see': 2, '2': 2, "wasn'

Finally, we can integrate the above preprocessing `clean_doc_unigrams()` into the data processing pipeline for all the files in a directory. We do so by the function `process_docs_unigram()`. 


In [13]:
# load all docs in a directory
def process_docs_unigrams(directory):
    """
    Parameters
    ----------
    directory: a directory containing positive/negative samples from the Thumbs
    Up! dataset.

    Return
    ------
    A list of unigram token counter dictionary where each token dictionary records the frequency of each token that occur in a document. 
    """
    # walk through all files in the folder
    print (directory)
    tokens_all=[]
    for filename in listdir(directory):
        # skip files that do not have the right extension
        if not filename.endswith(".txt"):
            continue
        # create the full path of the file to open
        path = directory + '/' + filename
        # load document
        text=load_doc(path)
        # clean documents
        tokens_dict = clean_doc_unigrams(text)
        tokens_all.append(tokens_dict)
    return tokens_all


# unigrams for all the negative files
unigrams_neg=process_docs_unigrams('./data/neg')
# unigrams for all the positive files
unigrams_posi=process_docs_unigrams('./data/pos')
# all unigrams
unigrams_all=unigrams_posi+unigrams_neg


./data/neg
./data/pos


## Turn text to feature vectors

Now that we have the count of each present unigrams for each document, we can convert these unigram counts into feature vectors. But before that, we first need to collect all the features to establish the dimensions of the feature vector. Here, the features are unigram words in the vocabulary. To do this, we define a function `collect_vocab()` to collect all the unique words in the vocabulary from `unigrams_all`, the list of token dictionary from all the documents. 

In [14]:
from collections import Counter
def collect_vocab(tokens_all):
    """
    Parameters
    ----------
    tokens_all: a list of token dictionaries where each token dictionary is extracted from a document. 

    Return
    ------
    a set of unique tokens in the vocabulary of tokens_all
    """
    
    vocab=set() #Here, we create a `set()` to store all the unique words.
    for token_lst in tokens_all: # iterate through token dictionary for each document
        for token in token_lst:
            # add a word in the vocab set. If a word already exists in vocab, it will not be added twice.
            vocab.add(token)
    return vocab


unigram_vocab=collect_vocab(unigrams_all)
# Let's check how many words we have in the vocab
print (len(unigram_vocab))

50920


Now let's create the mapping between vocabulary and feature vecor dimension index from `unigram_vocab`. We can define a function `create_vocab_feature_mappings()` to do this:

In [15]:
def create_vocab_feature_mappings(vocab):
    """
    Parameters
    ----------
    vocab: a set of unique features in the vocabulary

    Return
    ------
    vocab2index: a mapping dictionary with feature as key and dimension index as value
    index2vocab: a mapping dicitoanry with dimension index as key and feature as value
    """
    vocab2index={}
    index2vocab={}
    for i,w in enumerate(vocab): # iterate through the words in the vocabulary
        vocab2index[w]=i
        index2vocab[i]=w
    return vocab2index,index2vocab

unigram2index,index2unigram=create_vocab_feature_mappings(unigram_vocab)

Now, we can use `unigram2index` mapping dictionary to turn each document's token dictionary representation into a unigram feature vector. Remember, we can either represent in each dimension cell the freqency of the words, or use 1 or 0 to represent whether a word occurs or not. 
Let's design two functions `create_feature_presence()` and `create_feature_frequency()`  to turn the token dictionary for each document into these two types of features. 

To create the `create_feature_presence()`, we can do the following:

In [16]:
import numpy as np
def create_feature_presence(token_dict,vocab2index):
    """
    Parameters
    ----------
    token_dict: a token counter dictionary for a document
    vocab2index: a mapping dictionary with feature as key and feature index as value

    Return
    ------
    a feature vector for the document
    """
    # create a numpy array with dimension size of the vocabulary size
    vector=np.zeros(len(vocab2index))
    for w in token_dict:
        index=vocab2index[w]
        vector[index]=1
    return vector


### ❓ Quiz  

Can you try implementing the function`create_feature_frequency()` to extract unigram frequency? (You can modify on the basis of `create_feature_presence()`)
You can write your code below:

In [17]:
def create_feature_frequency(token_dict,vocab2index):
    """
    Parameters
    ----------
    token_dict: a token counter dictionary for a document
    vocab2index: a mapping dictionary with feature as key and feature index as value

    Return
    ------
    a feature vector for the document
    """
    # create a numpy array with dimension size of the vocabulary size
    vector=np.zeros(len(vocab2index))
    for w in token_dict:
        index=vocab2index[w]
        vector[index]=token_dict[w]
    return vector


<hr>    <!-- please remember this! -->
<details>
  <summary>Click <b>here</b> to see the answer.</summary>
    <p>simple change the line vector[index]=1 from create_feature_presence() to vector[index]=token_dict_current[w]</p>

   

<!--   <p>\begin{equation*}P(Fire|Smoke) =  \frac{P(Fire)*P(Smoke|Fire)}{P(Smoke)} \end{equation*} </p>
<p>\begin{equation*}=\frac{0.01 * 0.9}{ 0.1} \end{equation*}</p>
<p>\begin{equation*}=0.09\end{equation*}</p> -->

</details> 



Now let's join the dots to create the function `create_feature_presence_all()` to loop over the unigrams from all the documents in `unigrams_all` and convert the token dictionary in each document to feature vectors. 

In [18]:
def create_feature_presence_all(tokens_all,vocab2index):
    """
    Parameters
    ----------
    tokens_all: a list of token dictionaries where each token dictionary is extracted from a document
    vocab2index: a mapping dictionary with feature as key and feature index as value

    Return
    ------
    a presence feature vector for the document
    """
   
    # since we know the number of documents, and the vocabulary size, we can initialize our result matrix as an empty matrix (2-D array) with the shape of (number of document, vocabulary size)
    
    presence_features=np.empty((len(tokens_all),len(vocab2index))) #initialize the result array as an empty array ready to be appended through the loop. 
    for doc_i,token_dict in enumerate(tokens_all):
        # convert bag of word dictionary in each document into unigram features
        presence_feature=create_feature_presence(token_dict,vocab2index)
        # We assign the unigram fearture of the current document to the correct positon in the unigram_presence_result matrix. 
        presence_features[doc_i]=presence_feature
    return presence_features

Now we are ready to extract features from unigrams collected from all the data:

In [19]:
features_unigram_presence=create_feature_presence_all(unigrams_all,unigram2index)


### ❓ Quiz  

Can you follow the code above to extract all the documents' unigram frequency features (using the `create_feature_frequency()` functions you defined in quiz 3)? You can name your function as `create_feature_frequency_all()`. Please store the features into `features_unigram_frequency`. 

You can write your code below:

In [20]:
def create_feature_frequency(tokens_all,vocab2index):
    """
    Parameters
    ----------
    tokens_all: a list of token dictionaries where each token dictionary is extracted from a document
    vocab2index: a mapping dictionary with feature as key and feature index as value

    Return
    ------
    a frequency feature vector for the document
    """
   
    # since we know the number of documents, and the vocabulary size, we can initialize our result matrix as an empty matrix (2-D array) with the shape of (number of document, vocabulary size)
    
    frequency_features=np.empty((len(tokens_all),len(vocab2index))) #initialize the result array as an empty array ready to be appended through the loop. 
    for doc_i,token_dict in enumerate(tokens_all):
        # convert bag of word dictionary in each document into unigram features
        frequency_feature=create_feature_frequency(token_dict,vocab2index)
        # We assign the unigram fearture of the current document to the correct positon in the unigram_presence_result matrix. 
        frequency_features[doc_i]=unigram_frequency_feature
    return frequency_features

### ❓ Quiz  

Let's check the first review's unigram presence feature, can you use the mapping in `index2unigram` to reveal what unigrams are present in this review? 
Please answer: Which words in the following list are present?
A. gayness
B. fabulous
C. snappiness
D. happiness

You can write your code below:

In [21]:
a=[index2unigram[i] for i,item in enumerate(features_unigram_presence[0]) if item==1]
'happiness' in a

False


<hr>    <!-- please remember this! -->
<details>
  <summary>Click <b>here</b> to see the answer.</summary>
    <p> Code:</p>
    <p>wordlist=[index2vocab[i] for i,item in enumerate(unigram_presence_pos[0]) if item==1]</p>
    <p> A,C are present in the review </p>

   

<!--   <p>\begin{equation*}P(Fire|Smoke) =  \frac{P(Fire)*P(Smoke|Fire)}{P(Smoke)} \end{equation*} </p>
<p>\begin{equation*}=\frac{0.01 * 0.9}{ 0.1} \end{equation*}</p>
<p>\begin{equation*}=0.09\end{equation*}</p> -->

</details> 



## Encode the labels

At the same time, we should also find a way to represent the labels (positive or negative) in numeric ways for the model to train on. Let's create numpy arrays of labels of 1s and 0s. Let's say 1 corresponds to positive reviews and 0 corresponds to negative reviews.

Remember we concatenate positive and negative data when producing `unigrams_all` and `features_unigram_presence`, we will follow the same data order to create the numpy array of labels that tell us the sentiment for each feature vector in `features_unigram_presence`. 

In [22]:
 
labels=len(unigrams_posi)*[1]+len(unigrams_neg)*[0]
labels=np.array(labels)

## Save the features and labels

We can also save the prepared features for the reviews ready for modeling.

This is a good practice as it decouples the data preparation from modeling, allowing you to focus on modeling and circle back to data preparation if you have new ideas.

To store a numpy array, we can use the `numpy`'s `save()` function. `save()` takes two arguments: the first is an the filename to be written, and the second argument is the numpy array. 

In [23]:

np.save('features_unigram_presence.npy',features_unigram_presence)
np.save('labels_unigram_presence.npy',labels)

We can load the postiive data from the files by:
    

In [24]:
np.load('features_unigram_presence.npy')


array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

# Evaluation

## Prepare train/test split

Now it's time to prepare this dataset for model evlauation. We want to train a model that is generalisable to unseen data. Therefore, we can split the dataset into train and test where the model is trained on the train set and tested on the unseen test set. Usually the train-test split is 70% and 30%. 

We will follow the paper to adopt a 3-fold cross validation. We will use `scikit-learn`'s `KFold` to do that. In addition, to ensure that we have an equal number of positive and negative examples, we make the split in both positive and negative datasets respectively. 

Let's first load the unigram presence features and their labels. 

In [25]:
data=np.load('features_unigram_presence.npy')
labels=np.load('labels_unigram_presence.npy')


produce the indices for the positive and the negative data

In [26]:
posi_id=[i for i,label in enumerate(labels) if label==1]
neg_id=[i for i,label in enumerate(labels) if label==0]


Initialize `KFold` from `scikit-learn` and generate cross validation datasets

In [27]:
# import KFold and random
from sklearn.model_selection import KFold
import random


# set the number of folds
num_folds=3

# set random seed for cross validation
cross_val_seed = 42

# create the KFold object and create the splits for positive data
kf_posi = KFold(n_splits=num_folds)
kf_posi.get_n_splits(posi_id)
kf_posi = KFold(n_splits=num_folds, random_state=cross_val_seed, shuffle=True)
posi_folds=list(kf_posi.split(posi_id))

# create the KFold object and create the splits for negative data
kf_neg = KFold(n_splits=num_folds)
kf_neg.get_n_splits(posi_id)
kf_neg = KFold(n_splits=num_folds, random_state=cross_val_seed, shuffle=True)
neg_folds=list(kf_neg.split(neg_id))

# loop over the folds to create train and test data/label. 

for fold_idx, (train_index_posi, test_index_posi) in enumerate(posi_folds):

    print(f'Running fold {fold_idx+1}...')
    
    # Get the current fold for positive data
    fold_train_posi = np.array(posi_id)[train_index_posi]
    fold_test_posi = np.array(posi_id)[test_index_posi]
    
    # Get the current fold for negative data
    train_index_neg,test_index_neg=neg_folds[fold_idx]
    fold_train_neg=np.array(neg_id)[train_index_neg]
    fold_test_neg=np.array(neg_id)[test_index_neg]
    
    # ensure that we have balanced classes in both train and test
    assert len(fold_train_posi)==len(fold_train_neg)
    assert len(fold_test_posi)==len(fold_test_neg)
    # combine all train and test for the current fold
    fold_train=np.concatenate([fold_train_posi,fold_train_neg])
    fold_test=np.concatenate([fold_test_posi,fold_test_neg])
    print (fold_train)
    
    # shuffle the train and test indexes in the current fold
    random.shuffle(fold_train)
    random.shuffle(fold_test)
    
    # use the indexes in fold_train and fold_test to slice data and labels
    fold_train_data=data[fold_train]
    fold_test_data=data[fold_test]
    fold_train_label=labels[fold_train]
    fold_test_label=labels[fold_test]
    

Running fold 1...
[   0    1    4 ... 1996 1997 1999]
Running fold 2...
[   1    2    3 ... 1995 1996 1998]
Running fold 3...
[   0    2    3 ... 1997 1998 1999]


Again, we can wrap the above into a function `create_kfold_validadtion()`.

In [62]:
def create_kfold_validation(data,labels,num_folds):
    """
    Parameters
    ----------
    data: a numpy array of features
    labels: a numpy array of labels (1=positive, 0=negative)
    num_folds: the number of folds in cross validation

    Yields (works as a generator)
    ------
    fold_train_data: current fold of training data as numpy arrays
    fold_test_data: current fold of test data as numpy arrays
    fold_train_label: current fold of training labels as numpy arrays
    fold_test_label: current fold of test labels as numpy arrays
    """
    #produce the indices for the positive and the negative data
    posi_id=[i for i,label in enumerate(labels) if label==1]
    neg_id=[i for i,label in enumerate(labels) if label==0]

    # set random seed for cross validation
    cross_val_seed = 42

    # create the KFold object and create the splits for positive data
    kf_posi = KFold(n_splits=num_folds)
    kf_posi.get_n_splits(posi_id)
    kf_posi = KFold(n_splits=num_folds, random_state=cross_val_seed, shuffle=True)
    posi_folds=list(kf_posi.split(posi_id))

    # create the KFold object and create the splits for negative data
    kf_neg = KFold(n_splits=num_folds)
    kf_neg.get_n_splits(posi_id)
    kf_neg = KFold(n_splits=num_folds, random_state=cross_val_seed, shuffle=True)
    neg_folds=list(kf_neg.split(neg_id))

    # loop over the folds to create train and test data/label. 

    for fold_idx, (train_index_posi, test_index_posi) in enumerate(posi_folds):

        print(f'Running fold {fold_idx+1}...')

        # Get the current fold for positive data
        fold_train_posi = np.array(posi_id)[train_index_posi]
        fold_test_posi = np.array(posi_id)[test_index_posi]

        # Get the current fold for negative data
        train_index_neg,test_index_neg=neg_folds[fold_idx]
        fold_train_neg=np.array(neg_id)[train_index_neg]
        fold_test_neg=np.array(neg_id)[test_index_neg]

        # ensure that we have balanced classes in both train and test
        assert len(fold_train_posi)==len(fold_train_neg)
        assert len(fold_test_posi)==len(fold_test_neg)
        # combine all train and test for the current fold
        print ('combining positive and negative for train and test')
        fold_train=np.concatenate([fold_train_posi,fold_train_neg])
        fold_test=np.concatenate([fold_test_posi,fold_test_neg])
        
        
        # use the indexes in fold_train and fold_test to slice data and labels
        print ('slice training and test data')
        fold_train_data=data[fold_train]
        fold_test_data=data[fold_test]
        fold_train_label=labels[fold_train]
        fold_test_label=labels[fold_test]
        print ('yield')
        
        assert len(fold_train_data)==len(fold_train_label)
        assert len(fold_test_data)==len(fold_test_label)
        yield fold_train_data,fold_test_data,fold_train_label,fold_test_label

    

When we call `create_kfold_validation()`, we are initializing a generator. We can loop over the generator to produce `fold_train_data,fold_test_data,fold_train_label,fold_test_label`  for every fold

In [63]:
for fold_train_data,fold_test_data,fold_train_label,fold_test_label in create_kfold_validation(data,labels,num_folds=3):
    print ('length of train data',len(fold_train_data))
    print ('length of test data',len(fold_test_data))

    

Running fold 1...
combining positive and negative for train and test
slice training and test data
yield
length of train data 1332
length of test data 668
Running fold 2...
combining positive and negative for train and test
slice training and test data
yield
length of train data 1334
length of test data 666
Running fold 3...
combining positive and negative for train and test
slice training and test data
yield
length of train data 1334
length of test data 666


## Naive Bayes Model and evaluation

Let's try fitting the features and labels from each validation fold into a naive bayes model using the `MultinomialNB` package in `sklearn`. `MultinomialNB` is a Naive Bayes classifier for multinomial models. It implements additive smoothing by default. 

In [30]:
!pip install sklearn
from sklearn.naive_bayes import MultinomialNB

for fold_train_data,fold_test_data,fold_train_label,fold_test_label in create_kfold_validation(data,labels,num_folds=3):
    # initialize a multinomial naive bayes model
    model = MultinomialNB()
    # fit the model with features and labels for the training data
    model.fit(fold_train_data,fold_train_label)
    #evaluate the fitted model on the test set
    predicted=model.predict(fold_test_data)

[33mYou are using pip version 9.0.1, however version 20.2.4 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
Running fold 1...
[   0    1    4 ... 1996 1997 1999]
Running fold 2...
[   1    2    3 ... 1995 1996 1998]
Running fold 3...
[   0    2    3 ... 1997 1998 1999]


In each fold, `predicted` is an array of predictions in 1s and 0s. Let's compare it with the gold labels and calculate accuracy following:

accruracy= number of correct examples/number of total examples

Let's create a function that calls the naive bayes model and performs evaluation 

In [31]:
def train_evaluate(train_data,train_label,test_data,test_label):
    """
    Parameters
    ----------
    train_data: numpy array of train data (number of examples * number of features)
    train_label: numpy array of train labels in the form of [1,0,1...] where 1 = positive, 0 = negative
    test_data: numpy array of test data (number of examples * number of features)
    test_label: numpy array of test labels in the form of [1,0,1...] where 1 = positive, 0 = negative

    Return
    ------
    accuracy score on the test data
    """
    # initialize a multinomial naive bayes model
    model = MultinomialNB()
    # fit the model with features and labels for the training data
    model.fit(train_data,train_label)
    #evaluate the fitted model on the test set
    predicted=model.predict(test_data)
    #evaluation:
    correct=0
    for i in range(len(predicted)):
        if predicted[i]==test_label[i]: #if the predicted result is the same with the gold label
            correct+=1
    acc=correct/len(predicted)


    print ('accuracy',acc)
    return acc

Now integrating `train_evaluate()` into cross validation, we will get accuracy per each fold. We then take the average and standard deviation of the accuracy scores across folds. 

In [32]:
acc_list=[]
for fold_train_data,fold_test_data,fold_train_label,fold_test_label in create_kfold_validation(data,labels,num_folds=3):
    acc=train_evaluate(fold_train_data,fold_train_label,fold_test_data,fold_test_label)
    acc_list.append(acc)
    
# mean and standard deviation
print(f'Mean accuracy      : { np.mean(acc_list):.5f}')
print(f'Standard deviation : { np.std(acc_list):.5f}')

Running fold 1...
[   0    1    4 ... 1996 1997 1999]
accuracy 0.8233532934131736
Running fold 2...
[   1    2    3 ... 1995 1996 1998]
accuracy 0.8108108108108109
Running fold 3...
[   0    2    3 ... 1997 1998 1999]
accuracy 0.8258258258258259
Mean accuracy      : 0.82000
Standard deviation : 0.00657


# Recap: The whole pipeline to train on unigram presence features

In [46]:
# 1. Extract unigrams from all documents
from os import listdir 
# unigrams for all the negative files
unigrams_neg=process_docs_unigrams('./data/neg')
# unigrams for all the positive files
unigrams_posi=process_docs_unigrams('./data/pos')
# all unigrams
unigrams_all=unigrams_posi+unigrams_neg

# 2. Turn text to feature vectors
# collect all unique unigram as all features
unigram_vocab=collect_vocab(unigrams_all)
# map unigram features to dimension indexes
unigram2index,index2unigram=create_vocab_feature_mappings(unigram_vocab)
# Convert unigrams to feature vectors for all documents
features_unigram_presence=create_feature_presence_all(unigrams_all,unigram2index)

# 3. Encode labels
labels=len(unigrams_posi)*[1]+len(unigrams_neg)*[0]
labels=np.array(labels)

# 4. Evaluation with 3-fold cross validation
data=features_unigram_presence
acc_list=[]
for fold_train_data,fold_test_data,fold_train_label,fold_test_label in create_kfold_validation(data,labels,num_folds=3):
    acc=train_evaluate(fold_train_data,fold_train_label,fold_test_data,fold_test_label)
    acc_list.append(acc)
    
# mean and standard deviation
print(f'Mean accuracy      : { np.mean(acc_list):.5f}')
print(f'Standard deviation : { np.std(acc_list):.5f}')


./data/neg
./data/pos
Running fold 1...
[   0    1    4 ... 1996 1997 1999]
accuracy 0.8233532934131736
Running fold 2...
[   1    2    3 ... 1995 1996 1998]
accuracy 0.8108108108108109
Running fold 3...
[   0    2    3 ... 1997 1998 1999]
accuracy 0.8258258258258259
Mean accuracy      : 0.82000
Standard deviation : 0.00657


# Other features

## Bigrams

Based on `clean_doc_unigrams()` that turns a document into a unigram dictionary, we can create a function `clean_doc_bigrams()` to extract bigram dictionary. (You can refresh yourself of how to create a bigram counter dictionary in module 1.4. )

In [34]:
# turn a doc into clean tokens
def clean_doc_bigrams(doc):
    """
    Parameters
    ----------
    doc: string of the document text

    Return
    ------
    a counter dictionary with bigram as key and count as value
    """
    # split into tokens by white space
    tokens = doc.split()
    # remove tokens that are not alphabetic
    tokens = [word for word in tokens if word.isalpha()]
    
    # extract bigrams
    bigram_dict=Counter() #initialize a bigram dictionary to be updated
    tokens=['<start>']+tokens+['<end>'] # add <start> and <end> token
    for i in range(len(tokens)): #loop over all the indices of the token list
        if i<len(tokens)-1: #if it's not the end of the token list
            bigram_current=(tokens[i],tokens[i+1])
            bigram_dict[bigram_current]+=1
    return bigram_dict

Now we can replace the `clean_doc_unigrams()` line with `clean_doc_bigrams()` in `process_docs_unigrams()`, we will rename the function as `process_docs_bigrams()` that counts all the bigrams in the documents. 

In [37]:
def process_docs_bigrams(directory):
    """
    Parameters
    ----------
    directory: a directory containing positive/negative samples from the Thumbs
    Up! dataset.
    overall_bigrams_counter: 

    Return
    ------
    a list of bigram counters each representing a document
    """
    # walk through all files in the folder
    tokens_all=[]
    print (directory)
    for filename in listdir(directory):
        # skip files that do not have the right extension
        if not filename.endswith(".txt"):
            continue
        # create the full path of the file to open
        path = directory + '/' + filename
        # load document
        text=load_doc(path)
        # clean documents
        tokens_dict_current = clean_doc_bigrams(text)
        tokens_all.append(tokens_dict_current)
    return tokens_all
bigrams_neg=process_docs_bigrams('./data/neg')
bigrams_posi=process_docs_bigrams('./data/pos')
bigrams_all=bigrams_posi+bigrams_neg

./data/neg
./data/pos


We can follow the same procedure to turn bigram counters per each document into bigram feature presence vectors

In [44]:
bigram_vocab=collect_vocab(bigrams_all)
bigram2index,index2bigram=create_vocab_feature_mappings(bigram_vocab)
features_bigram_presence=create_feature_presence_all(bigrams_all,bigram2index)


The labels are the same for `features_unigram_presence`. We can then directly train Naive Bayes model with these bigram presence features

In [47]:
data=features_bigram_presence
acc_list=[]
for fold_train_data,fold_test_data,fold_train_label,fold_test_label in create_kfold_validation(data,labels,num_folds=3):
    acc=train_evaluate(fold_train_data,fold_train_label,fold_test_data,fold_test_label)
    acc_list.append(acc)
    
# mean and standard deviation
print(f'Mean accuracy      : { np.mean(acc_list):.5f}')
print(f'Standard deviation : { np.std(acc_list):.5f}')

Running fold 1...
[   0    1    4 ... 1996 1997 1999]
accuracy 0.844311377245509
Running fold 2...
[   1    2    3 ... 1995 1996 1998]
accuracy 0.8423423423423423
Running fold 3...
[   0    2    3 ... 1997 1998 1999]
accuracy 0.8708708708708709
Mean accuracy      : 0.85251
Standard deviation : 0.01301


### ❓ Quiz  

Now we have features in the form of bigrams. Now what does each dimension represent for a vector now? How many dimensions do we have for each vector? You can inspect `bigram_presence_positive` to answer these questions. 



<hr>    <!-- please remember this! -->
<details>
  <summary>Click <b>here</b> to see the answer.</summary>
    <p>Each dimension corresponds to a bigram. There are 463119 bigrams and therfore the dimension size of each vector is 463119. </p>
  

   

<!--   <p>\begin{equation*}P(Fire|Smoke) =  \frac{P(Fire)*P(Smoke|Fire)}{P(Smoke)} \end{equation*} </p>
<p>\begin{equation*}=\frac{0.01 * 0.9}{ 0.1} \end{equation*}</p>
<p>\begin{equation*}=0.09\end{equation*}</p> -->

</details> 



other features afterwards

#### Unigrams + POS

#### Adjective Unigrams (Quiz)

#### Unigrams above certain frequency threshold

#### Unigrams + Position

#### Unigrams + Bigrams (Quiz)

Negation (extension)

### ❓Final Quiz  

1. So far we have trained a model and evaluate on the bigram and unigram presence features, can you try to build another model on bigram frequency features with three fold cross validation? (Hint, you can simply change the arguments to the `produce_data_label_splits()`)
You can write your code below and report the accuracy

2. Adding Negation!


<hr>    <!-- please remember this! -->
<details>
  <summary>Click <b>here</b> to see the answer.</summary>
    <p>train_data,test_data,train_label,test_label=produce_data_label_splits(bigram_presence_positive,bigram_presence_negative)</p>
 <p>acc=evaluate(train_data,train_label,test_data,test_label)</p>
    <p>One run of the model gives 0.84. Remember that your result can be a bit different due to the random split.  </p>
  

   

<!--   <p>\begin{equation*}P(Fire|Smoke) =  \frac{P(Fire)*P(Smoke|Fire)}{P(Smoke)} \end{equation*} </p>
<p>\begin{equation*}=\frac{0.01 * 0.9}{ 0.1} \end{equation*}</p>
<p>\begin{equation*}=0.09\end{equation*}</p> -->

</details> 

