Welcome to module 2.1. In this module, we will start building and testing a sentiment classifier with Naive Bayes!


Let's first refresh your memory on the Naive Bayes model. 

## Pre-module quiz

Say that we have two events: Fire and Smoke. $P(Fire)$ is the probability of a fire (or in other words, how often a fire occurs), $P(Smoke)$ is the probability of seeing smoke (how often we see smoke). We want to know $P(Fire|Smoke)$, that is, how often fire occurs when we see smoke. Suppose we know the following:

$P(Fire)=0.01$

$P(Smoke)=0.1$

$P(Smoke|Fire)=0.9$ (ie. 90\% of the fire makes smoke)


Can you work out $P(Fire|Smoke)$?

A. 0.1

B. 0.09

C. 0.01

D. 0.9




<hr>    <!-- please remember this! -->
<details>
  <summary>Click <b>here</b> to see the answer.</summary>
  <p>\begin{equation*}P(Fire|Smoke) =  \frac{P(Fire)*P(Smoke|Fire)}{P(Smoke)} \end{equation*} </p>
<p>\begin{equation*}=\frac{0.01 * 0.9}{ 0.1} \end{equation*}</p>
<p>\begin{equation*}=0.09\end{equation*}</p>

</details> 



<h1 align='center' style='margin-top:2em'>
    <img src="../../resources/section_header.png" 
         style="height:36pt; display:inline; vertical-align:center; margin-top:0em" />
    <u>Sentiment Anlaysis</u>
    <img src="../../resources/section_header.png" 
         style="height:36pt; display:inline; vertical-align:center; margin-top:0em" />
<hr>
</h1>

Today, we are going to focus on a popular NLP classification task: sentiment analysis. What exactly is sentiment? Sentiment relates to the meaning of a word or sequence of words and is usually associated with an opinion or emotion. And analysis? Well, this is the process of looking at data and making inferences; in this case, using machine learning to learn and predict whether a movie review is positive or negative.

In this section, we will replicate the experiments from the paper: Thumbs up? Sentiment Classification using Machine Learning
Techniques (https://www.aclweb.org/anthology/W02-1011.pdf). We will extract a number of features including unigrams, bigrams, pos tags etc., and train Naive Bayes models on these features. 

## Preparing data

The data for this tutorial is stored in the `./data` folder. The two subdirectories `./data/pos` and `./data/neg` contain samples of IMDb positive and negative movie reviews. Each line of a review text file is a tokenized sentences. 

We can load an individual text file by opening it, reading in the ASCII text, and closing the file. For example, we can load the first negative review file “cv000_29416.txt” as follows:


In [None]:

# load one file
filename = 'data/neg/cv000_29416.txt'
# open the file as read only
file = open(filename, 'r')
# read all text
text = file.read()
print (text)
# close the file
file.close()


This loads the document as ASCII and preserves any white space, like new lines.

We can turn this into a function called load_doc() that takes a filename of the document to load and returns the text.


In [None]:

 
# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text


We can process each directory in turn by first getting a list of files in the directory using the `listdir()` function from the `os` module, then loading each file in turn.

For example, we can load each document in the negative directory using the `load_doc()` function to do the actual loading. Below, we define a `process_docs()` function to load all documents in a folder. 

Let's first read in these positive and negative files and store them as two list of texts. To navigate the files, we can use Python's `os` module. 

In [None]:
from os import listdir 
# load all docs in a directory
def process_docs(directory):
    # walk through all files in the folder
    docs=[] # a list of review texts
    for filename in listdir(directory):
        # skip files that do not have the right extension
        if not filename.endswith(".txt"):
            continue
        # create the full path of the file to open
        path = directory + '/' + filename
        # load document
        doc = load_doc(path)
        print('Loaded %s' % filename)
        docs.append(doc)
    return docs
 
# specify directory to load
directory = 'data/neg'
docs=process_docs(directory)

<p style="font-size:1.5em; font-weight: bold">
<img src="../../resources/exercise.png" style="height:36pt; display:inline; vertical-align:bottom; margin-right: 4pt" /> 
Try it yourself! <hr>
</p>

### Practice quiz 1:

Use the predefined `process_docs()` function to read in negative texts and positive reviews. How many reviews are there for each class? 

You can write your code below:


<hr>    <!-- please remember this! -->
<details>
  <summary>Click <b>here</b> to see the answer.</summary>
    <p>1000 positive and 1000 negative reviews</p>

<!--   <p>\begin{equation*}P(Fire|Smoke) =  \frac{P(Fire)*P(Smoke|Fire)}{P(Smoke)} \end{equation*} </p>
<p>\begin{equation*}=\frac{0.01 * 0.9}{ 0.1} \end{equation*}</p>
<p>\begin{equation*}=0.09\end{equation*}</p> -->

</details> 



## Feature Extraction


In this section, we will look at cleaning and extracting features from the movie review data. In machine learning, a feature is a measurable piece of data that can be used for analysis. For sentiment analysis, we can extract unigrams (bag of words) as features to predict the sentiment polarity. Since models can only understand numbers rather than words, we also need to convert words into some form of numerical values, which we will delve into the details later. 

We will start from splitting the text to extract unigrams. 


### Pre-processing

First, let’s load one document and look at the raw tokens split by white space. We will use the load_doc() function developed in the previous section. We can use the split() function to split the loaded document into tokens separated by white space.

In [None]:
# load the document
filename = 'data/neg/cv000_29416.txt'
text = load_doc(filename)
# split into tokens by white space
tokens = text.split()
print(tokens)

Just looking at the raw tokens can give us a lot of ideas of things to try, such as:

- Removing tokens that are just punctuation (e.g. ‘-‘).

- Removing tokens that contain numbers (e.g. ’10/10′).

- Remove tokens that don’t have much meaning (e.g. ‘and’)

Some ideas:

- We can remove tokens that are just punctuation or contain numbers by using an `isalpha()` function to check on each token.
- We can remove English stop words using the list loaded using `NLTK`.

We add the above preprocessing steps:

In [None]:
#install nltk using pip in the jupyter notebook
from nltk.corpus import stopwords
import string

# load the document
filename = 'data/neg/cv000_29416.txt'
text = load_doc(filename)
# split into tokens by white space
tokens = text.split()
# remove tokens that are not alphabetic
tokens = [word for word in tokens if word.isalpha()]
# filter out stop words
stop_words = set(stopwords.words('english'))
tokens = [w for w in tokens if not w in stop_words]
print(tokens)

As we are mainly interested in the frequency and presence of each unigram, we can store the unigram features using the `Counter` dictionary from `collections` module. Let's write a function `tokens_to_dict()` that turns a list of unigram tokens into a counter dictionary of unigrams. 

In [None]:
from collections import Counter
def tokens_to_dict(tokens):
    token2count=Counter()
    for token in tokens:
        token2count[token]+=1
    return token2count
tokens_dict=tokens_to_dict(tokens)
print (tokens_dict)

We can put this into a function called `clean_doc_unigrams()` and test it on another review, this time a positive review.


In [None]:
# turn a doc into clean tokens
def clean_doc_unigrams(doc):
    # split into tokens by white space
    tokens = doc.split()
    # remove tokens that are not alphabetic
    tokens = [word for word in tokens if word.isalpha()]
    # filter out stop words
    stop_words = set(stopwords.words('english'))
    tokens = [w for w in tokens if not w in stop_words]
    tokens_dict=tokens_to_dict(tokens)
    return tokens_dict
 
# load the document
filename = 'data/pos/cv000_29590.txt'
text = load_doc(filename)
tokens_dict = clean_doc_unigrams(text)
print(tokens)

Finally, we can add the above preprocessing steps into a function `process_docs_unigram()` to process all the files in a directory. 


In [None]:
# load all docs in a directory
def process_docs_unigrams(directory):
    # walk through all files in the folder
    print (directory)
    for filename in listdir(directory):
        # skip files that do not have the right extension
        if not filename.endswith(".txt"):
            continue
        # create the full path of the file to open
        path = directory + '/' + filename
        # load document
        text=load_doc(path)
        # clean documents
        tokens_dict_current = clean_doc_unigrams(text)
process_docs_unigrams('./data/neg')
process_docs_unigrams('./data/pos')

### Feature Vectors

To prepare the features as input to the model, we need to convert each word token into a numerical value so that the model can process them. To represent unigrams, we can create a vector (ie. a list of numbers) with the dimension size of the vocabulay, and the value in each dimension stores the frequency or presence of a specific word. 


For example, suppose we have five words (apple,banana,red,dog,is) in the vocabulary which are represented as five dimensions in the feature vecotors. We also have a document (document 1) containing the following words: 

document 1: "apple is red"

We add '1' for the words present and '0' for words not present in the document, and create the following: 

|document no.|apple|banana|red|dog|is
|------|------|------|------|------|------|
|document 1 |1|0|1|0|1|

We thus can represent the document as a feature vector [1,0,1,0,1]

The first step now is to create a mapping between dimension index of the vector and the word in the vocabulary. Let's first loop over the documents to collect all the words. Here, we use a dictionary that updates the words and their counts while we process all the documents. To do this, we can define a function `update_counter()` to collect the words and their counts from each token dictioanry of each document :

In [None]:
def update_counter(overall_vocab_counter,token_dict):
    for w in token_dict:
        overall_vocab_counter[w]+=token_dict[w]
        


We can now define an `overall_vocab_counter` and then integrate the `update_counter()` function into `process_docs_unigrams()` to update `overall_vocab_counter`. 

In [None]:
from collections import Counter
overall_vocab_counter=Counter()
def process_docs_unigrams(directory,overall_vocab_counter):
    # walk through all files in the folder
    print (directory)
    for filename in listdir(directory):
        # skip files that do not have the right extension
        if not filename.endswith(".txt"):
            continue
        # create the full path of the file to open
        path = directory + '/' + filename
        # load document
        text=load_doc(path)
        # clean documents
        tokens_dict_current = clean_doc_unigrams(text)
        update_counter(overall_vocab_counter,tokens_dict_current)
process_docs_unigrams('./data/neg',overall_vocab_counter)
process_docs_unigrams('./data/pos',overall_vocab_counter)


<p style="font-size:1.5em; font-weight: bold">
<img src="../../resources/exercise.png" style="height:36pt; display:inline; vertical-align:bottom; margin-right: 4pt" /> 
Try it yourself! <hr>
</p>

### Practice quiz 2:

Let's check the compiled `overall_vocab_counter`, how many words are there in total after preprocessing? And what are the top 5 most frequent words?

You can write your code below:


<hr>    <!-- please remember this! -->
<details>
  <summary>Click <b>here</b> to see the answer.</summary>
    <p>There are 37607 words in total</p>
    <p>The 5 most frequent words are: 
        ('film', 8849),
 ('one', 5514),
 ('movie', 5429),
 ('like', 3543),
 ('even', 2554),</p>

<!--   <p>\begin{equation*}P(Fire|Smoke) =  \frac{P(Fire)*P(Smoke|Fire)}{P(Smoke)} \end{equation*} </p>
<p>\begin{equation*}=\frac{0.01 * 0.9}{ 0.1} \end{equation*}</p>
<p>\begin{equation*}=0.09\end{equation*}</p> -->

</details> 



Now let's create the mapping between vocabulary and feature vecor dimension index from this `overall_vocab_counter`. We can define a function `create_vocab_feature_mappings()` to do this:

In [None]:
def create_vocab_feature_mappings(overall_vocab_counter):
    vocab2index={}
    index2vocab={}
    for i,w in enumerate(overall_vocab_counter.keys()): # iterate through the words in the vocabulary
        vocab2index[w]=i
        index2vocab[i]=w
    return vocab2index,index2vocab
vocab2index,index2vocab=create_vocab_feature_mappings(overall_vocab_counter)

Now, let's turn each document into a unigram feature vector. We can either represent in each dimension cell the freqency of the words, or use 1 or 0 to represent whether a word occurs or not. 


Let's design two functions `create_feature_presence()` and `create_feature_frequency()`  to turn the token dictionary for each document into these two types of features. 

To create the `create_feature_presence()`, we can do the following:

In [None]:
import numpy as np
def create_feature_presence(tokens_dict_current,vocab2index):
    # create a numpy array with dimension size of the vocabulary size
    vector=np.zeros(len(vocab2index))
    for w in tokens_dict_current:
        index=vocab2index[w]
        vector[index]=1
    return vector


### Numpy arrays

Notice that we have created `numpy` arrays here to represent feature vectors.  A `numpy` array is just like a `list` but with smaller memory and faster access. 

Below, we introduce several ways to create a numpy array

In [None]:
# Create a numpy array of zeros with dimension 2
vector1=np.zeros(2)
# create a numpy array from a list [1,2,3]
vector2=np.array([1,2,3])
# create an empty array of dimension 3 with arbitary data
vector3=np.empty(3)
print ('vector1',vector1)
print ('vector2',vector2)
print ('vector3',vector3)

So far, we have created numpy arrays of one dimension. Let's try creating a 2-D array (also called a matrix). We can pass dimension size (also called axes) as (a,b) where a is the number of rows in the matrix, and b specifies the number of columns. 

In [None]:
matrix1=np.zeros((3,3)) 
# this is a matrix of zeros that has 3 vectors, and within each vector there are 4 items. 
print ('matrix1',matrix1)
# a matrix from nested list
matrix2=np.array([[1,2,3],[2,3,4]])
print ('matrix2',matrix2)
# an empty matrix usually used as initialisation. It will print as an empty list
matrix3=np.empty((0,4))
print ('matrix3',matrix3)
#To check the axes of an array, you can retrieve the shape attribute like this:
print (matrix2.shape) 


Numpy arrays are mutable. Therefore, we can change values in the vector. For example:

In [None]:
vector2[0]=0 # change the first item in vector3 to 0
print (vector2)
matrix2[0][2]=0 # change the third item of the first vector to 0
print (matrix2)
matrix2[0]=vector2 # change the first vector in matrix2 to vector3
print (matrix2)


<p style="font-size:1.5em; font-weight: bold">
<img src="../../resources/exercise.png" style="height:36pt; display:inline; vertical-align:bottom; margin-right: 4pt" /> 
Try it yourself! <hr>
</p>

### Practice quiz 3:

Can you try implementing the function`create_feature_frequency()`?
You can write your code below:

In [None]:
def create_feature_frequency(tokens_dict_current,vocab2index):
    # create a numpy array with dimension size of the vocabulary size
    vector=np.zeros(len(vocab2index))
    for w in tokens_dict_current:
        index=vocab2index[w]
        vector[index]=tokens_dict_current[w]
    return vector


<hr>    <!-- please remember this! -->
<details>
  <summary>Click <b>here</b> to see the answer.</summary>
    <p>simple change the line vector[index]=1 from create_feature_presence() to vector[index]=token_dict_current[w]</p>

   

<!--   <p>\begin{equation*}P(Fire|Smoke) =  \frac{P(Fire)*P(Smoke|Fire)}{P(Smoke)} \end{equation*} </p>
<p>\begin{equation*}=\frac{0.01 * 0.9}{ 0.1} \end{equation*}</p>
<p>\begin{equation*}=0.09\end{equation*}</p> -->

</details> 



Now let's create another function `process_docs_unigrams_presence()` on the basis of `process_docs_unigrams()` to loop over the documents again and convert the token dictionary in each document to feature vectors. 

In [None]:
def process_docs_unigrams_presence(directory,vocab2index):
    # loop over the directory to extract filenames that have the right extension
    filenames=[filename for filename in listdir(directory) if filename.endswith(".txt")]
    # since we know how many files we will process, and the vocabulary size, we can initialize our result matrix as an empty array with the shape of (file number, vocabulary size)
    
    unigram_presence_result=np.empty((len(filenames),len(vocab2index))) #initialize the result array as an empty array ready to be appended through the loop. 
    # walk through all files in the folder
    print (directory)
    for file_i,filename in enumerate(filenames):
        
        # create the full path of the file to open
        path = directory + '/' + filename
        # load document
        text=load_doc(path)
        # clean documents
        tokens_dict_current = clean_doc_unigrams(text)
        # convert bag of words in each document into unigram features
        unigram_presence=create_feature_presence(tokens_dict_current,vocab2index)
        # We assign the unigram fearture of the current document to the correct positon in the unigram_presence_result matrix. 
        # we append the unigram_feature in each document to the unigram_feature_result array and update the array. 
        # to ensure the correct dimension size, we nest the unigram_feature vector so that it becomes a 2-d array as the initialization in unigram_feature_result
        unigram_presence_result[file_i]=unigram_presence
    return unigram_presence_result

Now we are ready to extract features from both the negative and positive directories:

In [None]:
unigram_presence_positive=process_docs_unigrams_presence('./data/pos',vocab2index)
unigram_presence_negative=process_docs_unigrams_presence('./data/neg',vocab2index)


Let's take a look of the feature representation for 

In [None]:
unigram_presence_neg[1]

<p style="font-size:1.5em; font-weight: bold">
<img src="../../resources/exercise.png" style="height:36pt; display:inline; vertical-align:bottom; margin-right: 4pt" /> 
Try it yourself! <hr>
</p>

### Practice quiz 4:

Can you follow the codes above to extract unigrams' frequency features using the `create_feature_frequency()` functions you defined in quiz 3? Please store the features into `unigram_frequency_positive` and `unigram_frequency_negative` for positive and negative directories respectively. 

You can write your code below:

### Practice quiz 5:

Let's check the first positive review's unigram presence feature, can you use the mapping in `index2vocab` to reveal what unigrams are present in this review? 
Please answer: Which words in the following list are present?
A. gayness
B. fabulous
C. snappiness
D. happiness

You can write your code below:

In [None]:
a=[index2vocab[i] for i,item in enumerate(unigram_presence_pos[0]) if item==1]
'happiness' in a


<hr>    <!-- please remember this! -->
<details>
  <summary>Click <b>here</b> to see the answer.</summary>
    <p> Code:</p>
    <p>wordlist=[index2vocab[i] for i,item in enumerate(unigram_presence_pos[0]) if item==1]</p>
    <p> A,C are present in the review </p>

   

<!--   <p>\begin{equation*}P(Fire|Smoke) =  \frac{P(Fire)*P(Smoke|Fire)}{P(Smoke)} \end{equation*} </p>
<p>\begin{equation*}=\frac{0.01 * 0.9}{ 0.1} \end{equation*}</p>
<p>\begin{equation*}=0.09\end{equation*}</p> -->

</details> 



### Save the features

We can also save the prepared features for the reviews ready for modeling.

This is a good practice as it decouples the data preparation from modeling, allowing you to focus on modeling and circle back to data preparation if you have new ideas.

To store a numpy array, we can use the `numpy`'s `save()` function. `save()` takes two arguments: the first is an the filename to be written, and the second argument is the numpy array. 

In [None]:

np.save('unigram_presence_positive.npy',unigram_presence_positive)
np.save('unigram_presence_negative.npy',unigram_presence_negative)


We can load the postiive data from the files by:
    

In [None]:
np.load('unigram_presence_positive.npy')


### Bigrams

Based on `clean_doc_unigrams()` that turns a document into a unigram dictionary, we can create a function `clean_doc_bigrams()` to extract bigram dictionary. (You can refresh yourself of how to create a bigram counter dictionary in module 1.4. )

In [None]:
# turn a doc into clean tokens
def clean_doc_bigrams(doc):
    # split into tokens by white space
    tokens = doc.split()
    # remove tokens that are not alphabetic
    tokens = [word for word in tokens if word.isalpha()]
    
    # extract bigrams
    bigram_dict=Counter() #initialize a bigram dictionary to be updated
    tokens=['<start>']+tokens+['<end>'] # add <start> and <end> token
    for i in range(len(tokens)): #loop over all the indices of the token list
        if i<len(tokens)-1: #if it's not the end of the token list
            bigram_current=(tokens[i],tokens[i+1])
            bigram_dict[bigram_current]+=1
    return bigram_dict

Now we can replace the `clean_doc_unigrams()` line with `clean_doc_bigrams()` in `process_docs_unigrams()`, we will rename the function as `process_docs_bigrams()` that counts all the bigrams in the documents. 

In [None]:
overall_bigrams_counter=Counter()
def process_docs_bigrams(directory,overall_bigrams_counter):
    # walk through all files in the folder
    print (directory)
    for filename in listdir(directory):
        # skip files that do not have the right extension
        if not filename.endswith(".txt"):
            continue
        # create the full path of the file to open
        path = directory + '/' + filename
        # load document
        text=load_doc(path)
        # clean documents
        tokens_dict_current = clean_doc_bigrams(text)
        update_counter(overall_bigrams_counter,tokens_dict_current)
process_docs_bigrams('./data/neg',overall_bigrams_counter)
process_docs_bigrams('./data/pos',overall_bigrams_counter)

Once we have `overall_bigrams_counter`, we can pass it to `create_vocab_feature_mappings()` to create index-bigram mappings. 

In [None]:
bigram2index,index2bigram=create_vocab_feature_mappings(overall_bigrams_counter)

Now we can modify `process_docs_unigrams_presence()` to process bigrams. 

In [None]:
def process_docs_bigrams_presence(directory,bigram2index):
    # loop over the directory to extract filenames that have the right extension
    filenames=[filename for filename in listdir(directory) if filename.endswith(".txt")]
    # since we know how many files we will process, and the vocabulary size, we can initialize our result matrix as an empty array with the shape of (file number, vocabulary size)
    
    bigram_presence_result=np.empty((len(filenames),len(bigram2index))) #initialize the result array as an empty array ready to be appended through the loop. 
    # walk through all files in the folder
    print (directory)
    for file_i,filename in enumerate(filenames):
        
        # create the full path of the file to open
        path = directory + '/' + filename
        # load document
        text=load_doc(path)
        # clean documents
        bigrams_dict_current = clean_doc_bigrams(text)
        # convert bag of words in each document into unigram features
        bigram_presence=create_feature_presence(bigrams_dict_current,bigram2index)
        # We assign the bigram fearture of the current document to the correct positon in the bigram_presence_result matrix. 
        # we append the bigram_feature in each document to the bigram_feature_result array and update the array. 
        # to ensure the correct dimension size, we nest the unigram_feature vector so that it becomes a 2-d array as the initialization in unigram_feature_result
        bigram_presence_result[file_i]=bigram_presence
    return bigram_presence_result
bigram_presence_positive=process_docs_bigrams_presence('./data/pos',bigram2index)
bigram_presence_negative=process_docs_bigrams_presence('./data/neg',bigram2index)


In [None]:
# We then save the bigram features:
np.save('bigram_presence_positive.npy',bigram_presence_positive)
np.save('bigram_presence_negative.npy',bigram_presence_negative)

<p style="font-size:1.5em; font-weight: bold">
<img src="../../resources/exercise.png" style="height:36pt; display:inline; vertical-align:bottom; margin-right: 4pt" /> 
Try it yourself! <hr>
</p>

### Practice quiz 6:

Now we have features in the form of bigrams. Now what does each dimension represent for a vector now? How many dimensions do we have for each vector? You can inspect `bigram_presence_positive` to answer these questions. 



<hr>    <!-- please remember this! -->
<details>
  <summary>Click <b>here</b> to see the answer.</summary>
    <p>Each dimension corresponds to a bigram. There are 463119 bigrams and therfore the dimension size of each vector is 463119. </p>
  

   

<!--   <p>\begin{equation*}P(Fire|Smoke) =  \frac{P(Fire)*P(Smoke|Fire)}{P(Smoke)} \end{equation*} </p>
<p>\begin{equation*}=\frac{0.01 * 0.9}{ 0.1} \end{equation*}</p>
<p>\begin{equation*}=0.09\end{equation*}</p> -->

</details> 



#### Unigrams + POS

#### Adjective Unigrams (Quiz)

#### Unigrams above certain frequency threshold

#### Unigrams + Position

#### Unigrams + Bigrams (Quiz)

Negation (extension)

## Prepare train-test split

## Naive Bayes Model

Evaluation