Welcome to module 2.1. In this module, we will start building and testing a sentiment classifier with Naive Bayes!


Let's first refresh your memory on the Naive Bayes model. 

## Pre-module quiz

Say that we have two events: Fire and Smoke. $P(Fire)$ is the probability of a fire (or in other words, how often a fire occurs), $P(Smoke)$ is the probability of seeing smoke (how often we see smoke). We want to know $P(Fire|Smoke)$, that is, how often fire occurs when we see smoke. Suppose we know the following:

$P(Fire)=0.01$

$P(Smoke)=0.1$

$P(Smoke|Fire)=0.9$ (ie. 90\% of the fire makes smoke)


Can you work out $P(Fire|Smoke)$?

A. 0.1

B. 0.09

C. 0.01

D. 0.9




<hr>    <!-- please remember this! -->
<details>
  <summary>Click <b>here</b> to see the answer.</summary>
  <p>\begin{equation*}P(Fire|Smoke) =  \frac{P(Fire)*P(Smoke|Fire)}{P(Smoke)} \end{equation*} </p>
<p>\begin{equation*}=\frac{0.01 * 0.9}{ 0.1} \end{equation*}</p>
<p>\begin{equation*}=0.09\end{equation*}</p>

</details> 



<h1 align='center' style='margin-top:2em'>
    <img src="../../resources/section_header.png" 
         style="height:36pt; display:inline; vertical-align:center; margin-top:0em" />
    <u>Sentiment Anlaysis</u>
    <img src="../../resources/section_header.png" 
         style="height:36pt; display:inline; vertical-align:center; margin-top:0em" />
<hr>
</h1>

## Preparing data

The data for this tutorial is stored in the `./data` folder. The two subdirectories `./data/pos` and `./data/neg` contain samples of IMDb positive and negative movie reviews. Each line of a review text file is a tokenized sentences. 

We can load an individual text file by opening it, reading in the ASCII text, and closing the file. For example, we can load the first negative review file “cv000_29416.txt” as follows:


In [None]:

# load one file
filename = 'data/neg/cv000_29416.txt'
# open the file as read only
file = open(filename, 'r')
# read all text
text = file.read()
print (text)
# close the file
file.close()


This loads the document as ASCII and preserves any white space, like new lines.

We can turn this into a function called load_doc() that takes a filename of the document to load and returns the text.


In [None]:

 
# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text


We can process each directory in turn by first getting a list of files in the directory using the `listdir()` function from the `os` module, then loading each file in turn.

For example, we can load each document in the negative directory using the `load_doc()` function to do the actual loading. Below, we define a `process_docs()` function to load all documents in a folder. 

Let's first read in these positive and negative files and store them as two list of texts. To navigate the files, we can use Python's `os` module. 

In [None]:
from os import listdir 
# load all docs in a directory
def process_docs(directory):
    # walk through all files in the folder
    docs=[] # a list of review texts
    for filename in listdir(directory):
        # skip files that do not have the right extension
        if not filename.endswith(".txt"):
            continue
        # create the full path of the file to open
        path = directory + '/' + filename
        # load document
        doc = load_doc(path)
        print('Loaded %s' % filename)
        docs.append(doc)
    return docs
 
# specify directory to load
directory = 'data/neg'
docs=process_docs(directory)

<p style="font-size:1.5em; font-weight: bold">
<img src="../../resources/exercise.png" style="height:36pt; display:inline; vertical-align:bottom; margin-right: 4pt" /> 
Try it yourself! <hr>
</p>

### Practice quiz 1:

Use the predefined `process_docs()` function to read in negative texts and positive reviews. How many reviews are there for each class? 

You can write your code below:


<hr>    <!-- please remember this! -->
<details>
  <summary>Click <b>here</b> to see the answer.</summary>
    <p>1000 positive and 1000 negative reviews</p>

<!--   <p>\begin{equation*}P(Fire|Smoke) =  \frac{P(Fire)*P(Smoke|Fire)}{P(Smoke)} \end{equation*} </p>
<p>\begin{equation*}=\frac{0.01 * 0.9}{ 0.1} \end{equation*}</p>
<p>\begin{equation*}=0.09\end{equation*}</p> -->

</details> 



## Feature Extraction


In this section, we will look at cleaning and extracting features from the movie review data. We will start from splitting the text to extract unigrams. 


#### Pre-processing

First, let’s load one document and look at the raw tokens split by white space. We will use the load_doc() function developed in the previous section. We can use the split() function to split the loaded document into tokens separated by white space.

In [None]:
# load the document
filename = 'data/neg/cv000_29416.txt'
text = load_doc(filename)
# split into tokens by white space
tokens = text.split()
print(tokens)

Just looking at the raw tokens can give us a lot of ideas of things to try, such as:

Removing tokens that are just punctuation (e.g. ‘-‘).
Removing tokens that contain numbers (e.g. ’10/10′).
Remove tokens that don’t have much meaning (e.g. ‘and’)

Some ideas:

We can remove tokens that are just punctuation or contain numbers by using an `isalpha()` function to check on each token.
We can remove English stop words using the list loaded using `NLTK`.

We add the above preprocessing steps:

In [None]:
#install nltk using pip in the jupyter notebook
!python -m pip install nltk 
from nltk.corpus import stopwords
import string

# load the document
filename = 'data/neg/cv000_29416.txt'
text = load_doc(filename)
# split into tokens by white space
tokens = text.split()
# remove tokens that are not alphabetic
tokens = [word for word in tokens if word.isalpha()]
# filter out stop words
stop_words = set(stopwords.words('english'))
tokens = [w for w in tokens if not w in stop_words]
print(tokens)

As we are mainly interested in the frequency and presence of each unigram, we can store the unigram features using the `Counter` dictionary from `collections` module. Let's write a function `tokens_to_dict()` that turns a list of unigram tokens into a counter dictionary of unigrams. 

In [None]:
from collections import Counter
def tokens_to_dict(tokens):
    token2count=Counter()
    for token in tokens:
        token2count[token]+=1
    return token2count
tokens_dict=tokens_to_dict(tokens)
print (tokens_dict)

We can put this into a function called `clean_doc_unigrams()` and test it on another review, this time a positive review.


In [None]:
# turn a doc into clean tokens
def clean_doc_unigrams(doc):
    # split into tokens by white space
    tokens = doc.split()
    # remove tokens that are not alphabetic
    tokens = [word for word in tokens if word.isalpha()]
    # filter out stop words
    stop_words = set(stopwords.words('english'))
    tokens = [w for w in tokens if not w in stop_words]
    tokens_dict=tokens_to_dict(tokens)
    return tokens_dict
 
# load the document
filename = 'data/pos/cv000_29590.txt'
text = load_doc(filename)
tokens_dict = clean_doc_unigrams(text)
print(tokens_dict)

Finally, we can add the above preprocessing steps into a function `process_docs_unigram()` to process all the files in a directory. 


In [None]:
# load all docs in a directory
def process_docs_unigrams(directory):
    # walk through all files in the folder
    print (directory)
    for filename in listdir(directory):
        # skip files that do not have the right extension
        if not filename.endswith(".txt"):
            continue
        # create the full path of the file to open
        path = directory + '/' + filename
        # load document
        text=load_doc(path)
        # clean documents
        tokens_dict = clean_doc_unigrams(text)
process_docs_unigrams('./data/neg')
process_docs_unigrams('./data/pos')

#### Feature Vectors

To prepare the features as input to the model, we need to turn the now dictionary features into a vector of numbers. For unigram features, the feature vector will have a dimension size of the vocabulary. Each dimension stores the frequency or presence of the corresponding word. 

For example, suppose we have five words (apple,banana,red,dog,is) in the vocabulary which are represented as five dimensions in the feature vecotors. We also have a document (document 1) containing the following words: 

document 1: "apple is red"

We add '1' for the words as in: 

|document no.|apple|banana|red|dog|is
|------|------|------|------|------|------|
|document 1 |1|0|1|0|1|

We thus can represent the document as a feature vector [1,0,1,0,1]

The first step now is to create a mapping between dimension index of the vector and the word in the vocabulary. Let's define an overall `Counter` dictionary that updates the words and their counts while we process all the documents. To do this, we can define a function `update_counter()`:

In [None]:
def update_counter(overall_vocab_counter,current_vocab_counter):
    for w in current_vocab_counter:
        overall_vocab_counter[w]+=current_vocab_counter[w]
        


We can now define an `overall_vocab_counter` and then integrate the `update_counter()` function into `process_docs_unigrams()` to update `overall_vocab_counter`. 

In [None]:
overall_vocab_counter=Counter()
def process_docs_unigrams(directory):
    # walk through all files in the folder
    print (directory)
    for filename in listdir(directory):
        # skip files that do not have the right extension
        if not filename.endswith(".txt"):
            continue
        # create the full path of the file to open
        path = directory + '/' + filename
        # load document
        text=load_doc(path)
        # clean documents
        tokens_dict = clean_doc_unigrams(text)
        update_counter(overall_vocab_counter,tokens_dict)
process_docs_unigrams('./data/neg')
process_docs_unigrams('./data/pos')


<p style="font-size:1.5em; font-weight: bold">
<img src="../../resources/exercise.png" style="height:36pt; display:inline; vertical-align:bottom; margin-right: 4pt" /> 
Try it yourself! <hr>
</p>

### Practice quiz 1:

Let's check the compiled `overall_vocab_counter`, how many words are there in total after preprocessing? And what are the top 100 most frequent words?

You can write your code below:

In [None]:
len(overall_vocab_counter)


<hr>    <!-- please remember this! -->
<details>
  <summary>Click <b>here</b> to see the answer.</summary>
    <p>There are 37607 words in total</p>
    <p>The 5 most frequent words are: 
        ('film', 8849),
 ('one', 5514),
 ('movie', 5429),
 ('like', 3543),
 ('even', 2554),</p>

<!--   <p>\begin{equation*}P(Fire|Smoke) =  \frac{P(Fire)*P(Smoke|Fire)}{P(Smoke)} \end{equation*} </p>
<p>\begin{equation*}=\frac{0.01 * 0.9}{ 0.1} \end{equation*}</p>
<p>\begin{equation*}=0.09\end{equation*}</p> -->

</details> 



Now let's create the mapping between vocabulary and feature vecor dimension index from this `overall_vocab_counter`. We can define a function `create_vocab_feature_mappings()` to do this:

In [None]:
def create_vocab_feature_mappings(overall_vocab_counter):
    vocab2index={}
    index2vocab={}
    for i,w in enumerate(overall_vocab_counter.keys()): # iterate through the words in the vocabulary
        vocab2index[w]=i
        index2vocab[i]=w
    return vocab2index,index2vocab
vocab2index,index2vocab=create_vocab_feature_mappings(overall_vocab_counter)

Now, let's turn each document into a unigram feature vector. We can either represent in each dimension cell the freqency of the words, or use 1 or 0 to represent whether a word occurs or not. 

To represent these feature vectors, we use the numpy arrays. These can be computed from python's `numpy` module. These numpy arrays require less memory and are much faster to access. 

Let's design two functions `create_feature_frequency()` and `create_feature_presence()` to turn the token dictionary for each document to form these two types of features. 

Short introduction of numpy?

In [None]:
import numpy
def create_feature_frequency(tokens_dict,vocab2index):
    # create a numpy array with dimension size of the vocabulary size
    pass
    

#### Save the features

We can also save the prepared features for the reviews ready for modeling.

This is a good practice as it decouples the data preparation from modeling, allowing you to focus on modeling and circle back to data preparation if you have new ideas.

#### Bigrams

Now let's change on the basis of `clean_doc_unigrams()` to extract bigrams

#### Unigrams + POS

#### Adjective Unigrams (Quiz)

#### Unigrams above certain frequency threshold

#### Unigrams + Position

#### Unigrams + Bigrams (Quiz)

## Prepare train-test split

## Naive Bayes Model