Welcome to module 2.1. In this module, we will start building and testing a sentiment classifier with Naive Bayes!


Let's first refresh your memory on the Naive Bayes model. 

## ❓ Pre-module quiz

Say that we have two events: Fire and Smoke. $P(Fire)$ is the probability of a fire (or in other words, how often a fire occurs), $P(Smoke)$ is the probability of seeing smoke (how often we see smoke). We want to know $P(Fire|Smoke)$, that is, how often fire occurs when we see smoke. Suppose we know the following:

$P(Fire)=0.01$

$P(Smoke)=0.1$

$P(Smoke|Fire)=0.9$ (ie. 90\% of the fire makes smoke)


Can you work out $P(Fire|Smoke)$?

A. 0.1

B. 0.09

C. 0.01

D. 0.9




<hr>    <!-- please remember this! -->
<details>
  <summary>Click <b>here</b> to see the answer.</summary>
  <p>\begin{equation*}P(Fire|Smoke) =  \frac{P(Fire)*P(Smoke|Fire)}{P(Smoke)} \end{equation*} </p>
<p>\begin{equation*}=\frac{0.01 * 0.9}{ 0.1} \end{equation*}</p>
<p>\begin{equation*}=0.09\end{equation*}</p>

</details> 



# Sentiment Analysis

Today, we are going to focus on a popular NLP classification task: sentiment analysis. What exactly is sentiment? Sentiment relates to the meaning of a word or sequence of words and is usually associated with an opinion or emotion. And analysis? Well, this is the process of looking at data and making inferences; in this case, using machine learning to learn and predict whether a movie review is positive or negative.

In this section, we will replicate the experiments from the paper: Thumbs up? Sentiment Classification using Machine Learning
Techniques (https://www.aclweb.org/anthology/W02-1011.pdf). We will extract a number of features including unigrams, bigrams, pos tags etc., and train Naive Bayes models on these features. 

## Preparing data

The data for this tutorial is stored in the `./data` folder. The two subdirectories `./data/pos` and `./data/neg` contain samples of IMDb positive and negative movie reviews. Each line of a review text file is a tokenized sentences. 

As usual, we download the files for the notebook from Github. If you're running this notebook locally or on Binder, you may skip this cell.

In [1]:
!wget https://github.com/cambridgeltl/python4cl/raw/module_2.1/module_2/module_2.1/data.zip
!unzip -n -q data.zip

--2020-11-04 16:07:06--  https://github.com/cambridgeltl/python4cl/raw/module_2.1/module_2/module_2.1/data.zip
Resolving github.com (github.com)... 140.82.121.4
Connecting to github.com (github.com)|140.82.121.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/cambridgeltl/python4cl/module_2.1/module_2/module_2.1/data.zip [following]
--2020-11-04 16:07:07--  https://raw.githubusercontent.com/cambridgeltl/python4cl/module_2.1/module_2/module_2.1/data.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.60.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.60.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4848090 (4.6M) [application/zip]
Saving to: ‘data.zip.1’


2020-11-04 16:07:08 (11.3 MB/s) - ‘data.zip.1’ saved [4848090/4848090]



We can load an individual text file by opening it, reading in the ASCII text, and closing the file. For example, we can load the first negative review file “cv000_29416.txt” as follows:


In [2]:

# load one file
filename = 'data/neg/cv000_29416.txt'
# open the file as read only
file = open(filename, 'r')
# read all text
text = file.read()
print (text)
# close the file
file.close()


plot : two teen couples go to a church party , drink and then drive . 
they get into an accident . 
one of the guys dies , but his girlfriend continues to see him in her life , and has nightmares . 
what's the deal ? 
watch the movie and " sorta " find out . . . 
critique : a mind-fuck movie for the teen generation that touches on a very cool idea , but presents it in a very bad package . 
which is what makes this review an even harder one to write , since i generally applaud films which attempt to break the mold , mess with your head and such ( lost highway & memento ) , but there are good and bad ways of making all types of films , and these folks just didn't snag this one correctly . 
they seem to have taken this pretty neat concept , but executed it terribly . 
so what are the problems with the movie ? 
well , its main problem is that it's simply too jumbled . 
it starts off " normal " but then downshifts into this " fantasy " world in which you , as an audience member , have no id

This loads the document as ASCII and preserves any white space, like new lines.

We can turn this into a function called load_doc() that takes a filename of the document to load and returns the text.


In [3]:
# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text


We can process each directory in turn by first getting a list of files in the directory using the `listdir()` function from the `os` module, then loading each file in turn.

For example, we can load each document in the negative directory using the `load_doc()` function to do the actual loading. Below, we define a `process_docs()` function to load all documents in a folder. 

Let's first read in these positive and negative files and store them as two list of texts. To navigate the files, we can use Python's `os` module. 

In [4]:
from os import listdir 
# load all docs in a directory
def process_docs(directory):
    # walk through all files in the folder
    docs=[] # a list of review texts
    for filename in listdir(directory):
        # skip files that do not have the right extension
        if not filename.endswith(".txt"):
            continue
        # create the full path of the file to open
        path = directory + '/' + filename
        # load document
        doc = load_doc(path)
        print('Loaded %s' % filename)
        docs.append(doc)
    return docs
 
# specify directory to load
directory = 'data/neg'
docs=process_docs(directory)

Loaded cv676_22202.txt
Loaded cv839_22807.txt
Loaded cv155_7845.txt
Loaded cv465_23401.txt
Loaded cv398_17047.txt
Loaded cv206_15893.txt
Loaded cv037_19798.txt
Loaded cv279_19452.txt
Loaded cv646_16817.txt
Loaded cv756_23676.txt
Loaded cv823_17055.txt
Loaded cv747_18189.txt
Loaded cv258_5627.txt
Loaded cv948_25870.txt
Loaded cv744_10091.txt
Loaded cv754_7709.txt
Loaded cv838_25886.txt
Loaded cv131_11568.txt
Loaded cv401_13758.txt
Loaded cv523_18285.txt
Loaded cv073_23039.txt
Loaded cv688_7884.txt
Loaded cv664_4264.txt
Loaded cv461_21124.txt
Loaded cv909_9973.txt
Loaded cv939_11247.txt
Loaded cv368_11090.txt
Loaded cv185_28372.txt
Loaded cv749_18960.txt
Loaded cv836_14311.txt
Loaded cv322_21820.txt
Loaded cv789_12991.txt
Loaded cv617_9561.txt
Loaded cv288_20212.txt
Loaded cv464_17076.txt
Loaded cv904_25663.txt
Loaded cv866_29447.txt
Loaded cv429_7937.txt
Loaded cv212_10054.txt
Loaded cv007_4992.txt
Loaded cv522_5418.txt
Loaded cv109_22599.txt
Loaded cv753_11812.txt
Loaded cv312_29308.tx

Loaded cv959_16218.txt
Loaded cv004_12641.txt
Loaded cv706_25883.txt
Loaded cv579_12542.txt
Loaded cv947_11316.txt
Loaded cv929_1841.txt
Loaded cv713_29002.txt
Loaded cv226_26692.txt
Loaded cv561_9484.txt
Loaded cv951_11816.txt
Loaded cv495_16121.txt
Loaded cv420_28631.txt
Loaded cv761_13769.txt
Loaded cv346_19198.txt
Loaded cv106_18379.txt
Loaded cv389_9611.txt
Loaded cv488_21453.txt
Loaded cv850_18185.txt
Loaded cv129_18373.txt
Loaded cv436_20564.txt
Loaded cv032_23718.txt
Loaded cv087_2145.txt
Loaded cv075_6250.txt
Loaded cv303_27366.txt
Loaded cv970_19532.txt
Loaded cv174_9735.txt
Loaded cv700_23163.txt
Loaded cv690_5425.txt
Loaded cv781_5358.txt
Loaded cv910_21930.txt
Loaded cv571_29292.txt
Loaded cv890_3515.txt
Loaded cv047_18725.txt
Loaded cv605_12730.txt
Loaded cv454_21961.txt
Loaded cv280_8651.txt
Loaded cv869_24782.txt
Loaded cv920_29423.txt
Loaded cv414_11161.txt
Loaded cv181_16083.txt
Loaded cv433_10443.txt
Loaded cv693_19147.txt
Loaded cv472_29140.txt
Loaded cv990_12443.tx

Loaded cv551_11214.txt
Loaded cv402_16097.txt
Loaded cv871_25971.txt
Loaded cv878_17204.txt
Loaded cv989_17297.txt
Loaded cv719_5581.txt
Loaded cv084_15183.txt
Loaded cv188_20687.txt
Loaded cv766_7983.txt
Loaded cv857_17527.txt
Loaded cv340_14776.txt
Loaded cv872_13710.txt
Loaded cv771_28466.txt
Loaded cv230_7913.txt
Loaded cv559_0057.txt
Loaded cv266_26644.txt
Loaded cv125_9636.txt
Loaded cv566_8967.txt
Loaded cv628_20758.txt
Loaded cv218_25651.txt
Loaded cv353_19197.txt
Loaded cv417_14653.txt
Loaded cv999_14636.txt
Loaded cv265_11625.txt
Loaded cv462_20788.txt
Loaded cv093_15606.txt
Loaded cv897_11703.txt
Loaded cv899_17812.txt
Loaded cv068_14810.txt
Loaded cv536_27221.txt
Loaded cv568_17065.txt
Loaded cv940_18935.txt
Loaded cv399_28593.txt
Loaded cv438_8500.txt
Loaded cv683_13047.txt
Loaded cv211_9955.txt
Loaded cv663_14484.txt
Loaded cv662_14791.txt
Loaded cv245_8938.txt
Loaded cv116_28734.txt
Loaded cv594_11945.txt
Loaded cv099_11189.txt
Loaded cv427_11693.txt
Loaded cv510_24758.t

### ❓ Quiz  

Use the predefined `process_docs()` function to read in negative texts and positive reviews. How many reviews are there for each class? 

You can write your code below:


<hr>    <!-- please remember this! -->
<details>
  <summary>Click <b>here</b> to see the answer.</summary>
    <p>1000 positive and 1000 negative reviews</p>

<!--   <p>\begin{equation*}P(Fire|Smoke) =  \frac{P(Fire)*P(Smoke|Fire)}{P(Smoke)} \end{equation*} </p>
<p>\begin{equation*}=\frac{0.01 * 0.9}{ 0.1} \end{equation*}</p>
<p>\begin{equation*}=0.09\end{equation*}</p> -->

</details> 



## Feature Extraction


In this section, we will look at cleaning and extracting features from the movie review data. In machine learning, a feature is a measurable piece of data that can be used for analysis. For sentiment analysis, we can extract unigrams (bag of words) as features to predict the sentiment polarity. Since models can only understand numbers rather than words, we also need to convert words into some form of numerical values, which we will delve into the details later. 

We will start from splitting the text to extract unigrams. 


### Pre-processing

First, let’s load one document and look at the raw tokens split by white space. We will use the load_doc() function developed in the previous section. We can use the split() function to split the loaded document into tokens separated by white space.

In [5]:
# load the document
filename = 'data/neg/cv000_29416.txt'
text = load_doc(filename)
# split into tokens by white space
tokens = text.split()
print(tokens)

['plot', ':', 'two', 'teen', 'couples', 'go', 'to', 'a', 'church', 'party', ',', 'drink', 'and', 'then', 'drive', '.', 'they', 'get', 'into', 'an', 'accident', '.', 'one', 'of', 'the', 'guys', 'dies', ',', 'but', 'his', 'girlfriend', 'continues', 'to', 'see', 'him', 'in', 'her', 'life', ',', 'and', 'has', 'nightmares', '.', "what's", 'the', 'deal', '?', 'watch', 'the', 'movie', 'and', '"', 'sorta', '"', 'find', 'out', '.', '.', '.', 'critique', ':', 'a', 'mind-fuck', 'movie', 'for', 'the', 'teen', 'generation', 'that', 'touches', 'on', 'a', 'very', 'cool', 'idea', ',', 'but', 'presents', 'it', 'in', 'a', 'very', 'bad', 'package', '.', 'which', 'is', 'what', 'makes', 'this', 'review', 'an', 'even', 'harder', 'one', 'to', 'write', ',', 'since', 'i', 'generally', 'applaud', 'films', 'which', 'attempt', 'to', 'break', 'the', 'mold', ',', 'mess', 'with', 'your', 'head', 'and', 'such', '(', 'lost', 'highway', '&', 'memento', ')', ',', 'but', 'there', 'are', 'good', 'and', 'bad', 'ways', 'of'

Just looking at the raw tokens can give us a lot of ideas of things to try, such as:

- Removing tokens that are just punctuation (e.g. ‘-‘).

- Removing tokens that contain numbers (e.g. ’10/10′).

- Remove tokens that don’t have much meaning (e.g. ‘and’)

Some ideas:

- We can remove tokens that are just punctuation or contain numbers by using an `isalpha()` function to check on each token.
- We can remove English stop words using the list loaded using `NLTK`.

We add the above preprocessing steps:

In [6]:
# install nltk using pip in the jupyter notebook
# and download the stopword lists
!pip install nltk
!python -m nltk.downloader stopwords 

# import the stopwods
from nltk.corpus import stopwords


[33mYou are using pip version 9.0.1, however version 20.2.4 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/liuqianchu/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [7]:
import string

# load the document
filename = 'data/neg/cv000_29416.txt'
text = load_doc(filename)
# split into tokens by white space
tokens = text.split()
# remove tokens that are not alphabetic
tokens = [word for word in tokens if word.isalpha()]
# filter out stop words
stop_words = set(stopwords.words('english'))
tokens = [w for w in tokens if not w in stop_words]
print(tokens)

['plot', 'two', 'teen', 'couples', 'go', 'church', 'party', 'drink', 'drive', 'get', 'accident', 'one', 'guys', 'dies', 'girlfriend', 'continues', 'see', 'life', 'nightmares', 'deal', 'watch', 'movie', 'sorta', 'find', 'critique', 'movie', 'teen', 'generation', 'touches', 'cool', 'idea', 'presents', 'bad', 'package', 'makes', 'review', 'even', 'harder', 'one', 'write', 'since', 'generally', 'applaud', 'films', 'attempt', 'break', 'mold', 'mess', 'head', 'lost', 'highway', 'memento', 'good', 'bad', 'ways', 'making', 'types', 'films', 'folks', 'snag', 'one', 'correctly', 'seem', 'taken', 'pretty', 'neat', 'concept', 'executed', 'terribly', 'problems', 'movie', 'well', 'main', 'problem', 'simply', 'jumbled', 'starts', 'normal', 'downshifts', 'fantasy', 'world', 'audience', 'member', 'idea', 'going', 'dreams', 'characters', 'coming', 'back', 'dead', 'others', 'look', 'like', 'dead', 'strange', 'apparitions', 'disappearances', 'looooot', 'chase', 'scenes', 'tons', 'weird', 'things', 'happen

As we are mainly interested in the frequency and presence of each unigram, we can store the unigram features using the `Counter` dictionary from `collections` module. Let's write a function `tokens_to_dict()` that turns a list of unigram tokens into a counter dictionary of unigrams. 

In [8]:
from collections import Counter
def tokens_to_dict(tokens):
    token2count=Counter()
    for token in tokens:
        token2count[token]+=1
    return token2count
tokens_dict=tokens_to_dict(tokens)
print (tokens_dict)

Counter({'movie': 6, 'pretty': 5, 'film': 5, 'make': 5, 'teen': 4, 'get': 3, 'one': 3, 'even': 3, 'like': 3, 'two': 2, 'go': 2, 'see': 2, 'cool': 2, 'idea': 2, 'bad': 2, 'since': 2, 'films': 2, 'lost': 2, 'highway': 2, 'memento': 2, 'good': 2, 'problem': 2, 'simply': 2, 'world': 2, 'audience': 2, 'going': 2, 'coming': 2, 'dead': 2, 'others': 2, 'look': 2, 'scenes': 2, 'things': 2, 'biggest': 2, 'secret': 2, 'hide': 2, 'minutes': 2, 'entertaining': 2, 'really': 2, 'part': 2, 'actually': 2, 'strangeness': 2, 'little': 2, 'sense': 2, 'still': 2, 'guess': 2, 'sagemiller': 2, 'away': 2, 'throughout': 2, 'apparently': 2, 'way': 2, 'crow': 2, 'plot': 1, 'couples': 1, 'church': 1, 'party': 1, 'drink': 1, 'drive': 1, 'accident': 1, 'guys': 1, 'dies': 1, 'girlfriend': 1, 'continues': 1, 'life': 1, 'nightmares': 1, 'deal': 1, 'watch': 1, 'sorta': 1, 'find': 1, 'critique': 1, 'generation': 1, 'touches': 1, 'presents': 1, 'package': 1, 'makes': 1, 'review': 1, 'harder': 1, 'write': 1, 'generally': 

We can put this into a function called `clean_doc_unigrams()` and test it on another review, this time a positive review.


In [9]:
# turn a doc into clean tokens
def clean_doc_unigrams(doc):
    # split into tokens by white space
    tokens = doc.split()
    # remove tokens that are not alphabetic
    tokens = [word for word in tokens if word.isalpha()]
    # filter out stop words
    stop_words = set(stopwords.words('english'))
    tokens = [w for w in tokens if not w in stop_words]
    tokens_dict=tokens_to_dict(tokens)
    return tokens_dict
 
# load the document
filename = 'data/pos/cv000_29590.txt'
text = load_doc(filename)
tokens_dict = clean_doc_unigrams(text)
print(tokens)

['plot', 'two', 'teen', 'couples', 'go', 'church', 'party', 'drink', 'drive', 'get', 'accident', 'one', 'guys', 'dies', 'girlfriend', 'continues', 'see', 'life', 'nightmares', 'deal', 'watch', 'movie', 'sorta', 'find', 'critique', 'movie', 'teen', 'generation', 'touches', 'cool', 'idea', 'presents', 'bad', 'package', 'makes', 'review', 'even', 'harder', 'one', 'write', 'since', 'generally', 'applaud', 'films', 'attempt', 'break', 'mold', 'mess', 'head', 'lost', 'highway', 'memento', 'good', 'bad', 'ways', 'making', 'types', 'films', 'folks', 'snag', 'one', 'correctly', 'seem', 'taken', 'pretty', 'neat', 'concept', 'executed', 'terribly', 'problems', 'movie', 'well', 'main', 'problem', 'simply', 'jumbled', 'starts', 'normal', 'downshifts', 'fantasy', 'world', 'audience', 'member', 'idea', 'going', 'dreams', 'characters', 'coming', 'back', 'dead', 'others', 'look', 'like', 'dead', 'strange', 'apparitions', 'disappearances', 'looooot', 'chase', 'scenes', 'tons', 'weird', 'things', 'happen

Finally, we can add the above preprocessing steps into a function `process_docs_unigram()` to process all the files in a directory. 


In [10]:
# load all docs in a directory
def process_docs_unigrams(directory):
    # walk through all files in the folder
    print (directory)
    for filename in listdir(directory):
        # skip files that do not have the right extension
        if not filename.endswith(".txt"):
            continue
        # create the full path of the file to open
        path = directory + '/' + filename
        # load document
        text=load_doc(path)
        # clean documents
        tokens_dict_current = clean_doc_unigrams(text)
process_docs_unigrams('./data/neg')
process_docs_unigrams('./data/pos')

./data/neg
./data/pos


### Feature Vectors

To prepare the features as input to the model, we need to convert each word token into a numerical value so that the model can process them. To represent unigrams, we can create a vector (ie. a list of numbers) with the dimension size of the vocabulay, and the value in each dimension stores the frequency or presence of a specific word. 


For example, suppose we have five words (apple,banana,red,dog,is) in the vocabulary which are represented as five dimensions in the feature vecotors. We also have a document (document 1) containing the following words: 

document 1: "apple is red"

We add '1' for the words present and '0' for words not present in the document, and create the following: 

|document no.|apple|banana|red|dog|is
|------|------|------|------|------|------|
|document 1 |1|0|1|0|1|

We thus can represent the document as a feature vector [1,0,1,0,1]

The first step now is to create a mapping between dimension index of the vector and the word in the vocabulary. Let's first loop over the documents to collect all the words. Here, we use a dictionary that updates the words and their counts while we process all the documents. To do this, we can define a function `update_counter()` to collect the words and their counts from each token dictioanry of each document :

In [11]:
def update_counter(overall_vocab_counter,token_dict):
    for w in token_dict:
        overall_vocab_counter[w]+=token_dict[w]

We can now define an `overall_vocab_counter` and then integrate the `update_counter()` function into `process_docs_unigrams()` to update `overall_vocab_counter`. 

In [12]:
from collections import Counter
overall_vocab_counter=Counter()
def process_docs_unigrams(directory,overall_vocab_counter):
    # walk through all files in the folder
    print (directory)
    for filename in listdir(directory):
        # skip files that do not have the right extension
        if not filename.endswith(".txt"):
            continue
        # create the full path of the file to open
        path = directory + '/' + filename
        # load document
        text=load_doc(path)
        # clean documents
        tokens_dict_current = clean_doc_unigrams(text)
        update_counter(overall_vocab_counter,tokens_dict_current)
process_docs_unigrams('./data/neg',overall_vocab_counter)
process_docs_unigrams('./data/pos',overall_vocab_counter)


./data/neg
./data/pos


### ❓ Quiz  

Let's check the compiled `overall_vocab_counter`, how many words are there in total after preprocessing? And what are the top 5 most frequent words?

You can write your code below:


<hr>    <!-- please remember this! -->
<details>
  <summary>Click <b>here</b> to see the answer.</summary>
    <p>There are 37607 words in total</p>
    <p>The 5 most frequent words are: 
        ('film', 8849),
 ('one', 5514),
 ('movie', 5429),
 ('like', 3543),
 ('even', 2554),</p>

<!--   <p>\begin{equation*}P(Fire|Smoke) =  \frac{P(Fire)*P(Smoke|Fire)}{P(Smoke)} \end{equation*} </p>
<p>\begin{equation*}=\frac{0.01 * 0.9}{ 0.1} \end{equation*}</p>
<p>\begin{equation*}=0.09\end{equation*}</p> -->

</details> 



Now let's create the mapping between vocabulary and feature vecor dimension index from this `overall_vocab_counter`. We can define a function `create_vocab_feature_mappings()` to do this:

In [13]:
def create_vocab_feature_mappings(overall_vocab_counter):
    vocab2index={}
    index2vocab={}
    for i,w in enumerate(overall_vocab_counter.keys()): # iterate through the words in the vocabulary
        vocab2index[w]=i
        index2vocab[i]=w
    return vocab2index,index2vocab
vocab2index,index2vocab=create_vocab_feature_mappings(overall_vocab_counter)

Now, let's turn each document into a unigram feature vector. We can either represent in each dimension cell the freqency of the words, or use 1 or 0 to represent whether a word occurs or not. 


Let's design two functions `create_feature_presence()` and `create_feature_frequency()`  to turn the token dictionary for each document into these two types of features. 

To create the `create_feature_presence()`, we can do the following:

In [14]:
import numpy as np
def create_feature_presence(tokens_dict_current,vocab2index):
    # create a numpy array with dimension size of the vocabulary size
    vector=np.zeros(len(vocab2index))
    for w in tokens_dict_current:
        index=vocab2index[w]
        vector[index]=1
    return vector


### Numpy arrays

Notice that we have created `numpy` arrays here to represent feature vectors.  A `numpy` array is just like a `list` but with smaller memory and faster access. 

Below, we introduce several ways to create a numpy array

In [15]:
# Create a numpy array of zeros with dimension 2
vector1=np.zeros(2)
# create a numpy array from a list [1,2,3]
vector2=np.array([1,2,3])
# create an empty array of dimension 3 with arbitary data
vector3=np.empty(3)
print ('vector1',vector1)
print ('vector2',vector2)
print ('vector3',vector3)

vector1 [0. 0.]
vector2 [1 2 3]
vector3 [3. 2. 1.]


We can also concatenate two numpy arrays of the same dimension

In [16]:
np.concatenate((vector2,vector3))

array([1., 2., 3., 3., 2., 1.])

So far, we have created numpy arrays of one dimension. Let's try creating a 2-D array (also called a matrix). We can pass dimension size (also called axes) as (a,b) where a is the number of rows in the matrix, and b specifies the number of columns. 

In [17]:
matrix1=np.zeros((3,3)) 
# this is a matrix of zeros that has 3 vectors, and within each vector there are 4 items. 
print ('matrix1',matrix1)
# a matrix from nested list
matrix2=np.array([[1,2,3],[2,3,4]])
print ('matrix2',matrix2)
# an empty matrix usually used as initialisation. It will print as an empty list
matrix3=np.empty((0,4))
print ('matrix3',matrix3)
#To check the axes of an array, you can retrieve the shape attribute like this:
print (matrix2.shape) 


matrix1 [[0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]]
matrix2 [[1 2 3]
 [2 3 4]]
matrix3 []
(2, 3)


Numpy arrays are mutable. Therefore, we can change values in the vector. For example:

In [18]:
vector2[0]=0 # change the first item in vector3 to 0
print (vector2)
matrix2[0][2]=0 # change the third item of the first vector to 0
print (matrix2)
matrix2[0]=vector2 # change the first vector in matrix2 to vector3
print (matrix2)


[0 2 3]
[[1 2 0]
 [2 3 4]]
[[0 2 3]
 [2 3 4]]


We can also slice a numpy array with an index array. 

In [19]:
#Let's select the values at index 1,2 of vector2
vector2[[1,2]]

array([2, 3])

### ❓ Quiz  

Can you try implementing the function`create_feature_frequency()` to extract unigram frequency? (You can modify on the basis of `create_feature_presence()`)
You can write your code below:

In [20]:
def create_feature_frequency(tokens_dict_current,vocab2index):
    # create a numpy array with dimension size of the vocabulary size
    vector=np.zeros(len(vocab2index))
    for w in tokens_dict_current:
        index=vocab2index[w]
        vector[index]=tokens_dict_current[w]
    return vector


<hr>    <!-- please remember this! -->
<details>
  <summary>Click <b>here</b> to see the answer.</summary>
    <p>simple change the line vector[index]=1 from create_feature_presence() to vector[index]=token_dict_current[w]</p>

   

<!--   <p>\begin{equation*}P(Fire|Smoke) =  \frac{P(Fire)*P(Smoke|Fire)}{P(Smoke)} \end{equation*} </p>
<p>\begin{equation*}=\frac{0.01 * 0.9}{ 0.1} \end{equation*}</p>
<p>\begin{equation*}=0.09\end{equation*}</p> -->

</details> 



Now let's create another function `process_docs_unigrams_presence()` on the basis of `process_docs_unigrams()` to loop over the documents again and convert the token dictionary in each document to feature vectors. 

In [21]:
def process_docs_unigrams_presence(directory,vocab2index):
    # loop over the directory to extract filenames that have the right extension
    filenames=[filename for filename in listdir(directory) if filename.endswith(".txt")]
    # since we know how many files we will process, and the vocabulary size, we can initialize our result matrix as an empty array with the shape of (file number, vocabulary size)
    
    unigram_presence_result=np.empty((len(filenames),len(vocab2index))) #initialize the result array as an empty array ready to be appended through the loop. 
    # walk through all files in the folder
    print (directory)
    for file_i,filename in enumerate(filenames):
        
        # create the full path of the file to open
        path = directory + '/' + filename
        # load document
        text=load_doc(path)
        # clean documents
        tokens_dict_current = clean_doc_unigrams(text)
        # convert bag of words in each document into unigram features
        unigram_presence=create_feature_presence(tokens_dict_current,vocab2index)
        # We assign the unigram fearture of the current document to the correct positon in the unigram_presence_result matrix. 
        # we append the unigram_feature in each document to the unigram_feature_result array and update the array. 
        # to ensure the correct dimension size, we nest the unigram_feature vector so that it becomes a 2-d array as the initialization in unigram_feature_result
        unigram_presence_result[file_i]=unigram_presence
    return unigram_presence_result

Now we are ready to extract features from both the negative and positive directories:

In [22]:
unigram_presence_positive=process_docs_unigrams_presence('./data/pos',vocab2index)
unigram_presence_negative=process_docs_unigrams_presence('./data/neg',vocab2index)


./data/pos
./data/neg


Let's take a look of the feature representation for 

In [23]:
unigram_presence_negative[1]

array([1., 1., 0., ..., 0., 0., 0.])

### ❓ Quiz  

Can you follow the codes above to extract unigrams' frequency features using the `create_feature_frequency()` functions you defined in quiz 3? Please store the features into `unigram_frequency_positive` and `unigram_frequency_negative` for positive and negative directories respectively. 

You can write your code below:

### ❓ Quiz  

Let's check the first positive review's unigram presence feature, can you use the mapping in `index2vocab` to reveal what unigrams are present in this review? 
Please answer: Which words in the following list are present?
A. gayness
B. fabulous
C. snappiness
D. happiness

You can write your code below:

In [24]:
a=[index2vocab[i] for i,item in enumerate(unigram_presence_positive[0]) if item==1]
'happiness' in a

False


<hr>    <!-- please remember this! -->
<details>
  <summary>Click <b>here</b> to see the answer.</summary>
    <p> Code:</p>
    <p>wordlist=[index2vocab[i] for i,item in enumerate(unigram_presence_pos[0]) if item==1]</p>
    <p> A,C are present in the review </p>

   

<!--   <p>\begin{equation*}P(Fire|Smoke) =  \frac{P(Fire)*P(Smoke|Fire)}{P(Smoke)} \end{equation*} </p>
<p>\begin{equation*}=\frac{0.01 * 0.9}{ 0.1} \end{equation*}</p>
<p>\begin{equation*}=0.09\end{equation*}</p> -->

</details> 



### Save the features

We can also save the prepared features for the reviews ready for modeling.

This is a good practice as it decouples the data preparation from modeling, allowing you to focus on modeling and circle back to data preparation if you have new ideas.

To store a numpy array, we can use the `numpy`'s `save()` function. `save()` takes two arguments: the first is an the filename to be written, and the second argument is the numpy array. 

In [25]:

np.save('unigram_presence_positive.npy',unigram_presence_positive)
np.save('unigram_presence_negative.npy',unigram_presence_negative)


We can load the postiive data from the files by:
    

In [26]:
np.load('unigram_presence_positive.npy')


array([[0., 1., 1., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [1., 1., 0., ..., 0., 0., 0.],
       ...,
       [0., 1., 0., ..., 0., 0., 0.],
       [1., 1., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 1., 1., 1.]])

### Bigrams

Based on `clean_doc_unigrams()` that turns a document into a unigram dictionary, we can create a function `clean_doc_bigrams()` to extract bigram dictionary. (You can refresh yourself of how to create a bigram counter dictionary in module 1.4. )

In [27]:
# turn a doc into clean tokens
def clean_doc_bigrams(doc):
    # split into tokens by white space
    tokens = doc.split()
    # remove tokens that are not alphabetic
    tokens = [word for word in tokens if word.isalpha()]
    
    # extract bigrams
    bigram_dict=Counter() #initialize a bigram dictionary to be updated
    tokens=['<start>']+tokens+['<end>'] # add <start> and <end> token
    for i in range(len(tokens)): #loop over all the indices of the token list
        if i<len(tokens)-1: #if it's not the end of the token list
            bigram_current=(tokens[i],tokens[i+1])
            bigram_dict[bigram_current]+=1
    return bigram_dict

Now we can replace the `clean_doc_unigrams()` line with `clean_doc_bigrams()` in `process_docs_unigrams()`, we will rename the function as `process_docs_bigrams()` that counts all the bigrams in the documents. 

In [28]:
overall_bigrams_counter=Counter()
def process_docs_bigrams(directory,overall_bigrams_counter):
    # walk through all files in the folder
    print (directory)
    for filename in listdir(directory):
        # skip files that do not have the right extension
        if not filename.endswith(".txt"):
            continue
        # create the full path of the file to open
        path = directory + '/' + filename
        # load document
        text=load_doc(path)
        # clean documents
        tokens_dict_current = clean_doc_bigrams(text)
        update_counter(overall_bigrams_counter,tokens_dict_current)
process_docs_bigrams('./data/neg',overall_bigrams_counter)
process_docs_bigrams('./data/pos',overall_bigrams_counter)

./data/neg
./data/pos


Once we have `overall_bigrams_counter`, we can pass it to `create_vocab_feature_mappings()` to create index-bigram mappings. 

In [29]:
bigram2index,index2bigram=create_vocab_feature_mappings(overall_bigrams_counter)

Now we can modify `process_docs_unigrams_presence()` to process bigrams. 

In [30]:
def process_docs_bigrams_presence(directory,bigram2index):
    # loop over the directory to extract filenames that have the right extension
    filenames=[filename for filename in listdir(directory) if filename.endswith(".txt")]
    # since we know how many files we will process, and the vocabulary size, we can initialize our result matrix as an empty array with the shape of (file number, vocabulary size)
    
    bigram_presence_result=np.empty((len(filenames),len(bigram2index))) #initialize the result array as an empty array ready to be appended through the loop. 
    # walk through all files in the folder
    print (directory)
    for file_i,filename in enumerate(filenames):
        
        # create the full path of the file to open
        path = directory + '/' + filename
        # load document
        text=load_doc(path)
        # clean documents
        bigrams_dict_current = clean_doc_bigrams(text)
        # convert bag of words in each document into unigram features
        bigram_presence=create_feature_presence(bigrams_dict_current,bigram2index)
        # We assign the bigram fearture of the current document to the correct positon in the bigram_presence_result matrix. 
        # we append the bigram_feature in each document to the bigram_feature_result array and update the array. 
        # to ensure the correct dimension size, we nest the unigram_feature vector so that it becomes a 2-d array as the initialization in unigram_feature_result
        bigram_presence_result[file_i]=bigram_presence
    return bigram_presence_result
bigram_presence_positive=process_docs_bigrams_presence('./data/pos',bigram2index)
bigram_presence_negative=process_docs_bigrams_presence('./data/neg',bigram2index)


./data/pos
./data/neg


In [31]:
# We then save the bigram features:
np.save('bigram_presence_positive.npy',bigram_presence_positive)
np.save('bigram_presence_negative.npy',bigram_presence_negative)

### ❓ Quiz  

Now we have features in the form of bigrams. Now what does each dimension represent for a vector now? How many dimensions do we have for each vector? You can inspect `bigram_presence_positive` to answer these questions. 



<hr>    <!-- please remember this! -->
<details>
  <summary>Click <b>here</b> to see the answer.</summary>
    <p>Each dimension corresponds to a bigram. There are 463119 bigrams and therfore the dimension size of each vector is 463119. </p>
  

   

<!--   <p>\begin{equation*}P(Fire|Smoke) =  \frac{P(Fire)*P(Smoke|Fire)}{P(Smoke)} \end{equation*} </p>
<p>\begin{equation*}=\frac{0.01 * 0.9}{ 0.1} \end{equation*}</p>
<p>\begin{equation*}=0.09\end{equation*}</p> -->

</details> 



other features afterwards

#### Unigrams + POS

#### Adjective Unigrams (Quiz)

#### Unigrams above certain frequency threshold

#### Unigrams + Position

#### Unigrams + Bigrams (Quiz)

Negation (extension)

## Evaluation

### Prepare train/test split

Now it's time to prepare this dataset for model evlauation. We want to train a model that is generalisable to unseen data. Therefore, we can split the dataset into train and test where the model is trained on the train set and tested on the unseen test set. Usually the train-test split is 70% and 30%. Let's make a random split of 70%/30% from our data. We will also ensure that we have an equal number of positibe and negative examples by making the split in both positive and negative datasets respectively. 

Let's first make train test split for the positive data

In [32]:

# make the split from positive data
# get the indexes for the positive data
positive_indices=list(range(len(unigram_presence_positive)))
# randomly shuffle the indices
import random
random.shuffle(positive_indices)
# get the training and test data indices
train_positive_indices=positive_indices[:int(len(positive_indices)*(0.7))]
test_positive_indices=positive_indices[int(len(positive_indices)*(0.7)):]
# now let's use the indices to extract train and test features
unigram_presence_positive_train=unigram_presence_positive[train_positive_indices]
unigram_presence_positive_test=unigram_presence_positive[test_positive_indices]

We can wrap up the above into a function that takes into feature data and returns train and test splits

In [33]:
def train_test_split(features,train_percentage=0.7):
    # get the indexes for the feature data
    indices=list(range(len(features)))
    # randomly shuffle the indices
    random.shuffle(indices)
    # get the training and test data indices
    train_indices=indices[:int(len(indices)*(train_percentage))]
    test_indices=indices[int(len(indices)*(train_percentage)):]
    # now let's use the indices to extract train and test features
    fearture_train=features[train_indices]
    feature_test=features[test_indices]
    return fearture_train,feature_test

Let's apply the function `train_test_split()` to negative unigram presence features. 

In [34]:
positive_train,positive_test=train_test_split(unigram_presence_positive)
negative_train,negative_test=train_test_split(unigram_presence_negative)


Now we can create the train and test sets for all the data

In [35]:

train_data=np.concatenate((positive_train,negative_train))
test_data=np.concatenate((positive_test,negative_test))


at the same time, we should also find a way to represent the labels (positive or negative) in numeric ways for the model to train on. Let's create numpy arrays of labels of 1s and 0s. Let's say 1 corresponds to positive reviews and 0 corresponds to negative reviews.

In [37]:
 
train_label=len(positive_train)*[1]+len(negative_train)*[0]
test_label=len(positive_test)*[1]+len(negative_test)*[0]
train_label=np.array(train_label)
test_label=np.array(test_label)

Let's wrap up the above into a function that takes both positive and negative feature data, and split into train and test data and labels. 

In [38]:
def produce_data_label_splits(positive_feature,negative_feature):
    #create train and test splits on positive and negative data's features
    positive_train_feature,positive_test_feature=train_test_split(positive_feature)
    negative_train_feature,negative_test_feature=train_test_split(negative_feature)
    #concatenate positive and negative features into the total train test data
    train_feature=np.concatenate((positive_train_feature,negative_train_feature))
    test_feature=np.concatenate((positive_test_feature,negative_test_feature))
    #create label arrays
    train_label=len(positive_train_feature)*[1]+len(negative_train_feature)*[0]
    test_label=len(positive_test_feature)*[1]+len(negative_test_feature)*[0]
    train_label=np.array(train_label)
    test_label=np.array(test_label)
    return train_feature,test_feature,train_label,test_label
train_data,test_data,train_label,test_label=produce_data_label_splits(unigram_presence_positive,unigram_presence_negative)



### Naive Bayes Model and evaluation

Let's try fitting the features and labels into a naive bayes model using the `MultinomialNB` package in `sklearn`. 

In [40]:
!pip install sklearn
from sklearn.naive_bayes import MultinomialNB
# initialize a multinomial naive bayes model
model = MultinomialNB()
# fit the model with features and labels for the training data
model.fit(train_data,train_label)
#evaluate the fitted model on the test set
predicted=model.predict(test_data)

Collecting sklearn
  Downloading https://files.pythonhosted.org/packages/1e/7a/dbb3be0ce9bd5c8b7e3d87328e79063f8b263b2b1bfa4774cb1147bfcd3f/sklearn-0.0.tar.gz
Collecting scikit-learn (from sklearn)
[33m  Cache entry deserialization failed, entry ignored[0m
  Downloading https://files.pythonhosted.org/packages/d9/78/44fb6f0842e93d401040cc06db1a9787c9c16df15c8970cdc8999587a322/scikit_learn-0.23.2-cp36-cp36m-macosx_10_9_x86_64.whl (7.2MB)
[K    100% |████████████████████████████████| 7.2MB 34kB/s  eta 0:00:01
Collecting scipy>=0.19.1 (from scikit-learn->sklearn)
  Downloading https://files.pythonhosted.org/packages/55/9c/b17c492bc3141d679e4bda9d40b348d438a0a81f5be4866552d552145901/scipy-1.5.3-cp36-cp36m-macosx_10_9_x86_64.whl (28.8MB)
[K    100% |████████████████████████████████| 28.8MB 10kB/s  eta 0:00:01                          | 4.6MB 21.2MB/s eta 0:00:02    35% |███████████▍                    | 10.3MB 9.5MB/s eta 0:00:02    39% |████████████▊                   | 11.4MB 633kB/s e

`predicted` is now an array of predictions in 1s and 0s. Let's compare it with the gold labels and calculate accuracy following:

accruracy= number of correct examples/number of total examples

Let's loop over the examples to count correct ones

In [41]:
correct=0
for i in range(len(predicted)):
    if predicted[i]==test_label[i]: #if the predicted result is the same with the gold label
        correct+=1
acc=correct/len(predicted)

print ('accuracy',acc)

accuracy 0.8133333333333334


Let's wrap up above to create a function that calls the naive bayes model and performs evaluation

In [42]:
def evaluate(train_data,train_label,test_data,test_label):
    # initialize a multinomial naive bayes model
    model = MultinomialNB()
    # fit the model with features and labels for the training data
    model.fit(train_data,train_label)
    #evaluate the fitted model on the test set
    predicted=model.predict(test_data)
    #evaluation:
    correct=0
    for i in range(len(predicted)):
        if predicted[i]==test_label[i]: #if the predicted result is the same with the gold label
            correct+=1
    acc=correct/len(predicted)


    print ('accuracy',acc)
    return acc

To run train test preparation and model evaluation, we can do the following:

In [43]:
train_data,test_data,train_label,test_label=produce_data_label_splits(unigram_presence_positive,unigram_presence_negative)
acc=evaluate(train_data,train_label,test_data,test_label)


accuracy 0.8533333333333334


### ❓ Quiz  

Everytime we run `produce_data_label_splits()`, we are making a new random split of the dataset. Try run the function on unigram presence features 10 times, and evaluate with `evaluate()` to report the average accuracy results with standard deviation. Does the result change a lot? 

In [44]:
accs=[]
for _ in range(10):
    train_data,test_data,train_label,test_label=produce_data_label_splits(unigram_presence_positive,unigram_presence_negative)
    acc=evaluate(train_data,train_label,test_data,test_label)
    accs.append(acc)
acc_mean=np.mean(np.array(accs))
acc_std=np.std(np.array(accs))
print ('mean',acc_mean,'std',acc_std)

accuracy 0.825
accuracy 0.8183333333333334
accuracy 0.8383333333333334
accuracy 0.85


KeyboardInterrupt: 


<hr>    <!-- please remember this! -->
<details>
  <summary>Click <b>here</b> to see the answer.</summary>
    <p>The following result is from 10 random runs and your result may vary a little from this:</p>
    <p>mean: 0.8388; standard : 0.01 </p>
  

   

<!--   <p>\begin{equation*}P(Fire|Smoke) =  \frac{P(Fire)*P(Smoke|Fire)}{P(Smoke)} \end{equation*} </p>
<p>\begin{equation*}=\frac{0.01 * 0.9}{ 0.1} \end{equation*}</p>
<p>\begin{equation*}=0.09\end{equation*}</p> -->

</details> 



### ❓ Quiz  

So far we have trained a model and evaluate on the unigram presence features (`unigram_presence_positive` and `unigram_presence_negative`), can you try to build another model on bigram presence features? (Hint, you can simply change the arguments to the `produce_data_label_splits()`)
You can write your code below and report the accuracy


<hr>    <!-- please remember this! -->
<details>
  <summary>Click <b>here</b> to see the answer.</summary>
    <p>train_data,test_data,train_label,test_label=produce_data_label_splits(bigram_presence_positive,bigram_presence_negative)</p>
 <p>acc=evaluate(train_data,train_label,test_data,test_label)</p>
    <p>One run of the model gives 0.84. Remember that your result can be a bit different due to the random split.  </p>
  

   

<!--   <p>\begin{equation*}P(Fire|Smoke) =  \frac{P(Fire)*P(Smoke|Fire)}{P(Smoke)} \end{equation*} </p>
<p>\begin{equation*}=\frac{0.01 * 0.9}{ 0.1} \end{equation*}</p>
<p>\begin{equation*}=0.09\end{equation*}</p> -->

</details> 



## Cross validation

## Other features