# Sentiment Classification the old-fashioned way: 
## `Naive Bayes`, `Logistic Regression`, and `Ngrams`


I don't use `fastai` library, therefore this notebook is my own version using more common libraries.

The purpose of this notebook is to show how sentiment classification is done via the classic techniques of `Naive Bayes`, `Logistic regression`, and `Ngrams`.  We will be using `sklearn` and the `fastai` library.

In a future lesson, we will revisit sentiment classification using `deep learning`, so that you can compare the two approaches.

The content here was extended from [Lesson 10 of the fast.ai Machine Learning course](https://course.fast.ai/lessonsml1/lesson10.html). Linear model is pretty close to the state of the art here.  Jeremy surpassed state of the art using a RNN in fall 2017.

## 0.The fastai library

We will begin using [the fastai library](https://docs.fast.ai) (version 1.0) in this notebook.  We will use it more once we move on to neural networks.

The fastai library is built on top of PyTorch and encodes many state-of-the-art best practices. It is used in production at a number of companies.  You can read more about it here:

- [Fast.ai's software could radically democratize AI](https://www.zdnet.com/article/fast-ais-new-software-could-radically-democratize-ai/) (ZDNet)

- [fastai v1 for PyTorch: Fast and accurate neural nets using modern best practices](https://www.fast.ai/2018/10/02/fastai-ai/) (fast.ai)

- [fastai docs](https://docs.fast.ai/)

### Installation

With conda:

`conda install -c pytorch -c fastai fastai=1.0`

Or with pip:

`pip install fastai==1.0`

More [installation information here](https://github.com/fastai/fastai/blob/master/README.md).

Beginning in lesson 4, we will be using GPUs, so if you want, you could switch to a [cloud option](https://course.fast.ai/#using-a-gpu) now to setup fastai.

## 1. The IMDB dataset

<img src="IMDb.png" alt="floating point" style="width: 90%"/>

The [large movie review dataset](http://ai.stanford.edu/~amaas/data/sentiment/) contains a collection of 50,000 reviews from IMDB, We will use the version hosted as part [fast.ai datasets](https://course.fast.ai/datasets.html) on AWS Open Datasets. 

The dataset contains an even number of positive and negative reviews. The authors considered only highly polarized reviews. A negative review has a score ≤ 4 out of 10, and a positive review has a score ≥ 7 out of 10. Neutral reviews are not included in the dataset. The dataset is divided into training and test sets. The training set is the same 25,000 labeled reviews.

The **sentiment classification task** consists of predicting the polarity (positive or negative) of a given text.

### Imports

In [39]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [40]:
import pandas as pd
import numpy as np

import sklearn.feature_extraction.text as sklearn_text
import pickle 
from sklearn.model_selection import train_test_split
import string

### Preview the sample IMDb data set

fast.ai has a number of [datasets hosted via AWS Open Datasets](https://course.fast.ai/datasets.html) for easy download. We can see them by checking the docs for URLs (remember `??` is a helpful command):

It is always good to start working on a sample of your data before you use the full dataset-- this allows for quicker computations as you debug and get your code working. For IMDB, there is a sample dataset already available:

In [41]:
path = '../data/fastai/'
!ls {path}

[34mbritish-fiction-corpus[m[m movie_data.csv


In [42]:
df = pd.read_csv(f'{path}/movie_data.csv')
df.head()

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0
3,hi for all the people who have seen this wonde...,1
4,"I recently bought the DVD, forgetting just how...",0


### Extract the movie reviews from the sample IMDb data set.

My approach is completely different from the one in the lesson. Fastai library has gone through a lot of changes and the code is no longe relevant. Better using more common code so to have better support from the community too.

#### Remove punctuation

In [6]:
def remove_punctuation(text):
    """
    Transform/remove punctuation characters. 
    
    :param
    text : str
        Original text.
    
    :return
    text : str
        Text without punctuation.
    """
    table = str.maketrans('','', string.punctuation)
    return text.translate(table).strip()

df['review'] = df['review'].apply(lambda x: remove_punctuation(x))

In [7]:
df.head()

Unnamed: 0,review,sentiment
0,In 1974 the teenager Martha Moxley Maggie Grac...,1
1,OK so I really like Kris Kristofferson and his...,0
2,SPOILER Do not read this if you think about wa...,0
3,hi for all the people who have seen this wonde...,1
4,I recently bought the DVD forgetting just how ...,0


#### Lowercase everything

In [8]:
df['review'] = df['review'].apply(lambda x: x.lower())

In [9]:
df.head()

Unnamed: 0,review,sentiment
0,in 1974 the teenager martha moxley maggie grac...,1
1,ok so i really like kris kristofferson and his...,0
2,spoiler do not read this if you think about wa...,0
3,hi for all the people who have seen this wonde...,1
4,i recently bought the dvd forgetting just how ...,0


### Tokenize

In [10]:
from sklearn.feature_extraction.text import CountVectorizer

In [11]:
def count_vectorizer(data):
    vectorizer = CountVectorizer()
    embeddings = vectorizer.fit_transform(data)
    return embeddings, vectorizer

In [12]:
df.head()

Unnamed: 0,review,sentiment
0,in 1974 the teenager martha moxley maggie grac...,1
1,ok so i really like kris kristofferson and his...,0
2,spoiler do not read this if you think about wa...,0
3,hi for all the people who have seen this wonde...,1
4,i recently bought the dvd forgetting just how ...,0


### Train Test Split

In [13]:
X_train, X_test, y_train, y_test = train_test_split(df['review'], df['sentiment'], test_size=0.2, random_state=42)

In [90]:
X_train_count, vectorizer = count_vectorizer(X_train)

In [91]:
X_train_count.shape

(40000, 158461)

That means there are 40'000 review, and there are 158461 words in total.

In [92]:
X_test_count = vectorizer.transform(X_test)

In [93]:
X_test_count.shape

(10000, 158461)

`x_train_count` is a sparse matrix (in the lesson has been explained the various ways of storing a sparse matrix) but it can be represented numerically only if you use `todense()` function:

In [94]:
np.count_nonzero(X_train_count.todense())

5479461

The first row in the matrix seems to have 92 non-zero entries. This would also mean that the first review contains 92 unique words. Is this correct?

In [95]:
nnz = np.count_nonzero(X_train_count.todense()[0,:])
x_train_rows = X_train_count.shape[0]
sparsity = (x_train_rows - nnz) / x_train_rows

In [96]:
sparsity

0.9977

In [97]:
np.count_nonzero(X_train_count.todense()[0,:])

92

In [98]:
words = [word for word in df.review[0].split()]

In [99]:
print(set(words))

{'teenager', 'to', 'on', 'dramatization', 'moved', 'support', 'net', 'criminal', 'fifteen', 'do', 'in', 'his', 'christopher', 'wealthy', 'the', 'days', 'br', '1974', 'haven', 'perjurer', 'mitchell', 'who', 'remained', 'and', 'kennedy', 'has', 'connecticut', 'is', 'moves', 'used', 'robert', 'shows', 'twentytwo', 'was', 'house', 'influence', 'my', 'maggie', 'but', 'former', 'stephen', 'night', 'story', 'murdered', 'how', 'parallel', 'murder', 'movie', 'crime', 'years', 'money', 'power', 'a', 'committed', 'family', 'welcome', 'than', 'there', 'good', 'highclass', 'weeks', 'murderbr', 'writing', 'however', 'eve', 'book', 'old', 'halloween', 'more', 'retired', 'la', 'discover', 'tv', 'backyard', 'trial', 'disgrace', 'girl', 'decides', 'mischief', 'available', 'fallen', 'oj', 'last', 'case', 'by', 'later', 'disclose', 'hideous', 'convicted', 'meloni', 'squirm', 'fuhrman', 'that', 'snoopy', 'forster', 'not', 'purpose', 'their', 'writer', 'title', 'greenwich', 'grace', 'charge', 'true', 'lack'

In [100]:
len(set(words))

142

#### Convert string to index

Mind there is a manual way, or an _attribute_ of sklearn can give you what you want. Both are shown:

In [101]:
df['review'][0]

'in 1974 the teenager martha moxley maggie grace moves to the highclass area of belle haven greenwich connecticut on the mischief night eve of halloween she was murdered in the backyard of her house and her murder remained unsolved twentytwo years later the writer mark fuhrman christopher meloni who is a former la detective that has fallen in disgrace for perjury in oj simpson trial and moved to idaho decides to investigate the case with his partner stephen weeks andrew mitchell with the purpose of writing a book the locals squirm and do not welcome them but with the support of the retired detective steve carroll robert forster that was in charge of the investigation in the 70s they discover the criminal and a net of power and money to cover the murderbr br murder in greenwich is a good tv movie with the true story of a murder of a fifteen years old girl that was committed by a wealthy teenager whose mother was a kennedy the powerful and rich family used their influence to cover the mu

In [102]:
features_name = vectorizer.get_feature_names()

In [103]:
features_name

['00',
 '000',
 '0000000000001',
 '000001',
 '0001',
 '00015',
 '001',
 '0010',
 '002',
 '00383042',
 '006',
 '0069',
 '007',
 '0079',
 '007br',
 '007s',
 '0080',
 '0083',
 '009',
 '00agent',
 '00s',
 '00schneider',
 '01',
 '010',
 '01000',
 '010707',
 '010br',
 '010makes',
 '0110',
 '012310',
 '0130',
 '013007',
 '02',
 '0205',
 '0210',
 '0230',
 '029',
 '02br',
 '03',
 '030',
 '03092005',
 '0310',
 '039',
 '03oct2009',
 '04',
 '04082007',
 '041',
 '044',
 '048',
 '05',
 '050',
 '0510',
 '053105',
 '05br',
 '06',
 '060241',
 '0615',
 '06and',
 '06br',
 '07',
 '075',
 '08',
 '081006',
 '087',
 '08th',
 '09',
 '09082009',
 '09br',
 '0br',
 '0clock',
 '0f',
 '0ne',
 '0r',
 '0s',
 '0stars',
 '0ttbr',
 '0when',
 '10',
 '100',
 '1000',
 '10000',
 '100000',
 '1000000',
 '10000000',
 '100000000',
 '1000000000',
 '1000000000000',
 '1000000000000010',
 '1000000000000010000000000000',
 '100000dm',
 '100001',
 '10002000',
 '10005000',
 '1000lb',
 '1000month',
 '1000s',
 '1000th',
 '1000word',
 '1

In [104]:
dictionary_features = {value:pos for pos, value in enumerate(features_name)}

In [105]:
len(dictionary_features)

158461

In [106]:
first_row_with_index = [dictionary_features[word] for word in df['review'][0].split()]

KeyError: 'a'

This means counvectorizer strip some stopwords at least (above error).

Easier way:

In [None]:
vectorizer.vocabulary_

In [None]:
from collections import Counter

In [None]:
Counter('I ama am am')

With counter you can create a term document frequency.

## 4. Create the document-term matrix for the IMDb

#### In non-deep learning methods of NLP, we are often interested only in `which words` were used in a review, and `how often each word got used`. This is known as the `bag of words` approach, and it suggests a really simple way to store a document (in this case, a movie review). 

#### For each review we can keep track of which words were used and how often each word was used with a `vector` whose `length` is the number of tokens in the vocabulary, which we will call `n`. The `indexes` of this `vector` correspond to the `tokens` in the `IMDb vocabulary`, and the`values` of the vector are the number of times the corresponding tokens appeared in the review. For example the values stored at indexes 0, 1, 2, 3, 4 of the vector record the number of times the 5 tokens ['xxunk','xxpad','xxbos','xxeos','xxfld'] appeared in the review, respectively.

#### Now, if our movie review database has `m` reviews, and each review is represented by a `vector` of length `n`, then vertically stacking the row vectors for all the reviews creates a matrix representation of the IMDb, which we call its `document-term matrix`. The `rows` correspond to `documents` (reviews), while the `columns` correspond to `terms` (or tokens in the vocabulary).

You can create a term document matrix also with tf-idf from sklearn.

You will get a matrix in which each row is a review.

__Note:__

In case you want to know the similarity between 2 documents you can apply the cosine similarity between the two vectors: <a href="https://stackoverflow.com/questions/11870210/tf-idf-simple-use-nltk-scikit-learn?rq=1"> Article</a>. Cosine similarity = 1, when two vectors have same angle, instead angle 90 degress no relationship whatsover, and -1 is negative relationship.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
tfidf = TfidfVectorizer(min_df=4)

In [None]:
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

In [None]:
tfidf.vocabulary_

In [None]:
tfidf.get_feature_names()

#### Save and Load

In [None]:
import scipy
scipy.sparse.save_npz('term_doc_fre.npz', X_train_tfidf)

In [None]:
X_train_tfidf = scipy.sparse.load_npz('term_doc_fre.npz')

## 5. Sparse Matrix Representation

#### Even though we've reduced over 19,000 unique words in our corpus of reviews down to a vocabulary of 6,000 words, that's still a lot! But reviews are generally short, a few hundred words. So most tokens don't appear in a typical review.  That means that most of the entries in the document-term matrix will be zeros, and therefore ordinary matrix operations will waste a lot of compute resources multiplying and adding zeros. 

####  We want to maximize the use of space and time by storing and performing matrix operations on our document-term matrix as a **sparse matrix**. `scipy` provides tools for efficient sparse matrix representatin and operations. 

#### Loosely speaking,  matrix with a high proportion of zeros is called `sparse` (the opposite of sparse is `dense`).  For sparse matrices, you can save a lot of memory by only storing the non-zero values.

#### More specifically, a class of matrices is called **sparse** if the number of non-zero elements is proportional to the number of rows (or columns) instead of being proportional to the product rows x columns. An example is the class of diagonal matrices.


<img src="images/sparse.png" alt="floating point" style="width: 30%"/>



### Visualizing sparse matrix structure
<img src="sparse-matrix-structure-visualization.png" alt="floating point" style="width: 90%"/>
ref. https://scipy-lectures.org/advanced/scipy_sparse/introduction.html

### Sparse matrix storage formats

<img src="summary_of_sparse_matrix_storage_schemes.png" alt="floating point" style="width: 90%"/>
ref. https://scipy-lectures.org/advanced/scipy_sparse/storage_schemes.html

There are the most common sparse storage formats:
- coordinate-wise (scipy calls COO)
- compressed sparse row (CSR)
- compressed sparse column (CSC)



### Definition of the Compressed Sparse Row (CSR) format

Let's start out with a presecription for the **CSR format** (ref. https://en.wikipedia.org/wiki/Sparse_matrix)

Given a full matrix **`A`** that has **`m`** rows, **`n`** columns, and **`N`** nonzero values, the CSR (Compressed Sparse Row) representation uses three arrays as follows:

1. **`Val[0:N]`** contains the **values** of the **`N` non-zero elements**.

2. **`Col[0:N]`** contains the **column indices** of the **`N` non-zero elements**. 
    
3. For each row **`i`** of **`A`**, **`RowPointer[i]`** contains the index in **Val** of the the first **nonzero value** in row **`i`**. If there are no nonzero values in the **ith** row, then **`RowPointer[i] = None`**. And, by convention, an extra value **`RowPointer[m] = N`** is tacked on at the end. 

Question: How many floats and ints does it take to store the matrix **`A`** in CSR format?

Let's walk through [a few examples](http://www.mathcs.emory.edu/~cheung/Courses/561/Syllabus/3-C/sparse.html) at the Emory University website



## 6. Store the document-term matrix in CSR format
i.e. given the `TextList` object containing the list of reviews, return the three arrays (values, column_indices, row_pointer)

### Scipy Implementation of sparse matrices

From the [Scipy Sparse Matrix Documentation](https://docs.scipy.org/doc/scipy-0.18.1/reference/sparse.html)

- To construct a matrix efficiently, use either dok_matrix or lil_matrix. The lil_matrix class supports basic slicing and fancy indexing with a similar syntax to NumPy arrays. As illustrated below, the COO format may also be used to efficiently construct matrices
- To perform manipulations such as multiplication or inversion, first convert the matrix to either CSC or CSR format.
- All conversions among the CSR, CSC, and COO formats are efficient, linear-time operations.

### To really understand the CSR format, we need to be able know how to do two things:
1. Translate a regular matrix A into CSR format
2. Reconstruct a regular matrix from its CSR sparse representation


## 8. What is a [Naive Bayes classifier](https://towardsdatascience.com/the-naive-bayes-classifier-e92ea9f47523)? 


also this article is pretty good: https://medium.com/syncedreview/applying-multinomial-naive-bayes-to-nlp-problems-a-practical-explanation-4f5271768ebf


#### The `bag of words model` considers a movie review as equivalent to a list of the counts of all the tokens that it contains. When you do this, you throw away the rich information that comes from the sequential arrangement of the tokens into sentences and paragraphs. 

#### Nevertheless, even if you are not allowed to read the review but are only given its representation as `token counts`, you can usually still get a pretty good sense of whether the review was good or bad. How do you do this?  By mentally gauging the overall `positive` or `negative` sentiment that the collection of words conveys, right?  

#### The `Naive Bayes Classifier` is an algorithm that encodes this simple reasoning process mathematically. It is based on two important pieces of information that we can learn from the training set:
* The `class priors`, i.e. the probabilities that a randomly chosen review will be `positive`, or `negative`
* The `token likelihoods` i.e. how likely is it that a given token would appear in a `positive` or `negative` review 

#### It turns out that this is all the information we need to build a model capable of predicting fairly accurately how any given review will be classified, given its text! 

#### We shall unfold the complete explanation of the magic of the Naive Bayes Classifier in the next section. 

#### Meanwhile, In this section, we focus on how to compute the necessary information from the training data, specifically the `prior probabilities` for reviews of each class, and the `class occurrence counts` and `class likelihood ratios` for each `token` in the `vocabulary`. 

### 8A. Class priors

#### From the training data we can determine the `class priors` $p$ and $q$, which are the overall probabilities that a randomly chosen review is in the `positive`, or `negative` class, resepectively. 

#### $p=\frac{N^{+}}{N}$ 
#### and
#### $q=\frac{N^{-}}{N}$ 

#### Here $N^{+}$ and $N^{-}$ are the numbers of `positive` and `negative` reviews, and $N$ is the total number of reviews in the training set, so that 

#### $N = N^{+} + N^{-}$, 

#### and 

#### $q = 1-p$

### 8B. Class `occurrence counts`

#### Let $C^{+}_{t}$ and $C^{-}_{t}$ be the `occurrence counts` of token $t$ in `positive` and `negative` reviews, respectively, and $N^{+}$ and $N^{-}$ be the total numbers of`positive` and `negative` reviews in the data set, respectively. 


### 8C. Class likelihood ratios

#### Then, given the knowledge that a review is classified as `positive`, the `conditional likelihood` that a token $t$ will appear in the review is
### $ L(t|+) = \frac{C^{+}_{t}}{N^+}$, 
#### and simlarly, the `conditional likelihood` of a token appearing in a `negative` review is 
### $ L(t|-) = \frac{C^{-}_{t}}{N^-}$

### 8D. The `log-count ratio`

#### From the class likelihood ratios, we can define a **log-count ratio** $R_{t}$ for each token $t$ as
### $ R_{t} = \text{log} \frac{L(t|+)}  {L(t|-)}$
#### The `log-count ratio` ranks tokens by their relative affinities for positive and negative reviews
#### We observe that
* $R_{t} \gt 0$ means `positive` reviews are more likely to contain this token 
* $R_{t} \lt 0$ means `negative` reviews are more likely to contain this token 
* $R_{t} = 0$ indicates the token $t$ has equal likelihood to appear in  `positive` and `negative` reviews


## 9. Building a Naive Bayes Classifier for IMDb movie reviews

#### From the `occurrence count` arrays, we can compute the `class likelihoods` and `log-count ratios` of all the tokens in the vocabulary. 

### 9A. Compute the `class likelihoods`

#### We compute slightly modified `conditional likelihoods`, by adding 1 to the numerator and denominator to insure numerically stability.

In [61]:
L1 = (C1+1) / ((y.items==positive).sum() + 1)
L0 = (C0+1) / ((y.items==negative).sum() + 1)

### 9D.1 What is Bayes Theorem, and what does it have to say about IMDb movie reviews?

Consider two events, $A$ and $B$  
Then the probability of $A$ and $B$ occurring together can be written in two ways:
$p(A,B) = p(A|B)\cdot p(B)$
$p(A,B) = p(B|A)\cdot p(A)$

where $p(A|B)$ and $p(B|A)$ are conditional probabilities:
$p(A|B)$ is the probability of $A$ occurring given that $B$ has occurred,
$p(A)$ is the probability that $A$ occurs,
$p(B)$ is the probabilityt that $B$ occurs


$\textbf{Bayes Theorem}$ is just the statement that the right hand sides of the above two equations are equal:

$p(A|B) \cdot p(B) = p(B|A) \cdot p(A)$

Applying $\textbf{Bayes Theorem}$ to our IMDb movie review problem:

We identify $A$ and $B$ as <br> 
$A \equiv \text{class}$, i.e. positive or negative, and <br>
$B \equiv \text{tokens}$, i.e. the "bag" of tokens used in the review

Then $\textbf{Bayes Theorem}$ says

$p(\text{class}|\text{tokens})\cdot p(\text{tokens}) = p(\text{tokens}|\text{class}) \cdot p(\text{class})$

so that <br>
$p(\text{class}|\text{tokens}) = p(\text{tokens}|\text{class})\cdot \frac{p(\text{class})}{p(\text{tokens})}$

Since $p(\text{tokens})$ is a constant, we have the proportionality 

$p(\text{class}|\text{tokens}) \propto p(\text{tokens}|\text{class})\cdot p(\text{class})$

The left hand side of the above expression is called the $\textbf{posterior class probability}$, the probability that the review is positive (or negative), given the tokens it contains. This is exactly what we want to predict!

### 9D.2 The Naive Bayes Classifier

#### Given the list of tokens in a review, we seek to predict whether the review is rated as `positive` or `negative` 

#### We can make the prediction if we know the `posterior class probabilities`.

#### $p(\text{class}|\text{tokens})$,
#### where $\text{class}$ is either `positive` or `negative`, and $\text{tokens}$ is the list of tokens that appear in the review.
#### [Bayes' Theorem](https://en.wikipedia.org/wiki/Bayes%27_theorem) tells us that the posterior probabilities, the likelihoods and the priors are related this way:

#### $p(\text{class}|\text{tokens}) \propto p(\text{tokens}|\text{class})\cdot p(\text{class})$

#### Now the tokens are not independent of one another.  For example, 'go' often appears with 'to', so if 'go' appears in a review it is more likely that the review also contains 'to'. Nevertheless, assuming the tokens are independent allows us to simplify things, so we recklessly do it, hoping it's not too wrong!
#### $p(\text{tokens}|\text{class}) = \prod_{i=1}^{n} p(t_{i}|\text{class})$

#### where $t_{i}$ is the $i\text{th}$ token in the vocabulary and $n$ is the number of tokens in the vocabulary. 

#### So Bayes' theorem is

#### $p(\text{class}|\text{tokens}) \propto p(\text{class}) \prod_{i=1}^{n} p(t_{i}|\text{class}) $

#### Taking the ratio of the $\textbf{posterior class probabilities}$ for the `positive` and `negative` classes, we have

#### $\frac{p(+|\text{tokens})}{p( - |\text{tokens})} =  \frac{p(+)}{p( - )}  \cdot  \prod_{i=1}^{n} \frac {p(t_{i}|+)}  {p(t_{i}| - )} = \frac{p}{q}  \cdot  \prod_{i=1}^{n} \frac {L(t_{i}|+)}  {L(t_{i}| - )}$
#### since likelihoods are proportional to probabilities.
#### Taking the log of both sides converts this to a `linear` problem:
#### $\text{log} \frac{p(+|\text{tokens})}{p( - |\text{tokens})} = \text{log}\frac{p}{q} + \sum_{i=1}^{n} \text{log} \frac {L(t_{i}|+)}  {L(t_{i}| - )} = b + \sum_{i=1}^{n}  R_{t_{i}}$

#### The first term on the right-hand side is the `bias`, and the second term is the dot product of the *binarized* embedding vector and the log-count ratios

#### If the left-hand side is greater than or equal to zero, we predict the review is `positive`, else we predict the review is `negative`. 

####  We can re-write the last equation in matrix form to generate a $m \times 1$ boolean column vector $\textbf{preds}$ of review predictions:

#### $\textbf{preds} = \textbf{W} \cdot \textbf{R} + \textbf{b}$
#### where 

* $\textbf{preds} \equiv \text{log} \frac{p(+|\text{tokens})}{p( - |\text{tokens})}$
* $\textbf{W}$ is the $m\times n$ `binarized document-term matrix`, whose rows are the binarized embedding vectors for the movie reviews
* $\textbf{R}$ is the $n\times 1$ vector of `log-count ratios`  for the tokens, and 
* $\textbf{b}$ is a $n\times 1$ vector whose entries are the bias $b$


#### The Naive Bayes model consists of the log-counts vector $\textbf{R}$ and the bias $\textbf{b}$

### 9E. Implement our Naive Bayes Movie Review classifier
#### and use it to predict labels for the training and validation sets of the IMDb_sample data.

In [12]:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix

classifier = GaussianNB()
classifier.fit(X_train_tfidf.todense(), y_train)
y_pred = classifier.predict(X_train_tfidf.todense())

NameError: name 'X_train_tfidf' is not defined

In [114]:
y_train.shape, X_train_tfidf.shape, len(y_pred)

((40000,), (40000, 43611), 40000)

In [115]:
confusion_matrix(y_train, y_pred)

array([[19425,   541],
       [ 4044, 15990]])

In [116]:
(y_pred == y_train).mean()

0.885375

88% percent accuracy, not so bad! Implement also the binarized Naive Bayes a few sections below.

### 9F. Summary: A recipe for the Naive Bayes  Classifier
#### Here is a summary of our procedure for predicting labels with the Naive Bayes Classifier, starting with the training set `x` and the training labels `y`


#### 1. Compute the token count vectors
> C0 = np.squeeze(np.asarray(x[y.items==negative].sum(0))) <br> 
> C1 = np.squeeze(np.asarray(x[y.items==positive].sum(0))) <br> 

#### 2. Compute the token class likelihood vectors
> L0 = (C0+1) / ((y.items==negative).sum() + 1) <br> 
> L1 = (C1+1) / ((y.items==positive).sum() + 1) <br> 

#### 3. Compute the log-count ratios vector
> R = np.log(L1/L0)

#### 4. Compute the bias term
> b = np.log((y.items==positive).mean() / (y.items==negative).mean())

#### 5. The Naive Bayes model consists of the log-counts vector $\textbf{R}$ and the bias $\textbf{b}$
#### 6. Predict the movie review labels from a linear transformation of the log-count ratios vector:
> preds = (W @ R + b) > 0, <br> 
> where the weights matrix W = valid_doc_term.sign() is the binarized `valid_doc_term matrix` whose rows are the binarized embedding vectors for the movie reviews for which you want to predict ratings.


#### Binarized Naive Bayes

_Instead of having the count for each of the features (tf-idf) you binarize the term to 1 if is present or 0 if is not present. If "cat" was present 3 times in `doc_1` then you would have `cat = 3`, whereas with the binarized version you have `cat=1`, that's the idea behind this model._

In [131]:
classifier = GaussianNB()

x = X_train_tfidf.sign().todense()
y = y_train

classifier.fit(x, y)
preds = classifier.predict(x)
accuracy_value_bin = (preds == y).mean()
print(f'Accuracy {accuracy_value_bin}')

Accuracy 0.791575


<font color="red">Poor performance compared to a tf-idf

## 10. Working with the full IMDb data set

Now that we have our approach working on a smaller sample of the data, we can try using it on the full dataset.

### 10A. Download the data

In [78]:
path = untar_data(URLs.IMDB)
path.ls()

[WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/data_clas.pkl'),
 WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/data_lm.pkl'),
 WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/finetuned.pth'),
 WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/finetuned_enc.pth'),
 WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/imdb.vocab'),
 WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/ld.pkl'),
 WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/ll_clas.pkl'),
 WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/ll_lm.pkl'),
 WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/models'),
 WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/pretrained'),
 WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/README'),
 WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/test'),
 WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/tmp_clas'),
 WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/tmp_lm'),
 WindowsPath('C:/Users/cross-entropy/.fastai

In [79]:
(path/'train').ls()

[WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/train/labeledBow.feat'),
 WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/train/neg'),
 WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/train/pos'),
 WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/train/unsupBow.feat')]

In [85]:
with open('reviews_full.pickle', 'wb') as handle:
    pickle.dump(reviews_full, handle, protocol=pickle.HIGHEST_PROTOCOL)

#### In the future, we'll just be able to load our data:

In [86]:
train_doc_term = scipy.sparse.load_npz("train_doc_term.npz")
valid_doc_term = scipy.sparse.load_npz("valid_doc_term.npz")

In [87]:
with open('reviews_full.pickle', 'rb') as handle:
    pickle.load(handle)

## 13. The Logistic Regression classifier with the full IMBb data set

#### With the `sci-kit learn` library, we can fit logistic a regression model where the features are the unigrams. Here $C$ is a regularization parameter.

In [122]:
from sklearn.linear_model import LogisticRegression

#### Using the full `document-term matrix`:

In [125]:
model = LogisticRegression(C=0.1, dual=False, solver='liblinear')
model.fit(X_train_tfidf, y_train)
preds = model.predict(X_train_tfidf)
train_accuracy = (preds == y_train).mean()
print(f'Train accuracy : {train_accuracy}')

Train accuracy : 0.8769


In [129]:
X_train_tfidf.sign()

<40000x43611 sparse matrix of type '<class 'numpy.float64'>'
	with 5330339 stored elements in Compressed Sparse Row format>

In [128]:
preds_test = model.predict(X_test_tfidf)
test_accuracy = (preds_test == y_test).mean()
print(f'Test accuracy : {test_accuracy}')

Test accuracy : 0.8624


<font color="red">Ok, the result is not so bad. We have a 0.88 on train set and 0.87 on test set. It is 0.01 worst than __naive bayes__.

## 14. `Trigram` representation of the `IMDb_sample`: preprocessing

#### Our next model is a version of logistic regression with Naive Bayes features extended to include bigrams and trigrams as well as unigrams, described [here](https://www.aclweb.org/anthology/P12-2018). For every document we compute binarized features as described above, but this time we use bigrams and trigrams too. Each feature is a log-count ratio. A logistic regression model is then trained to predict sentiment. Because of the much larger number of features, we will return to the smaller `IMDb_sample` data set.

### What are `ngrams`?

#### An `n-gram` is a contiguous sequence of n items (where the items can be characters, syllables, or words).  A `1-gram` is a `unigram`, a `2-gram` is a `bigram`, and a `3-gram` is a `trigram`.

#### Here, we are referring to sequences of words. So examples of bigrams include "the dog", "said that", and "can't you".

### 14A. How to know what keywords appear the most in the new matrix?

Is not always easy to understand what the `CountVectorize` is, therefore can be useful to check out what the number of the matrix are. Let's see an example:

In [14]:
from sklearn.feature_extraction.text import CountVectorizer

In [15]:
corpus = [
    'This is the first document.',
    'This is the second second document.',
    'And the third one.',
    'Is this the first document?',
]

In [16]:
bigram_vectorizer = CountVectorizer(ngram_range=(1,2), min_df=1)

In [17]:
x_train_bigram = bigram_vectorizer.fit_transform(corpus).todense(); x_train_bigram

matrix([[0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0],
        [0, 0, 1, 0, 0, 1, 1, 0, 0, 2, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0],
        [1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0],
        [0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1]])

In [18]:
print(bigram_vectorizer.vocabulary_.get('this'))

18


In [19]:
# All the vocabulary and combinations of bigrams
bigram_vectorizer.vocabulary_

{'this': 18,
 'is': 5,
 'the': 12,
 'first': 3,
 'document': 2,
 'this is': 19,
 'is the': 6,
 'the first': 13,
 'first document': 4,
 'second': 9,
 'the second': 14,
 'second second': 11,
 'second document': 10,
 'and': 0,
 'third': 16,
 'one': 8,
 'and the': 1,
 'the third': 15,
 'third one': 17,
 'is this': 7,
 'this the': 20}

In [20]:
# Sum the appearences together
x_train_bigram.sum(axis=0)

matrix([[1, 1, 3, 2, 2, 3, 2, 1, 1, 2, 1, 1, 4, 2, 1, 1, 1, 1, 3, 2, 1]])

In [21]:
# sort in order the words and after create the appereances vocabulary
word_list_sorted = sorted(bigram_vectorizer.vocabulary_); word_list_sorted

['and',
 'and the',
 'document',
 'first',
 'first document',
 'is',
 'is the',
 'is this',
 'one',
 'second',
 'second document',
 'second second',
 'the',
 'the first',
 'the second',
 'the third',
 'third',
 'third one',
 'this',
 'this is',
 'this the']

In [22]:
len(word_list_sorted)

21

In [23]:
x_bg_train_count = x_train_bigram.sum(axis=0); x_bg_train_count

matrix([[1, 1, 3, 2, 2, 3, 2, 1, 1, 2, 1, 1, 4, 2, 1, 1, 1, 1, 3, 2, 1]])

In [24]:
type(x_bg_train_count)

numpy.matrix

In [25]:
# You need to reduce to a singular row
x_bg_train_count = np.ravel(x_bg_train_count)

In [26]:
type(x_bg_train_count)

numpy.ndarray

In [27]:
x_bg_train_count[0]

1

In [28]:
dic = [{word:count} for word, count in zip(word_list_sorted, x_bg_train_count)]

In [29]:
dic

[{'and': 1},
 {'and the': 1},
 {'document': 3},
 {'first': 2},
 {'first document': 2},
 {'is': 3},
 {'is the': 2},
 {'is this': 1},
 {'one': 1},
 {'second': 2},
 {'second document': 1},
 {'second second': 1},
 {'the': 4},
 {'the first': 2},
 {'the second': 1},
 {'the third': 1},
 {'third': 1},
 {'third one': 1},
 {'this': 3},
 {'this is': 2},
 {'this the': 1}]

### 14.B Naive Bayes with Bigram

In [33]:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix

In [30]:
bigram_vectorizer = CountVectorizer(ngram_range=(1,2), max_features=80000)

In [31]:
X_bigram = bigram_vectorizer.fit_transform(X_train)

In [34]:
naive_bayes = GaussianNB()

In [35]:
naive_bayes.fit(X_bigram.todense(), y_train)

GaussianNB()

In [37]:
preds = naive_bayes.predict(X_bigram.todense())

In [38]:
(y_train == preds).mean()

0.92465

This seems crazy high, but should be correct!

### 14E. Save the `ngram` data so we won't have to spend the time to generate it again

In [151]:
scipy.sparse.save_npz("X_train_bigram_matrix.npz", X_bigram)

In [152]:
with open('itongram.pickle', 'wb') as handle:
    pickle.dump(itongram, handle, protocol=pickle.HIGHEST_PROTOCOL)
    
with open('ngramtoi.pickle', 'wb') as handle:
    pickle.dump(ngramtoi, handle, protocol=pickle.HIGHEST_PROTOCOL)

### 14F. Load the `ngram` data

In [153]:
train_ngram_doc_matrix = scipy.sparse.load_npz("train_ngram_matrix.npz")
valid_ngram_doc_matrix = scipy.sparse.load_npz("valid_ngram_matrix.npz")

In [154]:
with open('itongram.pickle', 'rb') as handle:
    b = pickle.load(handle)
    
with open('ngramtoi.pickle', 'rb') as handle:
    b = pickle.load(handle)

## 15. A Naive Bayes IMDb classifier using Trigrams instead of Tokens

You can try the exact same thing but with a trigram (computationally expensive), and see what is the output.

## References

* Baselines and Bigrams: Simple, Good Sentiment and Topic Classification. Sida Wang and Christopher D. Manning [pdf](https://www.aclweb.org/anthology/P12-2018)
* [The Naive Bayes Classifier](https://towardsdatascience.com/the-naive-bayes-classifier-e92ea9f47523). Joseph Catanzarite, in Towards Data Science