# Sentiment Classification of Movie Reviews (Using Naive Bayes, Logistic Regression, and Ngrams)
The purpose of this notebook is to go over Naives Bayes, Logistic Regression, and Ngrams for sentiment classification. Using sklearn and fastai. 

In [1]:
from fastai import *
from fastai.text import *

In [2]:
import sklearn.feature_extraction.text as sklearn_text

It is always good to start working on a sample of your data before you use the full dataset - this allows for quicker computations as you debug and get your code working. 

We will be using IMBD, which already has a sample dataset.

In [3]:
# Checkign the datasets we have
?? URLs

[0;31mInit signature:[0m  [0mURLs[0m[0;34m([0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mSource:[0m        
[0;32mclass[0m [0mURLs[0m[0;34m([0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m    [0;34m"Global constants for dataset and model URLs."[0m[0;34m[0m
[0;34m[0m    [0mLOCAL_PATH[0m [0;34m=[0m [0mPath[0m[0;34m.[0m[0mcwd[0m[0;34m([0m[0;34m)[0m[0;34m[0m
[0;34m[0m    [0mS3[0m [0;34m=[0m [0;34m'https://s3.amazonaws.com/fast-ai-'[0m[0;34m[0m
[0;34m[0m    [0mS3_IMAGE[0m [0;34m=[0m [0;34mf'{S3}imageclas/'[0m[0;34m[0m
[0;34m[0m    [0mS3_IMAGELOC[0m [0;34m=[0m [0;34mf'{S3}imagelocal/'[0m[0;34m[0m
[0;34m[0m    [0mS3_NLP[0m [0;34m=[0m [0;34mf'{S3}nlp/'[0m[0;34m[0m
[0;34m[0m    [0mS3_COCO[0m [0;34m=[0m [0;34mf'{S3}coco/'[0m[0;34m[0m
[0;34m[0m    [0mS3_MODEL[0m [0;34m=[0m [0;34mf'{S3}modelzoo/'[0m[0;34m[0m
[0;34m[0m    [0mCOCO_SAMPLE[0m [0;34m=[0m [0;34mf'{S3_COCO}coco_sample'[0m[0;34m[0

In [3]:
# In our case we will use the IMDB Sample
path = untar_data(URLs.IMDB_SAMPLE)

# pointing to the path
path

PosixPath('/Users/diegomedina-bernal/.fastai/data/imdb_sample')

In [4]:
# Let's get a sense of what the data looks like - we will not use this for our model
df = pd.read_csv(path/'texts.csv')
df.head()

Unnamed: 0,label,text,is_valid
0,negative,Un-bleeping-believable! Meg Ryan doesn't even ...,False
1,positive,This is a extremely well-made film. The acting...,False
2,negative,Every once in a long while a movie will come a...,False
3,positive,Name just says it all. I watched this movie wi...,False
4,negative,This movie succeeds at being one of the most u...,False


In [5]:
df.iloc[0]['text']

"Un-bleeping-believable! Meg Ryan doesn't even look her usual pert lovable self in this, which normally makes me forgive her shallow ticky acting schtick. Hard to believe she was the producer on this dog. Plus Kevin Kline: what kind of suicide trip has his career been on? Whoosh... Banzai!!! Finally this was directed by the guy who did Big Chill? Must be a replay of Jonestown - hollywood style. Wooofff!"

In [6]:
import os

In [7]:
# Checking the path directy - this is what the file is called with our dataset - IMDB sample
os.listdir(path)

['texts.csv']

In [8]:
# We will be using TextLit from FastAI library to process our data
? TextList

[0;31mInit signature:[0m  [0mTextList[0m[0;34m([0m[0mitems[0m[0;34m:[0m[0mIterator[0m[0;34m,[0m [0mvocab[0m[0;34m:[0m[0mfastai[0m[0;34m.[0m[0mtext[0m[0;34m.[0m[0mtransform[0m[0;34m.[0m[0mVocab[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0mpad_idx[0m[0;34m:[0m[0mint[0m[0;34m=[0m[0;36m1[0m[0;34m,[0m [0;34m**[0m[0mkwargs[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m      Basic `ItemList` for text data.
[0;31mFile:[0m           ~/miniconda3/lib/python3.6/site-packages/fastai/text/data.py
[0;31mType:[0m           type


In [9]:
# Getting movie reviews - using TextList 

"""
path: path to our dataset
'texts.csv': name of our dataset
cols='text': column text is the text we want to process

cols2: isValid
cols0: Our label - we will label from dataframe therefore we point to col0
"""

movie_reviews = (TextList.from_csv(path, 'texts.csv', cols='text')
                         .split_from_df(col=2)
                         .label_from_df(cols=0))

## Exploring what our data looks like
A good first step for any data problem is to explore the data and get a sense of what it looks like. In this case we are looking at movie reviews, which have been labeled as "Positive" or "Negative"

In [10]:
# Checking first review from valid 
movie_reviews.valid.x[0], movie_reviews.valid.y[0]

(Text xxbos xxmaj this very funny xxmaj british comedy shows what might happen if a section of xxmaj london , in this case xxmaj xxunk , were to xxunk itself independent from the rest of the xxup uk and its laws , xxunk & post - war xxunk . xxmaj merry xxunk is what would happen . 
 
  xxmaj the explosion of a wartime bomb leads to the xxunk of ancient xxunk which show that xxmaj xxunk was xxunk to the xxmaj xxunk of xxmaj xxunk xxunk ago , a small historical xxunk long since forgotten . xxmaj to the new xxmaj xxunk , however , this is an unexpected opportunity to live as they please , free from any xxunk from xxmaj xxunk . 
 
  xxmaj stanley xxmaj xxunk is excellent as the minor city xxunk who suddenly finds himself leading one of the world 's xxunk xxunk . xxmaj xxunk xxmaj margaret xxmaj xxunk is a delight as the history professor who sides with xxmaj xxunk . xxmaj others in the stand - out cast include xxmaj xxunk xxmaj xxunk , xxmaj paul xxmaj xxunk , xxmaj xxunk xxmaj xxunk , xxm

In NLP, a **token** is the basic unit of processing (what the tokens are depends on the application and your choices). Here, the tokens mostly correspond to words or punctuation, as well as several special tokens, corresponding to unknown words, captialization, etc. 

From looking at our example - this is post-processed data that has been **tokenized**. You will see that some words have ``xx``. Here is a dictionary explaining what they mean. 

* ```UNK```: Is for an unknown word (one that isn't present in the current vocabulary)
* ```PAD```: Is the token used for padding, if we need to regroup several texts of different lengths in a batch
* ```BOS```: Represents the beginning of a text in your dataset
* ```FLD```: Is used if you set ```mark_fields=True``` is your ```TokenizeProcessor``` to seperate the different fields of texts (if your text are loaded from several different columns in a dataframe)
* ```TK_MAJ```: Is used to indicate the next word begins with a capital in the original text
* ```TK_UP```: Is used to indicate the next word is written in all caps in the original text
* ```TK_REP```: Is used to indicate the next character is repeated n timess in the original text 
* ```TK_WREP```: Is used to indicate the next word is repeated n times in the original text

```itos``` is integers to string. 
```stoi``` is string to integer. 

In [17]:
# what is the shape of each?
len(movie_reviews.vocab.itos), len(movie_reviews.vocab.stoi)

(6010, 19159)

In [18]:
# Looking at tokens
movie_reviews.vocab.itos[:10]

['xxunk',
 'xxpad',
 'xxbos',
 'xxfld',
 'xxmaj',
 'xxup',
 'xxrep',
 'xxwrep',
 'the',
 '.']

In [20]:
movie_reviews.vocab.itos[30:40]

['but', 'film', 'you', ')', 'on', '(', "n't", 'are', 'he', 'his']

So if you're confused: ```movie_reviews.vocab.stoi``` is basically a dictionary which is our vocab. 

In this dictionary we contain the index and the token it represents. 

Therefore with ```movie_reviews.vocab.itos``` we can actually slice the list giving it the 'index' which will point to the string it represents. 

In [11]:
# Let's look up the index for a word
movie_reviews.vocab.stoi['planet']

1367

In [12]:
# to confirm this
movie_reviews.vocab.itos[1367]

'planet'

In [13]:
# checking where a word maps to 
word = 'sex'

idx = movie_reviews.vocab.stoi[word] # getting index for that word
token_returned = movie_reviews.vocab.itos[idx]

print(f'Initial word: {word}')
print(f'Index it points to: {idx}')
print(f'Token returned: {token_returned}')

Initial word: sex
Index it points to: 433
Token returned: sex


# Creating Term-Document Matrix
As covered in notebook 1, a term-document matrix represents a document as a "bag of words", that is, we don't keep track of the order of the words are in, just which words occur (and how often). 

In that notebook we used SkLearn's ```CountVectorizer```. In this notebook we will create our own. Why? 
* To understand what is happening 'underneath the hood'
* To create something that will work with fastai TextList

To create our term-document matrix, we first need to learn about **counters** and **sparse matrices** 

## Counters
Counters are a userful Python object. Below is how they work

In [14]:
c = Counter([4, 2, 8, 8, 4, 8])

In [15]:
c

Counter({4: 2, 2: 1, 8: 3})

So, what the counter object does is that it takes a list of object 'ints' in our case and **counts** how many times they appear in that list. Returning that count as a dictionary.

If you look about 4: 2 means that 4 appears twice in the list. 8: 3 means that 8 appears 3 times in the list. 

## Sparse Matrices (SciPy)
Even though we have reduced over 19,000 words down to 6,000, that is still a lot. Most tokens don't appear in most reviews. We want to take advantage of this by storing our data as a **sparse matrix**.

A matrix with lots of zeros is called **sparse** (opposite of **dense**). For sparse matrices, you can save a lot of memory by only storing the non-zero values. 

There are the most common sparse storage formats: 

* Coordinate-wise (scipy calls COO)
* Compressed sparse row (CSR)
* Compressed sparse columns (CSC)

A class of matrices is generally called sparse if the number of non-zero elements is proportional to the number of rows (or columns) instead of being proportional to the product rows x columns. 

### Our Version of ```CountVectorizer```

In [16]:
# each document we use counter on will be:
doc = movie_reviews.valid.x[0]

In [17]:
doc

Text xxbos xxmaj this very funny xxmaj british comedy shows what might happen if a section of xxmaj london , in this case xxmaj xxunk , were to xxunk itself independent from the rest of the xxup uk and its laws , xxunk & post - war xxunk . xxmaj merry xxunk is what would happen . 

 xxmaj the explosion of a wartime bomb leads to the xxunk of ancient xxunk which show that xxmaj xxunk was xxunk to the xxmaj xxunk of xxmaj xxunk xxunk ago , a small historical xxunk long since forgotten . xxmaj to the new xxmaj xxunk , however , this is an unexpected opportunity to live as they please , free from any xxunk from xxmaj xxunk . 

 xxmaj stanley xxmaj xxunk is excellent as the minor city xxunk who suddenly finds himself leading one of the world 's xxunk xxunk . xxmaj xxunk xxmaj margaret xxmaj xxunk is a delight as the history professor who sides with xxmaj xxunk . xxmaj others in the stand - out cast include xxmaj xxunk xxmaj xxunk , xxmaj paul xxmaj xxunk , xxmaj xxunk xxmaj xxunk , xxmaj xx

In [18]:
doc.data

array([ 2,  4, 20, 70, ..., 14,  4,  0, 51])

In [19]:
def get_term_doc_matrix(label_list, vocab_len):
    j_indices = []
    indptr = []
    values = []
    indptr.append(0) # initiate counter
    
    for i, doc in enumerate(label_list):
        feature_counter = Counter(doc.data) # grab each word - we grab the index
        j_indices.extend(feature_counter.keys()) # place word index in j index
        values.extend(feature_counter.values()) # place word in values list
        indptr.append(len(j_indices))
        
    # Returning sparse matrix (values, j_indices, indptr)
    return scipy.sparse.csr_matrix((values, j_indices, indptr),
                                    shape=(len(indptr) - 1, vocab_len),
                                    dtype=int)

In [20]:
# using our countvectorizer
label_list = movie_reviews.valid.x # all the documents
vocab_len = len(movie_reviews.vocab.itos) # list of all words - grabbing length

val_term_doc = get_term_doc_matrix(label_list, vocab_len)

In [21]:
# 200 reviews, 6010 words - sparsely represented
val_term_doc.shape

(200, 6010)

In [22]:
# Creating our train term-doc
label_list = movie_reviews.train.x
vocab_len = vocab_len # same thing

train_term_doc = get_term_doc_matrix(label_list, vocab_len)

In [23]:
# 800 reviews, 6010 words - sparsely represented
train_term_doc.shape

(800, 6010)

In [24]:
# Let's check these matrices out
# We need to convert todense()
val_term_doc.todense()[:10,:10]

matrix([[32,  0,  1,  0, ...,  0,  0, 10,  7],
        [ 9,  0,  1,  0, ...,  0,  0,  7,  8],
        [ 6,  0,  1,  0, ...,  0,  0, 12, 12],
        [78,  0,  1,  0, ...,  0,  0, 44, 23],
        ...,
        [ 8,  0,  1,  0, ...,  0,  0,  8,  8],
        [43,  0,  1,  0, ...,  1,  0, 25, 24],
        [ 7,  0,  1,  0, ...,  0,  0,  9,  9],
        [19,  0,  1,  0, ...,  0,  0,  5,  9]])

In [25]:
train_term_doc.todense()[:10, :10]

matrix([[ 8,  0,  1,  0, ...,  0,  0,  2,  3],
        [22,  0,  1,  0, ...,  0,  0, 27, 19],
        [ 4,  0,  1,  0, ...,  0,  0,  5, 14],
        [13,  0,  1,  0, ...,  0,  0, 16,  7],
        ...,
        [ 4,  0,  1,  0, ...,  0,  0, 19,  7],
        [42,  0,  1,  0, ...,  0,  0, 30, 15],
        [18,  0,  1,  0, ...,  0,  0, 15, 11],
        [20,  0,  1,  0, ...,  0,  0, 10,  4]])

In [26]:
doc.data

array([ 2,  4, 20, 70, ..., 14,  4,  0, 51])

In [27]:
doc.text

"xxbos xxmaj this very funny xxmaj british comedy shows what might happen if a section of xxmaj london , in this case xxmaj xxunk , were to xxunk itself independent from the rest of the xxup uk and its laws , xxunk & post - war xxunk . xxmaj merry xxunk is what would happen . \n\n xxmaj the explosion of a wartime bomb leads to the xxunk of ancient xxunk which show that xxmaj xxunk was xxunk to the xxmaj xxunk of xxmaj xxunk xxunk ago , a small historical xxunk long since forgotten . xxmaj to the new xxmaj xxunk , however , this is an unexpected opportunity to live as they please , free from any xxunk from xxmaj xxunk . \n\n xxmaj stanley xxmaj xxunk is excellent as the minor city xxunk who suddenly finds himself leading one of the world 's xxunk xxunk . xxmaj xxunk xxmaj margaret xxmaj xxunk is a delight as the history professor who sides with xxmaj xxunk . xxmaj others in the stand - out cast include xxmaj xxunk xxmaj xxunk , xxmaj paul xxmaj xxunk , xxmaj xxunk xxmaj xxunk , xxmaj xx

### Fun Thought 1
A function $f(x_{[i,n]}) = \hat{y}$ where $x_{[i,n]}$ represents $x_{i}$ features up to $x_{n}$ and $\hat{y}$ represents the output which is most cases is a de-coherent distributed form. 

In a high level: when plotting $x_{[i,n]}$ (in $m$ dimensional space) you form a **map** to it's de-coherent distribution which is a point in the **whole distribution** of the system you are mapping. 
 
Each feature $x$ must be numerical in nature; thus when 'processing text' we use embeddings or other numerical representations of the text to process. 

An example to get this notion accross think of this scenario: You want to go play golf but you want to know if it's even a good day to play golf. Are the 'weather' *parameters* good? And *good* in a way that those *parameters* when plugged into $f(x)$ fall within a satisfactory de-coherent distribution. 

Therefore, for a *system* to learn this **whole distribution** to *map* $x_{[i,n]}$ to $\hat{y}$ we need to run **MANY** experiments. **Many** being defined as enough to hold *Statistical value*. 

Therefore, in *Machine Learning* we need a lot of data to make this distribution to *Learn* this distribution. 

This same principle is the core of common *experimental method* when conducting research. 

# Naives Bayes
This algorithm makes an assumption as all the variables in the dataset is *Naive* or **not correlated to each other**. This algorithm is usually used to get base accuracy of the dataset.

$$P(c | x) = \frac{P(x|c)P(c)}{P(x)}$$

Where:
* $P(c|x)$ is the **posterior probability** of *class c* given *predictor x (features)*
* $P(c)$ is the probability of *class*
* $P(x|c)$ is the **likelihood** which is the probability of predictor given class
* $P(x)$ is the **prior probability** of predictor

### Advantages of Naives Bayes
* It is easy and fast to predict the class of the test dataset. It also performs well in multi-class prediction
* When assumption of independence holds, a Naive Bayes classifier performs better compared to other models like Logistic Regression and you need less training data
* It performs well in case of categorical input variables compared to numerical variables. For numerical variable, a normal distribution is assumed (bell curve, which is a strong assumption)

### Disadvantages of Naives Bayes
* If categorical variable has a category (in test dataset), which has not observed in training data set, then the model will assign 0 (zero) probability and will be unable to make a prediction. This is often known as **zero frequency**. To solve this, we can use the smoothing technique. One of the simplest smoothing techniques is called *Laplace estimation*
* On the other side Naive Bayes is also known as a bad estimator, so the probability outputs are not to be taken too seriously
* Another limitation of Naive Bayes is the assumption of independent predictors. In real life, it is almost impossible that we get a set of predictors which are completely independent 

We define the **log-count ratio $r$** for each word $f$:
$$ r = log\frac{\text{ratio of feature} f \text{in positive documents}}{\text{ratio of feature} f \text{in negative documents}} $$

In [44]:
movie_reviews.classes

['negative', 'positive']

In [51]:
x = train_term_doc # sparse matrix
y = movie_reviews.train.y # labels for train dataset
val_y = movie_reviews.valid.y # labels for validation dataset

In [54]:
# Grabbing int values of our classes from y
positive = y.c2i['positive']
negative = y.c2i['negative']

In [58]:
# Grabbing frequency of positive and negative for each word in our vocabulary
p1 = np.squeeze(np.asarray(x[y.items==positive].sum(0))) # summing up positives
p0 = np.squeeze(np.asarray(x[y.items==negative].sum(0))) # summing up negatives

In [59]:
len(p1), len(p0)

(6010, 6010)

In [60]:
p1[:10]

array([ 6468,     0,   383,     0, 10267,   674,    57,     0,  5260,  4195], dtype=int64)

In [62]:
p0[:10]

array([ 7153,     0,   417,     0, 10741,   908,    53,     1,  6150,  5147], dtype=int64)

In [63]:
# Grabbing vocab
v = movie_reviews.vocab

In [73]:
# How often does a word appear in positive & negative reviews?
def count_word_pos_neg(word):
    return p1[v.stoi[word]], p0[v.stoi[word]]

p,n = count_word_pos_neg('loved')
print(f'Positive count: {p}')
print(f'Negative count: {n}')

Positive count: 29
Negative count: 12


In [84]:
p,n = count_word_pos_neg('lust')
print(f'Positive count: {p}')
print(f'Negative count: {n}')

Positive count: 6468
Negative count: 7153


In [119]:
# Grabbing a review
def grab_random_review(review_idxs_list_, d_set="train"):
    """
    This function will take a list of review indexs (positive or negative) and will return 
    """
    random_idx = random.choice(review_idxs_list_)
    
    if d_set=="train":
        review_ = movie_reviews.train.x[random_idx]
    elif d_set=="valid":
        review_ = movie_reviews.valid.x[random_idx]
    else:
        return "Wrong dataset typed in"
    
    return review_.text

In [124]:
# reviewing review from train dataset
def grab_list_postive_negative_for_word(word):
    """
    This takes in a word and will find all positive and negative reviews that contain that specific word. Therefore this function will return a negative and positive list of review indexes
    
    RETURNS:
        positive_review_list_idxs, negative_review_list_idxs
    """
    # grab word index
    word_idx = v.stoi[word]
    
    # positive
    a_p = np.argwhere((x[:,word_idx] > 0))[:,0]
    b_p = np.argwhere(y.items==positive)[:,0] 
    sets_p = set(a).intersection(set(b))
    
    # positive
    a_n = np.argwhere((x[:,word_idx] > 0))[:,0]
    b_n = np.argwhere(y.items==negative)[:,0] 
    sets_n = set(a).intersection(set(b))
    
    # our list of indexes
    idxs_found_p = [] # positive list
    idxs_found_n = [] # negative list
    
    # grabbing possitive
    for _ in range(sets_p.__len__()):
        idx = sets_p.pop() # grabbing each index
        idxs_found_p.append(idx) # adding to our list
        
    # grabbing negative
    for _ in range(sets_n.__len__()):
        idx = sets_n.pop() # grabbing each index
        idxs_found_n.append(idx) # adding to our list
        
    return idxs_found_p, idxs_found_n

In [133]:
# let's grab a random review
word = "love"

# positive review
idxs_p, _ = grab_list_postive_negative_for_word(word)
review = grab_random_review(idxs_p)

print(f'Positive Review for word: {word}\n')
print('-----------------------------------------------------------------------')
print(review)
print('-----------------------------------------------------------------------')

Positive Review for word: love

-----------------------------------------------------------------------
xxbos xxmaj there are numerous films relating to xxup xxunk , but xxmaj mother xxmaj night is quite xxunk among them : xxmaj in this film , we are introduced to xxmaj howard xxmaj campbell ( xxmaj nolte ) , an xxmaj american living in xxmaj berlin and married to a xxmaj german , xxmaj xxunk xxmaj xxunk ( xxmaj lee ) , who decides to accept the role of a spy : xxmaj more specifically , a xxup cia agent xxmaj major xxmaj xxunk ( xxmaj goodman ) recruits xxmaj campbell who becomes a xxmaj nazi xxunk in order to enter the highest xxunk of the xxmaj hitler xxunk . xxmaj however , the deal is that the xxup us xxmaj government will never xxunk xxmaj campbell 's role in the war for national security reasons , and so xxmaj campbell becomes a hated figure across the xxup us . xxmaj after the war , he tries to xxunk his identity , but the past comes back and xxunk him . xxmaj his only " friend 

# Applying Naive Bayes

In [134]:
p1 = np.squeeze(np.asarray(x[y.items==positive].sum(0)))
p0 = np.squeeze(np.asarray(x[y.items==negative].sum(0)))

pr1 = (p1+1) / ((y.items==positive).sum() + 1)
pr0 = (p0+1) / ((y.items==negative).sum() + 1)

In [135]:
r = np.log(pr1/pr0); r

array([-0.015811,  0.084839,  0.      ,  0.084839, ...,  1.471133, -1.301455, -1.301455, -1.301455])

In [136]:
# Vocab most associated with positive and negative reviews - idxs
biggest = np.argpartition(r, -10)[-10:]
smallest = np.argpartition(r, 10)[:10]

In [137]:
# most positive words
p_words = [v.itos[k] for k in biggest]
print(f'Positive words: \n{p_words}\n')

# most negative words
n_words = [v.itos[k] for k in smallest]
print(f'Negative words: \n{n_words}')

Positive words: 
['han', 'jabba', 'davies', 'gilliam', 'jimmy', 'felix', 'biko', 'fanfan', 'astaire', 'noir']

Negative words: 
['vargas', 'naschy', 'worst', 'dog', 'porn', 'crater', 'crap', 'disappointment', 'soderbergh', 'fuqua']


In [151]:
# Testing Naives Bayes
ratio = ((y.items==positive).sum(), (y.items==negative).sum())
print(f'ratio: {ratio}')

b = np.log(ratio[0] / ratio[1])

preds = (val_term_doc @ r + b) > 0

# measuring our accuracy
(preds == val_y.items).mean()

ratio: (383, 417)


0.645

So from using a **sample** of our data we got a **64%** accuracy

# Switching to full data set
Now we will do the same approach as above on our entire data

## Downloading Data

In [154]:
path = untar_data(URLs.IMDB)

In [155]:
path.ls()

[PosixPath('/Users/diegomedina-bernal/.fastai/data/imdb/test'),
 PosixPath('/Users/diegomedina-bernal/.fastai/data/imdb/tmp_clas'),
 PosixPath('/Users/diegomedina-bernal/.fastai/data/imdb/imdb.vocab'),
 PosixPath('/Users/diegomedina-bernal/.fastai/data/imdb/unsup'),
 PosixPath('/Users/diegomedina-bernal/.fastai/data/imdb/README'),
 PosixPath('/Users/diegomedina-bernal/.fastai/data/imdb/tmp_lm'),
 PosixPath('/Users/diegomedina-bernal/.fastai/data/imdb/train')]

In [156]:
# checkign train folder
(path/'train').ls()

[PosixPath('/Users/diegomedina-bernal/.fastai/data/imdb/train/neg'),
 PosixPath('/Users/diegomedina-bernal/.fastai/data/imdb/train/pos'),
 PosixPath('/Users/diegomedina-bernal/.fastai/data/imdb/train/unsupBow.feat'),
 PosixPath('/Users/diegomedina-bernal/.fastai/data/imdb/train/labeledBow.feat')]

In [157]:
# Creating our dataset with fastai
reviews_full = (TextList.from_folder(path)
                        .split_by_folder(valid='test')
                        .label_from_folder(classes=['neg', 'pos']))

In [161]:
len(reviews_full.train), len(reviews_full.valid)

(25000, 25000)

In [162]:
# Grabbing our vocab
v = reviews_full.vocab

In [168]:
len(v.itos), len(v.stoi)

(38458, 118247)

In [169]:
# Grabbing a few words
v.itos[100:105]

['people', 'will', 'other', 'also', 'into']

In [171]:
? get_term_doc_matrix

[0;31mSignature:[0m  [0mget_term_doc_matrix[0m[0;34m([0m[0mlabel_list[0m[0;34m,[0m [0mvocab_len[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m <no docstring>
[0;31mFile:[0m      ~/Desktop/MadGeniusLearning/FASTAI_NLP/<ipython-input-19-7ae55a3c6a90>
[0;31mType:[0m      function


In [174]:
# Getting our term documents
val_term_doc = get_term_doc_matrix(reviews_full.valid.x, len(reviews_full.vocab.itos))
trn_term_doc = get_term_doc_matrix(reviews_full.train.x, len(reviews_full.vocab.itos))

In [175]:
# Saving our documents
scipy.sparse.save_npz('trn_term_doc.npz', trn_term_doc)
scipy.sparse.save_npz('val_term_doc.npz', val_term_doc)

In [178]:
# Moving them into a new data directory
!mkdir data
!mv trn_term_doc.npz ./data; mv val_term_doc.npz ./data
!ls ./data

trn_term_doc.npz val_term_doc.npz


In [None]:
# # How to load - don't need to run now
# trn_term_doc = scipy.sparse.load_npz('./data/trn_term_doc.npz')
# val_term_doc = scipy.sparse.load_npz('./data/val_term_doc.npz')

In [180]:
# Checking our matrix
trn_term_doc.todense()[:10,:10] 

matrix([[ 0,  0,  1,  0, ...,  0,  0,  2,  2],
        [ 0,  0,  1,  0, ...,  0,  0, 12,  5],
        [ 0,  0,  1,  0, ...,  0,  0,  4,  9],
        [ 4,  0,  1,  0, ...,  1,  0, 19, 23],
        ...,
        [ 3,  0,  1,  0, ...,  0,  0, 10,  7],
        [ 2,  0,  1,  0, ...,  0,  0,  4,  5],
        [ 0,  0,  1,  0, ...,  0,  0,  9,  7],
        [ 2,  0,  1,  0, ...,  0,  0, 16,  7]])

# Naive Bayes on full dataset

In [181]:
x = trn_term_doc # our matrix
y = reviews_full.train.y # our true labels

val_y = reviews_full.valid.y.items # our validation labels

In [187]:
x

<25000x38458 sparse matrix of type '<class 'numpy.int64'>'
	with 3716653 stored elements in Compressed Sparse Row format>

In [186]:
# Gettting labels
positive = y.c2i['pos']
negative = y.c2i['neg']

In [188]:
# grabbing p1, p0
p0 = np.squeeze(np.asarray(x[y.items==negative].sum(0)))
p1 = np.squeeze(np.asarray(x[y.items==positive].sum(0)))

In [189]:
pr1 = (p1+1) / ((y.items==positive).sum() + 1)
pr0 = (p0+1) / ((y.items==negative).sum() + 1)

In [190]:
r = np.log(pr1 / pr0)

In [191]:
b = np.log((y.items==positive).mean() / (y.items==negative).mean()); b

0.0

In [192]:
preds = (val_term_doc @ r + b) > 0 # helps avoid zeros in our matrix

In [194]:
(preds == val_y).mean()

0.8086

# Logistic Regression
Now we will use **logistic regression** to tackle the same problem. 

Before moving forward: Let's learn more about **logistic regression**. 

#### Outcome
In logistic regression, the outcome (dependent variable) has only a limited number of possible values *usually between 0 and 1*. 

#### The dependent variable
Logistic regression is used when the response variable is categorical in nature. For instance: yes/no, true/false, red/green/blue, etc. 

#### Coefficient interpretation
In logistic regression, the coefficient interpretation depends on the family (binomial, Possion, etc) and link (log, logit, inverse-log, etc) you use. 

#### Error minimization technique
Logistic regression uses maximum likelihood method to arrive at the optimal solution. 

Using logistic loss function causes large errors to be penalized to an asymptotically constant. 

Below is how we can fit logistic regression where the features are the unigrams

## What is a unigram and ngram?
A **Unigram** just represents a single word, whereas an **ngram** represents a sequence of words

In [195]:
from sklearn.linear_model import LogisticRegression

In [198]:
m = LogisticRegression(C=0.1, dual=True)
m.fit(x, y.items.astype(int))
preds = m.predict(val_term_doc)
(preds==val_y).mean() # printing accuracy

0.88272

In [201]:
# binary approach - does the word exist or not?
m = LogisticRegression(C=0.1, dual=True)
m.fit(trn_term_doc.sign(), y.items.astype(int))
preds = m.predict(val_term_doc.sign())
(preds==val_y).mean() # printing accuracy

0.88544

# Trigam with NB features
Our next model is a version of logistic regression with Naive Bayes feature. For every document we compute binarized features as shown above, but this time we use bigrams and trigrams too. Each feature is a log-count ratio. A logistic regression model is then trained to predict sentiment. 

## ngrams
An n-gram is a contigous sequence of n items (where the items can be characters, syllables, or words). A 1-gram is a unigram, a 2-gram is a bigram, and a 3-gram is a trigram. 

Here, we are referring to sequence of words. So examples of bigrams include "the dog", "said that", and "can't you"

In [203]:
# Loading the data
path = untar_data(URLs.IMDB_SAMPLE) # sample
path.ls()

[PosixPath('/Users/diegomedina-bernal/.fastai/data/imdb_sample/texts.csv')]

In [204]:
# converting to TextList object
movie_reviews = (TextList.from_csv(path, 'texts.csv', cols='text')
                         .split_from_df(col=2)
                         .label_from_df(cols=0))

In [207]:
# Grabbing our vocab
v = movie_reviews.vocab.itos # returns list with each word

vocab_len = len(v)

## Our Data

### Creating train matrix

In [208]:
# ngram definition
min_n = 1
max_n = 3

j_indices = []
indptr = []
values = []
indptr.append(0)
num_tokens = vocab_len

itongram = dict()
ngramtoi = dict()

In [209]:
# Iterate through sequence of words to create ngrams
for i, doc in enumerate(movie_reviews.train.x):
    feature_counter = Counter(doc.data)
    j_indices.extend(feature_counter.keys())
    values.extend(feature_counter.values())
    
    this_doc_ngrams = list()
    
    m = 0
    for n in range(min_n, max_n + 1):
        for k in range(vocab_len - n + 1):
            ngram = doc.data[k: k + n]
            if str(ngram) not in ngramtoi:
                if len(ngram)==1:
                    num = ngram[0]
                    ngramtoi[str(ngram)] = num
                    itongram[num] = ngram
                else:
                    ngramtoi[str(ngram)] = num_tokens
                    itongram[num_tokens] = ngram
                    num_tokens += 1
            this_doc_ngrams.append(ngramtoi[str(ngram)])
            m += 1
            
    ngram_counters = Counter(this_doc_ngrams)
    j_indices.extend(ngram_counters.keys())
    values.extend(ngram_counters.values())
    indptr.append(len(j_indices))

In [211]:
# Creating our matrix
train_ngram_doc_matrix = scipy.sparse.csr_matrix((values, j_indices, indptr),
                                                shape=(len(indptr) - 1, len(ngramtoi)),
                                                dtype=int)

In [212]:
train_ngram_doc_matrix

<800x260428 sparse matrix of type '<class 'numpy.int64'>'
	with 678936 stored elements in Compressed Sparse Row format>

In [213]:
train_ngram_doc_matrix.todense()[:10, :10]

matrix([[16,  0,  2,  0, ...,  0,  0,  4,  6],
        [44,  0,  2,  0, ...,  0,  0, 54, 40],
        [ 8,  0,  2,  0, ...,  0,  0, 10, 30],
        [26,  0,  2,  0, ...,  0,  0, 32, 16],
        ...,
        [ 8,  0,  2,  0, ...,  0,  0, 38, 16],
        [84,  0,  2,  0, ...,  0,  0, 60, 32],
        [36,  0,  2,  0, ...,  0,  0, 30, 24],
        [40,  0,  2,  0, ...,  0,  0, 20, 10]])

More on ```train_ngram_doc_matrix```. Here we have a sparse matrix of size: (800, 260428), where 800 represents 800 reviews, 260428 represents that many ngrams (bi-grams, tri-grams)

In [214]:
# Creating validation matrix
j_indices = []
indptr = []
values = []
indptr.append(0)

for i, doc in enumerate(movie_reviews.valid.x):
    feature_counter = Counter(doc.data)
    j_indices.extend(feature_counter.keys())
    values.extend(feature_counter.values())
    this_doc_ngrams = list()
    
    m = 0
    for n in range(min_n, max_n + 1):
        for k in range(vocab_len - n + 1):
            ngram = doc.data[k: k+n]
            if str(ngram) in ngramtoi:
                this_doc_ngrams.append(ngramtoi[str(ngram)])
            m += 1
            
    ngram_counter = Counter(this_doc_ngrams)
    j_indices.extend(ngram_counter.keys())
    values.extend(ngram_counter.values())
    indptr.append(len(j_indices))

In [215]:
valid_ngram_doc_matrix = scipy.sparse.csr_matrix((values, j_indices, indptr),
                                                shape=(len(indptr) - 1, len(ngramtoi)),
                                                dtype=int)

In [216]:
valid_ngram_doc_matrix

<200x260428 sparse matrix of type '<class 'numpy.int64'>'
	with 121595 stored elements in Compressed Sparse Row format>

In [217]:
train_ngram_doc_matrix

<800x260428 sparse matrix of type '<class 'numpy.int64'>'
	with 678936 stored elements in Compressed Sparse Row format>

In [218]:
# saving our data
scipy.sparse.save_npz('./data/train_ngram_matrix.npz', train_ngram_doc_matrix)
scipy.sparse.save_npz('./data/valid_ngram_matrix.npz', valid_ngram_doc_matrix)

# Naives Bayes

In [219]:
x = train_ngram_doc_matrix
y = movie_reviews.train.y

positive = y.c2i['positive']
negative = y.c2i['negative']

In [220]:
x

<800x260428 sparse matrix of type '<class 'numpy.int64'>'
	with 678936 stored elements in Compressed Sparse Row format>

In [221]:
k = 260428

pos = (y.items == positive)[:k]
neg = (y.items == negative)[:k]

In [222]:
xx = x[:k]

In [223]:
valid_labels = [o == positive for o in movie_reviews.valid.y.items]

In [224]:
p0 = np.squeeze(np.array(xx[neg].sum(0)))
p1 = np.squeeze(np.array(xx[pos].sum(0)))

In [225]:
pr1 = (p1+1) / ((y.items==positive).sum() + 1)
pr0 = (p0+1) / ((y.items==negative).sum() + 1)

In [226]:
r = np.log(pr1/pr0)

In [227]:
b = np.log((y.items==positive).mean() / (y.items==negative).mean())

In [228]:
pre_preds = valid_ngram_doc_matrix @ r.T + b
preds = pre_preds.T > 0

(preds==valid_labels).mean() # our accuracy

0.76

# Logistic Regression

In [230]:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer

In [231]:
veczr = CountVectorizer(ngram_range=(1,3), preprocessor=noop, tokenizer=noop, max_features=800000)

In [232]:
veczr

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=800000, min_df=1,
                ngram_range=(1, 3), preprocessor=<function noop at 0x137c63620>,
                stop_words=None, strip_accents=None,
                token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=<function noop at 0x137c63620>, vocabulary=None)

In [236]:
docs = movie_reviews.train.x

In [237]:
train_words = [[docs.vocab.itos[o] for o in doc.data] for doc in movie_reviews.train.x]
valid_words = [[docs.vocab.itos[o] for o in doc.data] for doc in movie_reviews.valid.x]

In [239]:
train_ngram_doc = veczr.fit_transform(train_words)

In [240]:
train_ngram_doc

<800x260427 sparse matrix of type '<class 'numpy.int64'>'
	with 565716 stored elements in Compressed Sparse Row format>

In [244]:
# Getting valid matrix
val_ngram_doc = veczr.transform(valid_words)
val_ngram_doc

<200x260427 sparse matrix of type '<class 'numpy.int64'>'
	with 93547 stored elements in Compressed Sparse Row format>

In [245]:
vocab = veczr.get_feature_names()

In [252]:
vocab[200000:200005]

['the room she',
 'the room when',
 'the room where',
 'the rooms',
 'the rooms are']

In [254]:
# running our model
m = LogisticRegression(C=0.1, dual=True)
m.fit(train_ngram_doc, y.items)
preds = m.predict(val_ngram_doc)
(preds.T == valid_labels).mean()



0.785

In [255]:
# binary
m = LogisticRegression(C=0.1, dual=True)
m.fit(train_ngram_doc.sign(), y.items)
preds = m.predict(val_ngram_doc.sign())
(preds.T == valid_labels).mean()



0.83