THE PROBLEM
The dataset contains a collection of 50,000 reviews from IMDB. 
It contains an even number of positive and negative reviews. Actually, IMDb lets users rate movies on a scale from 1 to 10. 
To label these reviews the curator of the data, labeled anything with ≤ 4 stars as negative and anything with ≥ 7 stars as positive Reviews with 5 or 6 stars were left out. The dataset is divided into training and test sets.

Our task is to look at these movie reviews and for each one, we are going to predict whether they were positive or negative.

Step 1: Import the libraries that we will use in this Project

In [8]:
#import the important libraries
%reload_ext autoreload
%autoreload 2
%matplotlib inline

from glob import glob
import numpy as np
import os,re,string
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
import warnings
warnings.filterwarnings("ignore")

Step 2: IMPORT DATA
The data can be download it by running the following commands in a Jupyter notebook:

In [None]:
#import data
!wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!gunzip aclImdb_v1.tar.gz
!tar -xvf aclImdb_v1.tar

--2021-03-04 03:30:32--  http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘aclImdb_v1.tar.gz’


2021-03-04 03:30:37 (15.5 MB/s) - ‘aclImdb_v1.tar.gz’ saved [84125825/84125825]

gzip: aclImdb_v1.tar already exists; do you wish to overwrite (y or n)? 

Once the above commands finished you’ll see that you’ve got a train and a test directory and inside your train directory, you’ll see there is a negative and a positive directory. The ones that were strongly positive went in /pos and strongly negative went in /neg. In both directories, you’ll see there is a bunch of text files.

In [9]:
#Declare Path & Names
PATH='aclImdb/'
names = ['neg','pos']
#Check the files in the path
!ls {PATH}

imdbEr.txt  imdb.vocab	README	test  train


In [10]:
#Check the path for Train Folder
!ls {PATH}train

labeledBow.feat  pos	unsupBow.feat  urls_pos.txt
neg		 unsup	urls_neg.txt   urls_unsup.txt


In [11]:
#load files in Train Folder
!ls {PATH}train/pos | head

0_9.txt
10000_8.txt
10001_10.txt
10002_7.txt
10003_8.txt
10004_8.txt
10005_7.txt
10006_7.txt
10007_7.txt
10008_7.txt
ls: write error: Broken pipe


In [12]:
#Similar for the test folder
!ls {PATH}test

labeledBow.feat  neg  pos  urls_neg.txt  urls_pos.txt


In [13]:
#Load files in Test folder
!ls {PATH}test/pos | head

0_10.txt
10000_7.txt
10001_9.txt
10002_8.txt
10003_8.txt
10004_9.txt
10005_8.txt
10006_7.txt
10007_10.txt
10008_8.txt
ls: write error: Broken pipe


In [14]:
#Combine all the datasets into one Array
def load_texts_labels_from_folders(path, folders):
    texts,labels = [],[]
    for idx,label in enumerate(folders):
        for fname in glob(os.path.join(path, label, '*.*')):
            texts.append(open(fname, 'r').read())
            labels.append(idx)
    # stored as np.int8 to save space 
    return texts, np.array(labels).astype(np.int8)

In [15]:
trn,trn_y = load_texts_labels_from_folders(f'{PATH}train',names)
val,val_y = load_texts_labels_from_folders(f'{PATH}test',names)

In [16]:
#Check theNumber of Files
len(val),len(trn_y),len(val),len(val_y)

(25000, 25000, 25000, 25000)

In [17]:
len(trn_y[trn_y==1]),len(val_y[val_y==1])

(12500, 12500)

In [18]:
np.unique(trn_y)

array([0, 1], dtype=int8)

Note:
The data is split evenly with 25k reviews intended for training and 25k for testing your classifier. Moreover, each set has 12.5k positive and 12.5k negative reviews.

Pull Out a File as example:

In [19]:
print(trn[0])
print()
print(f"Review's label: {trn_y[0]}")
# 0 represent a negative review

This is indeed a god adaptation of Jane Austen's novel. Compared with the American Version with Guinneth Paltrow, the script was written to resemble as much as possible the book. But the acting was awful. Besides Kate Beckinsale, who I believe was a true likeness of the Emma in the book, all the other actors were trying too hard. Mark Strong was not the "gentleman" he was supposed to be. He was often rude and offensive, had no feeling whatsoever, and throughout the entire film you could not see his love "growing" for Emma at all. This had a terrible effect on Kate Beckinsale, who seemed to be trying to "resque" her leading role as well as her partner's. Moreover, there was no chemistry between the entire cast. Hariett Smith, played by Samantha Morton, seemed to have no real attachment to Mr. Elton, played by Dominic Rowan. Therefore, she did not seem as heartbroken as she was portrayed in the book. The settings of the film are also too poor. The costumes are even more so. I would have 

Text data preprocessing:
The next step is to preprocess the movie reviews. To tackle this we will use the CountVectorizer API of Sklearn which convert a collection of text documents to a matrix of token counts. Practically, it creates a sparse bag of words matrix with the caveat that throws away all of the interesting stuff about language which is the order in which the words are in.

This is very often not a good idea, but in this particular case, it’s going to turn out to work not too badly. Normally, the order of the words matters a lot. If you’ve got a “not” before something, then that “not” refers to that thing.

But in this case, we are trying to predict whether something is positive or negative. If you see the word “absurd” or “cryptic” appear a lot then maybe that’s a sign that this isn’t very good. So the idea is that we are going to turn it into something called a term document matrix where for each document (i.e. each review), we are just going to create a list of what words are in it, rather than what order they are in.

Tokenization:
Before transforming our text into a term document matrix we will need to tokenize it first. 
In NLP tokenization is the process of transforming your text into a list of words. 

For example:
"This movie is good" ---> ["This", "movie", "is", "good"]
This looks like a trivial process however it isn't. In the case we have This "movie" isn’t good., how do you deal with that punctuation? A good tokenizer would turn this:

"This 'movie' isn’t good." ---> ["This", "'", "movie", "'", "is", "n't", "good","."]
Every token is either a single piece of punctuation, word or this suffix n't which is considered like a word. That’s how we would probably want to tokenize that piece of text. You wouldn’t want just to split on spaces cause it would have resulted to weird tokens like "good." and "movie".

CountVectorizer converts a collection of text documents to a matrix of token counts (part of sklearn.feature_extraction.text).

In [20]:
#Tokenizing
re_tok = re.compile(f'([{string.punctuation}“”¨«»®´·º½¾¿¡§£₤‘’])')
def tokenize(s): return re_tok.sub(r' \1 ', s).split()

#See how toeknizing works
s = "This 'movie' isn’t good." 
tokenize(s)

['This', "'", 'movie', "'", 'isn', '’', 't', 'good', '.']

In [21]:
#create term documetn matrix
veczr = CountVectorizer(tokenizer=tokenize)

fit_transform(trn) finds the vocabulary in the training set. It also transforms the training set into a term-document matrix. Since we have to apply the same transformation to your validation set, the second line uses just the method transform(val). trn_term_doc and val_term_doc are sparse matrices. trn_term_doc[i] represents training document i and it contains a count of words for each document for each word in the vocabulary.

In [22]:
trn_term_doc = veczr.fit_transform(trn)
# Important: Use same vocab for validation set
val_term_doc = veczr.transform(val)

As seen below when we create this term document matrix, the training set has 25,000 rows because there are 25,000 movie reviews and there are 75,132 columns which is the number of unique words.

In [23]:
#See how many unique words are there
trn_term_doc

<25000x75132 sparse matrix of type '<class 'numpy.int64'>'
	with 3749745 stored elements in Compressed Sparse Row format>

75,132 columns that too many columns. Since most of the documents don’t have most of these 75,132 words we don’t want to actually store it as a normal array in memory. So instead, we store it as a sparse matrix.

What is a sparse matrix?
It simply stores as something that says whereabouts the non-zeros are located. For example, for the document number 1, word number 4 appears and it has 4 of them. term number 123 appears once, and so forth.

(1, 4) → 4
(1, 123) → 1

That’s basically how it’s stored and the important thing is that it’s efficient.

*******************************************************************************************************************************************************************
Below is the Practice coding to see if the words are arranged properly

In [24]:
trn_term_doc[5] #87 stored elements

<1x75132 sparse matrix of type '<class 'numpy.int64'>'
	with 87 stored elements in Compressed Sparse Row format>

We grab the sixth review and that gives us 75,132 long sparse row with 83 non-zero stored elements . So in other words, the sixth review contains 87 words.

In [25]:
#Check what is the review for each trn_term_doc
w0 = set([o.lower() for o in trn[5].split(' ')]); w0

{'(as',
 'a',
 'actors!!!the',
 'and',
 'approach.the',
 'are',
 'as',
 'awful',
 'b-movie',
 'been',
 'blends',
 'but',
 'camera',
 'can',
 'childish',
 'clise',
 'day,there',
 'did)',
 'do',
 'enjoy',
 'even',
 'evening',
 'film',
 'for',
 'friends',
 'full',
 'funny',
 'genre',
 'go',
 'got',
 'green',
 'have',
 'i',
 'in',
 'inspiration(at',
 'is',
 'it',
 'its',
 'least',
 'light',
 'lines',
 'movie).everything',
 'movie,has',
 'must',
 'no',
 'of',
 'plot',
 'poe',
 'pointer,even',
 'really',
 'red',
 'reflections',
 'relaxing',
 "scenery's",
 'scenes',
 'see',
 'seriously!',
 'shallow',
 'shot',
 'slightest',
 'take',
 'that',
 'the',
 'then',
 "there's",
 'this',
 'thriller',
 'thrilling.if',
 'to',
 'want',
 'watch',
 'way',
 'where',
 'will',
 'with',
 'without',
 'you'}

In [61]:
#Check what is the review for each trn_term_doc
w1 = set([o.lower() for o in trn[0].split(' ')]); w1

{'"blockbuster",',
 '"gentleman"',
 '"growing"',
 '"lighter"',
 '"resque"',
 '(excepting',
 'a',
 'acting',
 'actors',
 'actors.',
 'adaptation',
 'again,',
 'all',
 'all.',
 'also',
 'american',
 'and',
 'any',
 'are',
 'as',
 'at',
 'attachment',
 "austen's",
 'awful.',
 'be',
 'be.',
 'beckinsale)',
 'beckinsale,',
 'believe',
 'believed',
 'besides',
 'better',
 'between',
 "book's",
 'book,',
 'book.',
 'budget',
 'but',
 'by',
 'can',
 'cast.',
 'chemistry',
 'clear',
 'compared',
 'conclude,',
 'costumes',
 'could',
 'did',
 'does',
 'dominic',
 'dress',
 'effect',
 'elegant',
 'elton,',
 'emma',
 'ending',
 'ending,',
 'entire',
 'even',
 'face',
 'fashionable',
 'feeling',
 'film',
 'film.',
 'for',
 'god',
 'good',
 'guinneth',
 'had',
 'happiness',
 'hard.',
 'hariett',
 'have',
 'he',
 'heartbroken',
 'her',
 'here.',
 'his',
 'i',
 'if',
 'imagined',
 'in',
 'indeed',
 'is',
 'it',
 'jane',
 'kate',
 'killer',
 'leading',
 'likeness',
 'long.',
 'love',
 'loyal',
 'made',


In [27]:
len(w0)
# length is 77 which is pretty similar to 87, and just the 
# difference will be that I didn’t use a real tokenizer.

77

In [62]:
len(w1)
# length is 77 which is pretty similar to 87, and just the 
# difference will be that I didn’t use a real tokenizer.

167

In [64]:
trn_term_doc[0] #163 stored elements

<1x75132 sparse matrix of type '<class 'numpy.int64'>'
	with 163 stored elements in Compressed Sparse Row format>

Sklearn gives us the ability to have a look at vocabulary by saying veczr.get_feature_names . Here is an example of a few of the elements of feature names:

In [28]:
vocab = veczr.get_feature_names()
print(len(vocab))
vocab[5000:5005]

75132


['aussie', 'aussies', 'austen', 'austeniana', 'austens']

We simply created a unique list of words and mapped them. We could check by calling veczr.vocabulary_ to find the ID of a particular word. So this is like the reverse map of veczr.get_feature_names which maps integer to word, veczr.vocabulary_ maps word to integer.

In [51]:
veczr.vocabulary_['awful']

5234

In [52]:
trn_term_doc[0,5234]
# word 'absurd' appears twice in the first document
# word 'and' appears six time in the first document
# word 'awful' appears once in the first document

1

In [57]:
veczr.vocabulary_['shallow']

59356

In [60]:
trn_term_doc[5,59356]

1

In [36]:
vocab[4050]

'arching'

In [67]:
veczr.vocabulary_['ending']

21789

In [68]:
trn_term_doc[0,21789]
# word 'ending' appears thrice in the first document

3

*******************************************************************************************************************************************************************
Practice Ends Here

NAIVE BYERS:

In [39]:
def pr(y_i):
    p = x[y==y_i].sum(0)
    return (p+1) / ((y==y_i).sum()+1)

In [40]:
x=trn_term_doc
y=trn_y

r = np.log((pr(1)/pr(0)))
b = np.log((y==1).mean() / (y==0).mean())

In [41]:
r.shape,val_term_doc.shape

((1, 75132), (25000, 75132))

In [69]:
preds

matrix([[False, False, False, ...,  True,  True,  True]])

In [43]:
#Formula for Naive Byers
pre_preds = val_term_doc @ r.T + b
preds = pre_preds.T>0
(preds==val_y).mean()

0.81656

In [44]:
#Binarized Naive Byers
x=trn_term_doc.sign()
r = np.log(pr(1)/pr(0))

pre_preds = val_term_doc.sign() @ r.T + b #sign binarize
preds = pre_preds.T>0
(preds==val_y).mean()

0.83016