# WARNING
**Please make sure to "COPY AND EDIT NOTEBOOK" to use compatible library dependencies! DO NOT CREATE A NEW NOTEBOOK AND COPY+PASTE THE CODE - this will use latest Kaggle dependencies at the time you do that, and the code will need to be modified to make it work. Also make sure internet connectivity is enabled on your notebook**

# Preliminaries
Write requirements to file, anytime you run it, in case you have to go back and recover dependencies. **MOST OF THESE REQUIREMENTS WOULD NOT BE NECESSARY FOR LOCAL INSTALLATION**

Latest known such requirements are hosted for each notebook in the companion github repo, and can be pulled down and installed here if needed. Companion github repo is located at https://github.com/azunre/transfer-learning-for-nlp

In [1]:
!pip freeze > kaggle_image_requirements.txt

# Download IMDB Movie Review Dataset
Download IMDB dataset

In [2]:
import random
import pandas as pd

## Read-in the reviews and print some basic descriptions of them

!wget -q "http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
!tar xzf aclImdb_v1.tar.gz

wget: /opt/conda/lib/libuuid.so.1: no version information available (required by wget)


# Define Tokenization, Stop-word and Punctuation Removal Functions
Before proceeding, we must decide how many samples to draw from each class. We must also decide the maximum number of tokens per email, and the maximum length of each token. This is done by setting the following overarching hyperparameters

In [3]:
Nsamp = 1000 # number of samples to generate in each class - 'spam', 'not spam'
maxtokens = 200 # the maximum number of tokens per document
maxtokenlen = 100 # the maximum length of each token

**Tokenization**

In [4]:
def tokenize(row):
    if row is None or row is '':
        tokens = ""
    else:
        tokens = row.split(" ")[:maxtokens]
    return tokens

**Use regular expressions to remove unnecessary characters**

Next, we define a function to remove punctuation marks and other nonword characters (using regular expressions) from the emails with the help of the ubiquitous python regex library. In the same step, we truncate all tokens to hyperparameter maxtokenlen defined above.

In [5]:
import re

def reg_expressions(row):
    tokens = []
    try:
        for token in row:
            token = token.lower() # make all characters lower case
            token = re.sub(r'[\W\d]', "", token)
            token = token[:maxtokenlen] # truncate token
            tokens.append(token)
    except:
        token = ""
        tokens.append(token)
    return tokens

**Stop-word removal**

Stop-words are also removed. Stop-words are words that are very common in text but offer no useful information that can be used to classify the text. Words such as is, and, the, are are examples of stop-words. The NLTK library contains a list of 127 English stop-words and can be used to filter our tokenized strings.

In [6]:
import nltk

nltk.download('stopwords')
from nltk.corpus import stopwords
stopwords = stopwords.words('english')    

# print(stopwords) # see default stopwords
# it may be beneficial to drop negation words from the removal list, as they can change the positive/negative meaning
# of a sentence
# stopwords.remove("no")
# stopwords.remove("nor")
# stopwords.remove("not")

[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [7]:
def stop_word_removal(row):
    token = [token for token in row if token not in stopwords]
    token = filter(None, token)
    return token

# Assemble Embedding Vectors

The following functions are used to extract fastText embedding vectors for each review

In [8]:
import time
from gensim.models import FastText, KeyedVectors

!ls

start=time.time()
FastText_embedding = KeyedVectors.load_word2vec_format("../input/jigsaw/wiki.en.vec")
end = time.time()
print("Loading the embedding took %d seconds"%(end-start))

__notebook__.ipynb  aclImdb  aclImdb_v1.tar.gz	kaggle_image_requirements.txt
Loading the embedding took 1119 seconds


In [9]:
def handle_out_of_vocab(embedding,in_txt):
    out = None
    for word in in_txt:
        try:
            tmp = embedding[word]
            tmp = tmp.reshape(1,len(tmp))
            
            if out is None:
                out = tmp
            else:
                out = np.concatenate((out,tmp),axis=0)
        except:
            pass
    
    return out
        

def assemble_embedding_vectors(data):
    out = None
    for item in data:
        tmp = handle_out_of_vocab(FastText_embedding,item)
        if tmp is not None:
            dim = tmp.shape[1]
            if out is not None:
                vec = np.mean(tmp,axis=0)
                vec = vec.reshape((1,dim))
                out = np.concatenate((out,vec),axis=0)
            else:
                out = np.mean(tmp,axis=0).reshape((1,dim))                                            
        else:
            pass
        
        
    return out

# Putting It All Together To Assemble Dataset

Now, putting all the preprocessing steps together we assemble our dataset...

In [10]:
import os
import numpy as np

# shuffle raw data first
def unison_shuffle_data(data, header):
    p = np.random.permutation(len(header))
    data = data[p]
    header = np.asarray(header)[p]
    return data, header

# load data in appropriate form
def load_data(path):
    data, sentiments = [], []
    for folder, sentiment in (('neg', 0), ('pos', 1)):
        folder = os.path.join(path, folder)
        for name in os.listdir(folder):
            with open(os.path.join(folder, name), 'r') as reader:
                  text = reader.read()
            text = tokenize(text)
            text = stop_word_removal(text)
            text = reg_expressions(text)
            data.append(text)
            sentiments.append(sentiment)
    data_np = np.array(data)
    data, sentiments = unison_shuffle_data(data_np, sentiments)
    
    return data, sentiments

train_path = os.path.join('aclImdb', 'train')
test_path = os.path.join('aclImdb', 'test')
raw_data, raw_header = load_data(train_path)

print(raw_data.shape)
print(len(raw_header))

(25000,)
25000


In [11]:
# Subsample required number of samples
random_indices = np.random.choice(range(len(raw_header)),size=(Nsamp*2,),replace=False)
data_train = raw_data[random_indices]
header = raw_header[random_indices]

print("DEBUG::data_train::")
print(data_train)

DEBUG::data_train::
[list(['this', 'one', 'time', 'favorites', 'it', 'simple', 'sweet', '', 'definitely', 'chick', 'flick', 'romantic', 'comedy', 'i', 'really', 'like', 'film', 'full', 'good', 'quotes', 'it', 'one', 'favorites', 'albert', 'einstein', 'says', 'ed', 'walters', 'are', 'thinking', 'im', 'thinking', 'ed', 'says', 'now', 'odds', 'happening', 'in', 'opinion', 'film', 'fabulous', 'watch', 'get', 'amount', 'enjoyment', 'it', 'one', 'can', 'i', 'also', 'really', 'enjoyed', 'walter', 'ms', 'way', 'portraying', 'einstein', 'i', 'think', 'characters', 'fit', 'together', 'really', 'well', 'story', 'flows', 'nicely', 'there', 'many', 'times', 'i', 'find', 'smiling', 'right', 'along', 'film', 'quoting', 'favorite', 'lines', 'i', 'watch', 'it', 'i', 'would', 'recommend', 'movie', 'anyone', 'heart', 'enjoys', 'feelgood', 'romantic', 'comedy', 'then'])
 list(['the', 'premise', 'sucked', 'in', 'clear', '', 'seconds', 'either', 'david', 'lynch', 'something', 'seriously', 'terrible', 'inter

Display sentiments and their frequencies in the dataset, to ensure it is roughly balanced between classes

In [12]:
unique_elements, counts_elements = np.unique(header, return_counts=True)
print("Sentiments and their frequencies:")
print(unique_elements)
print(counts_elements)

Sentiments and their frequencies:
[0 1]
[1017  983]


**Featurize and Create Labels**

In [13]:
EmbeddingVectors = assemble_embedding_vectors(data_train)
print(EmbeddingVectors)

[[-0.04708445 -0.00859615 -0.10156192 ...  0.12017959  0.16323158
   0.01946514]
 [-0.08692177 -0.06223733 -0.1334786  ...  0.14544673  0.11161387
  -0.00043845]
 [-0.05798069 -0.05940554 -0.16345114 ...  0.10559762  0.12862308
   0.03148314]
 ...
 [-0.0937733  -0.01354256 -0.14242576 ...  0.1760791   0.10297086
   0.00673548]
 [-0.0864714  -0.03663545 -0.20604639 ...  0.15443572  0.14226893
   0.02450219]
 [-0.07466923  0.00334586 -0.16633242 ...  0.15489857  0.13078918
   0.05107713]]


In [14]:
data = EmbeddingVectors

idx = int(0.7*data.shape[0])

# 70% of data for training
train_x = data[:idx,:]
train_y = header[:idx]
# # remaining 30% for testing
test_x = data[idx:,:]
test_y = header[idx:] 

print("train_x/train_y list details, to make sure it is of the right form:")
print(len(train_x))
print(train_x)
print(train_y[:5])
print(len(train_y))

train_x/train_y list details, to make sure it is of the right form:
1400
[[-0.04708445 -0.00859615 -0.10156192 ...  0.12017959  0.16323158
   0.01946514]
 [-0.08692177 -0.06223733 -0.1334786  ...  0.14544673  0.11161387
  -0.00043845]
 [-0.05798069 -0.05940554 -0.16345114 ...  0.10559762  0.12862308
   0.03148314]
 ...
 [-0.07950715 -0.12623046 -0.15015663 ...  0.15858898  0.09886043
   0.05608691]
 [-0.02690058 -0.03051995 -0.15439297 ...  0.13425885  0.20691273
   0.04434924]
 [-0.06817091 -0.0140137  -0.15882649 ...  0.13278545  0.10525348
   0.01010045]]
[1 0 0 0 1]
1400


# Logistic Regression Classifier

In [15]:
from sklearn.linear_model import LogisticRegression

def fit(train_x,train_y):
    model = LogisticRegression()

    try:
        model.fit(train_x, train_y)
    except:
        pass
    return model

model = fit(train_x,train_y)



In [16]:
predicted_labels = model.predict(test_x)
print("DEBUG::The logistic regression predicted labels are::")
print(predicted_labels)

DEBUG::The logistic regression predicted labels are::
[0 0 0 0 1 1 1 0 1 1 0 0 1 1 1 1 1 1 0 0 0 1 1 1 0 0 1 0 1 1 1 0 1 1 1 0 1
 1 1 0 1 0 0 0 1 0 0 1 0 1 1 1 1 0 1 1 1 1 0 0 1 1 0 1 0 1 0 1 0 0 0 0 0 0
 0 1 0 1 1 1 0 1 0 1 1 1 1 1 0 0 1 1 1 1 0 1 1 0 1 1 1 1 1 1 0 0 1 1 0 0 0
 0 0 0 0 1 1 0 0 1 0 1 1 0 1 1 0 0 1 1 1 1 1 0 1 1 1 1 0 1 0 0 0 0 0 0 1 1
 0 0 0 0 1 0 1 1 0 1 1 0 1 0 0 0 1 1 0 0 1 0 1 0 1 0 1 0 0 0 1 1 0 1 1 0 1
 1 1 1 1 1 0 0 1 0 1 0 1 0 1 0 1 1 1 0 1 0 0 0 1 1 0 1 0 1 1 1 1 0 1 1 0 0
 1 0 1 1 1 0 0 1 1 1 0 0 1 0 1 0 1 1 1 0 0 1 1 1 0 0 0 0 0 1 0 1 0 0 1 1 0
 0 0 0 0 0 0 1 1 1 1 0 1 0 0 0 0 1 1 0 1 0 0 0 1 1 1 1 0 0 1 0 0 0 0 0 0 1
 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 1 1 1 1 0 1 0 1 0 1 1 1 0 0 0 0 1 1
 1 0 0 1 0 1 0 1 0 1 0 0 1 0 0 1 1 0 0 1 1 1 1 1 0 0 0 1 0 0 1 0 0 0 0 1 0
 0 0 1 1 1 1 0 1 1 0 0 1 1 0 1 0 0 1 1 1 1 1 0 1 0 1 0 1 0 0 0 1 1 1 0 1 0
 0 0 1 1 0 1 0 0 0 1 1 1 0 0 1 0 0 1 1 0 1 1 1 0 0 0 0 1 1 1 1 0 0 1 1 0 1
 0 0 1 0 0 0 0 0 0 0 1 0 0 1 1 1 0 0 1 0 1 0 1

In [17]:
from sklearn.metrics import accuracy_score

acc_score = accuracy_score(test_y, predicted_labels)

print("The logistic regression accuracy score is::")
print(acc_score)

The logistic regression accuracy score is::
0.7633333333333333


# Random Forests

In [18]:
# Load scikit's random forest classifier library
from sklearn.ensemble import RandomForestClassifier

# Create a random forest Classifier. By convention, clf means 'Classifier'
clf = RandomForestClassifier(n_jobs=1, random_state=0)

# Train the Classifier to take the training features and learn how they relate
# to the training y (spam, not spam?)
start_time = time.time()
clf.fit(train_x, train_y)
end_time = time.time()
print("Training the Random Forest Classifier took %3d seconds"%(end_time-start_time))

predicted_labels = clf.predict(test_x)
print("DEBUG::The RF predicted labels are::")
print(predicted_labels)

acc_score = accuracy_score(test_y, predicted_labels)

print("DEBUG::The RF testing accuracy score is::")
print(acc_score)

Training the Random Forest Classifier took   0 seconds
DEBUG::The RF predicted labels are::
[0 0 0 0 1 1 1 0 0 1 0 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 1 0 1 1 0 0 1 1 1 1 1
 1 1 0 1 1 0 0 1 0 0 1 1 0 1 0 0 0 1 0 1 1 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0
 0 0 0 1 1 0 0 1 0 0 1 1 1 1 0 0 1 0 0 1 0 0 1 0 1 1 0 1 0 0 0 1 1 1 0 1 0
 0 0 0 0 1 1 0 0 1 0 1 1 0 1 1 1 0 1 0 1 1 1 0 1 1 0 1 0 1 1 0 0 0 0 1 1 1
 0 0 0 0 1 0 1 0 0 1 1 1 1 0 0 1 0 0 0 0 1 0 1 1 0 1 0 0 0 0 1 0 0 0 1 0 1
 1 1 1 1 0 0 0 1 0 1 0 1 1 1 0 1 0 0 0 1 0 0 0 1 1 0 1 0 1 1 0 0 0 1 1 0 0
 0 0 1 1 0 0 0 1 0 1 0 0 1 0 1 1 1 1 0 0 0 1 0 1 0 0 0 0 0 1 1 1 0 1 1 0 0
 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 0 1 0 0 0 1 0 0 1
 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 1 0 1 1 1 0 1 0 1 0 0 0 0 0 0 1 0 0 1
 1 0 0 0 0 0 0 1 0 1 1 0 0 0 0 1 1 1 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 1 1 0
 0 1 1 0 0 0 1 0 1 1 0 1 1 1 1 0 0 0 1 1 0 1 0 1 0 0 0 1 1 0 0 1 1 0 0 1 0
 0 0 1 1 1 1 0 1 0 1 1 1 0 0 0 1 0 1 0 1 1 0 0 0 0 0 0 0 0 1 0 0 1 1 1 0 1
 0 0 0 0



In [19]:
from IPython.display import HTML
def create_download_link(title = "Download file", filename = "data.csv"):  
    html = '<a href={filename}>{title}</a>'
    html = html.format(title=title,filename=filename)
    return HTML(html)

#create_download_link(filename='GBMimportances.svg')

In [20]:
!rm -rf aclImdb
!rm aclImdb_v1.tar.gz