# WARNING
**Please make sure to "COPY AND EDIT NOTEBOOK" to use compatible library dependencies! DO NOT CREATE A NEW NOTEBOOK AND COPY+PASTE THE CODE - this will use latest Kaggle dependencies at the time you do that, and the code will need to be modified to make it work. Also make sure internet connectivity is enabled on your notebook**

# Preliminaries
Install required dependencies not already on the Kaggle image

In [1]:
# install sent2vec
!pip install git+https://github.com/epfml/sent2vec

Collecting git+https://github.com/epfml/sent2vec
  Cloning https://github.com/epfml/sent2vec to /tmp/pip-req-build-pk9bc4tp
  Running command git clone -q https://github.com/epfml/sent2vec /tmp/pip-req-build-pk9bc4tp
Building wheels for collected packages: sent2vec
  Building wheel for sent2vec (setup.py) ... [?25ldone
[?25h  Created wheel for sent2vec: filename=sent2vec-0.0.0-cp36-cp36m-linux_x86_64.whl size=1139414 sha256=c1870293d604840a0617afa1a0cbcf9ffb07f5a4157f8d90fe59229c0e71fda8
  Stored in directory: /tmp/pip-ephem-wheel-cache-srnnopbd/wheels/f5/1a/52/b5f36e8120688b3f026ac0cefe9c6544905753c51d8190ff17
Successfully built sent2vec
Installing collected packages: sent2vec
Successfully installed sent2vec-0.0.0


Write requirements to file, anytime you run it, in case you have to go back and recover dependencies. **MOST OF THESE REQUIREMENTS WOULD NOT BE NECESSARY FOR LOCAL INSTALLATION**

Requirements are hosted for each notebook in the companion github repo, and can be pulled down and installed here if needed. Companion github repo is located at https://github.com/azunre/transfer-learning-for-nlp

In [2]:
!pip freeze > kaggle_image_requirements.txt

# Download IMDB Movie Review Dataset
Download IMDB dataset

In [3]:
import random
import pandas as pd

## Read-in the reviews and print some basic descriptions of them

!wget -q "http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
!tar xzf aclImdb_v1.tar.gz

wget: /opt/conda/lib/libuuid.so.1: no version information available (required by wget)


# Define Tokenization, Stop-word and Punctuation Removal Functions
Before proceeding, we must decide how many samples to draw from each class. We must also decide the maximum number of tokens per email, and the maximum length of each token. This is done by setting the following overarching hyperparameters

**Preliminary overarching hyperparameters**


In [4]:
Nsamp = 1000 # number of samples to generate in each class - 'positive', 'negative'
maxtokens = 200 # the maximum number of tokens per document
maxtokenlen = 100 # the maximum length of each token

**Tokenization**

In [5]:
def tokenize(row):
    if row is None or row is '':
        tokens = ""
    else:
        tokens = str(row).split(" ")[:maxtokens]
    return tokens

**Use regular expressions to remove unnecessary characters**

Next, we define a function to remove punctuation marks and other nonword characters (using regular expressions) from the emails with the help of the ubiquitous python regex library. In the same step, we truncate all tokens to hyperparameter maxtokenlen defined above.

In [6]:
import re

def reg_expressions(row):
    tokens = []
    try:
        for token in row:
            token = token.lower()
            token = re.sub(r'[\W\d]', " ", token)
            token = token[:maxtokenlen] # truncate token
            tokens.append(token)
    except:
        token = ""
        tokens.append(token)
    return tokens

**Stop-word removal**

Stop-words are also removed. Stop-words are words that are very common in text but offer no useful information that can be used to classify the text. Words such as is, and, the, are are examples of stop-words. The NLTK library contains a list of 127 English stop-words and can be used to filter our tokenized strings.

In [7]:
import nltk

nltk.download('stopwords')
from nltk.corpus import stopwords
stopwords = stopwords.words('english')    

# these stopwords may indicate positivity/negativity of sentiment, so we remove them (keep them in corpus)
# stopwords.remove("no")
# stopwords.remove("nor")
# stopwords.remove("not")

# print(stopwords) # see default stopwords


[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [8]:
def stop_word_removal(row):
    token = [token for token in row if token not in stopwords]
    token = filter(None, token)
    return token

# Assemble Embedding Vectors

The following functions are used to extract sent2vec embedding vectors for each review

In [9]:
import time
import sent2vec

s2v_model = sent2vec.Sent2vecModel()
start=time.time()
s2v_model.load_model('../input/sent2vec/wiki_unigrams.bin')
end = time.time()
print("Loading the sent2vec embedding took %d seconds"%(end-start))

Loading the sent2vec embedding took 50 seconds


In [10]:
def assemble_embedding_vectors(data):
    out = None
    for item in data:
        vec = s2v_model.embed_sentence(" ".join(str(i) for i in item if i))
        if vec is not None:
            if out is not None:
                out = np.concatenate((out,vec),axis=0)
            else:
                out = vec                                            
        else:
            pass
        
        
    return out

# Putting It All Together To Assemble Dataset

Now, putting all the preprocessing steps together we assemble our dataset...

In [11]:
import os
import numpy as np

# shuffle raw data first
def unison_shuffle_data(data, header):
    p = np.random.permutation(len(header))
    data = data[p]
    header = np.asarray(header)[p]
    return data, header

def load_data(path):
    data, sentiments = [], []
    for folder, sentiment in (('neg', 0), ('pos', 1)):
        folder = os.path.join(path, folder)
        for name in os.listdir(folder):
            with open(os.path.join(folder, name), 'r') as reader:
                  text = reader.read()
            text = tokenize(text)
            text = stop_word_removal(text)
            text = reg_expressions(text)
            data.append(text)
            sentiments.append(sentiment)
    data_np = np.array(data)
    data, sentiments = unison_shuffle_data(data_np, sentiments)
    
    return data, sentiments

train_path = os.path.join('aclImdb', 'train')
test_path = os.path.join('aclImdb', 'test')
raw_data, raw_header = load_data(train_path)

print(raw_data.shape)
print(len(raw_header))

(25000,)
25000


In [12]:
# Subsample required number of samples
random_indices = np.random.choice(range(len(raw_header)),size=(Nsamp*2,),replace=False)
data_train = raw_data[random_indices]
header = raw_header[random_indices]

del raw_data, raw_header # huge and no longer needed, get rid of it

print("DEBUG::data_train::")
print(data_train)

DEBUG::data_train::
[list(['this', 'one', 'best', 'movies', 'serial', 'killers', 'i ve', 'ever', 'seen ', 'coming', 'someone', 'absolutely', 'loved', 'silence', 'lambs ', 'hbo', 'hit', 'jackpot', 'here ', 'this', 'film', 'compelling', 'first', 'moment', 'last  br', '   br', '  this', 'film', 'many', 'underlying', 'themes', 'hard', 'tell', 'exactly', 'about ', 'it', 'chronicles', 'decade long', 'search', 'russian', 'serial', 'killer', 'andrea', 'chikatilo ', 'stephen', 'rea', 'gives', 'brilliantly', 'reserved', 'performance', 'inexperienced', 'forensic', 'expert', 'put', 'charge', 'investigation ', 'donald', 'sutherland', 'gives', 'even', 'involving', 'performance', 'cynical', 'superior ', 'person', 'russian', 'government', 'willing', 'help', 'him ', 'both', 'performances', 'subtle', 'masterpieces   rea', 'begins', 'naive', 'unwilling', 'compromise ', 'sutherland', 'begins', 'detached', 'almost', 'amused', 'situation ', 'towards', 'end ', 'rea', 'becomes', 'world weary', 'beaten', 'syst

Display sentiments and their frequencies in the dataset, to ensure it is roughly balanced between classes

In [13]:
unique_elements, counts_elements = np.unique(header, return_counts=True)
print("Sentiments and their frequencies:")
print(unique_elements)
print(counts_elements)

Sentiments and their frequencies:
[0 1]
[ 993 1007]


**Featurize and Create Labels**

In [14]:
EmbeddingVectors = assemble_embedding_vectors(data_train)
print(EmbeddingVectors)

[[-0.03622827  0.01441378 -0.05925672 ... -0.00478834 -0.04687782
   0.0717904 ]
 [-0.05283374 -0.14769883 -0.02701941 ...  0.15918025  0.16663636
   0.18797083]
 [ 0.13212578 -0.15448087 -0.04736134 ... -0.05063647 -0.027518
   0.20125072]
 ...
 [-0.05686692 -0.11101679  0.10533109 ...  0.07239283  0.00357683
   0.09122648]
 [ 0.08307935 -0.12593347  0.00109499 ...  0.08826873 -0.07179418
   0.06714546]
 [ 0.02595801 -0.05723677  0.07862083 ... -0.03257715 -0.07341827
   0.0762985 ]]


In [15]:
data = EmbeddingVectors

idx = int(0.7*data.shape[0])

# 70% of data for training
train_x = data[:idx,:]
train_y = np.array(header[:idx])
# # remaining 30% for testing
test_x = data[idx:,:]
test_y = np.array(header[idx:]) 

print("train_x/train_y list details, to make sure it is of the right form:")
print(len(train_x))
print(train_x)
print(train_y[:5])
print(len(train_y))

train_x/train_y list details, to make sure it is of the right form:
1400
[[-0.03622827  0.01441378 -0.05925672 ... -0.00478834 -0.04687782
   0.0717904 ]
 [-0.05283374 -0.14769883 -0.02701941 ...  0.15918025  0.16663636
   0.18797083]
 [ 0.13212578 -0.15448087 -0.04736134 ... -0.05063647 -0.027518
   0.20125072]
 ...
 [ 0.02726981  0.02248825 -0.21001302 ... -0.04563235 -0.27397254
   0.17307606]
 [-0.01805624 -0.09111983 -0.10452412 ... -0.1359355  -0.04638012
   0.12248638]
 [ 0.01417789 -0.11241481 -0.13324331 ... -0.07542464  0.00514632
   0.06470316]]
[1 0 0 1 0]
1400


# Train Shallow Model for IMDB Reviews

In [16]:
from keras.models import Model
from keras.layers import Input, Dense, Dropout

input_shape = (len(train_x[0]),)
sent2vec_vectors = Input(shape=input_shape)
dense = Dense(512, activation='relu')(sent2vec_vectors)
dense = Dropout(0.1)(dense)
output = Dense(1, activation='sigmoid')(dense)
model = Model(inputs=sent2vec_vectors, outputs=output)

Using TensorFlow backend.


In [17]:
model.compile(loss='binary_crossentropy',
                  optimizer='adam', metrics=['accuracy'])
history = model.fit(train_x, train_y, validation_data=(test_x, test_y), batch_size=32,
                    nb_epoch=10, shuffle=True)

  after removing the cwd from sys.path.


Train on 1400 samples, validate on 600 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


# Test Trained Model on Book Reviews from MDSD

In [18]:
!ls ../input/multi-domain-sentiment-dataset-books-and-dvds/

books.negative.review  dvd.negative.review
books.positive.review  dvd.positive.review


In [19]:
def parse_MDSD(data):
    out_lst = []
    for i in range(len(data)):
        txt = ""
        if(data[i]=="<review_text>\n"):
            j=i
            while(data[j]!="</review_text>\n"):
                txt = txt+data[j]
                j = j+1
            text = tokenize(txt)
            text = stop_word_removal(text)
            text = reg_expressions(text)
            out_lst.append(text)
            
            #print(txt)
            #print(text)
            
    return out_lst

with open ("../input/multi-domain-sentiment-dataset-books-and-dvds/books.negative.review", "r", encoding="latin1") as myfile:
    data=myfile.readlines()
neg_books = parse_MDSD(data)
len(neg_books)

with open ("../input/multi-domain-sentiment-dataset-books-and-dvds/books.positive.review", "r", encoding="latin1") as myfile:
    data=myfile.readlines()
pos_books = parse_MDSD(data)
len(pos_books)

#print(neg_books)
#print(pos_books)

header = [0]*len(neg_books)
header.extend([1]*len(pos_books))
neg_books.extend(pos_books)
MDSD_data = np.array(neg_books)

data, sentiments = unison_shuffle_data(np.array(MDSD_data), header)

len(sentiments)

2000

**Try using the IMDB classifier directly on book review data...**

In [20]:
EmbeddingVectors = assemble_embedding_vectors(data)
print(EmbeddingVectors)
sentiments = np.asarray(sentiments)

[[ 6.9606602e-02 -3.0756295e-02 -2.3950104e-01 ... -2.2783343e-01
  -3.7275681e-01  4.9442065e-01]
 [-1.2904549e-02 -1.1575852e-01 -1.5874037e-01 ...  5.3884499e-02
  -2.0341949e-01  1.4110129e-01]
 [ 1.0849423e-01 -3.8145301e-01 -8.1480257e-02 ... -3.9360296e-02
   2.0374845e-01  2.6633894e-01]
 ...
 [-2.0717914e-01 -4.0148088e-01  1.2224840e-01 ... -4.1686586e-04
  -1.5059590e-01  1.6732480e-01]
 [-1.3965216e-01 -7.1579702e-02 -2.5804830e-01 ...  1.0827243e-01
  -3.4484950e-01  2.5415501e-01]
 [-1.5061742e-01 -4.2240743e-02 -3.8007267e-02 ...  3.8678069e-02
   5.0857328e-03  9.2025444e-02]]


In [21]:

print(model.evaluate(x=EmbeddingVectors,y=sentiments)) # evaluate IMDB classifier on books directly
print(model.metrics_names)

[0.6844062490463256, 0.7289999723434448]
['loss', 'accuracy']


# Adaptation of Book Review Domain via Autoencoder

In [22]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler

encoding_dim = 30 # chosen experimentally

input_shape = (len(train_x[0]),)
sent2vec_vectors = Input(shape=input_shape)
encoder = Dense(encoding_dim)(sent2vec_vectors)
dropout = Dropout(0.1)(encoder)
decoder = Dense(encoding_dim)(dropout)
dropout = Dropout(0.1)(decoder)
output = Dense(len(train_x[0]))(dropout)
autoencoder = Model(inputs=sent2vec_vectors, outputs=output)

In [23]:
autoencoder.compile(optimizer='adam',loss='mse',metrics=["mse","mae"])
autoencoder.fit(train_x,train_x,validation_data=(test_x, test_x), batch_size=32,
                    nb_epoch=50, shuffle=True)

  This is separate from the ipykernel package so we can avoid doing imports until


Train on 1400 samples, validate on 600 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.callbacks.History at 0x7efadb0e9b38>

In [24]:
# transform EmbeddingVectors and sentiments with autoencoder.predict and then evaluate IDMB model again

EmbeddingVectorsProjected = autoencoder.predict(EmbeddingVectors)

print(EmbeddingVectorsProjected.shape)

print(model.evaluate(x=EmbeddingVectorsProjected,y=sentiments))
print(model.metrics_names)

(2000, 600)
[0.5985191793441772, 0.7354999780654907]
['loss', 'accuracy']


In [25]:
from IPython.display import HTML
def create_download_link(title = "Download file", filename = "data.csv"):  
    html = '<a href={filename}>{title}</a>'
    html = html.format(title=title,filename=filename)
    return HTML(html)

#create_download_link(filename='GBMimportances.svg')

In [26]:
!rm -rf aclImdb
!rm aclImdb_v1.tar.gz