# Introduction

The notebook is based upon the notebook that can be found on Kaggle (URL: https://www.kaggle.com/desiredewaele/sentiment-analysis-on-imdb-reviews). It is used as a basis foir developing a solution that determines whether or not a review is positive or negative. This particular model is a neural network, and uses natural language processing (NLP) concepts for processing text.

The notebook has been adjusted to work with the latest tensorflow (keras) library. Changes were needed to adjust it to the latest API. This was the most challenging part.

The sample data sets have been replaced by the full data sets (an increase from 5k to around 24k for both the training and test data sets. Using a larger data sets increases the accuracy on the test data from 83% to 87%.

Adding layers may further improve the results. This is something to test tomorrow.

Frank Kornet
12/12/2019


In [1]:
# plotting
import scikitplot as skplt
from mlxtend.plotting import plot_decision_regions
from matplotlib import pyplot as plt
import seaborn as sns; sns.set()
%matplotlib inline

# data
import numpy as np
np.random.seed(125)
import pandas as pd
from sklearn.datasets import make_moons

# modeling
from sklearn.naive_bayes import GaussianNB
from sklego.mixture import GMMClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, roc_auc_score, roc_curve
from sklearn.metrics import precision_score, recall_score, accuracy_score, make_scorer
from sklearn.model_selection import GridSearchCV
from sklego.meta import Thresholder
from sklearn.pipeline import make_pipeline

# items not in Bryan's original list
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction import text 
from sklearn.decomposition import PCA


# garbage
import gc; gc.enable()

# warnings
import warnings
warnings.filterwarnings("ignore")


# libraries imported by the other notebook and not in Bryan's original Notebook
import re

import nltk
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.corpus import words
from nltk.corpus import wordnet 
allEnglishWords = words.words() + [w for w in wordnet.words()]
allEnglishWords = np.unique([x.lower() for x in allEnglishWords])

from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Sequential, load_model
from tensorflow.keras.layers import Dense, Dropout

import plotly.offline as py
import plotly.graph_objs as go
py.init_notebook_mode(connected=True)

from IPython.display import HTML

In [2]:
allEnglishWords

array(["'hood", "'s_gravenhage", "'tween", ..., 'zythum', 'zyzomys',
       'zyzzogeton'], dtype='<U71')

## Load Train and Test Data



In [3]:
ls

Keras Notes.docx      README.md             exercise.ipynb
LICENSE               clean_test.csv
Preparing_CSVs.ipynb  clean_train.csv


In [4]:
train_df=pd.read_csv('clean_train.csv')
test_df=pd.read_csv('clean_test.csv')

In [5]:
train_df.head()

Unnamed: 0.1,Unnamed: 0,Review,Label
0,0,If I accidentally stumbled across this script ...,0
1,1,"This film, was one of my childhood favorites a...",1
2,2,this movie just goes to show that you dont nee...,1
3,3,This may be one of the worst movies to ever ma...,0
4,4,"OK I caught this film halfway through, but.oh....",0


In [6]:
test_df.head()

Unnamed: 0.1,Unnamed: 0,Review,Label
0,0,"This film is a good companion to Blair Witch, ...",0
1,1,Stanwyck and Morgan are perfectly cast in what...,1
2,2,"Coming from the ""druggie"" generation, I though...",1
3,3,It's a shame that quality actors like Baldwin ...,0
4,4,"Two years after this short, the last ""Our Gang...",0


## Preprocessor Class

The code below is a preprocessor class and is used to process the review text to turn words to their base form, and get rid of unwanted text.

In [7]:
# Source: https://www.kaggle.com/desiredewaele/sentiment-analysis-on-imdb-reviews
class Preprocessor(object):
    ''' Preprocess data for NLP tasks. '''

    def __init__(self, alpha=True, lower=True, stemmer=True, english=False):
        self.alpha = alpha
        self.lower = lower
        self.stemmer = stemmer
        self.english = english
        
        self.uniqueWords = None
        self.uniqueStems = None
        
    def fit(self, texts):
        texts = self._doAlways(texts)

        allwords = pd.DataFrame({"word": np.concatenate(texts.apply(lambda x: x.split()).values)})
        self.uniqueWords = allwords.groupby(["word"]).size().rename("count").reset_index()
        self.uniqueWords = self.uniqueWords[self.uniqueWords["count"]>1]
        if self.stemmer:
            self.uniqueWords["stem"] = self.uniqueWords.word.apply(lambda x: PorterStemmer().stem(x)).values
            self.uniqueWords.sort_values(["stem", "count"], inplace=True, ascending=False)
            self.uniqueStems = self.uniqueWords.groupby("stem").first()
        
        #if self.english: self.words["english"] = np.in1d(self.words["mode"], allEnglishWords)
        print("Fitted.")
            
    def transform(self, texts):
        texts = self._doAlways(texts)
        if self.stemmer:
            allwords = np.concatenate(texts.apply(lambda x: x.split()).values)
            uniqueWords = pd.DataFrame(index=np.unique(allwords))
            uniqueWords["stem"] = pd.Series(uniqueWords.index).apply(lambda x: PorterStemmer().stem(x)).values
            uniqueWords["mode"] = uniqueWords.stem.apply(lambda x: self.uniqueStems.loc[x, "word"] if x in self.uniqueStems.index else "")
            texts = texts.apply(lambda x: " ".join([uniqueWords.loc[y, "mode"] for y in x.split()]))
        #if self.english: texts = self.words.apply(lambda x: " ".join([y for y in x.split() if self.words.loc[y,"english"]]))
        print("Transformed.")
        return(texts)

    def fit_transform(self, texts):
        texts = self._doAlways(texts)
        self.fit(texts)
        texts = self.transform(texts)
        return(texts)
    
    def _doAlways(self, texts):
        # Remove parts between <>'s
        texts = texts.apply(lambda x: re.sub('<.*?>', ' ', x))
        # Keep letters and digits only.
        if self.alpha: texts = texts.apply(lambda x: re.sub('[^a-zA-Z0-9 ]+', ' ', x))
        # Set everything to lower case
        if self.lower: texts = texts.apply(lambda x: x.lower())
        return texts 

## Preprocess the Reviews

Now crearte an Preprocessor object and call the preprocessor to fit and transform the data.

In [8]:
preprocess = Preprocessor(alpha=True, lower=True, stemmer=True)

In [9]:
%%time
trainX = preprocess.fit_transform(train_df.Review)

Fitted.
Transformed.
CPU times: user 48.5 s, sys: 1.76 s, total: 50.2 s
Wall time: 47 s


In [10]:
%%time
testX =  preprocess.transform(test_df.Review)

Transformed.
CPU times: user 41.6 s, sys: 1.46 s, total: 43.1 s
Wall time: 40.8 s


## Look at Preprocessed Reviews

Lets look at the first review in train and test data to see how the preprocessor changed the review text. This will give us a sense of what is happening. You can see that it remove text and that it stems the words.

In [11]:
trainX[0][0:512]

'if i accidentally stumbled across this script in textual form i would read it and maybe laugh i would not however laugh at the point in the film where the director would seems to want me to laugh although i am still not altogether sure where these are i don t care if this is woody allen this writer cannot writing dialogue or at least he cannot knowingly writing dialogue then draw performance from actors capable of draw laughter from even the most  of clown for example paraphrase i m an art historians i m lo'

In [12]:
train_df['Review'][0][0:512]

"If I accidentally stumbled across this script in textual form i would read it and maybe laugh. I would not, however laugh at the points in the film where the director would seem to want me to laugh. Although I am still not altogether sure where these are. I don't care if this is Woody Allen, this writer cannot write dialogue, or at least he cannot knowingly write dialogue then draw performances from actors capable of drawing laughter from even the most ticklish of clowns. For example:<br /><br />(paraphrase"

In [13]:
testX[0][0:512]

'this film is a good companion to blair witch because it does so much wrong that bw did right like bw this one pretend to be a documentary of ghostly events with each members of the team man his her own camera the sense of reality is never there however the participants are poorly written clich d characters and the events that take place are equally clich d the cat jump out of a closet falls chandelier etc also the stilted dialog and inept improved work by the overly attractive cast detract from the docu fee'

In [14]:
test_df['Review'][0][0:512]

'This film is a good companion to Blair Witch, because it does so much wrong that BW did right. Like BW, this one pretends to be a documentary of ghostly events, with each member of the team manning his/her own camera. <br /><br />The sense of reality is never there, however. The participants are poorly written clichéd characters and the events that take place are equally clichéd (the cat jumping out of a closet, falling chandelier, etc). Also the stilted dialog and inept improv work by the overly-attractive'

In [15]:
print(preprocess.uniqueWords.shape)

(46840, 3)


You can see that it encountered different forms of the word "disappoint" (7 forms to be precise).

In [16]:
preprocess.uniqueWords[preprocess.uniqueWords.word.str.contains("disappoint")]

Unnamed: 0,word,count,stem
18475,disappointingly,17,disappointingli
18473,disappointed,914,disappoint
18474,disappointing,419,disappoint
18476,disappointment,402,disappoint
18472,disappoint,98,disappoint
18479,disappoints,38,disappoint
18478,disappointments,22,disappoint


It mapped these 7 forms of "disappoint" to 2 stems.

In [17]:
print(preprocess.uniqueStems.shape)
preprocess.uniqueStems[preprocess.uniqueStems.word.str.contains("disappoint")]

(30834, 2)


Unnamed: 0_level_0,word,count
stem,Unnamed: 1_level_1,Unnamed: 2_level_1
disappoint,disappointed,914
disappointingli,disappointingly,17


## Feature Engineering

Next, we take the preprocessed texts as input and calculate their TF-IDF's. We retain 10000 features per text. More information can be found at https://www.tfidf.com. The stopwords are calculated in the next cells.


In [18]:
stop_words = text.ENGLISH_STOP_WORDS.union(["thats","weve", "dont","lets","youre",
                                            "im","thi","ha", "wa","st","ask","want",
                                            "like","thank","know","susan","ryan",
                                            "say","got","ought","ive","theyre"])
print(stop_words)

frozenset({'perhaps', 'next', 'without', 'ie', 'indeed', 'until', 'amoungst', 'another', 'get', 'thereby', 'couldnt', 'will', 'while', 'they', 'found', 'to', 'six', 'myself', 'besides', 'three', 'then', 'wherein', 'front', 'de', 'only', 'than', 'keep', 'first', 'am', 'whole', 'none', 'name', 'much', 'becoming', 'my', 'between', 'what', 'so', 'throughout', 'thats', 'being', 'st', 'seem', 'never', 'nine', 'onto', 'within', 'his', 'want', 'eleven', 'became', 'we', 'hundred', 'please', 'twenty', 'among', 'neither', 'beyond', 'often', 'them', 'thereafter', 'is', 'un', 'see', 'those', 'might', 'system', 'mostly', 'someone', 'back', 'who', 'under', 'as', 'which', 'eg', 'afterwards', 'toward', 'thereupon', 'ryan', 'hereupon', 'hers', 'from', 'fifty', 'our', 'above', 'call', 'ought', 'below', 'whereby', 'least', 'side', 'thence', 'nor', 'into', 'behind', 'some', 'per', 'yourself', 'itself', 'this', 'be', 'other', 'weve', 'each', 'about', 'she', 'top', 'two', 'ha', 'most', 'thru', 'of', 'amount'

The stop words are now removed and a vector is created. Unsure what is in the tfidf variable. And that is what we print to get a sense for what is in there after the call.

In [19]:
tfidf = TfidfVectorizer(min_df=2, max_features=10000, stop_words=stop_words) 
#, ngram_range=(1,3)
print(tfidf)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.float64'>, encoding='utf-8',
                input='content', lowercase=True, max_df=1.0, max_features=10000,
                min_df=2, ngram_range=(1, 1), norm='l2', preprocessor=None,
                smooth_idf=True,
                stop_words=frozenset({'a', 'about', 'above', 'across', 'after',
                                      'afterwards', 'again', 'against', 'all',
                                      'almost', 'alone', 'along', 'already',
                                      'also', 'although', 'always', 'am',
                                      'among', 'amongst', 'amoungst', 'amount',
                                      'an', 'and', 'another', 'any', 'anyhow',
                                      'anyone', 'anything', 'anyway',
                                      'anywhere', ...}),
                strip_accents=None, sublinear_tf=False,
                token_p

In [20]:
%%time
trainX = tfidf.fit_transform(trainX).toarray()
print(trainX[0][0:512], '\n\n', train_df['Review'][0][0:512])

[0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         

In [21]:
%%time
testX = tfidf.transform(testX).toarray()
print(testX[0][0:512], '\n\n', test_df['Review'][0][0:512])

[0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         

In [22]:
print(trainX.shape)
print(testX.shape)

(24904, 10000)
(24678, 10000)


In [23]:
trainY = train_df.Label
testY =  test_df.Label

In [24]:
print(trainX.shape, trainY.shape)
print(testX.shape,  testY.shape)

(24904, 10000) (24904,)
(24678, 10000) (24678,)


## Feature Engineering

Next, we take the 10k dimensional tfidf's as input, and keep the 2000 dimensions that correlate the most with our sentiment target. The corresponding words - see below - make sense.

Note to self: I guess this is using Principal Component Analysis (PCA) to reduce the number of dimensions from 10k to 2k. I do not fully understand what is happening here, and this is something I need to come back to to get a better understanding. It is currently above my skills level.

In [25]:
from scipy.stats.stats import pearsonr

In [26]:
print(trainX.shape)

(24904, 10000)


In [27]:
getCorrelation = np.vectorize(lambda x: pearsonr(trainX[:,x], trainY)[0])
print(type(getCorrelation), getCorrelation)

<class 'numpy.vectorize'> <numpy.vectorize object at 0x7fcb68d77810>


In [28]:
correlations = getCorrelation(np.arange(trainX.shape[1]))
print(correlations, len(correlations), type(correlations))

[-0.01284482 -0.01814795  0.00904327 ...  0.01702406  0.02087278
  0.00974081] 10000 <class 'numpy.ndarray'>


In [29]:
allIndeces = np.argsort(-correlations)
print(type(allIndeces), allIndeces, len(allIndeces))

<class 'numpy.ndarray'> [3917 5371 3149 ... 9690 9885  768] 10000


In [30]:
bestIndeces = allIndeces[np.concatenate([np.arange(1000), np.arange(-1000, 0)])]
print(type(bestIndeces), bestIndeces, len(bestIndeces))

<class 'numpy.ndarray'> [3917 5371 3149 ... 9690 9885  768] 2000


In [31]:
vocabulary = np.array(tfidf.get_feature_names())
print(type(vocabulary))
print(vocabulary[bestIndeces][:10])
print(vocabulary[bestIndeces][-10:])

<class 'numpy.ndarray'>
['great' 'love' 'excellent' 'best' 'beautiful' 'perfect' 'favorite'
 'enjoy' 'amazing' 'performance']
['poor' 'stupid' 'horrible' 'worse' 'terrible' 'boring' 'awful' 'waste'
 'worst' 'bad']


In [32]:
print(trainX.shape, testX.shape, min(bestIndeces), max(bestIndeces))

(24904, 10000) (24678, 10000) 9 9998


In [33]:
trainX = trainX[:,bestIndeces]

In [34]:
testX = testX[:,bestIndeces]

In [35]:
print(trainX.shape, trainY.shape)
print(testX.shape, testY.shape)

(24904, 2000) (24904,)
(24678, 2000) (24678,)


## Model Architecture

We choose a very simple dense network with 6 layers, performing binary classification.

In [36]:
DROPOUT = 0.5
ACTIVATION = "tanh"

model = Sequential([    
    Dense(int(trainX.shape[1]/2), activation=ACTIVATION, input_dim=trainX.shape[1]),
    Dropout(DROPOUT),
    Dense(int(trainX.shape[1]/2), activation=ACTIVATION, input_dim=trainX.shape[1]),
    Dropout(DROPOUT),
    Dense(int(trainX.shape[1]/4), activation=ACTIVATION),
    Dropout(DROPOUT),
    Dense(100, activation=ACTIVATION),
    Dropout(DROPOUT),
    Dense(20, activation=ACTIVATION),
    Dropout(DROPOUT),
    Dense(5, activation=ACTIVATION),
    Dropout(DROPOUT),    
    Dense(1, activation='sigmoid'),
])

In [37]:
model.compile(optimizer=Adam(0.00005), loss='binary_crossentropy', metrics=['accuracy'])
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 1000)              2001000   
_________________________________________________________________
dropout (Dropout)            (None, 1000)              0         
_________________________________________________________________
dense_1 (Dense)              (None, 1000)              1001000   
_________________________________________________________________
dropout_1 (Dropout)          (None, 1000)              0         
_________________________________________________________________
dense_2 (Dense)              (None, 500)               500500    
_________________________________________________________________
dropout_2 (Dropout)          (None, 500)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 100)               5

# Training The Model

As the notebook says, let's go.



In [38]:
EPOCHS = 30
BATCHSIZE = 1500

In [39]:
for np_arr in [trainX, trainY, testX, testY]:
    print(np_arr.shape, None in np_arr)

(24904, 2000) False
(24904,) False
(24678, 2000) False
(24678,) False


In [40]:
%%time
model.fit(trainX, np.asarray(trainY), epochs=EPOCHS, batch_size=BATCHSIZE, 
          validation_data=(testX, np.asarray(testY)))

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30
CPU times: user 9min 41s, sys: 23.8 s, total: 10min 5s
Wall time: 1min 13s


<tensorflow.python.keras.callbacks.History at 0x7fcb69138890>

In [41]:
x = np.arange(EPOCHS)
history = model.history.history
history.keys()


dict_keys(['loss', 'accuracy', 'val_loss', 'val_accuracy'])

In [42]:
data = [
#    go.Scatter(x=x, y=history["acc"], name="Train Accuracy", marker=dict(size=5), yaxis='y2'),
    go.Scatter(x=x, y=history["val_accuracy"], name="Test Accuracy", marker=dict(size=5), yaxis='y2'),
    go.Scatter(x=x, y=history["loss"], name="Train Loss", marker=dict(size=5)),
    go.Scatter(x=x, y=history["val_loss"], name="Test Loss", marker=dict(size=5))
]
layout = go.Layout(
    title="Model Training Evolution", font=dict(family='Palatino'), xaxis=dict(title='Epoch', dtick=1),
    yaxis1=dict(title="Loss", domain=[0, 0.45]), yaxis2=dict(title="Accuracy", domain=[0.55, 1]),
)
py.iplot(go.Figure(data=data, layout=layout), show_link=False)

## Model Evaluation

Let's first centralize the probabilities and predictions with the original train and validation dataframes. Then we can print out the respective accuracies and losses.

In [43]:
train_df["probability"] = model.predict(trainX)
train_df["prediction"] = train_df.probability-0.5>0
train_df["truth"] = train_df.Label==1
train_df.tail()

Unnamed: 0.1,Unnamed: 0,Review,Label,probability,prediction,truth
24899,24994,I was so looking forward to seeing this when i...,0,0.333382,False,False
24900,24996,Paul Verhoeven's predecessor to his breakout h...,1,0.794062,True,True
24901,24997,Financially strapped Paramount pulled out all ...,1,0.194794,False,True
24902,24998,This is one of the worst movies ever made. Tri...,0,0.194576,False,False
24903,24999,"FREDDY has gone from scary to funny,in this 6t...",1,0.779808,True,True


In [44]:
print(model.evaluate(trainX, np.asarray(trainY)))

[0.3271786570549011, 0.9200530052185059]


In [45]:
print((train_df.truth==train_df.prediction).mean())

0.9200530035335689


In [46]:
test_df["probability"] = model.predict(testX)
test_df["prediction"] = test_df.probability-0.5>0
test_df["truth"] = test_df.Label==1
test_df.tail()

Unnamed: 0.1,Unnamed: 0,Review,Label,probability,prediction,truth
24673,24995,"With its rerelease by ADV Films, I've had a ch...",1,0.802117,True,True
24674,24996,A good cast... A good idea but turns out it is...,0,0.194416,False,False
24675,24997,"There seems to be a whole sub genre of cheap, ...",0,0.195075,False,False
24676,24998,The concept: show 4 families of diverse ethnic...,0,0.199903,False,False
24677,24999,"I mean let's face it, all you have to do in mo...",1,0.802351,True,True


In [47]:
print(model.evaluate(testX, np.array(testY)))

[0.3837502896785736, 0.8697220087051392]


In [48]:
print((test_df.truth==test_df.prediction).mean())

0.8697220196126104


## Error Analysis

Error analysis gives us great insight in the way the model is making its errors. Often, it shows data quality issues.

In [49]:
trainCross = train_df.groupby(["prediction", "truth"]).size().unstack()
trainCross

truth,False,True
prediction,Unnamed: 1_level_1,Unnamed: 2_level_1
False,11496,1055
True,936,11417


In [50]:
testCross = test_df.groupby(["prediction", "truth"]).size().unstack()
testCross

truth,False,True
prediction,Unnamed: 1_level_1,Unnamed: 2_level_1
False,10862,1811
True,1404,10601


In [51]:
truepositives = test_df[(test_df.truth==True)&(test_df.truth==test_df.prediction)]
print(len(truepositives), "true positives.")
truepositives.sort_values("probability", ascending=False).head(3)

10601 true positives.


Unnamed: 0.1,Unnamed: 0,Review,Label,probability,prediction,truth
13486,13602,I saw this tonight with moderate expectations ...,1,0.805829,True,True
10596,10679,"In Paris, a few months before the Nazi invasio...",1,0.805815,True,True
15204,15352,This sad romance is untellable because the dir...,1,0.805808,True,True


In [52]:
truenegatives = test_df[(test_df.truth==False)&(test_df.truth==test_df.prediction)]
print(len(truenegatives), "true negatives.")
truenegatives.sort_values("probability", ascending=True).head(3)

10862 true negatives.


Unnamed: 0.1,Unnamed: 0,Review,Label,probability,prediction,truth
8274,8331,I was truly looking forward to this title. It ...,0,0.194324,False,False
21050,21303,Movie Title - Tart<br /><br />Date of review -...,0,0.194332,False,False
13596,13713,Do not rent this movie. I ended up buying the ...,0,0.194335,False,False


In [53]:
falsepositives = test_df[(test_df.truth==True)&(test_df.truth!=test_df.prediction)]
print(len(falsepositives), "false positives.")
falsepositives.sort_values("probability", ascending=True).head(3)

1811 false positives.


Unnamed: 0.1,Unnamed: 0,Review,Label,probability,prediction,truth
13782,13900,"Yes, indeed we have a winner- a winner in best...",1,0.194382,False,True
24043,24354,"Okay, I'll admit that if I didn't have kids, I...",1,0.194388,False,True
3732,3762,"I've waited 9 years to watch this film, simply...",1,0.194391,False,True


In [54]:
falsenegatives = test_df[(test_df.truth==False)&(test_df.truth!=test_df.prediction)]
print(len(falsenegatives), "false negatives.")
falsenegatives.sort_values("probability", ascending=False).head(3)

1404 false negatives.


Unnamed: 0.1,Unnamed: 0,Review,Label,probability,prediction,truth
13405,13520,This is definitely one of the best Kung fu mov...,0,0.805789,True,False
17257,17431,Victor Jory never became a major star. He is b...,0,0.805777,True,False
10419,10502,This movie was pure genius. John Waters is bri...,0,0.80577,True,False


This is the review that got predicted as positive most certainly - while being labeled as negative. However, we can easily recognize it as a poorly labeled sample.

In [55]:
HTML(test_df.loc[10502].Review)

In [56]:
HTML(test_df.loc[23071].Review)

## Model Application

To use this model, we would store the model, along with the preprocessing vectorizers, and run the unseen texts through following pipeline.

In [57]:
unseen = pd.Series("this movie very good. I liked it a lot")

In [58]:
unseen = preprocess.transform(unseen)       # Text preprocessing
unseen = tfidf.transform(unseen).toarray()  # Feature engineering
unseen = unseen[:,bestIndeces]              # Feature selection
probability = model.predict(unseen)[0,0]  # Network feedforward

Transformed.


In [59]:
print(probability)
print("Positive!") if probability > 0.5 else print("Negative!")

0.5799436
Positive!


In [60]:
print(train_df.shape, test_df.shape)

(24904, 6) (24678, 6)


In [61]:
train_df.columns

Index(['Unnamed: 0', 'Review', 'Label', 'probability', 'prediction', 'truth'], dtype='object')

In [62]:
test_df.columns

Index(['Unnamed: 0', 'Review', 'Label', 'probability', 'prediction', 'truth'], dtype='object')