# Movie Review Sentiment Analysis

Let's try the sentiment analysis by using [chariot](https://github.com/chakki-works/chariot) and [Tensorflow](https://www.tensorflow.org/).

1. Download the [Movie Review Data](https://github.com/chakki-works/chazutsu/tree/master/chazutsu#movie-review-data).
2. Preprocess the review text by chariot.
3. Load the pretrained word embedding [GloVe](https://nlp.stanford.edu/projects/glove/).
4. Make model by TensorFlow (use `tf.keras`).
5. Train & evaluate the model.

This tutorial needs following libraries.

* chazutsu
* chakin
* scipy
* scikit-learn
* tensorflow
* h5py


## Prepare the packages

In [1]:
%load_ext autoreload
%autoreload 2


import os
import sys
from pathlib import Path


def set_path():
    if "../" not in sys.path:
        sys.path.append("../")
    root_dir = Path.cwd()
    return root_dir

ROOT_DIR = set_path()

## Download the Movie Review Data

In [2]:
import chazutsu
from chariot.storage import Storage

storage = Storage.setup_data_dir(ROOT_DIR)
r = chazutsu.datasets.MovieReview.polarity().download(storage.path("raw"))

Read resource from the existed resource(if you want to retry, set force=True).


In [3]:
r.train_data().head(3)

Unnamed: 0,polarity,review
0,1,a bleak look at how the boston underworld oper...
1,1,showgirls is the second major outing for the p...
2,1,countries and legal systems that take the rule...


## Preprocess the review text by chariot.

### Make single preprocessor

In [4]:
import chariot.transformer as ct
from chariot.preprocessor import Preprocessor


review_processor = Preprocessor()
review_processor\
    .stack(ct.text.UnicodeNormalizer())\
    .stack(ct.Tokenizer("en"))\
    .stack(ct.token.StopwordFilter("en"))\
    .stack(ct.Vocabulary(min_df=5, max_df=0.5))\
    .fit(r.train_data()["review"])

Preprocessor(other_transformers=[],
       text_transformers=[UnicodeNormalizer(copy=True, form='NFKC')],
       token_transformers=[StopwordFilter(copy=True, lang='en')],
       tokenizer=Tokenizer(copy=True, lang='en'),
       vocabulary=Vocabulary(begin_of_sequence=None, copy=True, end_of_sequence=None,
      ignore_blank=True, max_df=0.5, min_df=5, padding=None, unknown=None,
      vocab_size=-1))

In [5]:
review_processor.vocabulary.get()[:10]

['@@PADDING@@',
 '@@UNKNOWN@@',
 '@@BEGIN_OF_SEQUENCE@@',
 '@@END_OF_SEQUENCE@@',
 "'re",
 'work',
 'better',
 'real',
 'gets',
 'going']

### Define dataset preprocessor

In [6]:
from chariot.dataset_preprocessor import DatasetPreprocessor
from chariot.transformer.formatter import Padding


pad_length = 300

dp = DatasetPreprocessor()
dp.process("review")\
    .by(ct.text.UnicodeNormalizer())\
    .by(ct.Tokenizer("en"))\
    .by(ct.token.StopwordFilter("en"))\
    .by(ct.Vocabulary(min_df=5, max_df=0.5))\
    .by(Padding(length=pad_length))\
    .fit(r.train_data()["review"])

<chariot.dataset_preprocessor.ProcessBuilder at 0x16b0a0e48>

## Load the pretrained word embedding GloVe

In [7]:
_ = storage.chakin(name="GloVe.6B.200d")

In [8]:
embedding = review_processor.vocabulary.make_embedding(storage.path("external/glove.6B.200d.txt"))
print(embedding.shape)

(11770, 200)


## Make model by TensorFlow

### Test baseline model

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics


def test_baseline(train, test):
    X = train["review"]
    y = train["polarity"]
    vectorizer = TfidfVectorizer()
    X_vector = vectorizer.fit_transform(X)

    classifier = LogisticRegression(solver="liblinear")
    classifier.fit(X_vector, y)

    predict = classifier.predict(vectorizer.transform(test["review"]))
    score = metrics.accuracy_score(test["polarity"], predict)

    print(score)

test_baseline(r.train_data(), r.test_data())

0.83


### Make model

In [10]:
from tensorflow.python import keras as K


vocab_size = dp.process("review").preprocessor.vocabulary.count
padding_index = dp.process("review").preprocessor.vocabulary.pad
embedding_size = 200

def make_model():
    model = K.Sequential()
    model.add(K.layers.Masking(mask_value=padding_index, input_shape=(pad_length,)))
    model.add(K.layers.Embedding(vocab_size, embedding_size, weights=[embedding]))
    model.add(K.layers.Lambda(lambda x: K.backend.mean(x, axis=1)))
    model.add(K.layers.Dense(1, activation="sigmoid"))
    return model

model = make_model()
model.compile(loss="binary_crossentropy",
              optimizer="adam",
              metrics=["accuracy"])

### Train the model

In [11]:
formatted = dp(r.train_data()).preprocess().format().processed

In [12]:
import numpy as np

print(formatted["review"].shape)
print(formatted["polarity"].shape)

(1400, 300)
(1400,)


In [13]:
model.fit(formatted["review"], formatted["polarity"], batch_size=32,
                 validation_split=0.2, epochs=15, verbose=2)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 1120 samples, validate on 280 samples
Epoch 1/15
 - 2s - loss: 0.6922 - acc: 0.5161 - val_loss: 0.6903 - val_acc: 0.5250
Epoch 2/15
 - 2s - loss: 0.6776 - acc: 0.7098 - val_loss: 0.6839 - val_acc: 0.6321
Epoch 3/15
 - 1s - loss: 0.6644 - acc: 0.7482 - val_loss: 0.6774 - val_acc: 0.6571
Epoch 4/15
 - 1s - loss: 0.6478 - acc: 0.8277 - val_loss: 0.6689 - val_acc: 0.6571
Epoch 5/15
 - 1s - loss: 0.6288 - acc: 0.8446 - val_loss: 0.6610 - val_acc: 0.7107
Epoch 6/15
 - 2s - loss: 0.6020 - acc: 0.8652 - val_loss: 0.6489 - val_acc: 0.6893
Epoch 7/15
 - 2s - loss: 0.5708 - acc: 0.8929 - val_loss: 0.6375 - val_acc: 0.7107
Epoch 8/15
 - 2s - loss: 0.5349 - acc: 0.9152 - val_loss: 0.6239 - val_acc: 0.7071
Epoch 9/15
 - 2s - loss: 0.4951 - acc: 0.9357 - val_loss: 0.6102 - val_acc: 0.7214
Epoch 10/15
 - 2s - loss: 0.4527 - acc: 0.9375 - val_loss: 0.5950 - val_acc: 0.7179
Epoch 11/15
 - 2s - loss: 0.4100 - acc: 0.9545 - val_loss: 0.5816 - val_acc: 0.7179
Epoch 12/15
 - 2s - loss: 0.3683 - acc

<tensorflow.python.keras.callbacks.History at 0x16b1316d8>

### Evaluate the model

In [14]:
test_dataset = dp(r.test_data()).preprocess().format().processed

In [15]:
score, acc = model.evaluate(test_dataset["review"], test_dataset["polarity"], batch_size=32)



In [16]:
print("Loss: {}, Accuracy: {}".format(score, acc))

Loss: 0.4995402534802755, Accuracy: 0.7916666666666666


## Model & Preprocessor persistence

In [17]:
import json
from sklearn.externals import joblib


if not os.path.exists("models"):
    os.mkdir("models")

model.save("models/sentiment_model.h5")
dp.save("models/movie_dp.tar.gz")
print("save models")

save models


### Load

In [18]:
loaded_dp = DatasetPreprocessor.load("models/movie_dp.tar.gz")

In [19]:
test_dataset = loaded_dp(r.test_data()).preprocess().format().processed
score, acc = model.evaluate(test_dataset["review"], test_dataset["polarity"], batch_size=32)



In [20]:
print("Loss: {}, Accuracy: {}".format(score, acc))

Loss: 0.4995402534802755, Accuracy: 0.7916666666666666
