# Movie Review Sentiment Analysis

Let's try the sentiment analysis by using [chariot](https://github.com/chakki-works/chariot) and [Tensorflow](https://www.tensorflow.org/).

1. Download the [Movie Review Data](https://github.com/chakki-works/chazutsu/tree/master/chazutsu#movie-review-data).
2. Preprocess the review text by chariot.
3. Load the pretrained word embedding [GloVe](https://nlp.stanford.edu/projects/glove/).
4. Make model by TensorFlow (use `tf.keras`).
5. Train & evaluate the model.

This tutorial needs following libraries.

* chazutsu
* chakin
* scipy
* scikit-learn
* tensorflow
* h5py


## Prepare the packages

In [1]:
%load_ext autoreload
%autoreload 2


import os
import sys
from pathlib import Path


def set_path():
    if "../" not in sys.path:
        sys.path.append("../")
    root_dir = Path.cwd()
    return root_dir

ROOT_DIR = set_path()

## Download the Movie Review Data

In [2]:
import chazutsu
from chariot.storage import Storage

storage = Storage.setup_data_dir(ROOT_DIR)
r = chazutsu.datasets.MovieReview.polarity().download(storage.data_path("raw"))

Read resource from the existed resource(if you want to retry, set force=True).


In [3]:
r.train_data().head(3)

Unnamed: 0,polarity,review
0,0,what hath kevin williamson wrought ? while the...
1,1,note : some may consider portions of the follo...
2,0,"in the finale of disney's "" mighty joe young ,..."


## Preprocess the review text by chariot.

### Make preprocessor

In [4]:
import chariot.transformer as ct
from chariot.preprocessor import Preprocessor


review_processor = Preprocessor(
                    text_transformers=[ct.text.UnicodeNormalizer()],
                    tokenizer=ct.Tokenizer("en"),
                    token_transformers=[ct.token.StopwordFilter("en")],
                    vocabulary=ct.Vocabulary(min_df=5, max_df=0.5))

review_processor.fit(r.train_data()["review"])

Preprocessor(text_transformers=[UnicodeNormalizer(copy=True, form='NFKC')],
       token_transformers=[StopwordFilter(copy=True, lang='en')],
       tokenizer=Tokenizer(copy=True, lang='en'),
       vocabulary=Vocabulary(begin_of_sequence=None, copy=True, end_of_sequence=None,
      max_df=0.5, min_df=5, padding=None, unknown=None, vocab_size=-1))

In [5]:
review_processor.vocabulary.get()[:10]

['@@PADDING@@',
 '@@UNKNOWN@@',
 '@@BEGIN_OF_SEQUENCE@@',
 '@@END_OF_SEQUENCE@@',
 'makes',
 '_',
 'better',
 'real',
 'role',
 'seen']

### Define preprocess process

In [6]:
from chariot.preprocess import Preprocess


preprocess = Preprocess({
    "review": review_processor
})


## Load the pretrained word embedding GloVe

In [7]:
_ = storage.chakin(name="GloVe.6B.200d")

In [8]:
embedding = review_processor.vocabulary.make_embedding(storage.data_path("external/glove.6B.200d.txt"))
print(embedding.shape)

(11816, 200)


## Make model by TensorFlow

### Prepare train dataset

In [9]:
preprocessed = preprocess.transform(r.train_data())

In [10]:
from chariot.feeder import Feeder
from chariot.transformer.formatter import Padding


pad_length = 300
feeder = Feeder({"review": Padding.from_(review_processor, length=pad_length)})

### Test baseline model

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics


def test_baseline(data, feeder):
    _data = feeder.transform(data)
    X = [" ".join(map(str, ids)) for ids in _data["review"]]
    y = _data["polarity"]
    x_train, x_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2)
    vectorizer = TfidfVectorizer()
    x_train_v = vectorizer.fit_transform(x_train)

    classifier = LogisticRegression()
    classifier.fit(x_train_v, y_train)

    predict = classifier.predict(vectorizer.transform(x_valid))
    score = metrics.accuracy_score(y_valid, predict)

    print(score)

test_baseline(preprocessed, feeder)

0.7607142857142857


### Make model

In [12]:
from tensorflow.python import keras as K


vocab_size = review_processor.vocabulary.count
embedding_size = 200

def make_model():
    model = K.Sequential()
    model.add(K.layers.Masking(mask_value=review_processor.vocabulary.pad, input_shape=(pad_length,)))
    model.add(K.layers.Embedding(vocab_size, embedding_size, weights=[embedding]))
    model.add(K.layers.Lambda(lambda x: K.backend.mean(x, axis=1)))
    model.add(K.layers.Dense(1, activation="sigmoid"))
    return model

model = make_model()
model.compile(loss="binary_crossentropy",
              optimizer="adam",
              metrics=["accuracy"])

### Train the model

In [13]:
adjusted = feeder.transform(preprocessed)
model.fit(adjusted["review"], adjusted["polarity"], batch_size=32,
          validation_split=0.2, epochs=15, verbose=2)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 1120 samples, validate on 280 samples
Epoch 1/15
 - 3s - loss: 0.6925 - acc: 0.5170 - val_loss: 0.6919 - val_acc: 0.5393
Epoch 2/15
 - 2s - loss: 0.6804 - acc: 0.6991 - val_loss: 0.6833 - val_acc: 0.6393
Epoch 3/15
 - 2s - loss: 0.6675 - acc: 0.7812 - val_loss: 0.6792 - val_acc: 0.6429
Epoch 4/15
 - 2s - loss: 0.6529 - acc: 0.7759 - val_loss: 0.6657 - val_acc: 0.7393
Epoch 5/15
 - 2s - loss: 0.6342 - acc: 0.8830 - val_loss: 0.6549 - val_acc: 0.7500
Epoch 6/15
 - 2s - loss: 0.6108 - acc: 0.8848 - val_loss: 0.6451 - val_acc: 0.7500
Epoch 7/15
 - 2s - loss: 0.5818 - acc: 0.9205 - val_loss: 0.6283 - val_acc: 0.7714
Epoch 8/15
 - 3s - loss: 0.5476 - acc: 0.9304 - val_loss: 0.6099 - val_acc: 0.7821
Epoch 9/15
 - 2s - loss: 0.5085 - acc: 0.9420 - val_loss: 0.5892 - val_acc: 0.8000
Epoch 10/15
 - 2s - loss: 0.4656 - acc: 0.9589 - val_loss: 0.5723 - val_acc: 0.7964
Epoch 11/15
 - 3s - loss: 0.4220 - acc: 0.9652 - val_loss: 0.5568 - val_acc: 0.7964
Epoch 12/15
 - 2s - loss: 0.3790 - acc

<tensorflow.python.keras.callbacks.History at 0x259fe59bcc0>

### Evaluate the model

In [14]:
test_dataset = feeder.transform(preprocess.transform(r.test_data()))

In [15]:
score, acc = model.evaluate(test_dataset["review"], test_dataset["polarity"], batch_size=32)



In [16]:
print("Loss: {}, Accuracy: {}".format(score, acc))

Loss: 0.5201510453224182, Accuracy: 0.7666666674613952


## Model & Preprocessor persistence

In [17]:
import json
from sklearn.externals import joblib


if not os.path.exists("models"):
    os.mkdir("models")

model.save("models/sentiment_model.h5")
preprocess.save("models/movie_preprocess.tar.gz")
feeder.save("models/movie_feeder.tar.gz")
print("save models")

save models


### Load

In [18]:
loaded_preprocess = Preprocess.load("models/movie_preprocess.tar.gz")
loaded_feeder = Feeder.load("models/movie_feeder.tar.gz")

In [19]:
test_dataset = loaded_feeder.transform(loaded_preprocess.transform(r.test_data()))
score, acc = model.evaluate(test_dataset["review"], test_dataset["polarity"], batch_size=32)



In [20]:
print("Loss: {}, Accuracy: {}".format(score, acc))

Loss: 0.5201510453224182, Accuracy: 0.7666666674613952
