# Movie Review Sentiment Analysis

Let's try the sentiment analysis by using [chariot](https://github.com/chakki-works/chariot) and [Tensorflow](https://www.tensorflow.org/).

1. Download the [Movie Review Data](https://github.com/chakki-works/chazutsu/tree/master/chazutsu#movie-review-data).
2. Preprocess the review text by chariot.
3. Load the pretrained word embedding [GloVe](https://nlp.stanford.edu/projects/glove/).
4. Make model by TensorFlow (use `tf.keras`).
5. Train & evaluate the model.

This tutorial needs following libraries.

* chazutsu
* chakin
* scipy
* scikit-learn
* tensorflow
* h5py


## Prepare the packages

In [1]:
%load_ext autoreload
%autoreload 2


import os
import sys
from pathlib import Path


def set_path():
    if "../" not in sys.path:
        sys.path.append("../")
    root_dir = Path.cwd()
    return root_dir

ROOT_DIR = set_path()

## Download the Movie Review Data

In [2]:
import chazutsu
from chariot.storage import Storage

storage = Storage.setup_data_dir(ROOT_DIR)
r = chazutsu.datasets.MovieReview.polarity().download(storage.data_path("raw"))

Read resource from the existed resource(if you want to retry, set force=True).


In [3]:
r.train_data().head(3)

Unnamed: 0,polarity,review
0,0,"synopsis : an aging master art thief , his sup..."
1,0,"plot : a separated , glamorous , hollywood cou..."
2,0,a friend invites you to a movie . this film wo...


## Preprocess the review text by chariot.

### Make preprocessor

In [4]:
import chariot.transformer as ct
from chariot.preprocessor import Preprocessor


review_processor = Preprocessor(
                    tokenizer=ct.Tokenizer("en"),
                    text_transformers=[ct.text.UnicodeNormalizer()],
                    token_transformers=[ct.token.StopwordFilter("en")],
                    indexer=ct.Indexer(min_df=5, max_df=0.5))

review_processor.fit(r.train_data()["review"])

Preprocessor(indexer=Indexer(begin_of_sequence=None, copy=True, count=-1, end_of_sequence=None,
    max_df=0.5, min_df=5, padding=None, unknown=None),
       text_transformers=[UnicodeNormalizer(copy=True, form='NFKC')],
       token_transformers=[StopwordFilter(copy=True, lang='en')],
       tokenizer=Tokenizer(copy=True, lang='en'))

In [5]:
review_processor.indexer.vocab[:10]

['@@PADDING@@',
 '@@UNKNOWN@@',
 '@@BEGIN_OF_SEQUENCE@@',
 '@@END_OF_SEQUENCE@@',
 'big',
 "'re",
 '*',
 'makes',
 'seen',
 'real']

### Define preprocess process

In [6]:
from chariot.preprocess import Preprocess


preprocess = Preprocess({
    "review": review_processor
})


## Load the pretrained word embedding GloVe

In [7]:
_ = storage.chakin(name="GloVe.6B.200d")

In [8]:
embedding = review_processor.indexer.make_embedding(storage.data_path("external/glove.6B.200d.txt"))
print(embedding.shape)

(10361, 200)


## Make model by TensorFlow

### Prepare train dataset

In [9]:
preprocessed = preprocess.apply(r.train_data())

In [10]:
from chariot.feeder import Feeder


pad_length = 300
feeder = Feeder({"review": review_processor.indexer.make_padding(length=pad_length)})

### Test baseline model

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics


def test_baseline(data, feeder):
    _data = feeder.apply(data)
    X = [" ".join(map(str, ids)) for ids in _data["review"]]
    y = _data["polarity"]
    x_train, x_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2)
    vectorizer = TfidfVectorizer()
    x_train_v = vectorizer.fit_transform(x_train)

    classifier = LogisticRegression()
    classifier.fit(x_train_v, y_train)

    predict = classifier.predict(vectorizer.transform(x_valid))
    score = metrics.accuracy_score(y_valid, predict)

    print(score)

test_baseline(preprocessed, feeder)

0.7892857142857143


### Make model

In [12]:
from tensorflow.python import keras as K


vocab_size = len(review_processor.indexer.vocab)
embedding_size = 200

def make_model():
    model = K.Sequential()
    model.add(K.layers.Masking(mask_value=review_processor.indexer.pad, input_shape=(pad_length,)))
    model.add(K.layers.Embedding(vocab_size, embedding_size, weights=[embedding]))
    model.add(K.layers.Lambda(lambda x: K.backend.mean(x, axis=1)))
    model.add(K.layers.Dense(1, activation="sigmoid"))
    return model

model = make_model()
model.compile(loss="binary_crossentropy",
              optimizer="adam",
              metrics=["accuracy"])

### Train the model

In [13]:
adjusted = feeder.apply(preprocessed)
model.fit(adjusted["review"], adjusted["polarity"], batch_size=32,
          validation_split=0.2, epochs=15, verbose=2)

Train on 1120 samples, validate on 280 samples
Epoch 1/15
 - 1s - loss: 0.6940 - acc: 0.5205 - val_loss: 0.6881 - val_acc: 0.5929
Epoch 2/15
 - 1s - loss: 0.6808 - acc: 0.6786 - val_loss: 0.6794 - val_acc: 0.6786
Epoch 3/15
 - 1s - loss: 0.6677 - acc: 0.7696 - val_loss: 0.6711 - val_acc: 0.7107
Epoch 4/15
 - 1s - loss: 0.6530 - acc: 0.8071 - val_loss: 0.6589 - val_acc: 0.7500
Epoch 5/15
 - 1s - loss: 0.6340 - acc: 0.8223 - val_loss: 0.6488 - val_acc: 0.7393
Epoch 6/15
 - 1s - loss: 0.6110 - acc: 0.8643 - val_loss: 0.6337 - val_acc: 0.7500
Epoch 7/15
 - 1s - loss: 0.5821 - acc: 0.8821 - val_loss: 0.6191 - val_acc: 0.7571
Epoch 8/15
 - 1s - loss: 0.5487 - acc: 0.8902 - val_loss: 0.5971 - val_acc: 0.7857
Epoch 9/15
 - 1s - loss: 0.5115 - acc: 0.9241 - val_loss: 0.5833 - val_acc: 0.7750
Epoch 10/15
 - 1s - loss: 0.4704 - acc: 0.9295 - val_loss: 0.5606 - val_acc: 0.8107
Epoch 11/15
 - 1s - loss: 0.4286 - acc: 0.9500 - val_loss: 0.5381 - val_acc: 0.8179
Epoch 12/15
 - 1s - loss: 0.3864 - acc

<tensorflow.python.keras._impl.keras.callbacks.History at 0x266139bb438>

### Evaluate the model

In [14]:
test_dataset = feeder.apply(preprocess.apply(r.test_data()))

In [15]:
score, acc = model.evaluate(test_dataset["review"], test_dataset["polarity"], batch_size=32)



In [16]:
print("Score: {}, Accuracy: {}".format(score, acc))

Score: 0.5089418419202169, Accuracy: 0.7833333325386047


## Model & Preprocessor persistence

In [17]:
import json
from sklearn.externals import joblib


if not os.path.exists("models"):
    os.mkdir("models")

model.save("models/sentiment_model.h5")
preprocess.save("models/movie_preprocess.tar.gz")
feeder.save("models/movie_feeder.tar.gz")
print("save models")

save models


### Load

In [18]:
loaded_preprocess = Preprocess.load("models/movie_preprocess.tar.gz")
loaded_feeder = Feeder.load("models/movie_feeder.tar.gz")

In [19]:
test_dataset = loaded_feeder.apply(loaded_preprocess.apply(r.test_data()))
score, acc = model.evaluate(test_dataset["review"], test_dataset["polarity"], batch_size=32)



In [20]:
print("Score: {}, Accuracy: {}".format(score, acc))

Score: 0.5089418419202169, Accuracy: 0.7833333325386047
