# Movie Review Sentiment Analysis

Let's try the sentiment analysis by using [chariot](https://github.com/chakki-works/chariot) and [Tensorflow](https://www.tensorflow.org/).

1. Download the [Movie Review Data](https://github.com/chakki-works/chazutsu/tree/master/chazutsu#movie-review-data).
2. Preprocess the review text by chariot.
3. Load the pretrained word embedding [GloVe](https://nlp.stanford.edu/projects/glove/).
4. Make model by TensorFlow (use `tf.keras`).
5. Train & evaluate the model.


## Prepare the packages

In [1]:
%load_ext autoreload
%autoreload 2


import os
import sys
from pathlib import Path


def set_path():
    if "../" not in sys.path:
        sys.path.append("../")
    root_dir = Path.cwd()
    return root_dir

ROOT_DIR = set_path()

## Download the Movie Review Data

In [2]:
import chazutsu
from chariot.storage import Storage

storage = Storage.setup_data_dir(ROOT_DIR)
r = chazutsu.datasets.MovieReview.polarity().download(storage.data_path("raw"))

Read resource from the existed resource(if you want to retry, set force=True).


In [3]:
train_dataset = storage.chazutsu(r).train_dataset
if len(train_dataset.fields) == 0:
    train_dataset.fields = ["polarity", "review"]
train_dataset.to_dataframe().head(3)

Unnamed: 0,polarity,review
0,0,"synopsis : an aging master art thief , his sup..."
1,0,"plot : a separated , glamorous , hollywood cou..."
2,0,a friend invites you to a movie . this film wo...


## Preprocess the review text by chariot.

In [4]:
import chariot.transformer as ct
from chariot.preprocessor import Preprocessor


preprocessor = Preprocessor(
                    tokenizer=ct.Tokenizer("en"),
                    text_transformers=[ct.text.UnicodeNormalizer()],
                    token_transformers=[ct.token.StopwordFilter("en")],
                    indexer=ct.Indexer())

preprocessor.fit(train_dataset.get("review"))

Preprocessor(indexer=Indexer(begin_of_seq=None, copy=True, end_of_seq=None, max_df=1.0, min_df=1,
    padding=None, size=-1, unknown=None),
       n_jobs=1,
       text_transformers=[UnicodeNormalizer(copy=True, form='NFKC')],
       token_transformers=[StopwordFilter(copy=True, lang='en')],
       tokenizer=Tokenizer(copy=True, lang='en'))

In [5]:
preprocessor.indexer.vocab[:10]

['__PAD__',
 '__UNK__',
 '__BOS__',
 '__EOS__',
 'character',
 ';',
 'characters',
 '!',
 'way',
 '--']

## Load the pretrained word embedding GloVe

In [6]:
_ = storage.chakin(name="GloVe.6B.100d")

In [7]:
embedding = preprocessor.indexer.make_embedding(storage.data_path("external/glove.6B.100d.txt"))
print(embedding.shape)

(21646, 100)


## Make model by TensorFlow

In [39]:
from tensorflow.python import keras as K


vocab_size = len(preprocessor.indexer.vocab)
embedding_size = 100

model = K.Sequential()
model.add(K.layers.Embedding(vocab_size, embedding_size, weights=[embedding]))
model.add(K.layers.LSTM(embedding_size, dropout=0.5, recurrent_dropout=0.5))
model.add(K.layers.Dense(1, activation="sigmoid"))

model.compile(loss="binary_crossentropy",
              optimizer="adam",
              metrics=["accuracy"])

In [26]:
feed = train_dataset.to_feed(field_transformers={
    "polarity": None,
    "review": preprocessor
})

y_full, X_full = feed.full()  # Get Batch

In [32]:
max_length = 80
y = y_full()
X = X_full.adjust(padding=max_length)
print(y.shape)
print(X.shape)

(1400,)
(1400, 80)


### Train the model

In [40]:
from sklearn.model_selection import train_test_split


X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2)

model.fit(X_train, y_train,
          batch_size=32,
          epochs=5,
          validation_data=(X_valid, y_valid), verbose=2)

Train on 1120 samples, validate on 280 samples
Epoch 1/5
 - 5s - loss: 0.7006 - acc: 0.5214 - val_loss: 0.6860 - val_acc: 0.5536
Epoch 2/5
 - 3s - loss: 0.6920 - acc: 0.5420 - val_loss: 0.6890 - val_acc: 0.5464
Epoch 3/5
 - 3s - loss: 0.6680 - acc: 0.6071 - val_loss: 0.6820 - val_acc: 0.5464
Epoch 4/5
 - 3s - loss: 0.6571 - acc: 0.6179 - val_loss: 0.6802 - val_acc: 0.5571
Epoch 5/5
 - 3s - loss: 0.6420 - acc: 0.6339 - val_loss: 0.6905 - val_acc: 0.5679


<tensorflow.python.keras._impl.keras.callbacks.History at 0x17868718d68>

### Evaluate the model

In [None]:
test_dataset = storage.chazutsu(r).test_dataset
if len(test_dataset.fields) == 0:
    test_dataset.fields = ["polarity", "review"]

feed = test_dataset.to_feed(field_transformers={
    "polarity": None,
    "review": preprocessor
})

y_test_full, X_test_full = feed.full()  # Get Batch
y_test = y_test_full()
X_test = X_test_full.adjust(padding=max_length)

In [None]:
score, acc = model.evaluate(X_test, y_test, batch_size=32)

In [None]:
print("Score: {}, Accuracy: {}".format(score, acc))

## Model & Preprocessor persistence

In [None]:
from sklearn.externals import joblib


model.save("sentiment_model.h5")
joblib.dump(preprocessor, "sentiment_preprocessor.pkl")