# Movie Review Sentiment Analysis

Let's try the sentiment analysis by using [chariot](https://github.com/chakki-works/chariot) and [Tensorflow](https://www.tensorflow.org/).

1. Download the [Movie Review Data](https://github.com/chakki-works/chazutsu/tree/master/chazutsu#movie-review-data).
2. Preprocess the review text by chariot.
3. Load the pretrained word embedding [GloVe](https://nlp.stanford.edu/projects/glove/).
4. Make model by TensorFlow (use `tf.keras`).
5. Train & evaluate the model.


## Prepare the packages

In [1]:
%load_ext autoreload
%autoreload 2


import os
import sys
from pathlib import Path


def set_path():
    if "../" not in sys.path:
        sys.path.append("../")
    root_dir = Path.cwd()
    return root_dir

ROOT_DIR = set_path()

## Download the Movie Review Data

In [107]:
import chazutsu
from chariot.storage import Storage

storage = Storage.setup_data_dir(ROOT_DIR)
r = chazutsu.datasets.MovieReview.polarity().download(storage.data_path("raw"))

Read resource from the existed resource(if you want to retry, set force=True).


In [108]:
train_dataset = storage.chazutsu(r).train_dataset
if len(train_dataset.fields) == 0:
    train_dataset.fields = ["polarity", "review"]
train_dataset.to_dataframe().head(3)

Unnamed: 0,polarity,review
0,0,"synopsis : an aging master art thief , his sup..."
1,0,"plot : a separated , glamorous , hollywood cou..."
2,0,a friend invites you to a movie . this film wo...


## Preprocess the review text by chariot.

In [109]:
import chariot.transformer as ct
from chariot.preprocessor import Preprocessor


preprocessor = Preprocessor(
                    tokenizer=ct.Tokenizer("en"),
                    text_transformers=[ct.text.UnicodeNormalizer()],
                    token_transformers=[ct.token.StopwordFilter("en")],
                    indexer=ct.Indexer(min_df=5, max_df=0.5))

preprocessor.fit(train_dataset.get("review"))

Preprocessor(indexer=Indexer(begin_of_seq=None, copy=True, end_of_seq=None, max_df=0.5, min_df=5,
    padding=None, size=-1, unknown=None),
       n_jobs=1,
       text_transformers=[UnicodeNormalizer(copy=True, form='NFKC')],
       token_transformers=[StopwordFilter(copy=True, lang='en')],
       tokenizer=Tokenizer(copy=True, lang='en'))

In [110]:
preprocessor.indexer.vocab[:10]

['__PAD__',
 '__UNK__',
 '__BOS__',
 '__EOS__',
 'big',
 "'re",
 '*',
 'makes',
 'seen',
 'real']

## Load the pretrained word embedding GloVe

In [111]:
_ = storage.chakin(name="GloVe.6B.200d")

In [112]:
embedding = preprocessor.indexer.make_embedding(storage.data_path("external/glove.6B.200d.txt"))
print(embedding.shape)

(10361, 200)


## Make model by TensorFlow

### Prepare train dataset

In [113]:
feed = train_dataset.to_feed(field_transformers={
    "polarity": None,
    "review": preprocessor
})

y_full, X_full = feed.full()  # Get Batch

In [114]:
max_length = 300
y = y_full()
X = X_full.adjust(padding=max_length)
print(y.shape)
print(X.shape)

(1400,)
(1400, 300)


### Test baseline model

In [115]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn import metrics


def test_baseline(X, y):
    X_concat = [" ".join([str(i) for i in row]) for row in X]
    x_train, x_valid, y_train, y_valid = train_test_split(X_concat, y, test_size=0.2)
    vectorizer = TfidfVectorizer()
    x_train_v = vectorizer.fit_transform(x_train)

    classifier = LogisticRegression()
    classifier.fit(x_train_v, y_train)

    predict = classifier.predict(vectorizer.transform(x_valid))
    score = metrics.accuracy_score(y_valid, predict)

    print(score)

test_baseline(X, y)

0.7964285714285714


### Make model

In [152]:
from tensorflow.python import keras as K


vocab_size = len(preprocessor.indexer.vocab)
embedding_size = 200

def make_model():
    model = K.Sequential()
    model.add(K.layers.Masking(mask_value=preprocessor.indexer.pad))
    model.add(K.layers.Embedding(vocab_size, embedding_size, weights=[embedding]))
    model.add(K.layers.Lambda(lambda x: K.backend.mean(x, axis=1)))
    model.add(K.layers.Dense(1, activation="sigmoid"))
    return model

model = make_model()
model.compile(loss="binary_crossentropy",
              optimizer="adam",
              metrics=["accuracy"])

### Train the model

In [153]:
from sklearn.model_selection import train_test_split


X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2)

model.fit(X_train, y_train,
          batch_size=32,
          epochs=15,
          validation_data=(X_valid, y_valid), verbose=2)

Train on 1120 samples, validate on 280 samples
Epoch 1/15
 - 2s - loss: 0.6899 - acc: 0.5205 - val_loss: 0.6883 - val_acc: 0.5393
Epoch 2/15
 - 1s - loss: 0.6756 - acc: 0.6089 - val_loss: 0.6828 - val_acc: 0.5500
Epoch 3/15
 - 1s - loss: 0.6605 - acc: 0.7393 - val_loss: 0.6754 - val_acc: 0.6107
Epoch 4/15
 - 1s - loss: 0.6442 - acc: 0.7946 - val_loss: 0.6672 - val_acc: 0.6214
Epoch 5/15
 - 1s - loss: 0.6224 - acc: 0.7946 - val_loss: 0.6591 - val_acc: 0.6250
Epoch 6/15
 - 1s - loss: 0.5984 - acc: 0.8152 - val_loss: 0.6450 - val_acc: 0.6821
Epoch 7/15
 - 1s - loss: 0.5678 - acc: 0.8875 - val_loss: 0.6333 - val_acc: 0.6821
Epoch 8/15
 - 1s - loss: 0.5335 - acc: 0.8973 - val_loss: 0.6146 - val_acc: 0.6893
Epoch 9/15
 - 1s - loss: 0.4941 - acc: 0.9125 - val_loss: 0.6001 - val_acc: 0.7179
Epoch 10/15
 - 1s - loss: 0.4532 - acc: 0.9286 - val_loss: 0.5811 - val_acc: 0.7179
Epoch 11/15
 - 1s - loss: 0.4119 - acc: 0.9357 - val_loss: 0.5690 - val_acc: 0.7250
Epoch 12/15
 - 1s - loss: 0.3718 - acc

<tensorflow.python.keras._impl.keras.callbacks.History at 0x26b58c2f7b8>

### Evaluate the model

In [118]:
test_dataset = storage.chazutsu(r).test_dataset
if len(test_dataset.fields) == 0:
    test_dataset.fields = ["polarity", "review"]

feed = test_dataset.to_feed(field_transformers={
    "polarity": None,
    "review": preprocessor
})

y_test_full, X_test_full = feed.full()  # Get Batch
y_test = y_test_full()
X_test = X_test_full.adjust(padding=max_length)

In [119]:
score, acc = model.evaluate(X_test, y_test, batch_size=32)



In [120]:
print("Score: {}, Accuracy: {}".format(score, acc))

Score: 0.5044107842445373, Accuracy: 0.7866666666666666


## Model & Preprocessor persistence

In [161]:
import json
from sklearn.externals import joblib


if not os.path.exists("models"):
    os.mkdir("models")

model.save("models/sentiment_model.h5")
joblib.dump(preprocessor, "models/sentiment_preprocessor.pkl")
print("save models")

save models


### Load

In [162]:
loaded_preprocessor = joblib.load("models/sentiment_preprocessor.pkl") 

In [163]:
feed = test_dataset.to_feed(field_transformers={
    "polarity": None,
    "review": loaded_preprocessor
})

y_test_full, X_test_full = feed.full()  # Get Batch
y_test = y_test_full()
X_test = X_test_full.adjust(padding=max_length)

score, acc = model.evaluate(X_test, y_test, batch_size=32)

Score: 0.5071771903832754, Accuracy: 0.7916666674613952


In [164]:
print("Score: {}, Accuracy: {}".format(score, acc))

Score: 0.5071771903832754, Accuracy: 0.7916666674613952
