# Movie Review Sentiment Analysis

Let's try the sentiment analysis by using [chariot](https://github.com/chakki-works/chariot) and [Tensorflow](https://www.tensorflow.org/).

1. Download the [Movie Review Data](https://github.com/chakki-works/chazutsu/tree/master/chazutsu#movie-review-data).
2. Preprocess the review text by chariot.
3. Load the pretrained word embedding [GloVe](https://nlp.stanford.edu/projects/glove/).
4. Make model by TensorFlow (use `tf.keras`).
5. Train & evaluate the model.

This tutorial needs following libraries.

* chazutsu
* chakin
* scipy
* scikit-learn
* tensorflow
* h5py


## Prepare the packages

In [1]:
%load_ext autoreload
%autoreload 2


import os
import sys
from pathlib import Path


def set_path():
    if "../" not in sys.path:
        sys.path.append("../")
    root_dir = Path.cwd()
    return root_dir

ROOT_DIR = set_path()

## Download the Movie Review Data

In [2]:
import chazutsu
from chariot.storage import Storage

storage = Storage.setup_data_dir(ROOT_DIR)
r = chazutsu.datasets.MovieReview.polarity().download(storage.data_path("raw"))

Read resource from the existed resource(if you want to retry, set force=True).


In [3]:
train_dataset = storage.chazutsu(r).train_dataset
if len(train_dataset.fields) == 0:
    train_dataset.fields = ["polarity", "review"]
train_dataset.to_dataframe().head(3)

Unnamed: 0,polarity,review
0,0,what hath kevin williamson wrought ? while the...
1,1,note : some may consider portions of the follo...
2,0,"in the finale of disney's "" mighty joe young ,..."


## Preprocess the review text by chariot.

In [4]:
import chariot.transformer as ct
from chariot.preprocessor import Preprocessor


preprocessor = Preprocessor(
                    tokenizer=ct.Tokenizer("en"),
                    text_transformers=[ct.text.UnicodeNormalizer()],
                    token_transformers=[ct.token.StopwordFilter("en")],
                    indexer=ct.Indexer(min_df=5, max_df=0.5))

preprocessor.fit(train_dataset.get("review"))

Preprocessor(indexer=Indexer(begin_of_seq=None, copy=True, end_of_seq=None, max_df=0.5, min_df=5,
    padding=None, size=-1, unknown=None),
       n_jobs=1,
       text_transformers=[UnicodeNormalizer(copy=True, form='NFKC')],
       token_transformers=[StopwordFilter(copy=True, lang='en')],
       tokenizer=Tokenizer(copy=True, lang='en'))

In [5]:
preprocessor.indexer.vocab[:10]

['__PAD__',
 '__UNK__',
 '__BOS__',
 '__EOS__',
 'makes',
 '_',
 'better',
 'real',
 'role',
 'seen']

## Load the pretrained word embedding GloVe

In [6]:
_ = storage.chakin(name="GloVe.6B.200d")

In [7]:
embedding = preprocessor.indexer.make_embedding(storage.data_path("external/glove.6B.200d.txt"))
print(embedding.shape)

(10453, 200)


## Make model by TensorFlow

### Prepare train dataset

In [8]:
feed = train_dataset.to_feed(field_transformers={
    "polarity": None,
    "review": preprocessor
})

y_full, X_full = feed.full()  # Get Batch

In [9]:
max_length = 300
y = y_full()
X = X_full.adjust(padding=max_length)
print(y.shape)
print(X.shape)

(1400,)
(1400, 300)


### Test baseline model

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics


def test_baseline(X, y):
    X_concat = [" ".join([str(i) for i in row]) for row in X]
    x_train, x_valid, y_train, y_valid = train_test_split(X_concat, y, test_size=0.2)
    vectorizer = TfidfVectorizer()
    x_train_v = vectorizer.fit_transform(x_train)

    classifier = LogisticRegression()
    classifier.fit(x_train_v, y_train)

    predict = classifier.predict(vectorizer.transform(x_valid))
    score = metrics.accuracy_score(y_valid, predict)

    print(score)

test_baseline(X, y)

0.775


### Make model

In [11]:
from tensorflow.python import keras as K


vocab_size = len(preprocessor.indexer.vocab)
embedding_size = 200

def make_model():
    model = K.Sequential()
    model.add(K.layers.Masking(mask_value=preprocessor.indexer.pad))
    model.add(K.layers.Embedding(vocab_size, embedding_size, weights=[embedding]))
    model.add(K.layers.Lambda(lambda x: K.backend.mean(x, axis=1)))
    model.add(K.layers.Dense(1, activation="sigmoid"))
    return model

model = make_model()
model.compile(loss="binary_crossentropy",
              optimizer="adam",
              metrics=["accuracy"])

### Train the model

In [12]:
from sklearn.model_selection import train_test_split


X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2)

model.fit(X_train, y_train,
          batch_size=32,
          epochs=15,
          validation_data=(X_valid, y_valid), verbose=2)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 1120 samples, validate on 280 samples
Epoch 1/15
 - 2s - loss: 0.6902 - acc: 0.5179 - val_loss: 0.6862 - val_acc: 0.5821
Epoch 2/15
 - 2s - loss: 0.6767 - acc: 0.6875 - val_loss: 0.6792 - val_acc: 0.6393
Epoch 3/15
 - 2s - loss: 0.6639 - acc: 0.6536 - val_loss: 0.6712 - val_acc: 0.6643
Epoch 4/15
 - 2s - loss: 0.6462 - acc: 0.8116 - val_loss: 0.6625 - val_acc: 0.7000
Epoch 5/15
 - 2s - loss: 0.6262 - acc: 0.8500 - val_loss: 0.6515 - val_acc: 0.7107
Epoch 6/15
 - 2s - loss: 0.6009 - acc: 0.8634 - val_loss: 0.6387 - val_acc: 0.7179
Epoch 7/15
 - 2s - loss: 0.5708 - acc: 0.8955 - val_loss: 0.6258 - val_acc: 0.7464
Epoch 8/15
 - 2s - loss: 0.5356 - acc: 0.9071 - val_loss: 0.6103 - val_acc: 0.7536
Epoch 9/15
 - 2s - loss: 0.4969 - acc: 0.9295 - val_loss: 0.5949 - val_acc: 0.7464
Epoch 10/15
 - 2s - loss: 0.4556 - acc: 0.9509 - val_loss: 0.5784 - val_acc: 0.7679
Epoch 11/15
 - 2s - loss: 0.4118 - acc: 0.9616 - val_loss: 0.5633 - val_acc: 0.7643
Epoch 12/15
 - 2s - loss: 0.3707 - acc

<tensorflow.python.keras.callbacks.History at 0x2587d8dc080>

### Evaluate the model

In [13]:
test_dataset = storage.chazutsu(r).test_dataset
if len(test_dataset.fields) == 0:
    test_dataset.fields = ["polarity", "review"]

feed = test_dataset.to_feed(field_transformers={
    "polarity": None,
    "review": preprocessor
})

y_test_full, X_test_full = feed.full()  # Get Batch
y_test = y_test_full()
X_test = X_test_full.adjust(padding=max_length)

In [14]:
score, acc = model.evaluate(X_test, y_test, batch_size=32)



In [15]:
print("Score: {}, Accuracy: {}".format(score, acc))

Score: 0.5046493029594421, Accuracy: 0.7849999992052714


## Model & Preprocessor persistence

In [16]:
import json
from sklearn.externals import joblib


if not os.path.exists("models"):
    os.mkdir("models")

model.save("models/sentiment_model.h5")
joblib.dump(preprocessor, "models/sentiment_preprocessor.pkl")
print("save models")

save models


### Load

In [17]:
loaded_preprocessor = joblib.load("models/sentiment_preprocessor.pkl") 

In [18]:
feed = test_dataset.to_feed(field_transformers={
    "polarity": None,
    "review": loaded_preprocessor
})

y_test_full, X_test_full = feed.full()  # Get Batch
y_test = y_test_full()
X_test = X_test_full.adjust(padding=max_length)

score, acc = model.evaluate(X_test, y_test, batch_size=32)



In [19]:
print("Score: {}, Accuracy: {}".format(score, acc))

Score: 0.5046493029594421, Accuracy: 0.7849999992052714
