# Sentiment Prediction in IMDB Reviews using an LSTM

[[Notebook](https://github.com/fastestimator/fastestimator/blob/master/apphub/NLP/imdb/imdb.ipynb)] [[TF Implementation](https://github.com/fastestimator/fastestimator/blob/master/apphub/NLP/imdb/imdb_tf.py)] [[Torch Implementation](https://github.com/fastestimator/fastestimator/blob/master/apphub/NLP/imdb/imdb_torch.py)]

In [1]:
import tempfile
import os
import numpy as np
import torch
import torch.nn as nn
import fastestimator as fe
from fastestimator.dataset.data import imdb_review
from fastestimator.op.numpyop.univariate.reshape import Reshape
from fastestimator.op.tensorop.loss import CrossEntropy
from fastestimator.op.tensorop.model import ModelOp, UpdateOp
from fastestimator.trace.io import BestModelSaver
from fastestimator.trace.metric import Accuracy
from fastestimator.backend import load_model

In [2]:
MAX_WORDS = 10000
MAX_LEN = 500
batch_size = 64
epochs = 10
train_steps_per_epoch = None
eval_steps_per_epoch = None

<h2>Building components</h2>

### Step 1: Prepare training & evaluation data and define a `Pipeline`

We are loading the dataset from tf.keras.datasets.imdb which contains movie reviews and sentiment scores. All the words have been replaced with the integers that specifies the popularity of the word in corpus. To ensure all the sequences are of same length we need to pad the input sequences before defining the `Pipeline`.

In [3]:
train_data, eval_data = imdb_review.load_data(MAX_LEN, MAX_WORDS)
pipeline = fe.Pipeline(train_data=train_data,
                       eval_data=eval_data,
                       batch_size=batch_size,
                       ops=Reshape(1, inputs="y", outputs="y"))

### Step 2: Create a `model` and FastEstimator `Network`

First, we have to define the neural network architecture, and then pass the definition, associated model name, and optimizer into fe.build:

In [4]:
class ReviewSentiment(nn.Module):
    def __init__(self, embedding_size=64, hidden_units=64):
        super().__init__()
        self.embedding = nn.Embedding(MAX_WORDS, embedding_size)
        self.conv1d = nn.Conv1d(in_channels=64, out_channels=32, kernel_size=3, padding=1)
        self.maxpool1d = nn.MaxPool1d(kernel_size=4)
        self.lstm = nn.LSTM(input_size=125, hidden_size=hidden_units, num_layers=1)
        self.fc1 = nn.Linear(in_features=hidden_units, out_features=250)
        self.fc2 = nn.Linear(in_features=250, out_features=1)

    def forward(self, x):
        x = self.embedding(x)
        x = x.permute((0, 2, 1))
        x = self.conv1d(x)
        x = torch.relu(x)
        x = self.maxpool1d(x)
        output, _ = self.lstm(x)
        x = output[:, -1]  # sequence output of only last timestamp
        x = torch.tanh(x)
        x = self.fc1(x)
        x = torch.relu(x)
        x = self.fc2(x)
        x = torch.sigmoid(x)
        return x

`Network` is the object that defines the whole training graph, including models, loss functions, optimizers etc. A `Network` can have several different models and loss functions (ex. GANs). `fe.Network` takes a series of operators, in this case just the basic `ModelOp`, loss op, and `UpdateOp` will suffice. It should be noted that "y_pred" is the key in the data dictionary which will store the predictions.

In [5]:
model = fe.build(model_fn=lambda: ReviewSentiment(), optimizer_fn="adam")
network = fe.Network(ops=[
    ModelOp(model=model, inputs="x", outputs="y_pred"),
    CrossEntropy(inputs=("y_pred", "y"), outputs="loss"),
    UpdateOp(model=model, loss_name="loss")
])

### Step 3: Prepare `Estimator` and configure the training loop

`Estimator` is the API that wraps the `Pipeline`, `Network` and other training metadata together. `Estimator` also contains `Traces`, which are similar to the callbacks of Keras.

In the training loop, we want to measure the validation loss and save the model that has the minimum loss. `BestModelSaver` is a convenient `Trace` to achieve this. Let's also measure accuracy over time using another `Trace`:

In [6]:
model_dir = tempfile.mkdtemp()
traces = [Accuracy(true_key="y", pred_key="y_pred"), BestModelSaver(model=model, save_dir=model_dir)]
estimator = fe.Estimator(network=network,
                         pipeline=pipeline,
                         epochs=epochs,
                         traces=traces,
                         train_steps_per_epoch=train_steps_per_epoch,
                         eval_steps_per_epoch=eval_steps_per_epoch)

<h2>Training</h2>

In [7]:
estimator.fit()

    ______           __  ______     __  _                 __            
   / ____/___ ______/ /_/ ____/____/ /_(_)___ ___  ____ _/ /_____  _____
  / /_  / __ `/ ___/ __/ __/ / ___/ __/ / __ `__ \/ __ `/ __/ __ \/ ___/
 / __/ / /_/ (__  ) /_/ /___(__  ) /_/ / / / / / / /_/ / /_/ /_/ / /    
/_/    \__,_/____/\__/_____/____/\__/_/_/ /_/ /_/\__,_/\__/\____/_/     
                                                                        

FastEstimator-Start: step: 1; logging_interval: 100; num_device: 0;
FastEstimator-Train: step: 1; loss: 0.6982045;
FastEstimator-Train: step: 100; loss: 0.69076145; steps/sec: 4.55;
FastEstimator-Train: step: 200; loss: 0.6970146; steps/sec: 5.49;
FastEstimator-Train: step: 300; loss: 0.67406845; steps/sec: 5.6;
FastEstimator-Train: step: 358; epoch: 1; epoch_time: 69.22 sec;
FastEstimator-BestModelSaver: Saved model to /var/folders/lx/drkxftt117gblvgsp1p39rlc0000gn/T/tmpds6dz9wa/model_best_loss.pt
FastEstimator-Eval: step: 358; epoch: 1; accuracy: 0.6826

<h2>Inferencing</h2>

For inferencing, first we have to load the trained model weights. We previously saved model weights corresponding to our minimum loss, and now we will load the weights using `load_model()`:

In [8]:
model_name = 'model_best_loss.pt'
model_path = os.path.join(model_dir, model_name)
load_model(model, model_path)

Let's get some random sequence and compare the prediction with the ground truth:

In [9]:
selected_idx = np.random.randint(10000)
print("Ground truth is: ",eval_data[selected_idx]['y'])

Ground truth is:  1


Create data dictionary for the inference. The `Transform()` function in Pipeline and Network applies all the operations on the given data:

In [10]:
infer_data = {"x":eval_data[selected_idx]['x'], "y":eval_data[selected_idx]['y']}
data = pipeline.transform(infer_data, mode="infer")
data = network.transform(data, mode="infer")

Finally, print the inferencing results.

In [11]:
print("Prediction for the input sequence: ", np.array(data["y_pred"])[0][0])

Prediction for the input sequence:  0.91389465


## Using your own dataset
This example can be used for any custom dataset that requires a `sequence-to-vector` task. The `Pipeline` in this code example assumes a tokenized sentence (every word represented by an index) in each sample with a fixed length (0-padded). 

If you have a tokenized dataset already, you can create a `Dataset` class that produces an output similar to the code example. If your dataset is not yet tokenized (aka in words), you can use a `Tokenize` Operator similar to [This example](https://github.com/fastestimator/fastestimator/blob/master/apphub/NLP/named_entity_recognition/bert.ipynb). 