# Instructor Do: RNNs for NLP - Sentiment Analysis

In this activity, students will learn how to define a LSTM RNN model for sentiment analysis using Keras. Also, data preparation for using LSTM models for natural language processing is introduced.

In [None]:
# Initial imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
from pathlib import Path

%matplotlib inline

## The Dataset

The provided data file contains `6878` customer reviews of Coffee Shops in Austin, Texas. The reviews were taken from Yelp; however, the names of the Coffee Shops were anonymized for privacy reasons.

The dataset has the following columns:

* `coffee_shop_name`: The anonymized name of the coffee shop.

* `full_review_text`: The customer reviews.

* `sentiment`: The sentiment of each customer's review. `0` - Negative, `1` - Positive.

In [None]:
# Import the dataset


## Data Preprocessing

RNN input requires an array data type. The `full_review_text` column will be transformed into the `X` array and the “sentiment” column into the `y` array.

In [None]:
# Creating the X and y vectors
X = 
y = 

To train the RNN model, we need to encode the text data as an integer. This transformation can be done using the following tools from Keras.

In [None]:
# Import Keras modules for data encoding


In [None]:
# Create an instance of the Tokenizer and fit it with the X text data
tokenizer = 

In [None]:
# Print the first five elements of the encoded vocabulary


In [None]:
# Transform the text data to numerical sequences
X_seq = 

# Contrast a sample numerical sequence with its text version


The RNN model requires that all the values of the `X` vector have the same length; the `pad_sequences` method will ensure that all integer encoded reviews have the same size. Each entry in `X` will be shortened to `140` integers, or pad with `0's` in case it's shorter.

In [None]:
# Padding sequences
X_pad = 

Now that the data is encoded, the training and testing sets will be created.

In [None]:
# Creating training, validation, and testing sets
from sklearn.model_selection import train_test_split


## Build and Train the LSTM RNN Model

In this section, a custom LSTM RNN model is going to be designed in Keras, and it's going to be fitted (trained) using the training data we defined.

These are the steps that will be followed:

* Define the model architecture in Keras.

* Compile the model.

* Fit the model to the training data.

### Importing the Keras Modules

To build an LSTM RNN model in Keras, the `Sequential` model is used; however, there are two new types of layers that are needed:

* `Embeding`: It's a type of layer that is used in neural networks to process encoded text data.

* `LSTM`: It's used to add an LSTM layer to the model.

In [None]:
# Import Keras modules for model creation
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense

### Setting Up the Model

The `Embedding` layer requires as parameter the size of the vocabulary in the text that is going to be processed. The `vocabulary_size` is set at the total number of words in the `tokenizer` dictionary plus `1`. The other parameter needed by this layer is the `input_length`; this parameter is set at `140` (`max_words` variable) that is the value defined for padding the reviews.

The `embedding_size` parameter specifies how many dimensions will be used to represent each word. As a rule-of-thumb, a multiple of eight could be used; for this demo, tuning the model value to `64` delivered the best result.

In [None]:
# Model set-up
vocabulary_size = 
max_words = 
embedding_size = 

### Defining the Model's Structure

In [None]:
# Define the LSTM RNN model

# Layer 1

# Layer 2

# Output layer


### Compiling the Model

In [None]:
# Compile the model


In [None]:
# Summarize the model


### Training the Model

In [None]:
# Training the model
batch_size = 1000


 ### Making Predictions

In [None]:
# Make sentiment predictions


In [None]:
# Create a DataFrame of Real and Predicted values
