In [1]:
# 7:15 PM
# 7:30 PM
import os; print(os.path.dirname(os.getcwd()).split('\\')[-1])

GitHub


<font color='red'>@@@</font>

The RNN cheatsheet: https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks

<font color='red'>@@@</font>

In [1]:
# Since these models can be slow to train, we shall use a timer to monitor expected runtime. This might be lowerbound since my desktop CPU is fairly fast.
import time  # for stopwatch and sleep
t1 = time.perf_counter()  # track execution time

# Instructor Do: RNNs for NLP - Sentiment Analysis

In this activity, students will learn how to define a LSTM RNN model for sentiment analysis using Keras. Also, data preparation for using LSTM models for natural language processing is introduced.

In [2]:
# Initial imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
from pathlib import Path

%matplotlib inline

## The Dataset

The provided data file contains `6878` customer reviews of Coffee Shops in Austin, Texas. The reviews were taken from Yelp; however, the names of the Coffee Shops were anonymized for privacy reasons.

The dataset has the following columns:

* `coffee_shop_name`: The anonymized name of the coffee shop.

* `full_review_text`: The customer reviews.

* `sentiment`: The sentiment of each customer's review. `0` - Negative, `1` - Positive.

In [3]:
# Import the dataset
filepath="C:/Users/CS_Knit_tinK_SC/Documents/My Data Sources/121621/02_austin_coffee_shops_reviews.csv"
#file_path = Path("../Resources/austin_coffee_shops_reviews.csv")
reviews_df = pd.read_csv(filepath)
reviews_df.head()

Unnamed: 0,coffee_shop_name,full_review_text,sentiment
0,Coffee Shop 66,Love love loved the atmosphere! Every corner o...,1
1,Coffee Shop 66,"Listed in Date Night: Austin, Ambiance in Aust...",1
2,Coffee Shop 66,Listed in Brunch Spots I loved the eclectic an...,1
3,Coffee Shop 66,Very cool decor! Good drinks Nice seating How...,0
4,Coffee Shop 66,They are located within the Northcross mall sh...,1


## Data Preprocessing

RNN input requires an array data type. The `full_review_text` column will be transformed into the `X` array and the “sentiment” column into the `y` array.

In [5]:
# Creating the X and y vectors
X = reviews_df["full_review_text"].values
y = reviews_df["sentiment"].values

To train the RNN model, we need to encode the text data as an integer. This transformation can be done using the following tools from Keras.

In [6]:
reviews_df["sentiment"].values

array([1, 1, 1, ..., 1, 1, 1], dtype=int64)

<font color='red'>@@@</font>

Importing some new libraries!

Keras contains many preprocessing modules: https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing

Here, we use 2 of them...

> Padding is a special form of masking where the masked steps are at the start or the end of a sequence. Padding comes from the need to encode sequence data into contiguous batches: in order to make all sequences in a batch fit a given standard length, it is necessary to pad or truncate some sequences.
>
> `raw_inputs = [[711, 632, 71],
    [73, 8, 3215, 55, 927],
    [83, 91, 1, 645, 1253, 927],
]`
>
> `[[ 711  632   71    0    0    0]
 [  73    8 3215   55  927    0]
 [  83   91    1  645 1253  927]]`

https://www.tensorflow.org/guide/keras/masking_and_padding

Tokenizer: https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer

<font color='red'>@@@</font>

In [7]:
# Import Keras modules for data encoding
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [8]:
# Create an instance of the Tokenizer and fit it with the X text data
tokenizer = Tokenizer(lower=True)
tokenizer.fit_on_texts(X)

<font color='red'>@@@</font>

Tokenizer creates a key-value pair of `word1: 01`, where the latter is merely an assigned integer.

<font color='red'>@@@</font>

In [9]:
# Print the first five elements of the encoded vocabulary
for token in list(tokenizer.word_index)[:5]:
    print(f"word: '{token}', token: {tokenizer.word_index[token]}")

word: 'the', token: 1
word: 'and', token: 2
word: 'a', token: 3
word: 'i', token: 4
word: 'to', token: 5


<font color='red'>@@@</font>

We transform every review into numerical sequences.

<font color='red'>@@@</font>

In [10]:
# Transform the text data to numerical sequences
X_seq = tokenizer.texts_to_sequences(X)

# Contrast a sample numerical sequence with its text version
print("**Text comment**")
print({X[0]})
print("**Numerical sequence representation**")
print(X_seq[0])

**Text comment**
{'Love love loved the atmosphere! Every corner of the coffee shop had its own style, and there were swings!!! I ordered the matcha latte, and it was muy fantastico! Ordering and getting my drink were pretty streamlined. I ordered on an iPad, which included all beverage selections that ranged from coffee to wine, desired level of sweetness, and a checkout system. I got my latte within minutes!  I was hoping for a typical heart or feather on my latte, but found myself listing out all the possibilities of what the art may be. Any ideas?'}
**Numerical sequence representation**
[53, 53, 301, 1, 114, 188, 589, 6, 1, 8, 65, 29, 255, 351, 810, 2, 36, 50, 1138, 4, 125, 1, 511, 69, 2, 11, 10, 5621, 5019, 506, 2, 319, 16, 106, 50, 89, 4562, 4, 125, 21, 58, 1112, 68, 1909, 40, 967, 998, 18, 5020, 43, 8, 5, 416, 3656, 1018, 6, 732, 2, 3, 4563, 1289, 4, 90, 16, 69, 999, 312, 4, 10, 1364, 12, 3, 811, 652, 39, 5622, 21, 16, 69, 17, 302, 474, 4202, 38, 40, 1, 4203, 6, 71, 1, 368, 439, 

**The RNN model requires that all the values of the `X` vector have the same length; the `pad_sequences` method will ensure that all integer encoded reviews have the same size. Each entry in `X` will be shortened to `140` integers, or pad with `0's` in case it's shorter.**

In [11]:
# Padding sequences
X_pad = pad_sequences(X_seq, maxlen=140, padding="post")

Now that the data is encoded, the training and testing sets will be created.

In [12]:
# Creating training, validation, and testing sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_pad, y, random_state=78)

## Build and Train the LSTM RNN Model

In this section, a custom LSTM RNN model is going to be designed in Keras, and it's going to be fitted (trained) using the training data we defined.

These are the steps that will be followed:

* Define the model architecture in Keras.

* Compile the model.

* Fit the model to the training data.

### Importing the Keras Modules

To build an LSTM RNN model in Keras, the `Sequential` model is used; however, there are two new types of layers that are needed:

* `Embeding`: It's a type of layer that is used in neural networks to process encoded text data.

* `LSTM`: It's used to add an LSTM layer to the model.

In [13]:
# Import Keras modules for model creation
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense

<font color='red'>@@@</font>

Embedding: https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding?version=stable

LSTM: https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM?version=stable

<font color='red'>@@@</font>

### Setting Up the Model

**Read below**

The `Embedding` layer requires as parameter the size of the vocabulary in the text that is going to be processed. The `vocabulary_size` is set at the total number of words in the `tokenizer` dictionary plus `1`. The other parameter needed by this layer is the `input_length`; this parameter is set at `140` (`max_words` variable) that is the value defined for padding the reviews.

The `embedding_size` parameter specifies how many dimensions will be used to represent each word. As a rule-of-thumb, a multiple of eight could be used; for this demo, tuning the model value to `64` delivered the best result.

In [14]:
# Model set-up
vocabulary_size = len(tokenizer.word_counts.keys()) + 1
max_words = 140
embedding_size = 64  # generally use multiples of 8
vocabulary_size

17051

### Defining the Model's Structure

In [15]:
# Define the LSTM RNN model
model = Sequential()

# Layer 1
# NOTE: This layer processes the integer-encoded sequence of each review comment to create a dense vector that feeds into the next LSTM layer.
model.add(Embedding(vocabulary_size, embedding_size, input_length=max_words))

# Layer 2
# NOTE: = 2x the max_words
# LSTM transforms the dense vector into a new one that contains info about the entire sequence that will be used 
# by the activation function in the `Dense` layer to score sentiments.
model.add(LSTM(units=280))

# NOTE: Adding more LSTMs could improve performance!

# Output layer
model.add(Dense(1, activation="sigmoid"))

### Compiling the Model

In [16]:
# Compile the model
model.compile(
    loss="binary_crossentropy",
    optimizer="adam"
)

In [17]:
# Summarize the model
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 140, 64)           1091264   
                                                                 
 lstm (LSTM)                 (None, 280)               386400    
                                                                 
 dense (Dense)               (None, 1)                 281       
                                                                 
Total params: 1,477,945
Trainable params: 1,477,945
Non-trainable params: 0
_________________________________________________________________


### Training the Model

<font color='red'>@@@</font>

Larger batch size will speed the training time.

<font color='red'>@@@</font>

In [18]:
# Training the model
batch_size = 1000
model.fit(
    X_train,
    y_train,
    epochs=10,
    batch_size=batch_size,
    verbose=1,
)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x1ac12df4e48>

 ### Making Predictions

In [19]:
# Make sentiment predictions
predicted = (model.predict(X_test[:10]) > 0.5).astype("int32")

<font color='red'>@@@</font>

_ indicates a throwaway variable, I think the article said?

<font color='red'>@@@</font>

In [20]:
# The table below compares the actual text (not the sequences) from the original dataframe to the predicted values
# For that purpose we need to apply train_test_split with the same random state to the original X and save it as  X_test_original (we don't need the other values)
_, X_test_original, _, _ = train_test_split(X, y, random_state=78)

In [21]:
# Create a DataFrame of Real and Predicted values
sentiments = pd.DataFrame({"Text": X_test_original[:10], "Actual": y_test[:10], "Predicted": predicted.ravel()})
sentiments

Unnamed: 0,Text,Actual,Predicted
0,3 check-ins Listed in Om Noms Austin Props to ...,1,1
1,This my second time at this great coffee shop ...,1,1
2,I basically live in this place during semester...,1,1
3,A friend introduced me to Alta's last year by ...,1,1
4,I think this place makes the best espresso dri...,1,1
5,The only place in town I would order my Cortad...,1,1
6,"Parking is skimp, but luckily there are some s...",1,1
7,Awesome new coffee shop in the Domain! It's a ...,1,1
8,I came by here in a party of 2 a little before...,1,1
9,This is literally my favorite coffee shop in A...,1,1


In [22]:
t2 = time.perf_counter()
execution_time = t2 - t1
print("execution time: " + str(round(execution_time, 1)) + " seconds")

execution time: 165.9 seconds
