# Overview 

For this final section we're going to work on word embeddings and how we can use them to perform sentiment analysis.

In [None]:
from keras.models import Sequential
from keras.layers import Dense, Embedding, GlobalAveragePooling1D
from keras.datasets import imdb
from keras.optimizers import SGD, RMSprop, Adam
from keras.preprocessing.sequence import pad_sequences
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Data Pre-Processing

Again we're going to get the training and testing data that have been provided to us and then perform a bit of exploratory analysis help do some additional pre-preprocessing

In [None]:
# Get the raw training and testing data
(X_train, y_train), (X_test, y_test) = imdb.load_data()

In [None]:
# The data is in a somewhat unusual format; let's take a look at how it's 
# currently formated and how we'll need to adjust it to be used by Keras
X_train[0]

## Exercise 1

Typically, natural language processing problems are very skewed -- namely, a small number of words cover most of the uses in the data. If this is true, then this typically implies we can shrink the vocabulary without paying a big price in terms of model performance while significantly speeding up computation time (similar to PCA). To check this hypothesis, our first exercise for this project will be to look at the distribution of words in the data. Specifically, I would like you to create a histogram displaying the word usage distribution in each of the reviews. To do this, you will need to represent the training data as a DataFrame and use this DataFrame to make a histogram.

In [None]:
# Fill in the create_word_df function

# An example of a valid output for this function would look like

# sample_num_vect | words
# -----------------------
# 1               | 15
# 1               | 27
# 1               | 3
# ...

def create_word_df(word_vect, sample_num):
    # Repeat the sample_num an appropriate number of times
    sample_num_vect = np.repeat(sample_num, len(word_vect))
    
    # Return a DataFrame with two columns: the sample_num_vect and words
    return pd.DataFrame({"sample_num_vect": sample_num_vect,
                         "words": word_vect})
    
# Apply create_word_df to each element of the X_train data
word_df = pd.concat(list(map(create_word_df, X_train, range(len(X_train)))),
                    ignore_index=True)

# Using the word_df DataFrame, plot the distribution of words in the data
word_df["words"].hist()
plt.show()

## Exercise 2

Before we can use a word embedding model, each of the word vectors need to have the same size; therefore, we need to see the distribution of review lengths in the data. Please generate a boxplot displaying the distribution of review lengths

**HINT:** "count" is an aggregation function that can be used to tell you the number of rows in a group

In [None]:
# We need to group by the sample_num_vect,
# count the number of rows per entry, and then 
# plot this using Pandas
word_df.groupby("sample_num_vect").agg("count").boxplot()
plt.show()

Now that we have determined the appropriate vocabulary and vector lengths for our dataset we can now get these values from the IMDB data. Fortunately since this is such a commonly used dataset, these functions are included for us.

In [None]:
# Get the training and testing data with the new constraints
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=5000, maxlen=500, seed=17)

In [None]:
# Finally we need to pad the sequences to ensure that all of the vectors have the same
# length
X_train = pad_sequences(X_train, maxlen=500)
X_test = pad_sequences(X_test, maxlen=500)

# Word Embeddings

Now we are going to introduce word embeddings in the context of sentiment analysis

## Exercise 3

For this exercise, using the Keras API and the following layers: 
- Embedding
- GlobalAveragePooling1D

Generate a neural network with the following hyper-parameters

- Embedding layer with 32 dimensional word vectors
- Default GlobalAveragePooling1D
- One dense layer with 64 units
- Standard SGD optimizer
- Binary cross-entropy loss function
- Train for five epochs

When this is done, evaluate the model out-of-sample

**HINT**: when you have a binary output, the final activation function is "sigmoid" and only has one unit

In [None]:
# The trick is to know which arguments to pass to Embedding()
# input_dim => # of vocab words (5000)
# output_dim => # number of dim to represent words (32)
# input_length => # expected length of vectors (500)
model = Sequential([
    Embedding(input_dim=5000, output_dim=32, input_length=500),
    GlobalAveragePooling1D(),
    Dense(64, activation="relu"),
    Dense(1, activation="sigmoid")
])

# Optimize the model the same way we've done before
model.compile(optimizer=SGD(), loss="binary_crossentropy")

# Fit the model to data
model.fit(X_train, y_train, epochs=5, batch_size=128)

## Exercise 4

Using the optimization algorithm that was assigned to your group, define and train an embedding model that has the same specifications as before; plot the history and report the final test error. Remember to type out the neural network and do not just copy-paste

In [None]:
# I'll show the example for the Adam() optimizer; it's very similar for 
# the other ones
model = Sequential([
    Embedding(input_dim=5000, output_dim=32, input_length=500),
    GlobalAveragePooling1D(),
    Dense(64, activation="relu"),
    Dense(1, activation="sigmoid")
])

# We just have to change from SGD() to Adam() -- that's how easy it is!
model.compile(optimizer=Adam(), loss="binary_crossentropy")

# Fit the model to data
model.fit(X_train, y_train, epochs=5, batch_size=128)

## Exercise 5

In our previous model, we just arbitrarily chose the words to be represented
by 32-dimensional vectors; let's see how sensitive our model is to that
choice; using either a 4, 128, or 256 dimensional vector with the Adam 
optimizer, determine how sensitive the model is to this hyper-parameter

In [None]:
# I'll show you how to do this for 128; just play around with the 
# `output_dim` hyper-parameter; we'll still use the Adam() optimizer
model = Sequential([
    Embedding(input_dim=5000, output_dim=128, input_length=500), # just change the output_dim
    GlobalAveragePooling1D(),
    Dense(64, activation="relu"),
    Dense(1, activation="sigmoid")
])

# We just have to change from SGD() to Adam() -- that's how easy it is!
model.compile(optimizer=Adam(), loss="binary_crossentropy")

# Fit the model to data
model.fit(X_train, y_train, epochs=5, batch_size=128)