# Neural Networks Part 3:

## Pre-Trained Networks and Word Embeddings and LSTMs, oh my!

In [None]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import nltk
from nltk.corpus import stopwords

from sklearn.model_selection import train_test_split

from tensorflow import keras
from tensorflow.keras import layers

from gensim.models import word2vec

In [None]:
# Defining our results visualization function
def visualize_training_results(history):
    '''
    From https://machinelearningmastery.com/display-deep-learning-model-training-history-in-keras/
    
    Input: keras history object (output from trained model)
    '''
    fig, (ax1, ax2) = plt.subplots(2, sharex=True)
    fig.suptitle('Model Results')

    # summarize history for accuracy
    ax1.plot(history.history['accuracy'])
    ax1.plot(history.history['val_accuracy'])
    ax1.set_ylabel('Accuracy')
    ax1.legend(['train', 'test'], loc='upper left')
    # summarize history for loss
    ax2.plot(history.history['loss'])
    ax2.plot(history.history['val_loss'])
    ax2.set_ylabel('Loss')
    ax2.legend(['train', 'test'], loc='upper left')
    
    plt.xlabel('Epoch')
    plt.show()

## First: Pre-Trained Image Classification Model

A pretrained network (also known as a convolutional base for CNNs) consists of layers that have already been trained on typically general data. For images, these layers have already learned general patterns, textures, colors, etc. such that when you feed in your training data, certain features can immediately be detected. This part is **feature extraction**.

You typically add your own final layers to train the network to classify/regress based on your problem. This component is **fine tuning**

Here are the pretrained image classification models that exist within Keras: https://keras.io/api/applications/

To demonstrate the utility of pretrained networks, we'll compare model performance between a baseline model and a model using a pretrained network (VGG19).

### Adding Pretrained Layers

VGG19: https://keras.io/api/applications/vgg/#vgg19-function

In [None]:
from keras.applications import VGG19

In [None]:
pretrained = VGG19(weights='imagenet',
                   include_top=True,
                   input_shape=(224, 224, 3))
# May download data at this step, shouldn't take long

In [None]:
pretrained.summary()

In [None]:
cnn = keras.models.Sequential()
cnn.add(pretrained)

# freezing layers so they don't get retrained with your new data
for layer in cnn.layers:
    layer.trainable=False 

In [None]:
# adding our own dense layers
cnn.add(layers.Flatten())
cnn.add(layers.Dense(132, activation='relu'))
cnn.add(layers.Dense(1, activation='softmax'))

In [None]:
cnn.summary()

In [None]:
# to verify that the weights are "frozen" 
for layer in cnn.layers:
    print(layer.name, layer.trainable)

With this you can now compile and fit your model!

## Now: Using Pre-Trained Embeddings and NNs for NLP Tasks

For the most part, following this example: https://stackabuse.com/python-for-nlp-movie-sentiment-analysis-using-deep-learning-in-keras/

Also relevant: https://machinelearningmastery.com/develop-word-embedding-model-predicting-movie-review-sentiment/

Data is: https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

In [None]:
df = pd.read_csv('data/IMDB_Reviews.csv')

In [None]:
df.head()

In [None]:
df.info()

### Pre-Split Preprocessing

Doing some initial preprocessing that can be done before the train/test split

In [None]:
# Let's check out an example review...
df['review'][2]

We have some HTML tags inside these texts... will want to remove them. But how?

Enter: Regular Expressions (regex).

Testing: https://regexr.com/

In [None]:
# Find the pattern to remove html tags
import re

html_tag_pattern = re.compile(r'<[^>]*>')

test = html_tag_pattern.sub('', df['review'][2])

In [None]:
test

In [None]:
# Apply our pattern to the dataset
df['review'] = df['review'].map(lambda x: re.sub(r'<[^>]*>', '', x))

# Same as
# df['review'] = df['review'].map(lambda x: html_tag_pattern.sub('', x))

In [None]:
# Sanity check
df['review'][2]

Let's also remove stopwords

In [None]:
stop_words = stopwords.words('english')

In [None]:
df['review'] = df['review'].apply(lambda x: ' '.join(
    [word for word in x.split() if word not in (stop_words)]))

Can also pre-process our target variable

In [None]:
# Create a target map
target_map = {'positive': 1,
              'negative': 0}

In [None]:
# Map it
df['sentiment'] = df['sentiment'].map(target_map)

In [None]:
# Sanity check
df.head()

### Split, and then Post-Split Processing
Now let's perform a train/test split:

In [None]:
# Define our X and y
X = df['review']
y = df['sentiment']

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [None]:
X_train.shape

Now, time to tokenize our text. Going to use keras's tokenizer: https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer

In [None]:
# Showcasing an example to start
X_train[2]

In [None]:
# Find our longest review - will need for padding later
max_length = max([len(s.split()) for s in X_train])
max_length

In [None]:
# Now to tokenize
# Recommend checking out their default values - they're removing punctuation for us!
tokenizer = keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(X_train)

X_train = tokenizer.texts_to_sequences(X_train)
X_test = tokenizer.texts_to_sequences(X_test)

In [None]:
# Same example, after processing
print(X_train[2])

In [None]:
# Grab the corpus size
# Adding 1 because of reserved 0 index
vocab_size = len(tokenizer.word_index) + 1
print(vocab_size)

In [None]:
# Now, let's pad so each review is the same length as our longest review
# Basically, adding zeros at the end

X_train = keras.preprocessing.sequence.pad_sequences(
    X_train, maxlen=max_length, padding='post')
X_test = keras.preprocessing.sequence.pad_sequences(
    X_test, maxlen=max_length, padding='post')

In [None]:
X_train[2]

### Pre-Trained Word Embeddings: GloVe (Global Vectors for Word Representation)

The link to download the GloVe files: https://nlp.stanford.edu/projects/glove/

The below function and code comes from: https://realpython.com/python-keras-text-classification/#using-pretrained-word-embeddings

In [None]:
def create_embedding_matrix(glove_filepath, word_index, embedding_dim):
    '''
    Grabs the embeddings just for the words in our vocabulary
    
    Inputs:
    glove_filepath - string, location of the glove text file to use
    word_index - word_index attribute from the keras tokenizer
    embedding_dim - int, number of dimensions to embed, a hyperparameter
    
    Output:
    embedding_matrix - numpy array of embeddings
    '''
    vocab_size = len(word_index) + 1  # Adding again 1 because of reserved 0 index
    embedding_matrix = np.zeros((vocab_size, embedding_dim))

    with open(glove_filepath) as f:
        for line in f:
            word, *vector = line.split()
            if word in word_index:
                idx = word_index[word] 
                embedding_matrix[idx] = np.array(
                    vector, dtype=np.float32)[:embedding_dim]

    return embedding_matrix

In [None]:
embedding_dim = 50
embedding_matrix = create_embedding_matrix('glove.6B.50d.txt',
                                           tokenizer.word_index, 
                                           embedding_dim)

In [None]:
embedding_matrix

In [None]:
# Time to model!
model = keras.models.Sequential()
model.add(layers.Embedding(vocab_size, embedding_dim, 
                           weights=[embedding_matrix], 
                           input_length=max_length, 
                           trainable=False)) # Note - not retraining the embedding layer
model.add(layers.GlobalMaxPool1D())
model.add(layers.Dense(10, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])
model.summary()

In [None]:
history = model.fit(X_train, y_train,
                    epochs=10,
                    batch_size=100,
                    validation_data=(X_test, y_test))

In [None]:
score = model.evaluate(X_test, y_test, verbose=0)
print("Test loss:", score[0])
print("Test accuracy:", score[1])

# Visualize results
visualize_training_results(history)

Evaluate:

- 


### Treat Embeddings as Starting Weights, but Allow Training:

In [None]:
# Time to model!
model = keras.models.Sequential()
model.add(layers.Embedding(vocab_size, embedding_dim, 
                           weights=[embedding_matrix], 
                           input_length=max_length, 
                           trainable=True)) # Now it can retrain the embedding layer
model.add(layers.GlobalMaxPool1D())
model.add(layers.Dense(10, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])
model.summary()

In [None]:
history = model.fit(X_train, y_train,
                    epochs=10,
                    batch_size=100,
                    validation_data=(X_test, y_test))
# Takes about... 3 minutes?

In [None]:
score = model.evaluate(X_test, y_test, verbose=0)
print("Test loss:", score[0])
print("Test accuracy:", score[1])

# Visualize results
visualize_training_results(history)

Evalutate:

- 


### Early Stopping

Patience: how many epochs that model can keep running without improvement before the training is stopped

Reference: https://machinelearningmastery.com/how-to-stop-training-deep-neural-networks-at-the-right-time-using-early-stopping/

In [None]:
# Implement early stopping
from keras.callbacks import EarlyStopping
es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=2)

In [None]:
# Combine with a model saving feature, so it saves as it improves
from keras.callbacks import ModelCheckpoint
mc = ModelCheckpoint('best_model.h5', monitor='val_accuracy', mode='max', verbose=1, save_best_only=True)

In [None]:
# Same model as just before this
model.summary()

In [None]:
# Just adding more epochs
history = model.fit(X_train, y_train,
                    epochs=100,
                    batch_size=100,
                    validation_data=(X_test, y_test),
                    callbacks=[es, mc])

### LSTM

Note: current bug in tensorflow related to the newest numpy version, if you have numpy version 1.20 + this won't work.

https://github.com/tensorflow/models/issues/9706

In [None]:
np.__version__

In [None]:
model = keras.models.Sequential()
model.add(layers.Embedding(vocab_size, embedding_dim,
                           weights=[embedding_matrix],
                           input_length=max_length,
                           trainable=True))
# Changing our previous simple dense layer to an LSTM
# Adding some dropout to prevent overfitting - note the two ways to do so
model.add(layers.LSTM(embedding_dim, 
                      dropout=0.2,
                      return_sequences=True))
model.add(layers.Dense(1, activation='sigmoid'))

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])
model.summary()

### Saving your model

In [None]:
model.save('model.h5')
model.save_weights('model_weights.h5')

In [None]:
from keras.models import load_model

my_model = load_model('model.h5')
my_model.load_weights('model_weights.h5')

In [None]:
my_model.evaluate(X_test.values, y_test.values)