# NLP: Using Pre-Trained Word Embeddings

In [None]:
# General imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from tensorflow import keras
from tensorflow.keras import layers

In [None]:
# Defining a results visualization function
def visualize_training_results(history):
    '''
    From https://machinelearningmastery.com/display-deep-learning-model-training-history-in-keras/
    
    Input: keras history object (output from trained model)
    '''
    fig, (ax1, ax2) = plt.subplots(2, sharex=True)
    fig.suptitle('Model Results')

    # summarize history for accuracy
    ax1.plot(history.history['accuracy'])
    ax1.plot(history.history['val_accuracy'])
    ax1.set_ylabel('Accuracy')
    ax1.legend(['train', 'test'], loc='upper left')
    # summarize history for loss
    ax2.plot(history.history['loss'])
    ax2.plot(history.history['val_loss'])
    ax2.set_ylabel('Loss')
    ax2.legend(['train', 'test'], loc='upper left')
    
    plt.xlabel('Epoch')
    plt.show()

## Pre-Trained Networks and Word Embeddings


In [None]:
# More specific imports
import nltk
from nltk.corpus import stopwords

from gensim.models import word2vec

Data is: https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

In [None]:
df = pd.read_csv('data/IMDB_Reviews.csv')

In [None]:
df.head()

In [None]:
df.info()

In [None]:
# Our target value
df['sentiment'].unique()

### Pre-Split Preprocessing

Doing some initial preprocessing that can be done before the train/test split

In [None]:
# Let's check out an example review...
index_num = 15903 # Defining the index number of the review to explore

df['review'].iloc[index_num]

We have some HTML tags inside these texts... will want to remove them. But how?

Enter: Regular Expressions (regex).

Testing: https://regexr.com/

In [None]:
# Find the pattern to remove html tags
import re

html_tag_pattern = re.compile(r'<[^>]*>')

test = html_tag_pattern.sub('', df['review'].iloc[index_num])

In [None]:
test

In [None]:
# Apply our pattern to the dataset
df['review'] = df['review'].map(lambda x: re.sub(r'<[^>]*>', '', x))

# Same as
# df['review'] = df['review'].map(lambda x: html_tag_pattern.sub('', x))

In [None]:
# Sanity check
df['review'].iloc[index_num]

Let's also remove stopwords

In [None]:
stop_words = stopwords.words('english')

In [None]:
# Neat bit of code!
df['review'] = df['review'].apply(lambda x: ' '.join(
    [word for word in x.split() if word.lower() not in (stop_words)]))

Can also pre-process our target variable

In [None]:
# Create a target map
target_map = {'positive': 1,
              'negative': 0}

In [None]:
# Map it
df['sentiment'] = df['sentiment'].map(target_map)

In [None]:
# Sanity check
df.head()

### Split, and then Post-Split Processing
Now let's perform a train/test split:

In [None]:
# Define our X and y
X = df['review']
y = df['sentiment']

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [None]:
X_train.shape

In [None]:
# Need to find that same review now that the index is shuffled
train_index_num = X_train.index.get_loc(15903)
X_train.iloc[train_index_num]

### Vanilla Text Classification... What Would We Do?

Aka what would this look like without a NN?

In [None]:
# Let's use a TF-IDF vectorizer 
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
# What parameters should we set? What steps have we already done, what do we still need to do?
# Already removed stopwords!
vectorizer = TfidfVectorizer(
    max_df=.95,  # removes words that appear in more than 95% of docs
    min_df=2 # removes words that appear 2 or fewer times
)  

In [None]:
vectorizer.fit(X_train)

X_train_vec = vectorizer.transform(X_train)
X_test_vec = vectorizer.transform(X_test)

#### Explore Our Vectorized Text

In [None]:
# Let's look at that second example again
X_train.iloc[train_index_num]

In [None]:
train_index_num

In [None]:
X_train.loc[X_train.str.contains('CHILDREN\'S MOVIE!!!')]

In [None]:
# Creating a df of tf-idf values, where each column is a word in the vocabulary
tfidf_train_df = pd.DataFrame(X_train_vec.toarray(), 
                              columns=vectorizer.get_feature_names(), 
                              index=X_train.index)

In [None]:
# Grabbing that row once it's been vectorized
test_doc = tfidf_train_df.iloc[train_index_num]

test_doc[test_doc > 0].sort_values(ascending=False).head(15) # Showing values > 0

What does this tell you about the word "censure" in the this document?

- 


In [None]:
# Now let's model
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()

In [None]:
classifier.fit(X_train_vec, y_train)

classifier.score(X_test_vec, y_test)

Evaluate:

- 


## Moving to NN-Based Text Classification!

#### Different Pre-Processing Steps!

Let's walk through these steps first, then discuss why we didn't just use vectorized text.

Going to use keras's tokenizer: https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer

In [None]:
# Find our longest review - will need for padding later
max_length = max([len(s.split()) for s in X_train])
max_length

In [None]:
# Now to tokenize
# Recommend checking out their default values - they're removing punctuation for us!
tokenizer = keras.preprocessing.text.Tokenizer()

tokenizer.fit_on_texts(X_train)

X_train_token = tokenizer.texts_to_sequences(X_train)
X_test_token = tokenizer.texts_to_sequences(X_test)

In [None]:
# What is this doing? 
#Let's look at the first 10 key-value pairs in the word_index dict

list(tokenizer.word_index.items())[:10]

In [None]:
# Same example, after processing
print(X_train_token[train_index_num])

In [None]:
# Grab the corpus size
# Adding 1 because of reserved 0 index
vocab_size = len(tokenizer.word_index) + 1
print(vocab_size)

In [None]:
# Now, let's pad so each review is the same length as our longest review
# Basically, adding zeros at the end

X_train_processed = keras.preprocessing.sequence.pad_sequences(
    X_train_token, maxlen=max_length, padding='post')
X_test_processed = keras.preprocessing.sequence.pad_sequences(
    X_test_token, maxlen=max_length, padding='post')

In [None]:
print(X_train_processed[2])

#### Why Couldn't We Just Use TF-IDF?

In [None]:
# Look at the vectorized text
tfidf_train_df.head()

In [None]:
# Look at the preprocessed text from keras
X_train_processed

What is the difference? Specifically, what are the columns representing in each of these? What are the numbers?

- 


## Using Pre-Trained Word Embeddings in NNs for NLP Tasks

For the most part, following this example: https://stackabuse.com/python-for-nlp-movie-sentiment-analysis-using-deep-learning-in-keras/

Also relevant: https://machinelearningmastery.com/develop-word-embedding-model-predicting-movie-review-sentiment/

### GloVe (Global Vectors for Word Representation)

The link to download the GloVe files: https://nlp.stanford.edu/projects/glove/

> **You will need to download the GloVe embeddings directly, since these files are all too big for github!**

The below function and code comes from: https://realpython.com/python-keras-text-classification/#using-pretrained-word-embeddings

In [None]:
def create_embedding_matrix(glove_filepath, word_index, embedding_dim):
    '''
    Grabs the embeddings just for the words in our vocabulary
    
    Inputs:
    glove_filepath - string, location of the glove text file to use
    word_index - word_index attribute from the keras tokenizer
    embedding_dim - int, number of dimensions to embed, a hyperparameter
    
    Output:
    embedding_matrix - numpy array of embeddings
    '''
    vocab_size = len(word_index) + 1  # Adding again 1 because of reserved 0 index
    embedding_matrix = np.zeros((vocab_size, embedding_dim))

    with open(glove_filepath) as f:
        for line in f:
            word, *vector = line.split()
            if word in word_index:
                idx = word_index[word] 
                embedding_matrix[idx] = np.array(
                    vector, dtype=np.float32)[:embedding_dim]

    return embedding_matrix

In [None]:
embedding_dim = 50
embedding_matrix = create_embedding_matrix('glove.6B.50d.txt',
                                           tokenizer.word_index, 
                                           embedding_dim)

In [None]:
embedding_matrix

How is this different from previous preprocessing steps?

- 


In [None]:
# Time to model!
model = keras.models.Sequential()
model.add(layers.Embedding(vocab_size, embedding_dim, 
                           weights=[embedding_matrix], 
                           input_length=max_length, 
                           trainable=False)) # Note - not retraining the embedding layer
model.add(layers.Flatten()) # flattening these layers down before connecting to dense layer
model.add(layers.Dense(1, activation='sigmoid'))

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

model.summary()

In [None]:
history = model.fit(X_train_processed, y_train,
                    epochs=10,
                    batch_size=100,
                    validation_data=(X_test_processed, y_test))

In [None]:
score = model.evaluate(X_test_processed, y_test, verbose=0)
print("Test loss:", score[0])
print("Test accuracy:", score[1])

# Visualize results
visualize_training_results(history)

Evaluate:

- 


### Treat Embeddings as Starting Weights, but Allow Training:

In [None]:
# Time to model!
model = keras.models.Sequential()
model.add(layers.Embedding(vocab_size, embedding_dim, 
                           weights=[embedding_matrix], 
                           input_length=max_length, 
                           trainable=True)) # Now it can retrain the embedding layer
model.add(layers.Flatten())
model.add(layers.Dense(1, activation='sigmoid'))

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])
model.summary()

In [None]:
history = model.fit(X_train_processed, y_train,
                    epochs=10,
                    batch_size=100,
                    validation_data=(X_test_processed, y_test))
# Takes about... 3 minutes?

In [None]:
score = model.evaluate(X_test_processed, y_test, verbose=0)
print("Test loss:", score[0])
print("Test accuracy:", score[1])

# Visualize results
visualize_training_results(history)

Evalutate:

- 


### Early Stopping

Patience: how many epochs that model can keep running without improvement before the training is stopped

Reference: https://machinelearningmastery.com/how-to-stop-training-deep-neural-networks-at-the-right-time-using-early-stopping/

In [None]:
# Implement early stopping
from keras.callbacks import EarlyStopping
es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=2)

In [None]:
# Combine with a model saving feature, so it saves as it improves
from keras.callbacks import ModelCheckpoint
mc = ModelCheckpoint('best_model.h5', monitor='val_accuracy', mode='max', verbose=1, save_best_only=True)

In [None]:
# Same model as just before this
model.summary()

In [None]:
# Just adding more epochs
history = model.fit(X_train_processed, y_train,
                    epochs=100,
                    batch_size=100,
                    validation_data=(X_test_processed, y_test),
                    callbacks=[es, mc])

# This takes... a while

### LSTM

Note: there might still be a bug in tensorflow related to the newest numpy version, if you have numpy version 1.20+ this might not work.

https://github.com/tensorflow/models/issues/9706

In [None]:
np.__version__

In [None]:
model = keras.models.Sequential()
model.add(layers.Embedding(vocab_size, embedding_dim,
                           weights=[embedding_matrix],
                           input_length=max_length,
                           trainable=True))
# Replacing our Flattening layer with an LSTM
# Adding some dropout to prevent overfitting - note the two ways to do so
model.add(layers.LSTM(embedding_dim, 
                      dropout=0.2,
                      return_sequences=True))
model.add(layers.Dense(1, activation='sigmoid'))

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

model.summary()

We could do this all here... or we could move over to Kaggle and run this with GPUs!

### Saving your model

In [None]:
model.save('model.h5')
model.save_weights('model_weights.h5')

In [None]:
from keras.models import load_model

my_model = load_model('model.h5')
my_model.load_weights('model_weights.h5')

In [None]:
my_model.evaluate(X_test_processed.values, y_test.values)

## Additional Resources

- Sklearn's [Working with Text Data Tutorial](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html)

What else can we do with natural language data beyond text classification? 

- [This blog post](https://blog.aureusanalytics.com/blog/5-natural-language-processing-techniques-for-extracting-information) by Aureus Analytics provides an overview of other machine learning techniques used to extract meaning from text: Named Entity Recognition, Sentiment Analysis, Text Summarization, Aspect Mining and Topic Modeling

### Neural Network Vectorizer Resources:

Want another way to embed words for machine learning? Check out Word2Vec - a way of vectorizing text that tries to capture the relationships between words. See the image below, from [this paper](https://arxiv.org/pdf/1310.4546.pdf) from Google developers, that introduced a Skip-gram neural network model that's been utilized by Word2Vec (which is a tool you can use to implement this model). You'll note that the distance between each country and it's capital city is about the same - that distance actually has meaning, and thus you can imagine that the difference between `cat` and `kitten` would be the same as the difference between `dog` and `puppy`. Et cetera!

![screenshot from a paper on the Skip-gram model from devleopers at Google, https://arxiv.org/pdf/1310.4546.pdf](images/Fig2-DsitributedRepresentationsOfWordsAndPhrasesAndTheirCompositionality.png)

- [Pathmind's A.I. Wiki - A Beginner's Guide to Word2Vec](https://wiki.pathmind.com/word2vec)
- [Chris McCormick's Word2Vec Tutorial](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/)
- [What is the difference between Word2Vec and GloVe?](https://machinelearninginterview.com/topics/natural-language-processing/what-is-the-difference-between-word2vec-and-glove/)