# AML Week 2, Lecture 1: Preparing Text for Deep NLP Models (TextVectorization)

## Learning Objectives

- How to create a train-test-val split for Tensorflow datasets from a train-test split. 
- How to use a Keras TextVectorization Layer
- Demonstrate how tensorflow models using Sequences with Embedding Layers.


In [None]:
# Adding parent directory to python path
import os, sys
sys.path.append(os.path.abspath("../"))

In [None]:
## Load the autoreload extension
%load_ext autoreload 
%autoreload 2

import demo  as fn


## Data

In [None]:
from IPython.display import display, Markdown
with open("../Data-AmazonReviews/Amazon Product Reviews.md") as f:
    display(Markdown(f.read()))

In [None]:
import tensorflow as tf
import numpy as np
# Then Set Random Seeds
tf.keras.utils.set_random_seed(42)
tf.random.set_seed(42)
np.random.seed(42)
# Then run the Enable Deterministic Operations Function
tf.config.experimental.enable_op_determinism()

# MacOS Sonoma Fix
tf.config.set_visible_devices([], 'GPU')

In [None]:
import pandas as pd 
import seaborn as sns

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np


from tensorflow.keras.layers import TextVectorization
from tensorflow.keras import layers
from tensorflow.keras import optimizers

from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.model_selection import GridSearchCV
from imblearn.under_sampling import RandomUnderSampler
from sklearn.preprocessing import LabelEncoder
from sklearn import set_config
set_config(transform_output='pandas')
pd.set_option('display.max_colwidth', 250)

# Define a function for building an LSTM model
from tensorflow.keras.models import Sequential
from tensorflow.keras import layers, optimizers, regularizers

In [None]:
import joblib
df = joblib.load('../Data-AmazonReviews/processed_data.joblib')
df.info()
df.head()

In [None]:
def create_groups(x):
    if x>=5.0:
        return "high"
    elif x <=2.0:
        return "low"
    else: 
        return None

To understand what customers do and do not like about Hoover products, we will define 2 groups:
- High Ratings
    - Overall rating = 5.0
- Low Ratings
    - Overall rating = 1.0 or 2.0


We can use a function and .map to define group names based on the numeric overall ratings.

In [None]:
## Use the function to create a new "rating" column with groups
df['rating'] = df['overall'].map(create_groups)
df.head()

In [None]:
df['rating'].value_counts(dropna=False)

In [None]:
## Check class balance of 'rating'
df['rating'].value_counts(normalize=True)

In [None]:
# Create a df_ml without null ratings
df_ml = df.dropna(subset=['rating']).copy()
df_ml.isna().sum()

In [None]:
## X - Option A)  lemmas
# def join_tokens(token_list):
#     joined_tokens = ' '.join(token_list)
#     return joined_tokens
# X = df_ml['spacy_lemmas'].apply(join_tokens)

# X - Option B) original raw text
X = df_ml['text']

# y - use our binary target 
y = df_ml['rating']
X.head(10)

In [None]:
y.value_counts(normalize=True)

# 📚 New For Today:

- Starting with a simple train-test-split for ML model (like in movie nlp project)
- Resampling Imbalanced training data
- Creating tensorflow dataset from X_train, y_train (so dataset is rebalanced)
- Creating tensorflow dataset (intended to be split in 2 ) for X_test and y_test

## From Train-Test Split for ML to Train-Test-Val Split for ANNs

In [None]:
# Perform 70:30 train test split
X_train_full, X_test, y_train_full, y_test = train_test_split(X, y, test_size=.3, random_state=42)
len(X_train_full), len(X_test)

### Using Sklearn's LabelEncoder

- Can't use text labels with neural networks.

In [None]:
y_train_full[:10]

In [None]:
# Instansiate label encoder
encoder = LabelEncoder()

# Fit and transform the training target
y_train_full_enc = encoder.fit_transform(y_train_full)#.values)

# Fit and tranform the test target
y_test_enc = encoder.transform(y_test)

y_train_full_enc[:10]

In [None]:
# Original Class names saved as .classes_
classes = encoder.classes_
classes

In [None]:
# Can inverse-transform 
encoder.inverse_transform([0,1])

### Undersampling Majority Class

In [None]:
from imblearn.under_sampling import RandomUnderSampler

# Instantiate a RandomUnderSampler
sampler = RandomUnderSampler(random_state=42)

In [None]:
try:
    X_train, y_train = sampler.fit_resample(X_train_full,y_train_full_enc)
except Exception as e:
    display(e)

In [None]:
# Fit_resample on the reshaped X_train data and y-train data
X_train, y_train_enc = sampler.fit_resample(X_train_full.values.reshape(-1,1),
                                        y_train_full_enc)
X_train.shape

In [None]:
# Flatten the reshaped X_train data back to 1D
X_train = X_train.flatten()
X_train.shape

In [None]:
# Check for class balance
pd.Series(y_train_enc).value_counts()

## Previous Class' ML Model

In [None]:
# from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
# from sklearn.pipeline import Pipeline
# from sklearn.naive_bayes import MultinomialNB

In [None]:
# ## Create a model pipeline 
# count_pipe = Pipeline([('vectorizer',  CountVectorizer()), 
#                        ('naivebayes',  MultinomialNB())])

# count_pipe.fit(X_train, y_train_enc)
# fn.evaluate_classification(count_pipe, X_train, y_train_enc, X_test, y_test_enc,)

# Preparing For Deep NLP (Train-Test-Val Datasets)

## 🕹️ Prepare Tensorflow Datasets

Since we already have train/test X and y vars, we will make 2 dataset objects using tf.data.Dataset.from_tensor_slices.

1. The training dataset using X_train, y_train (that we resampled/balanced)
2. The val/test dataset using X_test, y-test.

We will then split the val/test dataset into a val/test split.

<!-- 
### T/T/V Split - Order of Operations (if using 1 dataset object)

1) **Create full dataset object & Shuffle Once.**
2) Calculate number of samples for training and validation data.
3) Create the train/test/val splits using .take() and .skip()
4) **Add shuffle to the train dataset only.**
5) (Optional/Not Used on LP) If applying a transformation (e.g. train_ds.map(...)) to the data, add  here, before .cache()
7) (Optional) Add .cache() to all splits to increase speed  (but may cause problems with large datasets)
8) **Add .batch to all splits (default batch size=32)**
9) (Optional) Add .prefetch(tf.data.AUTOTUNE)
10) (Optional) Print out final length of datasets -->

In [None]:
# Convert training data to Dataset Object
train_ds = tf.data.Dataset.from_tensor_slices((X_train, y_train_enc))
# Shuffle dataset once
train_ds = train_ds.shuffle(len(train_ds), reshuffle_each_iteration=False)

Create a test and validation dataset using X_test,y_test

In [None]:
# Convert test to dataset object to split
val_test_ds = tf.data.Dataset.from_tensor_slices((X_test, y_test_enc))

In [None]:
# Calculate # of samples for 50/50 val/test split
n_val_samples = int(len(val_test_ds) *.5)
n_val_samples

In [None]:
## Perform the val/test split

## Create the validation dataset using .take
val_ds = val_test_ds.take(n_val_samples)

## Create the test dataset using skip
test_ds = val_test_ds.skip(n_val_samples)

In [None]:
# Comparing the len gths of all 3 splits
len(train_ds), len(val_ds), len(test_ds)

### Adding Shuffling and Batching

Let's examine a single element.

In [None]:
# display a sample single element 
example_X, example_y= train_ds.take(1).get_single_element()
print(example_X,'\n\n',example_y)

In [None]:
# (Repeat) display a sample single element 
example_X, example_y= train_ds.take(1).get_single_element()
print(example_X,'\n\n',example_y)

Notice that we have the same example, the training data is not shuffling.

Add .shuffle the training data.

In [None]:
# Shuffle only the training data every epoch
train_ds = train_ds.shuffle(len(train_ds))

In [None]:
# (Repeat) display a sample single element 
example_X, example_y= train_ds.take(1).get_single_element()
print(example_X,'\n\n',example_y)

In [None]:
# (Repeat) display a sample single element 
example_X, example_y= train_ds.take(1).get_single_element()
print(example_X,'\n\n',example_y)

> Add batching (use 32 for batch_size)

In [None]:
#  Setting the batch_size for all datasets
BATCH_SIZE =32
# use .batch to add batching to all 3 datasets
train_ds = train_ds.batch(BATCH_SIZE)
val_ds = val_ds.batch(BATCH_SIZE)
test_ds = test_ds.batch(BATCH_SIZE)

# Confirm the number of batches in each
print (f' There are {len(train_ds)} training batches.')
print (f' There are {len(val_ds)} validation batches.')
print (f' There are {len(test_ds)} testing batches.')

In [None]:
# (Repeat) display a sample single element 
example_X, example_y= train_ds.take(1).get_single_element()
print(example_X,'\n\n',example_y)


A single element now contains 32 samples since we set  batch_size to 32.

## 📚 Vectorizing Text with Keras's TextVectorization Layer (Demo)
### Compare what is happening during text vectorization when using count versus sequence

Flexible layer that can convert text to bag-of-words or sequences.

In [None]:
# Example text for demo 
example_text = ['Sometimes I love this vacuum, sometimes i hate this vacuum']

### TextVectorization Layer - Demo Count Vectorization

In [None]:
# Create text Vectorization layer - set to count vectorization
count_vectorizer = tf.keras.layers.TextVectorization(output_mode='count')

In [None]:
# Get the vocabulary from the vectorization layer.
count_vectorizer.get_vocabulary()

- Before training, only contains the out of vocab token ([UNK])

In [None]:
# Fitting the vectorizer using .adapt
count_vectorizer.adapt(example_text)
# Check the vocabulary after training the layer.
count_vectorizer.get_vocabulary()

In [None]:
# Convert example to count-vectorization
counts = count_vectorizer(example_text)
counts

- Size of vectorized text - column for every word in vocab

In [None]:
# Getting the counts as as DataFrame 
pd.DataFrame(counts.numpy(), columns=count_vectorizer.get_vocabulary())

### TextVectorization Layer - Demo Sequence Vectorization

- Output_mode='int' returns sequences.
- Length is set by data scientist, use 20 for demo

In [None]:
# Create text Vectorization layer for sequences
sequence_vectorizer = tf.keras.layers.TextVectorization(output_mode='int',output_sequence_length=20)

In [None]:
# Check the vocabulary of the new sequence vectorizer.
vocab =  sequence_vectorizer.get_vocabulary()
vocab

- Before training, only contains the out of vocab token ([UNK])

In [None]:
# Fit the vectorizer using .adapt
sequence_vectorizer.adapt(example_text)
# Check the vocabulary after training the layer
vocab =  sequence_vectorizer.get_vocabulary()
vocab

To demonstrate how sequences are used, we will make a dictionary with the integer code as the key and the corresponding word as the value

In [None]:
# Save dictionaries to look up words from ints 
int_to_str  = {idx:word for idx, word in enumerate(vocab)} # Dictionary Comprehension
int_to_str

In [None]:
# Convert example to sequences
sequences = sequence_vectorizer(example_text)
sequences

- Why are there 0s at the end of the sequence?

In [None]:
# Cannot be made into a dataframe
try:
    pd.DataFrame(sequences, columns = vocab)
except Exception as e:
    display(e)

In [None]:
# save the sequences as numpy array for the loop below
sequences = sequences.numpy()
sequences

In [None]:
# For each integer code, display the corresponding word
for val in sequences[0]:
    print(f"{val} = {int_to_str[val]}")

##  Embedding Layer

In [None]:
# Saving the Size of the Vocab
VOCAB_SIZE = sequence_vectorizer.vocabulary_size()
VOCAB_SIZE

The embedding layer needs the number of words in the input (input_dim), and the desired embedding dimensions. (e.g. 100,200,300).


In [None]:
# Create embedding layer of desired # of values
EMBED_DIM = 50
embedding_layer = tf.keras.layers.Embedding(input_dim=VOCAB_SIZE,
                                           output_dim=EMBED_DIM,
                                           input_length= 20)
embedding_layer

### Demonstrating Sequence to Vector Embedding Lookup 

In [None]:
# Minimum Model Needed to Create Embedding Layer for Vocab
demo_embed = Sequential()
demo_embed.add(sequence_vectorizer)
demo_embed.add(embedding_layer)
demo_embed.compile(optimizer='adam', loss='mse')
demo_embed.summary()

In [None]:
# Embedding has row per word with EMBED_DIM of 50
sequence_vectorizer.vocabulary_size(), EMBED_DIM

In [None]:
# Get the weights from the embedding layer (this is your actual embedding matrix)
embedding_weights = demo_embed.layers[1].get_weights()[0]
embedding_weights

In [None]:
# Preview one set of embedding weights
embedding_weights[0]

In [None]:
embedding_weights.shape


In [None]:
# Show embeddings for each token in the sequence
for val in sequences[0]:
    print(f"{val} = {int_to_str[val]}")
    print(embedding_weights[val])
    print()

In [None]:
stop

# LECTURE 1 STOP HERE

### Word Vectors Math

In [None]:
# Prepare the words and their corresponding vectors
vector_dict = {}
for i, word in int_to_str.items():#tokenizer.word_index.items():
    # Save the weights for word (based on numeric index)
    vector_dict[word]= embedding_weights[i] 

    # vector_list.append(embedding_weights[i])
vector_dict.keys()

In [None]:
# Display the vector for "love"
vector_dict['love']

In [None]:
# Display the vector for "hate"
vector_dict['hate']

In [None]:
# Vectors can be added/subtracted to get output vector - then find most similar word  
vector_dict['hate'] + vector_dict['love'] + vector_dict['vacuum']

## Word Embeddings Demo (Pre-Trained)

###  Pretrianed Word Embeddings with GloVe

- [Click here](https://nlp.stanford.edu/data/glove.6B.zip) to start donwnloading GloVe zip file (glove.6B.zip)
- Unzip the downloaded zip archive.
- Open the extracted folder and find the the `glove.6B.100d.txt` file. (Size is over 300MB )
- Move the text file from Downloads to the same folder as this notebook.
- **Make sure to ignore the large file using GitHub Desktop**

In [None]:
from gensim.models import KeyedVectors
# Load GloVe vectors into a gensim model
glove_model = KeyedVectors.load_word2vec_format("glove.6B.100d.txt", binary=False, no_header=True)

In [None]:
# You can now use `glove_model` to access individual word vectors, similar to a dictionary
vector = glove_model['king']
vector

In [None]:
vector.shape

In [None]:
# Find similarity between words
glove_model.similarity('king', 'queen')

In [None]:
# Perform word math
result = glove_model.most_similar(positive=['woman', 'king'], negative=['man'], topn=5)
result

In [None]:
# We can use glove to calculate the most similar
glove_model.most_similar('king')

In [None]:
# Manually calculating new vector for word math
new_vector = glove_model['king'] - glove_model['man'] + glove_model['woman']
new_vector

In [None]:
# Using .most_similar with an array
glove_model.most_similar(new_vector)

In [None]:
# Manually calculating new vector for word math
new_vector = glove_model['monarchy'] + glove_model['vote'] + glove_model['government']
glove_model.most_similar(new_vector)

In [None]:
# Manually calculating new vector for word math
new_vector = glove_model['baby'] + glove_model['age']
glove_model.most_similar(new_vector)

In [None]:
# Manually calculating new vector for word math
new_vector = glove_model['baby'] + glove_model['baby']
glove_model.most_similar(new_vector)

# Returning to Hoover Data

### Create the Training Texts Dataset

In [None]:
# Fit the layer on the training texts
try:
    sequence_vectorizer.adapt(train_ds)
except Exception as e:
    display(e)

> We need to get a version of our data that is **only the texts**.

In [None]:
# Get just the text_ds from ds_train

# Preview the text_ds


### Determine appropriate sequence length. 

In [None]:
# df_ml['length (characters)'] = df_ml['text'].map(len)
# df_ml.head(3)

# ax = sns.histplot(data=df_ml, hue='rating', x='length (characters)',
#                 stat='percent',common_norm=False)#, estimator='median',);
# ax.axvline()

In [None]:
# Let's take a look at the length of the each text
# We will split on each space, and then get the length
df_ml['length (tokens)'] = df_ml['text'].map( lambda x: len(x.split(" ")))
df_ml['length (tokens)'].describe()

In [None]:
SEQUENCE_LENGTH = 150
ax = sns.histplot(data=df_ml, hue='rating', x='length (tokens)',kde=True,
                stat='probability',common_norm=False)#, estimator='median',);
ax.axvline(SEQUENCE_LENGTH, color='red', ls=":")

In [None]:
# import numpy as np
# from sklearn.metrics.pairwise import cosine_similarity

# # Define a function to calculate cosine similarity
# def find_closest_embeddings(embedding):
    
#     return sorted(vector_dict.keys(), key=lambda word: cosine_similarity([vector_dict[word]], [embedding]))

# # Example of finding words similar to 'vacuum' 
# similar_to_vacuum = find_closest_embeddings(vector_dict['vacuum'])[:5]  # Get the top 5 similar words

# # Print the similar words
# print("Words similar to 'vacuum':", similar_to_vacuum)

# # Demonstration of vector arithmetic: 'hate' + 'love' + 'vacuum'
# combined_vector = vector_dict['hate'] + vector_dict['love'] + vector_dict['vacuum']
# similar_to_combined = find_closest_embeddings(combined_vector)[:5]  # Get the top 5 similar words

# # Print the similar words to the combined vector
# print("Words similar to the combination of 'hate', 'love', and 'vacuum':", similar_to_combined)
# # 

In [None]:

# # Example of finding words similar to 'vacuum' 
# n_results = 5
# demo_word = 'vacuum'
# add_word = 'love'

# similar_to_vacuum = find_closest_embeddings(vector_dict[demo_word])[:n_results]  # Get the top 5 similar words

# # Print the similar words
# print(f"Words similar to '{demo_word}':")
# print(similar_to_vacuum)

# # Demonstration of vector arithmetic: 'hate' + 'love' + 'vacuum'
# combined_vector =vector_dict[add_word] + vector_dict[demo_word]# vector_dict['hate'] + 

# similar_to_combined = find_closest_embeddings(combined_vector)[:n_results]  # Get the top 5 similar words

# # Print the similar words to the combined vector
# print(f"\nWords similar to the combination of {demo_word} + {add_word}")
# print(similar_to_combined)


# Our First Deep Sequence Model

### Combining the 

### Simple RNN

In [None]:

## Create text Vectorization layer
SEQUENCE_LENGTH = None
EMBED_DIM = None

sequence_vectorizer = tf.keras.layers.TextVectorization(
    standardize="lower_and_strip_punctuation",
    output_mode="int",
    output_sequence_length=SEQUENCE_LENGTH
)

sequence_vectorizer.adapt(ds_texts)

In [None]:
VOCAB_SIZE = sequence_vectorizer.vocabulary_size()


# Define sequential model with pre-trained vectorization layer and *new* embedding layer
model = Sequential([
    sequence_vectorizer,
    layers.Embedding(input_dim=VOCAB_SIZE,
                              output_dim=EMBED_DIM, 
                              input_length=SEQUENCE_LENGTH)
    ])

In [None]:
def build_rnn_model(text_vectorization_layer):
    VOCAB_SIZE = text_vectorization_layer.vocabulary_size()
    SEQUENCE_LENGTH = sequence_vectorizer.get_config()['output_sequence_length']
    
    
    # Define sequential model with pre-trained vectorization layer and *new* embedding layer
    model = Sequential([
        text_vectorization_layer,
        layers.Embedding(input_dim=VOCAB_SIZE,
                                  output_dim=EMBED_DIM, 
                                  input_length=SEQUENCE_LENGTH)
        ])
        
    # Add *new* LSTM layer
    model.add(layers.SimpleRNN(32)) #BEST=32
    
    # Add output layer
    model.add(layers.Dense(1, activation='sigmoid'))
 
    # Compile the model
    model.compile(optimizer=optimizers.legacy.Adam(learning_rate = .001), 
                  loss='bce',
                  metrics=['accuracy'])
    
    model.summary()
    return model

def get_callbacks(patience=3, monitor='val_accuracy'):
    early_stop = tf.keras.callbacks.EarlyStopping(patience=patience, monitor=monitor)
    return [early_stop]

In [None]:
# Build the lstm model and specify the vectorizer
rnn_model = build_rnn_model(sequence_vectorizer)

# Defien number of epocs
EPOCHS = 30
# Fit the model
history = rnn_model.fit(
    train_ds,
    epochs=EPOCHS,
    validation_data=val_ds,
    callbacks=get_callbacks(patience=5),
)
fn.plot_history(history,figsize=(6,4))

In [None]:
# Obtain the results
results = fn.evaluate_classification_network(
    rnn_model, X_train=train_ds, 
    X_test=test_ds,# history=history
);

# Next Class

> We will continue with this task and introduce and apply various sequence models.

# APPENDIX - Save for Next Lecture

In [None]:
## TEMP/EXP - extract embedding matrix

embedding_weights = rnn_model.layers[1].get_weights()[0]
embedding_weights.shape

> - Conceptual example of using the maximum value as final result.
> - Relate to GlovalMaxPooling1D() layer

In [None]:
# Saving the MAX values (relate to GlobalMaxPooling)
max_vector = np.max((vector_dict['hate'], vector_dict['love'] ,vector_dict['vacuum']), axis=0)
print(max_vector.shape)
max_vector

In [None]:
# Saving the Average values (relate to GlobalMaxPooling)
avg_vector = np.mean((vector_dict['hate'], vector_dict['love'] ,vector_dict['vacuum']), axis=0)
print(avg_vector.shape)
avg_vector