<a href="https://colab.research.google.com/github/doronschwartz/NLP/blob/main/HW2/HW2_Part3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using pre-trained word embeddings

**Author:** [fchollet](https://twitter.com/fchollet)<br>
**Date created:** 2020/05/05<br>
**Last modified:** 2020/05/05<br>
**Description:** Text classification on the Newsgroup20 dataset using pre-trained GloVe word embeddings.

Taken from:
https://keras.io/examples/nlp/pretrained_word_embeddings/

Modified by Avi Rosenfeld on 2023/21/02

## Setup

In [None]:
import numpy as np
import tensorflow as tf
from tensorflow import keras

## Introduction

In this example, we show how to train a text classification model that uses pre-trained
word embeddings.

We'll work with the Newsgroup20 dataset, a set of 20,000 message board messages
belonging to 20 different topic categories.

For the pre-trained word embeddings, we'll use
[GloVe embeddings](http://nlp.stanford.edu/projects/glove/).

However, we will use Gensim's pretrained models.  More on all models can be found at:

https://radimrehurek.com/gensim/models/word2vec.html#pretrained-models

## Load pre-trained word embeddings

Let's download a pre-trained GloVe embeddings from Gensim.  This one is a model with 200 nodes. Feel free to try the other options with 50-dimensional,
100-dimensional, 200-dimensional, 300-dimensional vectors. Also, you can put in the Word2Vec ones just as easily!

In [None]:
import gensim.downloader as api
glove_model = api.load("glove-wiki-gigaword-200")
embedding_dim = glove_model.vector_size

In [None]:
print(embedding_dim)

200


## Download the Newsgroup20 data

Note that as opposed to our previous notebook, here we download the entire 20_newsgroup data.

In [None]:
from sklearn.datasets import fetch_20newsgroups
#newsgroups_all = fetch_20newsgroups(subset='all', shuffle=True, random_state=42, remove=('headers', 'footers', 'quotes'))
cats = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
#newsgroups_all = fetch_20newsgroups(subset='all',categories=cats)
newsgroups_all = fetch_20newsgroups(subset='all',categories=cats, remove=('headers', 'footers', 'quotes'))

## Let's take a look at the data

As you can see, there are header lines that are leaking the file's category, either
explicitly (the first line is literally the category name), or implicitly, e.g. via the
`Organization` filed. We get rid of the headers with the built-in filter.  

In [None]:
samples = newsgroups_all.data
labels =  newsgroups_all.target
class_names = newsgroups_all.target_names

print("Classes:", class_names)
print("Number of samples:", len(samples))

Classes: ['alt.atheism', 'comp.graphics', 'sci.space', 'talk.religion.misc']
Number of samples: 3387


## Shuffle and split the data into training & validation sets

In [None]:
# Shuffle the data
seed = 1337
rng = np.random.RandomState(seed)
rng.shuffle(samples)
rng = np.random.RandomState(seed)
rng.shuffle(labels)

# Extract a training & validation split
validation_split = 0.2
num_validation_samples = int(validation_split * len(samples))
train_samples = samples[:-num_validation_samples]
val_samples = samples[-num_validation_samples:]
train_labels = labels[:-num_validation_samples]
val_labels = labels[-num_validation_samples:]

In [None]:
#train_samples = train_samples.astype(str)

## Create a vocabulary index

Let's use the `TextVectorization` to index the vocabulary found in the dataset.
Later, we'll use the same layer instance to vectorize the samples.

Our layer will only consider the top 20,000 words, and will truncate or pad sequences to
be actually 200 tokens long.

In [None]:
from tensorflow.keras.layers import TextVectorization

vectorizer = TextVectorization(max_tokens=20000, output_sequence_length=200)
text_ds = tf.data.Dataset.from_tensor_slices(train_samples).batch(128)
vectorizer.adapt(text_ds)

You can retrieve the computed vocabulary used via `vectorizer.get_vocabulary()`. Let's
print the top 5 words:

In [None]:
vectorizer.get_vocabulary()[:5]

['', '[UNK]', 'the', 'to', 'of']

Let's vectorize a test sentence:

In [None]:
output = vectorizer([["the cat sat on the man"]])
output.numpy()[0, :6]

array([    2, 11969,  4109,    17,     2,   339])

As you can see, "the" gets represented as "2". Why not 0, given that "the" was the first
word in the vocabulary? That's because index 0 is reserved for padding and index 1 is
reserved for "out of vocabulary" tokens.

Here's a dict mapping words to their indices:

In [None]:
voc = vectorizer.get_vocabulary()
word_index = dict(zip(voc, range(len(voc))))

Now, let's prepare a corresponding embedding matrix that we can use in a Keras
`Embedding` layer. It's a simple NumPy matrix where entry at index `i` is the pre-trained
vector for the word of index `i` in our `vectorizer`'s vocabulary.

Hint: I found you can add stopwords or other filters here. I personally used the one from:

from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

As you can see, we obtain the same encoding as above for our test sentence:

In [None]:
test = ["the", "man", "sat", "on", "the", "cat"]


Now let's encode the 20 Newsgroup dataset using our preloaded word embeddings!

In [None]:
vocab_size = len(word_index)
num_tokens = len(voc) + 2
embedding_matrix = np.zeros((num_tokens, embedding_dim))

hits = 0
misses = 0

# Prepare embedding matrix
for word, i in word_index.items():
    if word in glove_model:
      hits += 1
      embedding_matrix[i] = glove_model[word]
    else:
        misses += 1
print("Converted %d words (%d misses)" % (hits, misses))


Converted 15832 words (4168 misses)


In [None]:
print(word_index.items())
print(print(embedding_matrix[10])) #Note that UNK is like OOV from the Lecture

[ 2.68049985e-01  3.60320002e-01 -3.31999987e-01 -5.46419978e-01
 -5.04509985e-01 -1.34610003e-02 -8.04319978e-01 -2.42139995e-01
  5.37360013e-01  7.75810003e-01 -3.25540006e-01  4.83000010e-01
  8.42649996e-01  3.77799988e-01 -1.47670001e-01  5.31920016e-01
 -7.05179989e-01  4.40369993e-01  7.50349998e-01 -1.81710005e-01
  7.01390028e-01  2.93829989e+00  4.56119999e-02 -2.11759999e-01
  1.99469998e-01 -4.81750011e-01 -2.58150011e-01  4.62000012e-01
 -5.68410009e-03 -3.05629998e-01 -5.75410008e-01 -1.95269994e-02
 -1.37510002e-01 -5.94500005e-01 -3.82160008e-01 -1.35409996e-01
 -6.64439976e-01 -2.30279997e-01 -5.54660000e-02  3.84209991e-01
 -1.68880001e-01  5.14619984e-02 -2.82929987e-01  4.50760007e-01
 -3.64639997e-01  3.61009985e-01  1.09350002e+00 -1.19470000e-01
  4.97290008e-02  4.87650000e-02  4.89439994e-01 -3.31379997e-04
  1.63650006e-01  4.97429997e-01  3.38140011e-01  1.55699998e-02
  2.57620007e-01 -5.84829986e-01 -5.58210015e-01 -2.90919989e-01
  2.36110002e-01 -2.89510

Next, we load the pre-trained word embeddings matrix into an `Embedding` layer.

Note that we set `trainable=False` so as to keep the embeddings fixed (we don't want to update them during training).  

This acts as the "transfer learning" aspect of this model.


In [None]:
from tensorflow.keras.layers import Embedding

embedding_layer = Embedding(
    num_tokens,
    embedding_dim,
    embeddings_initializer=keras.initializers.Constant(embedding_matrix),
    trainable=False,
)

## Build the model

A simple 1D convnet with global max pooling and a classifier at the end.  The different convolutional and pooling layers are meant to act as a type of feature selection, only focusing on the important embeddings (words).  However, I have found that some feature selection done in advance still helps by either speeding up performance and / or model speed.

In [None]:
from tensorflow.keras import layers

int_sequences_input = keras.Input(shape=(None,), dtype="int64")
embedded_sequences = embedding_layer(int_sequences_input)
x = layers.Conv1D(128, 5, activation="relu")(embedded_sequences)
x = layers.MaxPooling1D(5)(x)
x = layers.Conv1D(128, 5, activation="relu")(x)
x = layers.MaxPooling1D(5)(x)
x = layers.Conv1D(128, 5, activation="relu")(x)
x = layers.GlobalMaxPooling1D()(x)
x = layers.Dense(128, activation="relu")(x)
x = layers.Dropout(0.5)(x)
preds = layers.Dense(len(class_names), activation="softmax")(x)
model = keras.Model(int_sequences_input, preds)
model.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, None)]            0         
                                                                 
 embedding (Embedding)       (None, None, 200)         4000400   
                                                                 
 conv1d (Conv1D)             (None, None, 128)         128128    
                                                                 
 max_pooling1d (MaxPooling1  (None, None, 128)         0         
 D)                                                              
                                                                 
 conv1d_1 (Conv1D)           (None, None, 128)         82048     
                                                                 
 max_pooling1d_1 (MaxPoolin  (None, None, 128)         0         
 g1D)                                                        

## Train the model

First, convert our list-of-strings data to NumPy arrays of integer indices. The arrays
are right-padded.

In [None]:
x_train = vectorizer(np.array([[s] for s in train_samples])).numpy()
x_val = vectorizer(np.array([[s] for s in val_samples])).numpy()

y_train = np.array(train_labels)
y_val = np.array(val_labels)

We use categorical crossentropy as our loss since we're doing softmax classification.
Moreover, we use `sparse_categorical_crossentropy` since our labels are integers.

In [None]:
model.compile(
    loss="sparse_categorical_crossentropy", optimizer="adam", metrics=["acc"],
)
model.fit(x_train, y_train, batch_size=128, epochs=30, validation_data=(x_val, y_val))

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.src.callbacks.History at 0x78e42af00e50>

## Export an end-to-end model

Now, we may want to export a `Model` object that takes as input a string of arbitrary
length, rather than a sequence of indices. It would make the model much more portable,
since you wouldn't have to worry about the input preprocessing pipeline.

Our `vectorizer` is actually a Keras layer, so it's simple:

In [None]:
string_input = keras.Input(shape=(1,), dtype="string")
x = vectorizer(string_input)
preds = model(x)
end_to_end_model = keras.Model(string_input, preds)

probabilities = end_to_end_model.predict(
    [["this message is about computer graphics and 3D modeling"]]
)
print("The categories are " + str(class_names))
print("The probabilities for the model are " + str(probabilities))

class_names[np.argmax(probabilities[0])]

The categories are ['alt.atheism', 'comp.graphics', 'sci.space', 'talk.religion.misc']
The probabilities for the model are [[6.0392282e-20 1.0000000e+00 4.0591908e-19 2.3149380e-21]]


'comp.graphics'