<a href="https://colab.research.google.com/github/nitinpunjabi/nlp-demystified/blob/main/notebooks/nlpdemyst_static_word_embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Processing Demystified | Static Word Embeddings
https://nlpdemystified.org<br>
https://github.com/nitinpunjabi/nlp-demystified

**IMPORTANT**<br>
Enable **GPU acceleration** by going to *Runtime > Change Runtime Type*.
<br><br>
Also, if you're running this for free in the cloud rather than using a paid tier or using a local Jupyter server on your machine, then the notebook will *timeout* after a period of inactivity. If that happens and you don't reconnect in time, you will need to upgrade spaCy again and reinstall the requisite statistical package(s).
<br><br>
Refer to this link on how to run Colab notebooks locally on your machine to avoid this issue:<br>
https://research.google.com/colaboratory/local-runtimes.html

**Upgrading spaCy**<br>
At the time this notebook was created, spaCy had newer releases but Colab was still using version 2.x by default. So the first step is to upgrade spaCy and download a statistical model for English.



In [None]:
!pip install -U spacy==3.*

**NOTE**<br>
In this notebook, we won't train standalone word embeddings from scratch. Rather, we'll:
1. Use *pretrained* embeddings in one model.
2. Train embeddings alongside another model.
<br>

If you want to try training standalone word embeddings, coding Skip-Gram With Negative Sampling (SGNS) from scratch shouldn't be too hard now that you know all the details. But I recommend just using the **Gensim** library instead:<br>
https://radimrehurek.com/gensim/models/word2vec.html<br>
https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html



# spaCy Vectors

Unlike previous notebooks where we installed the **en_core_web_sm** (small) model for spaCy, we're going to install the **en_core_web_md** (medium) model instead.
<br><br>
This is because the small model doesn't come with pretrained word vectors, whereas the medium and large models do.<br>
https://spacy.io/models/en#en_core_web_md<br>
https://spacy.io/models/en#en_core_web_lg

In [None]:
!python -m spacy download en_core_web_md

In [None]:
import spacy
nlp = spacy.load('en_core_web_md')

We can see the number of word vectors in the package through the **metadata**.<br>
https://spacy.io/api/language#meta

In [None]:
nlp.meta['vectors']

We can retrieve a vocabulary entry (a **Lexeme**) and view its raw vector and shape.<br>
https://spacy.io/api/lexeme

In [None]:
pizza = nlp.vocab['pizza']
print(pizza.vector.shape)
pizza.vector

spaCy **Lexemes** have a *similarity* method to compare vectors.<br>
https://spacy.io/api/lexeme#similarity

In this case, we can see related words having a higher similarity measure...

In [None]:
print(pizza.similarity(nlp.vocab['tomato']))
print(pizza.similarity(nlp.vocab['sauce']))
print(pizza.similarity(nlp.vocab['cheese']))

...relative to often unrelated words.

In [None]:
print(pizza.similarity(nlp.vocab['gorilla']))
print(pizza.similarity(nlp.vocab['tree']))
print(pizza.similarity(nlp.vocab['yoga']))

Out-of-vocabulary (OOV) words have vectors of zero.

In [None]:
nlp.vocab['womblyboo'].vector

The vector for any sequence (**doc** or **span**) is simply the average of all the token vectors in the sequence.<br>
https://spacy.io/api/doc<br>
https://spacy.io/api/span<br>
https://spacy.io/usage/linguistic-features#vectors-similarity


In [None]:
d1 = nlp("The company has an office in Budapest")
d2 = nlp("We have a support team in Hungary")

In [None]:
d1.vector

The _similarity_ method works for **docs** and **spans** as well.

In [None]:
print(d1[-1:])
print(d2[-1:])
d1[-1:].similarity(d2[-1:])

In [None]:
print(d1[2:5])
print(d2[2:5])
d1[2:5].similarity(d2[2:5])

In [None]:
d1.similarity(d2)

But comparing similarity based on only token averages can come with issues such as false positives. In this example, the two documents compared clearly have little to nothing in common, but the similarity measurement is relatively high (likely due to stop/generic words).

In [None]:
d3 = nlp("dolphins can be pretty mean")

In [None]:
print(d2)
print(d3)
d2.similarity(d3)

Also, creating a sequence vector from token averages throws out word order. So two sentences with identical words which mean different things could score a perfect similarity score.

In [None]:
d1 = nlp("dog bites man")
d2 = nlp("man bites dog")
d1.similarity(d2)

Doing quick-and-dirty similarity measures like this is probably best if your corpus is domain-specific and similarity is based more on keywords. The more specific, the better.<br><br>
For example, a corpus of business news headlines would probably work well.

In [None]:
d1 = nlp("Volkswagen intends to double electric car sales in China")
d2 = nlp("First Toyota with solid state battery will be hybrid")
d3 = nlp("Dolphins are the thugs of the ocean")

In [None]:
print(d1.similarity(d2))
print(d1.similarity(d3))

One way to deal with false positives is to extract key information such as part-of-speech tags and entities, and perform similarity based only on those. In this example, documents 1 and 3 score high on similarity maybe because Chile has a particular dolphin in its waters (that and stop words).

In [None]:
d1 = nlp("I want to visit Santiago this winter")
d2 = nlp("When is the best time to tour Chile")
d3 = nlp("I wouldn't want to run into a dolphin in a dark alley")

In [None]:
print(d1.similarity(d2))
print(d1.similarity(d3))

Here we have a function which removes stop words and retains only verbs, nouns, and entities.

In [None]:
def filter_text(s):
  d = nlp(s)
  tokens = [t.text for t in d if 
            not t.is_stop 
            and (t.tag_ == 'VB' or t.pos_ == 'NOUN' or t.ent_type_ != '')]
  return nlp(" ".join(tokens))

If we filter the same three previous sentences again, we see the similarity scores update to something more sensible. The similarity between document 1 and the rest of the documents is lowered, but the score against document 3 is now relatively lower compared to 2.

In [None]:
d1 = filter_text("I want to visit Santiago this winter")
d2 = filter_text("When is the best time to tour Chile")
d3 = filter_text("I wouldn't want to run into a dolphin in a dark alley")

In [None]:
print(d1.similarity(d2))
print(d1.similarity(d3))

Also, simple similarity doesn't capture things like *intent*. In this example, someone interested in visiting Santiago or the best time to tour Chile might be interested in cheap flight tickets to the region as well.<br><br>
This isn't a fair comparison because the method isn't built for that, but it's just something to keep in mind. The resulting similarity scores actually aren't too bad, but having additional metadata like where the person currently is or doing specific preprocessing would be useful.

In [None]:
d4 = filter_text("Discount flight tickets to South America")

In [None]:
print(d1.similarity(d4))
print(d2.similarity(d4))

You can modify spaCy's vectors, add your own, and also load third-party vectors rather than using spaCy's built-in vectors:<br>
https://spacy.io/api/vectors

# Using Third-Party Vectors

There are a variety of pretrained, static word vector packages out there. In this section, we'll use the **Google News** vectors, a collection of three million, 300-dimension word vectors trained from three billion words from a Google News corpus (circa 2015).<br><br>
We could technically load these vectors into spaCy, but we'll use **gensim** here instead because it's easier and the library offers an API to do a few cool things with word vectors.


In [None]:
# Upgrade gensim just in case.
!pip install -U gensim==4.*

We'll need to first download the actual word vectors. It's over a gigabyte but will fit well within the space constraints of our environment.

In [None]:
!wget -P /root/input/ -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"

In [None]:
embedding_file = '/root/input/GoogleNews-vectors-negative300.bin.gz'

Next, we'll have **gensim** load the vectors through the **KeyedVectors** module which will enable us to look up vectors by tokens and indices.<br>
https://radimrehurek.com/gensim/models/keyedvectors.html
<br><br>
To save time and space, we'll limit ourselves to 200,000 word vectors for now.

In [None]:
from gensim.models.keyedvectors import KeyedVectors

In [None]:
%%time
word_vectors = KeyedVectors.load_word2vec_format(embedding_file, binary=True, limit=200000)

Retrieving a word's vector is a matter of using a token as a key.

In [None]:
word_vectors['cell']

The *most_similar* method returns the words with the closest vectors.<br>
https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors.most_similar

In [None]:
word_vectors.most_similar(positive=['cell'], topn=10)

We can also add vectors first, and retrieve the words most similar to that summation. Here, we're adding the vectors for 'cell' and 'phone' and retrieving the vectors closest to that result. 

In [None]:
word_vectors.most_similar(positive=['cell', 'phone'], topn=10)

Given a collection of words, the *doesn't_match* method returns the word that doesn't go with the rest (i.e. with the vector that's furthest away from the mean of all the other vectors).<br>
https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors.doesnt_match

In [None]:
word_vectors.doesnt_match(["apple", "orange", "hamburger", "banana", "kiwi"])

We can see the power of context in this example with 'Toyota' being correctly identified as the odd one out.

In [None]:
word_vectors.doesnt_match(["Microsoft", "Apple", "Toyota", "Amazon", "Netflix", "Google"])

Visualizing word vectors is straight-forward and can offer insights into what kind of contexts the training algorithm picked up.<br><br>
Because these word vectors have a dimension of 300, we need to reduce them down to two dimensions to plot them on a regular graph. This can be done through **Principal Components Analysis (PCA)**:<br>
https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html<br>
<br>
Here, we're plotting the words we considered in the slides.

In [None]:
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import numpy as np

In [None]:
def display_pca_scatterplot(model, words):        
    word_vectors = np.array([model[w] for w in words])

    twodim = PCA().fit_transform(word_vectors)[:,:2]
    
    plt.figure(figsize=(10,10))
    plt.scatter(twodim[:,0], twodim[:,1], edgecolors='k', c='r', s=128)
    for word, (x,y) in zip(words, twodim):
        plt.text(x+0.05, y+0.05, word)

In [None]:
display_pca_scatterplot(word_vectors, ['swim', 'swimming', 'cat', 'dog', 'feline', 'road', 'car', 'bus'])

We can even solve analogies (to a limited extent) with vector arithmetic.<br><br>
Here, we're solving the analogy:<br>
_Rome is to Italy as London is to __________.<br><br>
Arithmetically, this is Italy + London - Rome.


In [None]:
word_vectors.most_similar(positive=['Italy', 'London'], negative=['Rome'], topn=3)

Visualizing it can help with geometric intuition.

In [None]:
display_pca_scatterplot(word_vectors, ['Rome', 'Italy', 'London', 'Britain'])

# Using Pretrained Word Vectors for Classification

In this section, we'll train a **Keras** model to use these Google News vectors to perform sentiment analysis on a bunch of **Yelp** reviews.
<br><br>
For this model, we'll increase the number of word vectors loaded to 1,000,000.



In [None]:
%%time
word_vectors = KeyedVectors.load_word2vec_format(embedding_file, binary=True, limit=1000000)

The dataset we'll use is *Yelp Polarity Reviews*, a collection of ~600,000 reviews for both training and testing.<br><br>
The original Yelp reviews use a five-star rating system. The ratings in this dataset have been modified to simply be negative (label==1) or positive (label==2).<br>
https://www.tensorflow.org/datasets/catalog/yelp_polarity_reviews<br><br>
Tensorflow comes with a datasets loader but we're going to download the file manually and process the data ourselves for completeness.

In [None]:
!wget -P /root/input/ -c "https://s3.amazonaws.com/fast-ai-nlp/yelp_review_polarity_csv.tgz"

Unzipping the archive results in *train.csv* and *test.csv* files placed in the default *contents* folder of our environment.

In [None]:
!tar xvzf /root/input/yelp_review_polarity_csv.tgz

# Show current working directory.
!pwd

The **Pandas** library makes it simple to load a CSV file into memory and manipulate the data.<br>
https://pandas.pydata.org/<br>
https://pandas.pydata.org/docs/<br>
https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html?highlight=read_csv

In [None]:
import pandas as pd

Here, we're loading the CSV into a Pandas **dataframe** (sort of like an in-memory table) and giving the columns names.<br>
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html?highlight=dataframe#pandas.DataFrame

In [None]:
yelp_train = pd.read_csv('yelp_review_polarity_csv/train.csv', names=['sentiment', 'review'])
yelp_train.shape

We can get a quick view of the data through the *head* method.<br>
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html

In [None]:
yelp_train.head()

To save on training time, we'll train on 100,000 reviews rather than the full set. To do that, we'll shuffle the dataset using the *sample* method and *copy* the first 100,000 entries. The reason to shuffle first is to ensure we get a mix of reviews from a variety of businesses (in case the data is sorted in some way).<br>
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html<br>
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.copy.html


In [None]:
TRAIN_SIZE = 100000
yelp_train = yelp_train.sample(frac=1, random_state=1)[:TRAIN_SIZE].copy()
yelp_train.shape

The next thing to do is adjust the labels. This is a **binary classification problem**, so our model's output layer will be a single unit with a **sigmoid** activation function. This function's output will be between 0 and 1 which is then compared against the training label. But the labels are currently 1 for negative, and 2 for positive, which is going to cause problems when calculating the loss.<br><br>
So we'll simply replace the 1s with 0s, and 2s with 1s using the *replace* method.<br>
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html
<br><br>
Alternatively, we could keep the labels as-is and treat this as a **multiclassification** problem with two labels and use a **softmax**, but we would then need to **one-hot encode** the labels.


In [None]:
yelp_train['sentiment'].replace(to_replace=1, value=0, inplace=True)
yelp_train['sentiment'].replace(to_replace=2, value=1, inplace=True)

In [None]:
yelp_train.head()

As we've done throughout this course, we'll create train/validation splits.<br>
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [None]:
from sklearn.model_selection import train_test_split
yelp_train_split, yelp_val_split = train_test_split(yelp_train, train_size=0.85, random_state=1)

In [None]:
# Set up training data.
train_reviews = yelp_train_split['review']
y_train = np.array(yelp_train_split['sentiment'])

# Set up validation data.
val_reviews = yelp_val_split['review']
y_val = np.array(yelp_val_split['sentiment'])

A quick sanity check to see how our data is distributed (e.g. balanced or skewed).

In [None]:
import collections
collections.Counter(y_train)

Because we're relying more on richer encodings (in this case, word vectors), we won't perform as much preprocessing this time around. We'll stick with using the regular Keras **tokenizer** and just filter out numbers and certain symbols.<br>
https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer<br><br>
We'll also have the tokenizer limit itself to tokenizing only the most frequent 20,000 words. This way, the model will focus on the most frequent descriptive sentiment words.

In [None]:
from tensorflow import keras
tokenizer = keras.preprocessing.text.Tokenizer(num_words=20000, filters='0123456789!"#$%&()*+,-./:;<=>?@[\]^_`{|}~', lower=True)

Build the vocabulary.

In [None]:
%%time
tokenizer.fit_on_texts(train_reviews)

The next step is to vectorize our reviews. In the [_Neural Network Foundations_](https://github.com/nitinpunjabi/nlp-demystified/blob/main/notebooks/nlpdemyst_neural_network_foundations.ipynb) notebook, we used the *texts_to_matrix* method to turn text into binary bags of words.<br><br>
Here, we're going to use the *text_to_sequences* method to turn each review into a sequence of integers, with each integer representing its corresponding token.<br>
https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer#texts_to_sequences



In [None]:
%%time
x_train = tokenizer.texts_to_sequences(train_reviews)

In [None]:
# The first review in the training set, vectorized.
print(x_train[0])

We can look up the corresponding tokens using the tokenizer's *index_word* dict. Here are the tokens corresponding to the first three integers from the first vectorized review.

In [None]:
[tokenizer.index_word[x] for x in x_train[0][:3]]

We can also convert the integer sequence back to text using the *sequences_to_texts* method, and compare it against the original text.<br>
https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer#sequences_to_texts

In [None]:
# Review reconstructed from integer sequence.
tokenizer.sequences_to_texts([x_train[0]])

In [None]:
# Original review text.
train_reviews.iloc[0]

Some models and situations require us to **pad** our sequences to the same length. While that's not the case here, it can still be beneficial to have all our inputs (and consequently, our batches) to be of uniform size to help with optimizations.<br><br>
In this case, we'll make all our reviews 200 tokens in length (in practice, you can choose a number based on some analysis). So the reviews longer than 200 tokens will be truncated, while the reviews shorter than 200 will be padded with zeroes.<br>
https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/sequence/pad_sequences

In [None]:
MAX_REVIEW_LEN = 200
x_train = keras.preprocessing.sequence.pad_sequences(x_train, maxlen=MAX_REVIEW_LEN)

In [None]:
print(x_train[0])
print(x_train[1])

Our training set is prepared. We can now also vectorize and pad our validation set.

In [None]:
x_val = tokenizer.texts_to_sequences(val_reviews)
x_val = keras.preprocessing.sequence.pad_sequences(x_val, maxlen=MAX_REVIEW_LEN)

Now we need to incorporate the Google News vectors (currently loaded into gensim) into our Keras model. What we'll do is create an embedding matrix that maps each tokenizer integer to its respective word vector.<br><br>
For example, here's the index for the word "good" from the Keras tokenizer and the word vector for "good" from gensim. We want a matrix which maps the index to the vector.


In [None]:
print(tokenizer.word_index['good'])

In [None]:
# Part of the vector for the word 'good'.
print(word_vectors['good'][:50])

We'll create this embedding matrix by first initializing a matrix of zeros, then looping over every word in the tokenizer vocabulary and:
1. Checking if the word has a corresponding vector in gensim.
2. If it does, then copy the vector into the matrix row corresponding to the word's index.

In [None]:
# + 1 to account for padding token.
num_tokens = len(tokenizer.word_index) + 1

# Initialize a matrix of zeroes of size: vocabulary x embedding dimension.
embedding_dim = 300
embedding_matrix = np.zeros((num_tokens, embedding_dim))

for word, i in tokenizer.word_index.items():
  if word_vectors.has_index_for(word):
    embedding_matrix[i] = word_vectors[word].copy()


In [None]:
# Quick visual check.
print(embedding_matrix[tokenizer.word_index['good']][:50])

We're ready to build our first model using pretrained word vectors. The first layer we'll add is a Keras **embedding** layer which is essentially a trainable lookup table/matrix.<br>
https://keras.io/api/layers/base_layer/#layer-class<br>
https://keras.io/api/layers/core_layers/embedding/<br><br>
In this case, we'll populate the **embedding** layer with the embedding matrix we created, and set *trainable* to True. This means we'll allow the model to adjust/fine-tune the word vectors as needed for greater accuracy. This corresponds to one of the scenarios we covered in the slides.

In [None]:
from keras import layers

In [None]:
embedding_layer = layers.Embedding(
    num_tokens,
    embedding_dim,
    embeddings_initializer=keras.initializers.Constant(embedding_matrix),
    input_length=MAX_REVIEW_LEN,
    trainable=True
)

We'll use a simple architecture for this model. Here are a few things to note:<br>
1. Each training example is a sequence of *integers* which gets converted to a sequence of *vectors*., but subsequent layers are expecting one vector per example. So we're inserting a **GlobalAveragePooling1D** layer after the embedding layer to average out all the word vectors into a single vector, before sending it further into the network. For classification, this can be pretty effective as a base model approach.
2. There was no science behind choosing 128 units in the first hidden layer and 64 units in the second hidden layer. The intuition was that signal would be distilled from 300 dimensions down to 128 dimensions, then down to 64 dimensions before going to output.<br>

https://keras.io/api/layers/pooling_layers/global_average_pooling1d/
<br><br>
When we call the model's *summary* method, note how there are no params for the **GlobalAveragePooling1D** layer.

In [None]:
model = keras.Sequential()
model.add(embedding_layer)
model.add(layers.GlobalAveragePooling1D())
model.add(layers.Dense(128, activation='relu'))
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

model.summary()

We won't use **early stopping** for this run. This way, we'll be able to compare metrics between the train and validation sets.

In [None]:
history = model.fit(x_train, y_train, epochs=20, batch_size=512, validation_data=(x_val, y_val))

In [None]:
def plot_train_vs_val_performance(history):
  training_losses = history.history['loss']
  validation_losses = history.history['val_loss']

  training_accuracy = history.history['accuracy']
  validation_accuracy = history.history['val_accuracy']

  epochs = range(1, len(training_losses) + 1)

  import matplotlib.pyplot as plt
  fig, (ax1, ax2) = plt.subplots(2)
  fig.set_figheight(15)
  fig.set_figwidth(15)
  fig.tight_layout(pad=5.0)

  # Plot training vs. validation loss.
  ax1.plot(epochs, training_losses, 'bo', label='Training Loss')
  ax1.plot(epochs, validation_losses, 'b', label='Validation Loss')
  ax1.title.set_text('Training vs. Validation Loss')
  ax1.set_xlabel('Epoch')
  ax1.set_ylabel('Loss')
  ax1.legend()

  # PLot training vs. validation accuracy.
  ax2.plot(epochs, training_accuracy, 'bo', label='Training Accuracy')
  ax2.plot(epochs, validation_accuracy, 'b', label='Validation Accuracy')
  ax2.title.set_text('Training vs. Validation Accuracy')
  ax2.set_xlabel('Epoch')
  ax2.set_ylabel('Accuracy')
  ax2.legend()

  plt.show()

In [None]:
plot_train_vs_val_performance(history)

We'll initialize a new embedding layer and model and train for epochs equalling the point where we saw the validation loss diverge from the training loss.

In [None]:
embedding_layer = layers.Embedding(
    num_tokens,
    embedding_dim,
    embeddings_initializer=keras.initializers.Constant(embedding_matrix),
    input_length=MAX_REVIEW_LEN,
    trainable=True
)

model = keras.Sequential()
model.add(embedding_layer)
model.add(layers.GlobalAveragePooling1D())
model.add(layers.Dense(128, activation='relu'))
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

model.fit(x_train, y_train, epochs=<DESIRED_EPOCHS>, batch_size=512)

Now that we have a trained model, let's try it on the test data. As we did with the training data, we'll:
1. Replace the labels with 0 for negative sentiment, and 1 for positive.
2. Convert the reviews into a sequence of integers and pad/truncate each review to a fixed length.

In [None]:
yelp_test = pd.read_csv('yelp_review_polarity_csv/test.csv', names=['sentiment', 'review'])

In [None]:
yelp_test['sentiment'].replace(to_replace=1, value=0, inplace=True)
yelp_test['sentiment'].replace(to_replace=2, value=1, inplace=True)
yelp_test

In [None]:
y_test = np.array(yelp_test['sentiment'])
y_test

In [None]:
x_test = tokenizer.texts_to_sequences(yelp_test['review'])

In [None]:
x_test = keras.preprocessing.sequence.pad_sequences(x_test, maxlen=MAX_REVIEW_LEN)

In [None]:
model.evaluate(x_test, y_test)

Not bad for a conceptually simple model where we average out a review's word vectors, run it through a few plain hidden layers, and out through a sigmoid function with no regularization and just using defaults for model components (e.g. optimizer settings).<br><br>
We can now use the model for predictions.

In [None]:
def sentiment(reviews):
  seqs = tokenizer.texts_to_sequences(reviews)
  seqs = keras.preprocessing.sequence.pad_sequences(seqs, maxlen=MAX_REVIEW_LEN)
  return model.predict(seqs)


In [None]:
# Real reviews from Google Reviews.
pos_review = "The best seafood joint in East Village San Diego!  Great lobster roll, great fish, great oysters, great bread, great cocktails, and such amazing service.  The atmosphere is top notch and the location is so much fun being located just a block away from Petco Park (San Diego Padres Stadium)."
neg_review = "A thoroughly disappointing experience. When you book a Marriott you expect a certain standard. Albany falls way short. Room cleaning has to be booked 24 hours in advance but nobody thought to mention this at check in. The hotel is tired and needs a face-lift. The only bright light in a sea of mediocrity were the pancakes at breakfast. Sadly they weren't enough to save the experience. If you travel to Albany, then do yourself a big favour and book the Westin."

In [None]:
sentiment([pos_review, neg_review])

# Training New Embeddings and a Model at the Same Time

For this last model, rather than using pretrained embeddings, we'll start with a **random** embedding matrix and let the model come up with its own vectors simultaneously while fitting the training data.<br><br>
We'll also use **early stopping**, but otherwise keep everything else the same.

In [None]:
model = keras.Sequential()

# The 'trainable' property is True by default.
model.add(layers.Embedding(input_dim=num_tokens, 
                           output_dim=embedding_dim, 
                           input_length=MAX_REVIEW_LEN))


model.add(layers.GlobalAveragePooling1D())
model.add(layers.Dense(128, activation='relu'))
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

es_callback = keras.callbacks.EarlyStopping(monitor='val_loss', patience=3)
history = model.fit(x_train, y_train, epochs=20, batch_size=512, validation_data=(x_val, y_val), callbacks=[es_callback])

In [None]:
model.evaluate(x_test, y_test)

It looks like in this case, we get comparable performance between fine-tuning pretrained vectors and training embeddings from scratch as part of the model; likely because of the nature of the data and amount of it.

# Try This

In our first model, we used pretrained vectors in the **embedding layer** and set the *trainable* property to **True**, allowing the model to fine-tune the word vectors.<br><br>
Instantiate the same model but this time, set the *trainable* property in the **embedding layer** to **False**. What happens to training performance? Does the training speed increase or decrease? What happens if you try to add some regularization like dropout?

In [None]:
# Instantiate the embedding layer.


model = keras.Sequential()

# Add layers.


# Compile model.


# Call fit.


# Evaluate the model.


# Alternative Static Embedding Algorithms

## GloVe
**GloVe (Global Vectors for Word Representation)** is another algorithm for creating static word vectors. You can read the original GloVe paper and download pretrained word vectors here:<br>
https://nlp.stanford.edu/projects/glove/

## Doc2Vec
An algorithm which represents a document as a dense vector which addresses weaknesses of bag-of-words models.<br>
https://arxiv.org/abs/1405.4053<br>
https://radimrehurek.com/gensim/models/doc2vec.html<br>

## fastText
An alternative approach to creating embeddings. Instead of assigning a vector to each _word_ (e.g. a separate vector each for "dog" and "dogs"), a vector is assigned to each _subword_. For fastText, a subword is defined as a character n-gram.
<br><br>
So if n=3, then a word like "hello" would result in vectors for "<he", "hel", "ell", "llo", "lo>" (note that "<" and ">" are special characters). The vector for "hello" would be the sum of all the above vectors. This helps deal with OOV situations because vectors can still be assigned to unseen words as long as the n-grams exist in the vocabulary.<br>
https://fasttext.cc/<br>
https://radimrehurek.com/gensim/models/fasttext.html
<br><br>
**We'll cover subword tokenization in greater detail later in the course.**