<a href="https://www.kaggle.com/code/adwaitkesharwani/get-going-with-glove-embedding?scriptVersionId=111527770" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [1]:
import numpy as np
import pandas as pd
from tqdm import tqdm

import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Dropout, Bidirectional, LSTM, GlobalMaxPool1D, Embedding
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder


Every time we work with text data, we must represent the text in some mathematical form, otherwise the algorithm won't be able to handle the data. There are many ways to do this transformation, and one among them is GloVe embeddings. In this notebook, we will come a step closer to GloVe embedding and apply it to a binary classification problem.

There are a variety of pre-trained GloVe word embeddings available for download. More information about the training corpus of different Glove embedding can be found on <a href=https://nlp.stanford.edu/projects/glove/> this</a> website. In this notebook, we will use the glovetwitter27b50d embedding, which has 50 dimensions and was trained on 2B tweets from Twitter.

The embedding is available as a text file where each line has a string containing a word and its vector representation. We will convert the content of this text file into a dictionary.

In [2]:
# Read the text file
glovetwitter27b50d = "../input/glovetwitter27b50dtxt/glove.twitter.27B.50d.txt"
file = open(glovetwitter27b50d)
glovetwitter27b50d = file.readlines()

In [3]:
# Convert the text file into a dictionary
def ConvertToEmbeddingDictionary(glovetwitter27b50d):
    embedding_dictionary = {}
    for word_embedding in tqdm(glovetwitter27b50d):
        word_embedding = word_embedding.split()
        word = word_embedding[0]
        embedding = np.array([float(i) for i in word_embedding[1:]])
        embedding_dictionary[word] = embedding
    return embedding_dictionary
embedding_dictionary = ConvertToEmbeddingDictionary(glovetwitter27b50d)

100%|██████████| 1193514/1193514 [00:20<00:00, 58352.07it/s]


In [4]:
# Let's look at the embedding of the word "hello."
embedding_dictionary['hello']

array([ 0.28751  ,  0.31323  , -0.29318  ,  0.17199  , -0.69232  ,
       -0.4593   ,  1.3364   ,  0.709    ,  0.12118  ,  0.11476  ,
       -0.48505  , -0.088608 , -3.0154   , -0.54024  , -1.326    ,
        0.39477  ,  0.11755  , -0.17816  , -0.32272  ,  0.21715  ,
        0.043144 , -0.43666  , -0.55857  , -0.47601  , -0.095172 ,
        0.0031934,  0.1192   , -0.23643  ,  1.3234   , -0.45093  ,
       -0.65837  , -0.13865  ,  0.22145  , -0.35806  ,  0.20988  ,
        0.054894 , -0.080322 ,  0.48942  ,  0.19206  ,  0.4556   ,
       -1.642    , -0.83323  , -0.12974  ,  0.96514  , -0.18214  ,
        0.37733  , -0.19622  , -0.12231  , -0.10496  ,  0.45388  ])

We will use the below sample corpus to learn how to transform any text dataset using GloVe embedding.

In [5]:
sample_corpus = ['The woods are lovely, dark and deep',
                 'But I have promises to keep',   
                 'And miles to go before I sleep', 
                 'And miles to go before I sleep']

In [6]:
# This is the maximum number of tokens we wish to consider from our dataset.
# When there are more tokens, the tokens with the highest frequency are chosen.
max_number_of_words = 5

We will use the Keras tokenizer to extract the tokens from our data. Tokenizer assigns an index to each token, and we can convert any text to a sequence of indices using the texts_to_sequences function.

In [7]:
# Note: Keras tokenizer selects only top n-1 tokens if the num_words is set to n
tokenizer = Tokenizer(num_words=max_number_of_words)
tokenizer.fit_on_texts(sample_corpus)
sample_corpus_tokenized = tokenizer.texts_to_sequences(sample_corpus)
print(tokenizer.word_index)

{'and': 1, 'i': 2, 'to': 3, 'miles': 4, 'go': 5, 'before': 6, 'sleep': 7, 'the': 8, 'woods': 9, 'are': 10, 'lovely': 11, 'dark': 12, 'deep': 13, 'but': 14, 'have': 15, 'promises': 16, 'keep': 17}


The below sentence is converted to a list of two tokens because we had set the maximum number of tokens to 5. The indices of tokens 'i' and 'to' are selected because they are among the top 4 frequently occurring tokens.

In [8]:
print("But I have promises to keep: ", sample_corpus_tokenized[1])

But I have promises to keep:  [2, 3]


Now that we have chosen a set of tokens from our text corpus, we must develop an embedding matrix for them. The embedding matrix will have columns equal to the embedding's dimension and rows equal to the number of tokens.

In [9]:
total_number_of_words = min(max_number_of_words, len(tokenizer.word_index))
embedding_matrix = np.zeros((total_number_of_words,50))
for word, i in tokenizer.word_index.items():
    if i >= total_number_of_words: break
    if word in embedding_dictionary.keys():
        embedding_vector = embedding_dictionary[word]
        embedding_matrix[i] = embedding_vector

In [10]:
print(embedding_matrix)

[[ 0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00
   0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00
   0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00
   0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00
   0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00
   0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00
   0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00
   0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00
   0.0000e+00  0.0000e+00]
 [-4.3196e-01 -1.8965e-01 -2.8294e-02 -2.5903e-01 -4.4810e-01  5.3591e-01
   9.4627e-01 -7.8060e-02 -5.4519e-01 -7.2878e-01 -3.0083e-02 -2.8677e-01
  -6.4640e+00 -3.1295e-01  1.2351e-01 -2.4630e-01  2.9458e-02 -8.3529e-01
   1.9647e-01 -1.5722e-01 -5.5620e-01 -2.7029e-02 -2.3915e-01  1.8188e-01
  -1.5156e-01  5.4768e-01  1.3767e-01  2.1828e-01  6.1069e-01 -3.6790e-01
   2.3187e-

Artificial neural networks and ML algorithms can't deal with a variable length of inputs, so we need to convert the embeddings of every input sequence to a fixed size. There are many approaches to do this, but the most simple one is to sum the embedding of every token in a sentence and normalize the vector.

In [11]:
def convertToSentenceVector(sentences):
    new_sentences = []
    for sentence in sentences:
        sentence_vector = []
        for word_index in sentence:
            sentence_vector.append(embedding_matrix[word_index])
        sentence_vector = np.array(sentence_vector).sum(axis=0)
        embedding_vector / np.sqrt((embedding_vector ** 2).sum())
        new_sentences.append(sentence_vector)
    return new_sentences

Below is the 50-dimensional embedding of the first sentence in our training corpus.

In [12]:
sample_corpus_vectorized = convertToSentenceVector(sample_corpus_tokenized)
print(sample_corpus_vectorized[0])

[-0.43196   -0.18965   -0.028294  -0.25903   -0.4481     0.53591
  0.94627   -0.07806   -0.54519   -0.72878   -0.030083  -0.28677
 -6.464     -0.31295    0.12351   -0.2463     0.029458  -0.83529
  0.19647   -0.15722   -0.5562    -0.027029  -0.23915    0.18188
 -0.15156    0.54768    0.13767    0.21828    0.61069   -0.3679
  0.023187   0.33281   -0.18062   -0.0094163  0.31861   -0.19201
  0.35759    0.50104    0.55981    0.20561   -1.1167    -0.3063
 -0.14224    0.20285    0.10245   -0.39289   -0.26724   -0.37573
  0.16076   -0.74501  ]


## Sentiment Classification: IMDB Movie Reviews Dataset

Now we can perform the steps discussed above to get the embeddings for this dataset.

In [13]:
df = pd.read_csv("../input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv")
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [14]:
X = df['review']
y = df['sentiment']

In [15]:
le = LabelEncoder()
y = le.fit_transform(y)
le.classes_

array(['negative', 'positive'], dtype=object)

In [16]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, random_state=0)

In [17]:
#set the maximum number of tokens to 50000
max_number_of_words = 50000

In [18]:
tokenizer = Tokenizer(num_words=max_number_of_words)
tokenizer.fit_on_texts(X_train)
X_train = tokenizer.texts_to_sequences(X_train)
X_test = tokenizer.texts_to_sequences(X_test)

In [19]:
total_number_of_words = min(max_number_of_words, len(tokenizer.word_index))
embedding_matrix = np.zeros((total_number_of_words+1,50))
for word, i in tokenizer.word_index.items():
    if i >= total_number_of_words: break
    if word in embedding_dictionary.keys():
        embedding_vector = embedding_dictionary[word]
        embedding_matrix[i] = embedding_vector

In [20]:
X_train = convertToSentenceVector(X_train)
X_test = convertToSentenceVector(X_test)

In [21]:
X_train = pd.DataFrame(X_train)
X_test = pd.DataFrame(X_test)

## Artificial Neural Network

In [22]:
model = Sequential()
model.add(Dense(100, input_shape = (50,), activation = "relu"))
model.add(Dense(1000, activation = "relu"))
model.add(Dropout(0.2))
model.add(Dense(1, activation = "sigmoid"))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 100)               5100      
_________________________________________________________________
dense_1 (Dense)              (None, 1000)              101000    
_________________________________________________________________
dropout (Dropout)            (None, 1000)              0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 1001      
Total params: 107,101
Trainable params: 107,101
Non-trainable params: 0
_________________________________________________________________


2022-11-20 08:03:25.949777: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.


In [23]:
model.fit(X_train, y_train, batch_size=32, epochs=10, validation_split=0.1)

2022-11-20 08:03:26.185506: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fbfe771a650>

In [24]:
y_pred = model.predict(X_test)
y_pred = y_pred.round()

In [25]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.81      0.78      0.79      2553
           1       0.78      0.80      0.79      2447

    accuracy                           0.79      5000
   macro avg       0.79      0.79      0.79      5000
weighted avg       0.79      0.79      0.79      5000



The classification report produced by the artificial neural network is not very good because the ANN doesn't work well with sequential data. To get a better result, we will now use a Bi-directional Long Short-Term Memory (LSTM) network on the same dataset. A Bi-directional LSTM is a type of recurrent neural network and can handle inputs of varying sizes.

## Bi-directional LSTM

In [26]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, random_state=0)

In [27]:
max_number_of_words = 50000
max_length = 100

In [28]:
tokenizer = Tokenizer(num_words=max_number_of_words)
tokenizer.fit_on_texts(X_train)
X_train = tokenizer.texts_to_sequences(X_train)
X_test = tokenizer.texts_to_sequences(X_test)
X_train = pad_sequences(X_train, maxlen=max_length)
X_test = pad_sequences(X_test, maxlen=max_length)

In [29]:
total_number_of_words = min(max_number_of_words, len(tokenizer.word_index))
embedding_matrix = np.zeros((total_number_of_words+1,50))
for word, i in tokenizer.word_index.items():
    if i >= total_number_of_words: break
    if word in embedding_dictionary.keys():
        embedding_vector = embedding_dictionary[word]
        embedding_matrix[i] = embedding_vector

In [30]:
model = Sequential()
model.add(Embedding(max_number_of_words+1, 50, input_shape = (100,), weights=[embedding_matrix]))
model.add(Bidirectional(LSTM(50, return_sequences=True, dropout=0.1, recurrent_dropout=0.1)))
model.add(GlobalMaxPool1D())
model.add(Dense(50, activation="relu"))
model.add(Dropout(0.1))
model.add(Dense(1, activation="sigmoid"))

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 100, 50)           2500050   
_________________________________________________________________
bidirectional (Bidirectional (None, 100, 100)          40400     
_________________________________________________________________
global_max_pooling1d (Global (None, 100)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 50)                5050      
_________________________________________________________________
dropout_1 (Dropout)          (None, 50)                0         
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 51        
Total params: 2,545,551
Trainable params: 2,545,551
Non-trainable params: 0
____________________________________________

In [31]:
model.fit(X_train, y_train, batch_size=32, epochs=10, validation_split=0.1);

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [32]:
y_pred = model.predict(X_test)
y_pred = y_pred.round()

In [33]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.85      0.88      0.86      2553
           1       0.87      0.84      0.85      2447

    accuracy                           0.86      5000
   macro avg       0.86      0.86      0.86      5000
weighted avg       0.86      0.86      0.86      5000



Woohoo! We got a better classification report.<br> 
Now that you are more knowledgeable about GloVe embedding, you are ready to apply it to different NLP issues.