To run this notebook you need to first download "glove.6B.50d" file from goggle and put in sample folder and give path as '/content/sample_data/glove.6B.50d.txt', you can found this file in our laptop local storage as well.

This notebook is set up to demonstrate how to use pre-trained GloVe word embeddings for a given vocabulary.


This cell imports necessary libraries: Tokenizer and pad_sequences (addresses this by adding zeros to the beginning or end of sequences (known as 'padding') to make them all the same length. If a sequence is longer than the desired fixed length, it can also be truncated.) from tensorflow.keras.preprocessing.text for text preprocessing, and numpy for numerical operations. These are fundamental for tasks involving natural language processing (NLP).

In [5]:
# code for Glove word embedding
from tensorflow.keras.preprocessing.text import Tokenizer
# from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np

This cell defines a Python function called embedding_for_vocab. This function is crucial for creating an embedding matrix. It reads a GloVe embedding file (a file containing words and their corresponding dense vector representations), and for each word in your provided word_index (which maps words to unique integers), it populates an embedding_matrix_vocab with the pre-trained GloVe vector for that word. If a word isn't found in the GloVe file, its vector will remain zeros.

vocab_size = len(word_index) + 1: This line calculates the total number of unique words in your vocabulary. It takes the length of word_index (which maps each unique word to an integer) and adds 1. The + 1 is important because word_index typically starts numbering words from 1, reserving 0 for padding or unknown words in many NLP applications.

In [24]:
def embedding_for_vocab(filepath, word_index,
						embedding_dim):
	vocab_size = len(word_index) + 1
	print('vocab_size', vocab_size)
	# Adding again 1 because of reserved 0 index
	embedding_matrix_vocab = np.zeros((
			vocab_size, embedding_dim))
	print('embedding_matrix_vocab',embedding_matrix_vocab)

	with open(filepath, encoding="utf8") as f:
		for line in f:
			word, *vector = line.split()
			if word in word_index:
				idx = word_index[word]
				embedding_matrix_vocab[idx] = np.array(
					vector, dtype=np.float32)[:embedding_dim]

	return embedding_matrix_vocab

line.split(): This part of the code takes the line (which is a string, usually read from the GloVe file) and splits it into a list of substrings. By default, split() divides the string by any whitespace (spaces, tabs, newlines) and discards empty strings, giving you a list of words and numbers.

word, *vector = ...: This is Python's tuple unpacking (or sequence unpacking) with an asterisk (*) operator. It works like this:

The first element returned by line.split() is assigned to the variable word.
All the remaining elements from the line.split() list are collected into a new list, which is then assigned to the variable vector.

tokenizer.fit_on_texts(x): This is the "training" step for the tokenizer. You pass it your text data (in this case, the set x which contains words like 'text', 'the', 'leader', etc.). The fit_on_texts method then:

Scans through all the words in x.
Identifies all unique words.
Assigns a unique integer index to each unique word. This mapping is stored internally within the tokenizer object, specifically in its word_index attribute (which you saw printed in the previous output as a dictionary). The indices typically start from 1, reserving 0 for padding or unknown words.

In [27]:
# x = {'text', 'the', 'the', 'prime',
# 	'natural', 'language'}

# # create the dict.
# tokenizer = Tokenizer()
# tokenizer.fit_on_texts(x)

# # number of unique words in dict.
# print("Number of unique words in dictionary=",
# 	len(tokenizer.word_index))
# print("Dictionary is = ", tokenizer.word_index)

Number of unique words in dictionary= 5
Dictionary is =  {'the': 1, 'text': 2, 'language': 3, 'natural': 4, 'prime': 5}


In [26]:
x = {'text', 'the', 'leader', 'prime',
	'natural', 'language'}

# create the dict.
tokenizer = Tokenizer()
tokenizer.fit_on_texts(x)

# number of unique words in dict.
print("Number of unique words in dictionary=",
	len(tokenizer.word_index))
print("Dictionary is = ", tokenizer.word_index)



# matrix for vocab: word_index
embedding_dim = 50
embedding_matrix_vocab = embedding_for_vocab(
	'/content/sample_data/glove.6B.50d.txt', tokenizer.word_index,
embedding_dim)

print("Dense vector is => ",
	embedding_matrix_vocab)

print("Dense vector for first word is => ",
	embedding_matrix_vocab[1])


Number of unique words in dictionary= 6
Dictionary is =  {'the': 1, 'text': 2, 'language': 3, 'natural': 4, 'prime': 5, 'leader': 6}
vocab_size 7
embedding_matrix_vocab [[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.