<a href="https://colab.research.google.com/github/atul-ai/prompt-engineering-class/blob/main/WordEmbeddingsExamples.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using Pre-trained Word Embeddings

## 1. Setup and Imports

First, let's install and import the necessary libraries:


In [1]:
!pip install transformers torch gensim numpy scipy matplotlib

import torch
from transformers import AutoTokenizer, AutoModel
import gensim.downloader as api
import numpy as np
from scipy.spatial.distance import cosine
import matplotlib.pyplot as plt





## 2. Sample Sentences

We'll use these sentences for our demonstrations:



In [2]:
sentences = [
    "The cat sat on the mat.",
    "Dogs are man's best friend.",
    "It's raining cats and dogs.",
    "The early bird catches the worm.",
    "Actions speak louder than words.",
    "A picture is worth a thousand words.",
    "Don't judge a book by its cover.",
    "The apple doesn't fall far from the tree.",
    "Time flies like an arrow.",
    "All that glitters is not gold."
]

## 3. BERT Embeddings

Let's use BERT to get contextual embeddings:




In [4]:
# Load pre-trained BERT model and tokenizer
bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
bert_model = AutoModel.from_pretrained("bert-base-uncased")

def get_bert_embedding(sentence):
    inputs = bert_tokenizer(sentence, return_tensors="pt", padding=True, truncation=True)
    with torch.no_grad():
        outputs = bert_model(**inputs)
    return outputs.last_hidden_state.mean(dim=1).squeeze().numpy()

# Get BERT embeddings for all sentences
bert_embeddings = [get_bert_embedding(sent) for sent in sentences]

print("BERT embedding shape:", bert_embeddings[0].shape)
print(bert_embeddings[0])

BERT embedding shape: (768,)
[-1.81803331e-01 -2.66178459e-01 -2.18866497e-01  2.10887864e-01
  2.84733891e-01 -1.71848714e-01 -1.65881395e-01  5.09736776e-01
 -1.27144992e-01 -1.69706762e-01  3.03522944e-02 -4.69049990e-01
 -3.58982049e-02  1.33982167e-01 -1.17660999e-01 -2.40768567e-01
  1.20715722e-01  5.91536611e-02 -3.91021132e-01  1.07807025e-01
  2.31676280e-01 -2.06526875e-01 -5.21814585e-01  9.92321670e-02
  2.94130504e-01 -2.43794993e-01  7.10860491e-02 -1.43256009e-01
 -5.07237762e-02 -2.29964089e-02  2.10268632e-01 -5.67063875e-02
 -1.49750292e-01 -2.79531151e-01  5.43978959e-02 -8.85230005e-02
  2.97807395e-01  3.17300677e-01 -5.46831906e-01  2.36233160e-01
 -3.62854600e-01 -1.80200204e-01  2.56574489e-02  5.81901133e-01
  4.08479095e-01 -2.32060194e-01  4.06711966e-01 -2.07815111e-01
  6.34876370e-01  1.69491053e-01 -6.29979789e-01  3.35770339e-01
 -2.82218382e-02  1.09117158e-01 -3.31124179e-02  6.67784333e-01
  2.23814592e-01 -3.44666481e-01 -1.05968425e-02  3.83142501e

## 4. GloVe Embeddings

Now let's use pre-trained GloVe embeddings:

In [5]:
# Load pre-trained GloVe embeddings
glove = api.load("glove-wiki-gigaword-100")

def get_glove_embedding(sentence):
    words = sentence.lower().split()
    word_embeddings = [glove[word] for word in words if word in glove]
    return np.mean(word_embeddings, axis=0) if word_embeddings else np.zeros(glove.vector_size)

# Get GloVe embeddings for all sentences
glove_embeddings = [get_glove_embedding(sent) for sent in sentences]

print("GloVe embedding shape:", glove_embeddings[0].shape)
print(glove_embeddings[0])

GloVe embedding shape: (100,)
[-4.3515600e-02 -2.9748004e-02  5.9098202e-01 -2.4206194e-01
 -9.6999198e-02  3.5600525e-01 -1.7048240e-01  3.8083202e-01
 -4.9104899e-01 -2.9986900e-01  2.0948200e-01  1.4106001e-02
  3.8480201e-01 -1.0425195e-02  2.0291999e-01  2.9041598e-02
  4.3107802e-01 -1.0969679e-01 -1.7927799e-01 -3.6845797e-01
  1.6791800e-01 -3.5751998e-02  3.2425421e-01 -9.8839596e-02
  5.5196798e-01 -7.1866393e-02 -2.4172981e-01 -3.6806980e-01
 -1.7447004e-01  4.5199994e-02  6.3351199e-02  3.1589383e-01
  2.1115419e-01  2.6616600e-01 -3.8537998e-02  3.2990918e-01
 -2.8460804e-02  4.0716800e-01  3.5629180e-01 -9.2135206e-02
 -4.0600601e-01 -2.9943421e-01  2.1148400e-01 -1.8706401e-01
  5.9060035e-03  1.6642201e-01 -1.4540000e-01 -1.3599999e-01
  7.7965088e-02 -4.8302460e-01 -1.1683513e-01  4.7059981e-03
  2.7543801e-01  1.0413139e+00 -6.7047000e-01 -2.3747602e+00
 -5.9869200e-02 -1.7457598e-01  1.5867101e+00  5.2350998e-01
 -5.2707992e-02  8.7856799e-01 -1.8929639e-01  2.619840

## 5. FastText Embeddings

Let's use FastText embeddings:

In [6]:
# Load pre-trained FastText embeddings
fasttext = api.load("fasttext-wiki-news-subwords-300")

def get_fasttext_embedding(sentence):
    words = sentence.lower().split()
    word_embeddings = [fasttext[word] for word in words if word in fasttext]
    return np.mean(word_embeddings, axis=0) if word_embeddings else np.zeros(fasttext.vector_size)

# Get FastText embeddings for all sentences
fasttext_embeddings = [get_fasttext_embedding(sent) for sent in sentences]

print("FastText embedding shape:", fasttext_embeddings[0].shape)
print(fasttext_embeddings[0])


FastText embedding shape: (300,)
[-1.07459910e-03 -3.25630009e-02  2.54664011e-02  6.10740017e-03
  2.71326397e-02 -2.97526922e-02  2.36619990e-02 -1.69375986e-01
 -2.46400796e-02  3.74299996e-02 -4.69814017e-02 -1.43874720e-01
  6.30912036e-02 -1.86050013e-02  1.07885990e-02  1.56834014e-02
  1.31402612e-01 -7.43899960e-03  7.43768066e-02 -2.35512014e-02
 -2.01811995e-02 -1.78665612e-02  1.87217183e-02  2.70860586e-02
  5.58306053e-02 -4.29212023e-03  6.18999964e-03  2.90717967e-02
  7.63084963e-02 -1.44822001e-02 -1.06967408e-02 -2.07982007e-02
  1.90843996e-02  1.43086817e-02 -5.06734028e-02 -7.34577924e-02
  1.83796026e-02 -3.22851241e-02  1.50565337e-02 -4.59660590e-02
 -2.22694799e-02 -7.13122040e-02 -3.31973806e-02  3.85112013e-03
  5.12128044e-03  6.67500049e-02  1.05209192e-02  1.95667129e-02
 -2.47640023e-03  2.23735608e-02  6.31771535e-02  4.27426770e-02
  3.72897983e-02 -2.54235603e-02 -1.22278549e-01 -1.91269591e-02
 -4.00395989e-02 -8.88400059e-03 -1.19473197e-01 -1.88320


## 6. Comparing Embeddings

Let's compare the similarities between sentences using different embedding types:

In [7]:
def cosine_similarity(v1, v2):
    return 1 - cosine(v1, v2)

def print_most_similar(embeddings, sentences):
    similarities = [[cosine_similarity(e1, e2) for e2 in embeddings] for e1 in embeddings]
    for i, sent in enumerate(sentences):
        most_similar = max(range(len(sentences)), key=lambda x: similarities[i][x] if x != i else -1)
        print(f"'{sent}' is most similar to '{sentences[most_similar]}'")

print("BERT similarities:")
print_most_similar(bert_embeddings, sentences)

print("\nGloVe similarities:")
print_most_similar(glove_embeddings, sentences)

print("\nFastText similarities:")
print_most_similar(fasttext_embeddings, sentences)

BERT similarities:
'The cat sat on the mat.' is most similar to 'It's raining cats and dogs.'
'Dogs are man's best friend.' is most similar to 'Actions speak louder than words.'
'It's raining cats and dogs.' is most similar to 'The cat sat on the mat.'
'The early bird catches the worm.' is most similar to 'The apple doesn't fall far from the tree.'
'Actions speak louder than words.' is most similar to 'A picture is worth a thousand words.'
'A picture is worth a thousand words.' is most similar to 'Actions speak louder than words.'
'Don't judge a book by its cover.' is most similar to 'A picture is worth a thousand words.'
'The apple doesn't fall far from the tree.' is most similar to 'All that glitters is not gold.'
'Time flies like an arrow.' is most similar to 'A picture is worth a thousand words.'
'All that glitters is not gold.' is most similar to 'A picture is worth a thousand words.'

GloVe similarities:
'The cat sat on the mat.' is most similar to 'The early bird catches the wor

## 8. Conclusion

This notebook demonstrates how to use pre-trained BERT, GloVe, and FastText models to generate embeddings for sentences. We've seen how to load these models, generate embeddings, compare similarities, and visualize the results. Note the differences in how each model represents the sentences and captures their similarities.