<a href="https://colab.research.google.com/github/anshupandey/Generative-AI-for-Professionals/blob/main/Hugging_Face_Embeddings_Guide.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Working with Hugging Face Embeddings

## Step 1: Installation of Libraries
First, install the `transformers` and `torch` libraries, which are essential for loading and utilizing pre-trained models.

In [1]:
!pip install transformers torch --quiet

## Step 2: Load a Pre-trained Model
Load `bert-base-uncased`, a versatile BERT model suitable for understanding text embeddings.

In [2]:
from transformers import BertModel, BertTokenizer
import torch

# Initialize tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Model to evaluation mode
model.eval()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
  

## Step 3: Function to Generate Embeddings
Define a function that processes text to produce embeddings.

In [3]:
def get_embedding(text):
    encoded_input = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=512)
    with torch.no_grad():
        output = model(**encoded_input)
    embeddings = output.last_hidden_state.mean(1)
    return embeddings

## Step 4: Using Embeddings for Similarity Comparison
Use the embeddings to compare the semantic similarity between texts.

In [4]:
from torch.nn.functional import cosine_similarity

# Define texts
text1 = 'I love machine learning'
text2 = 'I adore artificial intelligence'
text3 = 'The weather is sunny'

# Get embeddings
embed1 = get_embedding(text1)
embed2 = get_embedding(text2)
embed3 = get_embedding(text3)

# Compute similarity
similarity12 = cosine_similarity(embed1, embed2)
similarity13 = cosine_similarity(embed1, embed3)

print('Similarity between text 1 and text 2:', similarity12.item())
print('Similarity between text 1 and text 3:', similarity13.item())

Similarity between text 1 and text 2: 0.7946856021881104
Similarity between text 1 and text 3: 0.5228810906410217


## Step 5: Clustering Texts Using Embeddings
Apply clustering to group texts based on semantic similarity.

In [5]:
from sklearn.cluster import KMeans
import numpy as np

# Assume more texts and embeddings
texts = [text1, text2, text3, 'Exploring space', 'I enjoy deep learning', "It's a sunny day"]
embeddings = torch.stack([get_embedding(text) for text in texts]).squeeze()

# Convert embeddings to NumPy for clustering
embeddings_np = embeddings.numpy()

# Cluster embeddings
kmeans = KMeans(n_clusters=3, random_state=0).fit(embeddings_np)

# Print cluster labels
for text, label in zip(texts, kmeans.labels_):
    print(f'Text: {text}, Cluster: {label}')

Text: I love machine learning, Cluster: 0
Text: I adore artificial intelligence, Cluster: 0
Text: The weather is sunny, Cluster: 1
Text: Exploring space, Cluster: 2
Text: I enjoy deep learning, Cluster: 0
Text: It's a sunny day, Cluster: 1




## Thank You