<a href="https://colab.research.google.com/github/anshupandey/Working_with_Large_Language_models/blob/main/WWL_C8_Implementing_Language_Model_with_Sentence_Transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercise: Implementing a Language Model Using Sentence Transformers


In this notebook, we will walk through the process of implementing a language model using Sentence Transformers. Sentence Transformers is a framework for sentence, paragraph, and image embeddings using BERT-like models. This framework makes it easy to generate embeddings for sentences and paragraphs which can then be used in various NLP tasks such as semantic search, clustering, and classification.



## Prerequisites

Before we start, make sure you have Python installed on your machine. You will also need to install the following packages:

```bash
pip install sentence-transformers
pip install numpy
pip install torch
```


In [1]:
!pip install sentence-transformers --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.1/227.1 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.3/21.3 MB[0m [31m36.3 MB/s[0m eta [36m0:00:00[0m
[?25h

## Step 1: Importing Necessary Libraries

In [2]:
from sentence_transformers import SentenceTransformer, util
import numpy as np
import torch
import tensorflow as tf
from tensorflow.keras import models,layers

  from tqdm.autonotebook import tqdm, trange


## Step 2: Loading the Model

In [3]:
# Load a pre-trained Sentence Transformer model
model = SentenceTransformer('all-MiniLM-L6-v2')

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

## Step 3: Encoding Sentences

In [4]:

# Example sentences
sentences = [
    "This is an example sentence.",
    "Each sentence is converted into a fixed-size vector.",
    "Sentence transformers are useful for various NLP tasks.",
    "Setence Transformers can be used to represent sentences/documents as vectors",
    "multiple powerful sentence transformer models are available at Hugging Face"
]

# Encode the sentences into embeddings
embeddings = model.encode(sentences)

# Display the embeddings
for sentence, embedding in zip(sentences, embeddings):
    print(f"Sentence: {sentence}")
    print(f"Embedding: {embedding[:5]}...")  # Displaying first 5 dimensions for brevity
    print()


Sentence: This is an example sentence.
Embedding: [0.0981246  0.06781266 0.06252308 0.09508482 0.0366476 ]...

Sentence: Each sentence is converted into a fixed-size vector.
Embedding: [0.07381862 0.05663023 0.02722433 0.02096546 0.03240976]...

Sentence: Sentence transformers are useful for various NLP tasks.
Embedding: [-0.05859482  0.01600869  0.0545337   0.03032776  0.02576441]...

Sentence: Setence Transformers can be used to represent sentences/documents as vectors
Embedding: [-0.06910557  0.03544828 -0.03661951  0.00562107 -0.02724299]...

Sentence: multiple powerful sentence transformer models are available at Hugging Face
Embedding: [-0.04939263 -0.04512243  0.02534697  0.01934744 -0.00156226]...



## Step 4: Finding Similar Sentences

In [5]:

# Define a query sentence
query_sentence = "How are sentence embeddings generated?"

# Encode the query sentence
query_embedding = model.encode(query_sentence)

print(query_embedding.shape)

# Compute cosine similarity between the query sentence and the other sentences
cosine_scores = util.pytorch_cos_sim(query_embedding, embeddings)

# Display the results
for sentence, score in zip(sentences, cosine_scores[0]):
    print(f"Sentence: {sentence}")
    print(f"Similarity Score: {score.item():.4f}")
    print()


(384,)
Sentence: This is an example sentence.
Similarity Score: 0.3304

Sentence: Each sentence is converted into a fixed-size vector.
Similarity Score: 0.5168

Sentence: Sentence transformers are useful for various NLP tasks.
Similarity Score: 0.5501

Sentence: Setence Transformers can be used to represent sentences/documents as vectors
Similarity Score: 0.5568

Sentence: multiple powerful sentence transformer models are available at Hugging Face
Similarity Score: 0.4443



## Step 5: Clustering Sentences

In [6]:
# Example sentences
sentences = [
    "This is an example sentence.",
    "Each sentence is converted into a fixed-size vector.",
    "Sentence transformers are useful for various NLP tasks.",
    "Setence Transformers can be used to represent sentences/documents as vectors",
    "multiple powerful sentence transformer models are available at Hugging Face",
    "Manila is a capital city of Philippines",
    "The capital of the Philippines is Manila",
    "Kuala Lumpur is the capital city of Malaysia",
    "The capital of Malaysia is Kuala Lumpur",
    "Beijing is the capital city of China",
    "The capital of China is Beijing",
    "London is the capital city of England",
    "The capital of England is London",
    "Paris is the capital city of France",
    "The capital of France is Paris",
    "Tokyo is the capital city of Japan",
    "The capital of Japan is Tokyo",
    "Sydney is the capital city of Australia",
    "The capital of Australia is Sydney",
]

# Encode the sentences into embeddings
embeddings = model.encode(sentences)


In [7]:

from sklearn.cluster import KMeans

# Define the number of clusters
num_clusters = 2

# Perform K-Means clustering
clustering_model = KMeans(n_clusters=num_clusters)
clustering_model.fit(embeddings)
cluster_assignment = clustering_model.labels_

# Display the clustering results
for i in range(num_clusters):
    print(f"Cluster {i}:")
    cluster_sentences = [sentences[j] for j in range(len(sentences)) if cluster_assignment[j] == i]
    for sentence in cluster_sentences:
        print(f"  - {sentence}")
    print()


Cluster 0:
  - This is an example sentence.
  - Each sentence is converted into a fixed-size vector.
  - Sentence transformers are useful for various NLP tasks.
  - Setence Transformers can be used to represent sentences/documents as vectors
  - multiple powerful sentence transformer models are available at Hugging Face

Cluster 1:
  - Manila is a capital city of Philippines
  - The capital of the Philippines is Manila
  - Kuala Lumpur is the capital city of Malaysia
  - The capital of Malaysia is Kuala Lumpur
  - Beijing is the capital city of China
  - The capital of China is Beijing
  - London is the capital city of England
  - The capital of England is London
  - Paris is the capital city of France
  - The capital of France is Paris
  - Tokyo is the capital city of Japan
  - The capital of Japan is Tokyo
  - Sydney is the capital city of Australia
  - The capital of Australia is Sydney





## Text Generation

## Step 6: Define a TensorFlow Function for Sentence Embeddings

In [8]:
# Define a function to get embeddings for all sentences
def get_embeddings(sentences):
    embeddings = model.encode(sentences)
    return np.array(embeddings, dtype=np.float32)

## Step 7: Build LSTM Model

In [9]:
# Function to build an LSTM model
def build_model(embedding_dim, lstm_units, output_dim):
    model = models.Sequential([
        layers.Input(shape=(embedding_dim,)),
        layers.Reshape((1, embedding_dim)),
        layers.LSTM(lstm_units, return_sequences=True),
        layers.LSTM(lstm_units),
        layers.Dense(output_dim, activation='softmax')
    ])
    model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

## Step 8: Prepare the Data

In [10]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [11]:

# Example data preprocessing (assuming you have a dataset of sentences)
# Example sentences
sentences = [
    "This is an example sentence.",
    "Each sentence is converted into a fixed-size vector.",
    "Sentence transformers are useful for various NLP tasks.",
    "Setence Transformers can be used to represent sentences/documents as vectors",
    "multiple powerful sentence transformer models are available at Hugging Face",
    "Manila is a capital city of Philippines",
    "The capital of the Philippines is Manila",
    "Kuala Lumpur is the capital city of Malaysia",
    "The capital of Malaysia is Kuala Lumpur",
    "Beijing is the capital city of China",
    "The capital of China is Beijing",
    "London is the capital city of England",
    "The capital of England is London",
    "Paris is the capital city of France",
    "The capital of France is Paris",
    "Tokyo is the capital city of Japan",
    "The capital of Japan is Tokyo",
    "Sydney is the capital city of Australia",
    "The capital of Australia is Sydney",
]


x = []
y = []
import string
punctuations = string.punctuation

from nltk.tokenize import word_tokenize
def get_data(sentences):
  for sent in sentences:
    words = word_tokenize(sent.strip().lower()) # converting sentence to list of words
    words = [w for w in words if w not in punctuations] # removing punctuations
    for i in range(len(words)-1):
      x.append(" ".join(words[0:i+1])) # storing words from 0 to i
      y.append(words[i+1]) # storing next word as output



In [12]:
get_data(sentences)
len(x),len(y)

(116, 116)

In [13]:
for i in range(len(x)):
  print(x[i],"||",y[i])

this || is
this is || an
this is an || example
this is an example || sentence
each || sentence
each sentence || is
each sentence is || converted
each sentence is converted || into
each sentence is converted into || a
each sentence is converted into a || fixed-size
each sentence is converted into a fixed-size || vector
sentence || transformers
sentence transformers || are
sentence transformers are || useful
sentence transformers are useful || for
sentence transformers are useful for || various
sentence transformers are useful for various || nlp
sentence transformers are useful for various nlp || tasks
setence || transformers
setence transformers || can
setence transformers can || be
setence transformers can be || used
setence transformers can be used || to
setence transformers can be used to || represent
setence transformers can be used to represent || sentences/documents
setence transformers can be used to represent sentences/documents || as
setence transformers can be used to represen

In [14]:
xvec = get_embeddings(x) # generate vectors for inputs (x) (sentences)
print(xvec.shape)
# Tokenize the sentences
tokenizer = tf.keras.preprocessing.text.Tokenizer() # initialized a tokenizer for target decoding (prediction decoding)
tokenizer.fit_on_texts(sentences)
sequences = tokenizer.texts_to_sequences(sentences)
vocab_size = len(tokenizer.word_index) + 1

y_encoded = tokenizer.texts_to_sequences(y) # decoded y from word to index
print(len(y_encoded))

(116, 384)
116


In [15]:
print(tokenizer.word_index)

{'is': 1, 'capital': 2, 'of': 3, 'the': 4, 'city': 5, 'sentence': 6, 'a': 7, 'transformers': 8, 'are': 9, 'manila': 10, 'philippines': 11, 'kuala': 12, 'lumpur': 13, 'malaysia': 14, 'beijing': 15, 'china': 16, 'london': 17, 'england': 18, 'paris': 19, 'france': 20, 'tokyo': 21, 'japan': 22, 'sydney': 23, 'australia': 24, 'this': 25, 'an': 26, 'example': 27, 'each': 28, 'converted': 29, 'into': 30, 'fixed': 31, 'size': 32, 'vector': 33, 'useful': 34, 'for': 35, 'various': 36, 'nlp': 37, 'tasks': 38, 'setence': 39, 'can': 40, 'be': 41, 'used': 42, 'to': 43, 'represent': 44, 'sentences': 45, 'documents': 46, 'as': 47, 'vectors': 48, 'multiple': 49, 'powerful': 50, 'transformer': 51, 'models': 52, 'available': 53, 'at': 54, 'hugging': 55, 'face': 56}


In [16]:
print(y)
print(y_encoded)

['is', 'an', 'example', 'sentence', 'sentence', 'is', 'converted', 'into', 'a', 'fixed-size', 'vector', 'transformers', 'are', 'useful', 'for', 'various', 'nlp', 'tasks', 'transformers', 'can', 'be', 'used', 'to', 'represent', 'sentences/documents', 'as', 'vectors', 'powerful', 'sentence', 'transformer', 'models', 'are', 'available', 'at', 'hugging', 'face', 'is', 'a', 'capital', 'city', 'of', 'philippines', 'capital', 'of', 'the', 'philippines', 'is', 'manila', 'lumpur', 'is', 'the', 'capital', 'city', 'of', 'malaysia', 'capital', 'of', 'malaysia', 'is', 'kuala', 'lumpur', 'is', 'the', 'capital', 'city', 'of', 'china', 'capital', 'of', 'china', 'is', 'beijing', 'is', 'the', 'capital', 'city', 'of', 'england', 'capital', 'of', 'england', 'is', 'london', 'is', 'the', 'capital', 'city', 'of', 'france', 'capital', 'of', 'france', 'is', 'paris', 'is', 'the', 'capital', 'city', 'of', 'japan', 'capital', 'of', 'japan', 'is', 'tokyo', 'is', 'the', 'capital', 'city', 'of', 'australia', 'capita

In [17]:
y_encoded = [k[0] for k in y_encoded]
y_encoded = np.array(y_encoded)
y_encoded.shape

(116,)

In [18]:
from tensorflow.keras.utils import to_categorical
y = to_categorical(y_encoded, num_classes=vocab_size)

## Step 6: Train the Model

In [19]:
embedding_dim = embeddings.shape[1]
lstm_units = 128

language_model = build_model(embedding_dim, lstm_units, vocab_size)
language_model.fit(xvec, y_encoded, epochs=1000, batch_size=50, verbose = False, shuffle = True)

<keras.src.callbacks.History at 0x7d62dda7b6d0>

## Step 7: Predict Next Word

In [20]:

# Function to predict the next word given a model, tokenizer, and sentence
def predict_next_word(language_model, tokenizer, sentence):
    embedding = get_embeddings([sentence])
    #print(embedding.shape)
    prediction = language_model.predict(embedding, verbose=0)
    #print(prediction)
    predicted_word_index = np.argmax(prediction[0])
    #print(predicted_word_index)
    predicted_word = tokenizer.index_word[predicted_word_index+1]
    return predicted_word

# Example prediction
sentence = "This is an example"
predicted_word = predict_next_word(language_model, tokenizer, sentence)
print(f"Next word prediction: {predicted_word}")


Next word prediction: a


In [21]:
userinput = "This"
generation_length = 10

for i in range(generation_length):
    predicted_word = predict_next_word(language_model, tokenizer, userinput)
    userinput += " " + predicted_word
    print(userinput)

This capital
This capital this
This capital this this
This capital this this this
This capital this this this sentence
This capital this this this sentence sentence
This capital this this this sentence sentence sentence
This capital this this this sentence sentence sentence sentence
This capital this this this sentence sentence sentence sentence sentence
This capital this this this sentence sentence sentence sentence sentence sentence



## Conclusion

In this notebook, we have covered the basics of implementing a language model using Sentence Transformers. We have shown how to encode sentences into embeddings, find similar sentences using cosine similarity, and cluster sentences using K-Means clustering. Sentence Transformers provide a powerful and flexible way to work with sentence embeddings for various NLP tasks.

You can further explore Sentence Transformers by trying out different models, tweaking hyperparameters, and applying these techniques to your specific use cases. Happy coding!
