<a href="https://colab.research.google.com/github/anshupandey/Working_with_Large_Language_models/blob/main/WWL_C8_Implementing_Language_Model_with_Sentence_Transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercise: Implementing a Language Model Using Sentence Transformers


In this notebook, we will walk through the process of implementing a language model using Sentence Transformers. Sentence Transformers is a framework for sentence, paragraph, and image embeddings using BERT-like models. This framework makes it easy to generate embeddings for sentences and paragraphs which can then be used in various NLP tasks such as semantic search, clustering, and classification.



## Prerequisites

Before we start, make sure you have Python installed on your machine. You will also need to install the following packages:

```bash
pip install sentence-transformers
pip install numpy
pip install torch
```


In [2]:
!pip install sentence-transformers --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.1/227.1 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.3/21.3 MB[0m [31m50.2 MB/s[0m eta [36m0:00:00[0m
[?25h

## Step 1: Importing Necessary Libraries

In [11]:

from sentence_transformers import SentenceTransformer, util
import numpy as np
import torch
import tensorflow as tf
from tensorflow.keras import models,layers



## Step 2: Loading the Model

In [22]:

# Load a pre-trained Sentence Transformer model
model = SentenceTransformer('all-MiniLM-L6-v2')




## Step 3: Encoding Sentences

In [23]:

# Example sentences
sentences = [
    "This is an example sentence.",
    "Each sentence is converted into a fixed-size vector.",
    "Sentence transformers are useful for various NLP tasks."
]

# Encode the sentences into embeddings
embeddings = model.encode(sentences)

# Display the embeddings
for sentence, embedding in zip(sentences, embeddings):
    print(f"Sentence: {sentence}")
    print(f"Embedding: {embedding[:5]}...")  # Displaying first 5 dimensions for brevity
    print()


Sentence: This is an example sentence.
Embedding: [0.0981246  0.0678127  0.06252313 0.09508476 0.03664764]...

Sentence: Each sentence is converted into a fixed-size vector.
Embedding: [0.07381859 0.05663022 0.02722435 0.02096546 0.03240977]...

Sentence: Sentence transformers are useful for various NLP tasks.
Embedding: [-0.05859482  0.0160087   0.05453373  0.03032776  0.02576445]...



## Step 4: Finding Similar Sentences

In [24]:

# Define a query sentence
query_sentence = "How are sentence embeddings generated?"

# Encode the query sentence
query_embedding = model.encode(query_sentence)

# Compute cosine similarity between the query sentence and the other sentences
cosine_scores = util.pytorch_cos_sim(query_embedding, embeddings)

# Display the results
for sentence, score in zip(sentences, cosine_scores[0]):
    print(f"Sentence: {sentence}")
    print(f"Similarity Score: {score.item():.4f}")
    print()


Sentence: This is an example sentence.
Similarity Score: 0.3304

Sentence: Each sentence is converted into a fixed-size vector.
Similarity Score: 0.5168

Sentence: Sentence transformers are useful for various NLP tasks.
Similarity Score: 0.5501



## Step 5: Clustering Sentences

In [25]:

from sklearn.cluster import KMeans

# Define the number of clusters
num_clusters = 2

# Perform K-Means clustering
clustering_model = KMeans(n_clusters=num_clusters)
clustering_model.fit(embeddings)
cluster_assignment = clustering_model.labels_

# Display the clustering results
for i in range(num_clusters):
    print(f"Cluster {i}:")
    cluster_sentences = [sentences[j] for j in range(len(sentences)) if cluster_assignment[j] == i]
    for sentence in cluster_sentences:
        print(f"  - {sentence}")
    print()


Cluster 0:
  - Each sentence is converted into a fixed-size vector.
  - Sentence transformers are useful for various NLP tasks.

Cluster 1:
  - This is an example sentence.





## Step 6: Define a TensorFlow Function for Sentence Embeddings

In [26]:
# Define a TensorFlow function that uses the Sentence Transformer to get embeddings
def sentence_embedding(sentence):
    embedding = model.encode(sentence)
    return tf.constant(embedding, dtype=tf.float32)


## Step 7: Build LSTM Model

In [31]:

# Function to build an LSTM model
def build_model(vocab_size, embedding_dim, lstm_units):
    language_model = models.Sequential([
        layers.Embedding(input_dim=vocab_size, output_dim=embedding_dim, mask_zero=True),
        layers.LSTM(lstm_units, return_sequences=True),
        layers.LSTM(lstm_units),
        layers.Dense(vocab_size, activation='softmax')
    ])
    language_model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    return language_model


## Step 8: Prepare the Data

In [32]:

# Example data preprocessing (assuming you have a dataset of sentences)
sentences = ["This is an example sentence.", "Sentence transformers are useful.", "This is another sentence."]
tokenizer = tf.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(sentences)
sequences = tokenizer.texts_to_sequences(sentences)
vocab_size = len(tokenizer.word_index) + 1

# Padding sequences to the same length
max_length = max(len(seq) for seq in sequences)
sequences = tf.keras.preprocessing.sequence.pad_sequences(sequences, maxlen=max_length, padding='post')

# Split data into inputs and labels
X, y = sequences[:, :-1], sequences[:, -1]


## Step 6: Train the Model

In [33]:

embedding_dim = 128
lstm_units = 128

language_model = build_model(vocab_size, embedding_dim, lstm_units)
language_model.fit(X, y, epochs=10, batch_size=2,verbose=False)


<keras.src.callbacks.History at 0x7f8a4e3e5150>

## Step 7: Predict Next Word

In [41]:

# Function to predict the next word given a model, tokenizer, and sentence
def predict_next_word(language_model, tokenizer, sentence, max_length):
    embedding = sentence_embedding(sentence)
    tokenized_sentence = tokenizer.texts_to_sequences([sentence])[0]
    tokenized_sentence = tf.keras.preprocessing.sequence.pad_sequences([tokenized_sentence], maxlen=max_length-1, padding='post')
    prediction = language_model.predict(tokenized_sentence, verbose=0)
    print(prediction)
    predicted_word_index = np.argmax(prediction[0])
    print(predicted_word_index)
    predicted_word = tokenizer.index_word[predicted_word_index+1]
    return predicted_word

# Example prediction
sentence = "This is an example"
predicted_word = predict_next_word(language_model, tokenizer, sentence, max_length)
print(f"Next word prediction: {predicted_word}")


[[0.5851038  0.1786701  0.02727927 0.03728561 0.03317305 0.0246282
  0.03144044 0.02969067 0.02466745 0.0280614 ]]
0
Next word prediction: sentence



## Conclusion

In this notebook, we have covered the basics of implementing a language model using Sentence Transformers. We have shown how to encode sentences into embeddings, find similar sentences using cosine similarity, and cluster sentences using K-Means clustering. Sentence Transformers provide a powerful and flexible way to work with sentence embeddings for various NLP tasks.

You can further explore Sentence Transformers by trying out different models, tweaking hyperparameters, and applying these techniques to your specific use cases. Happy coding!
