<a href="https://colab.research.google.com/github/atul-ai/prompt-engineering-class/blob/main/SimpleVectorDB_FIASS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Faiss Example: Sentence Similarity Search

This notebook demonstrates how to use Faiss for efficient similarity search on sentence embeddings.

## 1. Install Required Libraries

First, let's install the necessary libraries. This may take a few moments.

In [1]:
!pip install faiss-cpu numpy transformers torch

Collecting faiss-cpu
  Downloading faiss_cpu-1.8.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.7 kB)
Downloading faiss_cpu-1.8.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.0/27.0 MB[0m [31m23.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.8.0.post1


## 2. Import Libraries and Define Sentences

In [2]:
import numpy as np
import faiss
from transformers import AutoTokenizer, AutoModel
import torch

# Define a list of sentences
sentences = [
    "The quick brown fox jumps over the lazy dog.",
    "A journey of a thousand miles begins with a single step.",
    "To be or not to be, that is the question.",
    "All that glitters is not gold.",
    "Actions speak louder than words.",
    "Where there's a will, there's a way.",
    "The early bird catches the worm.",
    "A picture is worth a thousand words.",
    "When in Rome, do as the Romans do.",
    "The pen is mightier than the sword."
]

print(f"Number of sentences: {len(sentences)}")

Number of sentences: 10


## 3. Load Pre-trained Model and Define Embedding Function

In [3]:
# Load a pre-trained model and tokenizer
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# Function to generate embeddings for a sentence
def get_embedding(sentence):
    inputs = tokenizer(sentence, return_tensors="pt", padding=True, truncation=True)
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.last_hidden_state.mean(dim=1).squeeze().numpy()

print("Model loaded and embedding function defined.")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Model loaded and embedding function defined.


## 4. Generate Embeddings and Create Faiss Index

In [4]:
# Generate embeddings for all sentences
vectors = np.array([get_embedding(sent) for sent in sentences]).astype('float32')

# Create a Faiss index
dim = vectors.shape[1]  # dimensionality of our embeddings
index = faiss.IndexFlatL2(dim)

# Add vectors to the index
index.add(vectors)

print(f"Faiss index created with {index.ntotal} vectors of dimension {dim}.")

Faiss index created with 10 vectors of dimension 768.


## 5. Perform Similarity Search

In [5]:
# Perform a search
k = 3  # number of nearest neighbors to retrieve
query_sentence = "Knowledge is power."
query_vector = get_embedding(query_sentence).reshape(1, -1)
distances, indices = index.search(query_vector, k)

print(f"Query: '{query_sentence}'")
print(f"\nTop {k} most similar sentences:")
for i, (idx, distance) in enumerate(zip(indices[0], distances[0])):
    print(f"{i+1}. '{sentences[idx]}' (Distance: {distance:.4f})")

Query: 'Knowledge is power.'

Top 3 most similar sentences:
1. 'Actions speak louder than words.' (Distance: 32.8151)
2. 'A picture is worth a thousand words.' (Distance: 50.3357)
3. 'All that glitters is not gold.' (Distance: 51.8222)


## 6. Interactive Query (Optional)

You can use this cell to try different queries interactively.

In [10]:
def query_similar_sentences(query, k=3):
    query_vector = get_embedding(query).reshape(1, -1)
    distances, indices = index.search(query_vector, k)

    print(f"\n\nQuery: '{query}'")
    print(f"\nTop {k} most similar sentences:")
    for i, (idx, distance) in enumerate(zip(indices[0], distances[0])):
        print(f"{i+1}. '{sentences[idx]}' (Distance: {distance:.4f})")

# Example usage:
query_similar_sentences("Doing is better than talking")
query_similar_sentences("Beauty lies in the eyes of beholder")
query_similar_sentences("Life sucks!")

# You can try your own queries by calling the function with different inputs
# For example: query_similar_sentences("Your query here", k=5)


Query: 'Doing is better than talking'

Top 3 most similar sentences:
1. 'Actions speak louder than words.' (Distance: 29.8676)
2. 'A picture is worth a thousand words.' (Distance: 48.0573)
3. 'A journey of a thousand miles begins with a single step.' (Distance: 61.4174)

Query: 'Beauty lies in the eyes of beholder'

Top 3 most similar sentences:
1. 'All that glitters is not gold.' (Distance: 54.0545)
2. 'A journey of a thousand miles begins with a single step.' (Distance: 63.6879)
3. 'To be or not to be, that is the question.' (Distance: 65.1216)

Query: 'Life sucks!'

Top 3 most similar sentences:
1. 'A picture is worth a thousand words.' (Distance: 66.8055)
2. 'Actions speak louder than words.' (Distance: 69.3681)
3. 'All that glitters is not gold.' (Distance: 73.2825)
