<a href="https://colab.research.google.com/github/abdul9870/abdul9870/blob/main/project%206_Embeddings_and_Indexing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Day 6: Embeddings & Indexing Documents - An Advanced AI/ML Tutorial

Welcome to this advanced tutorial on generating text embeddings and creating searchable vector indexes. This notebook will guide you through:

1.  **Understanding Embeddings:** What they are and why they are crucial in modern NLP.
2.  **Generating Embeddings:** Using the powerful `sentence-transformers` library to convert text into meaningful vector representations.
3.  **Vector Indexing with FAISS:** Building efficient search indexes for your embeddings using Facebook AI Similarity Search (FAISS).
4.  **Practical Application:** Building a simple semantic search engine on a dataset.

This tutorial is aimed at ML engineering professionals looking to deepen their understanding and practical skills in these areas. We will focus on open-source tools and provide detailed, runnable code examples.

**Prerequisites:**
*   Basic understanding of Python and machine learning concepts.
*   Familiarity with Jupyter Notebooks.

Let's get started!



## 1. Setup: Installing and Importing Libraries

First, we need to install the necessary libraries. If you haven't already, you can install them using pip:

```bash
!pip install sentence-transformers faiss-cpu datasets pandas torch torchvision torchaudio
```

Now, let's import the libraries we'll be using throughout this tutorial.


In [None]:
!pip install sentence-transformers faiss-cpu datasets pandas torch torchvision torchaudio

Collecting faiss-cpu
  Downloading faiss_cpu-1.11.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.8 kB)
Collecting datasets
  Downloading datasets-3.6.0-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2025.3.0,>=2023.1.0 (from fsspec[http]<=2025.3.0,>=2023.1.0->datasets)
  Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.w

In [None]:
import numpy as np
from sentence_transformers import SentenceTransformer
import faiss
import os
import time
from datasets import load_dataset
import pandas as pd # For potential data display, though not heavily used in the script

print("Libraries imported successfully!")


Libraries imported successfully!



## 2. Understanding Embeddings

Text embeddings are dense vector representations of text in a high-dimensional space. The key idea is that similar texts will have vectors that are close to each other in this space, while dissimilar texts will have vectors that are far apart. This property makes embeddings incredibly useful for tasks like semantic search, clustering, classification, and more.

**Why are embeddings powerful?**

*   **Semantic Meaning:** They capture the semantic meaning of words and sentences, going beyond simple keyword matching.
*   **Dimensionality Reduction:** They represent complex text data in a more manageable numerical format.
*   **Transfer Learning:** Pre-trained embedding models, trained on vast amounts of text data, can be readily used for various downstream tasks, often achieving excellent performance with minimal task-specific data.

**SentenceTransformers**

`sentence-transformers` is a Python framework for state-of-the-art sentence, text, and image embeddings. It provides an easy way to use a wide variety of pre-trained models and fine-tune them on your own data.

In this section, we'll explore how to use `sentence-transformers` to generate embeddings.



### 2.1. Loading a Pre-trained Model

We'll start by loading a pre-trained model. `all-MiniLM-L6-v2` is a popular choice known for its balance of speed and performance.


In [None]:

# Load a pre-trained SentenceTransformer model
model_name_st = 'all-MiniLM-L6-v2'
print(f"Loading SentenceTransformer model: {model_name_st}...")
model_st = SentenceTransformer(model_name_st)
print(f"Model '{model_name_st}' loaded successfully.")

# Get the embedding dimension
dimension_st = model_st.get_sentence_embedding_dimension()
print(f"The embedding dimension of this model is: {dimension_st}")


Loading SentenceTransformer model: all-MiniLM-L6-v2...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Model 'all-MiniLM-L6-v2' loaded successfully.
The embedding dimension of this model is: 384



### 2.2. Generating Embeddings for Sentences

Once the model is loaded, generating embeddings is straightforward.


In [None]:

# Sentences for which we want to generate embeddings
sentences = [
    "This is an example sentence.",
    "Each sentence is converted to a vector.",
    "AI is revolutionizing many fields.",
    "Natural Language Processing is a fascinating area of study."
]

print(f"Generating embeddings for {len(sentences)} sentences...")
sentence_embeddings = model_st.encode(sentences, convert_to_numpy=True)

print("Embeddings generated successfully.")
print(f"Shape of the embeddings array: {sentence_embeddings.shape}") # (Number of sentences, Embedding dimension)

print("First embedding vector (first 5 dimensions):")
print(sentence_embeddings[0][:5])


Generating embeddings for 4 sentences...
Embeddings generated successfully.
Shape of the embeddings array: (4, 384)
First embedding vector (first 5 dimensions):
[0.09812461 0.06781267 0.06252321 0.09508484 0.03664759]



### 2.3. Generating Embeddings for a Batch of Texts (for FAISS demo)

For indexing with FAISS, we'll often work with batches of documents.


In [None]:
# Batch of texts for FAISS demonstration later
batch_of_texts_st = [
    "The weather is sunny and warm today.",
    "AI is transforming many industries across the globe.",
    "Learning new skills is important for professional development.",
    "This is the first document for FAISS.",
    "This document is the second document for FAISS.",
    "And this is the third one for FAISS.",
    "Is this the first document for FAISS?" # A query-like sentence
]

print(f"Generating embeddings for a batch of {len(batch_of_texts_st)} texts for FAISS demo...")
# Ensure embeddings are float32, as FAISS often expects this
db_vectors_from_st = model_st.encode(batch_of_texts_st, convert_to_numpy=True).astype(np.float32)

# Normalize embeddings for cosine similarity search with IndexFlatIP
# faiss.normalize_L2(db_vectors_from_st) # We will do this explicitly before adding to IndexFlatIP

print(f"Batch Embeddings for FAISS demo shape: {db_vectors_from_st.shape}")

# Example query for FAISS demo
query_texts_st = ["What is AI?"]
query_vectors_st = model_st.encode(query_texts_st, convert_to_numpy=True).astype(np.float32)
# faiss.normalize_L2(query_vectors_st) # We will do this explicitly before searching with IndexFlatIP

print("SentenceTransformers embedding generation part completed.")


Generating embeddings for a batch of 7 texts for FAISS demo...
Batch Embeddings for FAISS demo shape: (7, 384)
SentenceTransformers embedding generation part completed.



## 3. Vector Indexing and Search with FAISS

FAISS (Facebook AI Similarity Search) is a library for efficient similarity search and clustering of dense vectors. It's particularly useful when dealing with millions or even billions of vectors.

**Key Concepts in FAISS:**

*   **Index:** A data structure that stores the vectors and allows for efficient searching.
*   **Metric Types:** FAISS supports various similarity metrics, including L2 (Euclidean) distance and Inner Product (cosine similarity for normalized vectors).
*   **Index Types:** FAISS offers a variety of index types, each with different trade-offs in terms of search speed, memory usage, and accuracy.
    *   `IndexFlatL2`: Exact search using L2 distance. Simple but can be slow for large datasets.
    *   `IndexFlatIP`: Exact search using Inner Product. Suitable for cosine similarity if vectors are normalized.
    *   `IndexIVFFlat`: An inverted file index that partitions the vector space into cells (voronoi cells). It first searches a subset of these cells (`nprobe`) and then performs an exact search within those cells. Faster than `IndexFlat` for large datasets but introduces approximation.
    *   `IndexHNSWFlat`: A graph-based index (Hierarchical Navigable Small World) known for excellent speed-accuracy trade-offs.

We will use the embeddings generated in the previous step.


In [None]:
# Using embeddings from SentenceTransformers section
db_vectors_faiss = db_vectors_from_st
query_vectors_faiss = query_vectors_st
dimension_faiss = db_vectors_from_st.shape[1] # Should be 384 for all-MiniLM-L6-v2
num_db_vectors_faiss = db_vectors_faiss.shape[0]
num_query_vectors_faiss = query_vectors_faiss.shape[0]

print(f"Using {num_db_vectors_faiss} database vectors and {num_query_vectors_faiss} query vectors from SentenceTransformers section for FAISS.")
print(f"Vector dimension: {dimension_faiss}")


Using 7 database vectors and 1 query vectors from SentenceTransformers section for FAISS.
Vector dimension: 384



### 3.1. Building a Basic FAISS Index: `IndexFlatIP`

`IndexFlatIP` performs an exact search based on the inner product. If the vectors are L2-normalized, maximizing the inner product is equivalent to minimizing the L2 distance and maximizing cosine similarity.

**Note on Normalization:** For `IndexFlatIP` to correctly represent cosine similarity, the vectors (both database and query) *must* be L2 normalized. `sentence-transformers` often outputs normalized vectors by default, but it's good practice to ensure this or normalize them explicitly if using cosine similarity.

Let's normalize our vectors before adding them to `IndexFlatIP`.


In [None]:
# Normalize vectors for IndexFlatIP (cosine similarity)
db_vectors_faiss_normalized = db_vectors_faiss.copy()
faiss.normalize_L2(db_vectors_faiss_normalized)

query_vectors_faiss_normalized = query_vectors_faiss.copy()
faiss.normalize_L2(query_vectors_faiss_normalized)

index_flat_ip = faiss.IndexFlatIP(dimension_faiss)
print(f"Is the index trained? {index_flat_ip.is_trained}") # IndexFlatIP does not require training
index_flat_ip.add(db_vectors_faiss_normalized)
print(f"Number of vectors in IndexFlatIP: {index_flat_ip.ntotal}")

# Search the index
k_faiss = 3 # Number of nearest neighbors to retrieve
print(f"Searching IndexFlatIP for {k_faiss} nearest neighbors...")
D_ip, I_ip = index_flat_ip.search(query_vectors_faiss_normalized, k_faiss)

print("Query Text:", query_texts_st[0])
print("Results (IndexFlatIP):")
for i in range(k_faiss):
    print(f"  Rank {i+1}: Index={I_ip[0][i]}, Score (Inner Product)={D_ip[0][i]:.4f}, Text: {batch_of_texts_st[I_ip[0][i]]}")


Is the index trained? True
Number of vectors in IndexFlatIP: 7
Searching IndexFlatIP for 3 nearest neighbors...
Query Text: What is AI?
Results (IndexFlatIP):
  Rank 1: Index=1, Score (Inner Product)=0.6581, Text: AI is transforming many industries across the globe.
  Rank 2: Index=2, Score (Inner Product)=0.1436, Text: Learning new skills is important for professional development.
  Rank 3: Index=4, Score (Inner Product)=0.1163, Text: This document is the second document for FAISS.



### 3.2. Advanced FAISS Indexing: `IndexIVFFlat`

`IndexIVFFlat` is an inverted file index. It first clusters the dataset vectors into `nlist` cells using k-means. When searching, it identifies the `nprobe` closest cells to the query vector and then performs an exact search (like `IndexFlatIP` or `IndexFlatL2`) only within those cells.

This approach is much faster for large datasets but is an approximate search method. The `nprobe` parameter controls the trade-off between speed and accuracy.

**Steps:**
1.  Define a `quantizer`: This is a flat index used to find the cluster centroids.
2.  Create the `IndexIVFFlat` index, specifying the quantizer, dimension, number of clusters (`nlist`), and metric.
3.  Train the index on the database vectors (this is where k-means clustering happens).
4.  Add the database vectors to the trained index.
5.  Set `nprobe` for searching.


In [None]:
# Parameters for IndexIVFFlat
num_db_vectors_faiss = db_vectors_faiss_normalized.shape[0]
nlist = max(1, min(num_db_vectors_faiss // 4, 10))  # Number of clusters (Voronoi cells)
                                                 # Rule of thumb: sqrt(N) to N/100, but for small N, keep it small.
                                                 # Ensure nlist is at least 1 and not too large for small datasets.

if num_db_vectors_faiss < nlist:
    print(f"Warning: Number of database vectors ({num_db_vectors_faiss}) is less than nlist ({nlist}). Adjusting nlist.")
    nlist = max(1, num_db_vectors_faiss // 2) if num_db_vectors_faiss > 1 else 1

print(f"Using nlist = {nlist}")

quantizer = faiss.IndexFlatIP(dimension_faiss)  # Using Inner Product for the quantizer as well
index_ivf_flat = faiss.IndexIVFFlat(quantizer, dimension_faiss, nlist, faiss.METRIC_INNER_PRODUCT)

print(f"Is the IVF index trained? {index_ivf_flat.is_trained}")

if num_db_vectors_faiss >= nlist and nlist > 0: # Check if there are enough vectors to train and nlist is positive
    print("Training IndexIVFFlat...")
    index_ivf_flat.train(db_vectors_faiss_normalized) # Train on normalized vectors
    print(f"Is the IVF index trained after training? {index_ivf_flat.is_trained}")
    index_ivf_flat.add(db_vectors_faiss_normalized)    # Add normalized vectors
    print(f"Number of vectors in IndexIVFFlat: {index_ivf_flat.ntotal}")

    # Set nprobe (number of cells to search)
    index_ivf_flat.nprobe = max(1, nlist // 2)  # A common starting point
    print(f"Searching IndexIVFFlat with nprobe = {index_ivf_flat.nprobe}...")
    D_ivf, I_ivf = index_ivf_flat.search(query_vectors_faiss_normalized, k_faiss)

    print("Query Text:", query_texts_st[0])
    print("Results (IndexIVFFlat):")
    for i in range(k_faiss):
        print(f"  Rank {i+1}: Index={I_ivf[0][i]}, Score (Inner Product)={D_ivf[0][i]:.4f}, Text: {batch_of_texts_st[I_ivf[0][i]]}")
else:
    print(f"Skipping IndexIVFFlat example due to insufficient data ({num_db_vectors_faiss} vectors) for the chosen nlist ({nlist}).")
    print("IndexIVFFlat requires at least nlist training vectors.")



Using nlist = 1
Is the IVF index trained? False
Training IndexIVFFlat...
Is the IVF index trained after training? True
Number of vectors in IndexIVFFlat: 7
Searching IndexIVFFlat with nprobe = 1...
Query Text: What is AI?
Results (IndexIVFFlat):
  Rank 1: Index=1, Score (Inner Product)=0.6581, Text: AI is transforming many industries across the globe.
  Rank 2: Index=2, Score (Inner Product)=0.1436, Text: Learning new skills is important for professional development.
  Rank 3: Index=4, Score (Inner Product)=0.1163, Text: This document is the second document for FAISS.



### 3.3. Managing FAISS Indexes: Saving and Loading

FAISS allows you to save trained indexes to disk and load them later, which is essential for production systems.


In [None]:
index_dir = "/content/faiss_indexes_tutorial"
if not os.path.exists(index_dir):
    os.makedirs(index_dir)

index_flat_ip_path = os.path.join(index_dir, "index_flat_ip.faiss")
faiss.write_index(index_flat_ip, index_flat_ip_path)
print(f"IndexFlatIP saved to {index_flat_ip_path}")

# Load the index
loaded_index_flat_ip = faiss.read_index(index_flat_ip_path)
print(f"Loaded IndexFlatIP. Number of vectors: {loaded_index_flat_ip.ntotal}")

# Verify search with loaded index
D_loaded, I_loaded = loaded_index_flat_ip.search(query_vectors_faiss_normalized, k_faiss)
print("Search results from loaded IndexFlatIP:")
print("Distances (Inner Product):"); print(D_loaded)
print("Indices:"); print(I_loaded)

# Clean up the created directory and file (optional)
# import shutil
# shutil.rmtree(index_dir)
# print(f"Cleaned up directory: {index_dir}")


IndexFlatIP saved to /content/faiss_indexes_tutorial/index_flat_ip.faiss
Loaded IndexFlatIP. Number of vectors: 7
Search results from loaded IndexFlatIP:
Distances (Inner Product):
[[0.6581109  0.14355162 0.11632239]]
Indices:
[[1 2 4]]



## 4. Practical Application: Building a Simple Semantic Search Engine

Let's put it all together by building a simple semantic search engine. We'll use a small subset of the AG News dataset.

**Steps:**
1.  Load the dataset.
2.  Generate embeddings for all documents in the dataset using SentenceTransformers.
3.  Build a FAISS index with these embeddings.
4.  Implement a search function that takes a query, embeds it, and searches the FAISS index.



### 4.1. Load Dataset

We'll use the `datasets` library to load the AG News dataset. For this demo, we'll only use a small subset to keep things fast.


In [None]:
print("Loading AG News dataset (a small subset for the application)...")
try:
    # Using a very small subset for quick demonstration
    ag_news_dataset = load_dataset("ag_news", split="train[:200]") # Increased to 200 for a bit more data
    print(f"AG News dataset loaded. Number of samples: {len(ag_news_dataset)}")
    documents = ag_news_dataset["text"]
    labels = ag_news_dataset["label"] # Optional: for context or potential evaluation
except Exception as e:
    print(f"Error loading AG News dataset: {e}. Using dummy documents for application part.")
    documents = [
        "Global markets rally on positive economic news.",
        "New breakthrough in cancer research announced by scientists.",
        "Tech company launches innovative smartphone with advanced AI features.",
        "The local sports team celebrated a stunning victory last night.",
        "Debate on climate change policies intensifies in parliament.",
        "Stock prices soar as investors gain confidence.",
        "Medical researchers discover a new treatment for a rare disease.",
        "Artificial intelligence is reshaping the future of work.",
        "The home team secured a crucial win in the championship finals.",
        "Governments worldwide discuss strategies for sustainable development."
    ] * 20 # 200 dummy documents
    labels = [i % 4 for i in range(len(documents))] # Dummy labels

print(f"Using {len(documents)} documents for the semantic search engine.")
print("First 3 documents:")
for i in range(3):
    print(f"- {documents[i][:100]}...")


Loading AG News dataset (a small subset for the application)...


README.md:   0%|          | 0.00/8.07k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/18.6M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/1.23M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/120000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/7600 [00:00<?, ? examples/s]

AG News dataset loaded. Number of samples: 200
Using 200 documents for the semantic search engine.
First 3 documents:
- Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\b...
- Carlyle Looks Toward Commercial Aerospace (Reuters) Reuters - Private investment firm Carlyle Group,...
- Oil and Economy Cloud Stocks' Outlook (Reuters) Reuters - Soaring crude prices plus worries\about th...



### 4.2. Generate Embeddings for the Dataset

We'll use the same `all-MiniLM-L6-v2` model to embed all documents.


In [None]:
print(f"Generating embeddings for {len(documents)} documents...")
start_time_app = time.time()
document_embeddings_app = model_st.encode(documents, show_progress_bar=True, convert_to_numpy=True).astype(np.float32)

# Normalize for IndexFlatIP (cosine similarity)
faiss.normalize_L2(document_embeddings_app)

end_time_app = time.time()
print(f"Application embeddings generated in {end_time_app - start_time_app:.2f} seconds.")
print(f"Shape of document embeddings: {document_embeddings_app.shape}")


Generating embeddings for 200 documents...


Batches:   0%|          | 0/7 [00:00<?, ?it/s]

Application embeddings generated in 0.54 seconds.
Shape of document embeddings: (200, 384)



### 4.3. Build FAISS Index

We'll use `IndexFlatIP` for simplicity in this application, as our dataset is small. For larger datasets, `IndexIVFFlat` or `IndexHNSWFlat` would be more appropriate.


In [None]:
dimension_app = document_embeddings_app.shape[1]
index_app = faiss.IndexFlatIP(dimension_app)
index_app.add(document_embeddings_app)
print(f"FAISS index built for the application. Total vectors: {index_app.ntotal}")


FAISS index built for the application. Total vectors: 200



### 4.4. Implement Search Function

This function will take a query, embed it, and search the FAISS index to find the most similar documents.


In [None]:
import time
import numpy as np
import faiss

def semantic_search(query_text, model, index, documents_list, k=5):
    """Performs semantic search for a given query.

    Args:
        query_text (str): The search query.
        model: The SentenceTransformer model.
        index: The FAISS index.
        documents_list (list): The list of original documents.
        k (int): Number of top results to retrieve.

    Returns:
        list: A list of tuples, where each tuple contains (score, document_text).
    """
    print(f'Searching for: "{query_text}"')
    start_time = time.time()

    # Encode and normalize the query
    query_embedding = model.encode(
        [query_text],
        convert_to_numpy=True,  # for newer versions of SentenceTransformers
    ).astype(np.float32)
    faiss.normalize_L2(query_embedding)  # normalize to unit length

    # Search in the index
    distances, indices = index.search(query_embedding, k)
    elapsed = time.time() - start_time
    print(f"Search completed in {elapsed:.4f} seconds.\n")

    results = []
    num_results = min(k, distances.shape[1])
    print(f"Top {num_results} results:")
    for rank in range(num_results):
        doc_idx = indices[0][rank]
        score = distances[0][rank]
        snippet = documents_list[doc_idx][:200].replace("\n", " ")
        print(f"  Rank {rank+1}: Score={score:.4f}, Index={doc_idx}")
        print(f"    Text: {snippet}...\n")
        results.append((score, documents_list[doc_idx]))

    return results

# Example usage:
# (Assuming `model_st`, `index_app`, and `documents` are already defined)
search_queries = [
    "latest developments in artificial intelligence",
    "global economic outlook and stock market trends",
    "exciting football match results"
]

for q in search_queries:
    semantic_search(q, model_st, index_app, documents)


Searching for: "latest developments in artificial intelligence"
Search completed in 0.0149 seconds.

Top 5 results:
  Rank 1: Score=0.3358, Index=109
    Text: New NASA Supercomputer to Aid Theorists and Shuttle Engineers (SPACE.com) SPACE.com - NASA researchers have teamed up with a pair of Silicon Valley firms to build \  a supercomputer that ranks alongsi...

  Rank 2: Score=0.2465, Index=69
    Text: Autodesk tackles project collaboration Autodesk  this week unwrapped an updated version of its hosted project collaboration service targeted at the construction and manufacturing industries. Autodesk ...

  Rank 3: Score=0.2406, Index=197
    Text: 'Invisible' technology for Olympics Getting the technology in place for Athens 2004 is an Olympic task in itself....

  Rank 4: Score=0.2318, Index=135
    Text: Canadian Robot a Candidate to Save Hubble (AP) AP - NASA said Tuesday it is moving ahead with plans to send a robot to the rescue of the aging Hubble Space Telescope....

  Rank 5: 


## 5. Advanced Considerations for ML Engineers

While the examples above cover the basics, real-world applications often require more sophisticated approaches.

*   **Scalability:** For very large datasets (millions or billions of vectors), `IndexFlatIP` or `IndexFlatL2` become too slow. Consider:
    *   `IndexIVFPQ`: Inverted File Index with Product Quantization. Offers a good balance of speed, memory, and accuracy. Requires training.
    *   `IndexHNSWFlat` or `IndexHNSWPQ`: Graph-based indexes that are very fast and accurate, but can have higher build times and memory usage.
    *   Distributed FAISS: For datasets that don't fit on a single machine.
*   **Performance Optimization:**
    *   **Embedding Generation:** Use GPUs for `sentence-transformers` if available. Process texts in batches.
    *   **FAISS Search:** Experiment with `nprobe` for IVF-based indexes. Higher `nprobe` means better accuracy but slower search. For HNSW, `efSearch` is a key parameter.
    *   **Hardware:** Utilizing GPUs for FAISS can significantly speed up both index building and searching for certain index types.
*   **Evaluation:**
    *   How do you know your search is good? Define metrics relevant to your task (e.g., Recall@K, Mean Reciprocal Rank).
    *   Create a labeled dataset of queries and relevant documents for evaluation.
*   **Updating the Index:** FAISS indexes are generally static once built. If you have frequently changing data:
    *   Rebuild the index periodically.
    *   Some index types support adding new vectors (e.g., `IndexFlat`, `IndexIDMap`), but removing vectors can be complex or inefficient.
    *   Consider solutions like ChromaDB or Weaviate for dynamic vector databases if frequent updates are critical.
*   **Hybrid Search:** Combine vector search with traditional keyword search (e.g., BM25) for improved relevance, especially for queries with rare keywords or specific entities.
*   **Fine-tuning Embedding Models:** For domain-specific tasks, fine-tuning a pre-trained SentenceTransformer model on your own data can significantly improve embedding quality and search relevance.



## 6. Conclusion & Further Learning

In this tutorial, we've covered:

*   The concept of text embeddings and their importance.
*   Generating high-quality text embeddings using `sentence-transformers`.
*   Building and searching vector indexes with `FAISS`, including basic (`IndexFlatIP`) and more advanced (`IndexIVFFlat`) types.
*   A practical example of building a semantic search engine.
*   Key considerations for deploying and scaling these techniques in real-world applications.

**Further Learning:**

*   **SentenceTransformers Documentation:** [https://www.sbert.net](https://www.sbert.net)
*   **FAISS GitHub & Wiki:** [https://github.com/facebookresearch/faiss/wiki](https://github.com/facebookresearch/faiss/wiki)
*   **ChromaDB:** [https://www.trychroma.com/](https://www.trychroma.com/) - An open-source embedding database that simplifies management.
*   **Weaviate:** [https://weaviate.io/](https://weaviate.io/) - Another powerful open-source vector database.
*   Explore different pre-trained models from SentenceTransformers for various tasks and languages.
*   Dive deeper into FAISS index types and their parameters for optimization.
*   Consider fine-tuning embedding models for your specific domain.

This field is rapidly evolving, so continuous learning is key. Happy embedding and indexing!



## 7. References

*   Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. *Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing*.
*   Johnson, J., Douze, M., & Jégou, H. (2019). Billion-scale similarity search with GPUs. *IEEE Transactions on Big Data*.
*   Official documentation for SentenceTransformers, FAISS, and Datasets libraries.
