# Creating an Index

This jupyter notebook creates a FAISS based index consisting of docstrings of [SciKit Learn]() esitmators. This index is then used to simulate how vector database and LLM models can be used and integrated into DataRobot. 

As a first step, let's gather all of the docstrings from Scikit-Learn. 


In [14]:
import inspect
import os
import json
import numpy as np
import time
import faiss
from sklearn.utils.discovery import all_estimators

In [15]:
estimators = all_estimators()

docstrings = []
class_names = []

for name, estimator in estimators:
    # Check if it's actually an estimator (has fit method)
    if hasattr(estimator, 'fit') and inspect.isclass(estimator):
        doc = estimator.__doc__
        if doc is not None and len(doc.strip()) > 0:
            docstrings.append(doc)
            class_names.append(name)
            
print(f"Extracted {len(docstrings)} docstrings")
print(f"Here is a sample of {class_names[0]} \n")
print(docstrings[0][0:200])

Extracted 207 docstrings
Here is a sample of ARDRegression 

Bayesian ARD regression.

Fit the weights of a regression model, using an ARD prior. The weights of
the regression model are assumed to be in Gaussian distributions.
Also estimate the parameters lambd


In [16]:
# Create the embeddings
from transformers import AutoTokenizer, AutoModel
import torch
import os 
import time


MODEL_DIR = "embedding_model"
model_name = "prajjwal1/bert-tiny"

os.makedirs(MODEL_DIR, exist_ok=True)
print(f"Loading model from/to: {MODEL_DIR}")
tokenizer = AutoTokenizer.from_pretrained(model_name, cache_dir=MODEL_DIR)
model = AutoModel.from_pretrained(model_name, cache_dir=MODEL_DIR)


def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]  # First element contains token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# Process in batches to avoid memory issues
batch_size = 64  # Larger batch size for speed
embeddings_list = []

start_time = time.time()

for i in range(0, len(docstrings), batch_size):
    batch = docstrings[i:i+batch_size]
    
    # Tokenize
    encoded_input = tokenizer(batch, padding=True, truncation=True, 
                             max_length=256, return_tensors='pt')  # Reduced max_length for speed
    
    # Compute token embeddings
    with torch.no_grad():
        model_output = model(**encoded_input)
    
    # Apply mean pooling
    batch_embeddings = mean_pooling(model_output, encoded_input['attention_mask']).numpy()
    embeddings_list.append(batch_embeddings)
    
    # Show progress sparingly for speed
    if (i // batch_size) % 20 == 0:
        print(f"Processed {i+len(batch)}/{len(docstrings)} documents...")

# Concatenate all batches
embeddings = np.vstack(embeddings_list)

elapsed_time = time.time() - start_time

print(f"Created {embeddings.shape[1]}-dimensional embeddings for {len(docstrings)} documents")
print(f"Processing took {elapsed_time:.2f} seconds ({len(docstrings)/elapsed_time:.2f} docs/sec)")
    

Loading model from/to: embedding_model
Processed 64/207 documents...
Created 128-dimensional embeddings for 207 documents
Processing took 0.45 seconds (460.04 docs/sec)


In [17]:
# Create the FAISS Index
INDEX_PATH = "sklearn_docs.index"
d = embeddings.shape[1]
    
# Create a Flat index (exact search)
index = faiss.IndexFlatL2(d)

# Add embeddings to the index
index.add(embeddings.astype(np.float32))
faiss.write_index(index, INDEX_PATH )
print(f"Created index with {index.ntotal} documents") 

Created index with 207 documents


## Test a Query

In [18]:
query = "clustering algorithm for large datasets"

encoded_input = tokenizer([query], padding=True, truncation=True, 
                             max_length=256, return_tensors='pt')
with torch.no_grad():
    model_output = model(**encoded_input)
    
token_embeddings = model_output[0]
input_mask_expanded = encoded_input['attention_mask'].unsqueeze(-1).expand(token_embeddings.size()).float()
query_embedding = torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
query_embedding = query_embedding.numpy()

k = 3  # Return top 3 results
distances, indices = index.search(query_embedding, k)
print(f"\nTop {k} results for query: '{query}'")
for i, (idx, distance) in enumerate(zip(indices[0], distances[0])):
    print(f"{i+1}. {class_names[idx]} (Distance: {distance:.4f})")
    print(f"   {docstrings[idx][:150]}...\n")


Top 3 results for query: 'clustering algorithm for large datasets'
1. OPTICS (Distance: 37.9485)
   Estimate clustering structure from vector array.

OPTICS (Ordering Points To Identify the Clustering Structure), closely
related to DBSCAN, finds core...

2. EllipticEnvelope (Distance: 38.2346)
   An object for detecting outliers in a Gaussian distributed dataset.

Read more in the :ref:`User Guide <outlier_detection>`.

Parameters
----------
st...

3. IsolationForest (Distance: 38.2359)
   
Isolation Forest Algorithm.

Return the anomaly score of each sample using the IsolationForest algorithm

The IsolationForest 'isolates' observations...

