# Simple RAG example using Facebook AI Similarity Search (FAISS)

In this example, we'll demonstrate how to use **FAISS** for similarity-based document retrieval. We will simulate a small **mock dataset** of fictional documents and use **Sentence Transformers** to encode them into vectors. We will then build an **FAISS index** to enable fast and efficient similarity search. Finally, we will simulate a **RAG** system where we retrieve the most relevant documents and use them to generate an answer.


## Steps Overview:
1. **Create a mock dataset**: We create a small set of fictional documents.
2. **Generate embeddings**: We use the **SentenceTransformer** model to convert these documents into vector embeddings.
3. **Build FAISS index**: We build a **FAISS index** that will store these vector embeddings for fast similarity search.
4. **Search and retrieve**: We perform a similarity search based on a query and retrieve the most relevant document(s).
5. **Answer generation**: Using the retrieved document(s), we simulate a **RAG pipeline** to generate a response using OpenAI compatible API Aitta provides.

This example demonstrates how **FAISS** can be used for efficient document retrieval, and how **RAG** can help generate contextually relevant answers from these documents.


In [1]:
#!pip install sentence-transformers faiss-cpu openai aitta-client
!pip install -r requirements.txt

Collecting openai==1.66.5 (from -r requirements.txt (line 1))
  Downloading openai-1.66.5-py3-none-any.whl.metadata (24 kB)
Collecting langchain (from -r requirements.txt (line 2))
  Downloading langchain-0.3.21-py3-none-any.whl.metadata (7.8 kB)
Collecting sentence-transformers (from -r requirements.txt (line 3))
  Downloading sentence_transformers-3.4.1-py3-none-any.whl.metadata (10 kB)
Collecting aitta-client (from -r requirements.txt (line 5))
  Downloading aitta_client-0.2.0-py3-none-any.whl.metadata (5.3 kB)
Collecting tf-keras (from -r requirements.txt (line 6))
  Downloading tf_keras-2.19.0-py3-none-any.whl.metadata (1.8 kB)
Collecting faiss-cpu (from -r requirements.txt (line 7))
  Downloading faiss_cpu-1.10.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.4 kB)
Collecting jiter<1,>=0.4.0 (from openai==1.66.5->-r requirements.txt (line 1))
  Downloading jiter-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.2 kB)
Collecting pydantic<3,>=1.9.0 (fr

In [1]:
import numpy as np
import openai
import openai

import faiss
from sentence_transformers import SentenceTransformer
from aitta_client import Model, Client, StaticAccessTokenSource

2025-03-24 15:38:30.151089: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-03-24 15:38:30.153515: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-03-24 15:38:30.156769: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-03-24 15:38:30.166001: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1742830710.192803    1168 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1742830710.19

In [2]:
api_key = ""

In [3]:
# configure Client instance with API URL and access token
token_source = StaticAccessTokenSource(api_key)
aitta_client = Client("https://api-staging-aitta.2.rahtiapp.fi", token_source)

# load the LumiOpen/Poro-34B-chat model
poro_model = Model.load("LumiOpen/Poro-34B-chat", aitta_client)
print(poro_model.description)

# configure OpenAI client to use the Aitta OpenAI compatibility endpoints
client = openai.OpenAI(api_key=token_source.get_access_token(), base_url=poro_model.openai_api_url)


Poro 34B chat is a chat-tuned version of Poro 34B trained to follow instructions in both Finnish and English.

Poro was created in a collaboration between SiloGen from Silo AI, the TurkuNLP group of the University of Turku, and High Performance Language Technologies (HPLT). Training was conducted on the LUMI supercomputer, using compute resources generously provided by CSC - IT Center for Science, Finland.

This project is part of an ongoing effort to create open source large language models for non-English and especially low resource languages like Finnish. Through the combination of English and Finnish training data we get a model that outperforms previous Finnish only models, while also being fluent in English and code, and capable of basic translation between English and Finnish.



In [4]:
# Create a mock dataset as a list of "documents"
documents = ["Cacapapadadas are grey, 10cm long worms.",
"The moon is actually made of a soft cheese."]


In [5]:
from sentence_transformers import SentenceTransformer

#  Initialize the SentenceTransformer model as encoder and generate vector embeddings
encoder = SentenceTransformer("all-MiniLM-L6-v2")
vectors = encoder.encode(documents)
#type(vectors)



In [86]:
vectors.shape

(2, 384)

In [None]:
faiss.

In [18]:
# Build a FAISS index from vectors

import faiss

# Determine the dimensionality of the vector embeddings
vector_dimension = vectors.shape[1]

# Initialize FAISS index using the Inner Product (IP) method for similarity search
index = faiss.IndexFlatIP(vector_dimension)  # Using IP for cosine similarity search
# Alternatively, you could use IndexFlatL2 for Euclidean distance-based similarity


# Normalize the vectors for better performance in cosine similarity
faiss.normalize_L2(vectors) # SHOULD THIS BE SOMETHING ELSE??


# Add the vectors to the FAISS index
index.add(vectors)

# Check the type of the index to ensure it's properly created
type(index)


faiss.swigfaiss_avx512.IndexFlatIP

In [25]:
# Create a search vector

import numpy as np

# Define the query text for searching in the FAISS index
search_text = 'What is the moon made of?'

# Convert the query text into an embedding (vector)
search_vector = encoder.encode(search_text)

# Convert the query embedding into a NumPy array and normalize it
search_vector = np.array([search_vector])
faiss.normalize_L2(search_vector)

# Perform a search in the FAISS index to find the most similar document
# We search for 'k' nearest neighbors (k=2 for the top 2 results)
k = index.ntotal  # We set k to the total number of documents to see how similar all are to the query
distances, indices = index.search(search_vector, k=k)  # Perform the search


# Print the distances and corresponding indices of the retrieved documents
print(distances)
print(indices)

[[0.69921315 0.16020189]]
[[1 0]]


In [None]:
# Print each of the retrieved documents along with their similarity distance
for i, idx in enumerate(indices[0]):
    print(f"Rank {i+1}:")
    print("Text:", documents[idx])  # Retrieve the document text by its index
    print("Distance:", distances[0][i])  # The distance represents similarity (lower means more similar)
    print("-" * 50)


Rank 1:
Text: The moon is actually made of a soft cheese.
Distance: 0.69921315
--------------------------------------------------
Rank 2:
Text: Cacapapadadas are grey, 10cm long worms.
Distance: 0.16020189
--------------------------------------------------


In [27]:
input_query = "What are Cacapapadadas?"


# Embed the query
query_embedding = encoder.encode(input_query)

query_embedding = np.array([query_embedding]) # without this comed IndexError: tuple index out of range
faiss.normalize_L2(query_embedding)

# Perform similarity search on the FAISS index
k = 1  # Number of nearest neighbors to retrieve
distances, indices = index.search(query_embedding, k)

# Retrieve the document(s) corresponding to the top index
retrieved_documents = [documents[i] for i in indices[0]]
print(retrieved_documents)

# Retrieve the most similar document(s)
print("Most similar document index:", indices)
print("Distance:", distances)


# Prepare the prompt
prompt = f"Given the following document, answer the question:\n\nDocument: {retrieved_documents}\n\nQuestion: {input_query}\nAnswer:"




['Cacapapadadas are grey, 10cm long worms.']
Most similar document index: [[0]]
Distance: [[0.6765009]]


In [22]:
input_query

'What are Cacapapadadas?'

In [None]:
# Call the OpenAI API
response = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": prompt
        }
    ],
    model=poro_model.id,
    stream=False  # response streaming is currently not supported by Aitta, now you get the full response in one go
)

# Display the answer
answer = response.choices[0].message.content
print("Answer:", answer)

## LLM usage without RAG

Now, let's test how the model responds to the query without relying on an external data source.

In [None]:
input_query = "What are Cacapapadadas?"


# Call the OpenAI API
response = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": input_query
        }
    ],
    model=poro_model.id,
    stream=False  # response streaming is currently not supported by Aitta, now you get the full response in one go
)

# Display the answer
answer = response.choices[0].message.content
print("Answer:", answer)

Answer: Cacapapadada is a fictional character from the animated series "Archer", voiced by Jessica Walter. She is the mother of the main character, Archer, and is known for her strict and demanding personality.


## Did the model hallucinate? 

You may notice that the model generates a response based on patterns in the training data, which could be inaccurate. To reduce the chances of hallucination, we can provide a more specific prompt.

In [None]:
input_query = "What are Cacapapadadas?"

prompt = f"Answer the query only if you know the answer for sure. Query: {input_query}"

# Call the OpenAI API
response = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": prompt
        }
    ],
    model=poro_model.id,
    stream=False  # response streaming is currently not supported by Aitta, now you get the full response in one go
)

# Display the answer
answer = response.choices[0].message.content
print("Answer:", answer)