#  Building a Question-Answering System with LlamaIndex

##  **Introduction**  


In this notebook, we will build an **AI-powered document retrieval system** using **LlamaIndex**, **FAISS**, and **Google Flan-T5 Large**. Instead of using OpenAI's GPT models, we leverage **Flan-T5 Large**, a Hugging Face model, to process queries and generate responses.  

## **Objectives:**  
1. **Set up the environment** → Install and configure essential libraries.  
2. **Load documents** → Extract text from PDFs using `SimpleDirectoryReader`.  
3. **Segment the documents** → Split text into meaningful chunks with `SentenceWindowNodeParser`.  
4. **Generate embeddings** → Convert document segments into numerical representations using `HuggingFaceEmbedding`.  
5. **Store embeddings in FAISS** → Index document embeddings for fast similarity search.  
6. **Perform document retrieval** → Use **Flan-T5 Large** to process queries and retrieve the most relevant document segments.  

By implementing this pipeline, we create an efficient **semantic search engine** that can process and retrieve relevant information from large document collections using a **local LLM (Flan-T5 Large)** instead of cloud-based OpenAI models. 🚀  


In [None]:
!pip install --upgrade llama-index
!pip install llama-index-llms-huggingface
!pip install llama-index-llms-huggingface-api
!pip install llama-index transformers accelerate bitsandbytes sentencepiece
!pip install transformers
!pip install accelerate
!pip install bitsandbytes
!pip install sentencepiece

!pip install llama-index-embeddings-instructor

Collecting llama-index
  Downloading llama_index-0.12.24-py3-none-any.whl.metadata (12 kB)
Collecting llama-index-agent-openai<0.5.0,>=0.4.0 (from llama-index)
  Downloading llama_index_agent_openai-0.4.6-py3-none-any.whl.metadata (727 bytes)
Collecting llama-index-cli<0.5.0,>=0.4.1 (from llama-index)
  Downloading llama_index_cli-0.4.1-py3-none-any.whl.metadata (1.5 kB)
Collecting llama-index-core<0.13.0,>=0.12.24 (from llama-index)
  Downloading llama_index_core-0.12.24.post1-py3-none-any.whl.metadata (2.5 kB)
Collecting llama-index-embeddings-openai<0.4.0,>=0.3.0 (from llama-index)
  Downloading llama_index_embeddings_openai-0.3.1-py3-none-any.whl.metadata (684 bytes)
Collecting llama-index-indices-managed-llama-cloud>=0.4.0 (from llama-index)
  Downloading llama_index_indices_managed_llama_cloud-0.6.9-py3-none-any.whl.metadata (3.6 kB)
Collecting llama-index-llms-openai<0.4.0,>=0.3.0 (from llama-index)
  Downloading llama_index_llms_openai-0.3.25-py3-none-any.whl.metadata (3.3 kB)


In [None]:
!pip install llama-index --upgrade



In [None]:
!pip install llama-index-embeddings-huggingface

Collecting llama-index-embeddings-huggingface
  Downloading llama_index_embeddings_huggingface-0.5.2-py3-none-any.whl.metadata (767 bytes)
Downloading llama_index_embeddings_huggingface-0.5.2-py3-none-any.whl (8.9 kB)
Installing collected packages: llama-index-embeddings-huggingface
Successfully installed llama-index-embeddings-huggingface-0.5.2


In [None]:
!pip install llama-index faiss-cpu transformers accelerate sentencepiece pymupdf


Collecting faiss-cpu
  Downloading faiss_cpu-1.10.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.4 kB)
Collecting pymupdf
  Downloading pymupdf-1.25.4-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (3.4 kB)
Downloading faiss_cpu-1.10.0-cp311-cp311-manylinux_2_28_x86_64.whl (30.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.7/30.7 MB[0m [31m19.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pymupdf-1.25.4-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (20.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.0/20.0 MB[0m [31m30.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pymupdf, faiss-cpu
Successfully installed faiss-cpu-1.10.0 pymupdf-1.25.4


In [None]:
!pip install llama-index-vector-stores-faiss

Collecting llama-index-vector-stores-faiss
  Downloading llama_index_vector_stores_faiss-0.3.0-py3-none-any.whl.metadata (658 bytes)
Downloading llama_index_vector_stores_faiss-0.3.0-py3-none-any.whl (3.9 kB)
Installing collected packages: llama-index-vector-stores-faiss
Successfully installed llama-index-vector-stores-faiss-0.3.0


In [None]:
from google.colab import drive
import os

# Monter Google Drive
drive.mount('/content/gdrive')


Mounted at /content/gdrive


In [None]:
!pip show llama-index

Name: llama-index
Version: 0.12.24
Summary: Interface between LLMs and your data
Home-page: https://llamaindex.ai
Author: Jerry Liu
Author-email: jerry@llamaindex.ai
License: MIT
Location: /usr/local/lib/python3.11/dist-packages
Requires: llama-index-agent-openai, llama-index-cli, llama-index-core, llama-index-embeddings-openai, llama-index-indices-managed-llama-cloud, llama-index-llms-openai, llama-index-multi-modal-llms-openai, llama-index-program-openai, llama-index-question-gen-openai, llama-index-readers-file, llama-index-readers-llama-parse, nltk
Required-by: 


In [None]:
# Import Necessary Libraries
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, StorageContext, Settings, StorageContext
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.huggingface import HuggingFaceLLM
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.vector_stores.faiss import FaissVectorStore

import faiss
import numpy as np

In [None]:
pdf_folder_path = "/content/pdfs"  # Google Colab uses '/content/' as the working directory

# Check and create folder if not exists
if not os.path.exists(pdf_folder_path):
    os.makedirs(pdf_folder_path)
    print(f" Directory '{pdf_folder_path}' created. Please upload PDFs to continue.")
else:
    print(f" Directory '{pdf_folder_path}' already exists.")


 Directory '/content/pdfs' already exists.


#### Load and Read the PDF Documents


In [None]:
# Define the directory path where PDFs are stored in Google Drive
pdf_folder_path = "/content/gdrive/MyDrive/GenAI-ML-BootCamp-PSTB"

# Define the path to a specific PDF file within the directory
pdf_file_path = "/content/gdrive/MyDrive/GenAI-ML-BootCamp-PSTB/arxiv_doc.pdf"
pdf_file_path

'/content/gdrive/MyDrive/GenAI-ML-BootCamp-PSTB/arxiv_doc.pdf'

#### Load PDFs Using SimpleDirectoryReader

In [None]:
from llama_index.core import SimpleDirectoryReader

# Load all PDFs from the specified directory
documents = SimpleDirectoryReader(pdf_folder_path).load_data()

# Check if documents were succesfully loaded
if documents:
    print(f" Successfully loaded {len(documents)} documents.")
    print(documents[0].text[:1000])  # Show first 1000 characters from the first document
else:
    print(" No documents found. Make sure PDFs are in the folder.")


 Successfully loaded 38 documents.
# GenAI-ML-BootCamp


In [None]:
print(documents[1].text[:1000])

Towards Generalist Biomedical AI
Tao Tu∗, ‡, 1, Shekoofeh Azizi∗, ‡, 2,
Danny Driess2, Mike Schaekermann1, Mohamed Amin1, Pi-Chuan Chang1, Andrew Carroll1,
Chuck Lau1, Ryutaro Tanno2, Ira Ktena2, Basil Mustafa2, Aakanksha Chowdhery2, Yun Liu1,
Simon Kornblith2, David Fleet2, Philip Mansfield1, Sushant Prakash1, Renee Wong1, Sunny Virmani1,
Christopher Semturs1, S Sara Mahdavi2, Bradley Green1, Ewa Dominowska1, Blaise Aguera y Arcas1,
Joelle Barral2, Dale Webster1, Greg S. Corrado1, Yossi Matias1, Karan Singhal1, Pete Florence2,
Alan Karthikesalingam†, ‡,1 and Vivek Natarajan†, ‡,1
1Google Research,2Google DeepMind
Medicine is inherently multimodal, with rich data modalities spanning text, imaging, genomics, and more.
Generalist biomedical artificial intelligence (AI) systems that flexibly encode, integrate, and interpret
this data at scale can potentially enable impactful applications ranging from scientific discovery to care
delivery. To enable the development of these models, we firs

Since documents[0] doesn't contain real text, you can ignore it when creating your index

In [None]:
if documents[0].text.strip() == "# GenAI-ML-BootCamp":
    documents = documents[1:]  # Remove the first document

#### Convert Documents into Text Segments Using SentenceWindowNodeParser

In [None]:
from llama_index.core.node_parser import SentenceWindowNodeParser

# Initialize the node parser with a fixed window size
node_parser = SentenceWindowNodeParser.from_defaults(
    window_size=512  # can be adjusted
)

# Convert loaded documents into structured text segments (nodes)
nodes = node_parser.get_nodes_from_documents(documents)

# Check how many segments were created
print(f" Number of document segments created: {len(nodes)}")

# Preview the first 3 segments to inspect how the text has been divided
for i, node in enumerate(nodes[:3]):
    print(f"\n Segment {i+1}:\n{node.text[:500]}")


 Number of document segments created: 1156

 Segment 1:
Towards Generalist Biomedical AI
Tao Tu∗, ‡, 1, Shekoofeh Azizi∗, ‡, 2,
Danny Driess2, Mike Schaekermann1, Mohamed Amin1, Pi-Chuan Chang1, Andrew Carroll1,
Chuck Lau1, Ryutaro Tanno2, Ira Ktena2, Basil Mustafa2, Aakanksha Chowdhery2, Yun Liu1,
Simon Kornblith2, David Fleet2, Philip Mansfield1, Sushant Prakash1, Renee Wong1, Sunny Virmani1,
Christopher Semturs1, S Sara Mahdavi2, Bradley Green1, Ewa Dominowska1, Blaise Aguera y Arcas1,
Joelle Barral2, Dale Webster1, Greg S. Corrado1, Yossi Matias1

 Segment 2:
Generalist biomedical artificial intelligence (AI) systems that flexibly encode, integrate, and interpret
this data at scale can potentially enable impactful applications ranging from scientific discovery to care
delivery. 

 Segment 3:
To enable the development of these models, we first curate MultiMedBench, a new multimodal
biomedical benchmark. 


#### Store Hugging Face API Key Securely


In [None]:
from getpass import getpass

huggingface_token = getpass("Enter your Hugging Face token: ")
os.environ["HUGGING_FACE_HUB_TOKEN"] = huggingface_token

print("Hugging Face token set successfully!")


####  Initialize a Hugging Face Language Model (Flan-T5)

Initialize the Large Language Model (LLM)

In this project, we use google/flan-t5-large, a pre-trained sequence-to-sequence transformer model, as our LLM. This model is responsible for understanding natural language and generating responses based on retrieved information.

Role of the LLM in this project:
  *  Processes user queries in natural language.
  *  Generates structured, coherent answers based on retrieved documents.

Now, let's initialize the Flan-T5 model and its tokenizer.

In [None]:
# Inintialize LLM model
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
# Initialisation du modèle Hugging Face (Flan-T5)
llm = HuggingFaceLLM(
    model_name="google/flan-t5-large",  #  Pre-trained Hugging Face model
    tokenizer_name="google/flan-t5-large",  #  Tokenizer corresponding to the model
    context_window=2048,  #  Maximum number of tokens the model can consider
    max_new_tokens=256,  #  Maximum number of tokens generated in a response
    model_kwargs={"temperature": 0.1, "max_length": 512},  # Generation parameters
    generate_kwargs={},  # Additional arguments for text generation (can be extended)
    device_map="auto",  #  Enables GPU usage if available for faster inference
    model=AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-large"),  #  Load the pre-trained model
    tokenizer=AutoTokenizer.from_pretrained("google/flan-t5-large")  #  Load the corresponding tokenizer
)

#### Initialize the Embedding Model for FAISS Indexing

In addition to our LLM, we also need an embedding model to convert text into numerical vector representations. These embeddings will be stored in FAISS, enabling fast and efficient semantic search.

Role of the Embedding Model in this project:  
  * Converts text into embeddings (numerical vectors).
  * Enables similarity search to retrieve relevant documents.
  * Works with FAISS to speed up document retrieval.

For this purpose, we use sentence-transformers/all-MiniLM-L6-v2, a lightweight but powerful embedding model. Let's initialize it below.

In [None]:
# Load the embedding model
embed_model = HuggingFaceEmbedding(model_name="sentence-transformers/all-MiniLM-L6-v2")

# embed_model = HuggingFaceEmbedding(model_name="jinaai/jina-embeddings-v2-base-en")


# Generate embeddings for each node
for node in nodes:
    node.embedding = embed_model.get_text_embedding(node.text)

print(" Embeddings successfully generated!")


 Embeddings successfully generated!


#### Set Global Configuration to Use Hugging Face Instead of OpenAI

In [None]:
# Set global configuration to avoid using OpenAI by default
Settings.llm = llm  #  Define the language model (Flan-T5 from Hugging Face)
Settings.embed_model = embed_model  #  Define the embedding model (MiniLM)

# Print the applied settings for verification
# print(Settings.llm)
print(Settings.embed_model)


model_name='sentence-transformers/all-MiniLM-L6-v2' embed_batch_size=10 callback_manager=<llama_index.core.callbacks.base.CallbackManager object at 0x7cd92e11d390> num_workers=None max_length=256 normalize=True query_instruction=None text_instruction=None cache_folder=None


In [None]:
print(f"LLM Model: {Settings.llm.model_name}")
print(f"Embedding Model: {Settings.embed_model.model_name}")


LLM Model: google/flan-t5-large
Embedding Model: jinaai/jina-embeddings-v2-base-en


#### Create and Save FAISS Index

In [None]:
# Define the directory where FAISS index will be stored
faiss_dir = "./faiss_index"

# Check if the directory exists, if not, create it
if not os.path.exists(faiss_dir):
    os.makedirs(faiss_dir)    # Create the directory
    print(f" Directory '{faiss_dir}' created.")  # Inform the user

# Save the FAISS index to the specified directory
faiss.write_index(faiss_index, os.path.join(faiss_dir, "index.faiss"))

# Confirm that the FAISS index has been successfully saved
print(" FAISS index created and saved successfully!")


 FAISS index created and saved successfully!


In [None]:
# Create FAISS index with the correct embedding dimension
faiss_index = faiss.IndexFlatL2(embedding_dim)
vector_store = FaissVectorStore(faiss_index)

# Convert embeddings to float32 (required by FAISS)
for node in nodes:
    node.embedding = np.array(node.embedding, dtype=np.float32)

# Associate FAISS with LlamaIndex
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# Create the FAISS-backed index
index = VectorStoreIndex(nodes, storage_context=storage_context)

# Verify the number of vectors in FAISS
print(f" Number of vectors in FAISS after indexing: {faiss_index.ntotal}")

# Save the FAISS index
faiss_index_path = "./faiss_index/index.faiss"
faiss.write_index(faiss_index, faiss_index_path)


print(" FAISS index created and saved successfully!")


 Number of vectors in FAISS after indexing: 1156
 FAISS index created and saved successfully!


#### Querying the FAISS Index  

Now that we have indexed our document segments into FAISS, we can perform **semantic search** to retrieve relevant information.  

####  Key Actions:  
1. **Define the query** → Specify the question we want to answer.  
2. **Use FAISS-based retrieval** → The query engine will match the query with the most relevant document segments.  
3. **Display the results** → Extract and print the top-ranked retrieved segments.  


In [None]:
#  Define the query text
query_text = "What is Med-PaLM Multimodal?"

# Create a query engine from the FAISS-backed index
query_engine = index.as_query_engine()

# Perform the search using the query engine
response = query_engine.query(query_text)

# Check if the search returned results
if response is None or response == "":
    print(" The search returned no results. Check the FAISS indexing.")
else:
    print("\n Search Results:")

    # Loop through the retrieved documents and display relevant excerpts
    for i, node in enumerate(response.source_nodes):
        print(f"\n Result {i+1} (Score: {node.score:.4f}):")

        # Display the first 500 characters of the retrieved text segment
        print(node.text[:500])



 Search Results:

 Result 1 (Score: 0.4583):
In particular, Med-PaLM M is a flexible multimodal sequence-to-sequence architecture
that can easily incorporate and interleave various types of multimodal biomedical information. 

 Result 2 (Score: 0.5728):
Med-PaLM M is a large multimodal generative model that flexibly
encodes and interprets biomedical data including clinical language, imaging, and genomics with thesame
set of model weights. 
