<a href="https://colab.research.google.com/github/avikumart/LLM-GenAI-Transformers-Notebooks/blob/main/TMLC_LLM_projects/RAG/LlamaIndex_%2B_Qdrant_%2B_Open_source_Rerank_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Installing Necesarry Libraries

In [1]:
!pip install accelerate einops sentence-transformers transformers qdrant-client \
llama-index llama-index-cli llama-index-core llama-index-embeddings-fastembed llama-index-legacy \
llama-index-llms-huggingface llama-index-vector-stores-qdrant -q

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/267.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m267.2/267.2 kB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m32.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m49.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m69.7/69.7 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m22.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.5/4.5 MB[0m [31m61.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m50.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5

## Loading a sample OpenAI GPT-4 paper

Creating a directory and saving the document under that directory

In [2]:
# Create a directory named 'Data' in the current working directory
! mkdir Data

# Download the GPT-4 paper PDF from the given URL
# Save the downloaded file into the 'Data' directory with the name 'gpt-4.pdf'
! wget "https://cdn.openai.com/papers/gpt-4.pdf" -O Data/gpt-4.pdf

--2025-01-17 09:15:15--  https://cdn.openai.com/papers/gpt-4.pdf
Resolving cdn.openai.com (cdn.openai.com)... 13.107.246.69, 2620:1ec:bdf::69
Connecting to cdn.openai.com (cdn.openai.com)|13.107.246.69|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5229908 (5.0M) [application/pdf]
Saving to: ‘Data/gpt-4.pdf’


2025-01-17 09:15:16 (11.2 MB/s) - ‘Data/gpt-4.pdf’ saved [5229908/5229908]



## Importing libraries

In [3]:
import torch
from qdrant_client import QdrantClient # Import QdrantClient to interact with a Qdrant vector database, often used for vector search and similarity matching
from transformers import AutoModelForCausalLM,AutoTokenizer
from llama_index.core import SimpleDirectoryReader # Used to read and load documents from a directory into the application
from llama_index.embeddings.fastembed import FastEmbedEmbedding # Provides a method for generating embeddings using the FastEmbed model
from llama_index.core import Settings # Used to configure global or specific settings for the Llama Index library
from llama_index.core import PromptTemplate # Used to define and manage templates for prompts when interacting with language models
from llama_index.llms.huggingface import HuggingFaceLLM # A wrapper to integrate Hugging Face language models (LLMs) with Llama Index
from llama_index.core import VectorStoreIndex # A class to create and manage an index of vector embeddings for documents
from llama_index.core import StorageContext # Used to manage storage configurations and contexts for storing vectors or metadata
from llama_index.vector_stores.qdrant import QdrantVectorStore # A wrapper for integrating Qdrant as the backend vector store for Llama Index
from llama_index.core.postprocessor import SentenceTransformerRerank # Used for re-ranking results based on similarity using a SentenceTransformer model
from llama_index.core.memory import ChatMemoryBuffer # Used to handle long term chat memories for ChatRag

## Data Loading

In [4]:
# Load data from a directory using SimpleDirectoryReader
documents = SimpleDirectoryReader("/content/Data").load_data()

In [5]:
# Initialize an embedding model using FastEmbedEmbedding
# - model_name: Specifies the pre-trained embedding model to be used.
#   In this case, "BAAI/bge-small-en-v1.5" is a model optimized for embedding text in English.
embed_model = FastEmbedEmbedding(model_name="BAAI/bge-small-en-v1.5")

# Assign the initialized embedding model to the global Settings object
# - Settings.embed_model: A configuration parameter for Llama Index that sets the embedding model to be used for generating vector representations of text data.
Settings.embed_model = embed_model

# Set the chunk size for text processing within the global Settings object
# - Settings.chunk_size: Defines the maximum size of text chunks (in characters or tokens) that will be processed at a time.
#   A chunk size of 1024 ensures that text is divide d into manageable pieces before embedding or indexing.
Settings.chunk_size = 1024

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/706 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.24k [00:00<?, ?B/s]

model_optimized.onnx:   0%|          | 0.00/66.5M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

In [9]:
# Define a system-level prompt for the AI assistant
# - The prompt specifies the assistant's role and behavior:
#   Generate brief, clear, and concise responses to user queries based on the provided context.
system_prompt = "You are an intelligent AI assistant that generates brief, clear and concise response to user queries based on context provided"

# Define a query wrapper prompt using a PromptTemplate
# - The PromptTemplate is used to structure how user queries are wrapped for processing.
# - In this case, the user query (`{query_str}`) is wrapped between special tokens <|USER|> and <|ASSISTANT|>.
#   These tokens might help the language model differentiate between user input and assistant responses.
query_wrapper_prompt = PromptTemplate("<|USER|>{query_str}<|ASSISTANT|>")

In [7]:
# Login to the huggingface hub to insure llama model can be accessed
from google.colab import userdata
from huggingface_hub import login
login(userdata.get('hf_token'))

In [10]:
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")  # Load the tokenizer for the Llama-3.2-3B-Instruct model

stopping_ids = [tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|eot_id|>")]  # Define stopping conditions (EOS token and custom <|eot_id|>)

llm = HuggingFaceLLM(  # Initialize HuggingFaceLLM for text generation
    context_window=4096,  # Set the context window size (maximum tokens considered in a single input)
    max_new_tokens=2048,  # Set the maximum number of tokens to generate in the output
    generate_kwargs={"temperature": 0.2, "do_sample": False},  # Set generation parameters (temperature for randomness, do_sample for deterministic output)
    system_prompt=system_prompt,  # Pass the system-level prompt for guiding the model's behavior
    query_wrapper_prompt=query_wrapper_prompt,  # Wrap user queries with a consistent format
    tokenizer_name="meta-llama/Llama-3.2-3B-Instruct",  # Specify tokenizer name
    model_name="meta-llama/Llama-3.2-3B-Instruct",  # Specify model name
    device_map="auto",  # Automatically map model layers to available devices (CPU/GPU)
    stopping_ids=stopping_ids,  # Define token IDs that stop the generation process
    tokenizer_kwargs={"max_length": 4096},  # Set maximum token length for the tokenizer
    model_kwargs={"torch_dtype": torch.float16}  # Set model to use half-precision (float16) for faster computation and less memory usage
)

Settings.llm = llm  # Assign the initialized LLM to the global Settings object for use within Llama Index

config.json:   0%|          | 0.00/878 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/20.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.46G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

In [11]:
client = QdrantClient(location=":memory:")  # Initialize QdrantClient in memory mode for fast and lightweight experiments (no deployment needed)

vector_store = QdrantVectorStore(client=client, collection_name="test")  # Create a vector store using Qdrant and the in-memory client with a collection name "test"
storage_context = StorageContext.from_defaults(vector_store=vector_store)  # Create a storage context with the default settings and the defined vector store
index = VectorStoreIndex.from_documents(documents, storage_context=storage_context)  # Create an index from the provided documents using the storage context



In [12]:
rerank = SentenceTransformerRerank(model="cross-encoder/ms-marco-MiniLM-L-2-v2", top_n=3)  # Initialize the SentenceTransformerRerank with a pre-trained model for re-ranking results (returns top 3)
query_engine = index.as_query_engine(similarity_top_k=10, node_postprocessors=[rerank])  # Create a query engine from the index, retrieve top 10 similar results and apply reranking

config.json:   0%|          | 0.00/794 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/62.5M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/316 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [13]:
import warnings
warnings.filterwarnings("ignore")

In [14]:
response = query_engine.query("What are the key points about GPT-4 that differentiates it from GPT-3?")
print(f"Response Generated: {response}")

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Response Generated: GPT-4 is a large-scale, multimodal model that can accept image and text inputs and produce text outputs. It exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers. GPT-4 has similar limitations to earlier GPT models, such as not being fully reliable, having a limited context window, and not learning from experience. Despite these limitations, GPT-4 has made significant progress on public benchmarks like TruthfulQA, which tests the model's ability to separate fact from an adversarially-selected set of incorrect statements.

In addition to its performance capabilities, GPT-4 has been designed with scalability in mind. The model's architecture and training infrastructure have been optimized to predictably scale up to large training runs, allowing for more accurate predictions of performance on smaller models. This scalability enables the model to be deployed

In [15]:
response = query_engine.query("What were the benchmark results of GPT-4 compared to other models?")
print(f"Response Generated: {response}")

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Response Generated: GPT-4 outperformed GPT-3.5 on most exams tested, with the lower end of the range of percentiles, but this created some artifacts on the AP exams with very wide scoring bins. On the majority of professional and academic exams, GPT-4 exhibited human-level performance. It also passed a simulated version of the Uniform Bar Examination with a score in the top 10% of test takers. GPT-4 demonstrated strong performance on traditional NLP benchmarks and outperformed state-of-the-art systems on these tests. Its capabilities on exams were primarily due to the pre-training process, and not significantly affected by RLHF. On traditional benchmarks, GPT-4 considerably outperformed existing language models and previously state-of-the-art systems, often with benchmark-specific crafting or additional training protocols.


## Memory in LlamaIndex

In [16]:
memory = ChatMemoryBuffer.from_defaults(token_limit=4000)  # Initialize a chat memory buffer with a token limit of 4000 for storing the chat context

chat_engine = index.as_chat_engine(  # Create a chat engine from the index to handle conversational interactions
    chat_mode="context",  # Set the chat mode to 'context' where previous conversation history is considered for responses
    memory=memory,  # Assign the memory buffer to store and manage the conversation history
    system_prompt=(  # Provide the system prompt to guide the AI assistant's behavior
        "You are an AI assistant who answers the user questions"
    )
)

In [17]:
response = chat_engine.chat("Give the abstract of the GPT-4 introduction paper within 5 sentences")
print(response.response)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


The abstract of the GPT-4 introduction paper states that GPT-4 is a large-scale, multimodal model that can process image and text inputs and produce text outputs. It was developed to improve its ability to understand and generate natural language text, particularly in complex and nuanced scenarios. GPT-4 was evaluated on various exams originally designed for humans and achieved a score that falls in the top 10% of test takers, outperforming GPT-3.5. The model's capabilities are attributed to its pre-training process, and it outperforms existing language models and state-of-the-art systems on traditional benchmarks. GPT-4's performance is also influenced by its ability to predict aspects of its performance based on models trained with significantly less compute.


In [18]:
response = chat_engine.chat("Add more points on comparison with GPT-3")
print(response.response)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


The abstract of the GPT-4 introduction paper states that GPT-4 is a large-scale, multimodal model that can process image and text inputs and produce text outputs. It was developed to improve its ability to understand and generate natural language text, particularly in complex and nuanced scenarios. GPT-4 was evaluated on various exams originally designed for humans and achieved a score that falls in the top 10% of test takers, outperforming GPT-3.5. In comparison to GPT-3.5, GPT-4 exhibits significant improvements in its ability to reason, understand context, and generate coherent text. GPT-4 also outperforms GPT-3.5 in terms of its ability to handle multi-step reasoning, with a 20% increase in its ability to reason about complex scenarios. Additionally, GPT-4's performance is attributed to its pre-training process, which involves a larger dataset and more compute resources than GPT-3.5. GPT-4's capabilities are also influenced by its ability to predict aspects of its performance based

In [19]:
response = chat_engine.chat("Summarize all these into points")
print(response.response)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Here are the key points summarizing the abstract of the GPT-4 introduction paper:

**Key Features of GPT-4:**

1. **Multimodal model**: Can process image and text inputs and produce text outputs.
2. **Improved language understanding**: Can understand and generate natural language text, particularly in complex and nuanced scenarios.
3. **Top 10% performance**: Achieved a score that falls in the top 10% of test takers on various exams.

**Comparison with GPT-3.5:**

1. **Improved reasoning**: GPT-4 exhibits significant improvements in its ability to reason and understand context.
2. **20% increase in multi-step reasoning**: Can reason about complex scenarios with a 20% increase in ability compared to GPT-3.5.
3. **Better performance**: Outperforms GPT-3.5 in terms of text generation, coherence, and overall performance.

**Advantages:**

1. **Larger dataset and more compute resources**: GPT-4's pre-training process involves a larger dataset and more compute resources than GPT-3.5.
2. **Ab