<a href="https://colab.research.google.com/github/towardsai/ai-tutor-rag-system/blob/main/notebooks/12-Improve_Query.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Install Packages and Setup Variables


In [None]:
!pip install -q llama-index==0.10.57 openai==1.37.0 llama-index-finetuning llama-index-embeddings-huggingface llama-index-embeddings-cohere llama-index-readers-web cohere==5.6.2 tiktoken==0.7.0 chromadb==0.5.5 html2text sentence_transformers pydantic llama-index-vector-stores-chroma==0.1.10 kaleido==0.2.1 llama-index-llms-gemini==0.1.11

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/56.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.5/56.5 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.4/50.4 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m71.5 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import os

# Set the following API Keys in the Python environment. Will be used later.
os.environ["OPENAI_API_KEY"] = "<YOUR_OPENAI_KEY>"
os.environ["GOOGLE_API_KEY"] = "<YOUR_API_KEY>"

# from google.colab import userdata
# os.environ["OPENAI_API_KEY"] = userdata.get('openai_api_key')
# os.environ["GOOGLE_API_KEY"] = userdata.get('Google_api_key')

In [None]:
# Allows running asyncio in environments with an existing event loop, like Jupyter notebooks.

import nest_asyncio

nest_asyncio.apply()

# Load a Model


In [None]:
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.core import Settings

Settings.llm = OpenAI(temperature=0, model="gpt-4o-mini")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")

**Note: You can create a vector store from scratch using the code below, or you can load it from Hugging Face using the code provided in this notebook.**

## Download the Dataset (JSON)


In [None]:
from huggingface_hub import hf_hub_download
file_path = hf_hub_download(repo_id="jaiganesan/ai_tutor_knowledge", filename="ai_tutor_knowledge.jsonl",repo_type="dataset",local_dir="/content")

ai_tutor_knowledge.jsonl:   0%|          | 0.00/6.96M [00:00<?, ?B/s]

## Read File


In [None]:
import json
with open(file_path, "r") as file:
    ai_tutor_knowledge = [json.loads(line) for line in file]

len(ai_tutor_knowledge)

762

# Convert to Document obj


In [None]:
from typing import List
from llama_index.core import Document

def create_docs_from_list(data_list: List[dict]) -> List[Document]:
    documents = []
    for data in data_list:
        documents.append(
            Document(
                doc_id=data["doc_id"],
                text=data["content"],
                metadata={  # type: ignore
                    "url": data["url"],
                    "title": data["name"],
                    "tokens": data["tokens"],
                    "source": data["source"],
                },
                excluded_llm_metadata_keys=[
                    "title",
                    "tokens",
                    "source",
                ],
                excluded_embed_metadata_keys=[
                    "url",
                    "tokens",
                    "source",
                ],
            )
        )
    return documents

doc = create_docs_from_list(ai_tutor_knowledge)

# Transforming


In [None]:
from llama_index.core.text_splitter import TokenTextSplitter

text_splitter = TokenTextSplitter(separator=" ", chunk_size=512, chunk_overlap=128)

In [None]:
import chromadb
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core.extractors import (
    SummaryExtractor,
    QuestionsAnsweredExtractor,
    KeywordExtractor,
)
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core.ingestion import IngestionPipeline

# set up ChromaVectorStore and load in data
chroma_client = chromadb.EphemeralClient()
chroma_collection = chroma_client.create_collection("ai_tutor_knowledge")

# save to disk
db = chromadb.PersistentClient(path="/content/ai_tutor_knowledge")
chroma_collection = db.get_or_create_collection("ai_tutor_knowledge")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)

pipeline = IngestionPipeline(
    transformations=[
        text_splitter,
        QuestionsAnsweredExtractor(questions=3, llm=Settings.llm),
        SummaryExtractor(summaries=["prev", "self"], llm=Settings.llm),
        KeywordExtractor(keywords=10, llm=Settings.llm),
        OpenAIEmbedding(),
    ],
    vector_store=vector_store,
)

nodes = pipeline.run(documents=doc, show_progress=True)

Parsing nodes:   0%|          | 0/14 [00:00<?, ?it/s]

Parsing nodes: 100%|██████████| 14/14 [00:00<00:00, 28.28it/s]
100%|██████████| 108/108 [01:36<00:00,  1.12it/s]
100%|██████████| 108/108 [01:22<00:00,  1.30it/s]
100%|██████████| 108/108 [00:29<00:00,  3.72it/s]
Generating embeddings: 100%|██████████| 108/108 [00:02<00:00, 38.77it/s]


In [None]:
!zip -r vectorstore.zip ai_tutor_knowledge

updating: mini-llama-articles/ (stored 0%)
updating: mini-llama-articles/chroma.sqlite3 (deflated 65%)
  adding: mini-llama-articles/aaac4d54-4f82-40da-b769-a6aecfa59eb0/ (stored 0%)
  adding: mini-llama-articles/aaac4d54-4f82-40da-b769-a6aecfa59eb0/data_level0.bin (deflated 96%)
  adding: mini-llama-articles/aaac4d54-4f82-40da-b769-a6aecfa59eb0/length.bin (deflated 35%)
  adding: mini-llama-articles/aaac4d54-4f82-40da-b769-a6aecfa59eb0/link_lists.bin (stored 0%)
  adding: mini-llama-articles/aaac4d54-4f82-40da-b769-a6aecfa59eb0/header.bin (deflated 61%)


# Load Indexes


**Note: If you created the vector store from scratch, please comment out the three code blocks/cells below.**

In [None]:
from huggingface_hub import hf_hub_download
vectorstore = hf_hub_download(repo_id="jaiganesan/ai_tutor_knowledge", filename="vectorstore.zip",repo_type="dataset",local_dir="/content")

vectorstore.zip:   0%|          | 0.00/97.2M [00:00<?, ?B/s]

In [None]:
!unzip vectorstore.zip

Archive:  vectorstore.zip
   creating: ai_tutor_knowledge/
   creating: ai_tutor_knowledge/684af133-f877-4230-bde4-575cf53b6688/
  inflating: ai_tutor_knowledge/684af133-f877-4230-bde4-575cf53b6688/length.bin  
  inflating: ai_tutor_knowledge/684af133-f877-4230-bde4-575cf53b6688/index_metadata.pickle  
  inflating: ai_tutor_knowledge/684af133-f877-4230-bde4-575cf53b6688/link_lists.bin  
  inflating: ai_tutor_knowledge/684af133-f877-4230-bde4-575cf53b6688/header.bin  
  inflating: ai_tutor_knowledge/684af133-f877-4230-bde4-575cf53b6688/data_level0.bin  
  inflating: ai_tutor_knowledge/chroma.sqlite3  


In [None]:
import chromadb
from llama_index.vector_stores.chroma import ChromaVectorStore

# Create your index
db = chromadb.PersistentClient(path="./ai_tutor_knowledge")
chroma_collection = db.get_or_create_collection("ai_tutor_knowledge")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)

In [None]:
# Create your index
from llama_index.core import VectorStoreIndex

vector_index = VectorStoreIndex.from_vector_store(vector_store)

# Multi-Step Query Engine


## GPT-4o-mini


In [None]:
# Service context is depreciated
# from llama_index.core import ServiceContext
# service_context_gpt4o_mini = ServiceContext.from_defaults()

In [None]:
from llama_index.core.indices.query.query_transform.base import (
    StepDecomposeQueryTransform,
)

step_decompose_transform_gpt4o = StepDecomposeQueryTransform(verbose=True,llm=Settings.llm)

In [None]:
from llama_index.core.query_engine.multistep_query_engine import MultiStepQueryEngine

#Default query engine
query_engine_gpt4o_mini = vector_index.as_query_engine()

# Multi Step Query Engine
multi_step_query_engine = MultiStepQueryEngine(
    query_engine = query_engine_gpt4o_mini,
    query_transform = step_decompose_transform_gpt4o,
    index_summary = "Used to answer the Questions about RAG, Machine Learning, Deep Learning, and Generative AI",
)

# Query Dataset

## Default

In [None]:
# Default query engine
query_engine = vector_index.as_query_engine()
res = query_engine.query("Write about Llama 3.1 Model, BERT and PEFT")

In [None]:
res.response

'The LLaMA model is a foundational model designed for various natural language processing tasks, and it can be fine-tuned using the Parameter-Efficient Fine-Tuning (PEFT) library. This library provides methods such as LoRA (Low-Rank Adaptation) for efficient fine-tuning, allowing users to adapt the model with minimal additional parameters while preserving its pre-trained knowledge.\n\nBERT, another prominent model in the NLP landscape, is known for its bidirectional training approach, which enhances its understanding of context in text. While BERT and LLaMA serve different purposes and architectures, both can benefit from techniques like PEFT to optimize their performance on specific tasks without the need for extensive retraining.\n\nPEFT methods, including Llama-Adapter, focus on integrating new capabilities into existing models efficiently. For instance, Llama-Adapter allows the LLaMA model to follow instructions by adding learnable adaptation prompts while keeping the core model fr

In [None]:
for src in res.source_nodes:
    print("Node ID\t", src.node_id)
    print("Title\t", src.metadata["title"])
    print("Text\t", src.text)
    print("Score\t", src.score)
    print("-_" * 20)

Node ID	 781b7b12-eca2-47c0-a66e-9d6be670e951
Title	 LLaMA
Text	 on how to fine-tune LLaMA model using LoRA method via the 🤗 PEFT library with intuitive UI. 🌎 - A [notebook](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/jumpstart-foundation-models/text-generation-open-llama.ipynb) on how to deploy Open-LLaMA model for text generation on Amazon SageMaker. 🌎 ## LlamaConfig[[autodoc]] LlamaConfig## LlamaTokenizer[[autodoc]] LlamaTokenizer    - build_inputs_with_special_tokens    - get_special_tokens_mask    - create_token_type_ids_from_sequences    - save_vocabulary## LlamaTokenizerFast[[autodoc]] LlamaTokenizerFast    - build_inputs_with_special_tokens    - get_special_tokens_mask    - create_token_type_ids_from_sequences    - update_post_processor    - save_vocabulary## LlamaModel[[autodoc]] LlamaModel    - forward## LlamaForCausalLM[[autodoc]] LlamaForCausalLM    - forward## LlamaForSequenceClassification[[autodoc]] LlamaForSequenceClassif

## GPT-4o-mini Multi-Step


In [None]:
response = multi_step_query_engine.query("Write about Llama 3.1 Model, BERT and PEFT")

[1;3;33m> Current query: Write concepts about Llama 3.1 Model, BERT and PEFT
[0m[1;3;38;5;200m> New query: What are the key features of the Llama 3.1 Model?
[0m[1;3;33m> Current query: Write concepts about Llama 3.1 Model, BERT and PEFT
[0m[1;3;38;5;200m> New query: What are the key features of the Llama 3.1 Model?
[0m[1;3;33m> Current query: Write concepts about Llama 3.1 Model, BERT and PEFT
[0m[1;3;38;5;200m> New query: What are the key features of the BERT model?
[0m

In [None]:
response.response

'Llama 3.1 is an advanced AI model developed by Meta, recognized for its significant scale and capabilities. It is the largest in the Llama series, having been trained on over 15 trillion tokens with the help of more than 16,000 H100 GPUs. One of its standout features is a context length of 128K, which allows it to process longer inputs effectively. The model excels in reasoning, coding, and multilingual processing, demonstrating strong logical reasoning and problem-solving skills. It also supports zero-shot tool use, enabling it to perform tasks without prior specific training, and has shown superior performance in benchmark tests compared to other models, particularly in areas like mathematical reasoning and complex text processing.\n\nBERT, or Bidirectional Encoder Representations from Transformers, is another influential model in the field of natural language processing. Developed by Google, BERT introduced a novel approach by using a bidirectional training method, allowing the mod

In [None]:
for query, response in response.metadata['sub_qa']:
    print(f"**{query}**\n{response}\n")

**What are the key features of the Llama 3.1 Model?**
The Llama 3.1 model boasts several key features, including:

1. **Model Scale and Training**: It is the largest model from Meta, trained on over 15 trillion tokens using more than 16,000 H100 GPUs.

2. **Extended Context Length**: The model supports a context length of 128K, enhancing its ability to process and understand longer texts.

3. **Enhanced Reasoning and Coding Capabilities**: Llama 3.1 excels in generating high-quality code and demonstrates strong logical reasoning, problem-solving, and analytical skills.

4. **Multilingual Processing**: Approximately 50% of its training data consists of multilingual tokens, allowing it to effectively understand and process multiple languages.

5. **Zero-shot Tool Use**: The model supports zero-shot tool use and can develop agentic behaviors, showcasing its versatility in various applications.

6. **Benchmark Performance**: It outperforms competing models like GPT-4o and Claude 3.5 Sonnet

In [None]:
for src in response.source_nodes:
    print("Node ID\t", src.node_id)
    print("Text\t", src.text)
    print("Score\t", src.score)
    print("-_" * 20)

Node ID	 6be88fa3-2f8b-43e7-aba0-d874b39809fc
Text	 # FourierFT: Discrete Fourier Transformation Fine-Tuning[FourierFT](https://huggingface.co/papers/2405.03003) is a parameter-efficient fine-tuning technique that leverages Discrete Fourier Transform to compress the model's tunable weights. This method outperforms LoRA in the GLUE benchmark and common ViT classification tasks using much less parameters.FourierFT currently has the following constraints:- Only `nn.Linear` layers are supported.- Quantized layers are not supported.If these constraints don't work for your use case, consider other methods instead.The abstract from the paper is:> Low-rank adaptation (LoRA) has recently gained much interest in fine-tuning foundation models. It effectively reduces the number of trainable parameters by incorporating low-rank matrices A and B to represent the weight change, i.e., Delta W=BA. Despite LoRA's progress, it faces storage challenges when handling extensive customization adaptations or 

# Test gemini-1.5-flash Multi-Step


In [None]:
from llama_index.core import ServiceContext
from llama_index.core.indices.query.query_transform.base import (
    StepDecomposeQueryTransform,
)
from llama_index.core.query_engine.multistep_query_engine import MultiStepQueryEngine

from llama_index.llms.gemini import Gemini

llm = Gemini(model="models/gemini-1.5-flash")

service_context_gemini = ServiceContext.from_defaults(llm=llm)

step_decompose_transform = StepDecomposeQueryTransform(llm=llm, verbose=True)

query_engine_gemini = vector_index.as_query_engine(
    service_context=service_context_gemini
)
query_engine_gemini = MultiStepQueryEngine(
    query_engine=query_engine_gemini,
    query_transform=step_decompose_transform,
    index_summary="Used to answer the Questions about RAG, Machine Learning, Deep Learning, and Generative AI",
)

  service_context_gemini = ServiceContext.from_defaults(llm=llm)


In [None]:
response_gemini = query_engine_gemini.query("Write about Llama 3.1 Model, BERT and PEFT")

[1;3;33m> Current query: Write about Llama 3.1 Model, BERT and PEFT
[0m[1;3;38;5;200m> New query: New question: **What are Llama 3.1 Model, BERT, and PEFT?** 

[0m[1;3;33m> Current query: Write about Llama 3.1 Model, BERT and PEFT
[0m[1;3;38;5;200m> New query: New question: **What are the key differences between Llama 3.1 Model, BERT, and PEFT?** 

[0m[1;3;33m> Current query: Write about Llama 3.1 Model, BERT and PEFT
[0m[1;3;38;5;200m> New query: New question: **What are some examples of how Llama 3.1 Model, BERT, and PEFT are used in practice?** 

[0m

In [None]:
response_gemini.response

'The LLaMA model is a foundational model tailored for a variety of natural language processing tasks, emphasizing efficient fine-tuning techniques. It is designed to be adaptable for applications such as text classification, question answering, and causal language modeling.\n\nBERT, which stands for Bidirectional Encoder Representations from Transformers, is a significant model in the NLP domain. It excels in understanding the context of words within sentences through its bidirectional training approach. BERT is widely used for tasks like sentiment analysis, named entity recognition, and text summarization.\n\nPEFT, or Parameter-Efficient Fine-Tuning, is a methodology that enhances the fine-tuning process of large pretrained models like LLaMA and BERT. It focuses on adapting these models in a resource-efficient manner by fine-tuning only a small number of additional parameters. This approach allows for effective model adaptation for various tasks, including image classification and aut

## Test Retriever on Multistep


In [None]:
# import llama_index
# from llama_index.core.indices.query.schema import QueryBundle

# t = QueryBundle("How Retrieval Augmented Generation (RAG) work?")
# query_engine_gemini.retrieve(t)

NotImplementedError: This query engine does not support retrieve, use query directly

## Subquestion Query Engine

In [None]:
from llama_index.core.tools import QueryEngineTool, ToolMetadata
from llama_index.core.query_engine import SubQuestionQueryEngine

query_engine = vector_index.as_query_engine()

query_engine_tools = [
    QueryEngineTool(
        query_engine=query_engine,
        metadata=ToolMetadata(
            name="LlamaIndex",
            description="Used to answer the Questions about RAG, Machine Learning, Deep Learning, and Generative AI",
        ),
    ),
]

sub_question_engine = SubQuestionQueryEngine.from_defaults(
    query_engine_tools=query_engine_tools,
    use_async=True,
)

response = sub_question_engine.query("Write about Llama 3.1 Model, BERT and PEFT")


Generated 5 sub questions.
[1;3;38;2;237;90;200m[LlamaIndex] Q: What are the key features and improvements of the Llama 3.1 model compared to its predecessors?
[0m[1;3;38;2;90;149;237m[LlamaIndex] Q: How does BERT work and what are its main applications in natural language processing?
[0m[1;3;38;2;11;159;203m[LlamaIndex] Q: What is PEFT (Parameter-Efficient Fine-Tuning) and how does it enhance the performance of models like BERT?
[0m[1;3;38;2;155;135;227m[LlamaIndex] Q: What are the differences in architecture and training methods between Llama 3.1 and BERT?
[0m[1;3;38;2;237;90;200m[LlamaIndex] Q: What are the use cases for Llama 3.1, BERT, and PEFT in real-world applications?
[0m[1;3;38;2;155;135;227m[LlamaIndex] A: The provided information does not detail the architecture and training methods of BERT, so a direct comparison cannot be made. However, it highlights that Llama 3.1, particularly the 405 billion parameter version, is trained on over 15 trillion tokens using more

In [None]:
response.response

'Llama 3.1 is a state-of-the-art language model developed by Meta, notable for being the largest model in its series, with a training scale that includes over 15 trillion tokens and the use of more than 16,000 H100 GPUs. This model features a 128K context length, which allows it to process extensive inputs effectively. It demonstrates enhanced reasoning and coding capabilities, excels in multilingual processing, and supports zero-shot tool use, making it versatile for various applications. Llama 3.1 outperforms its predecessors in benchmark tests, particularly in mathematical reasoning and long text processing.\n\nBERT, or Bidirectional Encoder Representations from Transformers, utilizes a transformer architecture that processes text bidirectionally, capturing the full context of sentences. It is pre-trained on tasks like Masked Language Modeling and Next Sentence Prediction, which help it generate contextualized word representations. BERT is widely applied in natural language processi

# HyDE Transform


In [None]:
query_engine = vector_index.as_query_engine()

In [None]:
from llama_index.core.indices.query.query_transform import HyDEQueryTransform
from llama_index.core.query_engine.transform_query_engine import TransformQueryEngine

hyde = HyDEQueryTransform(include_original=True)
hyde_query_engine = TransformQueryEngine(query_engine, hyde)

In [None]:
response = hyde_query_engine.query("Write about Llama 3.1 Model, BERT and PEFT")

In [None]:
response.response

"Llama 3.1 405B is a significant advancement in the field of AI, developed by Meta. It stands out as the largest open-source model to date, trained on over 15 trillion tokens using more than 16,000 H100 GPUs. This extensive training has enabled it to achieve a remarkable 128K context length, which enhances its ability to handle complex reasoning, long document summarization, and context-dependent applications. The model has been optimized to perform on par with leading proprietary models in various areas, including general knowledge, steerability, mathematical reasoning, tool use, and multilingual translation.\n\nIn terms of features, Llama 3.1 includes enhanced programming capabilities, allowing it to generate high-quality code and demonstrate strong logical reasoning and problem-solving skills. It also supports multilingual processing, with about 50% of its training data consisting of multilingual tokens, enabling effective understanding and processing of multiple languages.\n\nWhile

In [None]:
for src in response.source_nodes:
    print("Node ID\t", src.node_id)
    print("Text\t", src.text)
    print("Score\t", src.score)
    print("-_" * 20)

Node ID	 5624cdc8-2997-4e4d-82d1-c7383d389215
Text	 3.1 405B is Metas largest model  trained with over 15 trillion tokens. For this  Meta optimized the entire training stack and trained it on more than 16 000 H100 GPUs  making it the first Llama model trained at this scale.   According to Meta  this version of the original model (Llama 1 and Llama 2) has 128K context length  improved reasoning and coding capabilities. Meta has also upgraded both multilingual 8B and 70B models.   Key Features of Llama 3.1 40 5B:Llama 3.1 comes with a host of features and capabilities that appeal to The users  such as:   RAG & tool use  Meta states that you can use Llama system components to extend the model using zero-shot tool use and build agentic behaviors with RAG.   Multi-lingual  Llama 3 naturally supports multilingual processing. The pre-training data includes about 50% multilingual tokens and can process and understand multiple languages.   Programming and Reasoning  Llama 3 has powerful program

In [None]:
query_bundle = hyde("Write about Llama 3.1 Model, BERT and PEFT")

In [None]:
hyde_doc = query_bundle.embedding_strs[0]

In [None]:
hyde_doc

"The Llama 3.1 model, developed by Meta, represents a significant advancement in the field of natural language processing (NLP). It builds upon the foundation laid by its predecessors, Llama 1 and Llama 2, by incorporating more extensive training data and improved architectural designs. Llama 3.1 is designed to enhance performance in various NLP tasks, including text generation, comprehension, and summarization, making it a versatile tool for developers and researchers alike.\n\nBERT, or Bidirectional Encoder Representations from Transformers, is another influential model in the NLP landscape, introduced by Google in 2018. BERT revolutionized the way models understand context in language by utilizing a bidirectional approach, allowing it to consider the entire context of a word based on the words that come before and after it. This capability significantly improved the performance of models on tasks such as question answering and sentiment analysis. BERT's architecture is based on the 