### Querying 

Now you've loaded your data, built an index, and stored that index for later, you're ready to get to the most significant part of an LLM application: querying.

At its simplest, querying is just a prompt call to an LLM: it can be a question and get an answer, or a request for summarization, or a much more complex instruction.

More complex querying could involve repeated/chained prompt + LLM calls, or even a reasoning loop across multiple components.

In [4]:
from llama_index.core import VectorStoreIndex , SimpleDirectoryReader, Settings
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.huggingface import HuggingFaceLLM
from transformers import AutoTokenizer

# local embedding
Settings.embed_model = HuggingFaceEmbedding(model_name = "BAAI/bge-small-en-v1.5")

# First create the tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    "microsoft/phi-2",
    padding_side="right",
    pad_token="<|endoftext|>"  # Define custom pad token
)
# Add pad token to tokenizer
tokenizer.add_special_tokens({'pad_token': '<|endoftext|>'})

# local LLM
Settings.llm = HuggingFaceLLM(
    model_name="microsoft/phi-2",  # This is a smaller model that works well for most tasks
    tokenizer_name="microsoft/phi-2",
    context_window=2048,
    max_new_tokens=1024,
    generate_kwargs={"temperature": 0.2, "do_sample": True, "pad_token_id": tokenizer.pad_token_id},
    device_map="auto",
)
documents = SimpleDirectoryReader("../../data").load_data()
index = VectorStoreIndex.from_documents(documents, show_progress=True)

Loading checkpoint shards: 100%|██████████| 2/2 [00:01<00:00,  1.12it/s]
Parsing nodes: 100%|██████████| 40/40 [00:00<00:00, 506.33it/s]
Generating embeddings: 100%|██████████| 83/83 [00:03<00:00, 25.46it/s]


In [2]:
query_engine = index.as_query_engine( 
                                    # similarity_top_k=1,
                                    response_mode="tree_summarize",)

response = query_engine.query(
    "write me short essay based on the given document about GPT"
)

This is a friendly reminder - the current text generation call will exceed the model's predefined maximum length (2048). Depending on the model, you may observe exceptions, performance degradation, or nothing at all.


In [3]:
print(response)


GPT (Generative Pre-trained Transformer) is a language model that has been developed by OpenAI. It is an artificial intelligence system that can generate human-like text based on the input it receives. GPT has been used in various applications, including chatbots, content creation, and research writing.

GPT has several advantages in the healthcare sector. It can be used to generate personalized medical reports, which can help healthcare professionals make more informed decisions about patient care. GPT can also be used to generate medical research papers, which can help researchers share their findings more quickly and efficiently. Additionally, GPT can be used to generate patient education materials, which can help patients better understand their medical conditions and treatment options.

However, there are also challenges to consider when using GPT in healthcare. One challenge is ensuring the accuracy and reliability of the information generated by GPT. Another challenge is ensuri

### Stages of querying
However, there is more to querying than initially meets the eye. Querying consists of three distinct stages:

- <b>Retrieval</b> is when you find and return the most relevant documents for your query from your Index. As previously discussed in indexing, the most common type of retrieval is "top-k" semantic retrieval, but there are many other retrieval strategies.
- <b>Postprocessing</b> is when the Nodes retrieved are optionally reranked, transformed, or filtered, for instance by requiring that they have specific metadata such as keywords attached.
- <b>Response synthesis</b> is when your query, your most-relevant data and your prompt are combined and sent to your LLM to return a response.

### Customizing the stages of querying

LlamaIndex features a low-level composition API that gives you granular control over your querying.

In this example, we customize our retriever to use a different number for top_k and add a post-processing step that requires that the retrieved nodes reach a minimum similarity score to be included. This would give you a lot of data when you have relevant results but potentially no data if you have nothing relevant.

In [None]:

from llama_index.core import VectorStoreIndex , SimpleDirectoryReader, Settings, get_response_synthesizer
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.postprocessor import SimilarityPostprocessor
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.huggingface import HuggingFaceLLM
from transformers import AutoTokenizer

# local embedding
Settings.embed_model = HuggingFaceEmbedding(model_name = "BAAI/bge-small-en-v1.5")

# First create the tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    "microsoft/phi-2",
    padding_side="right",
    pad_token="<|endoftext|>"  # Define custom pad token
)
# Add pad token to tokenizer
tokenizer.add_special_tokens({'pad_token': '<|endoftext|>'})

# local LLM
Settings.llm = HuggingFaceLLM(
    model_name="microsoft/phi-2",  # This is a smaller model that works well for most tasks
    tokenizer_name="microsoft/phi-2",
    context_window=2048,
    max_new_tokens=1024,
    generate_kwargs={"temperature": 0.2, "do_sample": True, "pad_token_id": tokenizer.pad_token_id},
    device_map="auto",
)
documents = SimpleDirectoryReader("../../data").load_data()


# build index
index = VectorStoreIndex.from_documents(documents)

# configure retriever
retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=10,
)

# configure response synthesizer
response_synthesizer = get_response_synthesizer()

# assemble query engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
    node_postprocessors=[SimilarityPostprocessor(similarity_cutoff=0.7)],
)

# query
response = query_engine.query("What does document about ")
print(response)

Loading checkpoint shards: 100%|██████████| 2/2 [00:02<00:00,  1.18s/it]
This is a friendly reminder - the current text generation call will exceed the model's predefined maximum length (2048). Depending on the model, you may observe exceptions, performance degradation, or nothing at all.


Testing Phi3.5 instruction model instead of Phi2 model 

In [1]:

from llama_index.core import VectorStoreIndex , SimpleDirectoryReader, Settings, get_response_synthesizer
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.postprocessor import SimilarityPostprocessor
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.huggingface import HuggingFaceLLM
from transformers import AutoTokenizer


# Use a more advanced embedding model
Settings.embed_model = HuggingFaceEmbedding(model_name="microsoft/Phi-3.5-mini-instruct")

# Create the tokenizer for Phi-3.5
tokenizer = AutoTokenizer.from_pretrained(
    "microsoft/Phi-3.5-mini-instruct",
    padding_side="right",
    pad_token="<|endoftext|>"
)
tokenizer.add_special_tokens({'pad_token': '<|endoftext|>'})

# Set up the Phi-3.5 model
Settings.llm = HuggingFaceLLM(
    model_name="microsoft/Phi-3.5-mini-instruct",
    tokenizer_name="microsoft/Phi-3.5-mini-instruct",
    context_window=128000,  # Phi-3.5 supports 128K context length
    max_new_tokens=1024,
    generate_kwargs={"temperature": 0.2, "do_sample": True, "pad_token_id": tokenizer.pad_token_id},
    device_map="auto",
)

# Load documents
documents = SimpleDirectoryReader("../../data").load_data()

# build index
index = VectorStoreIndex.from_documents(documents)

# configure retriever
retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=10,
)

# configure response synthesizer
response_synthesizer = get_response_synthesizer()

# assemble query engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
    node_postprocessors=[SimilarityPostprocessor(similarity_cutoff=0.7)],
)

# query
response = query_engine.query("Please write me essay about advantages and disadvantages of GPT? ")
print(response)

  from .autonotebook import tqdm as notebook_tqdm
No sentence-transformers model found with name microsoft/Phi-3.5-mini-instruct. Creating a new one with mean pooling.
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Downloading shards: 100%|██████████| 2/2 [03:01<00:00, 90.98s/it] 
Loading checkpoint shards: 100%|██████████| 2/2 [00:02<00:00,  1.13s/it]
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Downloading shards: 100%|██████████| 2/2 [03:02<00:00, 91.38s/it] 
Loading checkpoint shards: 100%|██████████| 2/2 [00:03<00:00,  1.77s/it]




Advantages and Disadvantages of GPT

Generative Pre-trained Transformers (GPT) have revolutionized the field of natural language processing (NLP) and have become a cornerstone in the development of AI-driven applications. These models offer a plethora of advantages, but they also come with certain disadvantages that need to be considered.

**Advantages of GPT:**

1. **Natural Language Understanding and Generation:**
   GPT models excel at understanding and generating human-like text, making them highly effective for a wide range of applications, including chatbots, content creation, and language translation.

2. **Data Efficiency:**
   GPT models can be fine-tuned with relatively small datasets, which is beneficial for organizations with limited data resources.

3. **Multimodal Capabilities:**
   Some GPT models, like GPT-3, have multimodal capabilities, allowing them to understand and generate text based on visual inputs, which is a significant step towards more comprehensive AI sys

### Configuring retriever

In [None]:
retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=10,
)

There are a huge variety of retrievers that you can learn about in our module guide on retrievers: https://docs.llamaindex.ai/en/stable/module_guides/querying/retriever/.