### Querying 

Now you've loaded your data, built an index, and stored that index for later, you're ready to get to the most significant part of an LLM application: querying.

At its simplest, querying is just a prompt call to an LLM: it can be a question and get an answer, or a request for summarization, or a much more complex instruction.

More complex querying could involve repeated/chained prompt + LLM calls, or even a reasoning loop across multiple components.

In [4]:
from llama_index.core import VectorStoreIndex , SimpleDirectoryReader, Settings
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.huggingface import HuggingFaceLLM
from transformers import AutoTokenizer

# local embedding
Settings.embed_model = HuggingFaceEmbedding(model_name = "BAAI/bge-small-en-v1.5")

# First create the tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    "microsoft/phi-2",
    padding_side="right",
    pad_token="<|endoftext|>"  # Define custom pad token
)
# Add pad token to tokenizer
tokenizer.add_special_tokens({'pad_token': '<|endoftext|>'})

# local LLM
Settings.llm = HuggingFaceLLM(
    model_name="microsoft/phi-2",  # This is a smaller model that works well for most tasks
    tokenizer_name="microsoft/phi-2",
    context_window=2048,
    max_new_tokens=1024,
    generate_kwargs={"temperature": 0.2, "do_sample": True, "pad_token_id": tokenizer.pad_token_id},
    device_map="auto",
)
documents = SimpleDirectoryReader("../../data").load_data()
index = VectorStoreIndex.from_documents(documents, show_progress=True)

Loading checkpoint shards: 100%|██████████| 2/2 [00:01<00:00,  1.12it/s]
Parsing nodes: 100%|██████████| 40/40 [00:00<00:00, 506.33it/s]
Generating embeddings: 100%|██████████| 83/83 [00:03<00:00, 25.46it/s]


In [2]:
query_engine = index.as_query_engine( 
                                    # similarity_top_k=1,
                                    response_mode="tree_summarize",)

response = query_engine.query(
    "write me short essay based on the given document about GPT"
)

This is a friendly reminder - the current text generation call will exceed the model's predefined maximum length (2048). Depending on the model, you may observe exceptions, performance degradation, or nothing at all.


In [3]:
print(response)


GPT (Generative Pre-trained Transformer) is a language model that has been developed by OpenAI. It is an artificial intelligence system that can generate human-like text based on the input it receives. GPT has been used in various applications, including chatbots, content creation, and research writing.

GPT has several advantages in the healthcare sector. It can be used to generate personalized medical reports, which can help healthcare professionals make more informed decisions about patient care. GPT can also be used to generate medical research papers, which can help researchers share their findings more quickly and efficiently. Additionally, GPT can be used to generate patient education materials, which can help patients better understand their medical conditions and treatment options.

However, there are also challenges to consider when using GPT in healthcare. One challenge is ensuring the accuracy and reliability of the information generated by GPT. Another challenge is ensuri