### Query Expansion
In this section we will cover a retrieval optimization technique called query expansion. This involves adding more context and tags to the user query to be able to produce better more accurate outputs

In [1]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_openai.chat_models import ChatOpenAI
from langchain.text_splitter import RecursiveCharacterTextSplitter, SentenceTransformersTokenTextSplitter

In [2]:
from dotenv import load_dotenv
load_dotenv()

True

In [3]:
mistral7b = ChatOpenAI(model="mistralai/Mistral-7B-Instruct-v0.2")

In [4]:
query_expansion_prompt = """
[INST]
You are an intelligent and insightful assistant.
Generate a potential response the following question:
Question:
{question}
Response:
[/INST]
"""

query_expansion_template = ChatPromptTemplate.from_template(query_expansion_prompt)

In [5]:
query_expansion_chain = query_expansion_template | mistral7b | StrOutputParser()

In [6]:
answer_summary_prompt = """ 
[INST]
You are an intelligent and insgihtful assistant.
Generate a summary for the given context.
Context:
{expanded_query}
[/INST]
"""

answer_summary_template = ChatPromptTemplate.from_template(answer_summary_prompt)

In [8]:
answer_summary_chain = answer_summary_template | mistral7b | StrOutputParser()

In [10]:
context_answering_prompt = """ 
[INST]
You are a helpful assistant. You give grounded and accurate answers based on the context provided within the message.
Here is the context
Context: 
{context}


Use the provided context to answer the following question:
Question:
{question}

Provide only the answer and no other explanation
[/INST]
"""
context_answering_prompt = ChatPromptTemplate.from_template(context_answering_prompt)

In [11]:
context_answering_chain = context_answering_prompt | mistral7b | StrOutputParser()

In [12]:
from chromadb.api.client import Client
from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction
from pypdf import PdfReader

In [13]:
doc = PdfReader("MIV2 - LLM paper.pdf")
docs = [ page.extract_text().strip() for page in doc.pages]

In [14]:
recursive_splitter = RecursiveCharacterTextSplitter(
    separators=[".", "", " ", "\n", "\n\n"],
    chunk_size = 1000,
    chunk_overlap = 200
)
texts = recursive_splitter.split_text("\n\n".join(docs))

In [15]:
sentence_t_splitter = SentenceTransformersTokenTextSplitter(chunk_overlap=0, tokens_per_chunk=256)
split_texts = []
for text in texts:
    split_texts+= sentence_t_splitter.split_text(text)

  from .autonotebook import tqdm as notebook_tqdm
  return self.fget.__get__(instance, owner)()


In [16]:
embedding_function = SentenceTransformerEmbeddingFunction()

In [17]:
client = Client()
collection = client.create_collection("collection2", embedding_function=embedding_function)

In [18]:
ids = [str(i) for i in range(len(split_texts))]

In [19]:
len(ids), len(split_texts)

(84, 84)

In [20]:
collection.add(ids, documents=split_texts)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [21]:
collection.count()

84

In [24]:
response = query_expansion_chain.invoke({"question": "what evaluations were used in the research paper?"})

In [26]:
from pprint import pprint

In [28]:
pprint(response)

('In the research paper, various evaluation methods were employed to ensure '
 'the validity and reliability of the findings. Here are some of the '
 'evaluations mentioned:\n'
 '\n'
 '1. Content analysis: This method was used to analyze the data collected from '
 'various sources, such as articles and books, to identify patterns and '
 'trends.\n'
 '2. Surveys: The researchers used self-report surveys to gather data from '
 'participants about their attitudes, beliefs, and behaviors related to the '
 'research topic.\n'
 '3. Interviews: In-depth interviews were conducted with key informants to '
 'gain a deeper understanding of the research topic and to validate the '
 'findings from the surveys and content analysis.\n'
 '4. Observations: The researchers observed participants in their natural '
 'settings to gain insights into their behaviors and interactions.\n'
 '5. Document analysis: The researchers analyzed various documents related to '
 'the research topic, such as policies and 

In [29]:
relevant_docs = collection.query(query_texts=response, n_results=5)["documents"]

In [30]:
pprint(relevant_docs)

[['. negative results are observed primarily in tasks prompted with a single '
  'modality, aligning with the inherent limitations associated with '
  'insufficient contextual information. observations during experiments reveal '
  'consistent plan generation errors by the model for specific instructions, '
  'and even after feedback is provided for a particular mistake, the model '
  'tends to repeat 10 table 2 : vimabench simulated environment evaluation '
  'single modality1multi modality2 project task 1 task 2 task 3 task 1 task 2 '
  'task 3 instruct2act 0. 0 % 70 % 90 % 40 % 63. 3 % 76. 66 % instruct2act w / '
  'feedback 0. 0 % 66 % 90 % 46. 66 % 73. 33 % 93. 3 % note : the success rate '
  'is calculated by finding the average success of task completion for each '
  'task across the 3 partitions. 1for single modality only textual data is '
  'provided to the context of the model. 2for multi modality both texts and '
  'images are provided as contexts to the model. these errors 

In [32]:
context = "\n\n".join(relevant_docs[0])

In [34]:
response =context_answering_chain.invoke({"context": context, "question": "what evaluations were used in the research paper?"})

In [35]:
pprint(response)

('The research paper conducted evaluations in both simulated and physical '
 'environments using the Vimabench simulation benchmark and human user '
 'interactions.')


In [37]:
relevant_docs = collection.query(query_texts=["what evaluations were used in the research paper?"])["documents"][0]

In [38]:
context = "\n\n".join(relevant_docs)

In [39]:
response =context_answering_chain.invoke({"context": context, "question": "what evaluations were used in the research paper?"})
pprint(response)

('The research paper conducted evaluations in both simulated and physical '
 'environments using a series of experiments. In the simulated evaluation, '
 'they used the Vimabench simulation benchmark, while in the physical '
 'evaluation, they used the VILD1SAM-CLIP methodology. The findings indicate '
 'an 8.6% performance increase in simulated evaluations and a 14% increase in '
 'physical evaluations. The system demonstrated an ability to comprehend and '
 'respond to a diverse range of language instructions, with valuable insights '
 'into its performance and areas for improvement.')


#### Using query expansion by generating extra questions

In [46]:
question_generation_prompt = """
[INST]
You are a helpful research asssistant.
Generate 5 additional related questions to the one provided below to help the user get the information they need faster.
Question:
{question}
[/INST]
"""

question_generation_template = ChatPromptTemplate.from_template(question_generation_prompt)

In [47]:
new_question_expansion_chain = question_generation_template | mistral7b | StrOutputParser()

In [48]:
more_questions = new_question_expansion_chain.invoke({"question": "what evaluations were used in the research paper?"})

In [49]:
pprint(more_questions)

('1. What specific evaluation methods or techniques were employed in the '
 'research study mentioned in the paper?\n'
 '2. Can you provide details about the data collection and analysis methods '
 'used in the evaluation process?\n'
 '3. Were any standardized evaluation tools or scales used in the study and if '
 'so, which ones?\n'
 '4. How were the evaluation results interpreted and what conclusions were '
 'drawn from them?\n'
 '5. Were there any limitations or challenges encountered during the '
 'evaluation process and how were they addressed?')


In [50]:
more_questions += "\n6. what evaluations were used in the research paper?"

In [52]:
pprint(more_questions)

('1. What specific evaluation methods or techniques were employed in the '
 'research study mentioned in the paper?\n'
 '2. Can you provide details about the data collection and analysis methods '
 'used in the evaluation process?\n'
 '3. Were any standardized evaluation tools or scales used in the study and if '
 'so, which ones?\n'
 '4. How were the evaluation results interpreted and what conclusions were '
 'drawn from them?\n'
 '5. Were there any limitations or challenges encountered during the '
 'evaluation process and how were they addressed?\n'
 '6. what evaluations were used in the research paper?')


In [54]:
relevant_docs = collection.query(query_texts=[more_questions], n_results=5)["documents"][0]

In [55]:
context = "\n\n".join(relevant_docs)

In [70]:
final_answer = context_answering_chain.invoke({"question":  "what evaluations were used in the research paper?", "context": context})

In [71]:
pprint(final_answer)

('The research paper conducted evaluations in both simulated and physical '
 'environments using the Vimabench simulation benchmark and a solitary human '
 "participant. The system's performance in comprehending and responding to a "
 'diverse range of language instructions was assessed, and the approach was '
 'found to be adaptable and continuously improving with real-time feedback and '
 'adjustment. The system demonstrated a 8.6% performance increase in simulated '
 'evaluations and a 14% increase in physical evaluations. Additional related '
 'projects and methods were also evaluated, including tinybot, Palm-E, '
 'code-as-policies, progprompt, and LLM-Brain.')
