# Query Extension

Sometimes people don't know exactly what they want to ask or can't precisely formulate their questions. The **Query Extension** step aims to solve this problem by generating multiple similar queries that can be used to enhance the original query. This technique is particularly useful in improving the retrieval component.


In [1]:
import ast
import os

# langchain
from langchain_core.messages import HumanMessage
from langchain_openai import AzureChatOpenAI
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import NLTKTextSplitter
from langchain_community.vectorstores import Chroma

In [2]:
# init OpenAI (or any other open source model)

AZURE_OPENAI_API_KEY = os.environ.get("AZURE_OPENAI_API_KEY")
AZURE_OPENAI_ENDPOINT = os.environ.get('AZURE_OPENAI_ENDPOINT')
AZURE_OPENAI_VERSION = os.environ.get('AZURE_OPENAI_VERSION')
AZURE_OPENAI_DEPLOYMENT_NAME = os.environ.get('AZURE_OPENAI_DEPLOYMENT_NAME')

oai = AzureChatOpenAI(
    openai_api_version=AZURE_OPENAI_VERSION,
    azure_deployment=AZURE_OPENAI_DEPLOYMENT_NAME,
    temperature=0
)

In [3]:
# basic retriever

data_path = "../data"
files = ["fishingguide1.pdf", "fishingguide2.pdf"]

data = [PyPDFLoader(os.path.join(data_path, file)).load() for file in files]
docs_list = [item for sublist in data for item in sublist]
text_splitter = NLTKTextSplitter()
doc_chunks = text_splitter.split_documents(docs_list)

print("TOTAL NO. OF CHUNKS: ", len(doc_chunks))

TOTAL NO. OF CHUNKS:  66


In [4]:
emb_model = SentenceTransformerEmbeddings(model_name="thenlper/gte-large")
db = Chroma.from_documents(documents=doc_chunks, embedding=emb_model,
                           collection_metadata={"hnsw:space": "cosine"})
retriever = db.as_retriever(search_type="mmr")

In [5]:
# OPTION 1: generate similar queries

def generate_similar_queries(initial_query):

    message = HumanMessage(
        content=f"""

            You are a helpful search assistant. Your task is to generate four similar search queries based on a single input query. 
            Always use provided output for your response. Be concise and constructive. Do not deviate from the context.
            Initial single input query: {initial_query}
            Output sturcture: ["{initial_query}", search query 1, search query 2, search query 3, search query 4]

        """,

    )
    response = oai.invoke([message])
    queries = ast.literal_eval(response.content)

    return queries


user_query = "How to catch fish USA"
similar_queries = generate_similar_queries(initial_query=user_query)
similar_queries

['How to catch fish USA',
 'Fishing techniques in USA',
 'Best ways to catch fish in USA',
 'USA fishing guide',
 'How to fish in America']

In [6]:
# calculate cosine similarity for each of the queries

def calculate_avg_similarity(queries):

    averages = {}

    for query in queries:
        results = db.similarity_search_with_relevance_scores(query=query)
        scores = [result[1] for result in results]
        if scores:
            averages[query] = round((sum(scores) / len(scores)), 5)
        else:
            averages[query] = 0

    srted_avgs = dict(
        sorted(averages.items(), key=lambda item: item[1], reverse=True))
    return srted_avgs


scores = calculate_avg_similarity(similar_queries)
scores

{'How to catch fish USA': 0.87898,
 'How to fish in America': 0.87423,
 'USA fishing guide': 0.87053,
 'Fishing techniques in USA': 0.86862,
 'Best ways to catch fish in USA': 0.86521}

In [7]:
# OPTION 2: generate similar queries & provide retrieved contenxt for quality enhancement

def generate_similar_queries_with_docs(query, documets):

    message = HumanMessage(
        content=f""" 
            You are a helpful search assistant. Your task is to generate four similar search queries in relation to the provided documents based on a single input query. 
            Always use provided output for your response. Be concise and constructive. Do not deviate from the context of the provided documents.
            

            Initial single input query: {query}
            Documents: {documets}
            Output sturcture: ["{query}", search query 1, search query 2, search query 3, search query 4]

        """,

    )
    response = oai.invoke([message])
    queries = ast.literal_eval(response.content)

    return queries


In [8]:
user_query = "How to catch fish USA"
initial_results = retriever.invoke(input=user_query)
similar_queries_w_docs = generate_similar_queries_with_docs(query=user_query, documets=initial_results)
similar_queries_w_docs

['How to catch fish USA',
 'Fishing safety tips in the USA',
 'Types of baits for fishing in the USA',
 'How to handle and clean fish in the USA',
 'Fishing regulations in the USA']

In [9]:
scores_w_docs = calculate_avg_similarity(similar_queries_w_docs)
scores_w_docs

{'How to handle and clean fish in the USA': 0.88943,
 'Fishing safety tips in the USA': 0.88455,
 'Fishing regulations in the USA': 0.88429,
 'Types of baits for fishing in the USA': 0.88086,
 'How to catch fish USA': 0.87898}

## Outcomes

By adding more context to the process of generating similar queries, we can enhance the similarity of retrieved documents by simply adjusting the formulations of user queries.

**Comparison of Outcomes without and with Documents:**

| Without Documents                      | With Documents                                   |
|----------------------------------------|--------------------------------------------------|
| 'How to catch fish USA': **0.87898**       | 'How to handle and clean fish in the USA': **0.88943** |
| 'How to fish in America': **0.87423**      | 'Fishing safety tips in the USA': **0.88455**        |
| 'USA fishing guide': 0.87053           | 'Fishing regulations in the USA': 0.88429         |
| 'Fishing techniques in USA': 0.86862   | 'Types of baits for fishing in the USA': 0.88086  |
| 'Best ways to catch fish in USA': 0.86521 | 'How to catch fish USA': 0.87898              |

**IDEAS:**
- Improve the system prompt for generating similar queries
- Test different similarity metrics
- Add a keyword-based scoring/re-ranking mechanism
- Adjust the number of k for retrieval (to see if results improve or worsen)