# 🧠 Week 07-08 · Notebook 09 · Query Transformation & Decomposition

Transform technician questions into decomposed, enriched queries that boost RAG accuracy and auditability.

## 🎯 Learning Objectives
- Decompose complex maintenance requests into sub-questions.
- Apply rewrite techniques (HyDE, paraphrasing) to improve retrieval.
- Track transformation lineage for compliance.
- Evaluate transformation impact on response quality.

## 🧩 Scenario
Technicians often bundle multiple requests (diagnosis, spare parts, safety check). Decompose queries to ensure each pipeline step has precise context.

In [None]:
from langchain_openai import ChatOpenAI
from langchain.prompts import PromptTemplate
from langchain_core.output_parsers import PydanticOutputParser
from pydantic import BaseModel, Field
from typing import List

# --- 1. Define the desired output structure ---
# Use Pydantic to define a structured output for the sub-questions.
class DecomposedQuestions(BaseModel):
    """A list of simple, self-contained questions decomposed from a complex user query."""
    questions: List[str] = Field(description="A list of sub-questions.")

# --- 2. Create a parser ---
# The parser will automatically generate formatting instructions for the LLM.
parser = PydanticOutputParser(pydantic_object=DecomposedQuestions)

# --- 3. Create the prompt ---
# The prompt template now includes the format instructions from the parser.
decomposition_prompt = PromptTemplate(
    template="""Decompose the following user query into a list of simple, self-contained sub-questions.
Focus on breaking down the query into individual tasks like diagnosis, inventory checks, or safety procedures.

{format_instructions}

User Query:
{query}
""",
    input_variables=["query"],
    partial_variables={"format_instructions": parser.get_format_instructions()}
)

# --- 4. Build and run the chain ---
llm = ChatOpenAI(model='gpt-4o-mini', temperature=0)
decompose_chain = decomposition_prompt | llm | parser

query = 'Spindle 4 is vibrating again after the bearing swap. I need to know the root cause, check the spare parts inventory for replacement bearings, and find the correct safety procedure for this repair.'
sub_questions = decompose_chain.invoke({"query": query})

print("--- Original Query ---")
print(query)
print("\n--- Decomposed Sub-Questions ---")
for q in sub_questions.questions:
    print(f"- {q}")

### 🧪 Hypothetical Document Embeddings (HyDE)
Generate synthetic answers to improve retrieval coverage.

In [None]:
from langchain_core.output_parsers import StrOutputParser

# --- HyDE: Hypothetical Document Embeddings ---
# This technique generates a hypothetical answer to the user's query first,
# then uses the embedding of that *answer* to find similar real documents.
# This can improve retrieval by matching on concepts rather than just keywords.

# The prompt to generate the hypothetical document
hyde_prompt = PromptTemplate.from_template(
    """Generate a concise, factual, hypothetical answer to the following question.
This answer will be used to find similar real documents.
Cite potential SOP IDs or document numbers.

Question: {query}
"""
)

# The chain to generate the synthetic document
hyde_chain = hyde_prompt | llm | StrOutputParser()

# The original query from before
query = 'Spindle 4 is vibrating again after the bearing swap. What is the root cause?'

# Generate the hypothetical document
synthetic_doc = hyde_chain.invoke({"query": query})

print("--- Original Query ---")
print(query)
print("\n--- Generated Hypothetical Document (for embedding) ---")
print(synthetic_doc)

# In a full RAG pipeline, you would then:
# 1. Embed `synthetic_doc`.
# 2. Use that embedding to perform a similarity search on your vector store.
# 3. Pass the retrieved *real* documents to the final answer-generation LLM.

## 🧾 Transformation Log
| Stage | Output | Hash |
- Record in governance store for traceability.

## 🧪 Lab Assignment
1. Build a `HypotheticalDocumentEmbedder` pipeline that indexes synthetic docs alongside real data.
2. Run ablation study: baseline vs. decomposition+HyDE on incident QA dataset.
3. Present findings to reliability engineering, highlighting recall improvements.
4. Update change log with new transformation policy.

## ✅ Checklist
- [ ] Decomposition chain implemented
- [ ] HyDE augmentation validated
- [ ] Transformation log captured
- [ ] Lab artefacts submitted

## 📚 References
- LangChain Query Transformation Guide
- HyDE Paper (Li et al., 2023)
- Week 09 Evaluation Metrics Notebook