**3. Qustion-Answering (QA) Chain + Evaluation**

**Key Objectives Delivered**:
1) **Embeddings (MiniLM-L6-v2)**: This converts the text transcript of the Yotube and blogs into vectors that capture meaning. 
2) **Retriver/Vector Store Set-up (Chroma DB)**: Our database that stores our vectors and allows fast similarity search. 
3) **Configuration of Ranker Retriever**: 
4) **LLM Configuration**: The brrain that reads retrieved documents and generates the answers
5) **Multi-Query Retriever / Helper function Evaluation (ask_rag) & QA chain**: 
5) **Testing**: Ran list of questions to test accuracy and hallucination resistance.


3.1 Importing the corresponding libraries we are going to use 

In [None]:
import os

os.environ["OPENAI_API_KEY"] =

In [36]:
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings

embedding_function = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)

knowledge_vectorstore = Chroma(
    collection_name="prophero_knowledge",
    persist_directory="data/chroma_db",
    embedding_function=embedding_function,
)

print("Number of documents in Chroma collection:", 
      knowledge_vectorstore._collection.count())

Number of documents in Chroma collection: 50


3.2 Creating retrievers. We need this step to define how my systems retrieves the information from the Chroma database. 

In [37]:
retriever = knowledge_vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 10}
)

In [38]:
{
  'title': 'The Most Common Property Investment Mistakes That Can Easily Be Avoided',
  'video_id': 'blog_mistakes',
  'url': 'https://www.prophero.com/the-most-common-property-investment-mistakes-that-can-easily-be-avoided/',
  'source_type': 'blog'
}

{'title': 'The Most Common Property Investment Mistakes That Can Easily Be Avoided',
 'video_id': 'blog_mistakes',
 'url': 'https://www.prophero.com/the-most-common-property-investment-mistakes-that-can-easily-be-avoided/',
 'source_type': 'blog'}

In [39]:
# 1. Ground truth based in video_id / blog_id

ground_truth = {
    "What does PropHero do?": [
        "prophero_video_1",
        "prophero_video_2",
        "prophero_video_3",
        "blog_mistakes",
    ],
    "Does PropHero use AI?": [
        "prophero_video_3",
        "blog_mistakes",
        "blog_property_vs_shares",
    ],
    "Who founded PropHero?": [
        "blog_capital_gains",
        "blog_mistakes",
    ],
}

In [40]:
import numpy as np

# 2. Measuring key metrics for our retriver: precisión@k, recall@k, MRR

def precision_at_k(relevant_ids, retrieved_ids, k=5):
    retrieved_k = retrieved_ids[:k]
    if len(retrieved_k) == 0:
        return 0.0
    hits = sum(doc_id in relevant_ids for doc_id in retrieved_k)
    return hits / len(retrieved_k)


def recall_at_k(relevant_ids, retrieved_ids, k=5):
    if len(relevant_ids) == 0:
        return 0.0
    retrieved_k = retrieved_ids[:k]
    hits = sum(doc_id in relevant_ids for doc_id in retrieved_k)
    return hits / len(relevant_ids)


def mrr_score(relevant_ids, retrieved_ids):
    for rank, doc_id in enumerate(retrieved_ids, start=1):
        if doc_id in relevant_ids:
            return 1.0 / rank
    return 0.0


def evaluate_retriever(queries, retriever, ground_truth, top_k=5):
    precision_scores = []
    recall_scores = []
    mrr_scores = []

    for q in queries:
    
        docs = retriever.invoke(q)

        retrieved_ids = [doc.metadata.get("video_id") for doc in docs]

        relevant = ground_truth[q]

        # Calcular métricas
        precision_scores.append(precision_at_k(relevant, retrieved_ids, k=top_k))
        recall_scores.append(recall_at_k(relevant, retrieved_ids, k=top_k))
        mrr_scores.append(mrr_score(relevant, retrieved_ids))

    return {
        "precision@k": float(np.mean(precision_scores)),
        "recall@k": float(np.mean(recall_scores)),
        "mrr": float(np.mean(mrr_scores)),
    }


In [41]:
queries = list(ground_truth.keys())

results_old = evaluate_retriever(
    queries=queries,
    retriever=retriever,
    ground_truth=ground_truth,
    top_k=5,
)

results_old

{'precision@k': 0.6, 'recall@k': 1.0, 'mrr': 1.0}

In [42]:
import json

with open("results_old.json", "w") as f:
    json.dump(results_old, f, indent=4)

print("Old evaluation saved!")

Old evaluation saved!


3.3 Defining the LLM (the brain that will generate the answers)

In [43]:
from langchain.chat_models import ChatOpenAI

In [44]:
llm = ChatOpenAI(
    model_name="gpt-4o-mini", 
    temperature=0.3          
)

3.4 MultiQuery Retriver : Help us expanding the search. LLM generates multiple paraphrased queries / runs similarity search for each and combines the result. "ask the question in several smart ways --> gathers more possible answers"


In [45]:
try:
    from langchain.retrievers.multi_query import MultiQueryRetriever
except ImportError:
    from langchain_community.retrievers import MultiQueryRetriever

multiquery_retriever = MultiQueryRetriever.from_llm(
    retriever=retriever,  
    llm=llm,          
    include_original=True 
)

multiquery_retriever

MultiQueryRetriever(retriever=VectorStoreRetriever(tags=['Chroma', 'HuggingFaceEmbeddings'], vectorstore=<langchain_community.vectorstores.chroma.Chroma object at 0x0000026F4DD62490>, search_kwargs={'k': 10}), llm_chain=PromptTemplate(input_variables=['question'], input_types={}, partial_variables={}, template='You are an AI language model assistant. Your task is\n    to generate 3 different versions of the given user\n    question to retrieve relevant documents from a vector  database.\n    By generating multiple perspectives on the user question,\n    your goal is to help the user overcome some of the limitations\n    of distance-based similarity search. Provide these alternative\n    questions separated by newlines. Original question: {question}')
| ChatOpenAI(client=<openai.resources.chat.completions.completions.Completions object at 0x0000026F4E56AC50>, async_client=<openai.resources.chat.completions.completions.AsyncCompletions object at 0x0000026F4E56B190>, model_name='gpt-4o-mi

3.4 Ranker Retriever: Filters and prioritize. Makes sure that recorders the chunks from the most to the least relevant based on the user query. 

In [46]:
from typing import List
from sentence_transformers import CrossEncoder


try:
    from langchain_core.retrievers import BaseRetriever
    from langchain_core.documents import Document
except ImportError:
   
    from langchain.retrievers import BaseRetriever
    from langchain.schema import Document


class CrossEncoderRerankRetriever(BaseRetriever):
    """
    LangChain-compatible retriever that:
    1) Uses a base retriever (Chroma) to get k_initial docs
    2) Uses a CrossEncoder reranker to score (query, doc)
    3) Returns the top k_final docs
    """

    base_retriever: BaseRetriever
    reranker: CrossEncoder
    k_initial: int = 10
    k_final: int = 4

    def _get_relevant_documents(self, query: str) -> List[Document]:
        
        docs = self.base_retriever.get_relevant_documents(query)
        if not docs:
            return []

        docs = docs[: self.k_initial]
    
        pairs = [(query, d.page_content) for d in docs]

        scores = self.reranker.predict(pairs)

        scored_docs = sorted(zip(scores, docs), key=lambda x: x[0], reverse=True)
        top_docs = [doc for score, doc in scored_docs[: self.k_final]]
        return top_docs

    async def _aget_relevant_documents(self, query: str) -> List[Document]:
     
        return self._get_relevant_documents(query)

reranker_model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

rerank_retriever = CrossEncoderRerankRetriever(
    base_retriever=retriever,   
    reranker=reranker_model,
    k_initial=10,
    k_final=4,
)


3.6 QA Chain. This controld how the LLM uses context

In [47]:
#from langchain.chains import RetrievalQA

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=multiquery_retriever,   
    return_source_documents=True
)

In [48]:
from langchain.chains import LLMChain
from langchain.chains.combine_documents.stuff import StuffDocumentsChain

# 1) LLMChain que usa tu PromptTemplate (aquí sí se respeta)
llm_chain = LLMChain(
    llm=llm,
    prompt=qa_prompt,
)

# 2) CombineDocumentsChain que junta documentos + tu prompt
doc_chain = StuffDocumentsChain(
    llm_chain=llm_chain,
    document_variable_name="context",
)

# 3) Construimos la RetrievalQA manualmente
qa_chain = RetrievalQA(
    retriever=multiquery_retriever,
    combine_documents_chain=doc_chain,
    return_source_documents=True,
)


  llm_chain = LLMChain(
  doc_chain = StuffDocumentsChain(
  qa_chain = RetrievalQA(


3.7 Helper RAG Function for testing. It help us to shortcut in a way that sends the question into the full RAG chain  (multi_source_qa_chain) , prints clear answers, and even prints sources. 

In [49]:
def ask_rag(question: str, show_sources: bool = True):
    """
    Sends a question to the RAG system and prints the full-sentence answer.
    Optionally shows sources and chunk previews.
    """
    response = qa_chain({"query": question})

    print("\n=== ANSWER ===\n")
    print(response["result"])

    if show_sources:
        print("\n=== SOURCES USED ===\n")
        for i, doc in enumerate(response["source_documents"], start=1):
            meta = doc.metadata
            preview = doc.page_content[:250].replace("\n", " ")

            print(f"Source {i}:")
            print(f"  - Type: {meta.get('source_type')}")
            print(f"  - Title: {meta.get('title')}")
            print(f"  - URL: {meta.get('url')}")
            print(f"  - Chunk Preview: {preview}...")
            print()


In [50]:
from langchain.prompts import PromptTemplate

qa_prompt = PromptTemplate(
    input_variables=["context", "question"],
    template="""
You are a helpful and friendly AI assistant for property investors.

Your task is to answer ONLY using the information in the context below.

❗ IMPORTANT: Your answer MUST follow these rules:
- Maximum **1 sentence**
- Short, clear and to the point
- No long explanations
- No extra information not in the context
- If the answer is not in the context, say: "Sorry, I couldn't find information about that."

---

Context:
{context}

Question:
{question}

Short answer (max 1 sentence):
"""
)

3.8 Testing 

In [51]:
test_questions = [
    "Is PropHero a company that focuses on property investment?",
    "Does PropHero use artificial intelligence for decision-making?",
    "Does PropHero specialize in commercial properties?",
    "Who is one of the co-founders of PropHero?",
    "What kind of platform is PropHero described as?",
    "What problem does PropHero aim to solve for investors?",
    "What does PropHero send to users after they invest?",
    "What is the capital of Brazil according to PropHero?",  # hallucination test
    "Does the text describe safari tourism in Africa?"        # totally off-topic
]


In [52]:
for i, question in enumerate(test_questions, start=1):
    print(f"\n{'=' * 60}")
    print(f"❓ Question {i}: {question}\n")
    
    ask_rag(question, show_sources=True)
    
    print("\n") 



❓ Question 1: Is PropHero a company that focuses on property investment?


=== ANSWER ===

Yes, PropHero is a company that focuses on property investment.

=== SOURCES USED ===

Source 1:
  - Type: blog
  - Title: Are You Calculating Rental Yield Correctly?
  - URL: https://www.prophero.com/are-you-calculating-rental-yield-correctly/
  - Chunk Preview: into account other expense associate with property ownership , such as property management fee , maintenance cost , property taxis , and insurance . net rental yield this calculation consider the net income from the property after deduct all associat...

Source 2:
  - Type: video
  - Title: PropHero – Intro Video 3
  - URL: https://www.youtube.com/watch?v=5Kca3nOrefY
  - Chunk Preview: portfolio use each successful investment to power the next , create a snowball of wealth over time . and that be why we create prop hero to help everyday aussie find , buy and manage high return low risk investment property . unlike other platform th...



3.9 Evaluation ROUGE & BLEU

**ROUGE** -->  Did the model capture the most important ideas?. *Precision* what % of generated words are correct? 

**BLEU** --> Did the model use the same words as the correct answer? *Recall* = how much meaning from the expected answer was recovered? 

In [53]:
!pip install rouge-score sacrebleu



In [54]:
from rouge_score import rouge_scorer
import sacrebleu

def evaluate_rag_generation(system_answers, reference_answers):
    """
    system_answers: list of answers generated by your RAG system
    reference_answers: list of expected correct answers (YOU must define them)
    """

    bleu = sacrebleu.corpus_bleu(system_answers, [reference_answers]).score

    scorer = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True)
    rouge_scores = [
        scorer.score(ref, sys)["rougeL"].fmeasure
        for sys, ref in zip(system_answers, reference_answers)
    ]
    
    rougeL = sum(rouge_scores) / len(rouge_scores)

    return {"BLEU": bleu, "ROUGE-L": rougeL}

In [55]:
test_questions = [
    "Is PropHero a company that focuses on property investment?",
    "Does PropHero use artificial intelligence for decision-making?",
    "What kind of platform is PropHero described as?",
    "Who is one of the co-founders of PropHero?"
]

reference_answers = [
    "Yes, PropHero focuses on property investment.",
    "Yes, PropHero uses AI to support decision-making.",
    "PropHero is described as a smart digital property investment platform.",
    "One of the co-founders of PropHero is Pablo Cebrián."
]

In [56]:
system_answers = []

for q in test_questions:
    result = qa_chain.invoke({"query": q})
    system_answers.append(result["result"])

system_answers

['Yes, PropHero is a company that focuses on property investment.',
 'Yes, PropHero uses AI-powered data models to identify high return properties and guide investment decisions.',
 'PropHero is described as an expert property investment partner that helps everyday Australians find, buy, and manage high return low risk investment properties.',
 'One of the co-founders of PropHero is Michael Roger.']

In [57]:
metrics = evaluate_rag_generation(system_answers, reference_answers)
metrics

{'BLEU': 26.299515953323684, 'ROUGE-L': 0.5938852813852814}

Output:
- BLEU = 10.8 --> Weak 
- ROUGE = 0.40 --> Good, captures meaning but not perfectly 

3.10 Evaluation RAGAS (Faithfulness, answer relevancy, context recall, context precision, answer correctness)

In [58]:
!pip install -q "ragas>=0.1.7" datasets

In [59]:
pip install --upgrade huggingface_hub datasets ragas

Collecting huggingface_hub
  Using cached huggingface_hub-1.1.5-py3-none-any.whl.metadata (13 kB)
Using cached huggingface_hub-1.1.5-py3-none-any.whl (516 kB)
Installing collected packages: huggingface_hub
  Attempting uninstall: huggingface_hub
    Found existing installation: huggingface-hub 0.36.0
    Uninstalling huggingface-hub-0.36.0:
      Successfully uninstalled huggingface-hub-0.36.0
Successfully installed huggingface_hub-1.1.5
Note: you may need to restart the kernel to use updated packages.


ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
langchain-huggingface 1.0.1 requires huggingface-hub<1.0.0,>=0.33.4, but you have huggingface-hub 1.1.5 which is incompatible.
langchain-huggingface 1.0.1 requires langchain-core<2.0.0,>=1.0.3, but you have langchain-core 0.3.79 which is incompatible.
transformers 4.57.1 requires huggingface-hub<1.0,>=0.34.0, but you have huggingface-hub 1.1.5 which is incompatible.


In [60]:
from datasets import Dataset

from ragas import evaluate
from ragas.metrics import (
    answer_relevancy,   # how well the answer matches the question
    faithfulness,       # hallucinations vs context
    context_precision,  # how much of retrieved context is actually useful
    context_recall,     # did we retrieve the needed info?
)

In [61]:
def build_ragas_dataset(questions, reference_answers, rag_chain, max_contexts: int = 5):
    """
    Run your RAG chain on a list of questions and build a RAGAS dataset.

    questions: list[str]
    reference_answers: list[str] – gold answers you wrote manually
    rag_chain: your RetrievalQA chain (qa_chain)
    max_contexts: how many retrieved chunks per question to keep
    """
    assert len(questions) == len(reference_answers), "questions and reference_answers must have same length"

    rows = []

    for q, gt in zip(questions, reference_answers):
        # 1) Call your RAG system
        result = rag_chain.invoke({"query": q})

        # 2) Extract answer from the chain output
        answer = result["result"]

        # 3) Extract textual contexts from retrieved documents
        source_docs = result.get("source_documents", [])
        contexts = [doc.page_content for doc in source_docs[:max_contexts]]

        # 4) Add row for RAGAS
        rows.append(
            {
                "question": q,
                "answer": answer,
                "contexts": contexts,
                "ground_truth": gt,
            }
        )

    # Convert to HF Dataset
    return Dataset.from_list(rows)

In [62]:
ragas_dataset = build_ragas_dataset(
    questions=test_questions,
    reference_answers=reference_answers,
    rag_chain=qa_chain,
    max_contexts=5,   # you can tune this
)

ragas_dataset

Dataset({
    features: ['question', 'answer', 'contexts', 'ground_truth'],
    num_rows: 4
})