# RAGAS Evaluation

#### Choose an advanced retrieval technique that you believe will improve your application’s ability to retrieve the most appropriate context. Write 1-2 sentences on why you believe it will be useful for your use case.

This evaluation compares the performance of **Hybrid (RRF only)** vs **Hybrid + Cohere Reranking**.

Hybrid with RRF represents the baseline retrieval strategy that I would consider useful for my use case, since I need both keyword and semantic search to meet my user's expectations. I want to measure the baseline performance of that strategy, then I would like to measure the performance with Cohere reranking added. I believe that reranking is useful for my use case because it is an additional layer of refinement on results that can help retrieval precision. Precise retrieval is extremely important to my users,  surfacing the right asset quickly is their primary goal. The result of this assessment will help me determine whether to add reranking to my final application.

In [12]:
import os
from dotenv import load_dotenv

load_dotenv("../../.env")
print("OpenAI key loaded:", bool(os.getenv("OPENAI_API_KEY")))

OpenAI key loaded: True


In [13]:
# Test dataset - queries with ground truth answers
EVAL_DATASET = [
    {
        "question": "sunset over water",
        "ground_truth": "Images showing sunsets with water, ocean, lake, or sea. Warm orange and pink colors in the sky reflected on water."
    },
    {
        "question": "portrait of woman",
        "ground_truth": "Portrait photographs of women, headshots or upper body, with focus on the face."
    },
    {
        "question": "running dog",
        "ground_truth": "Action shots of dogs running, playing, or in motion outdoors."
    },
    {
        "question": "bowl of fruit",
        "ground_truth": "Still life images of fruit in bowls or arranged on tables."
    },
    {
        "question": "cozy autumn",
        "ground_truth": "Fall imagery with warm colors - oranges, reds, browns. Cozy atmosphere, leaves, sweaters, warm drinks."
    },
    {
        "question": "strong contrast",
        "ground_truth": "High contrast images with dramatic lighting, deep shadows, bright highlights. Black and white or bold tonal range."
    },
]

print(f"Loaded {len(EVAL_DATASET)} test cases")

Loaded 6 test cases


In [14]:
import httpx
import time

API_URL = "http://localhost:8000"

def search(query: str, mode: str = "hybrid", limit: int = 5, rerank: bool = True) -> list[dict]:
    """Call Picosearch API."""
    resp = httpx.post(
        f"{API_URL}/search",
        json={"query": query, "mode": mode, "limit": limit, "rerank": rerank},
        timeout=60.0
    )
    resp.raise_for_status()
    return resp.json()["results"]

# Test connection
try:
    health = httpx.get(f"{API_URL}/health").json()
    print("API connected:", health)
except Exception as e:
    print(f"API not running: {e}")
    print("Start the backend: cd backend && uv run uvicorn main:app")

API connected: {'status': 'ok'}


In [15]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

ANSWER_PROMPT = """Based on the following image descriptions from a search, answer the user's query.
Describe what images were found and how well they match the query.

Query: {query}

Retrieved image descriptions:
{contexts}

Answer (2-3 sentences):"""

def generate_answer(query: str, contexts: list[str]) -> str:
    """Generate answer from retrieved contexts."""
    context_str = "\n".join(f"- {c}" for c in contexts)
    prompt = ANSWER_PROMPT.format(query=query, contexts=context_str)
    return llm.invoke(prompt).content

In [None]:
# BASELINE: Hybrid WITHOUT reranking

baseline_results = []

for row in EVAL_DATASET:
    print(f"Processing: {row['question']}")

    results = search(row["question"], mode="hybrid", limit=5, rerank=False)
    contexts = [r["description"] for r in results if r.get("description")]
    answer = generate_answer(row["question"], contexts) if contexts else "No results found."

    baseline_results.append({
        "user_input": row["question"],
        "retrieved_contexts": contexts,
        "response": answer,
        "reference": row["ground_truth"],
    })
    time.sleep(1)

print(f"\nCollected {len(baseline_results)} baseline results")

Processing: sunset over water
Processing: portrait of woman
Processing: running dog
Processing: bowl of fruit
Processing: cozy autumn
Processing: strong contrast

Collected 6 baseline results


In [None]:
# IMPROVED: Hybrid WITH Cohere reranking

rerank_results = []

for row in EVAL_DATASET:
    print(f"Processing: {row['question']}")

    results = search(row["question"], mode="hybrid", limit=5, rerank=True)
    contexts = [r["description"] for r in results if r.get("description")]
    answer = generate_answer(row["question"], contexts) if contexts else "No results found."

    rerank_results.append({
        "user_input": row["question"],
        "retrieved_contexts": contexts,
        "response": answer,
        "reference": row["ground_truth"],
    })
    time.sleep(1)

print(f"\nCollected {len(rerank_results)} reranked results")

Processing: sunset over water
Processing: portrait of woman
Processing: running dog
Processing: bowl of fruit
Processing: cozy autumn
Processing: strong contrast

Collected 6 reranked results


In [18]:
from ragas import EvaluationDataset, evaluate, SingleTurnSample
from ragas.metrics import LLMContextRecall, Faithfulness, ResponseRelevancy, LLMContextPrecisionWithoutReference
from ragas.llms import LangchainLLMWrapper

evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))

metrics = [
    LLMContextPrecisionWithoutReference(llm=evaluator_llm),
    LLMContextRecall(llm=evaluator_llm),
    Faithfulness(llm=evaluator_llm),
    ResponseRelevancy(llm=evaluator_llm),
]

  from ragas.metrics import LLMContextRecall, Faithfulness, ResponseRelevancy, LLMContextPrecisionWithoutReference
  from ragas.metrics import LLMContextRecall, Faithfulness, ResponseRelevancy, LLMContextPrecisionWithoutReference
  from ragas.metrics import LLMContextRecall, Faithfulness, ResponseRelevancy, LLMContextPrecisionWithoutReference
  from ragas.metrics import LLMContextRecall, Faithfulness, ResponseRelevancy, LLMContextPrecisionWithoutReference
  evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))


In [19]:
# Evaluate BASELINE (no rerank)
baseline_samples = [
    SingleTurnSample(
        user_input=r["user_input"],
        retrieved_contexts=r["retrieved_contexts"],
        response=r["response"],
        reference=r["reference"],
    )
    for r in baseline_results
]

baseline_dataset = EvaluationDataset(samples=baseline_samples)
baseline_eval = evaluate(dataset=baseline_dataset, metrics=metrics)
baseline_eval

Evaluating:  38%|███▊      | 9/24 [00:06<00:08,  1.83it/s]LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
Evaluating:  42%|████▏     | 10/24 [00:07<00:08,  1.64it/s]LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
Evaluating: 100%|██████████| 24/24 [00:17<00:00,  1.40it/s]


{'llm_context_precision_without_reference': 0.3250, 'context_recall': 0.1667, 'faithfulness': 0.5661, 'answer_relevancy': 0.2822}

In [20]:
# Evaluate IMPROVED (with rerank)
rerank_samples = [
    SingleTurnSample(
        user_input=r["user_input"],
        retrieved_contexts=r["retrieved_contexts"],
        response=r["response"],
        reference=r["reference"],
    )
    for r in rerank_results
]

rerank_dataset = EvaluationDataset(samples=rerank_samples)
rerank_eval = evaluate(dataset=rerank_dataset, metrics=metrics)
rerank_eval

Evaluating: 100%|██████████| 24/24 [00:16<00:00,  1.42it/s]


{'llm_context_precision_without_reference': 0.4500, 'context_recall': 0.4167, 'faithfulness': 0.5736, 'answer_relevancy': 0.1350}

## Results / Reflection

| Metric | Hybrid (RRF only) | Hybrid + Rerank | Delta |
|--------|-------------------|-----------------|-------|
| Context Precision | 0.325 | 0.450 | +0.125 |
| Context Recall | 0.167 | 0.417 | +0.250 |
| Faithfulness | 0.566 | 0.574 | +0.008 |
| Response Relevancy | 0.282 | 0.135 | -0.147 |

#### What conclusions can you draw about the performance and effectiveness of your original pipeline?
The initial approach (hybrid with RRF) has fairly low context precision (0.325) and context recall (0.167). This indicates that that retrieved images often don't match the user's query well, and we may be missing relevant images. Faithfulness (0.566) is moderate, which is expected since the generated answers are based directly on retrieved descriptions. The low recall suggests that Hybrid + RRF fusion alone isn't surfacing all relevant results.

#### How does the performance compare to your original RAG application? 
Context precision improved by ~38% and context recall more than doubled! This confirms that adding reranking effectively re-orders retrieved results to boost those that are more relevant. The metrics overall are still low, but I believe some of that is due to the small test set and my artificial "answer generation" step, since my actual app returns images to the user, not text. I believe this may be the cause of the the small drop response relevancy with reranking as well. Overall, Cohere reranking has improved retrieval quality significantly and I will keep this step in my final application.