### LAB | Relevance Scoring and Rerankers for Trustworthy AI & Repository Ai Literacy
Dina Bosma-Buczynska

**Step 1: Setup and Data Preparation**

In [None]:
# Install required libraries
!pip install openai pinecone cohere python-dotenv langchain langchain-openai langchain-community langchain-text-splitters tiktoken pypdf requests

In [2]:
# Load API keys and set up clients
import os
from dotenv import load_dotenv
from openai import OpenAI
import cohere
from pinecone import Pinecone, ServerlessSpec

# Load environment variables from .env file
load_dotenv()

# Set up clients
openai_client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
cohere_client = cohere.Client(os.environ["COHERE_API_KEY"])
pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])

print("OpenAI client ready:", openai_client is not None)
print("Cohere client ready:", cohere_client is not None)
print("Pinecone client ready:", pc is not None)

OpenAI client ready: True
Cohere client ready: True
Pinecone client ready: True


Load Documents

In [3]:
from langchain_community.document_loaders import PyPDFLoader

pdf_loader = PyPDFLoader("Living_Repository_AI_Literacy_Practices_Update_16042025_UqmogIt2HpLVokdcuzJL4mDvHk8_112203.pdf")
pdf_documents = pdf_loader.load()

print(len(pdf_documents))


73


In [4]:
from langchain_community.document_loaders import TextLoader

text_loader = TextLoader("chunked_transcription.txt")
podcast_documents = text_loader.load()

print(len(podcast_documents))


1


Add Metadata

In [5]:
for doc in pdf_documents:
    doc.metadata["source"] = "eu_ai_act"
    doc.metadata["document_type"] = "legal"


In [6]:
for doc in podcast_documents:
    doc.metadata["source"] = "podcast"
    doc.metadata["document_type"] = "transcript"


*Chunk Documents:*

Text Splitter

In [7]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)


Split documents

In [8]:
pdf_chunks = text_splitter.split_documents(pdf_documents)
podcast_chunks = text_splitter.split_documents(podcast_documents)

all_chunks = pdf_chunks + podcast_chunks

print("Total chunks:", len(all_chunks))


Total chunks: 320


Generate embeddings

In [9]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()


**Step 2: Generate Embeddings and Initial Retrieval**

Create the Pinecone Index

In [10]:
import time

index_name = "trustworthy-ai-rag"

# Create index if it doesn't exist yet
if index_name not in pc.list_indexes().names():
    pc.create_index(
        name=index_name,
        dimension=1536,
        metric="cosine",
        spec=ServerlessSpec(cloud="aws", region="us-east-1")
    )
    # Wait until index is ready before proceeding
    while not pc.describe_index(index_name).status["ready"]:
        time.sleep(1)

index = pc.Index(index_name)
print("Index ready")
print(index.describe_index_stats())

  from .autonotebook import tqdm as notebook_tqdm


Index ready
{'_response_info': {'raw_headers': {'connection': 'keep-alive',
                                    'content-length': '186',
                                    'content-type': 'application/json',
                                    'date': 'Thu, 19 Feb 2026 16:50:27 GMT',
                                    'grpc-status': '0',
                                    'server': 'envoy',
                                    'x-envoy-upstream-service-time': '60',
                                    'x-pinecone-request-latency-ms': '60',
                                    'x-pinecone-response-duration-ms': '62'}},
 'dimension': 1536,
 'index_fullness': 0.0,
 'memoryFullness': 0.0,
 'metric': 'cosine',
 'namespaces': {'__default__': {'vector_count': 960}},
 'storageFullness': 0.0,
 'total_vector_count': 960,
 'vector_type': 'dense'}


Embed All Chunks and Upload to Pinecone

In [11]:
import hashlib

def get_embedding(text):
    response = openai_client.embeddings.create(
        input=text,
        model="text-embedding-ada-002"
    )
    return response.data[0].embedding

# Clear the index first so re-running this cell never creates duplicates
index.delete(delete_all=True)
print("Index cleared.")

# Build vectors with a stable content-based ID to prevent duplicates
vectors = []
for i, chunk in enumerate(all_chunks):
    embedding = get_embedding(chunk.page_content)
    chunk_id = hashlib.md5(chunk.page_content.encode()).hexdigest()
    vectors.append({
        "id": chunk_id,
        "values": embedding,
        "metadata": {
            "text": chunk.page_content,
            "source": chunk.metadata.get("source", ""),
            "document_type": chunk.metadata.get("document_type", ""),
            "page": chunk.metadata.get("page", 0)
        }
    })
    if (i + 1) % 20 == 0:
        print(f"Embedded {i+1}/{len(all_chunks)} chunks...")

# Upsert to Pinecone in batches of 100
batch_size = 100
for i in range(0, len(vectors), batch_size):
    index.upsert(vectors=vectors[i:i+batch_size])
    print(f"Upserted batch {i//batch_size + 1}/{-(-len(vectors)//batch_size)}")

print("\nDone! Index stats:")
print(index.describe_index_stats())

Index cleared.
Embedded 20/320 chunks...
Embedded 40/320 chunks...
Embedded 60/320 chunks...
Embedded 80/320 chunks...
Embedded 100/320 chunks...
Embedded 120/320 chunks...
Embedded 140/320 chunks...
Embedded 160/320 chunks...
Embedded 180/320 chunks...
Embedded 200/320 chunks...
Embedded 220/320 chunks...
Embedded 240/320 chunks...
Embedded 260/320 chunks...
Embedded 280/320 chunks...
Embedded 300/320 chunks...
Embedded 320/320 chunks...
Upserted batch 1/4
Upserted batch 2/4
Upserted batch 3/4
Upserted batch 4/4

Done! Index stats:
{'_response_info': {'raw_headers': {'connection': 'keep-alive',
                                    'content-length': '186',
                                    'content-type': 'application/json',
                                    'date': 'Thu, 19 Feb 2026 16:51:25 GMT',
                                    'grpc-status': '0',
                                    'server': 'envoy',
                                    'x-envoy-upstream-service-time': '41',
 

Baseline retrieval

In [12]:
def retrieve(query, top_k=10):
    query_embedding = get_embedding(query)
    results = index.query(
        vector=query_embedding,
        top_k=top_k,
        include_metadata=True
    )
    return results["matches"]

# Test baseline retrieval
query = "What are the requirements for trustworthy AI?"
baseline_results = retrieve(query, top_k=10)

print(f"Retrieved {len(baseline_results)} results for: '{query}'\n")
for i, match in enumerate(baseline_results[:5]):
    print(f"Result {i+1} | Score: {match['score']:.4f} | Source: {match['metadata']['source']}")
    print(f"  {match['metadata']['text'][:200]}...")
    print()

Retrieved 10 results for: 'What are the requirements for trustworthy AI?'

Result 1 | Score: 0.8561 | Source: eu_ai_act
  for overseeing the ethical and responsible use of AI across all projects , ensuring compliance with 
regulatory requirements, and mitigating potential risks....

Result 2 | Score: 0.8822 | Source: podcast
  me. It says, trust is something we give to people. We trust people. We rely on machines. That's the crux of it, isn't it? If an AI is a black box that we can't explain, we can't truly trust it, we're ...

Result 3 | Score: 0.8533 | Source: podcast
  to make it concrete, the experts derive four ethical principles from those rights. These are the non-negotiables. Okay, let's run through them. Principle one. Respect for human autonomy. This means hu...

Result 4 | Score: 0.8605 | Source: eu_ai_act
  use of the system(s). 
The SAS course on “ Responsible Innovation and Trustworthy AI ” is designed for anyone who wants to 
gain a deeper understanding about the importa

**Step 3: Implement Relevance Scoring via an LLM**

Use GPT to score each retrieved chunk's relevance to the query, combine with the similarity score, then reorder results.



Score Each Chunk With GPT

In [13]:
import json as json_lib

def llm_score_chunk(query, chunk_text):
    """Ask GPT to rate how relevant a chunk is to the query (0-10)."""
    prompt = (
        f"Rate the relevance of the following text to the query on a scale of 0-10.\n"
        f"Return ONLY a JSON object with a single key 'score' and an integer value.\n\n"
        f"Query: {query}\n"
        f"Text: {chunk_text[:500]}\n\n"
        f"Response (JSON only):"
    )
    response = openai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0
    )
    try:
        result = json_lib.loads(response.choices[0].message.content)
        return result["score"] / 10  # normalise to 0-1
    except Exception:
        return 0.0

def llm_relevance_rerank(query, matches, top_n=5):
    """Score each chunk with LLM, combine with similarity score, reorder."""
    scored = []
    for m in matches:
        llm_score = llm_score_chunk(query, m["metadata"]["text"])
        combined = 0.5 * m["score"] + 0.5 * llm_score
        scored.append({
            "text": m["metadata"]["text"],
            "source": m["metadata"]["source"],
            "similarity_score": m["score"],
            "llm_score": llm_score,
            "combined_score": combined
        })
    scored.sort(key=lambda x: x["combined_score"], reverse=True)
    return scored[:top_n]

# Test LLM relevance scoring
query = "What are the requirements for trustworthy AI?"
print("Scoring chunks with LLM... (this may take a moment)")
llm_reranked = llm_relevance_rerank(query, baseline_results, top_n=5)

print(f"\n{'='*60}")
print("LLM RELEVANCE SCORING (top 5)")
print("="*60)
for i, r in enumerate(llm_reranked):
    print(f"{i+1}. Combined: {r['combined_score']:.4f} | Similarity: {r['similarity_score']:.4f} | LLM: {r['llm_score']:.4f}")
    print(f"   Source: {r['source']}")
    print(f"   {r['text'][:200]}...")
    print()

Scoring chunks with LLM... (this may take a moment)

LLM RELEVANCE SCORING (top 5)
1. Combined: 0.7767 | Similarity: 0.8533 | LLM: 0.7000
   Source: podcast
   to make it concrete, the experts derive four ethical principles from those rights. These are the non-negotiables. Okay, let's run through them. Principle one. Respect for human autonomy. This means hu...

2. Combined: 0.7281 | Similarity: 0.8561 | LLM: 0.6000
   Source: eu_ai_act
   for overseeing the ethical and responsible use of AI across all projects , ensuring compliance with 
regulatory requirements, and mitigating potential risks....

3. Combined: 0.6803 | Similarity: 0.8606 | LLM: 0.5000
   Source: podcast
   those details with a frighteningly high accuracy. So even if I keep my data private, the AI just guesses it anyway. And once it guesses, it treats that inference as a fact. The guidelines are clear th...

4. Combined: 0.6802 | Similarity: 0.8605 | LLM: 0.5000
   Source: eu_ai_act
   use of the system(s). 
The SAS co

**Step 4: Implement Reranker (Cohere or Cross-Encoder) (Optional - Advanced)**

Use Cohere's dedicated reranker model to re-score the top-10 candidates and compare with the LLM scoring approach above.

In [14]:
def rerank(query, matches, top_n=5):
    texts = [m["metadata"]["text"] for m in matches]
    reranked = cohere_client.rerank(
        query=query,
        documents=texts,
        top_n=top_n,
        model="rerank-v3.5"
    )
    results = []
    for item in reranked.results:
        results.append({
            "text": matches[item.index]["metadata"]["text"],
            "source": matches[item.index]["metadata"]["source"],
            "original_score": matches[item.index]["score"],
            "rerank_score": item.relevance_score
        })
    return results

query = "What are the requirements for trustworthy AI?"
reranked_results = rerank(query, baseline_results, top_n=5)

print("=" * 60)
print("COHERE RERANKER (top 5)")
print("=" * 60)
for i, r in enumerate(reranked_results):
    print(f"{i+1}. Rerank: {r['rerank_score']:.4f} | Original: {r['original_score']:.4f} | Source: {r['source']}")
    print(f"   {r['text'][:200]}...")
    print()

# Quick comparison summary
print("=" * 60)
print("COMPARISON: Baseline vs LLM Scoring vs Cohere Reranker")
print("=" * 60)
print("\nBaseline top-5 sources:", [m['metadata']['source'] for m in baseline_results[:5]])
print("LLM-scored top-5 sources:", [r['source'] for r in llm_reranked])
print("Cohere top-5 sources:", [r['source'] for r in reranked_results])

COHERE RERANKER (top 5)
1. Rerank: 0.8516 | Original: 0.8895 | Source: podcast
   this framework. It was the moment the conversation shifted from, you know, how do we make AI powerful to how do we make AI worthy of trust? And our mission for this deep dive is to figure out how to o...

2. Rerank: 0.7708 | Original: 0.8527 | Source: podcast
   these things are the first line of defense. They have to be. If an engineer sees that a system is behaving unethically, they need a safe channel to report it without getting fired. NGOs and trade unio...

3. Rerank: 0.6353 | Original: 0.8822 | Source: podcast
   me. It says, trust is something we give to people. We trust people. We rely on machines. That's the crux of it, isn't it? If an AI is a black box that we can't explain, we can't truly trust it, we're ...

4. Rerank: 0.6013 | Original: 0.8640 | Source: podcast
   phase. Today we are digging into the blueprint for that trust. We're unpacking a document that is arguably the Magna Carta for et

**Step 5: Add Metadata Filtering**

Filter results by source type so you can query each document collection independently.

In [15]:
def retrieve_with_filter(query, filter_dict, top_k=5):
    """Retrieve from Pinecone with a metadata filter."""
    query_embedding = get_embedding(query)
    results = index.query(
        vector=query_embedding,
        top_k=top_k,
        include_metadata=True,
        filter=filter_dict
    )
    return results["matches"]

query = "What are the requirements for trustworthy AI?"

print("=" * 60)
print("EU AI ACT ONLY (source = eu_ai_act)")
print("=" * 60)
legal_results = retrieve_with_filter(query, {"source": {"$eq": "eu_ai_act"}}, top_k=5)
for i, m in enumerate(legal_results):
    print(f"{i+1}. Score: {m['score']:.4f} | {m['metadata']['text'][:180]}...")
    print()

print("=" * 60)
print("PODCAST ONLY (source = podcast)")
print("=" * 60)
podcast_results = retrieve_with_filter(query, {"source": {"$eq": "podcast"}}, top_k=5)
for i, m in enumerate(podcast_results):
    print(f"{i+1}. Score: {m['score']:.4f} | {m['metadata']['text'][:180]}...")
    print()

EU AI ACT ONLY (source = eu_ai_act)
1. Score: 0.8605 | use of the system(s). 
The SAS course on “ Responsible Innovation and Trustworthy AI ” is designed for anyone who wants to 
gain a deeper understanding about the importance of trus...

2. Score: 0.8561 | for overseeing the ethical and responsible use of AI across all projects , ensuring compliance with 
regulatory requirements, and mitigating potential risks....

3. Score: 0.8500 | trusted advisors for our Compliance team. 
Also, the AI Risk Assessment tool operates autonomously to identify the risk levels of AI systems, allowing 
trained personnel to perform...

4. Score: 0.8495 | ethical, and reputational risks associated with AI deployment. 
The tool supports a range of industries, including healthcare, finance, and employment, enabling 
organisations to a...

5. Score: 0.8494 | and racial disparities in automated speech recognition). The course was also tested for accessibili ty. 
How does the practice take into account  the te

**Step 6: Complete RAG Pipeline with Reranking**

Integrate retrieval + reranking into a single pipeline. The `use_reranker` flag is the only 1-line change between the two versions.

In [16]:
def rag_pipeline(query, use_reranker=True, filter_dict=None, top_k=10, top_n=5):
    """Complete RAG pipeline: retrieve → (optional rerank) → generate."""

    # 1. Retrieve
    if filter_dict:
        matches = retrieve_with_filter(query, filter_dict, top_k=top_k)
    else:
        matches = retrieve(query, top_k=top_k)

    # 2. Rerank (the 1-line change)
    if use_reranker and matches:
        ranked = rerank(query, matches, top_n=top_n)      # ← swap in/out
        context_chunks = [r["text"] for r in ranked]
    else:
        context_chunks = [m["metadata"]["text"] for m in matches[:top_n]]

    # 3. Generate answer
    context = "\n\n---\n\n".join(context_chunks)
    response = openai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": (
                "You are an expert on AI regulation and trustworthy AI. "
                "Answer questions based only on the provided context."
            )},
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"}
        ],
        temperature=0
    )
    return response.choices[0].message.content

query = "What are the key principles of trustworthy AI according to the EU AI Act?"

print("=== WITHOUT RERANKER ===")
print(rag_pipeline(query, use_reranker=False))

print("\n=== WITH COHERE RERANKER ===")
print(rag_pipeline(query, use_reranker=True))

=== WITHOUT RERANKER ===
The key principles of trustworthy AI according to the EU AI Act, as derived from the ethics guidelines for trustworthy AI, include:

1. **Respect for Human Autonomy**: This principle emphasizes the importance of allowing individuals to make their own choices and decisions, ensuring that AI systems support rather than undermine human agency.

2. **Prevention of Harm**: AI systems should be designed to avoid causing harm to individuals or society, addressing both intentional and unintentional harm.

3. **Fairness**: AI must be developed and deployed in a manner that is fair, avoiding discrimination and ensuring equitable treatment for all users.

4. **Accountability**: There should be clear accountability for AI systems, meaning that the people and organizations behind the AI must be responsible for its outcomes and decisions.

These principles are foundational to ensuring that AI systems are lawful, ethical, and robust throughout their lifecycle.

=== WITH COHER

**Step 7: Evaluate Performance**

Manually compare retrieval quality before and after reranking across multiple queries. Mark each answer correct/incorrect in the output.

In [17]:
test_queries = [
    "What obligations do providers of high-risk AI systems have?",
    "How does the EU AI Act define prohibited AI practices?",
    "What does trustworthy AI mean according to the podcast?",
    "What are the transparency requirements for AI systems?"
]

print("RETRIEVAL EVALUATION: Baseline vs Cohere Reranker")
print("=" * 70)

for query in test_queries:
    print(f"\nQuery: {query}")
    print("-" * 70)

    # Baseline top-3
    baseline = retrieve(query, top_k=10)
    print("BASELINE TOP 3:")
    for i, m in enumerate(baseline[:3]):
        print(f"  {i+1}. [{m['metadata']['source']}] score={m['score']:.4f}")
        print(f"     {m['metadata']['text'][:120]}...")

    # Reranked top-3
    reranked = rerank(query, baseline, top_n=3)
    print("RERANKED TOP 3:")
    for i, r in enumerate(reranked):
        print(f"  {i+1}. [{r['source']}] rerank={r['rerank_score']:.4f} (was {r['original_score']:.4f})")
        print(f"     {r['text'][:120]}...")
    print()

RETRIEVAL EVALUATION: Baseline vs Cohere Reranker

Query: What obligations do providers of high-risk AI systems have?
----------------------------------------------------------------------
BASELINE TOP 3:
  1. [eu_ai_act] score=0.8645
     will be provided to our larger clients under commercial agreements to help them make continuing AI risk 
and compliance ...
  2. [eu_ai_act] score=0.8504
     trusted advisors for our Compliance team. 
Also, the AI Risk Assessment tool operates autonomously to identify the risk ...
  3. [eu_ai_act] score=0.8457
     for overseeing the ethical and responsible use of AI across all projects , ensuring compliance with 
regulatory requirem...
RERANKED TOP 3:
  1. [eu_ai_act] rerank=0.7815 (was 0.8420)
     on the level of risk they pose to health, safety, and fundamental rights. High-risk AI systems are subject 
to stringent...
  2. [eu_ai_act] rerank=0.6131 (was 0.8645)
     will be provided to our larger clients under commercial agreements to help them 

**Reflection**

After manually reviewing the outputs above:

- **Baseline** retrieves by cosine similarity — fast but sometimes returns topically adjacent chunks that miss the point.
- **LLM Relevance Scoring** (Step 3) gives a nuanced judgment per chunk but is slow and expensive (one API call per chunk).
- **Cohere Reranker** (Step 4) is the best trade-off: it re-scores all candidates with a cross-encoder model in a single API call and consistently surfaces the most relevant chunks.

**When reranking helps most:** queries requiring precise legal interpretation (EU AI Act), questions whose keywords appear in many chunks, and mixed-corpus queries where source type matters.

**Recommendation:** Use Cohere reranking in production RAG pipelines. Reserve LLM scoring for cases where you need explanations alongside scores.

---
#### Comparison Report



##### 1. Retrieval Results: Before vs After Reranking

**Query:** *"What are the requirements for trustworthy AI?"*

| Rank | Baseline — cosine similarity | Score | Cohere Reranked (`rerank-v3.5`) | Rerank score | Original score |
|------|------------------------------|-------|---------------------------------|--------------|----------------|
| 1 | eu_ai_act — *"overseeing the ethical and responsible use of AI across all projects, ensuring compliance with regulatory requirements..."* | 0.8561 | podcast — *"the moment the conversation shifted from how do we make AI powerful to how do we make AI worthy of trust..."* | 0.8516 | 0.8895 |
| 2 | podcast — *"trust is something we give to people. We trust people. We rely on machines. If an AI is a black box we can't truly trust it..."* | 0.8822 | podcast — *"these things are the first line of defense. If an engineer sees that a system is behaving unethically, they need a safe channel to report it..."* | 0.7708 | 0.8527 |
| 3 | podcast — *"the experts derive four ethical principles from those rights: Respect for human autonomy, Prevention of harm, Fairness, Explicability..."* | 0.8533 | podcast — *"trust is something we give to people. We trust people. We rely on machines..."* | 0.6353 | 0.8822 |
| 4 | eu_ai_act — *"The SAS course on Responsible Innovation and Trustworthy AI is designed for anyone who wants to gain a deeper understanding..."* | 0.8605 | podcast — *"Today we are digging into the blueprint for that trust. The ethics guidelines for trustworthy AI..."* | 0.6013 | 0.8640 |
| 5 | podcast — *"Today we are digging into the blueprint for that trust. The ethics guidelines for trustworthy AI — the Magna Carta for ethical computing..."* | 0.8640 | eu_ai_act — *"The SAS course on Responsible Innovation and Trustworthy AI..."* | 0.5909 | 0.8605 |

**Observations:**

- **No duplicates:** After fixing vector IDs to use content hashes, all 320 unique chunks were stored correctly and every result is distinct.
- **Source shift:** Baseline returned a mixed set [eu_ai_act, podcast, podcast, eu_ai_act, podcast]. The Cohere reranker shifted the top-4 entirely to podcast [podcast, podcast, podcast, podcast, eu_ai_act] — it recognised the podcast language as a closer match to this conversational query.
- **New chunk promoted:** The reranker's rank 2 result (*"first line of defense / safe reporting channel"*) did not appear in the baseline top 5 at all (original score 0.8527). This is reranking doing its job — surfacing a highly relevant chunk that cosine similarity buried.
- **eu_ai_act demoted:** The eu_ai_act chunks scored 0.86–0.86 on cosine but only 0.59 on Cohere, revealing that while they are semantically similar to the query words, they don't actually explain *requirements for trustworthy AI* — they describe a training course and a job role. The reranker correctly deprioritises them.
- **LLM scoring disagreement with Cohere:** LLM scoring ranked the *"four ethical principles"* chunk #1 (combined 0.777) while Cohere ranked it #3 (0.635). LLM scoring gave 0.0 to the chunk Cohere ranked #1 — showing that the two approaches use different definitions of relevance (LLM focuses on information density; Cohere on query-answer fit).


##### 2. Performance Metrics

Evaluation method: **manual relevance judgement** — does the #1 retrieved chunk actually answer the query? ✅ Relevant / ❌ Not Relevant / ⚠️ Partially relevant.

| Query | Baseline #1 result | Relevant? | Reranked #1 result | Relevant? | Improvement? |
|-------|-------------------|-----------|-------------------|-----------|--------------|
| What obligations do providers of high-risk AI systems have? | eu_ai_act · 0.8645 · *"AI risk and compliance..."* (describes a commercial compliance service) | ❌ | eu_ai_act · 0.7815 · *"High-risk AI systems are subject to stringent regulatory requirements for accuracy, transparency and oversight..."* | ✅ | **Yes — reranker promoted the right chunk from rank ~7** |
| How does the EU AI Act define prohibited AI practices? | eu_ai_act · 0.8600 · *"interactive whiteboard featuring legal provisions..."* (describes a teaching tool, not the provisions themselves) | ❌ | eu_ai_act · 0.6915 · *"AI literacy approach, partially rolled-out..."* (still meta-level, not Article 5 content) | ❌ | No — source document lacks the actual prohibited practices text |
| What does trustworthy AI mean according to the podcast? | podcast · 0.8746 · *"trust is something we give to people. We trust people. We rely on machines..."* | ✅ | podcast · 0.8313 · *(same chunk, confirmed correct)* | ✅ | Neutral — both correct; reranker added a useful #2 (*"the moment the conversation shifted..."*) |
| What are the transparency requirements for AI systems? | podcast · 0.8953 · *"AI guesses private details with frightening accuracy..."* (about inference, not transparency requirements) | ❌ | podcast · 0.8519 · *(same chunk stays #1)* but eu_ai_act · 0.7859 and 0.7265 surface at #2 and #3 with more relevant legal content | ⚠️ | Partial — #1 is still wrong but #2 and #3 are now better |

**Score: Baseline 1/4 · Reranked 2/4 (+ 1 partial improvement)**

**Key insight:** The biggest win came on the *provider obligations* query — the reranker pulled up a chunk that was buried at rank ~7 by cosine similarity (original score 0.8420) and made it rank 1 with a rerank score of 0.7815. That chunk directly answers the question. The baseline's top result (score 0.8645) looked similar numerically but was completely off-topic. This perfectly demonstrates why cosine scores alone are not reliable relevance signals — the spread between 0.84 and 0.86 is meaningless; the reranker's spread between 0.44 and 0.78 is highly meaningful.


##### 3. Analysis: When Does Reranking Help Most?

From our experiments across two document types (EU AI Act legal text + podcast transcript):

**Reranking helped clearly when:**
- **The query is specific and legal** (*"What obligations do providers of high-risk AI systems have?"*) — cosine similarity ranked a compliance services marketing chunk at #1 (0.8645). The reranker demoted it to #2 (0.6131) and promoted a chunk that actually describes the legal obligations (0.7815). The 0.84 vs 0.86 cosine range gave no signal; the 0.44–0.78 rerank range made the distinction obvious.
- **A relevant chunk was buried** — the reranker's rank-2 result for the trustworthy AI query (*"first line of defense / safe reporting channel"*) had cosine score 0.8527, buried below rank 5. The reranker surfaced it because it recognised it as an example of *how* trustworthy AI is enforced in practice.

**Reranking partially helped when:**
- **The query is mixed-source** (*"What are the transparency requirements for AI systems?"*) — the baseline top result was off-topic (podcast chunk about AI inferring private data). The reranker kept it at #1 but promoted two eu_ai_act chunks to #2 and #3 (rerank scores 0.79 and 0.73), which are more on-topic. The full answer becomes better even if the #1 chunk didn't change.

**Reranking could not help when:**
- **The relevant content is simply not in the corpus** (*"How does the EU AI Act define prohibited AI practices?"*) — the source PDF used is a literacy practices repository, not the actual EU AI Act legal text. Neither cosine similarity nor the reranker can retrieve content that was never ingested. This is a data preparation problem, not a retrieval problem. Lesson: **garbage in, garbage out — a reranker cannot compensate for a missing document**.

**Cohere vs LLM scoring in practice:**
The two approaches disagreed most on the *"this framework / worthy of trust"* chunk: Cohere ranked it #1 (0.8516), LLM gave it 0.0 and buried it. The LLM penalised it because it is framing/introduction language rather than factual content. Cohere rewarded it because it is contextually the best match to the query as a whole. For RAG purposes, Cohere's judgment produced the better answer in Step 6.


##### 4. Example Queries and Answers Demonstrating Improved Quality

**Query A: "What obligations do providers of high-risk AI systems have under the EU AI Act?"**

| | Without reranker | With Cohere reranker |
|--|-----------------|----------------------|
| **Answer** | *"Providers are obligated to ensure a sufficient level of AI literacy among their staff... providers must evaluate high-risk systems and recommend mitigations."* | *"1. Compliance with regulatory requirements — high-risk AI systems must meet stringent requirements. 2. Risk Assessment. 3. Accuracy, Transparency, and Oversight. 4. AI Literacy."* |
| **Quality** | Vague — built from a training-course chunk, gives only the literacy angle | Structured and multi-faceted — built from the chunk that actually lists the obligations |
| **Verdict** | ⚠️ Partially useful | ✅ Clearly better |

**Query B: "What does the podcast say about making AI trustworthy in practice?"**

| | Without reranker | With Cohere reranker |
|--|-----------------|----------------------|
| **Answer** | Mentions 3 lifecycle components, transparency, human oversight, and a socio-technical perspective. | Same themes, with slightly sharper emphasis on the three-component framework (lawful, ethical, robust) and the socio-technical framing. |
| **Quality** | Good — both use relevant podcast chunks | Good — marginally better structured |
| **Verdict** | ✅ Correct | ✅ Correct, slightly cleaner |

**Overall:** The reranker makes the biggest difference when the question has a precise factual answer buried in the corpus (Query A). For conversational/conceptual questions that match the podcast well (Query B), both approaches perform similarly because the cosine similarity already surfaces the right content.


In [18]:
example_queries = [
    "What obligations do providers of high-risk AI systems have under the EU AI Act?",
    "What does the podcast say about making AI trustworthy in practice?"
]

for q in example_queries:
    print("=" * 70)
    print(f"QUERY: {q}")
    print("=" * 70)

    answer_baseline = rag_pipeline(q, use_reranker=False)
    print("\n--- WITHOUT RERANKER ---")
    print(answer_baseline)

    answer_reranked = rag_pipeline(q, use_reranker=True)
    print("\n--- WITH COHERE RERANKER ---")
    print(answer_reranked)
    print()

QUERY: What obligations do providers of high-risk AI systems have under the EU AI Act?

--- WITHOUT RERANKER ---
Under the EU AI Act, providers of high-risk AI systems are obligated to take measures to ensure a sufficient level of AI literacy among their staff and other individuals involved in the operation and use of AI systems. This includes considering the technical knowledge, experience, education, and training of these individuals, as well as the context in which the AI systems are to be used. Additionally, providers must evaluate high-risk systems and recommend mitigations to reduce potential effects on health, safety, and fundamental rights as protected by Union law.

--- WITH COHERE RERANKER ---
Providers of high-risk AI systems under the EU AI Act have several obligations to ensure safety and compliance. These include:

1. **Compliance with Regulatory Requirements**: High-risk AI systems must meet stringent regulatory requirements to ensure their safety and compliance with the

##### 5. Recommendations: When to Use Reranking

| Situation | Recommendation | Evidence from this lab |
|-----------|---------------|------------------------|
| Legal / compliance queries requiring precise answers | ✅ Always rerank | Provider obligations query: reranker promoted the correct chunk from rank 7 to rank 1 |
| Mixed-source corpus (legal + conversational) | ✅ Rerank — bridges the style gap | Source distribution shifted from mixed to podcast-dominant for a conversational query; eu_ai_act surfaced for legal queries |
| Cosine scores are tightly clustered (0.84–0.89 range) | ✅ Rerank — it produces a calibrated spread | Baseline: all scores between 0.84–0.90, impossible to threshold. Reranked: 0.44–0.85, meaningful signal |
| Content is missing from the corpus entirely | ❌ Reranking cannot help | Prohibited practices query failed for both — the actual EU AI Act legal text was not in the PDF |
| Conversational/conceptual queries matching the dominant corpus style | ⚠️ Optional — cosine similarity already works | Trustworthy AI meaning: baseline and reranker both returned the same correct chunk at #1 |
| Real-time applications with strict latency budgets | ⚠️ Profile first | Cohere reranking adds one API call but processes all candidates in a single batch — usually < 500 ms |
| Cost-sensitive applications | ⚠️ Prefer Cohere over LLM scoring | LLM scoring = 1 API call per chunk (10 calls for top_k=10). Cohere = 1 call for all candidates |

**Overall recommendation for this legal-tech use case:**  
Use a **two-stage pipeline**: retrieve `top_k=10` with Pinecone cosine similarity, then rerank to `top_n=3–5` with Cohere `rerank-v3.5`. The reranker's calibrated scores (unlike cosine's compressed 0.84–0.90 range) also enable a **confidence threshold** — if the top rerank score is below 0.5, the system should respond *"I could not find a reliable answer in the provided documents"* rather than hallucinate from poor context. This is especially important for legal documents where a wrong answer carries real risk.

**On LLM scoring vs Cohere:**  
LLM scoring is useful when you need an *explanation* of why a chunk was ranked as it was, or when you want to customise the scoring rubric with a prompt. For standard retrieval quality improvement, Cohere's dedicated cross-encoder is faster, cheaper, and in our tests produced answers that were more factually complete (Step 6, Query A).
