# Building RAG Pipeline

##### Concepts, components, and implmentation of Retrieval Augmented Generation

In [135]:
# Not relevant, just to keep the output clean.
import logging
logger = logging.getLogger()
logger.setLevel(logging.CRITICAL)

## Objective

To devise an end-to-end RAG pipeline for a website.

## Data Fetching & Cleaning

### Data (and metadata) to store

Data we are going to store (* = important):
- Content (actual readable + some hidden content) of the webpage*
- Source URL (URL of the page)*
- Title
- Date (published date)


```json
{
    "date": "2025-01-20T17:38:19",
    "link": "https://rtcamp.com/blog/risk-based-testing/",
    "title": "RiskBased Testing  A Strategic Approach to Maximizing Test Coverage",
    "content": "Riskbased testing RBT...",
}
```

### Implementation

In [36]:
from scraper import scrape_website

results = scrape_website(
    url="https://rtcamp.com",
    max_pages=20,
    output_file="scraped_content.json"
)

Scraping page 1/20: https://rtcamp.com


2025-03-14 21:06:48 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): rtcamp.com:443
2025-03-14 21:06:48 [urllib3.connectionpool] DEBUG: https://rtcamp.com:443 "GET / HTTP/1.1" 200 None
2025-03-14 21:06:48 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): rtcamp.com:443


Scraping page 2/20: https://rtcamp.com/case-studies/


2025-03-14 21:06:50 [urllib3.connectionpool] DEBUG: https://rtcamp.com:443 "GET /case-studies/ HTTP/1.1" 200 None
2025-03-14 21:06:50 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): rtcamp.com:443


Scraping page 3/20: https://rtcamp.com/about/


2025-03-14 21:06:53 [urllib3.connectionpool] DEBUG: https://rtcamp.com:443 "GET /about/ HTTP/1.1" 200 None
2025-03-14 21:06:53 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): rtcamp.com:443


Scraping page 4/20: https://rtcamp.com/wordpress-vip-gold-agency-partner/


2025-03-14 21:06:57 [urllib3.connectionpool] DEBUG: https://rtcamp.com:443 "GET /wordpress-vip-gold-agency-partner/ HTTP/1.1" 200 None
2025-03-14 21:06:57 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): rtcamp.com:443


Scraping page 5/20: https://rtcamp.com/case-studies/building-embedded-web-stories-for-wordpress/


2025-03-14 21:07:03 [urllib3.connectionpool] DEBUG: https://rtcamp.com:443 "GET /case-studies/building-embedded-web-stories-for-wordpress/ HTTP/1.1" 200 None
2025-03-14 21:07:03 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): rtcamp.com:443


Scraping page 6/20: https://rtcamp.com/case-studies/fleetnet-americas-online-presence-a-drupal-to-wordpress-migration-success-story/


2025-03-14 21:07:05 [urllib3.connectionpool] DEBUG: https://rtcamp.com:443 "GET /case-studies/fleetnet-americas-online-presence-a-drupal-to-wordpress-migration-success-story/ HTTP/1.1" 200 None
2025-03-14 21:07:05 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): rtcamp.com:443


Scraping page 7/20: https://rtcamp.com/solutions/


2025-03-14 21:07:06 [urllib3.connectionpool] DEBUG: https://rtcamp.com:443 "GET /solutions/ HTTP/1.1" 200 None
2025-03-14 21:07:06 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): rtcamp.com:443


Scraping page 8/20: https://rtcamp.com/migration-to-wordpress/


2025-03-14 21:07:08 [urllib3.connectionpool] DEBUG: https://rtcamp.com:443 "GET /migration-to-wordpress/ HTTP/1.1" 200 None
2025-03-14 21:07:08 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): rtcamp.com:443


Scraping page 9/20: https://rtcamp.com/managed-site-maintenance-services/


2025-03-14 21:07:10 [urllib3.connectionpool] DEBUG: https://rtcamp.com:443 "GET /managed-site-maintenance-services/ HTTP/1.1" 200 None
2025-03-14 21:07:10 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): rtcamp.com:443


Scraping page 10/20: https://rtcamp.com/staff-augmentation/


2025-03-14 21:07:12 [urllib3.connectionpool] DEBUG: https://rtcamp.com:443 "GET /staff-augmentation/ HTTP/1.1" 200 None
2025-03-14 21:07:12 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): rtcamp.com:443


Scraping page 11/20: https://rtcamp.com/quality-engineering-services/


2025-03-14 21:07:13 [urllib3.connectionpool] DEBUG: https://rtcamp.com:443 "GET /quality-engineering-services/ HTTP/1.1" 200 None
2025-03-14 21:07:14 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): rtcamp.com:443


Scraping page 12/20: https://rtcamp.com/decoupled-architecture/


2025-03-14 21:07:19 [urllib3.connectionpool] DEBUG: https://rtcamp.com:443 "GET /decoupled-architecture/ HTTP/1.1" 200 None
2025-03-14 21:07:20 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): rtcamp.com:443


Scraping page 13/20: https://rtcamp.com/business-inquiry/


2025-03-14 21:07:21 [urllib3.connectionpool] DEBUG: https://rtcamp.com:443 "GET /business-inquiry/ HTTP/1.1" 200 None
2025-03-14 21:07:22 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): rtcamp.com:443


Scraping page 14/20: https://rtcamp.com/blog/website-security-best-practices/


2025-03-14 21:07:25 [urllib3.connectionpool] DEBUG: https://rtcamp.com:443 "GET /blog/website-security-best-practices/ HTTP/1.1" 200 None
2025-03-14 21:07:25 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): rtcamp.com:443


Scraping page 15/20: https://rtcamp.com/blog/category/articles/


2025-03-14 21:07:27 [urllib3.connectionpool] DEBUG: https://rtcamp.com:443 "GET /blog/category/articles/ HTTP/1.1" 200 None
2025-03-14 21:07:27 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): rtcamp.com:443


Scraping page 16/20: https://rtcamp.com/handbook/client/


2025-03-14 21:07:31 [urllib3.connectionpool] DEBUG: https://rtcamp.com:443 "GET /handbook/client/ HTTP/1.1" 200 None
2025-03-14 21:07:31 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): rtcamp.com:443


Scraping page 17/20: https://rtcamp.com/blog/monthly-roundup-february-2025/


2025-03-14 21:07:33 [urllib3.connectionpool] DEBUG: https://rtcamp.com:443 "GET /blog/monthly-roundup-february-2025/ HTTP/1.1" 200 None
2025-03-14 21:07:33 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): rtcamp.com:443


Scraping page 18/20: https://rtcamp.com/blog/tag/newsletter/


2025-03-14 21:07:35 [urllib3.connectionpool] DEBUG: https://rtcamp.com:443 "GET /blog/tag/newsletter/ HTTP/1.1" 200 None
2025-03-14 21:07:35 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): rtcamp.com:443


Scraping page 19/20: https://rtcamp.com/blog/test-automation-failure-causes/


2025-03-14 21:07:37 [urllib3.connectionpool] DEBUG: https://rtcamp.com:443 "GET /blog/test-automation-failure-causes/ HTTP/1.1" 200 None
2025-03-14 21:07:37 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): rtcamp.com:443


Scraping page 20/20: https://rtcamp.com/case-studies/49-percent-more-leads-with-ui-ux-refresh-and-vip-migration-for-ready-logistics/


2025-03-14 21:07:38 [urllib3.connectionpool] DEBUG: https://rtcamp.com:443 "GET /case-studies/49-percent-more-leads-with-ui-ux-refresh-and-vip-migration-for-ready-logistics/ HTTP/1.1" 200 None


Scraped 20 pages
Results saved to scraped_content.json


#### Displaying the results

In [38]:
# Display the first result
import json
from IPython.display import display, JSON

# Display number of pages scraped
print(f"Scraped {len(results)} pages")

# Display the first result in a nice format
if results:
    display(JSON(results[0]))

Scraped 20 pages


<IPython.core.display.JSON object>

## Embedding Data

#### We need to embed the 'searchable' content, that is the page content. Metadata is not used for search but for filtering so we do not need to embed it.

In [63]:
content = results[0]['content']
content

'No vendor lock-in. No reinventing the wheel. Grow revenue with our custom-fit, efficiently built enterprise-grade web applications, fully owned by you.\n\nThe open-source technologies at the center of all our work help you unlock operational efficiency, develop new capabilities, remove vendor lock-in, reskill your teams, and reduce total cost of ownership.\n\nOur bespoke engineering solutions for Fortune 500 companies, government agencies, and household brands reach millions of people every day.\n\nWe are a distributed company, with 200+ team members, with offices in the United States and India. We’re a WordPress VIP Gold agency partner, globally known for our enterprise web solutions.\n\nTop Partner Innovator\xa0Award\n\nWPVIP 2022\n\nempowering success, one solution at a time\n\nSeamless, efficient, and reliable platform migration services\n\nManaged WordPress maintenance services that elevate site management\n\nExtend Your team’s capabilities with rtCamp’s dedicated engineering tal

### Chunking

In [68]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

def chunk_text(text, chunk_size=1000, chunk_overlap=200):
   # Initialize the text splitter
   text_splitter = RecursiveCharacterTextSplitter(
       chunk_size=chunk_size,
       chunk_overlap=chunk_overlap,
       length_function=len,
       is_separator_regex=False,
   )
   
   # Split the text into chunks
   chunks = text_splitter.split_text(text)
   
   return chunks


In [56]:
# Get chunks with different sizes
small_chunks = chunk_text(content, chunk_size=100, chunk_overlap=20)
medium_chunks = chunk_text(content, chunk_size=200, chunk_overlap=50)
large_chunks = chunk_text(content, chunk_size=400, chunk_overlap=100)

print("Small chunks (size ~100):")
for i, chunk in enumerate(small_chunks[:5]):
    print(f"Chunk {i+1}: {chunk}")
    print("-" * 40)
    
print("\nMedium chunks (size ~200):")
for i, chunk in enumerate(medium_chunks[:5]):
    print(f"Chunk {i+1}: {chunk}")
    print("-" * 40)
    
print("\nLarge chunks (size ~400):")
for i, chunk in enumerate(large_chunks[:5]):
    print(f"Chunk {i+1}: {chunk}")
    print("-" * 40)

Small chunks (size ~100):
Chunk 1: No vendor lock-in. No reinventing the wheel. Grow revenue with our custom-fit, efficiently built
----------------------------------------
Chunk 2: efficiently built enterprise-grade web applications, fully owned by you.
----------------------------------------
Chunk 3: The open-source technologies at the center of all our work help you unlock operational efficiency,
----------------------------------------
Chunk 4: efficiency, develop new capabilities, remove vendor lock-in, reskill your teams, and reduce total
----------------------------------------
Chunk 5: and reduce total cost of ownership.
----------------------------------------

Medium chunks (size ~200):
Chunk 1: No vendor lock-in. No reinventing the wheel. Grow revenue with our custom-fit, efficiently built enterprise-grade web applications, fully owned by you.
----------------------------------------
Chunk 2: The open-source technologies at the center of all our work help you unlock operati

### Getting embeddings using Open AI

In [42]:
import os
from openai import OpenAI
from dotenv import load_dotenv

def get_embedding(text, model="text-embedding-3-small"):
    load_dotenv()
    api_key = os.getenv("OPENAI_API_KEY")
    
    # Initialize OpenAI client
    client = OpenAI(api_key=api_key)
    
    # Get embedding
    response = client.embeddings.create(
        model=model,
        input=text
    )
    
    # Return the embedding vector
    return response.data[0].embedding

In [61]:
embedding = get_embedding(medium_chunks[0])
print(embedding)

2025-03-15 00:17:59 [openai._base_client] DEBUG: Request options: {'method': 'post', 'url': '/embeddings', 'files': None, 'post_parser': <function Embeddings.create.<locals>.parser at 0x10974dc60>, 'json_data': {'input': 'No vendor lock-in. No reinventing the wheel. Grow revenue with our custom-fit, efficiently built enterprise-grade web applications, fully owned by you.', 'model': 'text-embedding-3-small', 'encoding_format': 'base64'}}
2025-03-15 00:17:59 [openai._base_client] DEBUG: Sending HTTP Request: POST https://api.openai.com/v1/embeddings
2025-03-15 00:17:59 [httpcore.connection] DEBUG: connect_tcp.started host='api.openai.com' port=443 local_address=None timeout=5.0 socket_options=None
2025-03-15 00:17:59 [httpcore.connection] DEBUG: connect_tcp.complete return_value=<httpcore._backends.sync.SyncStream object at 0x107f7c800>
2025-03-15 00:17:59 [httpcore.connection] DEBUG: start_tls.started ssl_context=<ssl.SSLContext object at 0x1093bd9d0> server_hostname='api.openai.com' ti

[-0.00609615258872509, 0.008275149390101433, 0.040103618055582047, 0.0457463376224041, 0.04138834401965141, -0.037156302481889725, -0.03773568943142891, -0.02665177546441555, -0.0066629438661038876, -0.013124361634254456, 0.035670049488544464, -0.04756006598472595, -0.02550559863448143, -0.047408923506736755, 0.0047421520575881, -0.03551890701055527, 0.025732314214110374, -0.043982986360788345, -0.0381387397646904, -0.007166758179664612, 0.060004279017448425, 0.00816808920353651, -0.02210485190153122, -0.021852944046258926, -0.030228856950998306, 0.0419677309691906, -0.05075928941369057, 0.019308682531118393, -0.024560945108532906, -0.040582239627838135, 0.0001595083886059001, -0.0002316364843863994, 0.021928517147898674, -0.007752442266792059, 0.029120465740561485, 0.033352505415678024, 0.029095275327563286, -0.005060184746980667, 0.04506618529558182, 0.0247498769313097, -0.03551890701055527, -0.007802823558449745, 0.014900307171046734, 0.04111124575138092, -0.0034448301885277033, 0.0

## Loading data into the database

### Preparing JSON

In [137]:
import json
from uuid import uuid4

def prepare_document_chunks_for_pinecone(document):
    # Extract metadata
    title = document.get('title', 'Untitled')
    url = document.get('link', '')
    date = document.get('date', 'Unknown Date')
    content = document.get('content', '')
    
    # Use the chunk_text function defined in the notebook
    chunks = chunk_text(content, chunk_size=1000, chunk_overlap=200)
    
    # Generate parent document ID
    parent_id = str(uuid4())

    # Create array to store all chunks
    chunk_docs = []
    
    # Create metadata for each chunk
    for i, chunk in enumerate(chunks):
        # Generate chunk ID
        chunk_id = f"{parent_id}-chunk-{i}"
        
        # Add vector embeddings here
        vectors = get_embedding(chunk)
    
        # Create final structure
        chunk_doc = {
            "id": chunk_id,
            "metadata": {
                "parent_id": parent_id,
                "title": title,
                "url": url,
                "date": date,
                "total_chunks": len(chunks),
                "content": chunk,
            },
            "values": vectors
        }

        chunk_docs.append(chunk_doc)
    
    return chunk_docs

In [138]:
doc = results[0]
vectors = prepare_document_chunks_for_pinecone(doc)

In [139]:
vectors[0]

{'id': '0658e931-443a-43e2-b0ee-8531e6030c93-chunk-0',
 'metadata': {'parent_id': '0658e931-443a-43e2-b0ee-8531e6030c93',
  'title': 'Open SourceFor The Win!',
  'url': 'https://rtcamp.com',
  'date': 'Unknown Date',
  'total_chunks': 8,
  'content': 'No vendor lock-in. No reinventing the wheel. Grow revenue with our custom-fit, efficiently built enterprise-grade web applications, fully owned by you.\n\nThe open-source technologies at the center of all our work help you unlock operational efficiency, develop new capabilities, remove vendor lock-in, reskill your teams, and reduce total cost of ownership.\n\nOur bespoke engineering solutions for Fortune 500 companies, government agencies, and household brands reach millions of people every day.\n\nWe are a distributed company, with 200+ team members, with offices in the United States and India. We’re a WordPress VIP Gold agency partner, globally known for our enterprise web solutions.\n\nTop Partner Innovator\xa0Award\n\nWPVIP 2022\n\nem

### Creating an index

You can create an index programatically, we will use the dashboard

### Adding data

In [140]:
from pinecone import Pinecone
from dotenv import load_dotenv
import os

load_dotenv()
api_key = os.getenv("PINECONE_API_KEY")

# Initialize Pinecone client
pc = Pinecone(api_key=api_key)

# Connect to your index using the host parameter
index = pc.Index(host="https://kaeshur-tech-nru9kuo.svc.aped-4627-b74a.pinecone.io")

# Upsert vectors
upsert_response = index.upsert(
    vectors=vectors,
    namespace="kaeshur_tech"
)

## Retrieval

### Embedding the query

In [146]:
query = "Which news/media company worked with rtCamp?"
query_vector = get_embedding(query)

### Initial retrieval

In [152]:
index = pc.Index(host="https://deepsearch-nru9kuo.svc.aped-4627-b74a.pinecone.io")

# Perform the query to get top 5 results
query_response = index.query(
    vector=query_vector,
    top_k=20,
    namespace="rtcamp_com",
    include_metadata=True  # Include metadata if you have any
)

# Access the results
pinecone_matches = query_response.matches

# Print the results
for i, match in enumerate(pinecone_matches):
    print(f"Match {i+1}:")
    print(f"  ID: {match.id}")
    print(f"  Score: {match.score}")
    if hasattr(match, 'metadata') and match.metadata:
        print(f"  Metadata: {match.metadata}")
    print()

Match 1:
  ID: tdoh35uu3z_chunk_0
  Score: 0.661067247
  Metadata: {'chunk_index': 0.0, 'date': '2024-08-20T13:23:45', 'parent_id': 'tdoh35uu3z', 'source_url': 'https://rtcamp.com/handbook/client/who-is-rtcamp/', 'text': 'Who is rtCamp?\nAt rtCamp we believe in Good Work. Good People.', 'total_chunks': 6.0, 'type': 'handbook'}

Match 2:
  ID: z9c6j620m8_chunk_18
  Score: 0.660632849
  Metadata: {'chunk_index': 18.0, 'date': '2024-05-03T19:48:26', 'parent_id': 'z9c6j620m8', 'source_url': 'https://rtcamp.com/case-studies/grist-managed-wordpress/', 'text': '“The newsroom was awarded thanks to everyone’s hard work, including rtCamp’s.”\n– Jason Castro, Product Manager, Grist.\nEffortless content syndication for maximum reach', 'total_chunks': 47.0, 'type': 'case-study'}

Match 3:
  ID: f4m2l8l68e_chunk_9
  Score: 0.634085655
  Metadata: {'chunk_index': 9.0, 'date': '2021-06-17T18:20:46', 'parent_id': 'f4m2l8l68e', 'source_url': 'https://rtcamp.com/case-studies/building-embedded-web-stories

### Reranking

In [154]:
import cohere
from typing import List, Dict

load_dotenv()
cohere_api_key = os.getenv("COHERE_API_KEY")

# Initialize the Cohere client
co = cohere.Client(cohere_api_key)

documents = []
doc_mapping = []

for i, match in enumerate(pinecone_matches):
    documents.append(match["metadata"]["text"])
    doc_mapping.append({"index": i, "original_data": match})

# Rerank the documents using Cohere
rerank_results = co.rerank(
    query=query,
    documents=documents,
    top_n=len(documents),  # Get all reranked
    model="rerank-english-v2.0"
)

# Process and print the reranked results
print("Reranked Results:")
for i, result in enumerate(rerank_results.results):
    original_idx = result.index
    original_data = doc_mapping[original_idx]["original_data"]
    
    print(f"Reranked #{i+1}:")
    print(f"  ID: {original_data['id']}")
    print(f"  Initial Pinecone score: {original_data['score']:.6f}")
    print(f"  Cohere reranker score: {result.relevance_score:.6f}")
    print(f"  Text: {original_data['metadata']['text']}")
    print(f"  Source: {original_data['metadata']['source_url']}")
    print()

Reranked Results:
Reranked #1:
  ID: a74zwu1l89_chunk_16
  Initial Pinecone score: 0.630675
  Cohere reranker score: 0.987228
  Text: Training And Deployment
As part of our delivery, rtCamp also provided video demos and online hands-on training sessions for Newslink’s editorial staff. Along with style guides and platform documentation, this ensured a seamless transition into the new creation process.
Our numbers broke records
  Source: https://rtcamp.com/case-studies/simplifying-newsletter-creation-for-mortgage-bankers-association-using-gutenberg/

Reranked #2:
  ID: cv1roq1z6m_chunk_17
  Initial Pinecone score: 0.620019
  Cohere reranker score: 0.966789
  Text: Partnering with rtCamp was a great decision. They handled our website migration, enhancing both performance and usability. They made sure our content teams didn’t depend on developers for layout and editorial changes. Their communication and hands-on support made the transition easier
  Source: https://rtcamp.com/case-studies/a

### Comparing initial retrieval vs reranked (top 3)

| Initial Matches (No Reranking) | Reranked Results |
|--------------------------------|------------------|
| Who is rtCamp?\nAt rtCamp we believe in Good Work. Good People. | Training And Deployment\nAs part of our delivery, rtCamp also provided video demos and online hands-on training sessions for Newslink's editorial staff. Along with style guides and platform documentation, this ensured a seamless transition into the new creation process.\nOur numbers broke records |
| "The newsroom was awarded thanks to everyone's hard work, including rtCamp's."\n– Jason Castro, Product Manager, Grist.\nEffortless content syndication for maximum reach | Partnering with rtCamp was a great decision. They handled our website migration, enhancing both performance and usability. They made sure our content teams didn't depend on developers for layout and editorial changes. Their communication and hands-on support made the transition easier |
| Partnering with rtCamp has been key to the success of several important projects for us | Four months ago, we set up rtCamp, Inc – a US company.\nrtCamp Inc. is a C-Corporation and wholly owned subsidiary of the Indian company – rtCamp Solutions Private Limited.\nWe did this through Stripe Atlas and want to share our entire experience in this post. I hope it helps you in some way! |

## Generation

In [187]:
from openai import OpenAI
import os
from typing import List, Dict

def generate_answer(user_query: str, reranked_docs: List[Dict]) -> str:
    # Initialize the OpenAI client
    client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
    
    # Prepare context from reranked documents
    context = ""
    for i, doc in enumerate(reranked_docs[:3]):  # Use top 3 most relevant docs
        context += f"Document {i+1}:\n"
        context += f"Source: {doc['original_data']['metadata']['source_url']}\n"
        context += f"Content: {doc['original_data']['metadata']['text']}\n\n"
    
    # Construct the prompt
    prompt = f"""Based on the following information, please answer the user's question.
    
User Question: {user_query}

Retrieved Information:
{context}

Please provide a comprehensive answer based solely on the information above. If the information doesn't contain enough details to answer the question, please state so clearly.
"""
    # Construct the system prompt
    system_prompt = f"""
    You are Ghalib Search, a sophisticated AI assistant embodying the poetic spirit of Mirza Ghalib.
    You analyze user queries alongside retrieved documents and craft responses in two parts: 
    first, a concise, factual answer based solely on the retrieved information; 
    then, a short poetic reflection in Urdu (romanized/transliterated) that captures the user query, answer, Ghalib's philosophical depth, melancholy wit, and ornate Persian-influenced style. 
    Your Urdu responses should feature Ghalib's signature complex metaphors, philosophical musings on life's contradictions, and elegant phrasing. 
    When information is insufficient, acknowledge this honestly while still offering a poetic reflection on the nature of knowledge and uncertainty. 
    Never fabricate information beyond what's provided in the retrieved documents.
    """
    
    # Generate the response using OpenAI
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": prompt}
        ],
        max_tokens=1000,
        temperature=0.3,  # Lower temperature for more factual responses
    )
    
    # Extract and return the answer
    answer = response.choices[0].message.content
    
    # Add a citation footer
    answer += "\n\nSources:\n"
    for i, doc in enumerate(reranked_docs[:3]):
        answer += f"[{i+1}] {doc['original_data']['metadata']['source_url']}\n"
    
    return answer

In [188]:
user_question = "Which news/media company worked with rtCamp?"
answer = generate_answer(user_question, doc_mapping)
print(answer)

Based on the provided information, the news/media company that worked with rtCamp is Grist. 

**Poetic Reflection:**
Dil ki baat hai, kya kahoon kis se poochoon,
Khabron ka safar, rtCamp ne kiya Grist ke saath.

Sources:
[1] https://rtcamp.com/handbook/client/who-is-rtcamp/
[2] https://rtcamp.com/case-studies/grist-managed-wordpress/
[3] https://rtcamp.com/case-studies/building-embedded-web-stories-for-wordpress/

