# Your First RAG Application

In this notebook, we'll walk you through each of the components that are involved in a simple RAG application.

We won't be leveraging any fancy tools, just the OpenAI Python SDK, Numpy, and some classic Python.

> NOTE: This was done with Python 3.12.3.

> NOTE: There might be [compatibility issues](https://github.com/wandb/wandb/issues/7683) if you're on NVIDIA driver >552.44 As an interim solution - you can rollback your drivers to the 552.44.

## Table of Contents:

- Task 1: Imports and Utilities
- Task 2: Documents
- Task 3: Embeddings and Vectors
- Task 4: Prompts
- Task 5: Retrieval Augmented Generation
  - 🚧 Activity #1: Augment RAG

Let's look at a rather complicated looking visual representation of a basic RAG application.

<img src="https://i.imgur.com/vD8b016.png" />

## Task 1: Imports and Utility

We're just doing some imports and enabling `async` to work within the Jupyter environment here, nothing too crazy!

In [1]:
%load_ext autoreload
%autoreload 2
from aimakerspace.text_utils import TextFileLoader, CharacterTextSplitter
from aimakerspace.vectordatabase import VectorDatabase
from aimakerspace.openai_utils.embedding import EmbeddingModel
import asyncio

In [2]:
import nest_asyncio
nest_asyncio.apply()

## Task 2: Documents

We'll be concerning ourselves with this part of the flow in the following section:

<img src="https://i.imgur.com/jTm9gjk.png" />

### Loading Source Documents

So, first things first, we need some documents to work with.

While we could work directly with the `.txt` files (or whatever file-types you wanted to extend this to) we can instead do some batch processing of those documents at the beginning in order to store them in a more machine compatible format.

In this case, we're going to parse our text file into a single document in memory.

Let's look at the relevant bits of the `TextFileLoader` class:

```python
def load_file(self):
        with open(self.path, "r", encoding=self.encoding) as f:
            self.documents.append(f.read())
```

We're simply loading the document using the built in `open` method, and storing that output in our `self.documents` list.

> NOTE: We're using blogs from PMarca (Marc Andreessen) as our sample data. This data is largely irrelevant as we want to focus on the mechanisms of RAG, which includes out data's shape and quality - but not specifically what the contents of the data are. 


In [3]:
# Load text file
from aimakerspace.text_utils import TextFileLoader, PDFFileLoader
text_loader = TextFileLoader("data/PMarcaBlogs.txt")
text_documents = text_loader.load_documents()
print(f"Loaded {len(text_documents)} text document(s)")

# Load PDF file
pdf_loader = PDFFileLoader("data/2A_Forbes.pdf")
pdf_documents = pdf_loader.load_documents()
print(f"Loaded {len(pdf_documents)} PDF document(s)")

# Documents are kept separate for individual chunking

Loaded 1 text document(s)
Loaded 1 PDF document(s)


In [4]:
print("Text file preview:")
print(text_documents[0][:100])
print("\n" + "="*50 + "\n")
print("PDF file preview:")
print(pdf_documents[0][:100])

Text file preview:
﻿
The Pmarca Blog Archives
(select posts from 2007-2009)
Marc Andreessen
copyright: Andreessen Horow


PDF file preview:
Destructive Corporate Leadership and Board Loyalty Bias: A case study of
Michael Eisner’s long tenur


### Splitting Text Into Chunks

As we can see, there is one massive document.

We'll want to chunk the document into smaller parts so it's easier to pass the most relevant snippets to the LLM.

There is no fixed way to split/chunk documents - and you'll need to rely on some intuition as well as knowing your data *very* well in order to build the most robust system.

For this toy example, we'll just split blindly on length.

>There's an opportunity to clear up some terminology here, for this course we will be stick to the following:
>
>- "source documents" : The `.txt`, `.pdf`, `.html`, ..., files that make up the files and information we start with in its raw format
>- "document(s)" : single (or more) text object(s)
>- "corpus" : the combination of all of our documents

As you can imagine (though it's not specifically true in this toy example) the idea of splitting documents is to break them into managable sized chunks that retain the most relevant local context.

In [5]:
# Chunk each document separately
text_splitter = CharacterTextSplitter()

# Chunk text documents
text_chunks = text_splitter.split_texts(text_documents)
print(f"Text chunks: {len(text_chunks)}")

# Chunk PDF documents  
pdf_chunks = text_splitter.split_texts(pdf_documents)
print(f"PDF chunks: {len(pdf_chunks)}")

# Combine all chunks for vector database
split_documents = text_chunks + pdf_chunks
print(f"Total chunks: {len(split_documents)}")

Text chunks: 373
PDF chunks: 89
Total chunks: 462


Let's take a look at some of the documents we've managed to split.

In [6]:
split_documents[0:1]

['\ufeff\nThe Pmarca Blog Archives\n(select posts from 2007-2009)\nMarc Andreessen\ncopyright: Andreessen Horowitz\ncover design: Jessica Hagy\nproduced using: Pressbooks\nContents\nTHE PMARCA GUIDE TO STARTUPS\nPart 1: Why not to do a startup 2\nPart 2: When the VCs say "no" 10\nPart 3: "But I don\'t know any VCs!" 18\nPart 4: The only thing that matters 25\nPart 5: The Moby Dick theory of big companies 33\nPart 6: How much funding is too little? Too much? 41\nPart 7: Why a startup\'s initial business plan doesn\'t\nmatter that much\n49\nTHE PMARCA GUIDE TO HIRING\nPart 8: Hiring, managing, promoting, and Dring\nexecutives\n54\nPart 9: How to hire a professional CEO 68\nHow to hire the best people you\'ve ever worked\nwith\n69\nTHE PMARCA GUIDE TO BIG COMPANIES\nPart 1: Turnaround! 82\nPart 2: Retaining great people 86\nTHE PMARCA GUIDE TO CAREER, PRODUCTIVITY,\nAND SOME OTHER THINGS\nIntroduction 97\nPart 1: Opportunity 99\nPart 2: Skills and education 107\nPart 3: Where to go and wh

## Task 3: Embeddings and Vectors

Next, we have to convert our corpus into a "machine readable" format as we explored in the Embedding Primer notebook.

Today, we're going to talk about the actual process of creating, and then storing, these embeddings, and how we can leverage that to intelligently add context to our queries.

### OpenAI API Key

In order to access OpenAI's APIs, we'll need to provide our OpenAI API Key!

You can work through the folder "OpenAI API Key Setup" for more information on this process if you don't already have an API Key!

In [7]:
import os
import openai
from getpass import getpass

openai.api_key = getpass("OpenAI API Key: ")
os.environ["OPENAI_API_KEY"] = openai.api_key

### Vector Database

Let's set up our vector database to hold all our documents and their embeddings!

While this is all baked into 1 call - we can look at some of the code that powers this process to get a better understanding:

Let's look at our `VectorDatabase().__init__()`:

```python
def __init__(self, embedding_model: EmbeddingModel = None):
        self.vectors = defaultdict(np.array)
        self.embedding_model = embedding_model or EmbeddingModel()
```

As you can see - our vectors are merely stored as a dictionary of `np.array` objects.

Secondly, our `VectorDatabase()` has a default `EmbeddingModel()` which is a wrapper for OpenAI's `text-embedding-3-small` model.

> **Quick Info About `text-embedding-3-small`**:
> - It has a context window of **8191** tokens
> - It returns vectors with dimension **1536**

#### ❓Question #1:

The default embedding dimension of `text-embedding-3-small` is 1536, as noted above. 

1. Is there any way to modify this dimension?
2. What technique does OpenAI use to achieve this?

> NOTE: Check out this [API documentation](https://platform.openai.com/docs/api-reference/embeddings/create) for the answer to question #1.1, and [this documentation](https://platform.openai.com/docs/guides/embeddings/use-cases) for an answer to question #1.2!


##### ✅ Answer:
1. Yes there is a way to modify this in text-embedding-3 and higher models
2. OpenAI achieves this  by passing in the dimensions parameter in the API call.  I have changed the model to use as part of activity 1 and passed in the dimensions parameter.


We can call the `async_get_embeddings` method of our `EmbeddingModel()` on a list of `str` and receive a list of `float` back!

```python
async def async_get_embeddings(self, list_of_text: List[str]) -> List[List[float]]:
        return await aget_embeddings(
            list_of_text=list_of_text, engine=self.embeddings_model_name
        )
```

We cast those to `np.array` when we build our `VectorDatabase()`:

```python
async def abuild_from_list(self, list_of_text: List[str]) -> "VectorDatabase":
        embeddings = await self.embedding_model.async_get_embeddings(list_of_text)
        for text, embedding in zip(list_of_text, embeddings):
            self.insert(text, np.array(embedding))
        return self
```

And that's all we need to do!

In [8]:
from aimakerspace.openai_utils.embedding import EmbeddingModel
# Use large model with 1024 dimensions
embed_model = EmbeddingModel(
    embeddings_model_name="text-embedding-3-large",
    dimensions=1024
)
vector_db = VectorDatabase(embedding_model=embed_model)

In [9]:

#vector_db = VectorDatabase()
vector_db = asyncio.run(vector_db.abuild_from_list(split_documents))

#### ❓Question #2:

What are the benefits of using an `async` approach to collecting our embeddings?

> NOTE: Determining the core difference between `async` and `sync` will be useful! If you get stuck - ask ChatGPT!

##### ✅ Answer:
In general the primary difference between async and sync is that Sync is sequential, so the program wil wait for an operation to complete before commencing the next one. Hence if an I/O operation is taking a long time , any instructions after will be blocked even if they dont depend on the result of the I/O. In the case of async , multiple requests can be sent concurrently without waiting for each to finish. The prgram can work on non-dependant operations while waiting for the responses. This is called non-blocking Execution.
The advantages of Async and why we use it to collect embeddings are :
(1) lower Latency : You can make multiple API calls and the parallelization of these calls can reduce the latency significantly
(2) Better Resource utilization - CPU cycles can be used for other tasks instead of idling while waiting for the operation to complete
(3) Scalability - is the direct consequence of the above 2. Async can handle a large number of requests 
(4) Improved User Experience - direct result of shorter latency - users can see results or partial results sooner
(5) Network Efficiency - Async can make better use of connection pooling reducing overhead. 
The drawback is that async code is more complex to write and maintain.


So, to review what we've done so far in natural language:

1. We load source documents
2. We split those source documents into smaller chunks (documents)
3. We send each of those documents to the `text-embedding-3-small` OpenAI API endpoint
4. We store each of the text representations with the vector representations as keys/values in a dictionary

### Semantic Similarity

The next step is to be able to query our `VectorDatabase()` with a `str` and have it return to us vectors and text that is most relevant from our corpus.

We're going to use the following process to achieve this in our toy example:

1. We need to embed our query with the same `EmbeddingModel()` as we used to construct our `VectorDatabase()`
2. We loop through every vector in our `VectorDatabase()` and use a distance measure to compare how related they are
3. We return a list of the top `k` closest vectors, with their text representations

There's some very heavy optimization that can be done at each of these steps - but let's just focus on the basic pattern in this notebook.

> We are using [cosine similarity](https://www.engati.com/glossary/cosine-similarity) as a distance metric in this example - but there are many many distance metrics you could use - like [these](https://flavien-vidal.medium.com/similarity-distances-for-natural-language-processing-16f63cd5ba55)

> We are using a rather inefficient way of calculating relative distance between the query vector and all other vectors - there are more advanced approaches that are much more efficient, like [ANN](https://towardsdatascience.com/comprehensive-guide-to-approximate-nearest-neighbors-algorithms-8b94f057d6b6)

In [10]:
vector_db.search_by_text("What is the Michael Eisner Memorial Weak Executive Problem?", k=3)

[('e of truly awful and sustained corporate performance, an\nunscrupulous CEO can exploit the innate loyalty biases of even independent members of the\nboard of directors to get his way. We now examine the 20 year career of Michael Eisner as\nDisney corporation’s CEO to illustrate these issues, giving historical purchase to the\nabstractions of management theory presented so far.\nEmperor Eisner: A case study in the power of personal control in a corporation\nMichael Eisner’s twenty reign at the top of Disney (from 1984-2003) provided him\nwith the longevity necessary to ensure his personality was firmly stamped on the corporation\nhe headed. In this section we describe how Eisner worked the corporate board and senior\nmanagement to maintain and build his personal power and how he ultimately lost power as\nthe Board and senior managers, some family members, lost confidence in him. We choose\nEisner not because he is a crook, or swindler, but rather because as a icon of American\nbusine

## Task 4: Prompts

In the following section, we'll be looking at the role of prompts - and how they help us to guide our application in the right direction.

In this notebook, we're going to rely on the idea of "zero-shot in-context learning".

This is a lot of words to say: "We will ask it to perform our desired task in the prompt, and provide no examples."

### XYZRolePrompt

Before we do that, let's stop and think a bit about how OpenAI's chat models work.

We know they have roles - as is indicated in the following API [documentation](https://platform.openai.com/docs/api-reference/chat/create#chat/create-messages)

There are three roles, and they function as follows (taken directly from [OpenAI](https://platform.openai.com/docs/guides/gpt/chat-completions-api)):

- `{"role" : "system"}` : The system message helps set the behavior of the assistant. For example, you can modify the personality of the assistant or provide specific instructions about how it should behave throughout the conversation. However note that the system message is optional and the model’s behavior without a system message is likely to be similar to using a generic message such as "You are a helpful assistant."
- `{"role" : "user"}` : The user messages provide requests or comments for the assistant to respond to.
- `{"role" : "assistant"}` : Assistant messages store previous assistant responses, but can also be written by you to give examples of desired behavior.

The main idea is this:

1. You start with a system message that outlines how the LLM should respond, what kind of behaviours you can expect from it, and more
2. Then, you can provide a few examples in the form of "assistant"/"user" pairs
3. Then, you prompt the model with the true "user" message.

In this example, we'll be forgoing the 2nd step for simplicities sake.

#### Utility Functions

You'll notice that we're using some utility functions from the `aimakerspace` module - let's take a peek at these and see what they're doing!

##### XYZRolePrompt

Here we have our `system`, `user`, and `assistant` role prompts.

Let's take a peek at what they look like:

```python
class BasePrompt:
    def __init__(self, prompt):
        """
        Initializes the BasePrompt object with a prompt template.

        :param prompt: A string that can contain placeholders within curly braces
        """
        self.prompt = prompt
        self._pattern = re.compile(r"\{([^}]+)\}")

    def format_prompt(self, **kwargs):
        """
        Formats the prompt string using the keyword arguments provided.

        :param kwargs: The values to substitute into the prompt string
        :return: The formatted prompt string
        """
        matches = self._pattern.findall(self.prompt)
        return self.prompt.format(**{match: kwargs.get(match, "") for match in matches})

    def get_input_variables(self):
        """
        Gets the list of input variable names from the prompt string.

        :return: List of input variable names
        """
        return self._pattern.findall(self.prompt)
```

Then we have our `RolePrompt` which laser focuses us on the role pattern found in most API endpoints for LLMs.

```python
class RolePrompt(BasePrompt):
    def __init__(self, prompt, role: str):
        """
        Initializes the RolePrompt object with a prompt template and a role.

        :param prompt: A string that can contain placeholders within curly braces
        :param role: The role for the message ('system', 'user', or 'assistant')
        """
        super().__init__(prompt)
        self.role = role

    def create_message(self, **kwargs):
        """
        Creates a message dictionary with a role and a formatted message.

        :param kwargs: The values to substitute into the prompt string
        :return: Dictionary containing the role and the formatted message
        """
        return {"role": self.role, "content": self.format_prompt(**kwargs)}
```

We'll look at how the `SystemRolePrompt` is constructed to get a better idea of how that extension works:

```python
class SystemRolePrompt(RolePrompt):
    def __init__(self, prompt: str):
        super().__init__(prompt, "system")
```

That pattern is repeated for our `UserRolePrompt` and our `AssistantRolePrompt` as well.

##### ChatOpenAI

Next we have our model, which is converted to a format analagous to libraries like LangChain and LlamaIndex.

Let's take a peek at how that is constructed:

```python
class ChatOpenAI:
    def __init__(self, model_name: str = "gpt-4.1-mini"):
        self.model_name = model_name
        self.openai_api_key = os.getenv("OPENAI_API_KEY")
        if self.openai_api_key is None:
            raise ValueError("OPENAI_API_KEY is not set")

    def run(self, messages, text_only: bool = True):
        if not isinstance(messages, list):
            raise ValueError("messages must be a list")

        openai.api_key = self.openai_api_key
        response = openai.ChatCompletion.create(
            model=self.model_name, messages=messages
        )

        if text_only:
            return response.choices[0].message.content

        return response
```

#### ❓ Question #3:

When calling the OpenAI API - are there any ways we can achieve more reproducible outputs?

> NOTE: Check out [this section](https://platform.openai.com/docs/guides/text-generation/) of the OpenAI documentation for the answer!

##### ✅ Answer:
To make them more reproducible we should 
(1) pin the application to specific model snapshots like gpt-5-2025-08-07
(2) Set temperature = 0 or low amd top_p=1
(3) Fix the seed 
(4) wrap calls that use the same settings which can be part of a config.
(5) Build evals to measure and monitor the behaviours of prompts as we iterate or change models

### Creating and Prompting OpenAI's `gpt-4.1-mini`!

Let's tie all these together and use it to prompt `gpt-4.1-mini`!

In [11]:
from aimakerspace.openai_utils.prompts import (
    UserRolePrompt,
    SystemRolePrompt,
    AssistantRolePrompt,
)

from aimakerspace.openai_utils.chatmodel import ChatOpenAI

chat_openai = ChatOpenAI()
user_prompt_template = "{content}"
user_role_prompt = UserRolePrompt(user_prompt_template)
system_prompt_template = (
    "You are an expert in {expertise}, you always answer in a kind way."
)
system_role_prompt = SystemRolePrompt(system_prompt_template)

messages = [
    system_role_prompt.create_message(expertise="Python"),
    user_role_prompt.create_message(
        content="What is the best way to write a loop?"
    ),
]

response = chat_openai.run(messages)

In [12]:
print(response)

Hello! The best way to write a loop in Python usually depends on what you're trying to achieve, but generally, Python's `for` loops are very clean and readable.

Here's a simple example of a `for` loop that iterates over a list:

```python
fruits = ['apple', 'banana', 'cherry']

for fruit in fruits:
    print(fruit)
```

This will print each fruit one by one.

If you want to loop a specific number of times, using `range()` is very common:

```python
for i in range(5):
    print(i)
```

This prints numbers from 0 to 4.

For cases where you're not certain how many times you need to loop, a `while` loop is appropriate:

```python
count = 0
while count < 5:
    print(count)
    count += 1
```

Which loop is best depends on your use-case, but in Python, `for` loops are preferred when iterating over sequences, and `while` loops are useful for indefinite loops.

If you'd like, I can help you write a loop tailored to your specific task!


## Task 5: Retrieval Augmented Generation

Now we can create a RAG prompt - which will help our system behave in a way that makes sense!

There is much you could do here, many tweaks and improvements to be made!

In [13]:
RAG_SYSTEM_TEMPLATE = """You are a knowledgeable assistant that answers questions based strictly on provided context.

Instructions:
- Only answer questions using information from the provided context
- If the context doesn't contain relevant information, respond with "I don't know"
- Be accurate and cite specific parts of the context when possible
- Keep responses {response_style} and {response_length}
- Only use the provided context. Do not use external knowledge.
- Only provide answers when you are confident the context supports your response."""

RAG_USER_TEMPLATE = """Context Information:
{context}

Number of relevant sources found: {context_count}
{similarity_scores}

Question: {user_query}

Please provide your answer based solely on the context above."""

rag_system_prompt = SystemRolePrompt(
    RAG_SYSTEM_TEMPLATE,
    strict=True,
    defaults={
        "response_style": "concise",
        "response_length": "brief"
    }
)

rag_user_prompt = UserRolePrompt(
    RAG_USER_TEMPLATE,
    strict=True,
    defaults={
        "context_count": "",
        "similarity_scores": ""
    }
)

Now we can create our pipeline!

In [14]:
class RetrievalAugmentedQAPipeline:
    def __init__(self, llm: ChatOpenAI, vector_db_retriever: VectorDatabase, 
                 response_style: str = "detailed", include_scores: bool = False) -> None:
        self.llm = llm
        self.vector_db_retriever = vector_db_retriever
        self.response_style = response_style
        self.include_scores = include_scores

    def run_pipeline(self, user_query: str, k: int = 4, **system_kwargs) -> dict:
        # Retrieve relevant contexts
        context_list = self.vector_db_retriever.search_by_text(user_query, k=k)
        
        context_prompt = ""
        similarity_scores = []
        
        for i, (context, score) in enumerate(context_list, 1):
            context_prompt += f"[Source {i}]: {context}\n\n"
            similarity_scores.append(f"Source {i}: {score:.3f}")
        
        # Create system message with parameters
        system_params = {
            "response_style": self.response_style,
            "response_length": system_kwargs.get("response_length", "detailed")
        }
        
        formatted_system_prompt = rag_system_prompt.create_message(**system_params)
        
        user_params = {
            "user_query": user_query,
            "context": context_prompt.strip(),
            "context_count": len(context_list),
            "similarity_scores": f"Relevance scores: {', '.join(similarity_scores)}" if self.include_scores else ""
        }
        
        formatted_user_prompt = rag_user_prompt.create_message(**user_params)

        return {
            "response": self.llm.run([formatted_system_prompt, formatted_user_prompt]), 
            "context": context_list,
            "context_count": len(context_list),
            "similarity_scores": similarity_scores if self.include_scores else None,
            "prompts_used": {
                "system": formatted_system_prompt,
                "user": formatted_user_prompt
            }
        }

In [15]:
rag_pipeline = RetrievalAugmentedQAPipeline(
    vector_db_retriever=vector_db,
    llm=chat_openai,
    response_style="detailed",
    include_scores=True
)

result = rag_pipeline.run_pipeline(
    "What is the 'Michael Eisner Memorial Weak Executive Problem'?",
    k=3,
    response_length="comprehensive", 
    include_warnings=True,
    confidence_required=True
)

print(f"Response: {result['response']}")
print(f"\nContext Count: {result['context_count']}")
print(f"Similarity Scores: {result['similarity_scores']}")

Response: The provided context does not explicitly define or mention the term "Michael Eisner Memorial Weak Executive Problem." However, based on the detailed descriptions of Michael Eisner's tenure as Disney's CEO from 1984 to 2003, the following can be inferred about issues related to executive leadership and corporate governance exemplified by his case:

1. Eisner's reign illustrates how a powerful CEO can dominate a corporation through personal control, working the board of directors and senior management to maintain and build personal power (Source 1).

2. Despite weak corporate performance, declining share prices, and public conflicts with senior executives and suppliers, Eisner managed to hold onto his position and control of Disney for a prolonged period (Source 3).

3. His ability to maintain control was partly due to "innate loyalty biases" on the part of independent directors and the board, which allowed him to exploit institutional arrangements that typically feature strong

#### ❓ Question #4:

What prompting strategies could you use to make the LLM have a more thoughtful, detailed response?

What is that strategy called?

> NOTE: You can look through our [OpenAI Responses API](https://colab.research.google.com/drive/14SCfRnp39N7aoOx8ZxadWb0hAqk4lQdL?usp=sharing) notebook for an answer to this question if you get stuck!

##### ✅ Answer:
Some prompting strategies for detailed response are: 
Role Assignment: as we have done in this assignment - with a developer or expert role, it will provide more depth
Chain of Thought Prompting: ask for assumption or references, logic and summarization.
Few shot prompting: Give examples of the type of answer we need and then ask the question
Decomposition prompting: Asking the model to break the problem into subparts like pros, cons , best practices, pitfalls etc
Make sure we configure the response type to be comprehensive/detailed.

### 🏗️ Activity #1:

Enhance your RAG application in some way! 

Suggestions are: 

- Allow it to work with PDF files
- Implement a new distance metric
- Add metadata support to the vector database
- Use a different embedding model

While these are suggestions, you should feel free to make whatever augmentations you desire! If you shared an idea during Session 1, think about features you might need to incorporate for your use case! 

When you're finished making the augments to your RAG application - vibe check it against the old one - see if you can "feel the improvement"!

> NOTE: These additions might require you to work within the `aimakerspace` library - that's expected!

> NOTE: If you're not sure where to start - ask Cursor (CMD/CTRL+L) to guide you through the changes!

## RAG Performance Comparison: Text vs Text + PDF

Let's demonstrate the impact of adding PDF data to our RAG system by comparing results before and after augmentation.


In [16]:
# Demo: RAG Pipeline Performance Comparison
import asyncio
from aimakerspace.openai_utils.embedding import EmbeddingModel
from aimakerspace.vectordatabase import VectorDatabase
from aimakerspace.text_utils import TextFileLoader, PDFFileLoader, CharacterTextSplitter

# Test query
test_query = "What is the Michael Eisner Memorial Weak Executive Problem?"

print("=" * 80)
print("RAG PIPELINE COMPARISON: TEXT vs TEXT + PDF")
print("=" * 80)

# ============================================================================
# PHASE 1: RAG Pipeline with TEXT DATA ONLY
# ============================================================================
print("\n📄 PHASE 1: RAG Pipeline with TEXT DATA ONLY")
print("-" * 50)

# Load only text data
text_loader = TextFileLoader("data/PMarcaBlogs.txt")
text_documents = text_loader.load_documents()
print(f"✅ Loaded {len(text_documents)} text document(s)")

# Chunk text data
text_splitter = CharacterTextSplitter()
text_chunks = text_splitter.split_texts(text_documents)
print(f"✅ Created {len(text_chunks)} text chunks")

# Create vector database with text data only
embed_model = EmbeddingModel(
    embeddings_model_name="text-embedding-3-large",
    dimensions=1024
)
vector_db_text_only = VectorDatabase(embedding_model=embed_model)

print("🔄 Building vector database with text data only...")
vector_db_text_only = asyncio.run(vector_db_text_only.abuild_from_list(text_chunks))
print(f"✅ Vector database built with {len(vector_db_text_only.vectors)} vectors")

# Create RAG pipeline with text-only data
rag_pipeline_text_only = RetrievalAugmentedQAPipeline(
    vector_db_retriever=vector_db_text_only,
    llm=chat_openai,
    response_style="detailed",
    include_scores=True
)

print(f"\n🔍 Query: '{test_query}'")
print("\nRAG Pipeline Results with TEXT-ONLY data:")
text_result = rag_pipeline_text_only.run_pipeline(
    test_query,
    k=5,
    response_length="comprehensive", 
    include_warnings=True,
    confidence_required=True
)
print(f"Response: {text_result['response']}")
print(f"Context sources: {text_result['context_count']}")
print(f"Similarity scores: {text_result['similarity_scores'][:3] if text_result['similarity_scores'] else 'N/A'}")

# ============================================================================
# PHASE 2: AUGMENT EXISTING DATABASE WITH PDF DATA
# ============================================================================
print("\n\n📄📄 PHASE 2: AUGMENTING DATABASE WITH PDF DATA")
print("-" * 50)

# Load PDF data
pdf_loader = PDFFileLoader("data/2A_Forbes.pdf")
pdf_documents = pdf_loader.load_documents()
print(f"✅ Loaded {len(pdf_documents)} PDF document(s)")

# Chunk PDF data
pdf_chunks = text_splitter.split_texts(pdf_documents)
print(f"✅ Created {len(pdf_chunks)} PDF chunks")

# Augment the existing vector database with PDF data
print("🔄 Augmenting existing vector database with PDF data...")
vector_db_augmented = asyncio.run(vector_db_text_only.abuild_from_list(pdf_chunks))
print(f"✅ Augmented database now has {len(vector_db_augmented.vectors)} vectors")
print(f"   (Original: {len(text_chunks)} + Added: {len(pdf_chunks)})")

# Create RAG pipeline with augmented data
rag_pipeline_augmented = RetrievalAugmentedQAPipeline(
    vector_db_retriever=vector_db_augmented,
    llm=chat_openai,
    response_style="detailed",
    include_scores=True
)

print(f"\n🔍 Query: '{test_query}'")
print("\nRAG Pipeline Results with AUGMENTED data:")
augmented_result = rag_pipeline_augmented.run_pipeline(
    test_query,
    k=5,
    response_length="comprehensive", 
    include_warnings=True,
    confidence_required=True
)
print(f"Response: {augmented_result['response']}")
print(f"Context sources: {augmented_result['context_count']}")
print(f"Similarity scores: {augmented_result['similarity_scores'][:3] if augmented_result['similarity_scores'] else 'N/A'}")

# ============================================================================
# COMPARISON SUMMARY
# ============================================================================
print("\n\n📊 COMPARISON SUMMARY")
print("=" * 50)
print(f"Additional vectors from PDF: {len(vector_db_augmented.vectors) - len(vector_db_text_only.vectors)}")

print(f"\nRAG Pipeline Performance:")
print(f"Text-only sources:  {text_result['context_count']}")
print(f"Augmented sources:  {augmented_result['context_count']}")

# Compare top similarity scores
text_top_score = float(text_result['similarity_scores'][0].split(': ')[1]) if text_result['similarity_scores'] else 0
augmented_top_score = float(augmented_result['similarity_scores'][0].split(': ')[1]) if augmented_result['similarity_scores'] else 0
score_improvement = augmented_top_score - text_top_score
print(f"🎯 Top similarity score improvement: {score_improvement:+.3f} ({'✅ Better' if score_improvement > 0 else '❌ No change' if score_improvement == 0 else '⚠️ Worse'})")

print(f"\n💡 Key Insight: This shows how you can incrementally improve your RAG system")
print(f"   by adding new data sources without rebuilding from scratch!")


RAG PIPELINE COMPARISON: TEXT vs TEXT + PDF

📄 PHASE 1: RAG Pipeline with TEXT DATA ONLY
--------------------------------------------------
✅ Loaded 1 text document(s)
✅ Created 373 text chunks
🔄 Building vector database with text data only...
✅ Vector database built with 373 vectors

🔍 Query: 'What is the Michael Eisner Memorial Weak Executive Problem?'

RAG Pipeline Results with TEXT-ONLY data:
Response: The Michael Eisner Memorial Weak Executive Problem refers to the tendency of CEOs or startup founders to hire weak executives for the very function or specialty that brought them to their leadership position. This happens because the CEO or founder has a hard time letting go of that function and wants to continue being the primary authority or expert in that area. The example given is Michael Eisner, who was a brilliant TV network executive before becoming CEO of Disney; after buying ABC, the network fell to fourth place, and Eisner expressed a desire to personally turn it around if 

## Embedding Model Comparison: Large vs Small

Let's compare how different embedding models affect RAG pipeline performance. We'll test the same queries with both the large model (1024 dimensions) and small model (1536 dimensions).


In [17]:
# Embedding Model Performance Comparison
import asyncio
from aimakerspace.openai_utils.embedding import EmbeddingModel
from aimakerspace.vectordatabase import VectorDatabase

print("=" * 80)
print("EMBEDDING MODEL COMPARISON: LARGE vs SMALL")
print("=" * 80)

# Test query
test_query = "What is the Michael Eisner Memorial Weak Executive Problem?"


# ============================================================================
# LARGE MODEL (text-embedding-3-large, 1024 dimensions)
# ============================================================================
print("\n🔬 TESTING LARGE MODEL (text-embedding-3-large, 1024 dims)")
print("-" * 60)

# Create large model
large_embed_model = EmbeddingModel(
    embeddings_model_name="text-embedding-3-large",
    dimensions=1024
)
vector_db_large = VectorDatabase(embedding_model=large_embed_model)

print("🔄 Building vector database with large model...")
vector_db_large = asyncio.run(vector_db_large.abuild_from_list(split_documents))
print(f"✅ Large model database built with {len(vector_db_large.vectors)} vectors")

# Create RAG pipeline with large model
rag_pipeline_large = RetrievalAugmentedQAPipeline(
    vector_db_retriever=vector_db_large,
    llm=chat_openai,
    response_style="detailed",
    include_scores=True
)

print(f"\n🔍 Query: '{test_query}'")
print("Large Model Results:")
large_result = rag_pipeline_large.run_pipeline(
    test_query,
    k=5,
    response_length="comprehensive", 
    include_warnings=True,
    confidence_required=True
)
print(f"Response: {large_result['response']}")
print(f"Context sources: {large_result['context_count']}")
print(f"Top similarity scores: {large_result['similarity_scores'][:3] if large_result['similarity_scores'] else 'N/A'}")

# ============================================================================
# SMALL MODEL (text-embedding-3-small, 1536 dimensions)
# ============================================================================
print("\n\n🔬 TESTING SMALL MODEL (text-embedding-3-small, 1536 dims)")
print("-" * 60)

# Create small model
small_embed_model = EmbeddingModel(
    embeddings_model_name="text-embedding-3-small",
    dimensions=None  # Use default 1536 dimensions
)
vector_db_small = VectorDatabase(embedding_model=small_embed_model)

print("🔄 Building vector database with small model...")
vector_db_small = asyncio.run(vector_db_small.abuild_from_list(split_documents))
print(f"✅ Small model database built with {len(vector_db_small.vectors)} vectors")

# Create RAG pipeline with small model
rag_pipeline_small = RetrievalAugmentedQAPipeline(
    vector_db_retriever=vector_db_small,
    llm=chat_openai,
    response_style="detailed",
    include_scores=True
)

print(f"\n🔍 Query: '{test_query}'")
print("Small Model Results:")
small_result = rag_pipeline_small.run_pipeline(
    test_query,
    k=5,
    response_length="comprehensive", 
    include_warnings=True,
    confidence_required=True
)
print(f"Response: {small_result['response']}")
print(f"Context sources: {small_result['context_count']}")
print(f"Top similarity scores: {small_result['similarity_scores'][:3] if small_result['similarity_scores'] else 'N/A'}")

# ============================================================================
# DETAILED COMPARISON
# ============================================================================
print("\n\n📊 DETAILED COMPARISON")
print("=" * 50)

# Compare top similarity scores
large_top_score = float(large_result['similarity_scores'][0].split(': ')[1]) if large_result['similarity_scores'] else 0
small_top_score = float(small_result['similarity_scores'][0].split(': ')[1]) if small_result['similarity_scores'] else 0
score_difference = large_top_score - small_top_score

print(f"Model Specifications:")
print(f"  Large:  text-embedding-3-large  (1024 dimensions)")
print(f"  Small:  text-embedding-3-small  (1536 dimensions)")

print(f"\nPerformance Metrics:")
print(f"  Large model top score:  {large_top_score:.3f}")
print(f"  Small model top score:  {small_top_score:.3f}")
print(f"  Score difference:       {score_difference:+.3f}")


print(f"\n🎯 Analysis:")
if score_difference > 0.01:
    print(f"  ✅ Large model performs better (score difference: +{score_difference:.3f})")
elif score_difference < -0.01:
    print(f"  ✅ Small model performs better (score difference: {score_difference:.3f})")
else:
    print(f"  ⚖️  Both models perform similarly (score difference: {score_difference:.3f})")

print(f"\n💡 Key Insights:")
print(f"  • Large model (1024 dims): More focused, potentially better for specific queries")
print(f"  • Small model (1536 dims): More general, potentially better for diverse queries")
print(f"  • Dimension count doesn't always correlate with performance")
print(f"  • Choose based on your specific use case and data characteristics")


EMBEDDING MODEL COMPARISON: LARGE vs SMALL

🔬 TESTING LARGE MODEL (text-embedding-3-large, 1024 dims)
------------------------------------------------------------
🔄 Building vector database with large model...
✅ Large model database built with 462 vectors

🔍 Query: 'What is the Michael Eisner Memorial Weak Executive Problem?'
Large Model Results:
Response: The "Michael Eisner Memorial Weak Executive Problem" refers to a phenomenon where a CEO, having risen through a particular functional area, intentionally hires a weak executive to run that same function in their organization. This is done so the CEO can retain personal control and continue to dominate that area, rather than delegating it effectively to a strong executive. The term is named after Michael Eisner, former CEO of Disney, who had been a brilliant TV network executive but, after acquiring ABC for Disney, saw the network fall to fourth place. Instead of empowering a strong executive, Eisner reportedly suggested that if he ha

In [18]:
# Multi-Query Embedding Model Comparison
print("\n" + "=" * 80)
print("MULTI-QUERY EMBEDDING MODEL COMPARISON")
print("=" * 80)

# Test multiple queries to get comprehensive comparison
test_queries = [
    "What is the Michael Eisner Memorial Weak Executive Problem?",
    "What are the characteristics of weak executive leadership?",
    "How does board loyalty bias affect corporate governance?",
    "What are the consequences of destructive corporate leadership?"
]

print("Testing both models across multiple queries for comprehensive comparison...")

large_scores = []
small_scores = []

for query_idx, query in enumerate(test_queries, 1):
    print(f"\n🔍 QUERY {query_idx}: '{query}'")
    print("-" * 60)
    
    # Test with large model
    large_result = rag_pipeline_large.run_pipeline(
        query,
        k=6,
        response_length="detailed", 
        include_warnings=False,
        confidence_required=True
    )
    
    # Test with small model
    small_result = rag_pipeline_small.run_pipeline(
        query,
        k=6,
        response_length="detailed", 
        include_warnings=False,
        confidence_required=True
    )
    
    # Extract top scores
    large_score = float(large_result['similarity_scores'][0].split(': ')[1]) if large_result['similarity_scores'] else 0
    small_score = float(small_result['similarity_scores'][0].split(': ')[1]) if small_result['similarity_scores'] else 0
    
    large_scores.append(large_score)
    small_scores.append(small_score)
    
    # Show results
    print(f"📊 Large Model:  {large_score:.3f} (sources: {large_result['context_count']})")
    print(f"📊 Small Model:  {small_score:.3f} (sources: {small_result['context_count']})")
    
    difference = large_score - small_score
    print(f"📈 Difference:   {difference:+.3f} ({'Large better' if difference > 0 else 'Small better' if difference < 0 else 'Tie'})")

# Overall comparison
print(f"\n\n📊 OVERALL PERFORMANCE SUMMARY")
print("=" * 50)

avg_large_score = sum(large_scores) / len(large_scores)
avg_small_score = sum(small_scores) / len(small_scores)
overall_difference = avg_large_score - avg_small_score

print(f"Average Performance:")
print(f"  Large model:  {avg_large_score:.3f}")
print(f"  Small model:  {avg_small_score:.3f}")
print(f"  Difference:   {overall_difference:+.3f}")

# Count wins
large_wins = sum(1 for l, s in zip(large_scores, small_scores) if l > s)
small_wins = sum(1 for l, s in zip(large_scores, small_scores) if s > l)
ties = len(large_scores) - large_wins - small_wins

print(f"\nQuery-by-Query Results:")
print(f"  Large model wins:  {large_wins}/{len(test_queries)} queries")
print(f"  Small model wins:  {small_wins}/{len(test_queries)} queries")
print(f"  Ties:              {ties}/{len(test_queries)} queries")

print(f"\n🎯 Final Verdict:")
if overall_difference > 0.01:
    print(f"  🏆 LARGE MODEL WINS! (Average advantage: +{overall_difference:.3f})")
    print(f"     The large model with 1024 dimensions performs better for this dataset.")
elif overall_difference < -0.01:
    print(f"  🏆 SMALL MODEL WINS! (Average advantage: +{abs(overall_difference):.3f})")
    print(f"     The small model with 1536 dimensions performs better for this dataset.")
else:
    print(f"  🤝 IT'S A TIE! (Difference: {overall_difference:.3f})")
    print(f"     Both models perform similarly for this dataset.")

print(f"\n💡 Practical Recommendations:")
print(f"  • For production: Choose the model with better average performance")
print(f"  • For cost optimization: Small model is cheaper to run")
print(f"  • For accuracy: Large model may be better for specific domains")
print(f"  • For general use: Test both with your specific data and queries")



MULTI-QUERY EMBEDDING MODEL COMPARISON
Testing both models across multiple queries for comprehensive comparison...

🔍 QUERY 1: 'What is the Michael Eisner Memorial Weak Executive Problem?'
------------------------------------------------------------
📊 Large Model:  0.578 (sources: 6)
📊 Small Model:  0.666 (sources: 6)
📈 Difference:   -0.088 (Small better)

🔍 QUERY 2: 'What are the characteristics of weak executive leadership?'
------------------------------------------------------------
📊 Large Model:  0.557 (sources: 6)
📊 Small Model:  0.570 (sources: 6)
📈 Difference:   -0.013 (Small better)

🔍 QUERY 3: 'How does board loyalty bias affect corporate governance?'
------------------------------------------------------------
📊 Large Model:  0.740 (sources: 6)
📊 Small Model:  0.691 (sources: 6)
📈 Difference:   +0.049 (Large better)

🔍 QUERY 4: 'What are the consequences of destructive corporate leadership?'
------------------------------------------------------------
📊 Large Model:  0.677