# RAG
Remember this image?

<p align="center">
    <img src="./img/pipeline.png" alt="drawing" style="width:800px;"/>
</p>

Let's implement it using the methods we have learnt so far.

We will build a very basic RAG system over the Instructor documentation. Previously, we tried to use BM25, so let's now have a look at using a ChromaDB.

In [None]:
from openai import OpenAI

api_key = "abcd1234"

client = OpenAI(api_key = api_key)

Let's first find out if `gpt-4o-mini` can help us extract reciept information using the Instructor library.

In [None]:
def generate(text: str, **kwargs) -> str:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages = [
            {"role": "user", "content": text},
        ],
        **kwargs
    )

    return response

response = generate("I am trying to extract some data from a picture of a receipt using the Instructor library. Can you help me with that?")

In [None]:
print(response.choices[0].message.content)

Obviously this is not using the Instructor library. Does `gpt-4o-mini` even know about Instructor?

In [None]:
response = generate("Are you aware of the python Instructor library?")
print(response.choices[0].message.content)

OK, so no it doesn't...In the `data/docs` folder there is the documentation of the Instructor library. We will use this documentation to build a RAG system.

We are going to follow a similar pattern to the `BM25Search` class. In the class definition below, fill in the missing components.

As a reminder, the `BM25Search` class had the following methods:

```python
class BM25Search

    def __init__()

    def read_file()

    def load_documents()

    def fit_from_directory()

    def search()
```


```python
class ChromaSearch
    # Same named methods as BM25Search
    def __init__()

    def read_file()

    def load_documents()

    def fit_from_directory()

    def search()

    # New methods
    def get_embedding()

    def embedding_fn()

    def clear_index()

    def format_context()

    def search_and_format()

    def clear_index()
```

So there is a little bit of work to do.

In the `utils.py` file, we have the `TemplateManager` class which we have already met, but we have also defined a `BaseSearcher` class:

```python
class BaseSearcher(ABC):
    """Abstract base class defining the interface for search implementations."""
    
    def __init__(self):
        self.doc_paths = []
        self.documents = []
    
    @abstractmethod
    def fit_from_directory(self, directory: Union[str, Path], **kwargs):
        """Load and index documents from a directory."""
        pass
    
    @abstractmethod
    def search(self, query: str, n: int = 5) -> List[Tuple[float, str, str]]:
        """Search for documents matching the query."""
        pass

    def read_file(self, file_path):
        with open(file_path, 'r', encoding='utf-8') as file:
            return file.read()

    def load_documents(self, directory: Union[str, Path]) -> List[str]:
        files = SimpleDirectoryReader(directory, recursive=True).load_data()
        cwd = os.getcwd()
        
        unique_files = {}
        for doc in files:
            path = "."+doc.metadata['file_path'][len(cwd):]
            if path not in unique_files:
                unique_files[path] = doc.text
        
        self.doc_paths = list(unique_files.keys())
        return list(unique_files.values())
    
    def format_context(self, results: List[Tuple[float, str, str]]) -> str:
        """Format search results into a structured context string."""
        formatted_text = ""
        for score, file_path, content in results:
            # Extract title from file path (remove extension and path)
            title = os.path.splitext(os.path.basename(file_path))[0]
            
            formatted_text += f"title: {title}\n"
            formatted_text += f"similarity score: {score:.4f}\n"
            formatted_text += f"content: {content}\n\n"
        
        return formatted_text.strip()
    
    def search_and_format(self, query: str, n: int = 5) -> str:
        """Search and format results in one call."""
        results = self.search(query, n)
        return self.format_context(results)
```

The BM25 and the Chroma searcher will share some common methods. In addition, we have put in the requirement that the searcher must always implement the `fit_from_directory` and `search` methods.

In `utils` we have also refactored the BM25Search class to inherit from `BaseSearcher`. We will now construct a new class `ChromaSearch` that will inherit from `BaseSearcher`.

In [None]:
from utils import BaseSearcher, BM25Search

import chromadb

from pathlib import Path
import os

So we **must** implement the `fit_from_directory` and `search` methods in the `ChromaSearch` class. But we should also implement a way to get the embeddings using the OpenAI API.

In [None]:
class ChromaSearch(BaseSearcher):
    def __init__(self, collection_name: str = "documents", model_name: str = "text-embedding-3-small"):
        super().__init__()
        self.model_name = model_name
        # Initialize the database client
        self.db_client = chromadb.PersistentClient(path=f"./databases/{collection_name}/")

        # Get or create the collection
        self.collection = self.db_client.get_or_create_collection(
            name=collection_name,
            metadata={"hnsw:space": "cosine"}
        )

        self.openai = OpenAI(api_key = api_key)


    def get_embedding(self, text: str | list) -> dict:
        """Get embedding from OpenAI API."""
        # Ensure text is in the correct format
        if isinstance(text, list):
            input_text = text
        else:
            input_text = [text]
        
        # Get the embeddings
        response = self.openai.embeddings.create(
            model=self.model_name,
            input=input_text
        )
        return response


    def fit_from_directory(self, directory: str | Path):
        self.documents = self.load_documents(directory)
        # We will add items in batches of 100
        for i in range(0, len(self.documents), 100):
            batch_docs = self.documents[i:i+100]
            batch_paths = self.doc_paths[i:i+100]
            
            embeddings = self.get_embedding(batch_docs)
            embeddings = [embedding.embedding for embedding in embeddings.data]
            batch_ids = [str(j) for j in range(i, i + len(batch_docs))]

            # Add the embeddings to the collection
            self.collection.add(
                embeddings=embeddings,
                documents=batch_docs,
                metadatas=[{"file_path": path} for path in batch_paths],
                ids=batch_ids
            )
        
        print(f"Processed {len(self.documents)} documents")


    def search(self, query: str, n: int = 5) -> list[tuple[float, str, str]]:
        query_embedding = self.get_embedding(query).data[0].embedding

        # Search the collection
        results = self.collection.query(
            query_embeddings=[query_embedding],
            n_results=n,
            include=["documents", "metadatas", "distances"]
        )
        
        # Some formatting
        formatted_results = []
        for doc, metadata, distance in zip(
            results['documents'][0],
            results['metadatas'][0],
            results['distances'][0]
        ):
            score = 1 - distance
            file_path = metadata['file_path']
            text = self.read_file(file_path)
            formatted_results.append((score, file_path, text))
            
        return sorted(formatted_results, key=lambda x: x[0], reverse=True)

In [None]:
# Initialize the search
searcher = ChromaSearch(collection_name="instructor-docs", model_name="text-embedding-3-small")
searcher.fit_from_directory("data/docs")

bm25_searcher = BM25Search()
bm25_searcher.fit_from_directory("data/docs")

Let's compare which links each searcher returns for a query.

In [None]:
query = "I am trying to extract some data from a picture of a receipt using the Instructor library. Can you help me with that?"

chroma_results = searcher.search(query)
bm25_results = bm25_searcher.search(query)

In [None]:
chroma_links = [result[1] for result in chroma_results]
bm25_links = [result[1] for result in bm25_results]

print("Chroma Results:")
for link in chroma_links:
    print(link)

print("\nBM25 Results:")
for link in bm25_links:
    print(link)

## Chat integration
We want to be able to ask questions about our documentation. We have two promps:

#### System Prompt
---
```jinja
You are an expert on the Instructor library. Instructor makes it easy to get structured data like JSON from LLMs like GPT-3.5, GPT-4, GPT-4-Vision, and open-source models including Mistral/Mixtral, Anyscale, Ollama, and llama-cpp-python.

You will be asked queries from a user. The query from the user will be used to query a database of Instructor documentation in markdown format. You will be provided with the top N search results from this database. Your task is to provide a response to the user based on the search results.
```
---

<br>

#### User Prompt
---
```jinja
Here are the search results from the database::

## Search Results ##
{{ results }}

Here is the user query:

## User Query ##
{{ query }}
```
---

In [None]:
from utils import TemplateManager

# Initialize the template manager
# template_manager = ...
template_manager = TemplateManager('./prompts')


In [None]:
def get_links(results: list[tuple[float, str, str]]) -> str:
    links = "\n\nHere are some links that might be useful:\n\n"
    base = "https://python.useinstructor.com"
    for result in results:
        # get the link name by removing the "docs" prefix and removing .md extension
        path = result[1].replace("./data/docs", "")
        path = path.replace(".md", "")
        links += f" - {base}{path}\n"

    return links


def generate_response(query: str, searcher: BaseSearcher, model: str = "gpt-4o-mini", n=3) -> str:
    # First get the search results
    results = searcher.search(query, n=n)
    
    # Format the results using format_context
    formatted_results = searcher.format_context(results)

    # Get the system prompt and user prompt
    system_prompt = template_manager.render('system_prompt.jinja')
    user_input = template_manager.render('user_prompt.jinja', results=formatted_results, query=query)

    # Generate the response
    response = client.chat.completions.create(
        model=model,
        messages = [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_input},
        ]
    ).choices[0].message.content

    links = get_links(results)
    response = f"{response}\n\n{links}"

    return response

In [None]:
query = "Can you give me an example of how to use the Instructor library with Anthropic models?"

chroma_response = generate_response(query, searcher)
bm25_response = generate_response(query, bm25_searcher)

# Chroma Response

Certainly! Here’s an example of how to integrate the Instructor library with Anthropic models to create structured outputs using Pydantic models. This example demonstrates how to set up the client and generate a user model with specific properties.

### Step 1: Install Required Libraries

First, you need to install the necessary libraries. You can do this via pip:

```bash
pip install anthropic instructor[anthropic]
```

### Step 2: Set Up Your Python Script

Here’s a complete example:

```python
from pydantic import BaseModel
from typing import List
import anthropic
import instructor

# Patch the Anthropic client with the Instructor for enhanced capabilities
anthropic_client = instructor.from_anthropic(
    create=anthropic.Anthropic()
)

# Define your Pydantic models
class Properties(BaseModel):
    name: str
    value: str

class User(BaseModel):
    name: str
    age: int
    properties: List[Properties]

# Use the patched client to generate structured output
user_response = anthropic_client(
    model="claude-3-haiku-20240307",
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": "Create a user for a model with a name, age, and properties.",
        }
    ],
    response_model=User,
)

print(user_response.model_dump_json(indent=2))
```

### Explanation of the Code

1. **Imports**: The script imports necessary classes and functions from Pydantic, along with the Anthropics and Instructor libraries.
2. **Client Setup**: It patches the Anthropic client with the Instructor client, enabling enhanced capabilities for generating structured outputs.
3. **Model Definitions**: Pydantic models `Properties` and `User` are defined to structure the content received from the model.
4. **Make a Request**: A request is made to the Anthropic model (in this case, “claude-3-haiku-20240307”) to create a user with specified attributes.
5. **Output**: The response is printed in a structured JSON format.

### Sample Output

The execution of this script would yield a structured output similar to the following:

```json
{
  "name": "John Doe",
  "age": 35,
  "properties": [
    {
      "name": "City",
      "value": "New York"
    },
    {
      "name": "Occupation",
      "value": "Software Engineer"
    }
  ]
}
```

This approach allows you to efficiently extract structured data from the Anthropic model's responses using the Instructor library. Happy coding!



Here are some links that might be useful:

 - https://python.useinstructor.com/hub/anthropic
 - https://python.useinstructor.com/blog/posts/anthropic
 - https://python.useinstructor.com/blog/posts/structured-output-anthropic



# BM25 Response

Certainly! You can use the Instructor library with Anthropic models to create user models with complex properties. Here’s an example that demonstrates how to integrate the Anthropic client with the Instructor client in Python:

### Installation
Make sure you have the Anthropic library installed:
```bash
pip install anthropic
```

### Example Code
Here's how you can set up the Instructor client with an Anthropic model:

```python
from pydantic import BaseModel
from typing import List
import anthropic
import instructor

# Patching the Anthropic client with the instructor for enhanced capabilities
client = instructor.from_anthropic(
    anthropic.Anthropic(),
)

# Define your data models
class Properties(BaseModel):
    name: str
    value: str

class User(BaseModel):
    name: str
    age: int
    properties: List[Properties]

# Make a request to create a user with specific attributes
user_response = client.chat.completions.create(
    model="claude-3-haiku-20240307",
    max_tokens=1024,
    max_retries=0,
    messages=[
        {
            "role": "user",
            "content": "Create a user for a model with a name, age, and properties.",
        }
    ],
    response_model=User,
)

# Print the resulting user model in JSON format
print(user_response.model_dump_json(indent=2))
```

### Expected Output
When you run this code, you might get a response similar to the following (the actual output will depend on the model's response):

```json
{
  "name": "John Doe",
  "age": 35,
  "properties": [
    {
      "name": "City",
      "value": "New York"
    },
    {
      "name": "Occupation",
      "value": "Software Engineer"
    }
  ]
}
```

### Explanation
- **Properties Class**: This defines a structure for properties that can be linked to a user, such as name and value pairs.
- **User Class**: This encapsulates the user's details, including their basic information and a list of properties.
- **Request to the Anthropic Model**: You send a request to the Anthropic model (replace `"claude-3-haiku-20240307"` with the model you wish to use), and specify that you want a response conforming to the `User` model structure.
- **Response Handling**: The `model_dump_json` method allows you to output the structured response as formatted JSON.

This integration enhances the capabilities of your application by allowing you to define detailed user profiles while leveraging the powerful language generation capabilities of Anthropic models.



Here are some links that might be useful:

 - https://python.useinstructor.com/prompting/zero_shot/role_prompting
 - https://python.useinstructor.com/prompting/thought_generation/chain_of_thought_few_shot/auto_cot
 - https://python.useinstructor.com/hub/anthropic