# Final Exercise

This exercise is about bringing together what you have learnt in the previous exercise

In the source_documents folder are a number of CV's of fictional people.

The first task would be to embed this data into a vector database, secondly create a retrieval and generation function which will make use of this data.

You can make the RAG step as simple or as complex as you like - consider some of the following questions:

* how best to chunk the data?
* what information to pass to the generation step? is any preprocessing/augmentation needed
* are there any metrics you can incorporate to your pipeline at runtime? (NOTE: avoid using recall as this can become computationally expensive)
* what possible data posioning attacks might be relevent to this exercise? How, can you protect against them?
* ... be creative and think of other improvements you may like to implement



In [None]:
#initial setup
import sys
import os

import openai
import glob
import dotenv
from IPython.display import display, Markdown


from src.utils import OPENAI_API_KEY
from src.chroma_db import VectorCollection, VectorDBItem, OpenAIEmbeddingModel, get_chromadb_client, remove_collection


project_root = os.path.abspath(os.path.join(os.getcwd(), '..', '..'))
if project_root not in sys.path:
    sys.path.insert(0, project_root)

OPENAI_MODEL = 'gpt-4-turbo' # 128,000 tokens
SCHEMA_NAME = "final_exercise_embeddings"
COLLECTION_NAME = "final_exercise_collection"

In [6]:
chroma_client = get_chromadb_client(SCHEMA_NAME)
client_openai = openai.OpenAI(
    api_key=OPENAI_API_KEY
)

collection = VectorCollection(
    name=COLLECTION_NAME,
    client=chroma_client, 
    token=OPENAI_API_KEY
)

In [None]:


def load_documents():
    """ A very simple method to obtain all documents as a list"""
    documents = []
    for file_path in glob.glob("source_documents/*.md"):
        with open(file_path, 'r') as f:
            content = f.read()
            doc_id = file_path.split('/')[-1].replace('.md', '')
            documents.append((doc_id, content))
    return documents


documents = load_documents()
for doc_id, content in documents:
    collection.add_item(content, doc_id)

In [None]:

"""
Try to implement one of the chunking methods to improve the performance of your RAG!
"""


def chunk(document):
    pass




In [None]:

def retrieve(query, top_k):
    """ A method to retrieve topK results similar to the given query """

    results = collection.similar_items(query, n_results=top_k)
    return [result.text for result in results]



def generate(query, context, max_tokens=500):
    """ Wrapper on the OpenAI generation method, which combines the retrieved context together with the user query """

    prompt = f"Context: {context}\n\nQuestion: {query}\n\nAnswer:"
    response = client_openai.chat.completions.create(
        model=OPENAI_MODEL,
        messages=[{"role": "user", "content": prompt}],
        max_tokens=max_tokens
    )

    return response.choices[0].message.content


def rag(query, top_k=3):
    """ The entire RAG pipeline (Retrieve from VectorDB -> Generate) """
    
    retrieved_docs = retrieve(query, top_k)
    context = "\n\n".join(retrieved_docs)

    return generate(query, context)

In [20]:
answer = rag("My car broke down. Who can help me?")

display(Markdown(answer))

Given your situation with a car breakdown, David "Dusty" Miller would be best suited to help you. With his background as a mechanic and his practical skills in vehicle maintenance and repair, Dusty has the expertise necessary to diagnose and fix problems with cars, especially given his experience and passion for restoring vintage vehicles. You can be confident that Dusty's mechanical skills will come in handy in getting your car back up and running.