# Requirements
The only libraries that are going to be used are : 
- OpenAI : To call the LLM & Embedding model
- PyPDF2 : To process the text inside each PDF
- ChromaDB : To create a VectorDB and save the documents/chunks and their embeddings


In [30]:
import os
import chromadb
from chromadb.config import Settings
import re
from PyPDF2 import PdfReader
from openai import OpenAI

# Set up OpenAI API (make sure to set your API key as an environment variable)
os.environ["OPENAI_API_KEY"] = "sk-...."

client = OpenAI()

## Configure Chroma Vector Database

In [2]:
# Initialize Chroma client : Persistent client to save this DB into the Disk. 
chroma_client = chromadb.PersistentClient(
    path="./chroma_db"
)

# Create a collection 
collection = chroma_client.create_collection("pdf_collection")

## Data processing

The idea here is :
- Retrieve all the text from PDFs
- Divide them into chunks
- Get for each chunk a vector representation (embeddings)
- Store each chunk inside the ChromaDB 

In [5]:
def extract_text_from_pdf(pdf_path):
    """
    Extract all the text from a PDF file.
    """
    with open(pdf_path, 'rb') as file:
        reader = PdfReader(file)
        text = ""
        # Iterate through all pages and extract text
        for page in reader.pages:
            text += page.extract_text()
    return text

def split_text_into_chunks(text, words_per_chunk=500, overlap=50):
    """
    Split text into smaller chunks for better embedding.
    """
    # Split text into words
    words = re.findall(r'\S+', text)
    chunks = []
    # Create overlapping chunks
    for i in range(0, len(words), words_per_chunk - overlap):
        chunk = ' '.join(words[i:i + words_per_chunk])
        chunks.append(chunk)
    return chunks

def get_embedding(text):
    """
    Call OpenAI API to create embeddings for a given text.
    """
    # Call OpenAI API to generate embedding

    response = client.embeddings.create(
        input=text,
        model="text-embedding-3-small"
    )
    return response.data[0].embedding


def process_pdfs(folder_path):
    """
    Process the PDFs: 
    1. Extract the text
    2. Convert the text into chunks
    3. Get embeddings for each chunk
    4. Load the chunks and the embeddings inside the chroma DB
    
    PS : We can also request OpenAI with batches...
    """

    doc_id = 0
    for filename in os.listdir(folder_path):
        if filename.endswith('.pdf'):
            print('file: ', filename)
            pdf_path = os.path.join(folder_path, filename)
            
            # Extract text from PDF
            print('Extract text from pdf')
            text = extract_text_from_pdf(pdf_path)
            
            # Split text into chunks
            print('Split text')
            chunks = split_text_into_chunks(text)
            print('Chunks size : ', len(chunks))
            for i, chunk in enumerate(chunks):
                # Create embeddings using OpenAI
                embedding = get_embedding(chunk)
                
                # Add to Chroma: Some of the VectorDB allow inserting batches
                collection.add(
                    embeddings=[embedding],
                    documents=[chunk],
                    metadatas=[{"source": filename, "chunk": i}],
                    ids=[f"{filename}_chunk_{i}"]
                )
                doc_id += 1

    print(f"Processed and added {len(collection.get()['ids'])} chunks to Chroma.")



## Run the pipeline

In [8]:
data_folder = "./data"  # Replace it with your PDF folder path
process_pdfs(data_folder)

file:  31555_1a_maj1_diagnostic.pdf
Extract text from pdf
Split text
Chunks size :  182
file:  at105_avril24_0.pdf
Extract text from pdf
Split text
Chunks size :  21
file:  a_toulouse_mars_2024.pdf
Extract text from pdf
Split text
Chunks size :  15
file:  livret_code_de_la_rue_170x230-v2.pdf
Extract text from pdf
Split text
Chunks size :  9
Processed and added 227 chunks to Chroma.


## Assemble everything

The process, once the ChromaDB is populated, for each user question:
- Generate the embedding of the question
- Query ChromaDB with the question embedding to retrieve the most relevant chunks
- Send the retrieved chunks and the original question to the LLM to formulate an answer based on the provided context

In [24]:
def query_chroma(query_embedding, n_results=5):
    """
    Query ChromaDB for relevant documents using the query embedding.
    """
    # Query ChromaDB
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=n_results
    )
    # Return the first list of documents
    return results['documents'][0]

def generate_answer(question, context):
    """
    Generate an answer using OpenAI's API with the given context.
    """
    # Call OpenAI API
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": f"You are a helpful assistant. Use the following context to answer the user's question: {context}\n\n --------"},
            {"role": "user", "content": question}
        ]
    )
    # Extract and return the generated answer
    return response.choices[0].message.content

In [25]:
def rag_pipeline(question):
    """
    Executes the RAG pipeline.
    """
    # Generate embedding for the question
    question_embeddings = get_embedding(question)

    # Retrieve relevant chunks from ChromaDB
    relevant_chunks = query_chroma(question_embeddings)

    # Combine chunks into a single context string
    context = "\n\n".join(relevant_chunks)

    # Generate answer using the question and context
    answer = generate_answer(question, context)

    return answer

## Call the RAG pipeline

In [29]:
prompt = "How to get a composteur gratuit"
r = rag_pipeline(prompt)
print(r)

To obtain a free composteur (composter) in Toulouse, you need to follow these steps:

1. **Complete the Biodéchets Sorting Training:**
   - The primary condition for receiving a free composter is to complete a training session on sorting biodéchets (biodegradable waste) provided by Toulouse Métropole.

2. **Attend the Training Based on Your Situation:**
   - If you have a garden and live in a house or a ground-floor apartment with a garden:
     - Participate in an individual training session that lasts half a day to learn how to compost in your garden.
   - If you live near a public garden and want to start a collective composting project:
     - Fill out a form to assess the feasibility of your project.
     - Attend a training session that lasts two days and a half-day to become equipped to help families sort biodéchets.
   - If you have encouraged your neighbors in your residence to collectively sort biodéchets:
     - Gather 10 or more families in your residence for the initiative