#**Retrieval-Augmented Generation (RAG)**
Retrieval-Augmented Generation (RAG) is a technique that enhances the performance of Large Language Models (LLMs) by combining retrieval-based methods with generation-based models. This approach improves the accuracy, relevance, and factual consistency of generated responses.



###**How RAG Works?**
RAG consists of two main components:

Retrieval Component (Retriever)

1. Searches a database (such as a vector database like ChromaDB, Pinecone, or FAISS) to fetch relevant documents based on the user's query.
2. Uses embeddings to perform similarity search over stored knowledge.
3. Ensures that the model has access to up-to-date and relevant information, beyond its pre-trained knowledge.

Generation Component (LLM)

1. The retrieved documents are passed as context to an LLM (e.g., OpenAI GPT, Mistral, Llama 2, or Hugging Face models).
2. The LLM generates a response based on both the retrieved information and its inherent knowledge.

###**Steps to perform RAG**

1. Load Dataset (Using which we will perform RAG)
2. Read Files
3. Split Text
4. Create Index
5. Load Index
6. Similarity Search and Response Generation

###**1. Downloads the `hrdataset.zip` file from the CloudYuga GitHub repo**

Saves it in the current working directory of notebook

(e.g., /content/ in Google Colab).

In [1]:
!wget https://github.com/cloudyuga/mastering-genai-w-python/raw/refs/heads/main/hrdataset.zip

--2025-05-23 12:04:14--  https://github.com/cloudyuga/mastering-genai-w-python/raw/refs/heads/main/hrdataset.zip
Resolving github.com (github.com)... 140.82.114.4
Connecting to github.com (github.com)|140.82.114.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/cloudyuga/mastering-genai-w-python/refs/heads/main/hrdataset.zip [following]
--2025-05-23 12:04:14--  https://raw.githubusercontent.com/cloudyuga/mastering-genai-w-python/refs/heads/main/hrdataset.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 9530 (9.3K) [application/zip]
Saving to: ‘hrdataset.zip.2’


2025-05-23 12:04:14 (71.0 MB/s) - ‘hrdataset.zip.2’ saved [9530/9530]



###**Unzip `hrdataset.zip` file**
- It will automatically create **`hrdataset`** folder in our current working directory (/content/ in Google Colab)

In [2]:
!unzip hrdataset.zip

Archive:  hrdataset.zip
replace hrdataset/policies/leave_policies.md? [y]es, [n]o, [A]ll, [N]one, [r]ename: n
replace hrdataset/policies/training_and_development.md? [y]es, [n]o, [A]ll, [N]one, [r]ename: n
replace hrdataset/policies/employee_benefits.md? [y]es, [n]o, [A]ll, [N]one, [r]ename: n
replace hrdataset/policies/holiday_calendar.md? [y]es, [n]o, [A]ll, [N]one, [r]ename: nn
replace hrdataset/policies/events_calendar.md? [y]es, [n]o, [A]ll, [N]one, [r]ename: n
replace hrdataset/surveys/Employee_Culture_Survey_Responses.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: n
replace hrdataset/employees/108_Rajesh_Kulkarni.md? [y]es, [n]o, [A]ll, [N]one, [r]ename: n
replace hrdataset/employees/106_Neha_Malhotra.md? [y]es, [n]o, [A]ll, [N]one, [r]ename: n
replace hrdataset/employees/103_Anjali_Das.md? [y]es, [n]o, [A]ll, [N]one, [r]ename: n
replace hrdataset/employees/105_Sunita_Patil.md? [y]es, [n]o, [A]ll, [N]one, [r]ename: n
replace hrdataset/employees/101_Priya_Sharma.md? [y]es, [n]o, [A]l

In [3]:
# List the Markdown files
!ls -R hrdataset/


hrdataset/:
employees  policies  surveys

hrdataset/employees:
101_Priya_Sharma.md  105_Sunita_Patil.md     109_Meera_Iyer.md
102_Rohit_Mehra.md   106_Neha_Malhotra.md    110_Aditya_Jain.md
103_Anjali_Das.md    107_Amit_Verma.md	     payroll_information.md
104_Karan_Kapoor.md  108_Rajesh_Kulkarni.md

hrdataset/policies:
employee_benefits.md  holiday_calendar.md  training_and_development.md
events_calendar.md    leave_policies.md

hrdataset/surveys:
Employee_Culture_Survey_Responses.csv


####**Install Dependencies**

In [4]:
!pip install langchain_community
!pip install langchain_text_splitters



In [5]:
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

In [6]:
employee_files_path = "hrdataset/employees"
policy_files_path = "hrdataset/policies"
persist_directory = "hr_vector_index"

###**2. Read Our Files and convert into text**

In [7]:
import os
def read_markdown_files(directory):
    """Read and load content from all Markdown files in a directory."""
    documents = []
    for filename in os.listdir(directory):
        if filename.endswith(".md"):
            filepath = os.path.join(directory, filename)
            with open(filepath, 'r', encoding='utf-8') as f:
                documents.append({"filename": filename, "content": f.read()})
    return documents

In [8]:
employee_docs = read_markdown_files(employee_files_path)
policy_docs = read_markdown_files(policy_files_path)
all_docs = employee_docs + policy_docs

###**3. Split the text into chunks**

In [9]:
def split_text(documents, chunk_size=1000, chunk_overlap=20):
    """Split text documents into manageable chunks."""
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        length_function=len,
        is_separator_regex=False
    )
    chunks = []
    for doc in documents:
        # Split the document into chunks
        doc_chunks = text_splitter.create_documents([doc["content"]])
        # Add metadata (e.g., filename) to each chunk
        for chunk in doc_chunks:
            chunk.metadata = {"filename": doc["filename"]}
        chunks.extend(doc_chunks)
    return chunks

In [10]:
# Split all documents into chunks
chunks = split_text(all_docs)

In [11]:
print(chunks[5])

page_content='# Employee Profile: Anjali Das

## Basic Information
- **Employee ID:** 103
- **Name:** Anjali Das
- **Role:** HR Executive
- **Department:** Human Resources
- **Manager:** Ramesh Nair
- **Contact:** +91-9988776655
- **Joining Date:** 2021-05-10
- **Date of Birth:** 1995-01-15
- **Hobbies:** Cooking, Gardening

## Performance Ratings
- **2021:** 4.2
- **2022:** 4.3
- **2023:** 4.5

## Onboarding Status
- N/A' metadata={'filename': '103_Anjali_Das.md'}


In [12]:
!pip install chromadb



###**4. Create Index**

In [13]:
def create_chroma_index(chunks, persist_directory):
    """Create and persist a ChromaDB index."""
    embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
    vectordb = Chroma.from_documents(documents=chunks, embedding=embeddings, persist_directory=persist_directory)
    print(f"ChromaDB index created and saved in {persist_directory}.")
    return vectordb

In [14]:
create_chroma_index(chunks, persist_directory)

  embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


ChromaDB index created and saved in hr_vector_index.


<langchain_community.vectorstores.chroma.Chroma at 0x7c9f0ebdc5d0>

###**5. Load Index**

In [15]:
def load_chroma_index(persist_directory=persist_directory):
    """Load an existing ChromaDB index."""
    embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
    vectordb = Chroma(persist_directory=persist_directory, embedding_function=embeddings)
    print(f"ChromaDB index loaded from {persist_directory}.")
    return vectordb

###**6. Similarity Search and Response Generation using LLM**

In [16]:
!pip install openai



In [17]:
from openai import OpenAI

#####**Retrive API key from Secrets and Set as an ENV**

In [18]:
# Retrieve the API key from Colab's secrets
from google.colab import userdata
OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')

In [19]:
# Set OPENAI_API_KEY as an ENV
import os
os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY

In [20]:
client = OpenAI()

In [21]:
def generate_response(context, question):
    """Generate a response using OpenAI."""
    try:
        messages = [
            {"role": "system", "content": "You are an assistant that answers questions based on the provided content."},
            {"role": "user", "content": f"Context: {context}\n\nQuestion: {question}\n\nAnswer:"}
        ]
        response = client.chat.completions.create(
            model="gpt-4o-mini",  # Replace with preferred model
            messages=messages,
            max_tokens=150,
        )
        return response.choices[0].message.content.strip()
    except Exception as e:
        return f"Error generating response: {e}"

In [22]:
def process_question(question):
    if not question:
        return "Please provide a question."

    # Step 1: load ChromaDB
    vectordb = load_chroma_index(persist_directory=persist_directory)

    # Step 2: Perform similarity search
    try:
        docs = vectordb.similarity_search(question)
        if not docs:
            return "No relevant information found."

        # Step 3: Generate a response using the retrieved context
        context = docs[0].page_content
        response = generate_response(context, question)
        return response
    except Exception as e:
        return f"Error during similarity search or response generation: {str(e)}"

In [23]:
print("Response:",process_question("Give me the summary of leave policy in 20 words"))

  vectordb = Chroma(persist_directory=persist_directory, embedding_function=embeddings)


ChromaDB index loaded from hr_vector_index.
Response: Annual leave: 18 days; sick leave: 12 days; maternity: 6 months; paternity: 15 days; compensatory leave for extra work.
