# Simple RAG Application with Groq and ChromaDB

This notebook demonstrates how to build a Retrieval-Augmented Generation (RAG) system using student feedback data from an Excel file.

In [82]:
%pip install -q langchain langchain-groq langchain-community sentence-transformers chromadb pandas openpyxl python-dotenv langchain-huggingface langchain-chroma

Note: you may need to restart the kernel to use updated packages.


In [83]:
import os
import pandas as pd
from dotenv import load_dotenv
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_chroma import Chroma
from langchain_core.documents import Document
from langchain_groq import ChatGroq
from langchain.agents import create_agent
from langchain.tools import tool
from langchain_core.prompts import ChatPromptTemplate

# Load environment variables from .env
load_dotenv(override=True)

groq_api_key = os.getenv("GROQ_API_KEY")
if not groq_api_key:
    print("WARNING: GROQ_API_KEY not found in .env file. Please set it.")
else:
    print("Groq API Key found.")

huggingface_api_key = os.getenv("HUGGINGFACE_API_KEY")
if not huggingface_api_key:
    print("WARNING: HUGGINGFACE_API_KEY not found in .env file. Please set it.")
else:
    print("Huggingface API Key found.")

Groq API Key found.
Huggingface API Key found.


## 1. Load Data

In [84]:
file_path = "summarized_student_feedback.xlsx"

if os.path.exists(file_path):
    df = pd.read_excel(file_path)
    print("Data loaded. First 5 rows:")
    display(df.head())
else:
    print(f"File not found: {file_path}. Please make sure the file is in the same directory.")

Data loaded. First 5 rows:


Unnamed: 0,z_last_digits,student_writeup
0,607947,"As i mentioned the first class, I'm an archite..."
1,571257,I expect to learn basic ideas of Artificial In...
2,607605,I will develop high practical skills in data a...
3,236137,My primary goal for this course is to develop ...
4,561111,I want to gain practical knowledge on how to i...


## 2. Prepare Vector Database

In [85]:
# Convert DataFrame rows to Documents
documents = []

# Iterate over rows and convert to text format
for index, row in df.iterrows():
    # Convert the entire row to a string representation
    content = "\n".join([f"{col}: {val}" for col, val in row.items() if pd.notna(val)])
    
    doc = Document(page_content=content, metadata={"row": index, "source": file_path})
    documents.append(doc)

print(f"Processed {len(documents)} documents.")

Processed 37 documents.


## Create a Vector Database

In [86]:
# Initialize Embeddings (using a model from HuggingFace)
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

# Create ChromaDB Vector Store

"""This creates a temporary in-memory store. The memory will be cleared when the kernel is restarted, 
if persist_directory is not specified."""

"""The from_documents method requires a list of document objects along with the 
embedding model and the name of the collection as parameters, where collection_name is optional."""

vectorstore = Chroma.from_documents(
    documents=documents,
    embedding=embeddings,
    collection_name="student_feedback",
    persist_directory="./chroma_db"
)

print("Vector Database created.")

Vector Database created.


## 3. Setup an Agent using LangChain

In [87]:
# This function will get the documents from the vector store
@tool
def get_feedback(query:str)->str:
    """This function will search the query for a similarity score in the vector store 
    and will return the documents which are most similar to the query"""
    docs = vectorstore.similarity_search(query,k=5)
    # Format the response to a single line string
    response_content = "\n\n".join([d.page_content for d in docs])
    return response_content


In [88]:
# Create the Agent

# Initialize the LLM
llm = ChatGroq(model_name="meta-llama/llama-4-scout-17b-16e-instruct", api_key=groq_api_key)

# Initialize the tools
tools = [get_feedback]

# Create the agent
agent = create_agent(llm, tools, system_prompt="You are a helpful assistant that can answer questions about student feedback.")


In [89]:
# Run the agent for results
result = agent.invoke({"messages": [{"role": "user", "content": "What are students saying about the course difficulty?"}]})
final_answer = result['messages'][-1].content

print(final_answer)

Students have mentioned that the course will enhance their technical base as well as problem-solving attitude. They also want to master Python programming, use libraries such as NumPy, Pandas, and scikit-learn, and learn how to analyze results and explain information in a straightforward manner. 

However, I couldn't find any direct feedback on the course difficulty. If you want to know more about the course difficulty, I suggest checking the course reviews or ratings. 

Would you like to know more about the course or is there anything else I can help you with?
