Skip to content

AI RAG evaluation project using Ragas. Includes RAG metrics (precision, recall, faithfulness), retrieval diagnostics, and prompt testing examples for fintech/banking LLM systems. Designed as an AI QA Specialist portfolio project.

Notifications You must be signed in to change notification settings

alinaleo27/ai-rag-eval-qa

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ“˜ AI RAG Evaluation – QA Test Suite

This project demonstrates how an AI QA Specialist can evaluate a RAG (Retrieval-Augmented Generation) system using Ragas.
The repository covers:
LLM / RAG quality evaluation
retrieval error analysis (missing / wrong / irrelevant context)
automated RAG metrics (precision, recall, faithfulness)
basic prompt testing for a banking / fintech chatbot

🧩 Project structure
ai-rag-eval-qa/
β”œβ”€β”€ README.md
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ .gitignore
β”œβ”€β”€ data/
β”‚   └── rag_eval_dataset.jsonl
β”œβ”€β”€ notebooks/
β”‚   └── ragas_evaluation.py
└── prompts/
    └── prompt_tests.md

βš™οΈ Installation
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

πŸ”’ API Key Configuration (.env)
This project uses OpenAI for semantic evaluation.
Create a .env file in the project root:
OPENAI_API_KEY=your_api_key_here
The .env file is in .gitignore, so your API key will stay on your machine only

πŸ§ͺ Running evaluation
python3 notebooks/ragas_evaluation.py
The script loads the dataset and evaluates:
context_precision
context_recall
faithfulness
(You can also run in offline mode by disabling LLM usage.)

⚠️ Note on OpenAI Quota
The full evaluation requires an active OpenAI quota.
If the account has no credits or quota, you will see:
openai.RateLimitError: insufficient_quota
This is expected behavior and not an error in the project.
To run offline (no API calls):
results = evaluate(dataset, metrics=metrics, llm=None)
Note: faithfulness does not work without an LLM.

πŸ“‚ Dataset
rag_eval_dataset.jsonl contains 10 fintech/banking examples for RAG evaluation.

πŸ’¬ Prompt tests
Located in:
prompts/prompt_tests.md
Includes:
JSON output validation
jailbreak attempts
safety tests
consistency checks

🎯 Purpose
This repository serves as a compact example for:
RAG evaluation
LLM QA
retrieval diagnostics
prompt testing

Designed for AI QA / LLM QA roles.

About

AI RAG evaluation project using Ragas. Includes RAG metrics (precision, recall, faithfulness), retrieval diagnostics, and prompt testing examples for fintech/banking LLM systems. Designed as an AI QA Specialist portfolio project.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages