<H1> <b> AI-Powered Chatbot for Reddit FAQ with LlamaIndex & ChromaDB on Google Colab </H1>
This project builds an AI-powered chatbot that:<br>
Fetches & indexes knowledge from Reddit FAQ Page <br>
Stores embeddings in ChromaDB for persistent retrieval <br>
Uses Gemini AI for smart question-answering <br>
Provides Clear and Concide Answers based on source  <br>

<h2> <B> 1. Install Dependencies

In [None]:
!pip install llama-index-vector-stores-chroma
!pip install llama-index-embeddings-gemini
!pip install llama-index
!pip install llama-index-llms-gemini
!pip install llama-index-readers-web
!pip install llama-index-vector-stores-chroma
!pip install llama-index-embeddings-gemini
!pip install llama-index
!pip install llama-index-llms-gemini
!pip install llama-index-readers-web
!pip install chromadb
!pip install --upgrade pydantic
!pip install praw
!pip install re

Collecting llama-index-readers-web
  Using cached llama_index_readers_web-0.3.9-py3-none-any.whl.metadata (1.3 kB)
Collecting chromedriver-autoinstaller<0.7.0,>=0.6.3 (from llama-index-readers-web)
  Using cached chromedriver_autoinstaller-0.6.4-py3-none-any.whl.metadata (2.1 kB)
Collecting html2text<2025.0.0,>=2024.2.26 (from llama-index-readers-web)
  Using cached html2text-2024.2.26.tar.gz (56 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting newspaper3k<0.3.0,>=0.2.8 (from llama-index-readers-web)
  Using cached newspaper3k-0.2.8-py3-none-any.whl.metadata (11 kB)
Collecting playwright<2.0,>=1.30 (from llama-index-readers-web)
  Using cached playwright-1.51.0-py3-none-manylinux1_x86_64.whl.metadata (3.5 kB)
Collecting selenium<5.0.0,>=4.17.2 (from llama-index-readers-web)
  Using cached selenium-4.31.0-py3-none-any.whl.metadata (7.5 kB)
Collecting spider-client<0.0.28,>=0.0.27 (from llama-index-readers-web)
  Using cached spider-client-0.0.27.tar.gz (5.8 kB)
  Prep

<h2> <B> 2. Import Required Libraries

In [None]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, StorageContext, Document,Settings
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.readers.web import SimpleWebPageReader
from IPython.display import Markdown, display
import chromadb
from llama_index.embeddings.gemini import GeminiEmbedding
from llama_index.llms.gemini import Gemini


In [None]:
from bs4 import BeautifulSoup
import requests
import praw

<h2> <B> 3. Initialize AI Models

In [None]:
api_key = "AIzaSyBHtNbYKgX6ju8fmdQBJkIAOM8Yre2yZAo"

llm = Gemini(api_key=api_key, model_name="models/gemini-1.5-flash")
embed_model = GeminiEmbedding(api_key=api_key, model_name="models/embedding-001")

Settings.llm = llm
Settings.embed_model = embed_model


  llm = Gemini(api_key=api_key, model_name="models/gemini-1.5-flash")
  embed_model = GeminiEmbedding(api_key=api_key, model_name="models/embedding-001")
INFO:tornado.access:200 GET /v1beta/models/gemini-1.5-flash?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 612.78ms


<h2> <B> 4. Fetch & Process Data from web

In [None]:
reddit = praw.Reddit(
    client_id="d9hMLLldI69TDWhZtRjwGA",
    client_secret="Efjph2v1uFXCF_xoessNeJBKPsFwgA",
    user_agent="faq_chatbot"
)
subreddit = reddit.subreddit("reddit.com")
wiki_page = subreddit.wiki["faq"]

faq_text = wiki_page.content_md  # Extract FAQ content in Markdown format
documents = [Document(text=faq_text)]

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



<H2> <B> 5. Set Up ChromaDB (Vector Database)

In [None]:
client = chromadb.PersistentClient(path="./chroma_db")
chroma_collection = client.get_or_create_collection("quickstart")

vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)




INFO:chromadb.telemetry.product.posthog:Anonymized telemetry enabled. See                     https://docs.trychroma.com/telemetry for more information.


<h2> <B> 6. Create a Vector-Based Search Index

In [None]:
index = VectorStoreIndex.from_documents(documents, storage_context=storage_context)
query_engine = index.as_query_engine()

INFO:numexpr.utils:NumExpr defaulting to 2 threads.
INFO:tornado.access:200 POST /v1beta/models/embedding-001:embedContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 463.91ms
INFO:tornado.access:200 POST /v1beta/models/embedding-001:embedContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 307.04ms
INFO:tornado.access:200 POST /v1beta/models/embedding-001:embedContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 306.63ms
INFO:tornado.access:200 POST /v1beta/models/embedding-001:embedContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 310.86ms
INFO:tornado.access:200 POST /v1beta/models/embedding-001:embedContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 309.61ms
INFO:tornado.access:200 POST /v1beta/models/embedding-001:embedContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 308.75ms


<b> <h2> 7. Implementating Safety Standards </h2> <br>
The main objective is to prevent the chatbot answering to sensitive questions. <br>
<h3> Approach <


In [None]:
UNSAFE_PATTERNS = {
    "religion": [r"\breligion\b", r"\bfaith\b", r"\bChristianity\b", r"\bIslam\b",r"\bHinduism\b", r"\bBuddhism\b", r"\bspiritual\b", r"\bdivine\b",r"\bgod\b", r"\bgoddess\b", r"\bchurch\b", r"\btemple\b",r"\bmosque\b", r"\bsynagogue\b", r"\bpray(?:ing|ed|s)?\b",r"\bworship(?:ing|ped|s)?\b", r"\bbelief\b", r"\bdoctrine\b",r"\bsect\b", r"\bcult\b", r"\bheresy\b", r"\bsacred\b",r"\bhol(?:y|iness)\b"],
    "politics": [r"\bpolitics\b", r"\bvote(?:d|ing|s)?\b", r"\belection\b",r"\bgovernment\b", r"\bpolicy\b", r"\bpolitic(?:al|ian)\b",r"\bdemocrat(?:ic)?\b", r"\brepublican\b", r"\bconservative\b",r"\bliberal\b", r"\bsocialism\b", r"\bcommunism\b",r"\bparliament\b", r"\bcongress\b", r"\bsenate\b",r"\bpresident\b", r"\bprime minister\b"],
    "illegal": [r"\bhack(?:ing|ed|s)?\b", r"\bpirate(?:d|ing|s)?\b",r"\bsteal(?:ing|s|ole)?\b", r"\bfraud\b", r"\bscam\b",r"\bblack market\b", r"\bdrug(?:s)?\b", r"\bweapon(?:s)?\b",r"\bcrime\b", r"\btheft\b", r"\brobber(?:y|ies)\b",r"\bmurder\b", r"\bkill(?:ing|ed|s)?\b", r"\bviolence\b", r"\bgun\b", r"\btax(?:es)?\b",r"\bterrorism\b", r"\bextortion\b", r"\bbriber(?:y|ies)\b",r"\bcybercrime\b"],
    "personal_advice": [r"\bdepressed\b", r"\bsuicidal\b", r"\banxiety\b", r"\blove\b",r"\bbreak up\b", r"\brelationship\b", r"\bmarriage\b",r"\bdivorce\b", r"\bfamily\b", r"\bfriend(?:s)?\b",r"\bpersonal problem(?:s)?\b", r"\bmental health\b",r"\btherapy\b", r"\bcounseling\b", r"\bself-harm\b",r"\beating disorder(?:s)?\b", r"\baddict(?:ion|ed)?\b"],
}

In [None]:
import re
def is_unsafe_prompt(prompt: str):
    """Check if the prompt contains restricted content."""
    for category, patterns in UNSAFE_PATTERNS.items():
        for pattern in patterns:
            if re.search(pattern, prompt, re.IGNORECASE):
                return f"Sorry, I can't help with {category}-related topics."
    return None

In [None]:

def chatbot():
    print("Reddit FAQ Chatbot - What you wanna know for Today (Type 'exit' to quit)")
    while True:
        user_input = input("\nYou: ").strip()
        if user_input.lower() == "exit":
            print("Goodbye!")
            break

        # Content moderation check
        warning = is_unsafe_prompt(user_input)
        if warning:
            print(f"Chatbot: {warning}")
            continue

        response = query_engine.query(user_input)
        display(Markdown(f"<b>Chat Bot: {response}</b>"))
        unsafe_tests = user_input




# Run chatbot
if __name__ == "__main__":
    chatbot()
if __name__ == "__main__":
    chatbot()

Reddit FAQ Chatbot - What you wanna know for Today (Type 'exit' to quit)


<H2> <B> Conclusion </h2> </B>

This project successfully builds an AI-powered chatbot specifically tailored for Reddit FAQs by integrating LlamaIndex, ChromaDB, and Gemini AI for intelligent knowledge retrieval and safe, conversational interactions. The key achievements include:

<b>Targeted Reddit FAQ Integration</b> – The chatbot extracts and processes official FAQ content directly from Reddit using `SimpleWebPageReader`, enabling domain-specific knowledge grounding.

<b>Persistent Embedding Storage with ChromaDB</b> – Embeddings of the Reddit FAQ content are stored efficiently in ChromaDB, allowing for fast, contextually relevant retrieval during conversations.

<b>Smart Answering with Gemini AI</b> – Gemini AI models are used to provide high-quality natural language understanding and generation, offering accurate and conversational responses to user questions.

<b>Content Safety System</b> – A robust filter prevents the chatbot from responding to sensitive or harmful topics, enhancing trust, compliance, and user safety.

<b>LLM-Based Response Evaluation</b> – A dedicated evaluation agent analyzes chatbot replies for relevance, completeness, clarity, and safety, enabling measurable quality assurance and continuous improvement.

<b>Automated Testing Suite</b> – The evaluator is connected to a testing framework using safe and unsafe prompts to validate both performance and policy adherence.

This system demonstrates a scalable, responsible approach to building FAQ-focused conversational agents using modern LLM infrastructure and vector-based search techniques.
