## 📦 Package Installation

This section installs and upgrades all the required Python libraries used to build and run the chatbot system. These tools are essential for handling document processing, language model interaction, UI rendering, and environment management.


In [17]:
%pip install --quiet --upgrade langchain-text-splitters langchain-community langgraph
%pip install -qU "langchain[google-genai]"
%pip install -qU langchain-huggingface
%pip install -qU langchain-core
%pip install --upgrade --quiet  langchain langchain-community azure-ai-documentintelligence
%pip install python-dotenv
%pip install gradio --quiet
%pip install python-pptx
%pip install -qU langchain_community pypdf

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


## 🧠 Module Imports and Environment Setup

This section imports all necessary Python modules and libraries required for building the chatbot's backend logic.


In [1]:
import getpass
import os
from langchain.chat_models import init_chat_model
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_core.vectorstores import InMemoryVectorStore
import bs4
from langchain import hub
from langchain_core.documents import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langgraph.graph import START, StateGraph
from typing_extensions import List, TypedDict
from langchain_community.document_loaders import AzureAIDocumentIntelligenceLoader
from dotenv import load_dotenv
import gradio as gr
from pptx import Presentation
from langchain_community.document_loaders import PyPDFLoader

load_dotenv()


True

## 🔐 Environment Variable Setup

This block sets up required environment variables and retrieves sensitive keys from the `.env` file. These keys are used to authenticate with third-party services like LangSmith, Google Gemini, and Azure Document Intelligence.

In [18]:
os.environ["LANGSMITH_TRACING"] = "true"
os.environ["LANGSMITH_API_KEY"] = os.getenv("LANGSMITH_API_KEY")
os.environ["GOOGLE_API_KEY"] = os.getenv("GOOGLE_API_KEY")
document_intelligence_key = os.getenv("DOCUMENTAI_API_KEY")
file_path = "test.pptx"
endpoint = os.getenv("DOCUMENTAI_ENDPOINT")

### Chat Model - Gemini GenAI

In [3]:
llm = init_chat_model("gemini-2.0-flash", model_provider="google_genai")

### Embeddings model - HuggingFace

In [4]:
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

### Memory Vector Model - LangChain Core

In [5]:
vector_store = InMemoryVectorStore(embeddings)

## 📄 Document Loading, Splitting, Vectorization & RAG Pipeline Setup

This section performs four major tasks:
1. Loads and parses the input document using Azure Document Intelligence.
2. Splits the parsed content into manageable chunks.
3. Embeds the chunks and adds them to an in-memory vector store.
4. Builds a retrieval-augmented generation (RAG) reasoning pipeline using LangGraph.

### Azure AI Document Intelligence - Document Loading

In [None]:
# loader = AzureAIDocumentIntelligenceLoader(
#     api_endpoint=endpoint, api_key=document_intelligence_key, file_path=file_path, api_model="prebuilt-layout"
# )

# docs = loader.load()
# print(f"Loaded {len(docs)} pages")

### Manual Document Loading

In [None]:
def load_documents(file_path: str) -> list[Document]:
    """Load a PPTX or PDF into a list of LangChain Document objects."""
    docs: list[Document] = []
    ext = os.path.splitext(file_path)[1].lower()

    if ext == ".pdf":
        loader = PyPDFLoader(file_path)
        docs = loader.load()
        print(f"Loaded {len(docs)} pages into {len(docs)} Document objects.")

    elif ext in (".pptx", ".ppt"):
        prs = Presentation(file_path)
        for i, slide in enumerate(prs.slides):
            slide_texts = []
            for shape in slide.shapes:
                if hasattr(shape, "text") and shape.text:
                    slide_texts.append(shape.text.strip())
                elif hasattr(shape, "text_frame") and shape.text_frame.text:
                    slide_texts.append(shape.text_frame.text.strip())
            full_text = "\n".join(slide_texts)
            if full_text:
                docs.append(Document(page_content=full_text, metadata={"slide_index": i}))
        print(f"Loaded {len(docs)} slides into {len(docs)} Document objects.")

    else:
        raise ValueError(f"Unsupported file type: {ext}")

    return docs

In [20]:
docs = load_documents(file_path)

text_splitter = RecursiveCharacterTextSplitter(chunk_size=400, chunk_overlap=100)
all_splits = text_splitter.split_documents(docs)

_ = vector_store.add_documents(documents=all_splits)

prompt = hub.pull("rlm/rag-prompt")

class State(TypedDict):
    question: str
    context: List[Document]
    answer: str

def retrieve(state: State):
    retrieved_docs = vector_store.similarity_search(state["question"])
    return {"context": retrieved_docs}


def generate(state: State):
    docs_content = "\n\n".join(doc.page_content for doc in state["context"])
    messages = prompt.invoke({"question": state["question"], "context": docs_content})
    response = llm.invoke(messages)
    return {"answer": response.content}


graph_builder = StateGraph(State).add_sequence([retrieve, generate])
graph_builder.add_edge(START, "retrieve")
graph = graph_builder.compile()

Loaded 20 slides into 20 Document objects.


## 💬 Gradio Interface for Chatbot Interaction

This section defines the function used to process user questions and launches a Gradio interface for easy interaction with the chatbot via a web-based UI.

In [16]:

def chat_fn(question: str) -> str:
    result = graph.invoke({"question": question})
    return result["answer"]

In [21]:
iface = gr.Interface(
    fn=chat_fn,
    inputs=gr.Textbox(lines=3, placeholder="Type your question here…"),
    outputs=gr.Textbox(label="Answer"),
    title="📄 CTSE Document Q & A Chatbot",
    description="Ask away!"
)

iface.launch(share=True, inline=True)

* Running on local URL:  http://127.0.0.1:7863

Could not create share link. Please check your internet connection or our status page: https://status.gradio.app.


