# Building a Documentation Chatbot with LangChain

This script demonstrates how to build an intelligent chatbot that queries documentation using LangChain. 
The chatbot can:
- Parse and preprocess Markdown files.
- Embed document content for efficient similarity-based retrieval.
- Answer detailed, context-aware queries from users.

In [9]:
import os
import logging
import helpers.hdbg as hdbg
from langchain.chat_models import ChatOpenAI
from langchain.embeddings import OpenAIEmbeddings
from langchain.chains import RetrievalQA
from langchain_utils import (
    list_markdown_files,
    parse_markdown_files,
    split_documents,
    create_vector_store,
    build_retriever,
    watch_folder_for_changes,
    update_vector_store
)
# Configure logging.
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
# Set the OpenAI API key.
os.environ["OPENAI_API_KEY"] = "your_openai_api_key_here"
# Initialize the chat model
chat_model = ChatOpenAI(model="gpt-3.5-turbo-0125", temperature=0)

In [10]:
hdbg.init_logger(verbosity=logging.INFO)

_LOG = logging.getLogger(__name__)



In [12]:
import os
import logging
from langchain.chat_models import ChatOpenAI
from langchain.embeddings import OpenAIEmbeddings
from langchain.chains import RetrievalQA
import langchain_utils as lang_utils 

In [13]:
hdbg.init_logger(verbosity=logging.INFO)

_LOG = logging.getLogger(__name__)



## Define Config

In [18]:
config = {
    "open_ai_api_key": "your_api_key_here",
    # Define language model arguments.
    "language_model": {
        # Define your model here.
        "model": "gpt-40-mini",
        "temperature": 0,
    },
    # Define input directory path containing documents.
    "source_directory": "../../docs",
    "parse_data_into_chunks": {
        "chunk_size" = 500,
        "chunk_overlap" = 50,
    },
}

## Setting Up

We'll begin by importing the required libraries and configuring the environment. The chatbot will use:
- OpenAI's GPT-3.5 as the core language model.
- FAISS for fast document retrieval.
- LangChain utilities for document parsing, text splitting, and chaining.

In [20]:
# Set the OpenAI API key.
os.environ["OPENAI_API_KEY"] = config["open_ai_api_key"]
# Initialize the chat model.
chat_model = ChatOpenAI(**config["language_model"])

## Parse and Preprocess Documentation

Markdown files serve as the primary data source for this chatbot. 
We'll parse the files into LangChain `Document` objects and split them into manageable chunks to ensure efficient retrieval.

In [22]:
split_documents = lang_utils.parse_data_into_chunks(
    dir_path = config["source_directory"],
    **config["parse_data_into_chunks"],
)
_LOG.info("Processed and chunked %d documents.", len(split_documents))
# Print sample chunked documents
for doc in split_documents[:5]:
    _LOG.info("Source: %s", {doc.metadata['source']})
    _LOG.info("Content: %s", {doc.page_content})

NameError: name 'lang_utils' is not defined

## Create a FAISS Vector Store

To enable fast document retrieval, we'll embed the document chunks using OpenAI's embeddings and store them in a FAISS vector store.

In [64]:
# Initialize OpenAI embeddings.
embeddings = OpenAIEmbeddings()
# Create a FAISS vector store.
vector_store = create_vector_store(chunked_documents, embeddings)
logger.info("FAISS vector store created with %d documents.", len(chunked_documents)).



## Build a QA Chain

The `RetrievalQA` chain combines document retrieval with OpenAI's GPT-3.5 for question answering. 
It retrieves the most relevant document chunks and uses them as context to generate answers.

In [19]:
# Build the retriever from the vector store
retriever = build_retriever(vector_store)

# Create the RetrievalQA chain
qa_chain = RetrievalQA.from_chain_type(llm=chat_model, retriever=retriever, return_source_documents=True)

logger.info("RetrievalQA chain initialized.")

## Step 5: Query the Chatbot

Let's interact with the chatbot! We'll ask it questions based on the documentation. 
The chatbot will retrieve relevant chunks and generate context-aware responses.

In [9]:
# Define a user query
query = "What are the guidelines for setting up a new project?"

# Query the chatbot
response = qa_chain({"query": query})

# Display the answer and source documents
print(f"Answer:\n{response['result']}\n")
print("Source Documents:")
for doc in response['source_documents']:
    print(f"- Source: {doc.metadata['source']}")
    print(f"  Excerpt: {doc.page_content[:200]}")

## Step 6: Dynamic Updates

What if the documentation changes? We'll handle this by monitoring the folder for new or modified files.
The vector store will be updated dynamically to ensure the chatbot stays up-to-date.

In [10]:
# Monitor the folder for changes and update the vector store
known_files = {}
changes = watch_folder_for_changes(docs_directory, known_files)

if changes["new"] or changes["modified"]:
    # Parse and process the changed files
    new_documents = parse_markdown_files(changes["new"] + changes["modified"])
    update_vector_store(vector_store, new_documents, embeddings)
    logger.info("Vector store updated with new/modified documents.")

## Step 7: Enhancements - Personalization

We can extend the chatbot to include personalized responses:
- Filter documents by metadata (e.g., tags, categories).
- Customize responses based on user preferences.

For example, users can ask for specific sections of the documentation or request summaries tailored to their needs.

In [31]:
# Example query with personalized intent
personalized_query = "Show me onboarding guidelines for new employees."

# Query the chatbot
personalized_response = qa_chain({"query": personalized_query})

# Display the personalized response
print(f"Answer:\n{personalized_response['result']}\n")
print("Source Documents:")
for doc in personalized_response['source_documents']:
    print(f"- Source: {doc.metadata['source']}")
    print(f"  Excerpt: {doc.page_content[:200]}")

INFO  Source: {'../../docs/all.how_write_tutorials.how_to_guide.md'}
INFO  Content: {'<!-- toc -->\n\n- [Tutorials "Learn X in 60 minutes"](#tutorials-learn-x-in-60-minutes)\n  * [What are the goals for each tutorial](#what-are-the-goals-for-each-tutorial)\n\n<!-- tocstop -->\n\n# Tutorials "Learn X in 60 minutes"\n\nThe goal is to give everything needed for one person to become familiar with a\nBig data / AI / LLM / data science technology in 60 minutes.\n\n- Each tutorial conceptually corresponds to a blog entry.'}
INFO  Source: {'../../docs/all.how_write_tutorials.how_to_guide.md'}
INFO  Content: {'Each tutorial corresponds to a directory in the `//tutorials` repo\n[https://github.com/causify-ai/tutorials](https://github.com/causify-ai/tutorials)\nwith'}
INFO  Source: {'../../docs/all.how_write_tutorials.how_to_guide.md'}
INFO  Content: {'- A markdown \\`XYZ.API.md\\` about the API and the software layer written by us\n  on top of the native API\n- A markdown `XYZ.example.md` with a

## Summary

In this script, we:
1. Parsed and processed Markdown documentation.
2. Embedded document chunks into a FAISS vector store for efficient retrieval.
3. Built a RetrievalQA chain for context-aware question answering.
4. Enabled dynamic updates to handle changing documentation.
5. Enhanced the chatbot with personalized query handling.

This showcases how LangChain can be used to build intelligent, flexible chatbots tailored for specific tasks.