# Build a custom chat bot for question answering

## Dataset: 

This project uses a dataset containing fictional character descriptions from theater, television, and film. Each entry provides a brief profile of an imagined character, including personality traits, background, and setting. The dataset was chosen for its creative depth and potential to support a chatbot that can respond in character-driven or narrative-based scenarios. This customization is especially useful in contexts like storytelling, creative writing, or ideation sessions where access to diverse character types can enhance the experience.

## Approach for customization:

The customization uses a Retrieval-Augmented Generation (RAG) setup built with LangChain. Character data is converted into text documents, embedded using OpenAI's embedding model, and stored in a FAISS vector store for similarity-based retrieval. LangChain’s ConversationalRetrievalChain ties the retriever to the gpt-3.5-turbo model. When a question is asked, relevant context is retrieved and passed into the model to generate responses grounded in the dataset.

# 1.  Imports

In [1]:
import pandas as pd
from langchain.document_loaders import DataFrameLoader
from langchain.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.chains import ConversationalRetrievalChain
from dotenv import load_dotenv
import os
import openai

# 2. Load environment variables

In [2]:
# Load environment variables explicitly from .env
load_dotenv()

True

# 3. Set Open AI base and key

In [3]:
openai.api_base = "https://openai.vocareum.com/v1"
openai_api_key = os.getenv("OPENAI_API_KEY")

# 4. Load and prepare the dataset

This step loads the dataset and pulls in the character data and stitching together the name, description, medium, and setting—into one clean text column for the chatbot to use.

In [4]:

df = pd.read_csv('data/character_descriptions.csv')
df['text'] = df['Name'] + ": " + df['Description'] + " (" + df['Medium'] + ", " + df['Setting'] + ")"

In [5]:
# Display the dataset

pd.set_option('display.max_colwidth', None)
df[['Name', "Description", "text"]].head()

Unnamed: 0,Name,Description,text
0,Emily,"A young woman in her early 20s, Emily is an aspiring actress and Alice's daughter. She has a bubbly personality and a quick wit, but struggles with self-doubt and insecurity. She's also in a relationship with George.","Emily: A young woman in her early 20s, Emily is an aspiring actress and Alice's daughter. She has a bubbly personality and a quick wit, but struggles with self-doubt and insecurity. She's also in a relationship with George. (Play, England)"
1,Jack,"A middle-aged man in his 40s, Jack is a successful businessman and Sarah's boss. He has a no-nonsense attitude, but is fiercely loyal to his friends and family. He's married to Alice.","Jack: A middle-aged man in his 40s, Jack is a successful businessman and Sarah's boss. He has a no-nonsense attitude, but is fiercely loyal to his friends and family. He's married to Alice. (Play, England)"
2,Alice,"A woman in her late 30s, Alice is a warm and nurturing mother of two, including Emily. She's kind-hearted and empathetic, but can be overly protective of her children and prone to worrying. She's married to Jack.","Alice: A woman in her late 30s, Alice is a warm and nurturing mother of two, including Emily. She's kind-hearted and empathetic, but can be overly protective of her children and prone to worrying. She's married to Jack. (Play, England)"
3,Tom,"A man in his 50s, Tom is a retired soldier and John's son. He has a no-nonsense approach to life, but is haunted by his experiences in combat and struggles with PTSD. He's also in a relationship with Rachel.","Tom: A man in his 50s, Tom is a retired soldier and John's son. He has a no-nonsense approach to life, but is haunted by his experiences in combat and struggles with PTSD. He's also in a relationship with Rachel. (Play, England)"
4,Sarah,"A woman in her mid-20s, Sarah is a free-spirited artist and Jack's employee. She's creative, unconventional, and passionate about her work. However, she can also be flighty and impulsive at times.","Sarah: A woman in her mid-20s, Sarah is a free-spirited artist and Jack's employee. She's creative, unconventional, and passionate about her work. However, she can also be flighty and impulsive at times. (Play, England)"


In [6]:
df.shape

(55, 5)

# 5. Compare the performance of the chatbot with and without customization

This section compares how the chatbot responds with and without access to the character dataset. The first version just uses the base model, so it’s answering based on general knowledge. The second one pulls in relevant info using LangChain's retrieval setup, which makes the responses more grounded and specific to the dataset. This helps show the impact of customization and why feeding in the right context actually makes a difference.

## 5.1 Define function to retrieve responses from the base model without customization

In [7]:
# Define helper function to retrieve responses without customization

def chatbot_without_customization(question):
    client = openai.OpenAI(
        api_key=openai.api_key,
        base_url=openai.api_base
    )

    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "user", "content": question}
        ],
        max_tokens=150
    )
    return response.choices[0].message.content.strip()



## 5.2 Custom Chatbot Setup with LangChain + RAG: Embeddings, Retrieval, and Chat Model Integration

This block sets up the full flow for a retrieval-augmented chatbot using LangChain and OpenAI. The dataset is first loaded into a document format compatible with LangChain, a framework that helps manage the different parts of a language model workflow—like connecting to models, managing context, and plugging in retrieval systems like vector stores. 

LangChain was the chosen framework as it strikes a good balance between flexibility and ease of use. It supports both quick prototypes and more complex use cases, with built-in tools for retrieval, memory, chains, and model integrations. For this project, LangChain made it easy to connect the OpenAI model with the character dataset using embeddings and FAISS, without needing to write a custom retrieval pipeline from scratch.

Embeddings are generated using OpenAI’s embedding model, which turns each character description into a numeric representation that captures its meaning. These are stored in a FAISS vector store, which allows fast similarity searches when a question comes in—so the chatbot can retrieve the most relevant character data. Embeddings are used to represent text as high-dimensional vectors based on meaning. Instead of doing keyword matching, the system compares the user’s question to the character descriptions using vector similarity. This allows it to retrieve the most relevant entries even when the wording doesn’t match exactly. OpenAI’s text-embedding-ada-002 model was used here because it’s lightweight, reliable, and integrates well with LangChain.

The gpt-3.5-turbo model was selected because it offers a strong balance between performance, cost, and speed. It’s optimized for chat-like interactions, making it well-suited for conversational tasks like this custom chatbot. Compared to older models (like text-davinci-003), it handles dialogue better, supports longer context windows, and is more efficient in terms of token usage. Since the goal of this project is to create a chatbot that can respond naturally and contextually based on a dataset, gpt-3.5-turbo was the most practical and capable choice available. The gpt-3.5-turbo model is connected using LangChain’s ChatOpenAI, and the retrieval and model are tied together through ConversationalRetrievalChain. That setup handles everything from pulling context to generating a response.

A helper function wraps it all into a simple call, making it easier to trigger responses based on user questions. The result is a chatbot that’s not just guessing—it’s pulling in actual context from the dataset and responding accordingly.

In [None]:

# Load documents into LangChain
loader = DataFrameLoader(df, page_content_column='text')
documents = loader.load()

# Use embeddings with  Vocareum custom endpoint
embeddings = OpenAIEmbeddings(
    openai_api_key=openai.api_key,
    openai_api_base=openai.api_base
)

# Create FAISS vectorstore 
vectorstore = FAISS.from_documents(documents, embeddings)

# Setup the chat model with custom endpoint
chat_model = ChatOpenAI(
    model_name="gpt-3.5-turbo",
    openai_api_key=openai.api_key,
    openai_api_base=openai.api_base
)

# Setup LangChain RAG 
retriever = vectorstore.as_retriever()
qa_chain = ConversationalRetrievalChain.from_llm(chat_model, retriever)


# Define helper function to retrieve responses with customization leveraging LangChain
def chatbot_with_langchain(question, chat_history=[]):
    qa_chain = ConversationalRetrievalChain.from_llm(chat_model, retriever)
    result = qa_chain.invoke({"question": question, "chat_history": chat_history})
    return result['answer']

## 5.3 Retrieve responses and compare model performance with and without customization



Here’s a quick test using a few example questions to compare how the chatbot performs without any customization versus when it's using the custom character data with LangChain. This helps show the difference in response quality and how much context matters when answering character-specific questions.

In [9]:
questions = [
    "Who is Jack and what issues is he facing?",
    "How is Sarah related to Jack?",
    "Who is Alice?"
]

for q in questions:
    print(f"Question: {q}\n")
    
    print("Answer WITHOUT customization:")
    print(chatbot_without_customization(q))
    print("---")
    
    print("Answer WITH LangChain RAG customization:")
    print(chatbot_with_langchain(q))
    print("="*80 + "\n")

Question: Who is Jack and what issues is he facing?

Answer WITHOUT customization:
It is not clear who "Jack" is without more context. This could refer to anyone in the world who is facing a variety of issues. Some common issues that people named Jack or anyone else may face include financial struggles, relationship problems, mental health issues, work-related stress, and health concerns, among others. Without more information, it is difficult to determine specifically what issues Jack may be facing.
---
Answer WITH LangChain RAG customization:
Jack is a middle-aged successful businessman who is Sarah's boss and married to Alice. He has a no-nonsense attitude but is loyal to his friends and family. Based on the context provided, there is no specific information about Jack facing any particular issues.

Question: How is Sarah related to Jack?

Answer WITHOUT customization:
It depends on the specific context or information provided. Sarah could be Jack's sister, daughter, wife, cousin, n

The comparison makes it pretty clear that once the chatbot is connected to the custom character dataset using LangChain and RAG, the quality of responses improves. Without customization, the answers are vague and generic—basically what you'd expect from a model guessing without context. With customization, the responses are more specific, accurate, and tied directly to the characters in the dataset.

This shows how useful embedding-based retrieval is when the goal is to answer questions based on a specific set of information rather than relying on the model's general training.

# 6. Next Steps that could help get better results

* Add memory or chat history tracking to allow for more natural, multi-turn conversations.

* Improve dataset structure by including clearer labels or tags (e.g. roles, traits, relationships) to enhance retrieval quality.

* Switch to GPT-4 for even more nuanced understanding (if available via the API key/environment).

