# Support Chatbot (Retrieval-based)


We'll create a simple retrieval-based chatbot: build a TF-IDF vector store of historical user messages and return the canned agent reply
for the most similar historical message using cosine similarity.


In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("data/support_dataset_10k.csv")

In [3]:
df.head(5)

Unnamed: 0,id,timestamp,customer_id,message,issue_type,priority,agent_reply
0,sup_0,2025-10-12T12:22:08.631911,cust_4527,cust_4527: I have an issue: missing statement....,missing statement,low,Statements are available under Documents > Sta...
1,sup_1,2025-10-29T12:22:08.631911,cust_61,cust_61: I have an issue: need KYC help. Need ...,need KYC help,low,"To complete KYC, upload ID and address proof. ..."
2,sup_2,2025-09-20T12:22:08.631911,cust_3451,cust_3451: I have an issue: how to change bank...,how to change bank details,medium,Go to Payments > Bank details. Update and comp...
3,sup_3,2025-09-06T12:22:08.631911,cust_2913,cust_2913: I have an issue: deposit not credit...,deposit not credited,high,Deposits normally show within 1-3 business day...
4,sup_4,2025-10-22T12:22:08.631911,cust_4621,cust_4621: I have an issue: account locked. Ur...,account locked,low,We're sorry — your account is locked. Please p...


# Build retrieval engine



We vectorize `message` column, then compute cosine similarity between a new user query and the stored messages.
We return the `agent_reply` corresponding to the most similar stored message.


In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np


  import scipy.linalg  # noqa


In [5]:
# prepare corpus
corpus = df['message'].fillna("").tolist()
vec = TfidfVectorizer(max_features=3000, stop_words='english')


In [6]:
X = vec.fit_transform(corpus)

In [7]:
def respond(user_text, top_k=1):
    """Return top_k canned replies and similarity scores."""
    v = vec.transform([user_text])
    sims = cosine_similarity(v, X).flatten()
    top_idx = np.argsort(sims)[-top_k:][::-1]
    results = []
    for idx in top_idx:
        results.append({
            "reply": df.iloc[idx]['agent_reply'],
            "similarity": float(sims[idx]),
            "matched_message": df.iloc[idx]['message'],
            "issue_type": df.iloc[idx]['issue_type']
        })
    return results

In [8]:
# Quick tests
queries = [
    "I forgot my password, how to reset?",
    "My deposit hasn't shown up for 5 days",
    "I am unable to place trades - error occurred"
]

In [9]:
for q in queries:
    print("Query:", q)
    print("Top reply:", respond(q, top_k=1)[0])
    print("----")

Query: I forgot my password, how to reset?
Top reply: {'reply': "Please follow the reset link in the app or visit Account > Reset Password. If locked out, use 'Forgot password'.", 'similarity': 0.8923694880472149, 'matched_message': 'cust_132: I have an issue: how to reset password. Please help', 'issue_type': 'how to reset password'}
----
Query: My deposit hasn't shown up for 5 days
Top reply: {'reply': 'Deposits normally show within 1-3 business days. If delayed, send transaction reference and we will investigate.', 'similarity': 0.631725667054492, 'matched_message': 'cust_3354: I have an issue: deposit not credited. Please help', 'issue_type': 'deposit not credited'}
----
Query: I am unable to place trades - error occurred
Top reply: {'reply': 'Trades may be blocked due to insufficient funds or restrictions. Check your balance and ensure KYC is complete.', 'similarity': 0.4519972654502085, 'matched_message': 'cust_252: I have an issue: unable to trade. Please help', 'issue_type': 'u