# Day-77: RNN Project: Build a Basic Chatbot Using RNNs

Welcome back to Day 77 of our 100 Days of Data Science Challenge!
In the last few days, we’ve been diving deep into RNNs, LSTMs, and GRUs, understanding how these networks handle sequential data.

Today, we’re going hands-on — we’ll build a simple chatbot using an RNN model.
This chatbot won’t be as advanced as ChatGPT, but it will show how RNNs can learn patterns in conversation data to generate human-like responses.

## Topics Covered:

- Understanding the Sequence-to-Sequence (Seq2Seq) Architecture

- Data Preprocessing for Chatbots (Tokenization, Padding, Integer Encoding)

- The Role of RNN/LSTM/GRU Cells in Sequence Modeling

- Implementing the Encoder

- Implementing the Decoder

- Training the Chatbot Model

## Understanding the Sequence-to-Sequence (Seq2Seq) Architecture

Imagine you're a translator. When someone speaks a sentence in Language A, you first listen to the entire sentence and understand its core meaning. Then, you formulate and speak the translated sentence in Language B.

`Analogy`: The Encoder is the part that listens and understands the input sentence. The Decoder is the part that formulates and speaks the output sentence (the response). The core meaning passed between them is the Context Vector (or thought vector).

    - Example:

        - Input (Encoder): "What time is it now?"

        - Context Vector: Captures the intent: User asking for current time.

        - Output (Decoder): "It is currently 7 PM."

In our RNN chatbot, the encoder will process the user's message, and the decoder will generate the chatbot's reply. We typically use LSTMs (Long Short-Term Memory) or GRUs (Gated Recurrent Units) instead of simple RNNs to better capture the long-term dependencies in the sequence data, preventing the vanishing gradient problem.

## Data Preprocessing for Chatbots (Tokenization, Padding, Integer Encoding)

In [66]:
! pip install kagglehub[pandas-datasets] scikit-learn





In [67]:
import kagglehub
import pandas as pd
import os

# Download latest version
path = kagglehub.dataset_download("niraliivaghani/chatbot-dataset")

print("Path to dataset files:", path)

file = "intents.json"
file_path = os.path.join(path, "intents.json")

def load_json_file(filename):
    with open(filename) as f:
        file = json.load(f)
    return file

intents = load_json_file(file_path)
intents


Path to dataset files: C:\Users\amey9\.cache\kagglehub\datasets\niraliivaghani\chatbot-dataset\versions\1


{'intents': [{'tag': 'greeting',
   'patterns': ['Hi',
    'How are you?',
    'Is anyone there?',
    'Hello',
    'Good day',
    "What's up",
    'how are ya',
    'heyy',
    'whatsup',
    '??? ??? ??'],
   'responses': ['Hello!',
    'Good to see you again!',
    'Hi there, how can I help?'],
   'context_set': ''},
  {'tag': 'goodbye',
   'patterns': ['cya',
    'see you',
    'bye bye',
    'See you later',
    'Goodbye',
    'I am Leaving',
    'Bye',
    'Have a Good day',
    'talk to you later',
    'ttyl',
    'i got to go',
    'gtg'],
   'responses': ['Sad to see you go :(',
    'Talk to you later',
    'Goodbye!',
    'Come back soon'],
   'context_set': ''},
  {'tag': 'creator',
   'patterns': ['what is the name of your developers',
    'what is the name of your creators',
    'what is the name of the developers',
    'what is the name of the creators',
    'who created you',
    'your developers',
    'your creators',
    'who are your developers',
    'developers',
  

In [68]:
def create_df():
    df = pd.DataFrame({
        'Pattern' : [],
        'Tag' : []
    })
    return df

df = create_df()

df.head()

Unnamed: 0,Pattern,Tag


In [69]:
def extract_json_info(json_file, df):
    
    for intent in json_file['intents']:
        
        for pattern in intent['patterns']:
            
            sentence_tag = [pattern, intent['tag']]
            df.loc[len(df.index)] = sentence_tag
                
    return df

df = extract_json_info(intents, df)
df.head()

Unnamed: 0,Pattern,Tag
0,Hi,greeting
1,How are you?,greeting
2,Is anyone there?,greeting
3,Hello,greeting
4,Good day,greeting


In [70]:
df.tail()

Unnamed: 0,Pattern,Tag
400,ragging history,ragging
401,ragging incidents,ragging
402,hod,hod
403,hod name,hod
404,who is the hod,hod


In [71]:
# --- 1) build responses map from your loaded 'intents' dict ---
# keep links in web UI; strip for CLI later
tag_to_responses = {it["tag"]: it.get("responses", []) for it in intents["intents"]}

In [72]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# --- 2) tokenize texts ---
texts = df["Pattern"].astype(str).tolist()
tags  = df["Tag"].astype(str).tolist()

MAX_VOCAB = 5000
OOV_TOKEN = "<OOV>"
tokenizer = Tokenizer(num_words=MAX_VOCAB, oov_token=OOV_TOKEN)
tokenizer.fit_on_texts(texts)
X = tokenizer.texts_to_sequences(texts)
X = pad_sequences(X, padding="post") 

In [73]:
from sklearn.preprocessing import LabelEncoder
# --- 3) encode labels ---
le = LabelEncoder()
y_int = le.fit_transform(tags)
num_classes = len(le.classes_)
y = tf.keras.utils.to_categorical(y_int, num_classes=num_classes)

In [74]:
from sklearn.model_selection import train_test_split
# --- 4) split --t-
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.15, random_state=42, stratify=y_int
)

In [85]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Bidirectional, LSTM, Dense, Dropout

EMBED_DIM = 128
RNN_UNITS = 96
DROPOUT = 0.35

model = tf.keras.Sequential([
    tf.keras.Input(shape=(X.shape[1],), dtype="int32", name="tokens"),
    tf.keras.layers.Embedding(input_dim=MAX_VOCAB, output_dim=EMBED_DIM),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(RNN_UNITS)),
    tf.keras.layers.Dropout(DROPOUT),
    tf.keras.layers.Dense(64, activation="relu"),
    tf.keras.layers.Dense(num_classes, activation="softmax"),
])
model.compile(
    optimizer="adam",
    loss="categorical_crossentropy",   # use this if y is one-hot
    metrics=["accuracy"]
)


model.summary()

In [86]:
callbacks = [
    tf.keras.callbacks.ReduceLROnPlateau(monitor="val_loss", factor=0.5, patience=2, min_lr=1e-5),
    tf.keras.callbacks.EarlyStopping(monitor="val_accuracy", patience=6, restore_best_weights=True)
]

history = model.fit(
    X_train, y_train,
    validation_data=(X_val, y_val),
    epochs=40, batch_size=16, callbacks=callbacks, verbose=1
)


Epoch 1/40


[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 26ms/step - accuracy: 0.0581 - loss: 3.6285 - val_accuracy: 0.0656 - val_loss: 3.6014 - learning_rate: 0.0010
Epoch 2/40
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 10ms/step - accuracy: 0.0727 - loss: 3.5710 - val_accuracy: 0.0820 - val_loss: 3.5270 - learning_rate: 0.0010
Epoch 3/40
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 10ms/step - accuracy: 0.0930 - loss: 3.4883 - val_accuracy: 0.1475 - val_loss: 3.4542 - learning_rate: 0.0010
Epoch 4/40
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 10ms/step - accuracy: 0.1047 - loss: 3.2939 - val_accuracy: 0.1311 - val_loss: 3.1979 - learning_rate: 0.0010
Epoch 5/40
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 10ms/step - accuracy: 0.1686 - loss: 2.9825 - val_accuracy: 0.1967 - val_loss: 2.9476 - learning_rate: 0.0010
Epoch 6/40
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 10ms/step -

In [87]:
import numpy as np
import random, re, html, json
import pickle

# --- 6) inference helpers ---
CONF_THRESHOLD = 0.55
FALLBACKS = [
    "Hmm, not sure I got that—could you rephrase?",
    "Sorry, I’m still learning. Can you try asking another way?",
]

def predict_intent(user_text: str):
    seq = tokenizer.texts_to_sequences([user_text])
    padded = pad_sequences(seq, maxlen=X.shape[1], padding="post")
    probs = model.predict(padded, verbose=0)[0]
    idx = int(np.argmax(probs))
    conf = float(probs[idx])
    tag  = le.inverse_transform([idx])[0]
    return tag, conf

# (optional) clean HTML/placeholder in responses for CLI
def strip_html(s: str) -> str:
    # quick & simple: remove tags
    return re.sub(r"<[^>]+>", " ", html.unescape(s)).strip()

PLACEHOLDERS = {
    r"\bNUMBER\b": "+91-99999-99999",
    r"ADD YOU GOOGLE MAP LINK": "https://maps.google.com/?q=Your+College+Name"
}
def fill_placeholders(text: str) -> str:
    out = text
    for pat, repl in PLACEHOLDERS.items():
        out = re.sub(pat, repl, out, flags=re.IGNORECASE)
    return out

def reply(user_text: str, as_html: bool = False) -> str:
    tag, conf = predict_intent(user_text)
    if conf < CONF_THRESHOLD or not tag_to_responses.get(tag):
        return random.choice(FALLBACKS) + f" (conf={conf:.2f})"
    resp = random.choice(tag_to_responses[tag])
    resp = fill_placeholders(resp)
    return resp if as_html else strip_html(resp)

# smoke test
for t in ["Hi", "fees details", "where is the college", "bye", "who is the HOD"]:
    print(f"You: {t}\nBot:", reply(t), "\n")

# --- 7) save artifacts ---
os.makedirs("artifacts_day77", exist_ok=True)
model.save("artifacts_day77/gru_intent_model.keras")
with open("artifacts_day77/tokenizer.pkl", "wb") as f:
    pickle.dump(tokenizer, f)
with open("artifacts_day77/label_encoder.pkl", "wb") as f:
    pickle.dump(le, f)
with open("artifacts_day77/tag_to_responses.json", "w", encoding="utf-8") as f:
    json.dump(tag_to_responses, f, ensure_ascii=False, indent=2)

You: Hi
Bot: Hi there, how can I help? 

You: fees details
Bot: For Fee detail visit   here 

You: where is the college
Bot: here 

You: bye
Bot: Sad to see you go :( 

You: who is the HOD
Bot: Sorry, I’m still learning. Can you try asking another way? (conf=0.24) 

