<a href="https://colab.research.google.com/github/chris-bhaila/ANAIS-2025/blob/main/Day%208%20-%20Tiny_AI_Chatbot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ðŸ¤– Tiny AI Chatbot  
### A Hands-On Workshop on How AI Learns to Talk

![AI Chatbot Banner](https://iili.io/fNdUQ72.png)

---

## ðŸ“Œ Introduction

Artificial Intelligence (AI) can now **understand and generate human-like language**.  
From chatbots and virtual assistants to automated customer support, AI systems are learning how to *talk* and this notebook shows **how that actually works**, step by step.

In this notebook, we build a **simple AI chatbot** and train it using real text data. No advanced math or deep AI background is required.

For a deeper understanding of how Large Language Models (LLMs) work under the hood, watch this excellent talk by Andrej Karpathy:  
ðŸ‘‰ https://www.youtube.com/watch?v=bZQun8Y4L2A

---

## ðŸŽ¯ What You Will Learn

By the end of this session, you will understand:

- How AI learns language from **text data**
- What a **language model** is and how it works
- The difference between **raw data** and **trained intelligence**
- How chatbots learn to:
  - Hold conversations  
  - Tell stories  
  - Solve basic math problems
- How a trained AI model can be used through a **simple chat interface**

---

## ðŸ§  What We Will Build

We will build a **Tiny AI Chatbot** that can:

- Chat with users naturally
- Tell short stories
- Answer math questions
- Remember short conversation history

---

## ðŸ›  Tools & Technologies Used

- **Python** â€“ programming language
- **Hugging Face Datasets** â€“ for text data
- **Transformers (GPT-2)** â€“ the AI language model
- **Gradio** â€“ to create a chat interface



ðŸ‘‰ Letâ€™s begin our journey!

### Install required libraries
- **datasets** â€“ to download and manage text datasets
- **transformers** â€“ to load and train the GPT-2 language model
- **accelerate** â€“ to optimize training performance
- **torch** â€“ deep learning framework
- **gradio** â€“ to create an interactive chat interface


In [None]:
!pip install -q datasets==2.16.0 transformers==4.57.0 accelerate==1.10.1 torch gradio==5.49.0

In [None]:
import warnings
warnings.filterwarnings("ignore", category=UserWarning)

## Download and view Datasets  
**Datasets:**  
- **DailyDialog**: https://huggingface.co/datasets/roskoN/dailydialog  
- **TinyStories**: https://huggingface.co/datasets/roneneldan/TinyStories  
- **AI-MO / NuminaMath-CoT**: https://huggingface.co/datasets/AI-MO/NuminaMath-CoT  

In [None]:
from datasets import load_dataset

# Download dialog dataset
chat_ds = load_dataset("roskoN/dailydialog", trust_remote_code=True)

print(chat_ds)
print()
chat_ds["train"][0]

In [None]:
# Download story dataset
story_ds = load_dataset("roneneldan/TinyStories")

print(story_ds)
print()
story_ds['train'][0]["text"]

In [None]:
# Download math dataset
math_ds = load_dataset("AI-MO/NuminaMath-CoT")

print(math_ds)
print()
math_ds['train'][0]

## Preparing the Chat Dataset

The chat dataset is enhanced by injecting stories and math problems into conversations.

### Key Concepts
- Alternating **User** and **Bot** messages
- Random insertion of:
  - Story prompts and responses
  - Math problems and solutions
- **END_TOKEN** added after each bot response
- All prepared datasets are combined into a single training dataset.

In [None]:
from datasets import load_dataset, concatenate_datasets
import random
import re

# Config
# ----------------------------
CHAT_SAMPLES  = 2000
STORY_SAMPLES = 2000
MATH_SAMPLES  = 2000

SEED = 42

STORY_INSERT_PROB = 0.3
MATH_INSERT_PROB = 0.3

END_TOKEN = "\n"

random.seed(SEED)

STORY_PROMPTS = [
    "Write a story.",
    "Tell a story.",
    "Create a story.",
    "Write a short story.",
    "Make up a story.",
    "Share a story.",
    "Invent a story.",
    "Compose a story.",
    "Write a narrative.",
    "Create a narrative.",
    "Tell a tale.",
    "Write a fictional story.",
    "Create a short narrative.",
    "Tell an original story.",
    "Write an imaginative story.",
    "Compose a short tale.",
    "Create an original story.",
    "Write a creative story.",
    "Tell a short tale.",
    "Make up a short story."
]


# Math LaTeX Normalizer (Gradio-safe)
# -------------------------------------------------
def normalize_math_tex(text: str) -> str:
    # Convert \[ ... \] â†’ $$ ... $$
    text = re.sub(r"\\\[(.*?)\\\]", r"$$\1$$", text, flags=re.DOTALL)

    # Convert single $...$ â†’ $$...$$ (ignore existing $$)
    text = re.sub(
        r"(?<!\$)\$(?!\$)(.+?)(?<!\$)\$(?!\$)",
        r"$$\1$$",
        text,
        flags=re.DOTALL,
    )

    return text.strip()


# Load Story Dataset
# -------------------------------------------------
story_raw_ds = load_dataset(
    "roneneldan/TinyStories",
    split=f"train[:{STORY_SAMPLES}]"
)

stories = [
    s["text"].replace("\n\n", " ").strip()
    for s in story_raw_ds
]


# Load Math Dataset
# -------------------------------------------------
math_raw_ds = load_dataset(
    "AI-MO/NuminaMath-CoT",
    split=f"train[:{MATH_SAMPLES}]"
)

math_problems = [
    (
        normalize_math_tex(m["problem"]),
        normalize_math_tex(m["solution"])
    )
    for m in math_raw_ds
]


# Chat Formatter (Story + Math Injection)
# -------------------------------------------------
def format_chat(example):
    text = ""
    utterances = example["utterances"]

    insert_story = random.random() < STORY_INSERT_PROB
    insert_math = random.random() < MATH_INSERT_PROB

    insert_idx = random.randrange(0, len(utterances), 2)

    for i, sentence in enumerate(utterances):
        if i % 2 == 0:
            text += f"User: {sentence}\n"
        else:
            text += f"Bot: {sentence} {END_TOKEN}\n"

        if i == insert_idx and i % 2 == 0:
            if insert_story:
                prompt = random.choice(STORY_PROMPTS)
                story = random.choice(stories)
                text += f"User: {prompt}\n"
                text += f"Bot: {story} {END_TOKEN}\n"

            ## Uncomment this if you want to train with math dataset
            # elif insert_math:
            #     problem, solution = random.choice(math_problems)
            #     text += f"User: {problem}\n"
            #     text += f"Bot: {solution} {END_TOKEN}\n"

    return {"text": text.strip()}


# Story-only Formatter
# -------------------------------------------------
def format_story(example):
    prompt = random.choice(STORY_PROMPTS)
    story = example["text"].strip()  #.replace("\n\n", " ")
    return {
        "text": f"User: {prompt}\nBot: {story} {END_TOKEN}"
    }


# Math-only Formatter
# -------------------------------------------------
def format_math(example):
    return {
        "text": (
            f"User: {normalize_math_tex(example['problem'])}\n"
            f"Bot: {normalize_math_tex(example['solution'])} {END_TOKEN}"
        )
    }


# Load & Prepare Chat Dataset
# -------------------------------------------------
chat_ds = load_dataset(
    "roskoN/dailydialog",
    split=f"train[:{CHAT_SAMPLES}]",
    trust_remote_code=True
)

chat_ds = chat_ds.map(
    format_chat,
    remove_columns=chat_ds.column_names
)


# Prepare Story-only Dataset
# -------------------------------------------------
story_ds = story_raw_ds.map(
    format_story,
    remove_columns=story_raw_ds.column_names
)


# Prepare Math-only Dataset
# -------------------------------------------------
math_ds = math_raw_ds.map(
    format_math,
    remove_columns=math_raw_ds.column_names
)


# Merge + Shuffle
# -------------------------------------------------
dataset = concatenate_datasets([chat_ds, story_ds])           # Train with Chat + Story dataset
#dataset = concatenate_datasets([chat_ds, story_ds, math_ds])  # Train with Chat + Story + Math dataset

dataset = dataset.shuffle(seed=SEED)

# -------------------------------------------------
# Sanity Check
print(dataset)
for i in range(15):
    print("\n--- SAMPLE ---\n")
    print(dataset[i]["text"])

## Loading the Base Language Model

A pre-trained GPT-2 model is used as the foundation.

### Why GPT-2?
- Already understands basic language patterns
- Small enough for fast training

The tokenizer and model are configured to support custom tokens.

In [None]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel

import warnings
warnings.filterwarnings("ignore")

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

model = GPT2LMHeadModel.from_pretrained("gpt2")
model.resize_token_embeddings(len(tokenizer))

#### Alternative models

You can experiment with alternative models as well. Uncomment the following cell to use alternative model

- **microsoft/DialoGPT-small** (GPT-2 based but chat-trained)
- **meta-llama/Llama-2-Chat**
- **mistralai/Mistral-7B-Instruct**
- **Qwen/Qwen-Chat**

In [None]:
# from transformers import AutoTokenizer, AutoModelForCausalLM

##  Uncomment this code to use alternative models

# tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-small")
# model = AutoModelForCausalLM.from_pretrained("microsoft/DialoGPT-small")

## Tokenization

Text data must be converted into numbers for the model to understand.

#### Tokenization Steps
- Convert text into token IDs
- Truncate or pad sequences to a fixed length
- Create labels for supervised learning

In [None]:
def tokenize(example):
    tokens = tokenizer(
        example["text"],
        truncation=True,
        padding="max_length",
        max_length=128
    )

    tokens["labels"] = tokens["input_ids"].copy()
    return tokens

tokenized_ds = dataset.map(tokenize, remove_columns=["text"])

## Training base model on dataset
The model is trained using supervised learning.
- The AI predicts the next word in a sentence
- Predictions are compared with the correct answer
- Errors are used to improve the model
- Training runs for multiple epochs

The process gradually improves the chatbotâ€™s responses.
Increase training epoch or add more data to improve chatbot's response

In [None]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./chatbot",
    per_device_train_batch_size=8,
    num_train_epochs=2,
    logging_steps=100,
    save_steps=500,
    fp16=True,
    report_to="none"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_ds
)

trainer.train()

## Save our trained model for later use

This code saves our trained model on disk so we can load and chat with it later.

In [None]:
save_directory = "./my_gpt2_model"  # any folder you like
model.save_pretrained(save_directory)
tokenizer.save_pretrained(save_directory)

## Lets chat with our AI

#### Interface Features
- Text-based chat interaction
- Maintains short conversation history
- Displays AI-generated responses instantly

#### Conversation Memory Handling
The chatbot remembers recent messages to maintain context.

#### How It Works
- Keeps a limited number of past interactions
- Formats them into a structured prompt
- Feeds the prompt into the model for response generation

This makes conversations feel more natural.

In [None]:
import gradio as gr

MAX_HISTORY = 10

def build_prompt(message, history):
    history = history[-MAX_HISTORY:]

    prompt = ""
    for user, bot in history:
        prompt += f"User: {user}\nBot: {bot} {END_TOKEN}\n"

    prompt += f"User: {message}\nBot:"
    return prompt


def chat(message, history):
    history = history[-MAX_HISTORY:]

    prompt = build_prompt(message, history)

    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    end_token_id = tokenizer.encode(END_TOKEN, add_special_tokens=False)[0]

    output = model.generate(
        **inputs,
        max_new_tokens=350,
        temperature=0.7,
        do_sample=True,
        eos_token_id=end_token_id,
        pad_token_id=tokenizer.eos_token_id,
    )

    decoded = tokenizer.decode(output[0], skip_special_tokens=True)

    # Extract only the last bot reply
    bot_reply = decoded.split("Bot:")[-1]
    bot_reply = bot_reply.split(END_TOKEN)[0].strip()

    return bot_reply


ui_interface = gr.ChatInterface(
    fn=chat,
    title="Tiny AI Bot",
    description="Trained on our dataset",
    theme="soft"
)

ui_interface.launch()