
# Chatbot (Refactored) — Qwen 2.5 + Hugging Face Transformers

This notebook modernizes the original chatbot to use **Qwen 2.5 Instruct** models via `transformers` and adds **CUDA memory management** using your `utils.py` helpers.

**Highlights**:
- Uses **Qwen 2.5 Instruct** models (configurable via `model_id`).
- Robust **device selection** and **CUDA memory cleanup** hooks (from `utils.py`).
- Clean chat helper using the model’s **chat template**.
- Optional fine-tuning scaffold with `Trainer` (commented by default).

> Note: Internet is required on your machine to download models from the Hugging Face Hub. If you keep `trust_remote_code=False`, use official model repos that are fully supported by Transformers. For Qwen chat template support, you may need `trust_remote_code=True` depending on your `transformers` version.
>Note: Internet access is required to download models from the Hugging Face Hub. Before downloading models, you must:

    Create an account at Hugging Face if you don't have one

    Generate an access token in your Hugging Face settings

    Set the HUGGING_FACE_HUB_TOKEN environment variable with your token:

    export HUGGING_FACE_HUB_TOKEN="your_token_here"

> If you keep trust_remote_code=False, use only official model repositories that are fully supported by Transformers. For Qwen chat template support, you may need trust_remote_code=True depending on your transformers version.


In [1]:

# Environment & version checks
import sys, os, platform
print(f"Python: {sys.version}")
try:
    import torch, transformers
    print("PyTorch:", torch.__version__)
    print("Transformers:", transformers.__version__)
except Exception as e:
    print("You likely need to install torch/transformers:", e)

# Optionally: pip installs (uncomment if running in a fresh env)
# %pip install --upgrade pip
# %pip install --upgrade torch transformers accelerate datasets peft bitsandbytes


Python: 3.12.11 (main, Jul 23 2025, 00:34:44) [Clang 20.1.4 ]
PyTorch: 2.8.0+cu128
Transformers: 4.55.4


In [2]:

# Import CUDA utils from parent folder (preferred), fallback to local
import sys, os
from pathlib import Path

# Try parent directory first (ideal location)
parent_dir = str(Path.cwd().parent)
if parent_dir not in sys.path:
    sys.path.insert(0, parent_dir)

try:
    import utils  # expected at ../utils.py
except Exception:
    # Fallback: current working directory
    curr_dir = str(Path.cwd())
    if curr_dir not in sys.path:
        sys.path.insert(0, curr_dir)
    import utils  # tries ./utils.py

print("Loaded utils from:", utils.__file__)
# Set memory env & show current device
utils.setup_memory_environment(expandable_segments=True)
device = utils.get_device()
print("Selected device:", device)


Loaded utils from: /mnt/nfs/workspace/courses/PyTorch/Building-Transformer-Models-with-PyTorch-2.0/utils.py
Memory environment configured
Selected device: cuda



## Model Selection

Choose your Qwen 2.5 Instruct model. Smaller models are easier to run locally.  
- Good starters: `Qwen/Qwen2.5-0.5B-Instruct` or `Qwen/Qwen2.5-1.5B-Instruct`  
- Larger (needs strong GPU): `Qwen/Qwen2.5-7B-Instruct`, `Qwen/Qwen2.5-14B-Instruct`


In [6]:

model_id = "Qwen/Qwen2.5-0.5B-Instruct"  # change as needed
# model_id = "Qwen/Qwen2.5-1.5B-Instruct"
# model_id = "Qwen/Qwen2.5-7B-Instruct"
# model_id = "Qwen/Qwen2.5-14B-Instruct"

# Generation defaults
gen_kwargs = {
    "max_new_tokens": 256,
    "temperature": 0.7,
    "top_p": 0.9,
    "do_sample": True
}

use_bfloat16 = False
try:
    import torch
    if device == "cuda" and torch.cuda.is_available():
        # bfloat16 is generally stable on newer GPUs; fallback to float16 if needed
        use_bfloat16 = torch.cuda.is_bf16_supported()
except Exception as e:
    print("dtype probe failed:", e)


In [7]:

from transformers import AutoTokenizer, AutoModelForCausalLM

# Clear GPU before loading a big model
utils.clear_gpu_memory(aggressive=False)

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
dtype = torch.bfloat16 if use_bfloat16 else (torch.float16 if device == "cuda" else None)

# device_map='auto' will spread across available devices; remove if you want manual .to(device)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    torch_dtype=dtype if dtype is not None else None,
    device_map=("auto" if device in ("cuda", "mps") else None),
)

model.eval()
print("Model loaded.")
utils.print_cuda_memory(verbose=True)


GPU memory cleared
Model loaded.
=== GPU Memory Usage ===
Allocated: 0.92 GB
Reserved:  0.95 GB
Free:      14.56 GB
Total:     15.48 GB


{'allocated': 0.9202370643615723,
 'reserved': 0.94921875,
 'total': 15.47686767578125,
 'free': 14.556630611419678}


## Chat Helper

We use the model's chat template (`tokenizer.apply_chat_template`) to format multi-turn dialogs.


In [8]:

from typing import List, Dict, Optional

def build_inputs(messages: List[Dict[str, str]]):
    """
    messages: list of {"role": "user"|"assistant"|"system", "content": str}
    """
    prompt = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    return inputs

@torch.inference_mode()
def chat(
    history: List[Dict[str, str]],
    gen_params: Optional[Dict] = None,
) -> str:
    """
    history: running list of messages including system/user/assistant turns.
    gen_params: overrides for generation (max_new_tokens, temperature, etc).
    """
    params = {**gen_kwargs, **(gen_params or {})}
    inputs = build_inputs(history)
    outputs = model.generate(**inputs, **params)
    text = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]

    # Extract only the assistant's last message if template returns full convo
    # Often the assistant content is after the last 'assistant' tag; naive split:
    lower = text.lower()
    if "assistant" in lower:
        idx = lower.rfind("assistant")
        response = text[idx + len("assistant"):].strip()
    else:
        # Fallback: remove the input prompt
        input_text = tokenizer.decode(inputs["input_ids"][0], skip_special_tokens=True)
        response = text[len(input_text):].strip()

    return response


In [9]:

# Minimal demo
history = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "In one sentence, what is Qwen 2.5?"},
]

reply = chat(history)
print("Assistant:", reply)


Assistant: Qwen 2.5 is an AI language model developed by Alibaba Cloud, focusing on natural language understanding and generation.


In [10]:

# Optional: interactive loop
for _ in range(3):
    user_msg = input("You: ")
    history.append({"role": "user", "content": user_msg})
    reply = chat(history)
    print("Assistant:", reply)
    history.append({"role": "assistant", "content": reply})


You:  


Assistant: Qwen 2.5 is a new generation of language models created by Alibaba Cloud.


You:  nice name


Assistant: I'm glad you like it! Let me know if there's anything else I can help with.


You:  do you know python programming language?


Assistant: Yes, Python is a high-level programming language that is widely used for web development and data analysis. It has a simple syntax and extensive standard library that makes it easy to learn and use. Python is also known for its readability and ease of debugging, which make it a popular choice among developers.



## Manual training loop with accelerate + tqdm progress bar


In [31]:
import math
import torch
import numpy as np
from torch.optim import AdamW
from transformers import get_scheduler, DataCollatorForLanguageModeling
from accelerate import Accelerator
from tqdm import tqdm
from utils import *
from datasets import load_dataset
from torch.utils.data import DataLoader

# Load better dataset - RECOMMENDED: Use conversation data
dataset = load_dataset("OpenAssistant/oasst1", split="train[:2%]")
# dataset = load_dataset("tatsu-lab/alpaca", split="train[:2%]")  # Alternative

def to_tokenized(ex):
    # Format for OASST dataset
    if "text" in ex and "Assistant:" in ex["text"]:
        parts = ex["text"].split("Assistant:")
        user_msg = parts[0].replace("Human:", "").strip()
        assistant_msg = parts[1].strip()
        
        messages = [
            {"role": "user", "content": user_msg},
            {"role": "assistant", "content": assistant_msg}
        ]
    else:
        # Fallback for other formats
        messages = [
            {"role": "user", "content": "Hello, how are you?"},
            {"role": "assistant", "content": "I'm doing well, thank you for asking!"}
        ]
    
    prompt = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=False,
    )
    
    # Tokenize with padding to ensure consistent lengths
    tokenized = tokenizer(
        prompt,
        truncation=True,
        max_length=512,
        padding="max_length",  # This fixes the batching issue
        return_tensors=None,
    )
    
    tokenized["labels"] = tokenized["input_ids"].copy()
    
    return tokenized

print("Tokenizing dataset...")
tokenized = dataset.map(
    to_tokenized,
    remove_columns=dataset.column_names,
    desc="Tokenizing"
)

# Filter out empty samples
tokenized = tokenized.filter(lambda x: len(x["input_ids"]) > 0 and sum(x["attention_mask"]) > 10)

# Simple collator since we already padded
def simple_collator(batch):
    return {
        "input_ids": torch.stack([torch.tensor(x["input_ids"]) for x in batch]),
        "attention_mask": torch.stack([torch.tensor(x["attention_mask"]) for x in batch]),
        "labels": torch.stack([torch.tensor(x["labels"]) for x in batch]),
    }

# Setup memory env
setup_memory_environment()
clear_gpu_memory()

# Accelerator
accelerator = Accelerator()

# Model and optimizer
optimizer = AdamW(model.parameters(), lr=5e-5)
num_epochs = 1
accumulation_steps = 4

# DataLoader with simple collator
train_dataloader = DataLoader(
    tokenized, shuffle=True, batch_size=2, collate_fn=simple_collator
)

num_training_steps = num_epochs * len(train_dataloader) // accumulation_steps

lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps
)

# Prepare for accelerate
model, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
    model, optimizer, train_dataloader, lr_scheduler
)

# Training loop
for epoch in range(num_epochs):
    model.train()
    total_loss = 0.0
    
    bar_length = get_optimal_bar_length()
    epoch_progress = tqdm(
        train_dataloader,
        desc=f"Epoch {epoch+1}/{num_epochs}",
        unit="batch",
        ncols=bar_length,
        bar_format='{l_bar}{bar}| {n_fmt}/{total_fmt} [{elapsed}<{remaining}]'
    )

    for step, batch in enumerate(epoch_progress):
        outputs = model(**batch)
        loss = outputs.loss / accumulation_steps

        # Backward
        accelerator.backward(loss)
        total_loss += loss.item() * accumulation_steps

        # Optimizer + scheduler
        if (step + 1) % accumulation_steps == 0:
            optimizer.step()
            lr_scheduler.step()
            optimizer.zero_grad()

        # Update progress
        avg_loss = total_loss / (step + 1)
        epoch_progress.set_postfix({
            "loss": f"{avg_loss:.4f}",
            "lr": f"{lr_scheduler.get_last_lr()[0]:.2e}",
        })

    epoch_progress.close()
    print(f"Epoch {epoch+1} - Avg Loss: {total_loss / len(train_dataloader):.4f}")
    clear_gpu_memory()

print("Training complete!")


README.md: 0.00B [00:00, ?B/s]

(…)-00000-of-00001-b42a775f407cee45.parquet:   0%|          | 0.00/39.5M [00:00<?, ?B/s]

(…)-00000-of-00001-134b8fd0c89408b6.parquet:   0%|          | 0.00/2.08M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/84437 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/4401 [00:00<?, ? examples/s]

Tokenizing dataset...


Tokenizing:   0%|          | 0/1689 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1689 [00:00<?, ? examples/s]

Memory environment configured
GPU memory cleared


Epoch 1/1: 100%|███████████| 845/845 [02:05<00:00]


Epoch 1 - Avg Loss: 0.1181
GPU memory cleared
Training complete!



## (Optional) Fine-tuning Scaffold (LoRA or Full Fine-tune) using simple Hugging Face Trainer

Below is a minimal example scaffold for fine-tuning with `Trainer`.  
We call `utils.clear_gpu_memory()` **before** starting training to reduce fragmentation.


In [32]:
from datasets import load_dataset
from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling
from transformers import DataCollatorForLanguageModeling
import math

# Example tiny dataset (replace with your own)
dataset = load_dataset("Abirate/english_quotes", split="train[:1%]")

def to_messages(ex):
    # Convert raw text to a trivial chat turn
    return [{"role": "user", "content": ex["quote"]}]
    
def to_tokenized(ex):
    messages = to_messages(ex)
    prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
    
    # Tokenize with padding to ensure consistent lengths
    out = tokenizer(
        prompt,
        truncation=True,
        max_length=512,
        padding="max_length",  # This fixes the batching issue
        return_tensors=None,
    )
    out["labels"] = out["input_ids"].copy()
    return out
    
tokenized = dataset.map(to_tokenized, remove_columns=dataset.column_names)

# Filter out empty samples
tokenized = tokenized.filter(lambda x: len(x["input_ids"]) > 0 and sum(x["attention_mask"]) > 10)

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,
    pad_to_multiple_of=8  # good for Tensor Cores; optional
)

print(tokenized[0])
print(tokenizer.decode(tokenized[0]["input_ids"]))

# Batch size selection using your utils (conservative)
try:
    sample_batch = tokenized[:8]
    bsz = utils.calculate_optimal_batch_size_simple(model, sample_batch, max_batch_size=16)
except Exception:
    bsz = 2

from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(
    output_dir="./qwen25-refac-ft",
    per_device_train_batch_size=bsz,
    gradient_accumulation_steps=1,
    num_train_epochs=1,
    learning_rate=2e-5,
    fp16=(device == "cuda" and not torch.cuda.is_bf16_supported()),
    bf16=(device == "cuda" and torch.cuda.is_bf16_supported()),
    logging_steps=10,
    save_steps=200,
    save_total_limit=2,
    report_to="none",
)

utils.clear_gpu_memory(aggressive=False)  # <<< clean before training
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized,
    data_collator=data_collator,
)

train_out = trainer.train()
utils.print_detailed_memory()

eval_loss = train_out.training_loss if hasattr(train_out, "training_loss") else None
print("Training done. Loss:", eval_loss)


Map:   0%|          | 0/25 [00:00<?, ? examples/s]

Filter:   0%|          | 0/25 [00:00<?, ? examples/s]

{'input_ids': [151644, 8948, 198, 2610, 525, 1207, 16948, 11, 3465, 553, 54364, 14817, 13, 1446, 525, 264, 10950, 17847, 13, 151645, 198, 151644, 872, 198, 2073, 3430, 6133, 26, 5019, 770, 374, 2669, 4429, 1987, 151645, 198, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,

OutOfMemoryError: CUDA out of memory. Tried to allocate 2.32 GiB. GPU 0 has a total capacity of 15.48 GiB of which 865.94 MiB is free. Including non-PyTorch memory, this process has 14.61 GiB memory in use. Of the allocated memory 14.34 GiB is allocated by PyTorch, and 108.05 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [None]:

# Post-training inference (if you fine-tuned)
# utils.clear_gpu_memory(aggressive=False)
# history_ft = [
#     {"role": "system", "content": "You are a concise assistant."},
#     {"role": "user", "content": "Give me a haiku about GPUs."},
# ]
# print("Assistant (FT):", chat(history_ft, {"max_new_tokens": 64, "temperature": 0.6}))


In [None]:

# Final memory report
utils.print_cuda_memory(verbose=True)
