# MiniLLM 0.2B Model Training (A100 GPU)

This notebook trains a ~200M parameter MiniLLM model on Google Colab with A100 GPU.

**Requirements:**
- Colab Pro/Pro+ with A100 GPU
- ~30GB disk space for data

**Before running:** If you've installed other packages that might conflict, do **Runtime -> Factory reset runtime** first.

## 1. Environment Check

In [None]:
# Check GPU
!nvidia-smi --query-gpu=name,memory.total,memory.free --format=csv

import torch
print(f"PyTorch: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

: 

## 2. Clone Repository & Install Dependencies

In [None]:
import os

# Clone repo
if not os.path.exists('/content/mini-llm'):
    !git clone https://github.com/ai-clarify/mini-llm.git /content/mini-llm
else:
    print('Repository already cloned')
    !cd /content/mini-llm && git pull

%cd /content/mini-llm

# ============================================================
# Install dependencies (Colab-optimized)
# ============================================================
# IMPORTANT: Don't reinstall torch - use Colab's pre-installed version
# which has proper CUDA library linkage

print("Installing additional dependencies...")

# Fix potential CUDA library issues
!pip install -q nvidia-cusparselt-cu12 2>/dev/null || true

# Core training dependencies
!pip install -q einops safetensors modelscope datasets

# Verify environment
print("\n" + "="*50)
print("Environment check:")
print("="*50)

import torch
print(f"✓ PyTorch: {torch.__version__}")
print(f"✓ CUDA: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"✓ GPU: {torch.cuda.get_device_name(0)}")

import transformers
print(f"✓ Transformers: {transformers.__version__}")

# Test model import
import sys
sys.path.insert(0, '/content/mini-llm')
from model.model_minillm import MiniLLMConfig, MiniLLMForCausalLM
print("✓ MiniLLM model loaded successfully")

## 3. Model Configuration (0.2B Parameters)

We configure the model to have approximately 200M parameters:
- `hidden_size=1024`: Embedding dimension
- `num_hidden_layers=16`: Number of transformer blocks
- `num_attention_heads=16`: Number of attention heads
- `intermediate_size=2816`: FFN hidden dimension

In [None]:
import sys
sys.path.insert(0, '/content/mini-llm')

from model.model_minillm import MiniLLMConfig, MiniLLMForCausalLM

# 0.2B Configuration
MODEL_CONFIG = {
    'hidden_size': 1024,
    'num_hidden_layers': 16,
    'num_attention_heads': 16,
    'intermediate_size': 2816,
    'q_lora_rank': 384,
    'kv_lora_rank': 192,
    'max_position_embeddings': 2048,
    'use_moe': False,
    'vocab_size': 6400,
}

# Verify parameter count
config = MiniLLMConfig(**MODEL_CONFIG)
model = MiniLLMForCausalLM(config)

total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"Total parameters: {total_params / 1e6:.1f}M")
print(f"Trainable parameters: {trainable_params / 1e6:.1f}M")
print(f"\nModel config:")
for k, v in MODEL_CONFIG.items():
    print(f"  {k}: {v}")

del model  # Free memory

## 4. Prepare Training Data

Download pretrain data from ModelScope or use sample data.

In [None]:
import os
import json

DATA_DIR = '/content/mini-llm/data'
os.makedirs(DATA_DIR, exist_ok=True)

PRETRAIN_FILE = f'{DATA_DIR}/pretrain.jsonl'

# Try to download from ModelScope, fall back to sample data
USE_SAMPLE_DATA = True  # Set to False to use full dataset

if not USE_SAMPLE_DATA:
    try:
        from modelscope import snapshot_download
        print("Downloading pretrain data from ModelScope...")
        dataset_path = snapshot_download(
            'gongjy/minimind_dataset',
            cache_dir='/content/cache'
        )
        PRETRAIN_FILE = os.path.join(dataset_path, 'pretrain_hq.jsonl')
        print(f"Using dataset: {PRETRAIN_FILE}")
    except Exception as e:
        print(f"Download failed: {e}")
        USE_SAMPLE_DATA = True

if USE_SAMPLE_DATA:
    print("Creating sample training data...")
    
    # Sample Chinese texts for demonstration
    sample_texts = [
        "人工智能是计算机科学的一个分支，它致力于创建能够执行通常需要人类智能的任务的系统。",
        "深度学习是机器学习的一个子领域，它使用多层神经网络来学习数据的复杂表示。",
        "自然语言处理是人工智能的一个重要领域，研究计算机如何理解和生成人类语言。",
        "Transformer模型使用自注意力机制来处理序列数据，它是现代大型语言模型的基础架构。",
        "预训练语言模型通过在大规模文本数据上进行无监督学习，学习语言的通用表示。",
        "强化学习从人类反馈中学习可以帮助语言模型更好地遵循人类指令和偏好。",
        "注意力机制允许模型在处理输入时关注最相关的部分，提高了模型的表达能力。",
        "梯度下降是一种优化算法，通过迭代地调整模型参数来最小化损失函数。",
        "词嵌入将单词映射到连续的向量空间，使得语义相似的词在空间中距离更近。",
        "语言模型的困惑度是衡量模型预测能力的指标，困惑度越低表示模型越好。",
    ]
    
    # Expand dataset for meaningful training
    num_samples = 50000  # 50K samples for demo
    
    with open(PRETRAIN_FILE, 'w', encoding='utf-8') as f:
        for i in range(num_samples):
            text = sample_texts[i % len(sample_texts)]
            # Add some variation
            if i % 3 == 0:
                text = text + "这是一个重要的概念。"
            elif i % 3 == 1:
                text = "在现代AI研究中，" + text
            f.write(json.dumps({'text': text}, ensure_ascii=False) + '\n')
    
    print(f"Created {num_samples} training samples at {PRETRAIN_FILE}")

# Count lines
with open(PRETRAIN_FILE, 'r') as f:
    num_lines = sum(1 for _ in f)
print(f"Total training samples: {num_lines:,}")

## 5. Training Configuration

Optimized settings for A100 80GB GPU:

In [None]:
# Training hyperparameters for A100
TRAIN_CONFIG = {
    'batch_size': 64,           # A100 can handle larger batches
    'accumulation_steps': 4,    # Effective batch = 64 * 4 = 256
    'learning_rate': 5e-4,
    'epochs': 2,
    'max_seq_len': 512,
    'dtype': 'bfloat16',        # A100 supports bf16 natively
    'grad_clip': 1.0,
    'log_interval': 50,
    'save_interval': 200,
    'num_workers': 4,
}

print("Training configuration:")
for k, v in TRAIN_CONFIG.items():
    print(f"  {k}: {v}")

effective_batch = TRAIN_CONFIG['batch_size'] * TRAIN_CONFIG['accumulation_steps']
print(f"\nEffective batch size: {effective_batch}")

## 6. Run Pretraining

In [None]:
import os

# Set output directory
OUT_DIR = '/content/mini-llm/out/pretrain_0_2b'
TB_DIR = '/content/mini-llm/out/pretrain_0_2b/tb'
os.makedirs(OUT_DIR, exist_ok=True)
os.makedirs(TB_DIR, exist_ok=True)

# Build training command
train_cmd = f"""
python trainer/train_pretrain.py \
    --data_path {PRETRAIN_FILE} \
    --hidden_size {MODEL_CONFIG['hidden_size']} \
    --num_hidden_layers {MODEL_CONFIG['num_hidden_layers']} \
    --max_seq_len {TRAIN_CONFIG['max_seq_len']} \
    --batch_size {TRAIN_CONFIG['batch_size']} \
    --accumulation_steps {TRAIN_CONFIG['accumulation_steps']} \
    --learning_rate {TRAIN_CONFIG['learning_rate']} \
    --epochs {TRAIN_CONFIG['epochs']} \
    --dtype {TRAIN_CONFIG['dtype']} \
    --grad_clip {TRAIN_CONFIG['grad_clip']} \
    --log_interval {TRAIN_CONFIG['log_interval']} \
    --save_interval {TRAIN_CONFIG['save_interval']} \
    --num_workers {TRAIN_CONFIG['num_workers']} \
    --out_dir {OUT_DIR} \
    --tensorboard_dir {TB_DIR} \
    --device cuda:0
"""

print("Training command:")
print(train_cmd)
print("\n" + "="*60)
print("Starting training...")
print("="*60 + "\n")

!{train_cmd}

## 7. Monitor Training with TensorBoard

In [None]:
%load_ext tensorboard
%tensorboard --logdir /content/mini-llm/out/pretrain_0_2b/tb

## 8. Test Inference

In [None]:
import torch
from transformers import AutoTokenizer
from model.model_minillm import MiniLLMConfig, MiniLLMForCausalLM

# Load trained model
config = MiniLLMConfig(**MODEL_CONFIG)
model = MiniLLMForCausalLM(config)

checkpoint_path = f"{OUT_DIR}/pretrain_{MODEL_CONFIG['hidden_size']}.pth"

if os.path.exists(checkpoint_path):
    state_dict = torch.load(checkpoint_path, map_location='cuda')
    model.load_state_dict(state_dict)
    print(f"Loaded checkpoint from {checkpoint_path}")
else:
    print(f"Warning: Checkpoint not found at {checkpoint_path}")

model = model.cuda().eval()

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained('/content/mini-llm/model/')

# Test generation
test_prompts = [
    "人工智能",
    "深度学习是",
    "在现代科技发展中",
]

print("\n" + "="*60)
print("Generation Test")
print("="*60)

for prompt in test_prompts:
    inputs = tokenizer(prompt, return_tensors='pt').to('cuda')
    
    with torch.no_grad():
        outputs = model.generate(
            inputs.input_ids,
            max_new_tokens=64,
            do_sample=True,
            temperature=0.8,
            top_p=0.9,
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.eos_token_id,
        )
    
    generated = tokenizer.decode(outputs[0], skip_special_tokens=True)
    print(f"\nPrompt: {prompt}")
    print(f"Output: {generated}")
    print("-" * 40)

## 9. Save & Download Model

In [None]:
import shutil

# Create export package
export_dir = '/content/minillm_0_2b_export'
os.makedirs(export_dir, exist_ok=True)

# Copy model checkpoint
if os.path.exists(checkpoint_path):
    shutil.copy(checkpoint_path, f'{export_dir}/model.pth')
    print(f"Copied model checkpoint")

# Copy tokenizer files
tokenizer_files = ['tokenizer.json', 'tokenizer_config.json']
for f in tokenizer_files:
    src = f'/content/mini-llm/model/{f}'
    if os.path.exists(src):
        shutil.copy(src, f'{export_dir}/{f}')

# Save config
import json
with open(f'{export_dir}/config.json', 'w') as f:
    json.dump(MODEL_CONFIG, f, indent=2)

# Create zip
!cd /content && zip -r minillm_0_2b.zip minillm_0_2b_export/

# List contents
print("\nExport contents:")
!ls -lh /content/minillm_0_2b_export/
print(f"\nZip file size:")
!ls -lh /content/minillm_0_2b.zip

In [None]:
# Download the model (uncomment to download)
# from google.colab import files
# files.download('/content/minillm_0_2b.zip')

## 10. (Optional) Continue with SFT Training

After pretraining, you can fine-tune on instruction data:

In [None]:
# Uncomment to run SFT
"""
# Prepare SFT data
sft_samples = [
    {"conversations": [{"role": "user", "content": "什么是人工智能？"}, 
                       {"role": "assistant", "content": "人工智能是计算机科学的一个分支，致力于创建能够执行通常需要人类智能的任务的系统。"}]},
    {"conversations": [{"role": "user", "content": "解释深度学习"}, 
                       {"role": "assistant", "content": "深度学习是机器学习的一个子领域，使用多层神经网络来学习数据的复杂表示。"}]},
]

SFT_FILE = '/content/mini-llm/data/sft.jsonl'
with open(SFT_FILE, 'w') as f:
    for sample in sft_samples * 1000:
        f.write(json.dumps(sample, ensure_ascii=False) + '\n')

# Run SFT
!python trainer/train_full_sft.py \
    --data_path {SFT_FILE} \
    --hidden_size {MODEL_CONFIG['hidden_size']} \
    --num_hidden_layers {MODEL_CONFIG['num_hidden_layers']} \
    --pretrained_path {checkpoint_path} \
    --batch_size 32 \
    --epochs 3 \
    --out_dir /content/mini-llm/out/sft_0_2b
"""
print("SFT training code is commented out. Uncomment to run.")

---

## Summary

This notebook trained a 0.2B parameter MiniLLM model with:
- **Architecture**: DeepSeek-V3 style with MLA (Multi-head Latent Attention)
- **Parameters**: ~200M
- **Training**: 2 epochs on sample data
- **Hardware**: A100 80GB GPU

For production training, use the full dataset from ModelScope by setting `USE_SAMPLE_DATA = False`.