# 🚀 Amharic LLM Training on Google Colab

This notebook trains an Amharic language model using your collected dataset.

**Setup Instructions:**
1. Runtime → Change runtime type → GPU (T4)
2. Run all cells in order
3. Training will take 10-30 minutes depending on model size

**Models Available:**
- `distilgpt2` (82M) - Ultra fast (5-10 min)
- `gpt2` (124M) - Fast (10-15 min)
- `bloom-560m` (560M) - Balanced (20-30 min)
- `bloom-1b1` (1.1B) - Quality (45-60 min)

In [None]:
# Check GPU availability
!nvidia-smi

import torch
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name()}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

In [None]:
# Mount Google Drive for data persistence
from google.colab import drive
drive.mount('/content/drive')

# Create directory for models
!mkdir -p /content/drive/MyDrive/amharic_models

In [None]:
# Setup data - Choose one of the options below

# Option 1: Upload to Google Drive (RECOMMENDED)
# 1. Upload your 'amharic-llm-data' folder to Google Drive
# 2. Uncomment and run the lines below:
# !cp -r '/content/drive/MyDrive/amharic-llm-data' /content/
# %cd /content/amharic-llm-data

# Option 2: Clone from GitHub (if you've pushed the data)
!git clone https://github.com/Yosef-Ali/amharic-llm-data.git
%cd amharic-llm-data

# Option 3: Direct file upload
# from google.colab import files
# uploaded = files.upload()  # Upload your dataset files manually

In [None]:
# Install required packages
!pip install transformers datasets peft accelerate bitsandbytes
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

In [None]:
# Check dataset
!ls -la data/processed/

# Show dataset statistics
import json
with open('data/dataset_statistics.json', 'r') as f:
    stats = json.load(f)
    print(f"Total examples: {stats['total_examples']}")
    print(f"Train: {stats['train_size']}, Val: {stats['validation_size']}, Test: {stats['test_size']}")

In [None]:
# Show training options
!python scripts/fast_training.py --options

In [None]:
# ULTRA FAST Training (5-10 minutes)
# Good for testing the pipeline

!python scripts/fast_training.py --train --model distilgpt2 --steps 100 --output models/amharic-distilgpt2

# Copy to Google Drive
!cp -r models/amharic-distilgpt2 /content/drive/MyDrive/amharic_models/

In [None]:
# BALANCED Training (20-30 minutes)
# Good balance of speed and quality

!python scripts/fast_training.py --train --model bloom-560m --steps 300 --output models/amharic-bloom560m

# Copy to Google Drive
!cp -r models/amharic-bloom560m /content/drive/MyDrive/amharic_models/

In [None]:
# QUALITY Training (45-60 minutes)
# Best quality for production use

!python scripts/fast_training.py --train --model bloom-1b1 --steps 500 --output models/amharic-bloom1b1

# Copy to Google Drive
!cp -r models/amharic-bloom1b1 /content/drive/MyDrive/amharic_models/

In [None]:
# Test the trained model
!python scripts/fast_training.py --test --output models/amharic-bloom560m

In [None]:
# Interactive testing
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load your trained model
model_path = "models/amharic-bloom560m"  # Change this to your model

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path)

def generate_response(instruction):
    prompt = f"Instruction: {instruction}\nResponse:"
    inputs = tokenizer(prompt, return_tensors="pt")
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=100,
            temperature=0.7,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id
        )
    
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response.split("Response:")[-1].strip()

# Test with Amharic instructions
test_instructions = [
    "የአማርኛ ቋንቋ ምንድን ነው?",
    "ኢትዮጵያ የት ትገኛለች?",
    "የአዲስ አበባ ዋና ከተማ ምንድን ነው?"
]

for instruction in test_instructions:
    response = generate_response(instruction)
    print(f"Q: {instruction}")
    print(f"A: {response}")
    print("-" * 50)

In [None]:
# Upload to Hugging Face Hub (optional)
# First, login to Hugging Face

from huggingface_hub import notebook_login
notebook_login()

# Upload model
from transformers import AutoModelForCausalLM, AutoTokenizer

model_path = "models/amharic-bloom560m"
hub_model_name = "your-username/amharic-bloom-560m"  # Change this

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path)

tokenizer.push_to_hub(hub_model_name)
model.push_to_hub(hub_model_name)

## 🎉 Training Complete!

Your Amharic language model has been trained successfully. Here's what you can do next:

### 📁 Your Models
- Models are saved in Google Drive: `/content/drive/MyDrive/amharic_models/`
- You can download them or use them in other notebooks

### 🚀 Next Steps
1. **Test More**: Try different prompts and instructions
2. **Deploy**: Create a Gradio demo or API
3. **Improve**: Collect more data and retrain
4. **Share**: Upload to Hugging Face Hub

### 📊 Model Comparison
- **DistilGPT2**: Fast, good for testing
- **Bloom-560M**: Balanced, good for most use cases
- **Bloom-1B1**: Best quality, slower training

### 🔗 Useful Links
- [Hugging Face Hub](https://huggingface.co)
- [Gradio Documentation](https://gradio.app)
- [Transformers Documentation](https://huggingface.co/docs/transformers)