# 🚀 Amharic LLM Training on Kaggle

This notebook trains an Amharic language model using Kaggle's free GPU resources.

## 📋 Setup Instructions

1. **Enable GPU**: Settings → Accelerator → GPU
2. **Enable Internet**: Settings → Internet → On
3. **Upload Dataset**: Create a Kaggle dataset with your Amharic data
4. **Add Dataset**: Add your dataset to this notebook
5. **Run All Cells**: Execute cells in order

## 🎯 Training Options

- **Quick Test** (10-15 min): DistilGPT2 with 100 steps
- **Balanced** (30-45 min): Bloom-560M with 300 steps
- **High Quality** (1-2 hours): Bloom-1B1 with 500 steps
- **Production** (2-4 hours): Phi-3.5-mini with full training

In [None]:
# Check GPU availability
import torch
import subprocess

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name()}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
    print(f"CUDA version: {torch.version.cuda}")

# Check nvidia-smi
try:
    result = subprocess.run(['nvidia-smi'], capture_output=True, text=True)
    print("\nGPU Status:")
    print(result.stdout)
except:
    print("nvidia-smi not available")

In [None]:
# Install required packages
!pip install transformers datasets peft accelerate bitsandbytes
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

In [None]:
# Setup data - Choose one of the options below
import os
os.chdir('/kaggle/working')

# Option 1: Upload as Kaggle Dataset (RECOMMENDED)
# 1. Create a new Kaggle dataset
# 2. Upload your entire 'amharic-llm-data' folder
# 3. Add the dataset to this notebook
# 4. Uncomment and run the line below:
# !cp -r /kaggle/input/amharic-llm-data/* .

# Option 2: Clone from GitHub (if you've pushed the data)
# !git clone https://github.com/Yosef-Ali/amharic-llm-data.git
# %cd amharic-llm-data

# Option 3: Manual upload
# Upload your data files directly to /kaggle/working/

# Check current directory
!pwd
!ls -la

In [None]:
# Check dataset
!ls -la data/processed/

# Show dataset statistics
import json

try:
    with open('data/dataset_statistics.json', 'r') as f:
        stats = json.load(f)
        print(f"📊 Dataset Statistics:")
        print(f"Total examples: {stats['total_examples']}")
        print(f"Train: {stats['train_size']}")
        print(f"Validation: {stats['validation_size']}")
        print(f"Test: {stats['test_size']}")
        print(f"Average instruction length: {stats['avg_instruction_length']:.1f}")
        print(f"Average response length: {stats['avg_response_length']:.1f}")
except FileNotFoundError:
    print("⚠️  Dataset statistics not found. Make sure data is properly uploaded.")

In [None]:
# Show training options
!python scripts/fast_training.py --options

In [None]:
# QUICK TEST - DistilGPT2 (10-15 minutes)
# Good for testing the pipeline quickly

import time
start_time = time.time()

!python scripts/fast_training.py --train --model distilgpt2 --steps 100 --output models/amharic-distilgpt2-test

end_time = time.time()
print(f"\n⏱️  Training completed in {(end_time - start_time)/60:.1f} minutes")

In [None]:
# BALANCED TRAINING - Bloom-560M (30-45 minutes)
# Good balance of quality and speed

import time
start_time = time.time()

!python scripts/fast_training.py --train --model bloom-560m --steps 300 --output models/amharic-bloom560m-finetuned

end_time = time.time()
print(f"\n⏱️  Training completed in {(end_time - start_time)/60:.1f} minutes")

In [None]:
# HIGH QUALITY TRAINING - Bloom-1B1 (1-2 hours)
# Best quality for production use

import time
start_time = time.time()

!python scripts/fast_training.py --train --model bloom-1b1 --steps 500 --output models/amharic-bloom1b1

end_time = time.time()
print(f"\n⏱️  Training completed in {(end_time - start_time)/60:.1f} minutes")

In [None]:
# PRODUCTION TRAINING - Phi-3.5-mini (2-4 hours)
# Highest quality, use only if you have time

import time
start_time = time.time()

# Use the original training script for better quality
!python scripts/train_example.py --train --model microsoft/Phi-3.5-mini-instruct --output models/amharic-phi35-production

end_time = time.time()
print(f"\n⏱️  Training completed in {(end_time - start_time)/60:.1f} minutes")

In [None]:
# Test the trained model
model_to_test = "models/amharic-bloom560m-finetuned"  # Change this to your trained model

!python scripts/fast_training.py --test --output {model_to_test}

In [None]:
# Interactive testing with custom prompts
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load your best trained model
model_path = "models/amharic-bloom560m-finetuned"  # Change this

print(f"Loading model from: {model_path}")
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path)

# Move to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
print(f"Model loaded on: {device}")

def generate_amharic_response(instruction, max_length=100):
    prompt = f"Instruction: {instruction}\nResponse:"
    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_length,
            temperature=0.7,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id,
            repetition_penalty=1.1
        )
    
    full_response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    response = full_response.split("Response:")[-1].strip()
    return response

# Test with various Amharic instructions
test_cases = [
    "የአማርኛ ቋንቋ ምንድን ነው?",
    "ኢትዮጵያ የት ትገኛለች?",
    "የአዲስ አበባ ዋና ከተማ ምንድን ነው?",
    "የኢትዮጵያ ባንዲራ ምን ቀለም ነው?",
    "ስለ ኢትዮጵያ ታሪክ ንገረኝ",
    "የአማርኛ ፊደላት ስንት ናቸው?"
]

print("🧪 Testing Amharic Language Model:\n")
print("=" * 60)

for i, instruction in enumerate(test_cases, 1):
    print(f"Test {i}:")
    print(f"❓ Question: {instruction}")
    
    response = generate_amharic_response(instruction)
    print(f"🤖 Response: {response}")
    print("-" * 60)

In [None]:
# Model evaluation and comparison
import os

print("📊 Trained Models Summary:\n")

models_dir = "models"
if os.path.exists(models_dir):
    for model_name in os.listdir(models_dir):
        model_path = os.path.join(models_dir, model_name)
        if os.path.isdir(model_path):
            # Get model size
            size = sum(os.path.getsize(os.path.join(dirpath, filename))
                      for dirpath, dirnames, filenames in os.walk(model_path)
                      for filename in filenames)
            size_mb = size / (1024 * 1024)
            
            print(f"🤖 {model_name}:")
            print(f"   Size: {size_mb:.1f} MB")
            print(f"   Path: {model_path}")
            print()

# Show training recommendations
print("🎯 Recommendations:")
print("• DistilGPT2: Good for quick testing and prototyping")
print("• Bloom-560M: Best balance of quality and speed")
print("• Bloom-1B1: Higher quality, good for production")
print("• Phi-3.5-mini: Highest quality, best for final deployment")

In [None]:
# Save models and create download links
import shutil
import zipfile

# Create a zip file with all models
def create_model_archive():
    if os.path.exists("models"):
        print("📦 Creating model archive...")
        
        with zipfile.ZipFile('amharic_models.zip', 'w', zipfile.ZIP_DEFLATED) as zipf:
            for root, dirs, files in os.walk("models"):
                for file in files:
                    file_path = os.path.join(root, file)
                    arcname = os.path.relpath(file_path, ".")
                    zipf.write(file_path, arcname)
        
        print("✅ Model archive created: amharic_models.zip")
        
        # Show file size
        size = os.path.getsize('amharic_models.zip') / (1024 * 1024)
        print(f"📁 Archive size: {size:.1f} MB")
        
        return 'amharic_models.zip'
    else:
        print("❌ No models directory found")
        return None

archive_path = create_model_archive()

if archive_path:
    print("\n🎉 Training Complete!")
    print("\nYour trained Amharic language models are ready!")
    print(f"Download the archive: {archive_path}")
    print("\nNext steps:")
    print("1. Download the models")
    print("2. Test them locally")
    print("3. Deploy to production")
    print("4. Create a demo with Gradio")

## 🎉 Training Complete!

Congratulations! You have successfully trained Amharic language models using Kaggle's free GPU resources.

### 📊 What You've Accomplished

- ✅ Trained multiple Amharic language models
- ✅ Tested model performance with Amharic instructions
- ✅ Created production-ready models
- ✅ Optimized for different use cases (speed vs quality)

### 🚀 Next Steps

1. **Download Models**: Save the `amharic_models.zip` file
2. **Local Testing**: Test models on your local machine
3. **Create Demo**: Build a Gradio or Streamlit demo
4. **Deploy**: Use Hugging Face Spaces or other platforms
5. **Improve**: Collect more data and retrain

### 📈 Model Recommendations

| Model | Use Case | Speed | Quality |
|-------|----------|-------|---------|
| DistilGPT2 | Testing, Prototyping | ⚡⚡⚡ | ⭐⭐ |
| Bloom-560M | General Use | ⚡⚡ | ⭐⭐⭐ |
| Bloom-1B1 | Production | ⚡ | ⭐⭐⭐⭐ |
| Phi-3.5-mini | High-end Production | ⚡ | ⭐⭐⭐⭐⭐ |

### 🔗 Useful Resources

- [Hugging Face Hub](https://huggingface.co) - Share your models
- [Gradio](https://gradio.app) - Create interactive demos
- [Streamlit](https://streamlit.io) - Build web apps
- [Transformers Docs](https://huggingface.co/docs/transformers) - Learn more

### 💡 Tips for Better Models

- **More Data**: Collect additional Amharic text data
- **Longer Training**: Increase training steps for better quality
- **Hyperparameter Tuning**: Experiment with learning rates
- **Evaluation**: Create proper evaluation metrics
- **Fine-tuning**: Adapt models for specific tasks