# Moroccan Darija SmolLM2 Fine-tuning

A comprehensive fine-tuning implementation that adapts **SmolLM2-135M-Instruct** to understand and generate Moroccan Darija (Arabic dialect) using high-quality Q&A datasets.

## Overview

This project fine-tunes the `HuggingFaceTB/SmolLM2-135M-Instruct` model on over **2,300 Moroccan Darija question-answer pairs** from the `Lyte/Moroccan-Darija-QA` dataset. The model learns to respond to questions in Darija across multiple domains, including business, culture, health, food, technology, and daily conversation.


In [None]:
!nvidia-smi

Mon Sep 15 13:32:46 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   47C    P8              9W /   70W |       2MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [None]:
# Moroccan Darija Fine-tuning for SmolLM2-135M using Lyte/Moroccan-Darija-QA
# Optimized for Google Colab with high-quality Q&A dataset

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from datasets import load_dataset, Dataset, concatenate_datasets
from trl import SFTTrainer, setup_chat_format
import pandas as pd
import random

# ============================================================================
# MOROCCAN DARIJA Q&A DATASET INTEGRATION
# ============================================================================

def load_moroccan_darija_qa_dataset():
    """Load the high-quality Moroccan Darija Q&A dataset from Lyte"""

    print("🇲🇦 Loading Moroccan Darija Q&A Dataset...")

    try:
        # Load different configurations
        dataset_default = load_dataset("Lyte/Moroccan-Darija-QA", name="default")
        dataset_translated = load_dataset("Lyte/Moroccan-Darija-QA", name="translated")
        dataset_reasoning = load_dataset("Lyte/Moroccan-Darija-QA", name="reasoning")

        print(f"✅ Default config loaded: {len(dataset_default['train'])} examples")
        print(f"✅ Translated config loaded: {len(dataset_translated['train'])} examples")
        print(f"✅ Reasoning config loaded: {len(dataset_reasoning['train'])} examples")

        # Preview the data structure
        print("\n📊 Dataset Preview:")
        for i, example in enumerate(dataset_default['train']):
            if i < 3:
                print(f"Question: {example['question']}")
                print(f"Answer: {example['answer'][:100]}...")
                print(f"Category: {example['category']}")
                print("-" * 50)

        return {
            'default': dataset_default,
            'translated': dataset_translated,
            'reasoning': dataset_reasoning
        }

    except Exception as e:
        print(f"❌ Error loading Q&A dataset: {e}")
        return None

def prepare_darija_qa_for_training(qa_datasets):
    """Convert Q&A datasets to instruction-following format for training"""

    if qa_datasets is None:
        return None

    training_examples = []

    # Process default configuration (main dataset)
    print("\n🔄 Processing default Q&A dataset...")
    default_data = qa_datasets['default']['train']

    for example in default_data:
        question = example['question'].strip()
        answer = example['answer'].strip()
        category = example['category']

        # Skip very short answers
        if len(answer) < 20:
            continue

        # Create instruction format
        training_examples.append({
            'prompt': question,
            'response': answer,
            'category': category
        })

    # Process reasoning configuration (adds thinking process)
    print("🧠 Processing reasoning Q&A dataset...")
    reasoning_data = qa_datasets['reasoning']['train']

    for example in reasoning_data:
        question = example['question'].strip()
        answer = example['answer'].strip()
        category = example['category']

        # Skip very short answers
        if len(answer) < 20:
            continue

        training_examples.append({
            'prompt': question,
            'response': answer,
            'category': f"{category}_reasoning"
        })

    # Optionally include some translated examples for diversity
    print("🌍 Adding translated examples for variety...")
    translated_data = qa_datasets['translated']['train']

    # Take a sample of translated data (not all to maintain Darija focus)
    sample_size = min(200, len(translated_data))
    translated_sample = random.sample(list(translated_data), sample_size)

    for example in translated_sample:
        question = example['question'].strip()
        answer = example['answer'].strip()
        category = example['category']

        training_examples.append({
            'prompt': question,
            'response': answer,
            'category': f"{category}_translated"
        })

    print(f"✅ Total training examples prepared: {len(training_examples)}")

    # Show category distribution
    categories = {}
    for ex in training_examples:
        cat = ex['category']
        categories[cat] = categories.get(cat, 0) + 1

    print("\n📈 Category Distribution:")
    for cat, count in sorted(categories.items()):
        print(f"  {cat}: {count} examples")

    return Dataset.from_list(training_examples)

def add_conversational_examples():
    """Add some basic conversational examples to complement the Q&A data"""

    darija_conversations = [
        {
            "prompt": "كيف داير؟",
            "response": "الحمد لله، بخير. أنت كيف داير؟",
            "category": "greeting"
        },
        {
            "prompt": "شنو كتعمل اليوم؟",
            "response": "كنخدم فالبيت وكنقرا شوية. أنت شنو كتعمل؟",
            "category": "daily_conversation"
        },
        {
            "prompt": "واش بغيتي تشرب أتاي؟",
            "response": "إيه بغيت، شكرا ليك. أتاي بالنعناع زين بزاف.",
            "category": "hospitality"
        },
        {
            "prompt": "شكرا ليك بزاف",
            "response": "العفو، ماشي مشكل. كلشي ديال الخير.",
            "category": "politeness"
        },
        {
            "prompt": "بصحتك",
            "response": "الله يعطيك الصحة. شكرا ليك.",
            "category": "well_wishes"
        },
        {
            "prompt": "أشنو كتحب فالماكلة المغريبية؟",
            "response": "كنحب الطاجين والكسكس والپاستيطا. والحلويات مغريبية زينين بزاف.",
            "category": "food_conversation"
        },
        {
            "prompt": "واش كتعرف تطيب؟",
            "response": "إيه كنعرف نطيب شوية. كنطيب الطاجين والحريرة مزيان.",
            "category": "cooking"
        },
        {
            "prompt": "السلام عليكم",
            "response": "وعليكم السلام ورحمة الله وبركاته. أهلا وسهلا.",
            "category": "religious_greeting"
        }
    ]

    print(f"➕ Added {len(darija_conversations)} conversational examples")
    return Dataset.from_list(darija_conversations)

# ============================================================================
# IMPROVED DATASET PREPARATION
# ============================================================================

def prepare_complete_darija_dataset():
    """Combine Q&A dataset with conversational examples"""

    # Load the main Q&A dataset
    qa_datasets = load_moroccan_darija_qa_dataset()
    if qa_datasets is None:
        print("❌ Failed to load Q&A dataset")
        return None

    # Prepare Q&A data for training
    qa_training_data = prepare_darija_qa_for_training(qa_datasets)

    # Add conversational examples
    conversation_data = add_conversational_examples()

    # Combine datasets
    if qa_training_data and conversation_data:
        combined_dataset = concatenate_datasets([qa_training_data, conversation_data])
    else:
        combined_dataset = qa_training_data or conversation_data

    print(f"\n🎯 Final dataset size: {len(combined_dataset)} examples")
    return combined_dataset

# ============================================================================
# TRAINING SETUP FOR DARIJA Q&A
# ============================================================================

def setup_darija_qa_training():
    """Setup training configuration optimized for Darija Q&A and Colab"""

    # Check GPU memory
    if torch.cuda.is_available():
        gpu_memory = torch.cuda.get_device_properties(0).total_memory / 1e9
        print(f"🚀 Available GPU memory: {gpu_memory:.1f}GB")

        # Adjust batch size based on memory
        if gpu_memory < 16:  # T4 in Colab
            batch_size = 1
            grad_accumulation = 8
        else:
            batch_size = 2
            grad_accumulation = 4
    else:
        print("⚠️  Running on CPU - training will be slower")
        batch_size = 1
        grad_accumulation = 8

    # Adjusted for larger, higher-quality dataset
    training_args = TrainingArguments(
        per_device_train_batch_size=batch_size,
        gradient_accumulation_steps=grad_accumulation,
        warmup_steps=100,  # More warmup for larger dataset
        max_steps=500,     # More steps for comprehensive Q&A learning
        learning_rate=1e-5,  # Lower LR for stability with quality data
        fp16=True,
        logging_steps=20,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="cosine",
        seed=42,
        output_dir="darija_qa_model_outputs",
        save_steps=150,
        dataloader_drop_last=True,
        report_to="none",
        remove_unused_columns=False,
        save_total_limit=2,
    )

    return training_args

# ============================================================================
# IMPROVED TOKENIZATION FOR Q&A FORMAT
# ============================================================================

def prepare_darija_qa_tokenization(dataset, tokenizer, max_length):
    """Prepare tokenization specifically for Darija Q&A format"""

    def tokenize_qa_examples(examples):
        """Tokenize Q&A examples with proper chat format"""

        texts = []
        for prompt, response in zip(examples["prompt"], examples["response"]):
            # Create chat format for Q&A
            conversation = [
                {"role": "user", "content": prompt},
                {"role": "assistant", "content": response}
            ]

            # Apply chat template
            text = tokenizer.apply_chat_template(
                conversation,
                tokenize=False,
                add_generation_prompt=False
            )
            texts.append(text)

        # Tokenize with appropriate settings for Q&A
        return tokenizer(
            texts,
            truncation=True,
            padding="max_length",
            max_length=max_length,
            return_tensors="pt"
        )

    # Apply tokenization
    tokenized_dataset = dataset.map(
        tokenize_qa_examples,
        batched=True,
        remove_columns=dataset.column_names
    )

    return tokenized_dataset

# ============================================================================
# MAIN TRAINING FUNCTION
# ============================================================================

def train_darija_qa_smollm():
    """Main function to train SmolLM2 on high-quality Darija Q&A data"""

    print("🇲🇦 Starting Moroccan Darija Q&A Fine-tuning for SmolLM2-135M")
    print("=" * 70)

    # 1. Load model and tokenizer
    print("\n1️⃣ Loading base model...")
    model_name = "HuggingFaceTB/SmolLM2-135M-Instruct"
    model = AutoModelForCausalLM.from_pretrained(model_name)
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    # Ensure pad token is set
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token

    print(f"✅ Model loaded: {model_name}")
    print(f"✅ Vocabulary size: {len(tokenizer)}")

    # 2. Prepare dataset
    print("\n2️⃣ Preparing Darija Q&A datasets...")
    darija_dataset = prepare_complete_darija_dataset()

    if darija_dataset is None:
        print("❌ Failed to prepare dataset")
        return None

    # 3. Tokenize dataset
    print("\n3️⃣ Tokenizing data...")
    max_seq_length = 512  # Increased for Q&A format
    tokenized_dataset = prepare_darija_qa_tokenization(darija_dataset, tokenizer, max_seq_length)

    print(f"✅ Tokenization complete. Dataset size: {len(tokenized_dataset)}")

    # 4. Setup training
    print("\n4️⃣ Setting up training...")
    training_args = setup_darija_qa_training()

    # 5. Initialize trainer
    trainer = SFTTrainer(
        model=model,
        processing_class=tokenizer,
        train_dataset=tokenized_dataset,
        args=training_args,
    )

    # 6. Start training
    print("\n5️⃣ Starting training...")
    print("🕐 Training will take approximately 25-35 minutes on Colab T4")
    print("📚 Learning from 2000+ Darija Q&A pairs...")

    trainer.train()

    print("✅ Training completed!")
    return trainer

# ============================================================================
# ENHANCED TESTING FOR Q&A MODEL
# ============================================================================

def test_darija_qa_model(trainer):
    """Test the fine-tuned model on various Darija Q&A scenarios"""

    from transformers import pipeline

    # Create pipeline
    darija_pipe = pipeline(
        "text-generation",
        model=trainer.model,
        tokenizer=trainer.processing_class,
        device=0 if torch.cuda.is_available() else -1
    )

    # Test prompts covering different categories from the dataset
    test_prompts = [
        # Business
        "واش التجارة مربحة فالمغرب؟",
        "كيفاش نفتح مقاولة صغيرة؟",

        # Health
        "شنو هوما فوائد الرياضة؟",
        "كيفاش نحافظ على صحتي؟",

        # Food
        "كيفاش نطيب الطاجين؟",
        "شنو أحسن ماكلة مغريبية؟",

        # Culture
        "شنو هوما التقاليد المغريبية؟",
        "علاش عيد الفطر مهم؟",

        # Daily conversation
        "كيف داير؟",
        "شنو كتعمل النهار؟",

        # Technology
        "شنو هو الذكاء الاصطناعي؟",
        "كيفاش نستعمل الهاتف الذكي؟"
    ]

    print("\n🧪 Testing Darija Q&A Model:")
    print("=" * 50)

    for i, prompt in enumerate(test_prompts, 1):
        # Format as chat
        messages = [{"role": "user", "content": prompt}]
        formatted_prompt = trainer.processing_class.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=True
        )

        # Generate response
        try:
            response = darija_pipe(
                formatted_prompt,
                max_new_tokens=120,
                do_sample=True,
                temperature=0.7,
                top_p=0.9,
                pad_token_id=trainer.processing_class.eos_token_id
            )

            generated = response[0]['generated_text'][len(formatted_prompt):].strip()

            print(f"\n{i}. 🙋‍♂️ {prompt}")
            print(f"   🤖 {generated}")
            print("-" * 40)

        except Exception as e:
            print(f"❌ Error generating response for '{prompt}': {e}")

# ============================================================================
# EXECUTION
# ============================================================================

if __name__ == "__main__":

    print("🇲🇦 Moroccan Darija SmolLM2 Fine-tuning with Q&A Dataset")
    print("📚 Using Lyte/Moroccan-Darija-QA - High Quality Dataset")
    print("🎯 2000+ Question-Answer pairs covering 10 categories")
    print("=" * 70)

    # Train the model
    trainer = train_darija_qa_smollm()

    if trainer:
        # Test Q&A capabilities
        test_darija_qa_model(trainer)

        # Save the model
        save_path = "smollm_darija_qa_finetuned"
        trainer.model.save_pretrained(save_path)
        trainer.processing_class.save_pretrained(save_path)

        print(f"\n✅ Darija Q&A fine-tuning complete!")
        print(f"📁 Model saved to: {save_path}")
        print(f"🇲🇦 Your SmolLM2 now speaks Moroccan Darija with Q&A capabilities!")
        print(f"📊 Trained on {len(trainer.train_dataset)} examples")

        # Model usage instructions
        print("\n📖 Usage Instructions:")
        print("from transformers import pipeline")
        print(f"pipe = pipeline('text-generation', model='{save_path}')")
        print("response = pipe('شنو هوما فوائد الرياضة؟')")

    else:
        print("❌ Training failed. Please check the error messages above.")

🇲🇦 Moroccan Darija SmolLM2 Fine-tuning with Q&A Dataset
📚 Using Lyte/Moroccan-Darija-QA - High Quality Dataset
🎯 2000+ Question-Answer pairs covering 10 categories
🇲🇦 Starting Moroccan Darija Q&A Fine-tuning for SmolLM2-135M

1️⃣ Loading base model...
✅ Model loaded: HuggingFaceTB/SmolLM2-135M-Instruct
✅ Vocabulary size: 49152

2️⃣ Preparing Darija Q&A datasets...
🇲🇦 Loading Moroccan Darija Q&A Dataset...
✅ Default config loaded: 2026 examples
✅ Translated config loaded: 1300 examples
✅ Reasoning config loaded: 144 examples

📊 Dataset Preview:
Question: واش التجارة مربحة فالمغرب؟
Answer: التجارة مربحة ملي كتعرف السوق مزيان وكتختار المنتج المناسب. ولكن كتحتاج رأس مال وخبرة. أحسن شي تبدا ...
Category: business
--------------------------------------------------
Question: كيفاش نفتح مقاولة صغيرة؟
Answer: باش تفتح مقاولة صغيرة، مش للسجل التجاري وسجل فالضرائب. خود رخصة من البلدية ملي كانت حاجة تحتاج ليها....
Category: business
--------------------------------------------------
Question: شحال

Map:   0%|          | 0/2378 [00:00<?, ? examples/s]

✅ Tokenization complete. Dataset size: 2378

4️⃣ Setting up training...
🚀 Available GPU memory: 15.8GB


Truncating train dataset:   0%|          | 0/2378 [00:00<?, ? examples/s]


5️⃣ Starting training...
🕐 Training will take approximately 25-35 minutes on Colab T4
📚 Learning from 2000+ Darija Q&A pairs...


Step,Training Loss
20,1.8413
40,1.4053
60,1.1049
80,0.9305
100,0.8958
120,0.8559
140,0.8193
160,0.8147
180,0.7936
200,0.844


Device set to use cuda:0


✅ Training completed!

🧪 Testing Darija Q&A Model:

1. 🙋‍♂️ واش التجارة مربحة فالمغرب؟
   🤖 التجارة كيحتاج بالمغرب، كيمكن يحتاج العادات المغربية ولا ديال كاملة والمواقع. إذا كانت فالمغرب ماتجاه من الأنت.
----------------------------------------

2. 🙋‍♂️ كيفاش نفتح مقاولة صغيرة؟
   🤖 المقاولة صغيرة من المغرب التغييرية. تطبيق مع المغرب التغييرية والتصلي. كيخلي فالمغرب التغييرية والمغرب العرفية. وماكيف مع المغرب التغ
----------------------------------------

3. 🙋‍♂️ شنو هوما فوائد الرياضة؟
   🤖 الأساسية كتعلم بالطبخ المغربية المغربية والتواصل المغربية. كذلك الرياضة كتسعو بالطبخ المغربية، والمغرب كتعلم بالطبخ المغربية الم
----------------------------------------

4. 🙋‍♂️ كيفاش نحافظ على صحتي؟
   🤖 بدات التجارة تحافظ على صحتي، خدمة كتجارة تجارة تأكد معها حيت خدمة وضعيف، واخر المعاهدات.
----------------------------------------

5. 🙋‍♂️ كيفاش نطيب الطاجين؟
   🤖 الطاجين كيستعمل الجداد، ولكن من نقدر مزيان. الطاجين كيعطي كيستعمل الأكثرات والجداد، ولكن أنها تقدر من الماء والعمر. كيستعمل الج
-----

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset



10. 🙋‍♂️ شنو كتعمل النهار؟
   🤖 النهار، ديال الصفة تحديد التقليد. الشعب الإصلي، كيجب عندو الأرخاصة والتعليم. الأصلي الثقافي المغربي.
----------------------------------------

11. 🙋‍♂️ شنو هو الذكاء الاصطناعي؟
   🤖 الذكاء الاصطناعي كتأثر معمولة المغرب والتساعد، والشيخ كتأثر تأثر عندها حيت تجنب جنوب تأثر كل مجيء.
----------------------------------------

12. 🙋‍♂️ كيفاش نستعمل الهاتف الذكي؟
   🤖 الهاتف الذكي كتعرف فيها الطبقة الذكية. كيستعمل شي تقدر وضعفات حسب الشباب. كتخلص المغرب والحضرة. كتتغير على طبقة إلى طبقة تق
----------------------------------------

✅ Darija Q&A fine-tuning complete!
📁 Model saved to: smollm_darija_qa_finetuned
🇲🇦 Your SmolLM2 now speaks Moroccan Darija with Q&A capabilities!
📊 Trained on 2378 examples

📖 Usage Instructions:
from transformers import pipeline
pipe = pipeline('text-generation', model='smollm_darija_qa_finetuned')
response = pipe('شنو هوما فوائد الرياضة؟')
