# Task 4: Model Comparison & Selection

This notebook performs a comprehensive comparison of multiple transformer models for Amharic Named Entity Recognition (NER). We'll evaluate three different multilingual models and select the best one based on various performance metrics.

## Models to Compare:
- **XLM-RoBERTa-base**: Cross-lingual language model with robust multilingual support
- **BERT-base-multilingual-cased**: Google's multilingual BERT model
- **DistilBERT-base-multilingual-cased**: Distilled version of multilingual BERT (faster, smaller)

## Evaluation Criteria:
- **Accuracy Metrics**: F1-score, Precision, Recall, Accuracy
- **Efficiency Metrics**: Training time, Inference time, Model size
- **Business Considerations**: Resource requirements, deployment feasibility

## 1. Import Required Libraries

Setting up the necessary libraries for model comparison, visualization, and analysis.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from time import time
import json

## 2. Model Configuration

Define the models we want to compare and initialize the results storage structure.

In [None]:
# List of models to compare
models_to_compare = [
    "xlm-roberta-base",
    "bert-base-multilingual-cased", 
    "distilbert-base-multilingual-cased"
]

# Dictionary to store results
model_results = {}

## 3. Model Training and Evaluation Function

This comprehensive function handles the complete training and evaluation pipeline for each model. It:

- Loads the specified model and tokenizer
- Tokenizes and prepares datasets
- Configures training parameters
- Trains the model
- Evaluates performance on test set
- Measures timing and resource metrics
- Saves the trained model

In [None]:
def train_and_evaluate_model(model_name, train_dataset, val_dataset, test_dataset, 
                           tokenizer_name=None, epochs=3):
    """Train and evaluate a specific model"""
    print(f"\n{'='*50}")
    print(f"Training model: {model_name}")
    print(f"{'='*50}")
    
    # Use same tokenizer name as model if not specified
    if tokenizer_name is None:
        tokenizer_name = model_name
    
    # Load tokenizer and model
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
    model = AutoModelForTokenClassification.from_pretrained(
        model_name, 
        num_labels=len(label_list),
        id2label=id_to_label,
        label2id=label_to_id
    )
    
    # Tokenize datasets
    start_time = time()
    train_tokenized = train_dataset.map(
        lambda x: tokenize_and_align_labels(x, tokenizer, label_to_id),
        batched=True
    )
    val_tokenized = val_dataset.map(
        lambda x: tokenize_and_align_labels(x, tokenizer, label_to_id),
        batched=True
    )
    test_tokenized = test_dataset.map(
        lambda x: tokenize_and_align_labels(x, tokenizer, label_to_id),
        batched=True
    )
    tokenization_time = time() - start_time
    
    # Data collator
    data_collator = DataCollatorForTokenClassification(
        tokenizer=tokenizer, 
        padding=True
    )
    
    # Training arguments
    training_args = TrainingArguments(
        output_dir=f"./results_{model_name.replace('/', '_')}",
        learning_rate=2e-5,
        per_device_train_batch_size=16,
        per_device_eval_batch_size=16,
        num_train_epochs=epochs,
        weight_decay=0.01,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        logging_dir=f'./logs_{model_name.replace("/", "_")}',
        logging_steps=10,
        load_best_model_at_end=True,
        metric_for_best_model="f1",
        greater_is_better=True,
        save_total_limit=1,
        report_to=None,
        dataloader_pin_memory=False,
    )
    
    # Trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_tokenized,
        eval_dataset=val_tokenized,
        tokenizer=tokenizer,
        data_collator=data_collator,
        compute_metrics=compute_metrics,
    )
    
    # Train
    start_time = time()
    trainer.train()
    training_time = time() - start_time
    
    # Evaluate on test set
    start_time = time()
    test_results = trainer.evaluate(test_tokenized)
    inference_time = time() - start_time
    
    # Calculate model size (approximate)
    model_size = sum(p.numel() for p in model.parameters()) / 1e6  # in millions
    
    # Store results
    results = {
        'model_name': model_name,
        'test_f1': test_results['eval_f1'],
        'test_precision': test_results['eval_precision'],
        'test_recall': test_results['eval_recall'],
        'test_accuracy': test_results['eval_accuracy'],
        'training_time': training_time,
        'inference_time': inference_time,
        'tokenization_time': tokenization_time,
        'model_size_millions': model_size,
        'test_loss': test_results['eval_loss']
    }
    
    # Save model
    model_save_path = f"./{model_name.replace('/', '_')}_amharic_ner"
    model.save_pretrained(model_save_path)
    tokenizer.save_pretrained(model_save_path)
    
    return results, model, tokenizer

## 4. Execute Model Training and Evaluation

Train and evaluate each model in our comparison list. This process will:

- Train each model for the specified number of epochs
- Collect performance metrics for each model
- Test sample predictions to verify model functionality
- Handle any training errors gracefully

In [None]:
# Train and evaluate all models
for model_name in models_to_compare:
    try:
        results, trained_model, trained_tokenizer = train_and_evaluate_model(
            model_name, train_dataset, val_dataset, test_dataset
        )
        model_results[model_name] = results
        
        # Test prediction
        test_text = "LIFESTAR Android ሪሲቨር ዋጋ 7000 ብር አዲስ አበባ ውስጥ"
        predictions = predict_entities(test_text, trained_model, trained_tokenizer, id_to_label)
        model_results[model_name]['sample_prediction'] = predictions
        
    except Exception as e:
        print(f"Error training {model_name}: {str(e)}")
        continue

## 5. Display Training Results

Present a comprehensive overview of all model performance metrics in a structured format.

In [None]:
# Display results
print("\n" + "="*80)
print("MODEL COMPARISON RESULTS")
print("="*80)

results_df = pd.DataFrame(model_results).T
print(results_df.round(4))

## 6. Performance Visualizations

Create comprehensive visualizations to compare models across different dimensions:

- **F1 Score Comparison**: Primary accuracy metric
- **Training Time Analysis**: Efficiency during training
- **Model Size Comparison**: Resource requirements
- **Precision vs Recall**: Balance between metrics
- **Radar Chart**: Overall performance profile

In [None]:
# Create visualizations
plt.figure(figsize=(15, 12))

# 1. F1 Score Comparison
plt.subplot(2, 3, 1)
models = list(model_results.keys())
f1_scores = [model_results[model]['test_f1'] for model in models]
bars = plt.bar(models, f1_scores, color=['skyblue', 'lightcoral', 'lightgreen'])
plt.title('F1 Score Comparison')
plt.ylabel('F1 Score')
plt.xticks(rotation=45)
for i, bar in enumerate(bars):
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height + 0.01,
             f'{f1_scores[i]:.3f}', ha='center', va='bottom')

# 2. Training Time Comparison
plt.subplot(2, 3, 2)
training_times = [model_results[model]['training_time']/60 for model in models]  # Convert to minutes
bars = plt.bar(models, training_times, color=['orange', 'purple', 'brown'])
plt.title('Training Time Comparison')
plt.ylabel('Training Time (minutes)')
plt.xticks(rotation=45)
for i, bar in enumerate(bars):
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height + 0.5,
             f'{training_times[i]:.1f}m', ha='center', va='bottom')

# 3. Model Size Comparison
plt.subplot(2, 3, 3)
model_sizes = [model_results[model]['model_size_millions'] for model in models]
bars = plt.bar(models, model_sizes, color=['red', 'blue', 'green'])
plt.title('Model Size Comparison')
plt.ylabel('Parameters (Millions)')
plt.xticks(rotation=45)
for i, bar in enumerate(bars):
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height + 1,
             f'{model_sizes[i]:.0f}M', ha='center', va='bottom')

# 4. Precision vs Recall
plt.subplot(2, 3, 4)
precisions = [model_results[model]['test_precision'] for model in models]
recalls = [model_results[model]['test_recall'] for model in models]
colors = ['skyblue', 'lightcoral', 'lightgreen']
for i, model in enumerate(models):
    plt.scatter(recalls[i], precisions[i], s=100, c=colors[i], label=model.split('/')[-1])
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision vs Recall')
plt.legend()
plt.grid(True, alpha=0.3)

# 5. Overall Performance Radar Chart
plt.subplot(2, 3, 5)
categories = ['F1 Score', 'Precision', 'Recall', 'Speed\n(1/time)', 'Efficiency\n(1/size)']
fig, ax = plt.subplots(figsize=(8, 8), subplot_kw=dict(projection='polar'))

angles = np.linspace(0, 2 * np.pi, len(categories), endpoint=False).tolist()
angles += angles[:1]  # Complete the circle

for i, model in enumerate(models):
    values = [
        model_results[model]['test_f1'],
        model_results[model]['test_precision'], 
        model_results[model]['test_recall'],
        1 / (model_results[model]['training_time'] / 100),  # Normalized speed
        1 / (model_results[model]['model_size_millions'] / 100)  # Normalized efficiency
    ]
    values += values[:1]  # Complete the circle
    
    ax.plot(angles, values, 'o-', linewidth=2, label=model.split('/')[-1])
    ax.fill(angles, values, alpha=0.25)

ax.set_xticks(angles[:-1])
ax.set_xticklabels(categories)
ax.set_ylim(0, 1)
plt.title('Overall Model Performance', size=16, y=1.1)
plt.legend(loc='upper right', bbox_to_anchor=(1.2, 1.0))

plt.tight_layout()
plt.savefig('model_comparison.png', dpi=300, bbox_inches='tight')
plt.show()

## 7. Detailed Analysis and Recommendations

Perform comprehensive analysis to determine the best model for different use cases and provide business-oriented recommendations.

In [None]:
# 6. Detailed Analysis and Recommendation
print("\n" + "="*80)
print("DETAILED ANALYSIS AND RECOMMENDATIONS")
print("="*80)

# Find best model for different criteria
best_f1_model = max(model_results.keys(), key=lambda x: model_results[x]['test_f1'])
fastest_model = min(model_results.keys(), key=lambda x: model_results[x]['training_time'])
smallest_model = min(model_results.keys(), key=lambda x: model_results[x]['model_size_millions'])

print(f"\n📊 PERFORMANCE ANALYSIS:")
print(f"• Best F1 Score: {best_f1_model} ({model_results[best_f1_model]['test_f1']:.4f})")
print(f"• Fastest Training: {fastest_model} ({model_results[fastest_model]['training_time']/60:.1f} minutes)")
print(f"• Smallest Model: {smallest_model} ({model_results[smallest_model]['model_size_millions']:.0f}M parameters)")

## 8. Composite Score Calculation

Calculate a weighted composite score considering multiple factors:
- **Accuracy (50%)**: Primary importance for NER task quality
- **Speed (30%)**: Important for production deployment
- **Efficiency (20%)**: Resource optimization consideration

In [None]:
# Calculate composite score
print(f"\n🏆 COMPOSITE SCORE CALCULATION:")
composite_scores = {}
for model in models:
    # Normalize metrics (higher is better)
    f1_norm = model_results[model]['test_f1']
    speed_norm = 1 / model_results[model]['training_time'] * 1000  # Normalize
    size_norm = 1 / model_results[model]['model_size_millions'] * 100  # Normalize
    
    # Weighted composite score (adjust weights based on business priorities)
    composite_score = (0.5 * f1_norm) + (0.3 * speed_norm) + (0.2 * size_norm)
    composite_scores[model] = composite_score
    
    print(f"• {model}: {composite_score:.4f}")

best_overall_model = max(composite_scores.keys(), key=lambda x: composite_scores[x])

print(f"\n🎯 FINAL RECOMMENDATION:")
print(f"Based on the composite analysis considering accuracy (50%), speed (30%), and efficiency (20%):")
print(f"RECOMMENDED MODEL: {best_overall_model}")
print(f"Composite Score: {composite_scores[best_overall_model]:.4f}")

## 9. Business Case Analysis

Provide context-specific recommendations for EthioMart's e-commerce platform and Telegram channel analysis use case.

In [None]:
# Business case analysis
print(f"\n💼 BUSINESS CASE ANALYSIS:")
print(f"For EthioMart's e-commerce platform:")

if best_overall_model == "xlm-roberta-base":
    print("• XLM-RoBERTa provides the best balance of accuracy and multilingual support")
    print("• Suitable for production deployment with high accuracy requirements")
    print("• Recommended for comprehensive entity extraction across diverse Telegram channels")
elif best_overall_model == "distilbert-base-multilingual-cased":
    print("• DistilBERT offers good performance with faster inference")
    print("• Ideal for real-time processing of high-volume Telegram messages")
    print("• Cost-effective solution for resource-constrained environments")
else:
    print("• BERT provides solid baseline performance")
    print("• Good option for balanced accuracy and resource usage")

## 10. Save Analysis Results

Export comprehensive results for future reference and reporting. This includes:
- Complete model comparison metrics
- Composite scores and rankings
- Final recommendations
- Timestamp for analysis reproducibility

In [None]:
# Save detailed results
final_results = {
    'model_comparison': model_results,
    'composite_scores': composite_scores,
    'recommendations': {
        'best_accuracy': best_f1_model,
        'fastest': fastest_model,
        'most_efficient': smallest_model,
        'best_overall': best_overall_model
    },
    'analysis_date': datetime.now().isoformat()
}

with open('model_comparison_results.json', 'w', encoding='utf-8') as f:
    json.dump(final_results, f, ensure_ascii=False, indent=2)

print(f"\n✅ Analysis complete! Results saved to 'model_comparison_results.json'")
print(f"📈 Visualization saved as 'model_comparison.png'")

## Conclusion

This comprehensive model comparison provides data-driven insights for selecting the optimal transformer model for Amharic NER tasks. The analysis considers multiple factors including accuracy, efficiency, and business requirements to ensure the selected model meets both technical and operational needs for EthioMart's e-commerce platform.

### Key Deliverables:
- **Trained Models**: All three models trained and saved for deployment
- **Performance Metrics**: Comprehensive evaluation across multiple dimensions
- **Visualizations**: Clear charts for stakeholder communication
- **Business Recommendations**: Context-specific guidance for model selection
- **Exportable Results**: JSON format for integration with other systems