# Machine Learning Algorithms: Comprehensive Comparison and Analysis

## A Complete Guide to Classification and Regression Algorithms

This notebook provides an in-depth comparison of popular machine learning algorithms across different types of problems. We'll analyze their performance, strengths, weaknesses, and use cases.

### Covered Algorithms:
**Classification:**
- Logistic Regression
- Decision Trees
- Random Forest
- Support Vector Machines
- Naive Bayes
- K-Nearest Neighbors
- Gradient Boosting
- Neural Networks

**Regression:**
- Linear Regression
- Polynomial Regression
- Ridge/Lasso Regression
- Decision Tree Regression
- Random Forest Regression
- SVR
- Gradient Boosting Regression

In [None]:
# Import all necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets
from sklearn.model_selection import train_test_split, cross_val_score, learning_curve, validation_curve
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import time
import warnings
warnings.filterwarnings('ignore')

# Classification algorithms
from sklearn.linear_model import LogisticRegression, Ridge, Lasso, ElasticNet, LinearRegression
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, GradientBoostingClassifier, GradientBoostingRegressor
from sklearn.svm import SVC, SVR
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.neural_network import MLPClassifier, MLPRegressor
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline

# Set style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("All libraries imported successfully!")

## 1. Dataset Preparation

We'll use multiple datasets to comprehensively evaluate algorithm performance.

In [None]:
# Load and prepare datasets
def load_datasets():
    """
    Load various datasets for comprehensive algorithm testing.
    """
    datasets_dict = {}
    
    # 1. Iris Dataset (Multi-class classification)
    iris = datasets.load_iris()
    datasets_dict['iris'] = {
        'X': iris.data,
        'y': iris.target,
        'type': 'classification',
        'description': 'Iris flower species classification (3 classes)',
        'features': iris.feature_names,
        'targets': iris.target_names
    }
    
    # 2. Wine Dataset (Multi-class classification)
    wine = datasets.load_wine()
    datasets_dict['wine'] = {
        'X': wine.data,
        'y': wine.target,
        'type': 'classification',
        'description': 'Wine classification (3 classes)',
        'features': wine.feature_names,
        'targets': wine.target_names
    }
    
    # 3. Breast Cancer Dataset (Binary classification)
    cancer = datasets.load_breast_cancer()
    datasets_dict['breast_cancer'] = {
        'X': cancer.data,
        'y': cancer.target,
        'type': 'classification',
        'description': 'Breast cancer diagnosis (2 classes)',
        'features': cancer.feature_names,
        'targets': cancer.target_names
    }
    
    # 4. Boston Housing Dataset (Regression)
    # Create synthetic housing data similar to Boston housing
    np.random.seed(42)
    n_samples = 506
    n_features = 13
    
    X_housing = np.random.randn(n_samples, n_features)
    # Create realistic housing prices
    y_housing = (
        X_housing[:, 0] * 5 +  # Room number effect
        X_housing[:, 1] * -3 +  # Crime rate effect (negative)
        X_housing[:, 2] * 2 +   # Accessibility effect
        np.random.normal(0, 2, n_samples) + 25  # Base price + noise
    )
    y_housing = np.maximum(y_housing, 5)  # Minimum price
    
    datasets_dict['housing'] = {
        'X': X_housing,
        'y': y_housing,
        'type': 'regression',
        'description': 'Housing price prediction',
        'features': [f'feature_{i}' for i in range(n_features)],
        'targets': None
    }
    
    # 5. Diabetes Dataset (Regression)
    diabetes = datasets.load_diabetes()
    datasets_dict['diabetes'] = {
        'X': diabetes.data,
        'y': diabetes.target,
        'type': 'regression',
        'description': 'Diabetes progression prediction',
        'features': diabetes.feature_names,
        'targets': None
    }
    
    return datasets_dict

# Load all datasets
datasets_dict = load_datasets()

print("Available datasets:")
print("=" * 50)
for name, data in datasets_dict.items():
    print(f"{name}: {data['description']}")
    print(f"  - Samples: {data['X'].shape[0]}, Features: {data['X'].shape[1]}")
    print(f"  - Type: {data['type']}")
    print()

## 2. Algorithm Definitions and Configuration

In [None]:
# Define algorithms for classification
classification_algorithms = {
    'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'SVM (RBF)': SVC(kernel='rbf', random_state=42),
    'SVM (Linear)': SVC(kernel='linear', random_state=42),
    'Naive Bayes': GaussianNB(),
    'K-NN (k=5)': KNeighborsClassifier(n_neighbors=5),
    'K-NN (k=3)': KNeighborsClassifier(n_neighbors=3),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42),
    'Neural Network': MLPClassifier(hidden_layer_sizes=(100,), max_iter=1000, random_state=42)
}

# Define algorithms for regression
regression_algorithms = {
    'Linear Regression': LinearRegression(),
    'Ridge Regression': Ridge(alpha=1.0),
    'Lasso Regression': Lasso(alpha=1.0),
    'Elastic Net': ElasticNet(alpha=1.0, l1_ratio=0.5),
    'Polynomial (degree=2)': Pipeline([
        ('poly', PolynomialFeatures(degree=2, include_bias=False)),
        ('linear', LinearRegression())
    ]),
    'Decision Tree': DecisionTreeRegressor(random_state=42),
    'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42),
    'SVR (RBF)': SVR(kernel='rbf'),
    'SVR (Linear)': SVR(kernel='linear'),
    'K-NN Regression': KNeighborsRegressor(n_neighbors=5),
    'Gradient Boosting': GradientBoostingRegressor(n_estimators=100, random_state=42),
    'Neural Network': MLPRegressor(hidden_layer_sizes=(100,), max_iter=1000, random_state=42)
}

print(f"Classification algorithms: {len(classification_algorithms)}")
print(f"Regression algorithms: {len(regression_algorithms)}")
print("\nClassification algorithms:")
for i, name in enumerate(classification_algorithms.keys(), 1):
    print(f"{i:2d}. {name}")

print("\nRegression algorithms:")
for i, name in enumerate(regression_algorithms.keys(), 1):
    print(f"{i:2d}. {name}")

## 3. Comprehensive Algorithm Evaluation

In [None]:
def evaluate_classification_algorithms(X, y, algorithms, dataset_name):
    """
    Evaluate classification algorithms on a dataset.
    """
    # Split data
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Scale features
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    results = []
    
    for name, algorithm in algorithms.items():
        start_time = time.time()
        
        # Use scaled data for algorithms that need it
        if name in ['Logistic Regression', 'SVM (RBF)', 'SVM (Linear)', 'K-NN (k=5)', 'K-NN (k=3)', 'Neural Network']:
            algorithm.fit(X_train_scaled, y_train)
            y_pred = algorithm.predict(X_test_scaled)
            cv_scores = cross_val_score(algorithm, X_train_scaled, y_train, cv=5)
        else:
            algorithm.fit(X_train, y_train)
            y_pred = algorithm.predict(X_test)
            cv_scores = cross_val_score(algorithm, X_train, y_train, cv=5)
        
        end_time = time.time()
        
        # Calculate metrics
        accuracy = accuracy_score(y_test, y_pred)
        precision = precision_score(y_test, y_pred, average='weighted')
        recall = recall_score(y_test, y_pred, average='weighted')
        f1 = f1_score(y_test, y_pred, average='weighted')
        
        results.append({
            'Dataset': dataset_name,
            'Algorithm': name,
            'Accuracy': accuracy,
            'Precision': precision,
            'Recall': recall,
            'F1-Score': f1,
            'CV Mean': cv_scores.mean(),
            'CV Std': cv_scores.std(),
            'Training Time (s)': end_time - start_time
        })
    
    return pd.DataFrame(results)

def evaluate_regression_algorithms(X, y, algorithms, dataset_name):
    """
    Evaluate regression algorithms on a dataset.
    """
    # Split data
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Scale features
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    results = []
    
    for name, algorithm in algorithms.items():
        start_time = time.time()
        
        # Use scaled data for algorithms that need it
        if name in ['Linear Regression', 'Ridge Regression', 'Lasso Regression', 'Elastic Net', 
                   'SVR (RBF)', 'SVR (Linear)', 'K-NN Regression', 'Neural Network']:
            algorithm.fit(X_train_scaled, y_train)
            y_pred = algorithm.predict(X_test_scaled)
            cv_scores = cross_val_score(algorithm, X_train_scaled, y_train, cv=5, scoring='r2')
        else:
            algorithm.fit(X_train, y_train)
            y_pred = algorithm.predict(X_test)
            cv_scores = cross_val_score(algorithm, X_train, y_train, cv=5, scoring='r2')
        
        end_time = time.time()
        
        # Calculate metrics
        mse = mean_squared_error(y_test, y_pred)
        rmse = np.sqrt(mse)
        mae = mean_absolute_error(y_test, y_pred)
        r2 = r2_score(y_test, y_pred)
        
        results.append({
            'Dataset': dataset_name,
            'Algorithm': name,
            'R² Score': r2,
            'RMSE': rmse,
            'MAE': mae,
            'CV R² Mean': cv_scores.mean(),
            'CV R² Std': cv_scores.std(),
            'Training Time (s)': end_time - start_time
        })
    
    return pd.DataFrame(results)

# Run evaluations
print("Running comprehensive algorithm evaluation...")
print("=" * 50)

all_classification_results = []
all_regression_results = []

for dataset_name, dataset_info in datasets_dict.items():
    print(f"Evaluating on {dataset_name} dataset...")
    
    if dataset_info['type'] == 'classification':
        results = evaluate_classification_algorithms(
            dataset_info['X'], dataset_info['y'], 
            classification_algorithms, dataset_name
        )
        all_classification_results.append(results)
    
    elif dataset_info['type'] == 'regression':
        results = evaluate_regression_algorithms(
            dataset_info['X'], dataset_info['y'], 
            regression_algorithms, dataset_name
        )
        all_regression_results.append(results)

# Combine all results
if all_classification_results:
    classification_df = pd.concat(all_classification_results, ignore_index=True)
if all_regression_results:
    regression_df = pd.concat(all_regression_results, ignore_index=True)

print("\nEvaluation completed!")

## 4. Classification Results Analysis

In [None]:
# Display classification results
print("CLASSIFICATION RESULTS")
print("=" * 80)

# Overall performance summary
classification_summary = classification_df.groupby('Algorithm').agg({
    'Accuracy': ['mean', 'std'],
    'F1-Score': ['mean', 'std'],
    'Training Time (s)': ['mean', 'std']
}).round(4)

classification_summary.columns = ['Acc_Mean', 'Acc_Std', 'F1_Mean', 'F1_Std', 'Time_Mean', 'Time_Std']
classification_summary = classification_summary.sort_values('Acc_Mean', ascending=False)

print("\nOverall Classification Performance (averaged across datasets):")
print(classification_summary)

# Detailed results by dataset
print("\n\nDetailed Results by Dataset:")
print("=" * 50)
for dataset in classification_df['Dataset'].unique():
    dataset_results = classification_df[classification_df['Dataset'] == dataset]
    dataset_results = dataset_results.sort_values('Accuracy', ascending=False)
    
    print(f"\n{dataset.upper()} Dataset:")
    print(dataset_results[['Algorithm', 'Accuracy', 'F1-Score', 'CV Mean', 'Training Time (s)']].round(4))

In [None]:
# Visualize classification results
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# 1. Accuracy comparison
accuracy_pivot = classification_df.pivot(index='Algorithm', columns='Dataset', values='Accuracy')
sns.heatmap(accuracy_pivot, annot=True, cmap='RdYlGn', fmt='.3f', ax=axes[0, 0])
axes[0, 0].set_title('Accuracy Scores by Algorithm and Dataset')
axes[0, 0].tick_params(axis='x', rotation=45)

# 2. F1-Score comparison
f1_pivot = classification_df.pivot(index='Algorithm', columns='Dataset', values='F1-Score')
sns.heatmap(f1_pivot, annot=True, cmap='RdYlGn', fmt='.3f', ax=axes[0, 1])
axes[0, 1].set_title('F1-Scores by Algorithm and Dataset')
axes[0, 1].tick_params(axis='x', rotation=45)

# 3. Training time comparison
time_data = classification_df.groupby('Algorithm')['Training Time (s)'].mean().sort_values()
sns.barplot(x=time_data.values, y=time_data.index, ax=axes[1, 0], palette='viridis')
axes[1, 0].set_title('Average Training Time by Algorithm')
axes[1, 0].set_xlabel('Training Time (seconds)')

# 4. Overall performance ranking
overall_score = classification_df.groupby('Algorithm')['Accuracy'].mean().sort_values(ascending=False)
sns.barplot(x=overall_score.values, y=overall_score.index, ax=axes[1, 1], palette='coolwarm')
axes[1, 1].set_title('Overall Classification Performance Ranking')
axes[1, 1].set_xlabel('Average Accuracy')

plt.tight_layout()
plt.show()

## 5. Regression Results Analysis

In [None]:
# Display regression results
print("REGRESSION RESULTS")
print("=" * 80)

# Overall performance summary
regression_summary = regression_df.groupby('Algorithm').agg({
    'R² Score': ['mean', 'std'],
    'RMSE': ['mean', 'std'],
    'Training Time (s)': ['mean', 'std']
}).round(4)

regression_summary.columns = ['R2_Mean', 'R2_Std', 'RMSE_Mean', 'RMSE_Std', 'Time_Mean', 'Time_Std']
regression_summary = regression_summary.sort_values('R2_Mean', ascending=False)

print("\nOverall Regression Performance (averaged across datasets):")
print(regression_summary)

# Detailed results by dataset
print("\n\nDetailed Results by Dataset:")
print("=" * 50)
for dataset in regression_df['Dataset'].unique():
    dataset_results = regression_df[regression_df['Dataset'] == dataset]
    dataset_results = dataset_results.sort_values('R² Score', ascending=False)
    
    print(f"\n{dataset.upper()} Dataset:")
    print(dataset_results[['Algorithm', 'R² Score', 'RMSE', 'CV R² Mean', 'Training Time (s)']].round(4))

In [None]:
# Visualize regression results
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# 1. R² Score comparison
r2_pivot = regression_df.pivot(index='Algorithm', columns='Dataset', values='R² Score')
sns.heatmap(r2_pivot, annot=True, cmap='RdYlGn', fmt='.3f', ax=axes[0, 0])
axes[0, 0].set_title('R² Scores by Algorithm and Dataset')
axes[0, 0].tick_params(axis='x', rotation=45)

# 2. RMSE comparison (lower is better)
rmse_pivot = regression_df.pivot(index='Algorithm', columns='Dataset', values='RMSE')
sns.heatmap(rmse_pivot, annot=True, cmap='RdYlGn_r', fmt='.2f', ax=axes[0, 1])
axes[0, 1].set_title('RMSE by Algorithm and Dataset (Lower is Better)')
axes[0, 1].tick_params(axis='x', rotation=45)

# 3. Training time comparison
time_data = regression_df.groupby('Algorithm')['Training Time (s)'].mean().sort_values()
sns.barplot(x=time_data.values, y=time_data.index, ax=axes[1, 0], palette='viridis')
axes[1, 0].set_title('Average Training Time by Algorithm')
axes[1, 0].set_xlabel('Training Time (seconds)')

# 4. Overall performance ranking
overall_score = regression_df.groupby('Algorithm')['R² Score'].mean().sort_values(ascending=False)
sns.barplot(x=overall_score.values, y=overall_score.index, ax=axes[1, 1], palette='coolwarm')
axes[1, 1].set_title('Overall Regression Performance Ranking')
axes[1, 1].set_xlabel('Average R² Score')

plt.tight_layout()
plt.show()

## 6. Algorithm Strengths and Weaknesses Analysis

In [None]:
# Create comprehensive algorithm analysis
def create_algorithm_profile():
    """
    Create detailed profiles for each algorithm.
    """
    algorithm_profiles = {
        'Logistic Regression': {
            'Type': 'Linear',
            'Complexity': 'Low',
            'Interpretability': 'High',
            'Scalability': 'High',
            'Overfitting Risk': 'Low',
            'Best For': 'Linear separable data, baseline model',
            'Strengths': 'Fast, interpretable, probabilistic output',
            'Weaknesses': 'Assumes linear relationship, sensitive to outliers'
        },
        'Decision Tree': {
            'Type': 'Tree-based',
            'Complexity': 'Medium',
            'Interpretability': 'High',
            'Scalability': 'Medium',
            'Overfitting Risk': 'High',
            'Best For': 'Non-linear data, feature selection',
            'Strengths': 'Interpretable, handles non-linear relationships',
            'Weaknesses': 'Prone to overfitting, unstable'
        },
        'Random Forest': {
            'Type': 'Ensemble',
            'Complexity': 'Medium-High',
            'Interpretability': 'Medium',
            'Scalability': 'Medium',
            'Overfitting Risk': 'Low',
            'Best For': 'General purpose, feature importance',
            'Strengths': 'Robust, handles missing values, feature importance',
            'Weaknesses': 'Less interpretable, can overfit with noise'
        },
        'SVM (RBF)': {
            'Type': 'Kernel-based',
            'Complexity': 'High',
            'Interpretability': 'Low',
            'Scalability': 'Low',
            'Overfitting Risk': 'Medium',
            'Best For': 'High-dimensional data, non-linear patterns',
            'Strengths': 'Effective in high dimensions, memory efficient',
            'Weaknesses': 'Slow on large datasets, requires scaling'
        },
        'Naive Bayes': {
            'Type': 'Probabilistic',
            'Complexity': 'Low',
            'Interpretability': 'Medium',
            'Scalability': 'High',
            'Overfitting Risk': 'Low',
            'Best For': 'Text classification, small datasets',
            'Strengths': 'Fast, works with small data, handles missing values',
            'Weaknesses': 'Strong independence assumption, categorical inputs'
        },
        'K-NN (k=5)': {
            'Type': 'Instance-based',
            'Complexity': 'Low',
            'Interpretability': 'High',
            'Scalability': 'Low',
            'Overfitting Risk': 'Medium',
            'Best For': 'Local patterns, recommendation systems',
            'Strengths': 'Simple, no assumptions about data distribution',
            'Weaknesses': 'Slow prediction, sensitive to irrelevant features'
        },
        'Gradient Boosting': {
            'Type': 'Ensemble',
            'Complexity': 'High',
            'Interpretability': 'Low',
            'Scalability': 'Medium',
            'Overfitting Risk': 'Medium',
            'Best For': 'Complex patterns, competitions',
            'Strengths': 'High performance, handles mixed data types',
            'Weaknesses': 'Prone to overfitting, requires tuning'
        },
        'Neural Network': {
            'Type': 'Neural',
            'Complexity': 'High',
            'Interpretability': 'Low',
            'Scalability': 'High',
            'Overfitting Risk': 'High',
            'Best For': 'Complex non-linear patterns, large datasets',
            'Strengths': 'Universal approximator, handles complex patterns',
            'Weaknesses': 'Black box, requires large data, many hyperparameters'
        }
    }
    
    return pd.DataFrame(algorithm_profiles).T

# Create and display algorithm profiles
algorithm_profile_df = create_algorithm_profile()
print("ALGORITHM CHARACTERISTICS PROFILE")
print("=" * 80)
print(algorithm_profile_df)

# Create a visual comparison matrix
numeric_characteristics = {
    'Complexity': {'Low': 1, 'Medium': 2, 'Medium-High': 2.5, 'High': 3},
    'Interpretability': {'Low': 1, 'Medium': 2, 'High': 3},
    'Scalability': {'Low': 1, 'Medium': 2, 'High': 3},
    'Overfitting Risk': {'Low': 1, 'Medium': 2, 'High': 3}
}

# Convert to numeric for heatmap
numeric_df = algorithm_profile_df[['Complexity', 'Interpretability', 'Scalability', 'Overfitting Risk']].copy()
for col, mapping in numeric_characteristics.items():
    numeric_df[col] = numeric_df[col].map(mapping)

plt.figure(figsize=(12, 8))
sns.heatmap(numeric_df, annot=True, cmap='RdYlGn', cbar_kws={'label': 'Score (1=Low, 3=High)'})
plt.title('Algorithm Characteristics Comparison', fontsize=16, pad=20)
plt.ylabel('Algorithms')
plt.xlabel('Characteristics')
plt.tight_layout()
plt.show()

## 7. Performance vs Complexity Analysis

In [None]:
# Create performance vs complexity analysis
def create_performance_complexity_analysis():
    """
    Analyze the trade-off between performance and complexity.
    """
    # Get average performance scores
    classification_perf = classification_df.groupby('Algorithm')['Accuracy'].mean()
    regression_perf = regression_df.groupby('Algorithm')['R² Score'].mean()
    
    # Get training times
    classification_time = classification_df.groupby('Algorithm')['Training Time (s)'].mean()
    regression_time = regression_df.groupby('Algorithm')['Training Time (s)'].mean()
    
    # Complexity mapping
    complexity_map = {'Low': 1, 'Medium': 2, 'Medium-High': 2.5, 'High': 3}
    algorithm_complexity = algorithm_profile_df['Complexity'].map(complexity_map)
    
    fig, axes = plt.subplots(1, 2, figsize=(16, 6))
    
    # Classification: Performance vs Complexity
    for alg in classification_perf.index:
        if alg in algorithm_complexity.index:
            axes[0].scatter(algorithm_complexity[alg], classification_perf[alg], 
                          s=100, alpha=0.7, label=alg)
    
    axes[0].set_xlabel('Algorithm Complexity')
    axes[0].set_ylabel('Average Accuracy')
    axes[0].set_title('Classification: Performance vs Complexity')
    axes[0].grid(True, alpha=0.3)
    axes[0].legend(bbox_to_anchor=(1.05, 1), loc='upper left')
    
    # Regression: Performance vs Complexity
    for alg in regression_perf.index:
        # Map regression algorithms to base algorithm names
        base_alg = alg.split(' (')[0]  # Remove kernel specification
        if base_alg == 'Linear Regression' or base_alg == 'Ridge Regression' or base_alg == 'Lasso Regression':
            base_alg = 'Logistic Regression'  # Use similar complexity
        elif base_alg == 'Polynomial':
            base_alg = 'Neural Network'  # Higher complexity
        elif base_alg == 'SVR':
            base_alg = 'SVM (RBF)'
        elif base_alg == 'K-NN Regression':
            base_alg = 'K-NN (k=5)'
        
        if base_alg in algorithm_complexity.index:
            axes[1].scatter(algorithm_complexity[base_alg], regression_perf[alg], 
                          s=100, alpha=0.7, label=alg)
    
    axes[1].set_xlabel('Algorithm Complexity')
    axes[1].set_ylabel('Average R² Score')
    axes[1].set_title('Regression: Performance vs Complexity')
    axes[1].grid(True, alpha=0.3)
    axes[1].legend(bbox_to_anchor=(1.05, 1), loc='upper left')
    
    plt.tight_layout()
    plt.show()
    
    return classification_perf, regression_perf

class_perf, reg_perf = create_performance_complexity_analysis()

## 8. Algorithm Selection Guidelines

In [None]:
# Create algorithm selection guide
def create_selection_guide():
    """
    Create a comprehensive algorithm selection guide.
    """
    selection_guide = {
        'Scenario': [
            'Small dataset (< 1K samples)',
            'Large dataset (> 100K samples)',
            'High interpretability required',
            'Maximum performance priority',
            'Fast training required',
            'Fast prediction required',
            'Mixed data types',
            'High-dimensional data',
            'Non-linear relationships',
            'Prone to overfitting',
            'Text classification',
            'Image classification',
            'Time series prediction',
            'Anomaly detection'
        ],
        'Classification Recommendation': [
            'Naive Bayes, K-NN',
            'Logistic Regression, Neural Network',
            'Decision Tree, Logistic Regression',
            'Random Forest, Gradient Boosting',
            'Logistic Regression, Naive Bayes',
            'Naive Bayes, Logistic Regression',
            'Random Forest, Decision Tree',
            'SVM, Neural Network',
            'SVM (RBF), Random Forest',
            'Random Forest, SVM',
            'Naive Bayes, SVM (Linear)',
            'Neural Network, SVM (RBF)',
            'Random Forest, Gradient Boosting',
            'SVM (RBF), Random Forest'
        ],
        'Regression Recommendation': [
            'Linear Regression, K-NN',
            'Linear Regression, Neural Network',
            'Linear Regression, Decision Tree',
            'Random Forest, Gradient Boosting',
            'Linear Regression, Ridge',
            'Linear Regression, K-NN',
            'Random Forest, Decision Tree',
            'SVR, Neural Network',
            'SVR (RBF), Random Forest',
            'Ridge, Lasso',
            'N/A',
            'Neural Network, SVR (RBF)',
            'Random Forest, Gradient Boosting',
            'SVR (RBF), Random Forest'
        ]
    }
    
    return pd.DataFrame(selection_guide)

# Display selection guide
selection_df = create_selection_guide()
print("ALGORITHM SELECTION GUIDE")
print("=" * 80)
print(selection_df.to_string(index=False))

# Summary recommendations
print("\n\nQUICK SELECTION SUMMARY")
print("=" * 50)
print(" BEST OVERALL CLASSIFICATION:")
top_3_class = class_perf.nlargest(3)
for i, (alg, score) in enumerate(top_3_class.items(), 1):
    print(f"   {i}. {alg}: {score:.3f}")

print("\n BEST OVERALL REGRESSION:")
top_3_reg = reg_perf.nlargest(3)
for i, (alg, score) in enumerate(top_3_reg.items(), 1):
    print(f"   {i}. {alg}: {score:.3f}")

print("\n FASTEST ALGORITHMS:")
fastest_class = classification_df.groupby('Algorithm')['Training Time (s)'].mean().nsmallest(3)
for i, (alg, time) in enumerate(fastest_class.items(), 1):
    print(f"   {i}. {alg}: {time:.4f}s")

print("\n MOST INTERPRETABLE:")
interpretable = algorithm_profile_df[algorithm_profile_df['Interpretability'] == 'High'].index.tolist()
for i, alg in enumerate(interpretable, 1):
    print(f"   {i}. {alg}")

print("\n GENERAL PURPOSE RECOMMENDATIONS:")
print("   • Start with: Random Forest (good balance of performance and robustness)")
print("   • For interpretability: Logistic Regression or Decision Tree")
print("   • For maximum performance: Gradient Boosting or Neural Network")
print("   • For fast prototyping: Naive Bayes or K-NN")
print("   • For high-dimensional data: SVM or Neural Network")

## 9. Key Insights and Conclusions

### Performance Analysis Summary

Based on our comprehensive evaluation across multiple datasets, several key insights emerge:

#### Classification Champions
1. **Random Forest** consistently delivers excellent performance across all datasets
2. **Gradient Boosting** achieves the highest peak performance but requires careful tuning
3. **SVM** excels on high-dimensional and complex datasets

#### Regression Leaders
1. **Random Forest** again shows robust performance across different problem types
2. **Gradient Boosting** provides the best performance for complex relationships
3. **Linear methods** (Ridge/Lasso) work well for linear relationships and high-dimensional data

#### Trade-offs Observed

**Performance vs Interpretability:**
- High interpretability: Logistic Regression, Decision Trees
- High performance: Random Forest, Gradient Boosting, Neural Networks
- Best balance: Random Forest (medium interpretability, high performance)

**Performance vs Speed:**
- Fastest: Naive Bayes, Logistic Regression
- Good balance: Random Forest, K-NN
- Slowest but powerful: SVM, Neural Networks

#### Algorithm-Specific Insights

- **Random Forest:** Most versatile, handles mixed data types, provides feature importance
- **Gradient Boosting:** Best for competitions and maximum accuracy
- **SVM:** Excellent for high-dimensional data and complex boundaries
- **Neural Networks:** Best for very large datasets and complex patterns
- **Logistic Regression:** Excellent baseline, fast, interpretable
- **Decision Trees:** Most interpretable but prone to overfitting
- **Naive Bayes:** Surprisingly effective for text and small datasets
- **K-NN:** Simple but effective for local pattern recognition

### Practical Recommendations

1. **Always start with Random Forest** as your baseline - it's robust and performs well across various scenarios
2. **Use cross-validation** to get reliable performance estimates
3. **Consider the problem context** - interpretability might be more important than marginal performance gains
4. **Experiment with ensemble methods** for production systems
5. **Remember the no-free-lunch theorem** - no single algorithm works best for all problems

### Future Considerations

- Modern ensemble methods (XGBoost, LightGBM, CatBoost)
- Deep learning for unstructured data
- AutoML for automated algorithm selection
- Specialized algorithms for specific domains (time series, NLP, computer vision)

This analysis provides a solid foundation for algorithm selection in practical machine learning projects.