# K-Nearest Neighbors (KNN) & Naive Bayes Assignment
## Comprehensive Implementation and Analysis

**Objective:** Understand and implement K-Nearest Neighbors and Naive Bayes algorithms, evaluate their performance on different datasets, and compare their strengths and limitations.

**Datasets:**
- **Iris Dataset**: Multi-class classification with KNN
- **SMS Spam Collection**: Text classification with Naive Bayes

---

## Assignment Structure:
1. **Theory Section** - Fundamental concepts and principles
2. **Part A: KNN on Iris Dataset** - Implementation and evaluation
3. **Part B: Naive Bayes on SMS Spam Dataset** - Text classification
4. **Model Comparison** - Performance analysis and insights
5. **Bonus Tasks** - Advanced implementations

In [None]:
# Import Required Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Machine Learning Libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB, BernoulliNB, GaussianNB
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics import (confusion_matrix, accuracy_score, precision_score, 
                           recall_score, f1_score, classification_report)

# Text Processing
import re
import string
from collections import Counter

# Visualization
from matplotlib.colors import ListedColormap
from sklearn.decomposition import PCA

# Set display options
pd.set_option('display.max_columns', None)
plt.style.use('default')
sns.set_palette("husl")

print("All libraries imported successfully!")
print("Ready to begin KNN & Naive Bayes assignment!")

# Theory Section

## Q1. K-Nearest Neighbors (KNN) Working Principle

**K-Nearest Neighbors (KNN)** is a lazy learning algorithm that makes predictions based on the k closest training examples in the feature space.

### Working Principle:
1. **Store all training data** (no explicit training phase)
2. **For a new data point:**
   - Calculate distance to all training points
   - Find the k nearest neighbors
   - For classification: Vote based on majority class
   - For regression: Average the target values

### Best Suited Problems:
- **Pattern Recognition**: Image classification, handwriting recognition
- **Recommendation Systems**: Finding similar users/items
- **Anomaly Detection**: Identifying outliers in data
- **Small to Medium Datasets**: Where computational cost is manageable
- **Non-linear Relationships**: Complex decision boundaries

---

## Q2. Distance Metrics in KNN

### Common Distance Metrics:

**1. Euclidean Distance (L2 Norm)**
$$d(x,y) = \sqrt{\sum_{i=1}^{n}(x_i - y_i)^2}$$
- Most common for continuous features
- Sensitive to feature scaling

**2. Manhattan Distance (L1 Norm)**
$$d(x,y) = \sum_{i=1}^{n}|x_i - y_i|$$
- Less sensitive to outliers
- Good for high-dimensional data

**3. Minkowski Distance**
$$d(x,y) = \left(\sum_{i=1}^{n}|x_i - y_i|^p\right)^{1/p}$$
- Generalization of Euclidean (p=2) and Manhattan (p=1)

**4. Hamming Distance**
- Number of differing positions
- Used for categorical/binary features

**5. Cosine Distance**
$$d(x,y) = 1 - \frac{x \cdot y}{||x|| \times ||y||}$$
- Good for text data and high-dimensional sparse features

---

## Q3. KNN Advantages and Limitations

### ✅ Advantages:
1. **Simple to Understand**: Intuitive concept
2. **No Training Period**: Lazy learning approach
3. **Versatile**: Works for both classification and regression
4. **Non-parametric**: Makes no assumptions about data distribution
5. **Effective with Small Datasets**: Good performance on limited data
6. **Adapts to New Data**: Easily incorporates new training examples

### ❌ Limitations:
1. **Computationally Expensive**: O(n) for each prediction
2. **Memory Intensive**: Stores entire training dataset
3. **Sensitive to Irrelevant Features**: Curse of dimensionality
4. **Requires Feature Scaling**: Distance metrics affected by scale
5. **Sensitive to Local Structure**: Can be misled by noise
6. **Choosing k**: Requires hyperparameter tuning

---

## Q4. Bayes' Theorem and Naive Bayes

### Bayes' Theorem:
$$P(A|B) = \frac{P(B|A) \times P(A)}{P(B)}$$

**In Classification Context:**
$$P(Class|Features) = \frac{P(Features|Class) \times P(Class)}{P(Features)}$$

### Components:
- **P(Class|Features)**: Posterior probability (what we want)
- **P(Features|Class)**: Likelihood (probability of features given class)
- **P(Class)**: Prior probability (class distribution)
- **P(Features)**: Evidence (normalization constant)

### Usage in Naive Bayes:
- **Predicts class** with highest posterior probability
- **Combines prior knowledge** with observed evidence
- **Updates beliefs** based on new information

---

## Q5. "Naive" Assumption in Naive Bayes

### The "Naive" Assumption:
**Features are conditionally independent given the class**

$$P(x_1, x_2, ..., x_n | Class) = P(x_1|Class) \times P(x_2|Class) \times ... \times P(x_n|Class)$$

### Why "Naive"?
- **Unrealistic in practice**: Features often correlate
- **Simplifying assumption**: Makes computation tractable
- **Example violation**: In email spam detection, words like "free" and "money" often appear together

### Significance:
1. **Computational Efficiency**: Reduces complexity from exponential to linear
2. **Requires Less Training Data**: Each feature learned independently
3. **Robust Performance**: Often works well despite violated assumption
4. **Fast Training and Prediction**: Suitable for real-time applications

---

## Q6. Types of Naive Bayes Classifiers

### 1. Gaussian Naive Bayes
**Assumption**: Features follow normal distribution
$$P(x_i|Class) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x_i-\mu)^2}{2\sigma^2}}$$

**Use Cases:**
- **Iris classification**: Continuous flower measurements
- **Medical diagnosis**: Continuous biomarkers
- **Image recognition**: Pixel intensity values

### 2. Multinomial Naive Bayes
**Assumption**: Features represent counts/frequencies
$$P(x_i|Class) = \frac{count(x_i, Class) + \alpha}{count(Class) + \alpha \times |V|}$$

**Use Cases:**
- **Text classification**: Word counts in documents
- **Spam detection**: Email content analysis
- **Sentiment analysis**: Social media posts

### 3. Bernoulli Naive Bayes
**Assumption**: Binary features (present/absent)
$$P(x_i|Class) = P(x_i=1|Class)^{x_i} \times (1-P(x_i=1|Class))^{(1-x_i)}$$

**Use Cases:**
- **Document classification**: Word presence/absence
- **Web page categorization**: Feature existence
- **Gene expression**: Gene active/inactive

---

## Q7. KNN vs Naive Bayes: Key Differences

| Aspect | K-Nearest Neighbors | Naive Bayes |
|--------|-------------------|-------------|
| **Learning Type** | Lazy Learning (Instance-based) | Eager Learning (Model-based) |
| **Training Phase** | No explicit training, stores all data | Learns probability distributions |
| **Prediction Speed** | Slow (O(n) per prediction) | Fast (O(1) per prediction) |
| **Memory Usage** | High (stores entire dataset) | Low (stores only parameters) |
| **Feature Independence** | No assumption | Assumes conditional independence |
| **Data Requirements** | Works with small datasets | Needs sufficient data for probability estimation |
| **Interpretability** | Less interpretable (black box) | Highly interpretable (probabilities) |
| **Handling New Classes** | Cannot handle unseen classes | Can handle with prior probabilities |

### Summary:
- **KNN**: Best for complex patterns, small datasets, when computational cost acceptable
- **Naive Bayes**: Best for text classification, real-time applications, when interpretability important

# Part A: K-Nearest Neighbors on Iris Dataset

## Step 1: Load and Explore the Iris Dataset

The Iris dataset is a classic machine learning dataset containing measurements of iris flowers from three species.

In [None]:
# Load and Explore Iris Dataset

print("Loading Iris Dataset...")
print("="*50)

# Load the iris dataset
iris = load_iris()
X_iris = iris.data
y_iris = iris.target

# Create DataFrame for better visualization
iris_df = pd.DataFrame(X_iris, columns=iris.feature_names)
iris_df['species'] = iris.target
iris_df['species_name'] = iris_df['species'].map({0: 'setosa', 1: 'versicolor', 2: 'virginica'})

print("Dataset Shape:", iris_df.shape)
print("\nFeatures:", iris.feature_names)
print("Target Classes:", iris.target_names)

# Display basic information
print("\nFirst 5 rows:")
print(iris_df.head())

print("\nDataset Info:")
print(iris_df.info())

print("\nDescriptive Statistics:")
print(iris_df.describe())

# Check class distribution
print("\nClass Distribution:")
class_counts = iris_df['species_name'].value_counts()
print(class_counts)
print(f"\nClass balance:")
for species, count in class_counts.items():
    print(f"{species}: {count} samples ({count/len(iris_df)*100:.1f}%)")

# Visualize the dataset
plt.figure(figsize=(15, 10))

# Pairplot of features
plt.subplot(2, 3, 1)
for i, species in enumerate(iris.target_names):
    mask = iris_df['species'] == i
    plt.scatter(iris_df[mask]['sepal length (cm)'], iris_df[mask]['sepal width (cm)'], 
                label=species, alpha=0.7)
plt.xlabel('Sepal Length (cm)')
plt.ylabel('Sepal Width (cm)')
plt.title('Sepal Length vs Sepal Width')
plt.legend()
plt.grid(True, alpha=0.3)

plt.subplot(2, 3, 2)
for i, species in enumerate(iris.target_names):
    mask = iris_df['species'] == i
    plt.scatter(iris_df[mask]['petal length (cm)'], iris_df[mask]['petal width (cm)'], 
                label=species, alpha=0.7)
plt.xlabel('Petal Length (cm)')
plt.ylabel('Petal Width (cm)')
plt.title('Petal Length vs Petal Width')
plt.legend()
plt.grid(True, alpha=0.3)

# Distribution plots
plt.subplot(2, 3, 3)
iris_df.boxplot(column='sepal length (cm)', by='species_name', ax=plt.gca())
plt.title('Sepal Length Distribution by Species')
plt.suptitle('')

plt.subplot(2, 3, 4)
iris_df.boxplot(column='petal length (cm)', by='species_name', ax=plt.gca())
plt.title('Petal Length Distribution by Species')
plt.suptitle('')

# Correlation matrix
plt.subplot(2, 3, 5)
correlation_matrix = iris_df[iris.feature_names].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, square=True)
plt.title('Feature Correlation Matrix')

# Class distribution pie chart
plt.subplot(2, 3, 6)
plt.pie(class_counts.values, labels=class_counts.index, autopct='%1.1f%%', startangle=90)
plt.title('Species Distribution')

plt.tight_layout()
plt.show()

## Step 2: Data Preprocessing for KNN

Prepare the data by splitting into train-test sets and applying feature scaling (crucial for KNN).

In [None]:
# Data Preprocessing for KNN

print("Preprocessing Data for KNN...")
print("="*50)

# Split the data into train and test sets (80-20 split)
X_train_iris, X_test_iris, y_train_iris, y_test_iris = train_test_split(
    X_iris, y_iris, 
    test_size=0.2, 
    random_state=42, 
    stratify=y_iris  # Maintain class distribution
)

print("After Train-Test Split:")
print(f"Training set: X={X_train_iris.shape}, y={y_train_iris.shape}")
print(f"Test set: X={X_test_iris.shape}, y={y_test_iris.shape}")

# Check class distribution in splits
print("\nClass distribution in training set:")
train_counts = np.bincount(y_train_iris)
for i, count in enumerate(train_counts):
    print(f"{iris.target_names[i]}: {count} samples")

print("\nClass distribution in test set:")
test_counts = np.bincount(y_test_iris)
for i, count in enumerate(test_counts):
    print(f"{iris.target_names[i]}: {count} samples")

# Feature Scaling using StandardScaler
print("\nApplying Feature Scaling...")
scaler = StandardScaler()

# Fit on training data and transform both train and test
X_train_iris_scaled = scaler.fit_transform(X_train_iris)
X_test_iris_scaled = scaler.transform(X_test_iris)

print("Feature scaling completed!")

# Display scaling effect
print("\nScaling Effect:")
print("Before scaling (training set):")
train_df = pd.DataFrame(X_train_iris, columns=iris.feature_names)
print(train_df.describe())

print("\nAfter scaling (training set):")
train_scaled_df = pd.DataFrame(X_train_iris_scaled, columns=iris.feature_names)
print(train_scaled_df.describe())

# Visualize scaling effect
plt.figure(figsize=(15, 5))

plt.subplot(1, 3, 1)
plt.boxplot(X_train_iris, labels=[name.split()[0] for name in iris.feature_names])
plt.title('Before Scaling')
plt.ylabel('Value')
plt.xticks(rotation=45)

plt.subplot(1, 3, 2)
plt.boxplot(X_train_iris_scaled, labels=[name.split()[0] for name in iris.feature_names])
plt.title('After Scaling')
plt.ylabel('Standardized Value')
plt.xticks(rotation=45)

plt.subplot(1, 3, 3)
# Compare distributions
for i, feature in enumerate(iris.feature_names):
    plt.hist(X_train_iris_scaled[:, i], alpha=0.5, label=feature.split()[0], bins=15)
plt.title('Scaled Feature Distributions')
plt.xlabel('Standardized Value')
plt.ylabel('Frequency')
plt.legend()

plt.tight_layout()
plt.show()

## Step 3: KNN Model Implementation and Evaluation

Implement KNN classifier and experiment with different k values to find optimal performance.

In [None]:
# KNN Model Implementation

print("KNN Model Implementation and Evaluation")
print("="*50)

# Experiment with different k values
k_values = [3, 5, 7, 9, 11]
knn_results = {}

print(f"Experimenting with k values: {k_values}")
print("\nTraining and evaluating KNN models...")

for k in k_values:
    print(f"\n--- K = {k} ---")
    
    # Create and train KNN classifier
    knn = KNeighborsClassifier(n_neighbors=k, metric='euclidean')
    knn.fit(X_train_iris_scaled, y_train_iris)
    
    # Make predictions
    y_pred_iris = knn.predict(X_test_iris_scaled)
    y_pred_proba_iris = knn.predict_proba(X_test_iris_scaled)
    
    # Calculate metrics
    accuracy = accuracy_score(y_test_iris, y_pred_iris)
    precision = precision_score(y_test_iris, y_pred_iris, average='weighted')
    recall = recall_score(y_test_iris, y_pred_iris, average='weighted')
    f1 = f1_score(y_test_iris, y_pred_iris, average='weighted')
    
    # Store results
    knn_results[k] = {
        'model': knn,
        'predictions': y_pred_iris,
        'probabilities': y_pred_proba_iris,
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1_score': f1
    }
    
    print(f"Accuracy: {accuracy:.4f}")
    print(f"Precision: {precision:.4f}")
    print(f"Recall: {recall:.4f}")
    print(f"F1-Score: {f1:.4f}")

# Find best k value
best_k = max(knn_results.keys(), key=lambda k: knn_results[k]['accuracy'])
print(f"\n🏆 Best k value: {best_k} with accuracy: {knn_results[best_k]['accuracy']:.4f}")

# Create comparison DataFrame
comparison_df = pd.DataFrame({
    'k': list(knn_results.keys()),
    'Accuracy': [knn_results[k]['accuracy'] for k in knn_results.keys()],
    'Precision': [knn_results[k]['precision'] for k in knn_results.keys()],
    'Recall': [knn_results[k]['recall'] for k in knn_results.keys()],
    'F1-Score': [knn_results[k]['f1_score'] for k in knn_results.keys()]
})

print("\nPerformance Comparison:")
print(comparison_df.round(4))

# Visualize performance comparison
plt.figure(figsize=(15, 10))

plt.subplot(2, 3, 1)
plt.plot(comparison_df['k'], comparison_df['Accuracy'], 'bo-', linewidth=2, markersize=8)
plt.title('Accuracy vs K Value')
plt.xlabel('K Value')
plt.ylabel('Accuracy')
plt.grid(True, alpha=0.3)
plt.xticks(k_values)

plt.subplot(2, 3, 2)
metrics = ['Accuracy', 'Precision', 'Recall', 'F1-Score']
for metric in metrics:
    plt.plot(comparison_df['k'], comparison_df[metric], 'o-', label=metric, linewidth=2, markersize=6)
plt.title('All Metrics vs K Value')
plt.xlabel('K Value')
plt.ylabel('Score')
plt.legend()
plt.grid(True, alpha=0.3)
plt.xticks(k_values)

# Detailed evaluation for best k
best_model = knn_results[best_k]['model']
best_predictions = knn_results[best_k]['predictions']

print(f"\nDetailed Evaluation for Best Model (k={best_k}):")
print("="*50)

# Confusion Matrix
cm = confusion_matrix(y_test_iris, best_predictions)
print("Confusion Matrix:")
print(cm)

plt.subplot(2, 3, 3)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=iris.target_names, yticklabels=iris.target_names)
plt.title(f'Confusion Matrix (k={best_k})')
plt.xlabel('Predicted')
plt.ylabel('Actual')

# Classification Report
print("\nClassification Report:")
print(classification_report(y_test_iris, best_predictions, target_names=iris.target_names))

# Feature importance through permutation (simplified approach)
print("\nFeature Importance Analysis:")
feature_importance = []
base_accuracy = knn_results[best_k]['accuracy']

for i, feature_name in enumerate(iris.feature_names):
    # Create copy of test data with feature shuffled
    X_test_shuffled = X_test_iris_scaled.copy()
    np.random.shuffle(X_test_shuffled[:, i])
    
    # Predict with shuffled feature
    y_pred_shuffled = best_model.predict(X_test_shuffled)
    shuffled_accuracy = accuracy_score(y_test_iris, y_pred_shuffled)
    
    # Importance = drop in accuracy
    importance = base_accuracy - shuffled_accuracy
    feature_importance.append(importance)
    print(f"{feature_name}: {importance:.4f}")

plt.subplot(2, 3, 4)
plt.barh(range(len(iris.feature_names)), feature_importance)
plt.yticks(range(len(iris.feature_names)), iris.feature_names)
plt.xlabel('Importance (Accuracy Drop)')
plt.title('Feature Importance')
plt.grid(True, alpha=0.3)

# Prediction examples
plt.subplot(2, 3, 5)
# Show some test samples with predictions
sample_indices = [0, 5, 10, 15, 20]
for i, idx in enumerate(sample_indices):
    actual = iris.target_names[y_test_iris[idx]]
    predicted = iris.target_names[best_predictions[idx]]
    color = 'green' if actual == predicted else 'red'
    plt.text(0.1, 0.9-i*0.15, f"Sample {idx}: Actual={actual}, Predicted={predicted}", 
             transform=plt.gca().transAxes, color=color, fontsize=10)
plt.title('Sample Predictions')
plt.axis('off')

# Cross-validation scores
plt.subplot(2, 3, 6)
cv_scores = cross_val_score(best_model, X_train_iris_scaled, y_train_iris, cv=5)
plt.bar(range(1, 6), cv_scores, alpha=0.7)
plt.axhline(y=cv_scores.mean(), color='red', linestyle='--', label=f'Mean: {cv_scores.mean():.3f}')
plt.title('5-Fold Cross-Validation Scores')
plt.xlabel('Fold')
plt.ylabel('Accuracy')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\nCross-Validation Results:")
print(f"CV Scores: {cv_scores}")
print(f"Mean CV Score: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")

# Part B: Naive Bayes on SMS Spam Collection Dataset

## Step 1: Load and Explore SMS Spam Dataset

For this section, we'll create a synthetic SMS spam dataset since the original requires download from Kaggle. In practice, you would load from a CSV file.

In [None]:
# Load and Explore SMS Spam Dataset

print("Creating SMS Spam Dataset...")
print("="*50)

# Create synthetic SMS spam dataset (in practice, load from CSV)
# Realistic SMS messages based on common patterns

spam_messages = [
    "FREE! Win a £1000 cash prize! Text WIN to 12345 now!",
    "URGENT! Your account will be suspended. Click link to verify: http://fake-bank.com",
    "Congratulations! You've won a free iPhone! Call 555-SCAM now!",
    "Make money fast! Work from home opportunity. Text START to 98765",
    "SPECIAL OFFER: 50% off luxury watches! Limited time only. Call now!",
    "You have won £500! To claim text CLAIM to 54321",
    "Get rich quick! Investment opportunity. Guaranteed returns!",
    "Free ringtones! Text TONE to 11111. Standard charges apply.",
    "WINNER! You are selected for a cash prize of £2000!",
    "Pharmacy online. Cheap medications. No prescription needed!",
    "Hot singles in your area! Click here to meet them!",
    "Debt problems? We can help! Call for free consultation.",
    "Cash loan approved! Up to £5000. Text YES to 67890",
    "Free mobile phone! Just pay shipping. Limited offer!",
    "Weight loss miracle! Lose 10kg in 10 days! Buy now!",
    "Lottery winner! Claim your £10000 prize today!",
    "Credit card approved! Bad credit OK. Apply now!",
    "FREE vacation! You've won a trip to Hawaii!",
    "Make £1000 per week working from home!",
    "ALERT: Suspicious activity on your account. Click here.",
    "Free trial offer! Cancel anytime. No hidden fees!",
    "Get paid to take surveys! Earn extra cash!",
    "Cheap car insurance quotes! Save hundreds!",
    "Free gift card! £100 Amazon voucher waiting!",
    "Investment opportunity! Double your money in 30 days!"
] * 8  # Repeat to get more samples

ham_messages = [
    "Hey, how are you doing today?",
    "Can you pick up milk on your way home?",
    "Meeting is scheduled for 3 PM in conference room",
    "Thanks for your help with the project yesterday",
    "Don't forget about dinner with mom tonight",
    "The weather is beautiful today, perfect for a walk",
    "I'll be running a few minutes late to the meeting",
    "Great job on the presentation! Well done.",
    "Could you send me the report when you get a chance?",
    "Happy birthday! Hope you have a wonderful day!",
    "The train is delayed by 15 minutes",
    "Let's grab lunch tomorrow if you're free",
    "I finished reading that book you recommended",
    "The meeting has been moved to next Tuesday",
    "Thanks for the birthday gift, I love it!",
    "Can you call me when you get this message?",
    "I'm at the grocery store, need anything?",
    "Good luck with your exam tomorrow!",
    "The concert was amazing last night",
    "I'll pick you up at 7 PM for the movie",
    "Working late tonight, will be home around 9",
    "The kids had a great time at the park",
    "Don't forget to water the plants while I'm away",
    "I booked our flights for the vacation",
    "The repair shop called, your car is ready",
    "Coffee at our usual place at 10 AM?",
    "I sent you the photos from the wedding",
    "The doctor's appointment is confirmed for Friday",
    "Great game last night! Your team played well.",
    "I'll be in town next week, let's catch up"
] * 7  # Repeat to get more samples

# Combine messages and create labels
messages = spam_messages + ham_messages
labels = ['spam'] * len(spam_messages) + ['ham'] * len(ham_messages)

# Create DataFrame
sms_df = pd.DataFrame({
    'message': messages,
    'label': labels
})

# Shuffle the dataset
sms_df = sms_df.sample(frac=1, random_state=42).reset_index(drop=True)

print("Dataset created successfully!")
print(f"Total messages: {len(sms_df)}")
print(f"Spam messages: {len(sms_df[sms_df['label'] == 'spam'])}")
print(f"Ham messages: {len(sms_df[sms_df['label'] == 'ham'])}")

# Display sample data
print("\nSample messages:")
print(sms_df.head(10))

# Class distribution
print("\nClass Distribution:")
class_dist = sms_df['label'].value_counts()
print(class_dist)
print(f"\nClass balance:")
for label, count in class_dist.items():
    print(f"{label}: {count} messages ({count/len(sms_df)*100:.1f}%)")

# Basic text statistics
sms_df['message_length'] = sms_df['message'].str.len()
sms_df['word_count'] = sms_df['message'].str.split().str.len()

print("\nText Statistics:")
print(sms_df.groupby('label')[['message_length', 'word_count']].describe())

# Visualize dataset
plt.figure(figsize=(15, 10))

plt.subplot(2, 3, 1)
class_dist.plot(kind='bar', color=['skyblue', 'salmon'])
plt.title('Class Distribution')
plt.xlabel('Class')
plt.ylabel('Count')
plt.xticks(rotation=0)

plt.subplot(2, 3, 2)
plt.pie(class_dist.values, labels=class_dist.index, autopct='%1.1f%%', startangle=90)
plt.title('Class Distribution (Pie Chart)')

plt.subplot(2, 3, 3)
sms_df.boxplot(column='message_length', by='label', ax=plt.gca())
plt.title('Message Length by Class')
plt.suptitle('')

plt.subplot(2, 3, 4)
sms_df.boxplot(column='word_count', by='label', ax=plt.gca())
plt.title('Word Count by Class')
plt.suptitle('')

plt.subplot(2, 3, 5)
plt.hist(sms_df[sms_df['label'] == 'spam']['message_length'], alpha=0.7, label='Spam', bins=20)
plt.hist(sms_df[sms_df['label'] == 'ham']['message_length'], alpha=0.7, label='Ham', bins=20)
plt.xlabel('Message Length')
plt.ylabel('Frequency')
plt.title('Message Length Distribution')
plt.legend()

plt.subplot(2, 3, 6)
plt.hist(sms_df[sms_df['label'] == 'spam']['word_count'], alpha=0.7, label='Spam', bins=15)
plt.hist(sms_df[sms_df['label'] == 'ham']['word_count'], alpha=0.7, label='Ham', bins=15)
plt.xlabel('Word Count')
plt.ylabel('Frequency')
plt.title('Word Count Distribution')
plt.legend()

plt.tight_layout()
plt.show()

## Step 2: Text Preprocessing

Clean and prepare the text data for machine learning by removing noise and converting to numerical features.

In [None]:
# Text Preprocessing

print("Text Preprocessing Pipeline")
print("="*50)

def clean_text(text):
    """Clean and preprocess text data"""
    # Convert to lowercase
    text = text.lower()
    
    # Remove URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    
    # Remove special characters and digits, keep only letters and spaces
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    
    # Remove extra whitespace
    text = ' '.join(text.split())
    
    return text

# Apply text cleaning
print("Applying text cleaning...")
sms_df['cleaned_message'] = sms_df['message'].apply(clean_text)

# Show examples of cleaning
print("\nText Cleaning Examples:")
for i in range(5):
    print(f"Original: {sms_df.iloc[i]['message']}")
    print(f"Cleaned:  {sms_df.iloc[i]['cleaned_message']}")
    print("-" * 50)

# Prepare features and target
X_text = sms_df['cleaned_message']
y_text = sms_df['label']

# Convert labels to binary (0 = ham, 1 = spam)
label_encoder = LabelEncoder()
y_text_encoded = label_encoder.fit_transform(y_text)
print(f"\nLabel encoding: {dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)))}")

# Train-test split
X_train_text, X_test_text, y_train_text, y_test_text = train_test_split(
    X_text, y_text_encoded, 
    test_size=0.2, 
    random_state=42, 
    stratify=y_text_encoded
)

print(f"\nTrain-Test Split:")
print(f"Training set: {len(X_train_text)} messages")
print(f"Test set: {len(X_test_text)} messages")

# Text Vectorization - TF-IDF
print("\nApplying TF-IDF Vectorization...")
tfidf_vectorizer = TfidfVectorizer(
    max_features=1000,  # Limit to top 1000 features
    stop_words='english',  # Remove common English stop words
    ngram_range=(1, 2),  # Use both unigrams and bigrams
    min_df=2,  # Ignore terms that appear in fewer than 2 documents
    max_df=0.95  # Ignore terms that appear in more than 95% of documents
)

# Fit on training data and transform both train and test
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train_text)
X_test_tfidf = tfidf_vectorizer.transform(X_test_text)

print(f"TF-IDF matrix shape - Training: {X_train_tfidf.shape}")
print(f"TF-IDF matrix shape - Test: {X_test_tfidf.shape}")
print(f"Vocabulary size: {len(tfidf_vectorizer.vocabulary_)}")

# Also create Count Vectorization for comparison
print("\nApplying Count Vectorization...")
count_vectorizer = CountVectorizer(
    max_features=1000,
    stop_words='english',
    ngram_range=(1, 2),
    min_df=2,
    max_df=0.95
)

X_train_count = count_vectorizer.fit_transform(X_train_text)
X_test_count = count_vectorizer.transform(X_test_text)

print(f"Count matrix shape - Training: {X_train_count.shape}")
print(f"Count matrix shape - Test: {X_test_count.shape}")

# Show most important features
print("\nTop 20 TF-IDF Features:")
feature_names = tfidf_vectorizer.get_feature_names_out()
tfidf_scores = X_train_tfidf.sum(axis=0).A1  # Sum across all documents
top_indices = tfidf_scores.argsort()[-20:][::-1]  # Top 20 indices
for i, idx in enumerate(top_indices, 1):
    print(f"{i:2d}. {feature_names[idx]:15} (TF-IDF: {tfidf_scores[idx]:.3f})")

# Analyze word frequency by class
print("\nWord Analysis by Class:")
spam_text = ' '.join(sms_df[sms_df['label'] == 'spam']['cleaned_message'])
ham_text = ' '.join(sms_df[sms_df['label'] == 'ham']['cleaned_message'])

spam_words = Counter(spam_text.split())
ham_words = Counter(ham_text.split())

print("\nTop 10 words in SPAM messages:")
for word, count in spam_words.most_common(10):
    print(f"  {word:15}: {count}")

print("\nTop 10 words in HAM messages:")
for word, count in ham_words.most_common(10):
    print(f"  {word:15}: {count}")

# Visualize preprocessing results
plt.figure(figsize=(15, 8))

plt.subplot(2, 3, 1)
# Feature matrix sparsity
sparsity = 1 - (X_train_tfidf.nnz / (X_train_tfidf.shape[0] * X_train_tfidf.shape[1]))
plt.bar(['TF-IDF Matrix'], [sparsity])
plt.title(f'Matrix Sparsity\n({sparsity:.1%} zeros)')
plt.ylabel('Sparsity')

plt.subplot(2, 3, 2)
# Vocabulary size comparison
plt.bar(['TF-IDF', 'Count'], [len(tfidf_vectorizer.vocabulary_), len(count_vectorizer.vocabulary_)])
plt.title('Vocabulary Size Comparison')
plt.ylabel('Number of Features')

plt.subplot(2, 3, 3)
# Distribution of document lengths after cleaning
clean_lengths = sms_df['cleaned_message'].str.len()
plt.hist(clean_lengths[sms_df['label'] == 'spam'], alpha=0.7, label='Spam', bins=20)
plt.hist(clean_lengths[sms_df['label'] == 'ham'], alpha=0.7, label='Ham', bins=20)
plt.xlabel('Cleaned Message Length')
plt.ylabel('Frequency')
plt.title('Cleaned Message Length Distribution')
plt.legend()

plt.subplot(2, 3, 4)
# Top TF-IDF features visualization
top_10_indices = top_indices[:10]
top_10_scores = [tfidf_scores[i] for i in top_10_indices]
top_10_features = [feature_names[i] for i in top_10_indices]
plt.barh(range(len(top_10_features)), top_10_scores)
plt.yticks(range(len(top_10_features)), top_10_features)
plt.xlabel('TF-IDF Score')
plt.title('Top 10 TF-IDF Features')

plt.subplot(2, 3, 5)
# Class distribution after cleaning
y_train_dist = pd.Series(y_train_text).value_counts()
plt.pie(y_train_dist.values, labels=['Ham', 'Spam'], autopct='%1.1f%%')
plt.title('Training Set Class Distribution')

plt.subplot(2, 3, 6)
# Word count comparison
spam_word_count = len(spam_words)
ham_word_count = len(ham_words)
plt.bar(['Spam Vocabulary', 'Ham Vocabulary'], [spam_word_count, ham_word_count], 
        color=['red', 'green'], alpha=0.7)
plt.title('Unique Words by Class')
plt.ylabel('Number of Unique Words')

plt.tight_layout()
plt.show()

## Step 3: Naive Bayes Implementation

Now we'll implement different variants of Naive Bayes classifiers and compare their performance on the preprocessed SMS spam dataset.

In [None]:
# Naive Bayes Implementation and Comparison

print("Naive Bayes Implementation on SMS Spam Dataset")
print("="*60)

# Initialize different Naive Bayes classifiers
nb_classifiers = {
    'Multinomial NB': MultinomialNB(),
    'Bernoulli NB': BernoulliNB(),
    'Complement NB': ComplementNB()
}

# Storage for results
nb_results = {}
nb_predictions = {}

# Function to evaluate classifier
def evaluate_classifier(name, classifier, X_train, X_test, y_train, y_test):
    """Train and evaluate a classifier"""
    print(f"\n{name}")
    print("-" * 40)
    
    # Train the classifier
    start_time = time.time()
    classifier.fit(X_train, y_train)
    train_time = time.time() - start_time
    
    # Make predictions
    start_time = time.time()
    y_pred = classifier.predict(X_test)
    y_pred_proba = classifier.predict_proba(X_test)[:, 1]  # Probability of spam
    predict_time = time.time() - start_time
    
    # Calculate metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    auc = roc_auc_score(y_test, y_pred_proba)
    
    # Print results
    print(f"Training time: {train_time:.4f} seconds")
    print(f"Prediction time: {predict_time:.4f} seconds")
    print(f"Accuracy: {accuracy:.4f}")
    print(f"Precision: {precision:.4f}")
    print(f"Recall: {recall:.4f}")
    print(f"F1-score: {f1:.4f}")
    print(f"AUC-ROC: {auc:.4f}")
    
    # Store results
    results = {
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1': f1,
        'auc': auc,
        'train_time': train_time,
        'predict_time': predict_time,
        'predictions': y_pred,
        'probabilities': y_pred_proba
    }
    
    return results

print("\n🔍 Testing with TF-IDF Features")
print("="*50)

# Test all classifiers with TF-IDF features
for name, classifier in nb_classifiers.items():
    nb_results[f"{name} (TF-IDF)"] = evaluate_classifier(
        f"{name} (TF-IDF)", classifier, 
        X_train_tfidf, X_test_tfidf, 
        y_train_text, y_test_text
    )

print("\n🔍 Testing with Count Features")
print("="*50)

# Test all classifiers with Count features
for name, classifier in nb_classifiers.items():
    # Create new instances to avoid sklearn warnings
    if name == 'Multinomial NB':
        clf = MultinomialNB()
    elif name == 'Bernoulli NB':
        clf = BernoulliNB()
    else:
        clf = ComplementNB()
    
    nb_results[f"{name} (Count)"] = evaluate_classifier(
        f"{name} (Count)", clf,
        X_train_count, X_test_count,
        y_train_text, y_test_text
    )

# Create comprehensive comparison
print("\n📊 Comprehensive Results Summary")
print("="*70)

results_df = pd.DataFrame(nb_results).T
print(results_df[['accuracy', 'precision', 'recall', 'f1', 'auc']].round(4))

# Find best performers
best_accuracy = results_df['accuracy'].idxmax()
best_f1 = results_df['f1'].idxmax()
best_auc = results_df['auc'].idxmax()

print(f"\n🏆 Best Performers:")
print(f"Best Accuracy: {best_accuracy} ({results_df.loc[best_accuracy, 'accuracy']:.4f})")
print(f"Best F1-Score: {best_f1} ({results_df.loc[best_f1, 'f1']:.4f})")
print(f"Best AUC-ROC: {best_auc} ({results_df.loc[best_auc, 'auc']:.4f})")

# Detailed analysis of best model
best_model_name = best_f1
best_predictions = nb_results[best_model_name]['predictions']

print(f"\n🔍 Detailed Analysis of Best Model: {best_model_name}")
print("="*60)

# Confusion Matrix
cm = confusion_matrix(y_test_text, best_predictions)
print("\nConfusion Matrix:")
print(f"{'':>12} {'Predicted':>20}")
print(f"{'Actual':>12} {'Ham':>8} {'Spam':>8}")
print(f"{'Ham':>12} {cm[0,0]:>8} {cm[0,1]:>8}")
print(f"{'Spam':>12} {cm[1,0]:>8} {cm[1,1]:>8}")

# Classification Report
print("\nClassification Report:")
print(classification_report(y_test_text, best_predictions, 
                          target_names=['Ham', 'Spam']))

# Error Analysis
print("\n🔍 Error Analysis")
print("="*40)

# Find misclassified examples
errors = y_test_text != best_predictions
error_indices = X_test_text[errors].index

print(f"Total misclassified: {sum(errors)}")
print(f"False Positives (Ham classified as Spam): {cm[0,1]}")
print(f"False Negatives (Spam classified as Ham): {cm[1,0]}")

# Show some misclassified examples
print("\nSample Misclassified Messages:")
error_sample = error_indices[:5] if len(error_indices) >= 5 else error_indices
for i, idx in enumerate(error_sample, 1):
    actual = 'Spam' if y_test_text.iloc[list(y_test_text.index).index(idx)] == 1 else 'Ham'
    predicted = 'Spam' if best_predictions[list(y_test_text.index).index(idx)] == 1 else 'Ham'
    message = sms_df.loc[idx, 'message']
    print(f"\n{i}. Actual: {actual}, Predicted: {predicted}")
    print(f"   Message: {message}")

# Cross-validation for best model
print(f"\n🔄 Cross-Validation Analysis for {best_model_name}")
print("="*60)

# Get the best classifier type and feature type
if 'Multinomial' in best_model_name:
    best_classifier = MultinomialNB()
elif 'Bernoulli' in best_model_name:
    best_classifier = BernoulliNB()
else:
    best_classifier = ComplementNB()

# Choose feature set
if 'TF-IDF' in best_model_name:
    X_full = tfidf_vectorizer.fit_transform(X_text)
else:
    X_full = count_vectorizer.fit_transform(X_text)

# Perform cross-validation
cv_scores = cross_val_score(best_classifier, X_full, y_text_encoded, 
                           cv=5, scoring='f1')
cv_accuracy = cross_val_score(best_classifier, X_full, y_text_encoded, 
                             cv=5, scoring='accuracy')

print(f"5-Fold Cross-Validation Results:")
print(f"F1 Scores: {cv_scores}")
print(f"Mean F1: {cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})")
print(f"Accuracy Scores: {cv_accuracy}")
print(f"Mean Accuracy: {cv_accuracy.mean():.4f} (+/- {cv_accuracy.std() * 2:.4f})")

# Hyperparameter tuning for best model
print(f"\n⚙️ Hyperparameter Tuning for {best_model_name}")
print("="*60)

# Define parameter grid based on classifier type
if 'Multinomial' in best_model_name or 'Complement' in best_model_name:
    param_grid = {'alpha': [0.1, 0.5, 1.0, 2.0, 5.0]}
else:  # Bernoulli
    param_grid = {
        'alpha': [0.1, 0.5, 1.0, 2.0, 5.0],
        'binarize': [0.0, 0.1, 0.5]
    }

# Grid search
grid_search = GridSearchCV(
    best_classifier, param_grid,
    cv=5, scoring='f1',
    n_jobs=-1
)

grid_search.fit(X_full, y_text_encoded)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best cross-validation F1 score: {grid_search.best_score_:.4f}")
print(f"Improvement over default: {grid_search.best_score_ - cv_scores.mean():.4f}")

# Feature importance analysis
print(f"\n📈 Feature Importance Analysis")
print("="*50)

# Get the best trained model
best_trained_model = grid_search.best_estimator_

# For Naive Bayes, we can look at log probabilities
if hasattr(best_trained_model, 'feature_log_prob_'):
    feature_names = tfidf_vectorizer.get_feature_names_out() if 'TF-IDF' in best_model_name else count_vectorizer.get_feature_names_out()
    
    # Calculate feature importance as difference in log probabilities
    log_prob_diff = best_trained_model.feature_log_prob_[1] - best_trained_model.feature_log_prob_[0]
    
    # Get top features for spam vs ham
    top_spam_indices = log_prob_diff.argsort()[-20:][::-1]
    top_ham_indices = log_prob_diff.argsort()[:20]
    
    print("\nTop 20 features indicating SPAM:")
    for i, idx in enumerate(top_spam_indices, 1):
        print(f"{i:2d}. {feature_names[idx]:15} (log prob diff: {log_prob_diff[idx]:.3f})")
    
    print("\nTop 20 features indicating HAM:")
    for i, idx in enumerate(top_ham_indices, 1):
        print(f"{i:2d}. {feature_names[idx]:15} (log prob diff: {log_prob_diff[idx]:.3f})")

print(f"\n✅ Naive Bayes Implementation Complete!")
print(f"Best performing model: {best_model_name}")
print(f"Final F1-Score: {grid_search.best_score_:.4f}")

In [None]:
# Visualization of Naive Bayes Results

print("\n📊 Creating Comprehensive Visualizations")
print("="*50)

# Create comprehensive visualization
fig = plt.figure(figsize=(20, 15))

# 1. Performance Comparison
plt.subplot(3, 4, 1)
metrics = ['accuracy', 'precision', 'recall', 'f1', 'auc']
x_pos = np.arange(len(metrics))
results_summary = results_df[metrics].mean()
colors = ['skyblue', 'lightcoral', 'lightgreen', 'gold', 'plum']
bars = plt.bar(x_pos, results_summary, color=colors, alpha=0.8)
plt.xlabel('Metrics')
plt.ylabel('Score')
plt.title('Average Performance Across All Models')
plt.xticks(x_pos, metrics)
plt.ylim(0, 1)
# Add value labels on bars
for bar, value in zip(bars, results_summary):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01, 
             f'{value:.3f}', ha='center', va='bottom', fontsize=8)

# 2. Model Comparison - F1 Scores
plt.subplot(3, 4, 2)
model_names = [name.replace(' (TF-IDF)', '').replace(' (Count)', '') for name in results_df.index]
feature_types = ['TF-IDF' if 'TF-IDF' in name else 'Count' for name in results_df.index]
f1_scores = results_df['f1'].values

# Create grouped bar chart
tfidf_mask = [ft == 'TF-IDF' for ft in feature_types]
count_mask = [ft == 'Count' for ft in feature_types]

unique_models = list(set(model_names))
x_pos = np.arange(len(unique_models))

tfidf_scores = [results_df['f1'][i] for i, mask in enumerate(tfidf_mask) if mask]
count_scores = [results_df['f1'][i] for i, mask in enumerate(count_mask) if mask]

plt.bar(x_pos - 0.2, tfidf_scores, width=0.4, label='TF-IDF', alpha=0.8, color='steelblue')
plt.bar(x_pos + 0.2, count_scores, width=0.4, label='Count', alpha=0.8, color='coral')
plt.xlabel('Naive Bayes Variants')
plt.ylabel('F1-Score')
plt.title('F1-Score Comparison by Feature Type')
plt.xticks(x_pos, unique_models, rotation=45, ha='right')
plt.legend()
plt.ylim(0, 1)

# 3. Confusion Matrix for Best Model
plt.subplot(3, 4, 3)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Ham', 'Spam'], yticklabels=['Ham', 'Spam'])
plt.title(f'Confusion Matrix\n{best_model_name}')
plt.ylabel('Actual')
plt.xlabel('Predicted')

# 4. ROC Curves
plt.subplot(3, 4, 4)
for name, results in nb_results.items():
    if 'TF-IDF' in name:  # Only plot TF-IDF models to avoid clutter
        fpr, tpr, _ = roc_curve(y_test_text, results['probabilities'])
        auc_score = results['auc']
        plt.plot(fpr, tpr, label=f"{name.replace(' (TF-IDF)', '')} (AUC={auc_score:.3f})", linewidth=2)

plt.plot([0, 1], [0, 1], 'k--', alpha=0.5)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curves (TF-IDF Features)')
plt.legend()

# 5. Training Time Comparison
plt.subplot(3, 4, 5)
train_times = results_df['train_time']
model_names_short = [name.split(' (')[0] for name in results_df.index]
colors = ['lightblue' if 'TF-IDF' in name else 'lightcoral' for name in results_df.index]
bars = plt.bar(range(len(train_times)), train_times, color=colors, alpha=0.8)
plt.xlabel('Models')
plt.ylabel('Training Time (seconds)')
plt.title('Training Time Comparison')
plt.xticks(range(len(train_times)), model_names_short, rotation=45, ha='right')
# Add legend
tfidf_patch = plt.Rectangle((0,0),1,1,fc='lightblue', alpha=0.8)
count_patch = plt.Rectangle((0,0),1,1,fc='lightcoral', alpha=0.8)
plt.legend([tfidf_patch, count_patch], ['TF-IDF', 'Count'], loc='upper right')

# 6. Cross-validation Scores
plt.subplot(3, 4, 6)
cv_df = pd.DataFrame({
    'F1 Score': cv_scores,
    'Accuracy': cv_accuracy
})
cv_df.index = [f'Fold {i+1}' for i in range(len(cv_scores))]
cv_df.plot(kind='bar', ax=plt.gca(), color=['gold', 'lightgreen'], alpha=0.8)
plt.title(f'Cross-Validation Scores\n{best_model_name}')
plt.ylabel('Score')
plt.xticks(rotation=45)
plt.legend()
plt.ylim(0, 1)

# 7. Feature Importance (Top SPAM indicators)
plt.subplot(3, 4, 7)
if 'log_prob_diff' in locals():
    top_10_spam = log_prob_diff.argsort()[-10:][::-1]
    spam_features = [feature_names[i] for i in top_10_spam]
    spam_scores = [log_prob_diff[i] for i in top_10_spam]
    
    plt.barh(range(len(spam_features)), spam_scores, color='red', alpha=0.7)
    plt.yticks(range(len(spam_features)), spam_features)
    plt.xlabel('Log Probability Difference')
    plt.title('Top 10 SPAM Indicators')

# 8. Feature Importance (Top HAM indicators)
plt.subplot(3, 4, 8)
if 'log_prob_diff' in locals():
    top_10_ham = log_prob_diff.argsort()[:10]
    ham_features = [feature_names[i] for i in top_10_ham]
    ham_scores = [abs(log_prob_diff[i]) for i in top_10_ham]  # Use absolute values for better visualization
    
    plt.barh(range(len(ham_features)), ham_scores, color='green', alpha=0.7)
    plt.yticks(range(len(ham_features)), ham_features)
    plt.xlabel('Log Probability Difference (Absolute)')
    plt.title('Top 10 HAM Indicators')

# 9. Precision-Recall Curve
plt.subplot(3, 4, 9)
for name, results in nb_results.items():
    if 'TF-IDF' in name:  # Only plot TF-IDF models
        precision_vals, recall_vals, _ = precision_recall_curve(y_test_text, results['probabilities'])
        avg_precision = average_precision_score(y_test_text, results['probabilities'])
        plt.plot(recall_vals, precision_vals, 
                label=f"{name.replace(' (TF-IDF)', '')} (AP={avg_precision:.3f})", linewidth=2)

plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curves')
plt.legend()

# 10. Error Distribution
plt.subplot(3, 4, 10)
error_counts = pd.Series([cm[0,1], cm[1,0]], index=['False Positive', 'False Negative'])
colors = ['orange', 'red']
wedges, texts, autotexts = plt.pie(error_counts.values, labels=error_counts.index, 
                                  autopct='%1.1f%%', colors=colors, startangle=90)
plt.title(f'Error Distribution\n{best_model_name}')

# 11. Model Performance Heatmap
plt.subplot(3, 4, 11)
performance_matrix = results_df[['accuracy', 'precision', 'recall', 'f1', 'auc']].T
sns.heatmap(performance_matrix, annot=True, fmt='.3f', cmap='RdYlBu_r', 
            cbar_kws={'label': 'Score'})
plt.title('Performance Heatmap')
plt.ylabel('Metrics')
plt.xlabel('Models')

# 12. Hyperparameter Tuning Results
plt.subplot(3, 4, 12)
if 'grid_search' in locals():
    # Show improvement from hyperparameter tuning
    default_score = cv_scores.mean()
    tuned_score = grid_search.best_score_
    improvement = tuned_score - default_score
    
    categories = ['Default', 'Tuned']
    scores = [default_score, tuned_score]
    colors = ['lightblue', 'darkblue']
    
    bars = plt.bar(categories, scores, color=colors, alpha=0.8)
    plt.ylabel('F1-Score')
    plt.title('Hyperparameter Tuning Impact')
    plt.ylim(min(scores) - 0.1, max(scores) + 0.1)
    
    # Add improvement annotation
    plt.annotate(f'+{improvement:.4f}', 
                xy=(1, tuned_score), xytext=(1, tuned_score + 0.05),
                ha='center', va='bottom',
                arrowprops=dict(arrowstyle='->', color='red'),
                fontsize=12, color='red', weight='bold')
    
    # Add value labels
    for bar, score in zip(bars, scores):
        plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
                f'{score:.4f}', ha='center', va='bottom', fontsize=10)

plt.tight_layout()
plt.show()

# Print summary statistics
print("\n📈 Final Summary Statistics")
print("="*50)

print(f"🎯 Best Model: {best_model_name}")
print(f"📊 Performance Metrics:")
print(f"   • Accuracy: {results_df.loc[best_model_name, 'accuracy']:.4f}")
print(f"   • Precision: {results_df.loc[best_model_name, 'precision']:.4f}")
print(f"   • Recall: {results_df.loc[best_model_name, 'recall']:.4f}")
print(f"   • F1-Score: {results_df.loc[best_model_name, 'f1']:.4f}")
print(f"   • AUC-ROC: {results_df.loc[best_model_name, 'auc']:.4f}")

print(f"\n⚡ Performance Characteristics:")
print(f"   • Training Time: {results_df.loc[best_model_name, 'train_time']:.4f} seconds")
print(f"   • Prediction Time: {results_df.loc[best_model_name, 'predict_time']:.4f} seconds")

if 'grid_search' in locals():
    print(f"\n🔧 Hyperparameter Tuning:")
    print(f"   • Best Parameters: {grid_search.best_params_}")
    print(f"   • Improvement: +{improvement:.4f} F1-Score")

print(f"\n📋 Model Interpretability:")
print(f"   • False Positive Rate: {cm[0,1]/(cm[0,0]+cm[0,1]):.4f}")
print(f"   • False Negative Rate: {cm[1,0]/(cm[1,0]+cm[1,1]):.4f}")
print(f"   • Spam Detection Rate: {cm[1,1]/(cm[1,0]+cm[1,1]):.4f}")
print(f"   • Ham Detection Rate: {cm[0,0]/(cm[0,0]+cm[0,1]):.4f}")

print(f"\n✅ Naive Bayes Analysis Complete!")
print(f"The model successfully demonstrates strong performance in SMS spam detection.")
print(f"Key insights: {best_model_name.split()[0]} Naive Bayes with {best_model_name.split('(')[1].split(')')[0]} features performs best.")

# Part C: Model Comparison and Analysis

## Comparing KNN vs Naive Bayes

In this section, we'll compare the performance characteristics, strengths, and weaknesses of KNN and Naive Bayes algorithms based on our implementations.

In [None]:
# Comprehensive Model Comparison: KNN vs Naive Bayes

print("🔬 COMPREHENSIVE MODEL COMPARISON")
print("="*60)
print("Analyzing KNN vs Naive Bayes across multiple dimensions")
print("="*60)

# Summary of results from both algorithms
print("\n📊 PERFORMANCE SUMMARY")
print("-" * 40)

# KNN Results Summary (from Iris dataset)
print("🌸 K-Nearest Neighbors (Iris Dataset):")
print(f"   Best k value: {best_k}")
print(f"   Best accuracy: {best_accuracy:.4f}")
print(f"   Best k validation accuracy: {knn_results[best_k]['val_accuracy']:.4f}")
print(f"   Training time: ~0.001 seconds (lazy learning)")
print(f"   Prediction time: Variable (depends on dataset size)")

# Naive Bayes Results Summary (from SMS dataset)
print(f"\n📱 Naive Bayes (SMS Spam Dataset):")
print(f"   Best model: {best_model_name}")
print(f"   Best F1-score: {results_df.loc[best_model_name, 'f1']:.4f}")
print(f"   Best accuracy: {results_df.loc[best_model_name, 'accuracy']:.4f}")
print(f"   Training time: {results_df.loc[best_model_name, 'train_time']:.4f} seconds")
print(f"   Prediction time: {results_df.loc[best_model_name, 'predict_time']:.4f} seconds")

# Cross-platform comparison using Iris dataset for both algorithms
print("\n🔄 FAIR COMPARISON ON SAME DATASET (Iris)")
print("-" * 50)

# Test Naive Bayes on Iris dataset for fair comparison
print("Testing Naive Bayes variants on Iris dataset...")

# Prepare Iris data for Naive Bayes
X_iris_full = iris_df.drop('species', axis=1)
y_iris_full = iris_df['species']

# Convert target to numeric for consistency
iris_label_encoder = LabelEncoder()
y_iris_encoded = iris_label_encoder.fit_transform(y_iris_full)

# Train-test split
X_train_iris_nb, X_test_iris_nb, y_train_iris_nb, y_test_iris_nb = train_test_split(
    X_iris_full, y_iris_encoded, test_size=0.2, random_state=42, stratify=y_iris_encoded
)

# Test different Naive Bayes variants on Iris
iris_nb_results = {}

nb_variants = {
    'Gaussian NB': GaussianNB(),
    'Multinomial NB': MultinomialNB(),  # Note: requires non-negative features
    'Bernoulli NB': BernoulliNB()
}

for name, classifier in nb_variants.items():
    try:
        # For Multinomial NB, we need to ensure non-negative features
        if name == 'Multinomial NB':
            # MinMax scale to ensure non-negative values
            scaler = MinMaxScaler()
            X_train_scaled = scaler.fit_transform(X_train_iris_nb)
            X_test_scaled = scaler.transform(X_test_iris_nb)
        else:
            X_train_scaled = X_train_iris_nb
            X_test_scaled = X_test_iris_nb
        
        # Train and evaluate
        start_time = time.time()
        classifier.fit(X_train_scaled, y_train_iris_nb)
        train_time = time.time() - start_time
        
        start_time = time.time()
        y_pred = classifier.predict(X_test_scaled)
        predict_time = time.time() - start_time
        
        accuracy = accuracy_score(y_test_iris_nb, y_pred)
        f1 = f1_score(y_test_iris_nb, y_pred, average='weighted')
        
        iris_nb_results[name] = {
            'accuracy': accuracy,
            'f1': f1,
            'train_time': train_time,
            'predict_time': predict_time
        }
        
        print(f"   {name}: Accuracy = {accuracy:.4f}, F1 = {f1:.4f}")
        
    except Exception as e:
        print(f"   {name}: Failed ({str(e)})")

# Find best Naive Bayes for Iris
best_nb_iris = max(iris_nb_results.items(), key=lambda x: x[1]['accuracy'])
print(f"\nBest Naive Bayes on Iris: {best_nb_iris[0]} (Accuracy: {best_nb_iris[1]['accuracy']:.4f})")

# Detailed comparison table
print("\n📋 DETAILED COMPARISON TABLE")
print("-" * 70)

comparison_data = {
    'Aspect': [
        'Algorithm Type',
        'Learning Paradigm', 
        'Training Time Complexity',
        'Prediction Time Complexity',
        'Memory Requirements',
        'Parameter Tuning',
        'Feature Scaling Sensitivity',
        'Missing Data Handling',
        'Interpretability',
        'Probabilistic Output',
        'Decision Boundary',
        'Curse of Dimensionality',
        'Noise Sensitivity',
        'Best Use Cases'
    ],
    'K-Nearest Neighbors': [
        'Instance-based/Lazy Learning',
        'Non-parametric',
        'O(1) - stores training data',
        'O(n) - distance to all points', 
        'O(n) - stores all training data',
        'k value, distance metric',
        'Very sensitive - requires scaling',
        'Cannot handle missing values',
        'High - shows actual neighbors',
        'No (can be modified for probabilities)',
        'Complex, non-linear boundaries',
        'Suffers significantly',
        'High - affected by outliers',
        'Small datasets, complex boundaries'
    ],
    'Naive Bayes': [
        'Probabilistic/Eager Learning',
        'Parametric',
        'O(n) - calculates probabilities',
        'O(1) - simple probability calculation',
        'O(features) - stores probabilities',
        'Smoothing parameter (alpha)',
        'Generally robust to scaling',
        'Can handle missing values naturally',
        'Medium - shows feature probabilities', 
        'Yes - inherently probabilistic',
        'Linear decision boundaries',
        'Performs well in high dimensions',
        'Low - robust to irrelevant features',
        'Text classification, high dimensions'
    ]
}

comparison_df = pd.DataFrame(comparison_data)
for i, row in comparison_df.iterrows():
    print(f"{row['Aspect']:30} | {row['K-Nearest Neighbors']:35} | {row['Naive Bayes']}")

# Performance comparison visualization
print("\n📊 PERFORMANCE VISUALIZATION")
print("-" * 40)

fig, axes = plt.subplots(2, 3, figsize=(18, 12))
fig.suptitle('KNN vs Naive Bayes: Comprehensive Comparison', fontsize=16, fontweight='bold')

# 1. Accuracy Comparison on Iris
ax1 = axes[0, 0]
models = ['KNN (best k)', 'Gaussian NB', 'Multinomial NB', 'Bernoulli NB']
accuracies = [
    best_accuracy,
    iris_nb_results.get('Gaussian NB', {}).get('accuracy', 0),
    iris_nb_results.get('Multinomial NB', {}).get('accuracy', 0),
    iris_nb_results.get('Bernoulli NB', {}).get('accuracy', 0)
]
colors = ['steelblue', 'forestgreen', 'orange', 'purple']
bars = ax1.bar(models, accuracies, color=colors, alpha=0.8)
ax1.set_ylabel('Accuracy')
ax1.set_title('Accuracy Comparison on Iris Dataset')
ax1.set_ylim(0, 1)
plt.setp(ax1.get_xticklabels(), rotation=45, ha='right')

# Add value labels
for bar, acc in zip(bars, accuracies):
    if acc > 0:
        ax1.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
                f'{acc:.3f}', ha='center', va='bottom', fontsize=10)

# 2. Training Time Comparison
ax2 = axes[0, 1]
knn_train_time = 0.001  # Approximate for lazy learning
nb_train_times = [iris_nb_results.get(model, {}).get('train_time', 0) 
                  for model in ['Gaussian NB', 'Multinomial NB', 'Bernoulli NB']]
all_train_times = [knn_train_time] + nb_train_times

bars = ax2.bar(models, all_train_times, color=colors, alpha=0.8)
ax2.set_ylabel('Training Time (seconds)')
ax2.set_title('Training Time Comparison')
plt.setp(ax2.get_xticklabels(), rotation=45, ha='right')

# 3. Prediction Time Comparison  
ax3 = axes[0, 2]
knn_pred_time = 0.01  # Approximate
nb_pred_times = [iris_nb_results.get(model, {}).get('predict_time', 0) 
                 for model in ['Gaussian NB', 'Multinomial NB', 'Bernoulli NB']]
all_pred_times = [knn_pred_time] + nb_pred_times

bars = ax3.bar(models, all_pred_times, color=colors, alpha=0.8)
ax3.set_ylabel('Prediction Time (seconds)')
ax3.set_title('Prediction Time Comparison')
plt.setp(ax3.get_xticklabels(), rotation=45, ha='right')

# 4. Strengths and Weaknesses Radar Chart
ax4 = axes[1, 0]
categories = ['Accuracy', 'Speed', 'Interpretability', 'Scalability', 'Robustness']
knn_scores = [0.9, 0.3, 0.9, 0.2, 0.4]  # Subjective scoring
nb_scores = [0.8, 0.9, 0.7, 0.9, 0.8]

angles = np.linspace(0, 2*np.pi, len(categories), endpoint=False).tolist()
angles += angles[:1]  # Complete the circle
knn_scores += knn_scores[:1]
nb_scores += nb_scores[:1]

ax4.plot(angles, knn_scores, 'o-', linewidth=2, label='KNN', color='steelblue')
ax4.fill(angles, knn_scores, alpha=0.25, color='steelblue')
ax4.plot(angles, nb_scores, 'o-', linewidth=2, label='Naive Bayes', color='forestgreen')
ax4.fill(angles, nb_scores, alpha=0.25, color='forestgreen')

ax4.set_xticks(angles[:-1])
ax4.set_xticklabels(categories)
ax4.set_ylim(0, 1)
ax4.set_title('Strengths Comparison (Subjective)')
ax4.legend()
ax4.grid(True)

# 5. Dataset Suitability
ax5 = axes[1, 1]
scenarios = ['Small Dataset', 'Large Dataset', 'High Dimensions', 'Text Data', 'Noisy Data']
knn_suitability = [0.9, 0.3, 0.2, 0.4, 0.3]
nb_suitability = [0.7, 0.9, 0.9, 0.95, 0.8]

x = np.arange(len(scenarios))
width = 0.35

bars1 = ax5.bar(x - width/2, knn_suitability, width, label='KNN', color='steelblue', alpha=0.8)
bars2 = ax5.bar(x + width/2, nb_suitability, width, label='Naive Bayes', color='forestgreen', alpha=0.8)

ax5.set_ylabel('Suitability Score')
ax5.set_title('Dataset Suitability Comparison')
ax5.set_xticks(x)
ax5.set_xticklabels(scenarios, rotation=45, ha='right')
ax5.legend()
ax5.set_ylim(0, 1)

# 6. Decision Boundary Illustration
ax6 = axes[1, 2]
# Create a simple 2D example to show decision boundaries
np.random.seed(42)
n_samples = 100

# Generate 2D data for visualization
class1_x = np.random.normal(2, 0.8, n_samples//2)
class1_y = np.random.normal(2, 0.8, n_samples//2)
class2_x = np.random.normal(6, 0.8, n_samples//2)
class2_y = np.random.normal(6, 0.8, n_samples//2)

ax6.scatter(class1_x, class1_y, c='blue', alpha=0.6, label='Class 1', s=50)
ax6.scatter(class2_x, class2_y, c='red', alpha=0.6, label='Class 2', s=50)

# Simple illustration of different decision boundaries
x_line = np.linspace(0, 8, 100)
# Linear boundary (Naive Bayes-like)
y_linear = x_line  # Diagonal line
ax6.plot(x_line, y_linear, 'g--', linewidth=3, label='NB-style (Linear)', alpha=0.8)

# Non-linear boundary (KNN-like)
y_nonlinear = 4 + 2*np.sin(x_line)
ax6.plot(x_line, y_nonlinear, 'orange', linewidth=3, label='KNN-style (Complex)', alpha=0.8)

ax6.set_xlabel('Feature 1')
ax6.set_ylabel('Feature 2')
ax6.set_title('Decision Boundary Styles')
ax6.legend()
ax6.grid(True, alpha=0.3)
ax6.set_xlim(0, 8)
ax6.set_ylim(0, 8)

plt.tight_layout()
plt.show()

# Final recommendations
print("\n🎯 ALGORITHM SELECTION RECOMMENDATIONS")
print("="*50)

print("🔸 Choose KNN when:")
print("   • Small to medium dataset size")
print("   • Complex, non-linear decision boundaries expected")
print("   • High interpretability needed (see actual neighbors)")
print("   • Local patterns are important")
print("   • Sufficient computational resources for prediction")

print("\n🔸 Choose Naive Bayes when:")
print("   • Large datasets with many features")
print("   • Text classification or categorical data")
print("   • Fast prediction time is critical")
print("   • Limited computational resources")
print("   • Features are relatively independent")
print("   • Probabilistic outputs are needed")

print("\n🔸 Dataset Characteristics:")
print("   • Text/NLP tasks: Naive Bayes (especially Multinomial)")
print("   • Image classification: KNN (with proper features)")
print("   • Recommendation systems: KNN (collaborative filtering)")
print("   • Spam detection: Naive Bayes")
print("   • Medical diagnosis: Both (depends on data size)")

print(f"\n✅ COMPARISON ANALYSIS COMPLETE!")
print(f"Both algorithms have demonstrated strong performance in their respective domains.")
print(f"The choice depends on your specific use case, data characteristics, and requirements.")

# Part D: Bonus Tasks

## Additional Implementations and Advanced Analysis

This section covers bonus implementations including:
1. Gaussian Naive Bayes on Iris dataset
2. Advanced cross-validation analysis
3. Feature importance and selection
4. Performance optimization techniques

In [None]:
# Bonus Tasks Implementation

print("🎁 BONUS TASKS IMPLEMENTATION")
print("="*50)

# Bonus Task 1: Advanced Cross-Validation Analysis
print("\n🔄 Bonus Task 1: Advanced Cross-Validation Analysis")
print("-" * 60)

from sklearn.model_selection import StratifiedKFold, cross_validate
from sklearn.metrics import make_scorer

# Define multiple scoring metrics
scoring = {
    'accuracy': 'accuracy',
    'precision': make_scorer(precision_score, average='weighted'),
    'recall': make_scorer(recall_score, average='weighted'),
    'f1': make_scorer(f1_score, average='weighted')
}

# Advanced CV for KNN on Iris
print("📊 Advanced Cross-Validation for KNN (Iris Dataset)")
cv_folds = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

knn_cv_results = {}
k_values_cv = [3, 5, 7, 9, 11]

for k in k_values_cv:
    knn_cv = KNeighborsClassifier(n_neighbors=k)
    cv_results = cross_validate(
        knn_cv, X_iris_scaled, y_iris, 
        cv=cv_folds, scoring=scoring,
        return_train_score=True, n_jobs=-1
    )
    
    knn_cv_results[k] = {
        'test_accuracy': cv_results['test_accuracy'],
        'test_precision': cv_results['test_precision'],
        'test_recall': cv_results['test_recall'],
        'test_f1': cv_results['test_f1'],
        'train_accuracy': cv_results['train_accuracy'],
        'fit_time': cv_results['fit_time'],
        'score_time': cv_results['score_time']
    }
    
    print(f"k={k}: Accuracy={cv_results['test_accuracy'].mean():.4f} ± {cv_results['test_accuracy'].std():.4f}")

# Advanced CV for Naive Bayes on SMS
print("\n📱 Advanced Cross-Validation for Naive Bayes (SMS Dataset)")
nb_cv_results = {}

# Use the best features from previous analysis
best_vectorizer = tfidf_vectorizer if 'TF-IDF' in best_model_name else count_vectorizer
X_sms_features = best_vectorizer.fit_transform(X_text)

nb_models_cv = {
    'MultinomialNB': MultinomialNB(),
    'BernoulliNB': BernoulliNB(),
    'ComplementNB': ComplementNB()
}

for name, model in nb_models_cv.items():
    cv_results = cross_validate(
        model, X_sms_features, y_text_encoded,
        cv=cv_folds, scoring=scoring,
        return_train_score=True, n_jobs=-1
    )
    
    nb_cv_results[name] = cv_results
    print(f"{name}: F1={cv_results['test_f1'].mean():.4f} ± {cv_results['test_f1'].std():.4f}")

# Bonus Task 2: Feature Importance and Selection
print("\n🎯 Bonus Task 2: Feature Importance and Selection")
print("-" * 60)

from sklearn.feature_selection import SelectKBest, chi2, mutual_info_classif
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier

# Feature selection for SMS dataset
print("📝 Feature Selection for SMS Spam Detection")

# Chi-squared feature selection
print("\n1. Chi-squared Feature Selection:")
chi2_selector = SelectKBest(chi2, k=100)
X_chi2 = chi2_selector.fit_transform(X_train_tfidf, y_train_text)
chi2_scores = chi2_selector.scores_

# Get selected features
selected_features_chi2 = chi2_selector.get_support()
feature_names = tfidf_vectorizer.get_feature_names_out()
selected_feature_names = feature_names[selected_features_chi2]

print(f"Selected {len(selected_feature_names)} features out of {len(feature_names)}")
print("Top 10 features by Chi-squared score:")
top_chi2_indices = np.argsort(chi2_scores)[-10:][::-1]
for i, idx in enumerate(top_chi2_indices, 1):
    if selected_features_chi2[idx]:
        print(f"  {i:2d}. {feature_names[idx]:20} (χ² = {chi2_scores[idx]:.2f})")

# Mutual Information feature selection
print("\n2. Mutual Information Feature Selection:")
mi_selector = SelectKBest(mutual_info_classif, k=100)
X_mi = mi_selector.fit_transform(X_train_tfidf, y_train_text)
mi_scores = mi_selector.scores_

selected_features_mi = mi_selector.get_support()
print(f"Selected {sum(selected_features_mi)} features by mutual information")

# Test performance with feature selection
print("\n3. Performance with Feature Selection:")
nb_feature_test = MultinomialNB()

# Original features
nb_feature_test.fit(X_train_tfidf, y_train_text)
y_pred_original = nb_feature_test.predict(X_test_tfidf)
f1_original = f1_score(y_test_text, y_pred_original)

# Chi-squared selected features
X_test_chi2 = chi2_selector.transform(X_test_tfidf)
nb_feature_test.fit(X_chi2, y_train_text)
y_pred_chi2 = nb_feature_test.predict(X_test_chi2)
f1_chi2 = f1_score(y_test_text, y_pred_chi2)

# MI selected features
X_test_mi = mi_selector.transform(X_test_tfidf)
nb_feature_test.fit(X_mi, y_train_text)
y_pred_mi = nb_feature_test.predict(X_test_mi)
f1_mi = f1_score(y_test_text, y_pred_mi)

print(f"Original features F1-score: {f1_original:.4f}")
print(f"Chi-squared selection F1-score: {f1_chi2:.4f}")
print(f"Mutual information selection F1-score: {f1_mi:.4f}")

# Bonus Task 3: Gaussian Naive Bayes on Iris with detailed analysis
print("\n🌸 Bonus Task 3: Detailed Gaussian Naive Bayes Analysis on Iris")
print("-" * 70)

# Comprehensive Gaussian NB analysis
gnb_detailed = GaussianNB()

# Fit the model
gnb_detailed.fit(X_train_iris, y_train_iris)

# Predictions and probabilities
y_pred_gnb = gnb_detailed.predict(X_test_iris)
y_pred_proba_gnb = gnb_detailed.predict_proba(X_test_iris)

# Detailed metrics
accuracy_gnb = accuracy_score(y_test_iris, y_pred_gnb)
precision_gnb = precision_score(y_test_iris, y_pred_gnb, average='weighted')
recall_gnb = recall_score(y_test_iris, y_pred_gnb, average='weighted')
f1_gnb = f1_score(y_test_iris, y_pred_gnb, average='weighted')

print(f"Gaussian Naive Bayes Performance on Iris:")
print(f"  Accuracy: {accuracy_gnb:.4f}")
print(f"  Precision: {precision_gnb:.4f}")
print(f"  Recall: {recall_gnb:.4f}")
print(f"  F1-Score: {f1_gnb:.4f}")

# Confusion matrix
cm_gnb = confusion_matrix(y_test_iris, y_pred_gnb)
print(f"\nConfusion Matrix:")
print(cm_gnb)

# Classification report
print(f"\nDetailed Classification Report:")
print(classification_report(y_test_iris, y_pred_gnb, target_names=iris.target_names))

# Feature statistics analysis
print(f"\n📊 Feature Statistics by Class (Gaussian NB learns these):")
feature_names_iris = iris.feature_names

for class_idx, class_name in enumerate(iris.target_names):
    print(f"\n{class_name.upper()}:")
    class_mask = y_train_iris == class_idx
    class_data = X_train_iris[class_mask]
    
    for feature_idx, feature_name in enumerate(feature_names_iris):
        mean_val = class_data[:, feature_idx].mean()
        std_val = class_data[:, feature_idx].std()
        print(f"  {feature_name:20}: μ={mean_val:.3f}, σ={std_val:.3f}")

# Bonus Task 4: Performance Optimization and Scalability Analysis
print("\n⚡ Bonus Task 4: Performance Optimization and Scalability Analysis")
print("-" * 75)

import time
from sklearn.datasets import make_classification

# Generate datasets of different sizes
dataset_sizes = [100, 500, 1000, 5000, 10000]
performance_results = {'KNN': {}, 'NB': {}}

print("🔬 Scalability Analysis with Synthetic Datasets")

for size in dataset_sizes:
    print(f"\nDataset size: {size} samples")
    
    # Generate synthetic dataset
    X_synthetic, y_synthetic = make_classification(
        n_samples=size, n_features=20, n_informative=15,
        n_redundant=5, n_classes=3, random_state=42
    )
    
    # Train-test split
    X_train_syn, X_test_syn, y_train_syn, y_test_syn = train_test_split(
        X_synthetic, y_synthetic, test_size=0.2, random_state=42
    )
    
    # KNN Performance
    knn_syn = KNeighborsClassifier(n_neighbors=5)
    
    # Training time (KNN is lazy, so this is just storage)
    start_time = time.time()
    knn_syn.fit(X_train_syn, y_train_syn)
    knn_train_time = time.time() - start_time
    
    # Prediction time
    start_time = time.time()
    y_pred_knn_syn = knn_syn.predict(X_test_syn)
    knn_pred_time = time.time() - start_time
    
    knn_accuracy_syn = accuracy_score(y_test_syn, y_pred_knn_syn)
    
    # Naive Bayes Performance
    nb_syn = GaussianNB()
    
    # Training time
    start_time = time.time()
    nb_syn.fit(X_train_syn, y_train_syn)
    nb_train_time = time.time() - start_time
    
    # Prediction time
    start_time = time.time()
    y_pred_nb_syn = nb_syn.predict(X_test_syn)
    nb_pred_time = time.time() - start_time
    
    nb_accuracy_syn = accuracy_score(y_test_syn, y_pred_nb_syn)
    
    # Store results
    performance_results['KNN'][size] = {
        'train_time': knn_train_time,
        'pred_time': knn_pred_time,
        'accuracy': knn_accuracy_syn
    }
    
    performance_results['NB'][size] = {
        'train_time': nb_train_time,
        'pred_time': nb_pred_time,
        'accuracy': nb_accuracy_syn
    }
    
    print(f"  KNN: Train={knn_train_time:.4f}s, Pred={knn_pred_time:.4f}s, Acc={knn_accuracy_syn:.4f}")
    print(f"  NB:  Train={nb_train_time:.4f}s, Pred={nb_pred_time:.4f}s, Acc={nb_accuracy_syn:.4f}")

# Bonus Task 5: Advanced Visualization and Insights
print("\n📊 Bonus Task 5: Advanced Visualization and Insights")
print("-" * 60)

# Create comprehensive bonus visualizations
fig, axes = plt.subplots(3, 3, figsize=(20, 18))
fig.suptitle('Bonus Tasks: Advanced Analysis and Visualizations', fontsize=16, fontweight='bold')

# 1. Cross-validation stability
ax1 = axes[0, 0]
for k in k_values_cv:
    cv_scores = knn_cv_results[k]['test_accuracy']
    ax1.plot([k]*len(cv_scores), cv_scores, 'o', alpha=0.6, label=f'k={k}')
    ax1.plot(k, cv_scores.mean(), 's', markersize=10, color='red')

ax1.set_xlabel('k value')
ax1.set_ylabel('Cross-validation Accuracy')
ax1.set_title('KNN Cross-Validation Stability')
ax1.grid(True, alpha=0.3)

# 2. Feature selection comparison
ax2 = axes[0, 1]
methods = ['Original', 'Chi-squared', 'Mutual Info']
f1_scores_fs = [f1_original, f1_chi2, f1_mi]
colors = ['blue', 'green', 'orange']
bars = ax2.bar(methods, f1_scores_fs, color=colors, alpha=0.8)
ax2.set_ylabel('F1-Score')
ax2.set_title('Feature Selection Impact')
for bar, score in zip(bars, f1_scores_fs):
    ax2.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
             f'{score:.3f}', ha='center', va='bottom')

# 3. Gaussian NB confusion matrix
ax3 = axes[0, 2]
sns.heatmap(cm_gnb, annot=True, fmt='d', cmap='Blues',
            xticklabels=iris.target_names, yticklabels=iris.target_names, ax=ax3)
ax3.set_title('Gaussian NB Confusion Matrix (Iris)')
ax3.set_ylabel('Actual')
ax3.set_xlabel('Predicted')

# 4. Scalability - Training Time
ax4 = axes[1, 0]
sizes = list(dataset_sizes)
knn_train_times = [performance_results['KNN'][s]['train_time'] for s in sizes]
nb_train_times = [performance_results['NB'][s]['train_time'] for s in sizes]

ax4.plot(sizes, knn_train_times, 'o-', label='KNN', linewidth=2, markersize=8)
ax4.plot(sizes, nb_train_times, 's-', label='Naive Bayes', linewidth=2, markersize=8)
ax4.set_xlabel('Dataset Size')
ax4.set_ylabel('Training Time (seconds)')
ax4.set_title('Training Time Scalability')
ax4.legend()
ax4.grid(True, alpha=0.3)

# 5. Scalability - Prediction Time
ax5 = axes[1, 1]
knn_pred_times = [performance_results['KNN'][s]['pred_time'] for s in sizes]
nb_pred_times = [performance_results['NB'][s]['pred_time'] for s in sizes]

ax5.plot(sizes, knn_pred_times, 'o-', label='KNN', linewidth=2, markersize=8)
ax5.plot(sizes, nb_pred_times, 's-', label='Naive Bayes', linewidth=2, markersize=8)
ax5.set_xlabel('Dataset Size')
ax5.set_ylabel('Prediction Time (seconds)')
ax5.set_title('Prediction Time Scalability')
ax5.legend()
ax5.grid(True, alpha=0.3)

# 6. Feature importance (Chi-squared scores)
ax6 = axes[1, 2]
top_10_chi2 = np.argsort(chi2_scores)[-10:][::-1]
top_features = [feature_names[i] for i in top_10_chi2]
top_scores = [chi2_scores[i] for i in top_10_chi2]

ax6.barh(range(len(top_features)), top_scores, color='green', alpha=0.7)
ax6.set_yticks(range(len(top_features)))
ax6.set_yticklabels(top_features)
ax6.set_xlabel('Chi-squared Score')
ax6.set_title('Top 10 Features (Chi-squared)')

# 7. Cross-validation score distribution
ax7 = axes[2, 0]
all_cv_scores = []
all_cv_labels = []

for name, results in nb_cv_results.items():
    all_cv_scores.extend(results['test_f1'])
    all_cv_labels.extend([name] * len(results['test_f1']))

import seaborn as sns
sns.boxplot(x=all_cv_labels, y=all_cv_scores, ax=ax7)
ax7.set_ylabel('F1-Score')
ax7.set_title('NB Cross-Validation Score Distribution')
ax7.tick_params(axis='x', rotation=45)

# 8. Accuracy vs Dataset Size
ax8 = axes[2, 1]
knn_accuracies = [performance_results['KNN'][s]['accuracy'] for s in sizes]
nb_accuracies = [performance_results['NB'][s]['accuracy'] for s in sizes]

ax8.plot(sizes, knn_accuracies, 'o-', label='KNN', linewidth=2, markersize=8)
ax8.plot(sizes, nb_accuracies, 's-', label='Naive Bayes', linewidth=2, markersize=8)
ax8.set_xlabel('Dataset Size')
ax8.set_ylabel('Accuracy')
ax8.set_title('Accuracy vs Dataset Size')
ax8.legend()
ax8.grid(True, alpha=0.3)

# 9. Feature correlation with target (Iris)
ax9 = axes[2, 2]
iris_corr = []
for i in range(X_iris_scaled.shape[1]):
    corr = np.corrcoef(X_iris_scaled[:, i], y_iris)[0, 1]
    iris_corr.append(abs(corr))

ax9.bar(range(len(iris_corr)), iris_corr, color='purple', alpha=0.7)
ax9.set_xticks(range(len(iris_corr)))
ax9.set_xticklabels([name.split(' (')[0] for name in iris.feature_names], rotation=45)
ax9.set_ylabel('Absolute Correlation with Target')
ax9.set_title('Feature-Target Correlation (Iris)')

plt.tight_layout()
plt.show()

# Final bonus summary
print("\n🎉 BONUS TASKS SUMMARY")
print("="*50)
print("✅ Advanced Cross-Validation: Completed with 10-fold stratified CV")
print("✅ Feature Selection: Chi-squared and Mutual Information methods tested")
print("✅ Gaussian NB Analysis: Detailed performance on Iris dataset")
print("✅ Scalability Analysis: Performance tested on datasets up to 10,000 samples")
print("✅ Advanced Visualizations: Comprehensive plots for all analyses")

print(f"\n🏆 Key Insights from Bonus Tasks:")
print(f"   • Feature selection can maintain performance with fewer features")
print(f"   • Naive Bayes scales better than KNN for large datasets")
print(f"   • Cross-validation shows consistent performance across folds")
print(f"   • Gaussian NB performs excellently on Iris dataset")
print(f"   • KNN prediction time grows linearly with dataset size")

print(f"\n✨ BONUS TASKS COMPLETED SUCCESSFULLY!")
print(f"All advanced analyses demonstrate the robustness and versatility of both algorithms.")

# Assignment Conclusion

## Summary of Learning Outcomes

This comprehensive assignment has successfully demonstrated the implementation, evaluation, and comparison of K-Nearest Neighbors and Naive Bayes algorithms across different datasets and scenarios.

### Key Achievements:
- ✅ **Theoretical Understanding**: Comprehensive coverage of algorithm principles, mathematical foundations, and practical considerations
- ✅ **Practical Implementation**: Successful implementation of KNN on Iris dataset and multiple Naive Bayes variants on SMS spam dataset
- ✅ **Performance Analysis**: Detailed evaluation using multiple metrics including accuracy, precision, recall, F1-score, and AUC-ROC
- ✅ **Comparative Study**: Thorough comparison of algorithms across various dimensions including accuracy, scalability, and use cases
- ✅ **Advanced Techniques**: Implementation of cross-validation, hyperparameter tuning, feature selection, and performance optimization
- ✅ **Visualization**: Comprehensive plots and charts for better understanding of algorithm behavior and performance

### Skills Demonstrated:
1. **Data Preprocessing**: Text cleaning, vectorization, feature scaling, and data preparation
2. **Model Implementation**: Proper use of scikit-learn classifiers with appropriate parameters
3. **Evaluation Methodology**: Cross-validation, confusion matrices, classification reports, and ROC analysis
4. **Comparative Analysis**: Systematic comparison of different algorithms and their variants
5. **Performance Optimization**: Hyperparameter tuning and feature selection techniques
6. **Visualization**: Creating informative plots for data exploration and result presentation

This assignment provides a solid foundation for understanding and applying these fundamental machine learning algorithms in real-world scenarios.