# Assignment 10: Support Vector Machine (SVM)

## Dataset: Mushroom Classification

**Objective:** Classify mushrooms as edible or poisonous using SVM.

**Topics Covered:**
- SVM Classifier
- Different Kernels (Linear, RBF, Polynomial)
- Hyperparameter Tuning
- Model Comparison

---
## Step 1: Import Libraries and Load Data

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import confusion_matrix, classification_report

# Load the dataset
df = pd.read_csv('mushroom.csv')

print("Dataset loaded successfully!")
print("Shape:", df.shape)
print("\nFirst 5 rows:")
df.head()

---
## Step 2: Exploratory Data Analysis

In [None]:
# Check data info
print("=== Data Types ===")
print(df.dtypes)

print("\n=== Missing Values ===")
print(df.isnull().sum().sum())

In [None]:
# Target distribution
print("=== Class Distribution ===")
class_counts = df['class'].value_counts()
print(class_counts)

plt.figure(figsize=(8, 5))
plt.bar(['Edible', 'Poisonous'], class_counts.values, color=['green', 'red'])
plt.xlabel('Mushroom Class')
plt.ylabel('Count')
plt.title('Distribution of Mushroom Classes')
plt.savefig('class_distribution.png')
plt.show()

In [None]:
# Numerical columns distribution
numerical_cols = ['stalk_height', 'cap_diameter']

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

for i in range(len(numerical_cols)):
    col = numerical_cols[i]
    axes[i].hist(df[col], bins=30, color='steelblue', edgecolor='black')
    axes[i].set_title('Distribution of ' + col)
    axes[i].set_xlabel(col)
    axes[i].set_ylabel('Frequency')

plt.tight_layout()
plt.savefig('numerical_distributions.png')
plt.show()

In [None]:
# Boxplots
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

for i in range(len(numerical_cols)):
    col = numerical_cols[i]
    axes[i].boxplot(df[col].dropna())
    axes[i].set_title('Boxplot of ' + col)
    axes[i].set_ylabel(col)

plt.tight_layout()
plt.savefig('boxplots.png')
plt.show()

---
## Step 3: Data Preprocessing

In [None]:
# Drop unnamed column if exists
if 'Unnamed: 0' in df.columns:
    df = df.drop('Unnamed: 0', axis=1)

# Identify categorical columns
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()
categorical_cols.remove('class')  # Remove target

print("Categorical columns:", len(categorical_cols))
print("Numerical columns:", len(numerical_cols))

In [None]:
# Encode categorical variables
print("=== Encoding Categorical Variables ===")

df_encoded = df.copy()
label_encoders = {}

for col in categorical_cols:
    le = LabelEncoder()
    df_encoded[col] = le.fit_transform(df_encoded[col].astype(str))
    label_encoders[col] = le

# Encode target
le_target = LabelEncoder()
df_encoded['class'] = le_target.fit_transform(df_encoded['class'])
print("Target encoding: edible=0, poisonous=1")

print("\nEncoding complete!")
df_encoded.head()

In [None]:
# Prepare features and target
feature_cols = [col for col in df_encoded.columns if col != 'class']

X = df_encoded[feature_cols]
y = df_encoded['class']

print("Features shape:", X.shape)
print("Target shape:", y.shape)

In [None]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Training set:", len(X_train))
print("Testing set:", len(X_test))

In [None]:
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Features scaled!")

---
## Step 4: SVM Implementation

In [None]:
# Train SVM with default parameters (RBF kernel)
print("=== Training SVM Model (RBF Kernel) ===")

svm_rbf = SVC(kernel='rbf', random_state=42)
svm_rbf.fit(X_train_scaled, y_train)

# Predict
y_pred_rbf = svm_rbf.predict(X_test_scaled)

# Evaluate
accuracy = accuracy_score(y_test, y_pred_rbf)
precision = precision_score(y_test, y_pred_rbf)
recall = recall_score(y_test, y_pred_rbf)
f1 = f1_score(y_test, y_pred_rbf)

print("\nRBF Kernel Results:")
print("Accuracy:", round(accuracy, 4))
print("Precision:", round(precision, 4))
print("Recall:", round(recall, 4))
print("F1-Score:", round(f1, 4))

In [None]:
# Confusion Matrix
print("=== Confusion Matrix ===")

cm = confusion_matrix(y_test, y_pred_rbf)

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=['Edible', 'Poisonous'],
            yticklabels=['Edible', 'Poisonous'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix (RBF Kernel)')
plt.savefig('confusion_matrix.png')
plt.show()

---
## Step 5: Compare Different Kernels

In [None]:
# Compare different kernels
print("=== Comparing SVM Kernels ===")

kernels = ['linear', 'rbf', 'poly', 'sigmoid']
results = []

for kernel in kernels:
    print("\nTraining with", kernel, "kernel...")
    
    # Train model
    svm_model = SVC(kernel=kernel, random_state=42)
    svm_model.fit(X_train_scaled, y_train)
    
    # Predict
    y_pred = svm_model.predict(X_test_scaled)
    
    # Calculate metrics
    acc = accuracy_score(y_test, y_pred)
    prec = precision_score(y_test, y_pred)
    rec = recall_score(y_test, y_pred)
    f1_val = f1_score(y_test, y_pred)
    
    results.append({
        'Kernel': kernel,
        'Accuracy': acc,
        'Precision': prec,
        'Recall': rec,
        'F1-Score': f1_val
    })
    
    print("  Accuracy:", round(acc, 4))

# Create comparison dataframe
results_df = pd.DataFrame(results)
print("\n=== Kernel Comparison ===")
results_df

In [None]:
# Plot kernel comparison
plt.figure(figsize=(10, 6))
x_pos = range(len(kernels))
plt.bar(x_pos, results_df['Accuracy'], color='steelblue', edgecolor='black')
plt.xticks(x_pos, results_df['Kernel'])
plt.xlabel('Kernel Type')
plt.ylabel('Accuracy')
plt.title('SVM Accuracy by Kernel Type')
plt.ylim(0.9, 1.0)

for i in range(len(results_df)):
    plt.text(i, results_df['Accuracy'].iloc[i] + 0.005, 
             str(round(results_df['Accuracy'].iloc[i], 4)), ha='center')

plt.savefig('kernel_comparison.png')
plt.show()

---
## Step 6: Hyperparameter Tuning

In [None]:
# Hyperparameter tuning for RBF kernel
print("=== Hyperparameter Tuning (RBF Kernel) ===")

C_values = [0.1, 1, 10, 100]
gamma_values = ['scale', 'auto', 0.1, 1]

best_accuracy = 0
best_params = {}

for C in C_values:
    for gamma in gamma_values:
        svm_model = SVC(kernel='rbf', C=C, gamma=gamma, random_state=42)
        svm_model.fit(X_train_scaled, y_train)
        y_pred = svm_model.predict(X_test_scaled)
        acc = accuracy_score(y_test, y_pred)
        
        if acc > best_accuracy:
            best_accuracy = acc
            best_params = {'C': C, 'gamma': gamma}

print("\nBest Parameters:")
print("  C:", best_params['C'])
print("  gamma:", best_params['gamma'])
print("  Best Accuracy:", round(best_accuracy, 4))

In [None]:
# Train final model with best parameters
print("=== Training Final Model ===")

final_svm = SVC(kernel='rbf', C=best_params['C'], gamma=best_params['gamma'], random_state=42)
final_svm.fit(X_train_scaled, y_train)

y_pred_final = final_svm.predict(X_test_scaled)

print("\n=== Final Model Evaluation ===")
print(classification_report(y_test, y_pred_final, target_names=['Edible', 'Poisonous']))

---
## Step 7: Analysis and Discussion

### SVM Strengths:
1. **Effective in high-dimensional spaces** - Works well even when features > samples
2. **Memory efficient** - Uses only support vectors for decision function
3. **Versatile** - Different kernels for different data types

### SVM Weaknesses:
1. **Computationally expensive** for large datasets
2. **Sensitive to feature scaling** - Requires normalization
3. **Difficult to interpret** - Black-box model

### Practical Implications:
- SVM is excellent for binary classification like mushroom edibility
- RBF kernel is good default for non-linear data
- Hyperparameter tuning (C, gamma) significantly impacts performance

---
## Summary

In this assignment, we:

1. **Explored** the Mushroom dataset
2. **Preprocessed** data with label encoding and scaling
3. **Implemented** SVM with multiple kernels
4. **Compared** Linear, RBF, Polynomial, and Sigmoid kernels
5. **Tuned** hyperparameters C and gamma
6. **Analyzed** SVM strengths and weaknesses