# SVM & Naive Bayes | Assignment

**Question 1: What is a Support Vector Machine (SVM), and how does it work?**

**Answer:**

**What is an SVM?**

Support Vector Machine (SVM) is a powerful supervised machine learning algorithm primarily used for classification, though it can also handle regression tasks. Its main goal is to find the best boundary (called a hyperplane) that separates different classes in your data.

Think of it like drawing a line on a 2D plot to separate red dots from blue dots—but it works in much higher dimensions too.

**How does it work?**

1. **Finding the Best Separating Hyperplane** Imagine you have two classes of data points scattered on a graph. SVM tries to find a line (in 2D) or a hyperplane (in higher dimensions) that splits these classes apart.

2. **Maximizing the Margin** Instead of just any line that separates the classes, SVM picks the one that maximizes the margin — the distance between the closest points of each class to the hyperplane. Those closest points are called support vectors because they "support" or define the position of the boundary.

3. **Handling Non-linearly Separable Data** What if you can’t draw a straight line to separate the classes? SVM cleverly uses something called a kernel trick — a mathematical function that transforms your data into a higher-dimensional space where it is linearly separable. Common kernels include:

- Linear
- Polynomial
- Radial Basis Function (RBF)

4. **Soft Margin for Noisy Data** Real-world data is messy and might overlap. SVM allows some points to be on the wrong side of the margin with a soft margin parameter (often called C), balancing margin size and classification errors.

**Why is SVM powerful?**

Works well in high-dimensional spaces.

Effective even when the number of features exceeds the number of samples.

Robust against overfitting due to maximizing margin.

Versatile through kernels for nonlinear data.

**Question 2: Explain the difference between Hard Margin and Soft Margin SVM.**

**Answer:**

The difference between Hard Margin and Soft Margin SVM is pretty central to how SVM handles data — especially when the data isn’t perfectly clean or linearly separable.

Here’s the lowdown:

**Hard Margin SVM**

- What it is: Hard Margin SVM tries to find a perfectly clean hyperplane that strictly separates the two classes with no errors at all.

- Key idea: The margin must be wide, and no data points are allowed to lie inside the margin or on the wrong side of the boundary.

- When it works: Only when the data is linearly separable without any overlap or noise.

- Limitations:

 - Doesn’t tolerate misclassified points or noise.
 - Not practical for most real-world datasets where classes overlap or have outliers.
- Mathematically: The optimization requires all points to satisfy:

                  yi​(w⋅xi​+b)≥1

with no violations allowed.

**Soft Margin SVM**

- What it is: Soft Margin SVM allows some flexibility by tolerating some misclassifications or points inside the margin.

- Key idea: It tries to find a hyperplane that balances maximizing the margin and minimizing classification errors.

- How it does this: Introduces slack variables ((ξ)) that measure how much each point violates the margin.

- Trade-off controlled by parameter:

 - A large means "penalize errors heavily" — so fewer misclassifications but possibly a smaller margin (closer to hard margin).
 - A small means "allow more errors" to get a wider margin (better generalization, often).
- When it works: Best for noisy or overlapping data where perfect separation isn’t possible.

Mathematically:

The optimization becomes:

subject to:
                   y
i
(w⋅x
i
+b)≥1−ξ
i
, where
𝜉
𝑖
≥
0
ξ
i
≥0

                
**In simple terms:**

| **Feature**            | **Hard Margin**                     | **Soft Margin**                             |
| ---------------------- | ----------------------------------- | ------------------------------------------- |
| **Error allowed?**     | No — zero misclassification allowed | Yes — allows some misclassifications        |
| **Data type**          | Perfectly separable                 | Overlapping or noisy data                   |
| **Margin flexibility** | Fixed and strict                    | Flexible — balances margin width and errors |
| **Use case**           | Rare in real-world data             | Most practical and commonly used SVM        |





**Question 3:What is the Kernel Trick in SVM? Give one example of a kernel and
explain its use case**

**Answer:**

**The Kernel Trick in SVM**

The Kernel Trick is a powerful technique used in SVM that allows it to handle non-linearly separable data without explicitly transforming the data into a higher-dimensional space. Instead of computing the coordinates of the data points in the higher dimension, it computes the dot product of the data points in that higher dimension using a kernel function. This is computationally much less expensive.

In essence, the kernel function acts as a shortcut to calculate the similarity between data points in a higher dimension, making it possible to find a linear decision boundary in that space, which corresponds to a non-linear boundary in the original space.

**Example of a Kernel: Radial Basis Function (RBF) Kernel**

*   **Mathematical Form:** The RBF kernel between two data points $x$ and $x'$ is given by:

    $K(x, x') = \exp(-\gamma \|x - x'\|^2)$

    where $\gamma$ is a parameter that controls the influence of a single training example.

*   **Use Case:** The RBF kernel is one of the most commonly used kernels in SVM. It is particularly effective for non-linear classification tasks where the relationship between features and the target variable is complex and cannot be captured by a linear boundary in the original feature space. It can create complex decision boundaries by mapping the data into an infinite-dimensional space.

    For example, if you have data points arranged in a circular pattern, a linear kernel would not be able to separate them. However, the RBF kernel can effectively find a non-linear boundary (a circle in this case) in the original space by implicitly mapping the data to a higher dimension where a linear separation is possible.

In summary, the Kernel Trick, especially with kernels like the RBF kernel, significantly enhances the power and applicability of SVM to a wide range of complex, non-linear classification problems.

**Question 4: What is a Naïve Bayes Classifier, and why is it called “naïve”?**

**Answer:**

Naïve Bayes is a probabilistic machine learning algorithm used mainly for classification tasks. It’s based on Bayes’ Theorem, which helps you update the probability estimate for a hypothesis as more evidence comes in.

In simple terms, it predicts the class of a data point by calculating the probability that it belongs to each class and then picking the class with the highest probability.

**How does it work?**

1.  It looks at the features (attributes) of your data.
2.  Uses Bayes’ theorem to calculate the probability of the data belonging to each class.
3.  Assigns the class with the highest posterior probability.

**Why is it called “Naïve”?**

The “naïve” part comes from a strong assumption it makes:

*   It assumes all features are independent of each other, given the class label.

In reality, features often influence each other (they’re correlated), but Naïve Bayes ignores this and treats each feature as if it stands alone.

This assumption is what makes it “naïve” — it simplifies the math drastically, making the algorithm fast and efficient, even if the assumption isn’t perfectly true.

**Why does it still work well?**

Surprisingly, even when the independence assumption is violated, Naïve Bayes often performs really well in practice, especially for:

*   Text classification (like spam filtering or sentiment analysis)
*   Medical diagnosis
*   Any problem where the independence assumption is “close enough”

**Quick Bayes Theorem refresher:**

$P(A|B) = \frac{P(B|A)P(A)}{P(B)}$

Where:

*   $P(A|B)$ = probability of class A given data B (what we want)
*   $P(B|A)$ = probability of data B given class A
*   $P(A)$ = prior probability of class A
*   $P(B)$ = probability of data B (normalizing constant)

**Question 5: Describe the Gaussian, Multinomial, and Bernoulli Naïve Bayes variants. When would you use each one?**

**Answer:**

1.  **Gaussian Naïve Bayes**
    *   What it assumes: The features follow a normal (Gaussian) distribution.
    *   How it works: For each feature and class, it estimates the mean and variance, then calculates the likelihood using the Gaussian probability density function.
    *   Use case: When your features are continuous numerical data (like height, weight, temperature, or any real-valued measurements).
    *   Example: Predicting if a patient has a disease based on continuous lab test results.

2.  **Multinomial Naïve Bayes**
    *   What it assumes: Features represent counts or frequencies (non-negative integers).
    *   How it works: It models the probability of the features as a multinomial distribution — basically, how often each feature occurs.
    *   Use case: Commonly used in text classification, where features are word counts or frequencies in documents.
    *   Example: Spam detection or sentiment analysis based on how often certain words appear in emails or reviews.

3.  **Bernoulli Naïve Bayes**
    *   What it assumes: Features are binary (0/1) — indicating presence or absence.
    *   How it works: It models features with a Bernoulli distribution, focusing on whether a feature is present or not in a sample.
    *   Use case: When your features are binary indicators, such as whether a word occurs or doesn’t in a document (ignoring how many times it appears).
    *   Example: Document classification where you only care if a word is present, not its frequency.

**Quick comparison summary:**

| Variant        | Data type             | Distribution Assumed    | Typical Use Case                                    |
| :------------- | :-------------------- | :---------------------- | :-------------------------------------------------- |
| Gaussian NB    | Continuous numerical  | Gaussian (Normal)       | Numeric measurements                                |
| Multinomial NB | Count data (integers) | Multinomial             | Text classification with word counts                |
| Bernoulli NB   | Binary (0/1) features | Bernoulli (binary)      | Text classification with binary features (presence/absence) |

**When to pick which?**

*   If your features are continuous real numbers, go with Gaussian NB.
*   If your features are counts or frequencies, pick Multinomial NB.
*   If your features are binary indicators, choose Bernoulli NB.

In [1]:
# Question 6: Write a Python program to:
# ● Load the Iris dataset
# ● Train an SVM Classifier with a linear kernel
# ● Print the model's accuracy and support vectors.


from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target

# Split dataset into train and test (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train SVM with linear kernel
svm_clf = SVC(kernel='linear', random_state=42)
svm_clf.fit(X_train, y_train)

# Predict on test data
y_pred = svm_clf.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")

# Print support vectors
print("Support Vectors:")
print(svm_clf.support_vectors_)

Model Accuracy: 1.00
Support Vectors:
[[4.8 3.4 1.9 0.2]
 [5.1 3.3 1.7 0.5]
 [4.5 2.3 1.3 0.3]
 [5.6 3.  4.5 1.5]
 [5.4 3.  4.5 1.5]
 [6.7 3.  5.  1.7]
 [5.9 3.2 4.8 1.8]
 [5.1 2.5 3.  1.1]
 [6.  2.7 5.1 1.6]
 [6.3 2.5 4.9 1.5]
 [6.1 2.9 4.7 1.4]
 [6.5 2.8 4.6 1.5]
 [6.9 3.1 4.9 1.5]
 [6.3 2.3 4.4 1.3]
 [6.3 2.5 5.  1.9]
 [6.3 2.8 5.1 1.5]
 [6.3 2.7 4.9 1.8]
 [6.  3.  4.8 1.8]
 [6.  2.2 5.  1.5]
 [6.2 2.8 4.8 1.8]
 [6.5 3.  5.2 2. ]
 [7.2 3.  5.8 1.6]
 [5.6 2.8 4.9 2. ]
 [5.9 3.  5.1 1.8]
 [4.9 2.5 4.5 1.7]]


In [2]:
# Question 7: Write a Python program to:
# ● Load the Breast Cancer dataset
# ● Train a Gaussian Naïve Bayes model
# ● Print its classification report including precision, recall, and F1-score.

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report

# Load Breast Cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train Gaussian Naive Bayes model
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# Predict on test set
y_pred = gnb.predict(X_test)

# Print classification report
report = classification_report(y_test, y_pred, target_names=data.target_names)
print(report)


              precision    recall  f1-score   support

   malignant       1.00      0.93      0.96        43
      benign       0.96      1.00      0.98        71

    accuracy                           0.97       114
   macro avg       0.98      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114



In [3]:
# Question 8: Write a Python program to:
# ● Train an SVM Classifier on the Wine dataset using GridSearchCV to find the best C and gamma.
# ● Print the best hyperparameters and accuracy.


from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load Wine dataset
data = load_wine()
X = data.data
y = data.target

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define SVM model
svm = SVC()

# Hyperparameter grid
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': ['scale', 'auto', 0.001, 0.01, 0.1, 1],
    'kernel': ['rbf']  # Using RBF kernel to include gamma parameter
}

# Setup GridSearchCV
grid_search = GridSearchCV(svm, param_grid, cv=5, n_jobs=-1, verbose=1)

# Train with hyperparameter tuning
grid_search.fit(X_train, y_train)

# Best parameters
print("Best Hyperparameters:", grid_search.best_params_)

# Evaluate on test data
y_pred = grid_search.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Test Set Accuracy: {accuracy:.2f}")


Fitting 5 folds for each of 24 candidates, totalling 120 fits
Best Hyperparameters: {'C': 100, 'gamma': 'scale', 'kernel': 'rbf'}
Test Set Accuracy: 0.83


In [4]:
# Question 9: Write a Python program to:
# ● Train a Naïve Bayes Classifier on a synthetic text dataset (e.g. using sklearn.datasets.fetch_20newsgroups).
# ● Print the model's ROC-AUC score for its predictions.


from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import label_binarize
from sklearn.metrics import roc_auc_score
from sklearn.multiclass import OneVsRestClassifier
import numpy as np

# Load subset of 20 Newsgroups dataset (to keep it manageable)
categories = ['alt.atheism', 'comp.graphics', 'sci.med']
newsgroups = fetch_20newsgroups(subset='all', categories=categories)

X = newsgroups.data
y = newsgroups.target

# Convert text to feature vectors (word counts)
vectorizer = CountVectorizer(stop_words='english', max_features=5000)
X_vect = vectorizer.fit_transform(X)

# Binarize labels for multi-class ROC-AUC
y_bin = label_binarize(y, classes=np.arange(len(categories)))

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_vect, y_bin, test_size=0.2, random_state=42)

# Use OneVsRest wrapper since roc_auc_score requires binary format for multi-class
clf = OneVsRestClassifier(MultinomialNB())
clf.fit(X_train, y_train)

# Predict probabilities
y_proba = clf.predict_proba(X_test)

# Calculate ROC-AUC score (macro average across classes)
roc_auc = roc_auc_score(y_test, y_proba, average='macro')

print(f"ROC-AUC Score (macro-average): {roc_auc:.3f}")


ROC-AUC Score (macro-average): 0.993


**Question 10: Imagine you’re working as a data scientist for a company that handles email communications.**

**Your task is to automatically classify emails as Spam or Not Spam. The emails may contain:**

*   Text with diverse vocabulary
*   Potential class imbalance (far more legitimate emails than spam)
*   Some incomplete or missing data

Explain the approach you would take to:

*   Preprocess the data (e.g., text vectorization, handling missing data)
*   Choose and justify an appropriate model (SVM vs. Naïve Bayes)
*   Address class imbalance
*   Evaluate the performance of your solution with suitable metrics

And explain the business impact of your solution.

**Answer:**

1.  **Data Preprocessing**
    *   **Handling Text Data**
        *   **Text Cleaning:** Remove HTML tags, punctuation, convert to lowercase, remove stopwords, and possibly do stemming/lemmatization to reduce vocabulary size.
        *   **Vectorization:** Use TF-IDF vectorization or CountVectorizer to convert text into numeric features. TF-IDF is great because it weighs down very common words and highlights distinctive words.
    *   **Handling Missing Data:**
        *   Missing or incomplete emails (empty body, missing subject) can be treated as empty strings or filled with placeholders.
        *   Ensure vectorizer handles empty inputs gracefully.
        *   If metadata is missing (like sender info), you could either drop those features or impute reasonable defaults.

2.  **Choosing the Model: SVM vs Naïve Bayes**
    *   **Naïve Bayes:**
        *   Fast and effective for text classification, especially spam filtering.
        *   Handles high-dimensional sparse data well (like word counts).
        *   Robust to noisy features and works well with smaller datasets.
    *   **SVM:**
        *   Can achieve higher accuracy with proper tuning, especially with kernels like linear or RBF.
        *   More computationally intensive but often better at complex boundaries.
    *   **Recommendation:** Start with Multinomial Naïve Bayes as a baseline because it’s simple, fast, and proven effective in spam filtering. If accuracy needs improvement, move to linear SVM with TF-IDF features.

3.  **Addressing Class Imbalance**
    *   Why important? Spam is usually a small fraction compared to legitimate emails, so models might get biased toward the majority class.
    *   **Approaches:**
        *   **Resampling:**
            *   Oversampling the minority class (e.g., using SMOTE)
            *   Undersampling the majority class
        *   **Class Weights:**
            *   Many models like SVM and Naïve Bayes accept class weights or priors to penalize misclassification of minority class more heavily.
        *   **Threshold Tuning:**
            *   Adjust classification thresholds to improve recall on spam without hurting precision too much.

4.  **Evaluating Performance**
    *   **Metrics to track:**
        *   **Precision:** How many predicted spam emails are actually spam (important to avoid false alarms).
        *   **Recall:** How many actual spam emails are detected (important to catch as much spam as possible).
        *   **F1-score:** Balance between precision and recall.
        *   **ROC-AUC / PR-AUC:** For overall discrimination ability, especially useful with class imbalance.
        *   **Confusion Matrix:** To understand types of errors.
    *   Why precision & recall? In spam detection, false positives (legitimate emails marked as spam) can annoy users, while false negatives (spam missed) reduce effectiveness.

5.  **Business Impact**
    *   **Improved User Experience:** Automatically filtering spam reduces clutter in users’ inboxes, improving satisfaction and productivity.
    *   **Security:** Effective spam detection helps block phishing, malware, and fraud attempts, protecting company and users.
    *   **Cost Efficiency:** Reduces manual moderation and support requests related to spam.
    *   **Reputation:** Maintaining high-quality communication channels preserves trust and brand integrity.

**Summary:**

| Step                | Approach                                                 |
| :------------------ | :------------------------------------------------------- |
| Preprocessing       | Clean text, TF-IDF vectorization, handle missing gracefully |
| Model Choice        | Start with Multinomial Naïve Bayes; move to SVM if needed |
| Class Imbalance     | Use class weights, resampling, threshold tuning          |
| Evaluation Metrics  | Precision, Recall, F1-score, ROC-AUC, Confusion Matrix   |
| Business Value      | Better UX, security, cost savings, and reputation        |