#SVM & Naive Bayes Assignment

1.Question 1:  What is a Support Vector Machine (SVM), and how does it work?

Answer:

A Support Vector Machine (SVM) is a powerful supervised machine learning algorithm used for both classification and regression tasks. Its primary goal is to find the best possible boundary (called a hyperplane) that separates different classes in the data.

Here's a breakdown of how it works:

Finding the Optimal Hyperplane:

In a 2D space, the hyperplane is a line. In higher dimensions, it's a multi-dimensional plane.
The SVM aims to find the hyperplane that has the largest margin between the data points of different classes. The margin is the distance between the hyperplane and the nearest data points from each class.
These nearest data points are called support vectors. They are the most critical points in determining the position and orientation of the hyperplane.
Maximizing the Margin:

Maximizing the margin helps to improve the generalization ability of the model. A larger margin means the model is less likely to be affected by noisy or outliers data points, leading to better performance on unseen data.
Handling Non-linearly Separable Data (The Kernel Trick):

Sometimes, the data cannot be perfectly separated by a straight line or a simple hyperplane in its original space.
SVMs use a technique called the kernel trick to handle such cases. The kernel trick maps the data into a higher-dimensional space where it might be linearly separable.
Instead of explicitly transforming the data, the kernel function calculates the dot product of the data points in the higher-dimensional space, which is computationally more efficient. Common kernel functions include the linear kernel, polynomial kernel, and Radial Basis Function (RBF) kernel.
Soft Margin SVM (Handling Outliers and Overlapping Data):

In real-world scenarios, data is often not perfectly separable, and there might be some overlapping or outliers.
Soft margin SVM allows for some misclassifications by introducing a regularization parameter (often denoted by C). This parameter controls the trade-off between maximizing the margin and minimizing misclassifications. A smaller C allows for more misclassifications but results in a larger margin, while a larger C aims to minimize misclassifications but may lead to a smaller margin.


2.Question 2: Explain the difference between Hard Margin and Soft Margin SVM.

Answer:

Hard Margin SVM: Tries to draw a line that perfectly separates the groups with the widest possible gap, and insists that no dot crosses the line or falls within the gap. This only works if the groups are perfectly separated.

Soft Margin SVM: Still tries to draw a line that creates a wide gap, but allows a few dots to cross the line or fall within the gap if it helps to find a better overall separation that is less affected by these few "noisy" dots.


3.Question 3: What is the Kernel Trick in SVM? Give one example of a kernel and
explain its use case.

Answer:

The Kernel Trick is a fundamental concept in Support Vector Machines (SVMs) that allows them to handle non-linearly separable data without explicitly transforming the data into a higher-dimensional space.

Here's how it works:

The Problem: In many real-world datasets, the classes are not linearly separable in their original feature space. This means you cannot draw a straight line (or a hyperplane in higher dimensions) to perfectly separate the data points of different classes.
The Idea: The core idea behind the kernel trick is that while the data might not be linearly separable in the original space, it might become linearly separable in a higher-dimensional space.
The Trick: Instead of explicitly calculating the coordinates of the data points in the higher-dimensional space (which can be computationally expensive or even infinite-dimensional), the kernel trick uses a kernel function. The kernel function calculates the dot product of the data points in the higher-dimensional space directly from their coordinates in the original space.
Mathematically:

If  ϕ(x)  is the transformation that maps the data point  x  from the original space to the higher-dimensional space, the kernel function  K(xi,xj)  is defined as:

K(xi,xj)=ϕ(xi)⋅ϕ(xj)

The SVM algorithm, which is based on dot products between data points, can then work with these kernel function outputs without ever needing to know the explicit form of  ϕ(x)  or the coordinates in the higher-dimensional space.


4.Question 4: What is a Naïve Bayes Classifier, and why is it called “naïve”?

Answer:

Based on Bayes' Theorem: The algorithm uses Bayes' Theorem to calculate the probability of a given data point belonging to a particular class, given its features. Bayes' Theorem states:

$P(A|B) = \frac{P(B|A) * P(A)}{P(B)}$

In the context of Naïve Bayes:

$P(\text{Class} | \text{Features})$ is the probability of the data point belonging to a specific class given its features (what we want to find).
$P(\text{Features} | \text{Class})$ is the probability of observing the given features given that the data point belongs to that class.
$P(\text{Class})$ is the prior probability of that class.
$P(\text{Features})$ is the prior probability of the observed features.
How it Works: The algorithm calculates the probability for each class and predicts the class with the highest probability.

Why it's called naive:

The naive part comes from the simplifying assumption that the features are conditionally independent given the class. This means the algorithm assumes that the presence of a particular feature in a class is independent of the presence of any other feature in the same class.


5.Question 5: Describe the Gaussian, Multinomial, and Bernoulli Naïve Bayes variants.
When would you use each one?

Answer:

Gaussian Naïve Bayes:

Assumption: Assumes that the continuous features for each class are distributed according to a Gaussian (normal) distribution.
How it Works: It calculates the mean and standard deviation of each feature for each class during the training phase. When classifying a new data point, it uses these parameters to calculate the probability of observing the feature values given the class, based on the Gaussian probability density function.
Use Case: Primarily used for continuous features. For example, if you have features like height, weight, or temperature, and you assume they follow a normal distribution within each class. It's often applied in tasks like medical diagnosis (predicting disease based on continuous measurements) or classifying data with numerical attributes.
Multinomial Naïve Bayes:

Assumption: Assumes that the features represent counts or frequencies of events. It's typically used for discrete features.
How it Works: It's commonly used for text classification where features are word counts or term frequencies. It calculates the probability of each word appearing in a document given the class, based on the frequency of words in the training data for that class.
Use Case: Ideal for tasks involving document classification, spam filtering, and other problems where the data can be represented as counts or proportions of categories. For example, classifying news articles into topics based on the frequency of words in each article.
Bernoulli Naïve Bayes:

Assumption: Assumes that the features are binary (Boolean) variables, meaning they can only take two values (e.g., 0 or 1, True or False, presence or absence).
How it Works: It calculates the probability of each feature being present or absent given the class. It's also often used in text classification, but instead of using word counts, it uses whether a word is present or absent in a document.
Use Case: Suitable for tasks where features are indicators of presence or absence. In text classification, it can be useful when the focus is on which words are present rather than how many times they appear. It can also be applied to other domains with binary features, such as classifying user behavior based on whether they clicked on certain links (click or no click).


Question 6:   Write a Python program to:
● Load the Iris dataset
● Train an SVM Classifier with a linear kernel
● Print the model's accuracy and support vectors.

In [1]:
# Import necessary libraries
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
import numpy as np

# Load the Iris dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize and train the SVM classifier with a linear kernel
svm_model = SVC(kernel='linear')
svm_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = svm_model.predict(X_test)

# Calculate and print the accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")

# Print the support vectors
# Note: svc.support_vectors_ contains the coordinates of the support vectors
print("\nSupport Vectors:")
for i, sv in enumerate(svm_model.support_vectors_):
    print(f"  Support Vector {i+1}: {sv}")

Model Accuracy: 1.00

Support Vectors:
  Support Vector 1: [4.8 3.4 1.9 0.2]
  Support Vector 2: [5.1 3.3 1.7 0.5]
  Support Vector 3: [4.5 2.3 1.3 0.3]
  Support Vector 4: [5.6 3.  4.5 1.5]
  Support Vector 5: [5.4 3.  4.5 1.5]
  Support Vector 6: [6.7 3.  5.  1.7]
  Support Vector 7: [5.9 3.2 4.8 1.8]
  Support Vector 8: [5.1 2.5 3.  1.1]
  Support Vector 9: [6.  2.7 5.1 1.6]
  Support Vector 10: [6.3 2.5 4.9 1.5]
  Support Vector 11: [6.1 2.9 4.7 1.4]
  Support Vector 12: [6.5 2.8 4.6 1.5]
  Support Vector 13: [6.9 3.1 4.9 1.5]
  Support Vector 14: [6.3 2.3 4.4 1.3]
  Support Vector 15: [6.3 2.8 5.1 1.5]
  Support Vector 16: [6.3 2.7 4.9 1.8]
  Support Vector 17: [6.  3.  4.8 1.8]
  Support Vector 18: [6.  2.2 5.  1.5]
  Support Vector 19: [6.2 2.8 4.8 1.8]
  Support Vector 20: [6.5 3.  5.2 2. ]
  Support Vector 21: [7.2 3.  5.8 1.6]
  Support Vector 22: [5.6 2.8 4.9 2. ]
  Support Vector 23: [5.9 3.  5.1 1.8]
  Support Vector 24: [4.9 2.5 4.5 1.7]


Question 7:  Write a Python program to:
● Load the Breast Cancer dataset
● Train a Gaussian Naïve Bayes model
● Print its classification report including precision, recall, and F1-score.

In [2]:
# Import necessary libraries
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report

# Load the Breast Cancer dataset
breast_cancer = datasets.load_breast_cancer()
X = breast_cancer.data
y = breast_cancer.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize and train the Gaussian Naïve Bayes model
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# Make predictions on the test set
y_pred = gnb.predict(X_test)

# Print the classification report
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=breast_cancer.target_names))

Classification Report:
              precision    recall  f1-score   support

   malignant       0.93      0.90      0.92        63
      benign       0.95      0.96      0.95       108

    accuracy                           0.94       171
   macro avg       0.94      0.93      0.94       171
weighted avg       0.94      0.94      0.94       171



Question 8: Write a Python program to:
● Train an SVM Classifier on the Wine dataset using GridSearchCV to find the best
C and gamma.
● Print the best hyperparameters and accuracy.

In [3]:
# Import necessary libraries
from sklearn import datasets
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load the Wine dataset
wine = datasets.load_wine()
X = wine.data
y = wine.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define the parameter grid for GridSearchCV
# We'll tune the 'C' and 'gamma' parameters
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [1, 0.1, 0.01, 0.001],
    'kernel': ['rbf'] # Using the RBF kernel as it's common for tuning gamma
}

# Initialize GridSearchCV with the SVM classifier and parameter grid
grid_search = GridSearchCV(SVC(), param_grid, refit=True, verbose=2, cv=5)

# Fit the grid search to the training data
grid_search.fit(X_train, y_train)

# Print the best hyperparameters found
print("Best hyperparameters found by GridSearchCV:")
print(grid_search.best_params_)

# Make predictions on the test set using the best model
best_svm_model = grid_search.best_estimator_
y_pred = best_svm_model.predict(X_test)

# Calculate and print the accuracy of the best model
accuracy = accuracy_score(y_test, y_pred)
print(f"\nModel Accuracy with best hyperparameters: {accuracy:.2f}")

Fitting 5 folds for each of 16 candidates, totalling 80 fits
[CV] END .........................C=0.1, gamma=1, kernel=rbf; total time=   0.0s
[CV] END .........................C=0.1, gamma=1, kernel=rbf; total time=   0.0s
[CV] END .........................C=0.1, gamma=1, kernel=rbf; total time=   0.0s
[CV] END .........................C=0.1, gamma=1, kernel=rbf; total time=   0.0s
[CV] END .........................C=0.1, gamma=1, kernel=rbf; total time=   0.0s
[CV] END .......................C=0.1, gamma=0.1, kernel=rbf; total time=   0.0s
[CV] END .......................C=0.1, gamma=0.1, kernel=rbf; total time=   0.0s
[CV] END .......................C=0.1, gamma=0.1, kernel=rbf; total time=   0.0s
[CV] END .......................C=0.1, gamma=0.1, kernel=rbf; total time=   0.0s
[CV] END .......................C=0.1, gamma=0.1, kernel=rbf; total time=   0.0s
[CV] END ......................C=0.1, gamma=0.01, kernel=rbf; total time=   0.0s
[CV] END ......................C=0.1, gamma=0.01

Question 9: Write a Python program to:
● Train a Naïve Bayes Classifier on a synthetic text dataset (e.g. using
sklearn.datasets.fetch_20newsgroups).
● Print the model's ROC-AUC score for its predictions.

In [4]:
# Import necessary libraries
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import roc_auc_score
import numpy as np

# Load the 20 Newsgroups dataset
# We'll select a subset of categories for simplicity
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']
newsgroups_data = fetch_20newsgroups(subset='all', categories=categories, shuffle=True, random_state=42)

X = newsgroups_data.data
y = newsgroups_data.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Convert text data to feature vectors using TF-IDF
vectorizer = TfidfVectorizer()
X_train_vectors = vectorizer.fit_transform(X_train)
X_test_vectors = vectorizer.transform(X_test)

# Initialize and train the Multinomial Naïve Bayes model
# Multinomial Naive Bayes is suitable for text data with discrete features (like word counts or TF-IDF)
mnb = MultinomialNB()
mnb.fit(X_train_vectors, y_train)

# Predict probabilities for the positive class
# ROC-AUC requires probability estimates
# For multi-class, we can calculate the macro-averaged ROC-AUC
y_prob = mnb.predict_proba(X_test_vectors)

# Calculate and print the ROC-AUC score
# For multi-class, roc_auc_score requires 'multi_class' parameter
# 'ovr' (One vs Rest) calculates AUC for each class against the rest
# 'weighted' average accounts for class imbalance
roc_auc = roc_auc_score(y_test, y_prob, multi_class='ovr', average='weighted')

print(f"Model ROC-AUC Score (weighted average): {roc_auc:.2f}")

Model ROC-AUC Score (weighted average): 0.99


Question 10: Imagine you’re working as a data scientist for a company that handles
email communications.
Your task is to automatically classify emails as Spam or Not Spam. The emails may
contain:
● Text with diverse vocabulary
● Potential class imbalance (far more legitimate emails than spam)
● Some incomplete or missing data
Explain the approach you would take to:
● Preprocess the data (e.g. text vectorization, handling missing data)
● Choose and justify an appropriate model (SVM vs. Naïve Bayes)
● Address class imbalance
● Evaluate the performance of your solution with suitable metrics
And explain the business impact of your solution.

Here's a comprehensive approach to tackling the email spam classification task:

**1. Data Preprocessing:**

*   **Text Vectorization:** Email data is text-based, so we need to convert it into numerical feature vectors that machine learning models can understand. Common techniques include:
    *   **Bag-of-Words (BoW):** Creates a vocabulary of all unique words in the dataset and represents each email as a vector of word counts.
    *   **TF-IDF (Term Frequency-Inverse Document Frequency):** Similar to BoW but weighs words based on their importance in a document relative to the entire corpus. This helps to downweight common words like "the" and "a" and give more importance to words that are more specific to spam or legitimate emails. TF-IDF is generally preferred for text classification.
    *   **Word Embeddings (e.g., Word2Vec, GloVe):** These techniques learn dense vector representations of words that capture semantic relationships. While more complex, they can be powerful for capturing context.
    *   **Choice:** For this task, **TF-IDF** is a good starting point due to its effectiveness and interpretability.
*   **Handling Missing Data:** Emails might have missing subject lines, body text, or sender information.
    *   **Identify Missingness:** Determine which features have missing values.
    *   **Imputation:** For text fields, you could replace missing values with an empty string or a special token like "[MISSING]". For numerical features (if any), you could use mean, median, or mode imputation, or more advanced techniques.
    *   **Dropping Data:** If a significant portion of an email's content is missing, you might consider dropping that email, but be cautious as this can lead to data loss.
*   **Text Cleaning:** Before vectorization, clean the text data by:
    *   Converting text to lowercase.
    *   Removing punctuation, numbers, and special characters.
    *   Removing stop words (common words that don't carry much meaning, like "is," "in," "and").
    *   Stemming or lemmatization (reducing words to their root form).
*   **Feature Engineering (Optional but Recommended):** Create new features from the email data that might be useful for classification, such as:
    *   Length of the email.
    *   Number of links or attachments.
    *   Presence of specific keywords (e.g., "urgent," "win," "prize").
    *   Sender's domain reputation.

**2. Choosing and Justifying an Appropriate Model (SVM vs. Naïve Bayes):**

Both SVM and Naïve Bayes are commonly used for text classification, and each has its strengths:

*   **Naïve Bayes (Multinomial Naïve Bayes):**
    *   **Justification:**
        *   **Simplicity and Speed:** Naïve Bayes is computationally efficient and fast to train, even on large datasets.
        *   **Good for Text Data:** Multinomial Naïve Bayes is particularly well-suited for discrete features like word counts or TF-IDF scores.
        *   **Handles High-Dimensional Data:** Performs well in high-dimensional feature spaces, which is typical for text data after vectorization.
    *   **Considerations:** The "naïve" independence assumption might not hold true for all word combinations, which could impact performance in some cases.
*   **Support Vector Machine (SVM):**
    *   **Justification:**
        *   **Effective with High-Dimensional Data:** SVMs are known for their ability to handle high-dimensional data effectively.
        *   **Finds Optimal Hyperplane:** Aims to find the optimal decision boundary with the largest margin, which can lead to good generalization.
        *   **Kernel Trick:** Can handle non-linear relationships in the data using different kernels (e.g., RBF kernel).
    *   **Considerations:** SVMs can be computationally more expensive to train than Naïve Bayes, especially on very large datasets. Choosing the right kernel and hyperparameters is crucial for optimal performance.

**Recommendation:**

Start with **Multinomial Naïve Bayes** as a baseline model. It's simple, fast, and often performs surprisingly well on text classification tasks. Then, train an **SVM** (perhaps with an RBF kernel) and compare its performance. SVM might achieve higher accuracy if the relationships between features are complex and non-linear.

**3. Addressing Class Imbalance:**

Since there are likely many more legitimate emails than spam emails, the dataset is imbalanced. Training a model on an imbalanced dataset can lead to a model that is biased towards the majority class (legitimate emails) and performs poorly on the minority class (spam). Here are ways to address this:

*   **Resampling Techniques:**
    *   **Oversampling:** Increase the number of instances in the minority class (spam) by duplicating existing instances or generating synthetic ones (e.g., SMOTE).
    *   **Undersampling:** Decrease the number of instances in the majority class (legitimate) by randomly removing instances.
*   **Using Class Weights:** Many machine learning algorithms (including SVM and some Naïve Bayes implementations) allow you to assign higher weights to the minority class during training. This tells the model to pay more attention to correctly classifying instances from the minority class.
*   **Choosing Appropriate Evaluation Metrics:** Accuracy alone is not a good metric for imbalanced datasets. Use metrics that are more sensitive to the performance on the minority class (discussed below).

**Recommendation:**

Start by trying **class weights** as it's often the easiest to implement and can be effective. If the imbalance is severe, consider combining class weights with **oversampling** techniques like SMOTE.

**4. Evaluating the Performance of Your Solution with Suitable Metrics:**

For an imbalanced classification task like spam detection, simply looking at accuracy can be misleading. Consider these metrics:

*   **Confusion Matrix:** A table that summarizes the results of your classification, showing:
    *   **True Positives (TP):** Correctly classified spam emails.
    *   **True Negatives (TN):** Correctly classified legitimate emails.
    *   **False Positives (FP):** Legitimate emails incorrectly classified as spam (Type I error).
    *   **False Negatives (FN):** Spam emails incorrectly classified as legitimate (Type II error).
*   **Precision:** Out of all emails predicted as spam, what percentage were actually spam? (TP / (TP + FP)). High precision means fewer legitimate emails are marked as spam.
*   **Recall (Sensitivity):** Out of all actual spam emails, what percentage were correctly identified? (TP / (TP + FN)). High recall means fewer spam emails are missed.
*   **F1-Score:** The harmonic mean of precision and recall. It provides a single score that balances both metrics. Useful when you need a balance between precision and recall.
*   **ROC Curve and AUC (Area Under the ROC Curve):** The ROC curve plots the True Positive Rate (Recall) against the False Positive Rate (FP / (FP + TN)) at various threshold settings. The AUC represents the overall ability of the classifier to distinguish between the classes. A higher AUC indicates better performance.
*   **Specificity:** Out of all actual legitimate emails, what percentage were correctly identified? (TN / (TN + FP)).

**Recommendation:**

Focus on **Precision, Recall, F1-Score, and AUC**. The relative importance of precision and recall depends on the business goal. For spam filtering, you might prioritize high precision (to avoid marking legitimate emails as spam) even if it means slightly lower recall (missing a few spam emails).

**5. Business Impact of Your Solution:**

Implementing an effective spam classification solution can have significant business impacts:

*   **Improved User Experience:** Users receive fewer unwanted spam emails, leading to a cleaner inbox and reduced frustration.
*   **Increased Productivity:** Users spend less time sifting through spam, allowing them to focus on important communications.
*   **Reduced Security Risks:** Spam emails often contain phishing attempts or malware. An effective filter reduces the likelihood of users falling victim to these threats.
*   **Reduced Bandwidth and Storage Costs:** Less spam means less data to transmit and store.
*   **Enhanced Reputation:** For a company providing email services, a robust spam filter builds trust and enhances its reputation.
*   **Compliance:** In some industries, there are regulatory requirements related to email filtering.