# SVM AND NAIVE BAYES ASSIGNMENT

# Question 1: What is a Support Vector Machine (SVM), and how does it work?

A Support Vector Machine (SVM) is a supervised machine learning algorithm used for classification and regression tasks, though it is mainly applied in classification problems. The main goal of SVM is to find the best decision boundary (hyperplane) that separates data points of different classes with the maximum margin.

SVM works by plotting the data in an n-dimensional space (where n is the number of features) and identifying the hyperplane that distinctly divides the classes. The data points that are closest to the hyperplane are called support vectors, and they are the key elements in defining the position and orientation of the hyperplane.

If the data is not linearly separable, SVM uses a technique called the kernel trick, which transforms the data into a higher-dimensional space where a separating hyperplane can be found. Common kernels include linear, polynomial, and radial basis function (RBF).

Example:

Suppose we want to classify emails as spam or not spam.
SVM will find a boundary (hyperplane) that separates spam emails from non-spam ones based on their features (like word frequency, sender, etc.) with the largest possible margin.

# Question 2: Explain the difference between Hard Margin and Soft Margin SVM.


In Support Vector Machines (SVM), the concept of margin refers to the distance between the separating hyperplane and the nearest data points (support vectors).
Depending on how strictly the SVM separates the classes, there are two types — Hard Margin and Soft Margin SVM.

1. Hard Margin SVM

Used when the data is perfectly linearly separable.

The algorithm finds a hyperplane that strictly separates the data into classes without any misclassification.

It ensures a maximum margin between classes, but it is very sensitive to noise and outliers.

If even one data point is misclassified, the model fails.

Example:

 Ideal for clean, noise-free datasets where classes are clearly separable.

2. Soft Margin SVM

Used when data is not perfectly separable.

It allows some misclassifications to achieve better generalization and performance on unseen data.

Introduces a penalty parameter (C) that controls the trade-off between maximizing the margin and minimizing classification error.

More robust and practical for real-world datasets containing noise or overlapping classes.

 Example:

  Works well on datasets where classes overlap slightly or contain outliers.

# Question 3: What is the Kernel Trick in SVM? Give one example of a kernel and explain its use case.

The Kernel Trick in Support Vector Machine (SVM) is a mathematical technique used to handle non-linear data.
It allows SVM to transform the input data into a higher-dimensional space where it becomes easier to separate the classes using a linear boundary.

Instead of explicitly calculating the coordinates of the data in that higher-dimensional space, the kernel trick computes the inner products between the transformed data points directly — saving time and computation.

How it Works:

In real-world problems, data is often not linearly separable in its original form.

The kernel function maps this data into a higher dimension where a linear separator (hyperplane) can be found.

Example:

Kernel: Radial Basis Function (RBF) Kernel (also known as Gaussian Kernel)

Formula:

K(x,x′)=exp(−γ∣∣x−x′∣∣2)

Use Case:

The RBF kernel is commonly used when data shows non-linear relationships.
For example, in image classification or handwriting recognition, where patterns are curved or complex, RBF helps SVM create flexible decision boundaries to classify data accurately.

# Question 4: What is a Naïve Bayes Classifier, and why is it called “naïve”?

A Naïve Bayes Classifier is a probabilistic machine learning algorithm based on Bayes’ Theorem, used for classification tasks.
It predicts the probability that a data point belongs to a particular class based on prior knowledge and the likelihood of features.

Bayes’ Theorem:

P(C∣X)=P(X)P(X∣C)⋅P(C)​

Where:

P(C∣X):Posterior probability (probability of class given features)

P(X∣C):Likelihood (probability of features given class)

P(C):Prior probability of the class

P(X):Probability of the feature

>>Why it is called “Naïve”:


It is called “naïve” because it assumes that all features are independent of each other — meaning one feature’s presence does not affect another.
In reality, this assumption is rarely true, but it still performs surprisingly well in many practical cases.

>Example Use Case:

Email Spam Detection:
Each word in an email is treated as an independent feature.
The model predicts whether the email is spam or not spam based on the probability of words appearing in each category.

# Question 5: Describe the Gaussian, Multinomial, and Bernoulli Naïve Bayes variants. When would you use each one?


The Naïve Bayes Classifier has different variants depending on the type of data it works with.
The three most common types are Gaussian, Multinomial, and Bernoulli Naïve Bayes.

1. Gaussian Naïve Bayes

Used when features are continuous (numerical) and follow a normal (Gaussian) distribution.

It assumes that the continuous values associated with each class are distributed according to a Gaussian (bell-shaped) curve.

>Use Case:

Suitable for datasets with real-valued features such as height, weight, age, or temperature.

Example: Iris flower classification, where petal length and width are continuous values.

2. Multinomial Naïve Bayes

Used when features represent discrete counts or frequencies.

It assumes that the features follow a Multinomial distribution (common in text data).

>Use Case:

Works best for text classification problems, such as spam detection or document categorization, where features are word counts or term frequencies.

Example: Email spam classification based on word occurrences.

3. Bernoulli Naïve Bayes

Used when features are binary (0 or 1) — representing the presence or absence of a feature.

It models data as a series of yes/no features rather than counts.

>Use Case:

Works well for binary text classification, where only the presence or absence of words matters, not their frequency.

Example: Sentiment analysis — checking if certain positive or negative words exist in a review.

# Question 6: Write a Python program to:
● Load the Iris dataset

● Train an SVM Classifier with a linear kernel

● Print the model's accuracy and support vectors.

In [1]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score


iris = datasets.load_iris()
X = iris.data
y = iris.target


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

svm_model = SVC(kernel='linear')


svm_model.fit(X_train, y_train)


y_pred = svm_model.predict(X_test)


accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", accuracy)


print("Support Vectors:\n", svm_model.support_vectors_)


Model Accuracy: 1.0
Support Vectors:
 [[4.8 3.4 1.9 0.2]
 [5.1 3.3 1.7 0.5]
 [4.5 2.3 1.3 0.3]
 [5.6 3.  4.5 1.5]
 [5.4 3.  4.5 1.5]
 [6.7 3.  5.  1.7]
 [5.9 3.2 4.8 1.8]
 [5.1 2.5 3.  1.1]
 [6.  2.7 5.1 1.6]
 [6.3 2.5 4.9 1.5]
 [6.1 2.9 4.7 1.4]
 [6.5 2.8 4.6 1.5]
 [6.9 3.1 4.9 1.5]
 [6.3 2.3 4.4 1.3]
 [6.3 2.5 5.  1.9]
 [6.3 2.8 5.1 1.5]
 [6.3 2.7 4.9 1.8]
 [6.  3.  4.8 1.8]
 [6.  2.2 5.  1.5]
 [6.2 2.8 4.8 1.8]
 [6.5 3.  5.2 2. ]
 [7.2 3.  5.8 1.6]
 [5.6 2.8 4.9 2. ]
 [5.9 3.  5.1 1.8]
 [4.9 2.5 4.5 1.7]]


## Question 7: Write a Python program to:
● Load the Breast Cancer dataset

● Train a Gaussian Naïve Bayes model

● Print its classification report including precision, recall, and F1-score.

In [2]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report


data = load_breast_cancer()
X = data.data
y = data.target


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


gnb = GaussianNB()
gnb.fit(X_train, y_train)


y_pred = gnb.predict(X_test)


print("Classification Report:")
print(classification_report(y_test, y_pred))


Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.93      0.96        43
           1       0.96      1.00      0.98        71

    accuracy                           0.97       114
   macro avg       0.98      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114



# Question 8: Write a Python program to:
● Train an SVM Classifier on the Wine dataset using GridSearchCV to find the best C and gamma.

● Print the best hyperparameters and accuracy.


In [3]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score


wine = load_wine()
X = wine.data
y = wine.target

# Split the dataset into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the SVM model
svm = SVC()


param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [1, 0.1, 0.01, 0.001],
    'kernel': ['rbf']
}

grid = GridSearchCV(svm, param_grid, refit=True, verbose=0)
grid.fit(X_train, y_train)


y_pred = grid.predict(X_test)

print("Best Hyperparameters:", grid.best_params_)
print("Test Accuracy:", accuracy_score(y_test, y_pred))


Best Hyperparameters: {'C': 100, 'gamma': 0.001, 'kernel': 'rbf'}
Test Accuracy: 0.8333333333333334


# Question 9: Write a Python program to:
● Train a Naïve Bayes Classifier on a synthetic text dataset (e.g. using
sklearn.datasets.fetch_20newsgroups).

● Print the model's ROC-AUC score for its predictions.

In [4]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import label_binarize


data = fetch_20newsgroups(subset='all', categories=['rec.sport.baseball', 'sci.med'], remove=('headers', 'footers', 'quotes'))


X = data.data
y = data.target


vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
X_tfidf = vectorizer.fit_transform(X)


X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.2, random_state=42)


nb = MultinomialNB()
nb.fit(X_train, y_train)


y_prob = nb.predict_proba(X_test)

# Compute ROC-AUC score
y_test_bin = label_binarize(y_test, classes=[0, 1])
roc_auc = roc_auc_score(y_test_bin, y_prob[:, 1])

print("ROC-AUC Score:", roc_auc)


ROC-AUC Score: 0.9971052765222691


# Question 10: Imagine you’re working as a data scientist for a company that handles email communications. Your task is to automatically classify emails as Spam or Not Spam. The emails may
contain:

● Text with diverse vocabulary

● Potential class imbalance (far more legitimate emails than spam)

● Some incomplete or missing data

Explain the approach you would take to:

● Preprocess the data (e.g. text vectorization, handling missing data)

● Choose and justify an appropriate model (SVM vs. Naïve Bayes)

● Address class imbalance

● Evaluate the performance of your solution with suitable metrics
And explain the business impact of your solution.

1. Data Preprocessing

To classify emails as Spam or Not Spam, data preprocessing is the most important step. The main tasks include:

>>Handling Missing Data

Emails may have missing text or metadata (like subject line or sender).

Replace missing values with an empty string ("") or use imputation techniques for numerical features (if any).

Remove irrelevant fields such as message IDs or timestamps if they don’t contribute to the classification.

>>Text Cleaning and Tokenization

Convert all text to lowercase.

Remove punctuation, numbers, and special characters.

Remove stopwords (like “is”, “the”, “and”) using NLTK or SpaCy.

Apply stemming or lemmatization to reduce words to their root form (e.g., “running” → “run”).

>>Text Vectorization

Convert text into numerical format using:

TF-IDF (Term Frequency–Inverse Document Frequency) – gives importance to rare but meaningful words.

Or Count Vectorizer – simple word frequency representation.

TF-IDF is preferred for spam detection because it reduces the weight of common words like “email”, “please”, etc.

2. Model Selection and Justification

>>Naïve Bayes Classifier

Works exceptionally well for text classification tasks because it assumes feature independence, which fits natural language data fairly well.

Multinomial Naïve Bayes performs well with word counts or TF-IDF values.

Fast, efficient, and performs well even on smaller datasets.

>>SVM (Support Vector Machine)

SVM can be used if the dataset is large and complex, as it handles high-dimensional data efficiently.

It works well for separating non-linear data using kernel functions, but it’s slower for very large text datasets.

Choice:

For this task, Multinomial Naïve Bayes is preferred due to its simplicity, speed, and effectiveness for high-dimensional sparse text data.

3. Addressing Class Imbalance

If there are more legitimate emails than spam, we can handle imbalance by:

Resampling Techniques:

Oversampling minority class (e.g., using SMOTE)

Undersampling majority class

Class weights: Adjust model parameters to give more importance to the minority (spam) class.

Threshold tuning: Adjust decision thresholds to reduce false negatives (missed spam emails).

4. Model Evaluation Metrics

To evaluate performance, use metrics suitable for imbalanced data:

Metric	Purpose
Accuracy	Basic performance indicator, but can be misleading if classes are imbalanced.
Precision	Measures how many emails predicted as spam are actually spam.
Recall (Sensitivity)	Measures how many actual spam emails were correctly identified.
F1-Score	Harmonic mean of Precision and Recall — good for imbalanced datasets.
ROC-AUC Score	Measures overall model discrimination between spam and not spam.

>Preferred metrics: Precision, Recall, F1-Score, and ROC-AUC (not just accuracy).

5. Business Impact

Implementing this spam classification system can have a significant positive impact:

>Improves productivity: Employees spend less time sorting or deleting spam emails.

>Enhances cybersecurity: Reduces phishing and malware risks.

>Increases efficiency: Ensures legitimate emails are prioritized, improving communication flow.

>Saves costs: Less risk of data breaches or wasted resources due to spam attacks.