#SVM & Naive Bayes : Assignment

Q-1 : What is a Support Vector Machine (SVM), and how does it work?


A-1 : Support Vector Machine (SVM) –
A Support Vector Machine (SVM) is a supervised machine learning algorithm used for classification and regression, especially binary classification.

SVM works by finding the optimal hyperplane that best separates the data into classes with the maximum margin. The data points closest to the hyperplane are called support vectors — they define the margin.

If data is not linearly separable, SVM uses kernel functions (like RBF or polynomial) to map data into higher dimensions where separation is possible. It also supports a soft margin for handling noisy data.

SVM is effective for high-dimensional data and is widely used in text classification, face detection, and bioinformatics.



Q-2 : Explain the difference between Hard Margin and Soft Margin SVM.

A-2 : In Hard Margin SVM, the algorithm assumes that the data is perfectly linearly separable, meaning no data points fall inside the margin or on the wrong side of the hyperplane. It aims to find the hyperplane with the maximum margin that strictly separates the two classes without allowing any errors. However, this approach is very sensitive to noise and can lead to overfitting if the data isn't perfectly clean.

On the other hand, Soft Margin SVM allows the model to tolerate some misclassifications or violations of the margin to achieve better performance on noisy or non-linearly separable data. It introduces a trade-off between maximizing the margin and minimizing classification errors. This makes Soft Margin SVM more flexible and practical for real-world applications where data is often messy or overlapping.



Q-3 : What is the Kernel Trick in SVM? Give one example of a kernel and
explain its use case.


A-3 : The Kernel Trick in SVM is a technique used to solve problems where data is not linearly separable. Instead of explicitly mapping data to a higher-dimensional space, the kernel trick uses a kernel function to compute the dot product of the transformed data directly in the original space. This makes computations efficient while allowing SVM to learn non-linear decision boundaries.

One commonly used kernel is the Radial Basis Function (RBF) kernel, also known as the Gaussian kernel. It transforms data into an infinite-dimensional space and is useful when the data has complex patterns.

Use Case: The RBF kernel is widely used in image classification, handwriting recognition, and medical diagnosis, where the data cannot be separated by a straight line.

Q-4 :  What is a Naïve Bayes Classifier, and why is it called “naïve”?

A-4 : A Naïve Bayes Classifier is a supervised machine learning algorithm based on Bayes' Theorem, used mainly for classification tasks. It calculates the probability that a given data point belongs to a particular class, based on the values of its features.

The classifier is called “naïve” because it assumes that all features are independent of each other given the class label — an assumption that is rarely true in real-world data. Despite this unrealistic assumption, the algorithm performs surprisingly well in many applications, especially with large datasets.

Naïve Bayes is commonly used in spam detection, sentiment analysis, and document classification, due to its simplicity, speed, and effectiveness.

Q-5 : Describe the Gaussian, Multinomial, and Bernoulli Naïve Bayes variants.
When would you use each one?

A-5 : Gaussian, Multinomial, and Bernoulli Naïve Bayes – Description and Use Cases
Gaussian Naïve Bayes:
This variant assumes that the features follow a normal (Gaussian) distribution. It is best suited for continuous numerical data.
Use Case: Useful in medical diagnosis or iris flower classification, where features are continuous (e.g., height, weight).

Multinomial Naïve Bayes:
It is designed for discrete data, particularly for count-based features like word frequencies.
Use Case: Ideal for text classification, such as spam detection or document categorization.

Bernoulli Naïve Bayes:
This variant works with binary/boolean features (0 or 1), representing the presence or absence of a feature.
Use Case: Suitable for tasks like email classification, where features are binary (e.g., presence of specific words).

Q-6 : Write a Python program to:
● Load the Iris dataset
● Train an SVM Classifier with a linear kernel
● Print the model's accuracy and support vectors.
(Include your Python code and output in the code box below.)

In [1]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train an SVM classifier with a linear kernel
clf = SVC(kernel='linear')
clf.fit(X_train, y_train)

# Predict on the test set
y_pred = clf.predict(X_test)

# Print accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")

# Print support vectors
print("Support Vectors:")
print(clf.support_vectors_)


Model Accuracy: 1.00
Support Vectors:
[[4.8 3.4 1.9 0.2]
 [5.1 3.3 1.7 0.5]
 [4.5 2.3 1.3 0.3]
 [5.6 3.  4.5 1.5]
 [5.4 3.  4.5 1.5]
 [6.7 3.  5.  1.7]
 [5.9 3.2 4.8 1.8]
 [5.1 2.5 3.  1.1]
 [6.  2.7 5.1 1.6]
 [6.3 2.5 4.9 1.5]
 [6.1 2.9 4.7 1.4]
 [6.5 2.8 4.6 1.5]
 [6.9 3.1 4.9 1.5]
 [6.3 2.3 4.4 1.3]
 [6.3 2.5 5.  1.9]
 [6.3 2.8 5.1 1.5]
 [6.3 2.7 4.9 1.8]
 [6.  3.  4.8 1.8]
 [6.  2.2 5.  1.5]
 [6.2 2.8 4.8 1.8]
 [6.5 3.  5.2 2. ]
 [7.2 3.  5.8 1.6]
 [5.6 2.8 4.9 2. ]
 [5.9 3.  5.1 1.8]
 [4.9 2.5 4.5 1.7]]


Q-7: Write a Python program to:
Load the Breast Cancer dataset
 Train a Gaussian Naïve Bayes model
 Print its classification report including precision, recall, and F1-score.
(Include your Python code and output in the code box below.)

In [2]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report

# Load the Breast Cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Gaussian Naïve Bayes model
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# Predict on test data
y_pred = gnb.predict(X_test)

# Print classification report
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=data.target_names))


Classification Report:
              precision    recall  f1-score   support

   malignant       1.00      0.93      0.96        43
      benign       0.96      1.00      0.98        71

    accuracy                           0.97       114
   macro avg       0.98      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114



Q-8 : Write a Python program to:
Train an SVM Classifier on the Wine dataset using GridSearchCV to find the best
C and gamma.
Print the best hyperparameters and accuracy.
(Include your Python code and output in the code box below.)

In [3]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load the Wine dataset
data = load_wine()
X = data.data
y = data.target

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define parameter grid for GridSearchCV
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [1, 0.1, 0.01, 0.001],
    'kernel': ['rbf']
}

# Create SVM classifier
svc = SVC()

# Perform Grid Search with 5-fold cross-validation
grid = GridSearchCV(svc, param_grid, cv=5)
grid.fit(X_train, y_train)

# Predict on test set
y_pred = grid.predict(X_test)

# Print best parameters and accuracy
print("Best Hyperparameters:", grid.best_params_)
print("Test Accuracy: {:.2f}".format(accuracy_score(y_test, y_pred)))


Best Hyperparameters: {'C': 100, 'gamma': 0.001, 'kernel': 'rbf'}
Test Accuracy: 0.83


Q-9 : Write a Python program to:
● Train a Naïve Bayes Classifier on a synthetic text dataset (e.g. using
sklearn.datasets.fetch_20newsgroups).
● Print the model's ROC-AUC score for its predictions.
(Include your Python code and output in the code box below.)

In [4]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import label_binarize

# Load binary text classification dataset (2 categories)
categories = ['rec.sport.hockey', 'sci.space']
newsgroups = fetch_20newsgroups(subset='all', categories=categories)

# Vectorize the text data using TF-IDF
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(newsgroups.data)
y = newsgroups.target

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Naive Bayes classifier
nb = MultinomialNB()
nb.fit(X_train, y_train)

# Predict probabilities for ROC-AUC
y_probs = nb.predict_proba(X_test)[:, 1]

# Calculate and print ROC-AUC score
roc_auc = roc_auc_score(y_test, y_probs)
print(f"ROC-AUC Score: {roc_auc:.2f}")


ROC-AUC Score: 1.00


Q-10 : Question 10: Imagine you’re working as a data scientist for a company that handles
email communications.
Your task is to automatically classify emails as Spam or Not Spam. The emails may
contain:
● Text with diverse vocabulary
● Potential class imbalance (far more legitimate emails than spam)
● Some incomplete or missing data
Explain the approach you would take to:
● Preprocess the data (e.g. text vectorization, handling missing data)
● Choose and justify an appropriate model (SVM vs. Naïve Bayes)
● Address class imbalance
● Evaluate the performance of your solution with suitable metrics
And explain the business impact of your solution.
(Include your Python code and output in the code box below.)

In [6]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.impute import SimpleImputer
import numpy as np

# Simulate a spam vs. ham dataset using two distinct categories
categories = ['talk.politics.misc', 'rec.autos']  # simulate 'ham' and 'spam'
data = fetch_20newsgroups(subset='all', categories=categories, remove=('headers', 'footers', 'quotes'))

# Add missing data simulation
texts = np.array(data.data, dtype=object)
texts[5] = None  # Inject a missing email

# Handle missing data by replacing None with an empty string before vectorization
texts = np.array([text if text is not None else '' for text in texts])

# Vectorization
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(texts)
y = data.target  # 0 = ham, 1 = spam (simulated)

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Train model
model = MultinomialNB()
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]

# Evaluation
print("Classification Report:")
print(classification_report(y_test, y_pred))
print("ROC-AUC Score: {:.2f}".format(roc_auc_score(y_test, y_prob)))

# Business Impact explanation (as requested in the Markdown cell)
print("\nBusiness Impact:")
print("Implementing this email classification system can significantly improve efficiency by automatically filtering spam, reducing the time employees spend sorting emails.")
print("It can also enhance security by minimizing exposure to malicious content often found in spam.")
print("Accurate classification leads to a better user experience and ensures important communications are not missed.")
print("The chosen metrics (precision, recall, F1-score, ROC-AUC) are relevant for evaluating a spam filter. High precision minimizes legitimate emails being marked as spam, while high recall minimizes spam emails reaching the inbox. ROC-AUC provides an overall measure of the model's ability to distinguish between classes.")

Classification Report:
              precision    recall  f1-score   support

           0       0.92      0.98      0.95       198
           1       0.98      0.89      0.93       155

    accuracy                           0.94       353
   macro avg       0.95      0.94      0.94       353
weighted avg       0.95      0.94      0.94       353

ROC-AUC Score: 0.99

Business Impact:
Implementing this email classification system can significantly improve efficiency by automatically filtering spam, reducing the time employees spend sorting emails.
It can also enhance security by minimizing exposure to malicious content often found in spam.
Accurate classification leads to a better user experience and ensures important communications are not missed.
The chosen metrics (precision, recall, F1-score, ROC-AUC) are relevant for evaluating a spam filter. High precision minimizes legitimate emails being marked as spam, while high recall minimizes spam emails reaching the inbox. ROC-AUC provide