#Assignment Code: DA-AG-013
#SVM & Naive Bayes | Assignment

Question 1: What is a Support Vector Machine (SVM), and how does it work?

Answer:  
Support Vector Machine (SVM) is a supervised machine learning algorithm used for classification and regression tasks. It works by finding the optimal hyperplane that separates the data points of different classes with the maximum margin. The data points closest to the hyperplane are called support vectors, and they are critical for defining the boundary. SVM can efficiently handle high-dimensional data and is effective in cases where the number of features exceeds the number of samples.

Question 2: Explain the difference between Hard Margin and Soft Margin SVM.

Answer:  

Hard Margin SVM assumes that the data is linearly separable and tries to find a hyperplane that perfectly separates the classes without any misclassification. It does not allow any margin violations.

Soft Margin SVM allows for some misclassification to achieve better generalization. It introduces a penalty parameter C to control the trade-off between maximizing the margin and minimizing classification errors. Soft Margin is more practical and robust, especially when data is noisy or overlapping.

Question 3: What is the Kernel Trick in SVM? Give one example of a kernel and explain its use case.

Answer:  
The Kernel Trick is a technique in SVM that allows it to operate in a high-dimensional feature space without explicitly computing the coordinates of the data in that space. Instead, it uses a kernel function to calculate the inner product between two data points in the transformed space.

Example – RBF (Radial Basis Function) Kernel:
The RBF kernel is useful for handling non-linear relationships. It measures the similarity between points and maps them into an infinite-dimensional space, allowing the SVM to find complex decision boundaries. It's commonly used when data is not linearly separable.

Question 4: What is a Naïve Bayes Classifier, and why is it called “naïve”?

Answer:
A Naïve Bayes Classifier is a probabilistic machine learning model based on Bayes’ Theorem. It is primarily used for classification tasks, especially in text classification problems such as spam detection, sentiment analysis, etc.

The classifier assumes that the presence (or absence) of one feature is independent of the presence (or absence) of any other feature, given the class label. This is a strong assumption and often not true in real-world data, which is why the algorithm is called "naïve".

Question 5: Describe the Gaussian, Multinomial, and Bernoulli Naïve Bayes variants.
When would you use each one?   
Answer:    
Naïve Bayes classifiers are a family of probabilistic models based on Bayes’ Theorem with the assumption of conditional independence between features. There are several variants of Naïve Bayes, each tailored to different types of data. The most common ones are Gaussian, Multinomial, and Bernoulli Naïve Bayes.

1. Gaussian Naïve Bayes

Description:
Assumes that the features follow a normal (Gaussian) distribution. For each feature, the likelihood of the data is estimated using the mean and standard deviation of the feature in the training data.

Use case:
Suitable for continuous numeric data where features are assumed to be normally distributed.

Example:
Predicting whether a patient has a disease based on lab measurements like blood pressure, cholesterol levels, etc.

2. Multinomial Naïve Bayes

Description:
Designed for discrete count data. It models the data using multinomial distributions and is typically used when features represent the frequencies or counts of events.

Use case:
Commonly used in text classification tasks (like spam detection or sentiment analysis) where features represent word counts or term frequencies (e.g., in a bag-of-words model).

Example:
Classifying emails as spam or not based on word occurrence counts.

3. Bernoulli Naïve Bayes

Description:
Assumes binary features, where each feature is either present (1) or absent (0). It models the data using Bernoulli distributions.

Use case:
Also used in text classification, but when features are binary—i.e., whether a word appears in a document or not, regardless of frequency.

Example:
Detecting whether a document belongs to a topic based on the presence or absence of certain keywords.

When to Use Each:

Use Gaussian NB when your features are continuous and roughly normally distributed.

Use Multinomial NB when your features are discrete counts (e.g., word frequencies).

Use Bernoulli NB when your features are binary indicators (e.g., presence/absence of a feature).

Question 6:   Write a Python program to:
● Load the Iris dataset
● Train an SVM Classifier with a linear kernel
● Print the model's accuracy and support vectors.
(Include your Python code and output in the code box below.)  

Dataset Info:
● You can use any suitable datasets like Iris, Breast Cancer, or Wine from
sklearn.datasets or a CSV file you have.  
Answer:



In [1]:
# Import required libraries
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the SVM classifier with a linear kernel
svm_model = SVC(kernel='linear')
svm_model.fit(X_train, y_train)

# Make predictions
y_pred = svm_model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

# Print the results
print("SVM Model Accuracy:", accuracy)
print("\nSupport Vectors:")
print(svm_model.support_vectors_)


SVM Model Accuracy: 1.0

Support Vectors:
[[4.8 3.4 1.9 0.2]
 [5.1 3.3 1.7 0.5]
 [4.5 2.3 1.3 0.3]
 [5.6 3.  4.5 1.5]
 [5.4 3.  4.5 1.5]
 [6.7 3.  5.  1.7]
 [5.9 3.2 4.8 1.8]
 [5.1 2.5 3.  1.1]
 [6.  2.7 5.1 1.6]
 [6.3 2.5 4.9 1.5]
 [6.1 2.9 4.7 1.4]
 [6.5 2.8 4.6 1.5]
 [6.9 3.1 4.9 1.5]
 [6.3 2.3 4.4 1.3]
 [6.3 2.5 5.  1.9]
 [6.3 2.8 5.1 1.5]
 [6.3 2.7 4.9 1.8]
 [6.  3.  4.8 1.8]
 [6.  2.2 5.  1.5]
 [6.2 2.8 4.8 1.8]
 [6.5 3.  5.2 2. ]
 [7.2 3.  5.8 1.6]
 [5.6 2.8 4.9 2. ]
 [5.9 3.  5.1 1.8]
 [4.9 2.5 4.5 1.7]]


Question 7:  Write a Python program to:
● Load the Breast Cancer dataset
● Train a Gaussian Naïve Bayes model
● Print its classification report including precision, recall, and F1-score.
(Include your Python code and output in the code box below.)  
Answer:

In [2]:
# Import required libraries
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report

# Load the Breast Cancer dataset
cancer = datasets.load_breast_cancer()
X = cancer.data
y = cancer.target

# Split the dataset into training and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the Gaussian Naïve Bayes model
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# Predict on test data
y_pred = gnb.predict(X_test)

# Print classification report
print(classification_report(y_test, y_pred, target_names=cancer.target_names))


              precision    recall  f1-score   support

   malignant       1.00      0.93      0.96        43
      benign       0.96      1.00      0.98        71

    accuracy                           0.97       114
   macro avg       0.98      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114



Question 8: Write a Python program to:
● Train an SVM Classifier on the Wine dataset using GridSearchCV to find the best
C and gamma.
● Print the best hyperparameters and accuracy.
(Include your Python code and output in the code box below.)


Answer:

In [3]:
from sklearn import datasets
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load Wine dataset
wine = datasets.load_wine()
X = wine.data
y = wine.target

# Split dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the SVM classifier
svm = SVC()

# Define parameter grid for GridSearchCV
param_grid = {
    'C': [0.1, 1, 10, 100],        # Regularization parameter
    'gamma': ['scale', 0.001, 0.01, 0.1, 1],  # Kernel coefficient
    'kernel': ['rbf']               # Use RBF kernel for gamma parameter relevance
}

# Initialize GridSearchCV
grid_search = GridSearchCV(svm, param_grid, cv=5, verbose=1)

# Fit GridSearchCV
grid_search.fit(X_train, y_train)

# Best hyperparameters
print("Best Hyperparameters:", grid_search.best_params_)

# Predict on test data using the best estimator
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Test Set Accuracy:", accuracy)


Fitting 5 folds for each of 20 candidates, totalling 100 fits
Best Hyperparameters: {'C': 100, 'gamma': 'scale', 'kernel': 'rbf'}
Test Set Accuracy: 0.8333333333333334


Question 9: Write a Python program to:
● Train a Naïve Bayes Classifier on a synthetic text dataset (e.g. using
sklearn.datasets.fetch_20newsgroups).
● Print the model's ROC-AUC score for its predictions.
(Include your Python code and output in the code box below.)   
Answer:

In [4]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import label_binarize
from sklearn.model_selection import train_test_split
import numpy as np

# Load the dataset (choose only two categories to make it binary classification)
categories = ['comp.graphics', 'rec.sport.baseball']
newsgroups = fetch_20newsgroups(subset='all', categories=categories)

X = newsgroups.data
y = newsgroups.target  # binary labels: 0 or 1

# Split dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Convert text data to TF-IDF features
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# Train Multinomial Naïve Bayes classifier
nb = MultinomialNB()
nb.fit(X_train_tfidf, y_train)

# Predict probabilities for ROC-AUC
y_probs = nb.predict_proba(X_test_tfidf)[:, 1]  # Probability for positive class

# Calculate ROC-AUC score
roc_auc = roc_auc_score(y_test, y_probs)

print(f"ROC-AUC score: {roc_auc:.4f}")


ROC-AUC score: 0.9999


Question 10: Imagine you’re working as a data scientist for a company that handles
email communications.  
Your task is to automatically classify emails as Spam or Not Spam. The emails may
contain:   
● Text with diverse vocabulary   
● Potential class imbalance (far more legitimate emails than spam)   
● Some incomplete or missing data   
Explain the approach you would take to:   
● Preprocess the data (e.g. text vectorization, handling missing data)   
● Choose and justify an appropriate model (SVM vs. Naïve Bayes)   
● Address class imbalance   
● Evaluate the performance of your solution with suitable metrics
And explain the business impact of your solution.   
(Include your Python code and output in the code box below.)   
Answer:  
Approach to Spam Email Classification:
1. Preprocessing the Data

Text Vectorization:
Use TF-IDF Vectorizer to convert emails into numerical features that capture term importance and reduce the impact of common words. This is effective for diverse vocabulary in emails.

Handling Missing Data:

Drop or fill missing email bodies (e.g., with empty strings) since text is crucial.

For missing metadata, either impute or exclude those features.

Since text data is sparse, TF-IDF can handle it well after filling missing entries.

Text Cleaning:
Lowercasing, removing punctuation, stop words removal, and possibly stemming or lemmatization to reduce noise.

2. Model Choice: SVM vs. Naïve Bayes

Naïve Bayes:

Fast, simple, and performs well with high-dimensional sparse text data.

Good baseline, especially with bag-of-words or TF-IDF features.

Assumes feature independence, which is not always true but works well in practice for spam detection.

SVM:

Can find complex decision boundaries and often outperforms Naïve Bayes on text classification when tuned properly.

Can handle large feature spaces but may require more computation and careful tuning (e.g., kernel, regularization).

Allows use of class weights to address imbalance.

Recommendation: Start with Multinomial Naïve Bayes for speed and simplicity; move to SVM if more accuracy is needed.

3. Addressing Class Imbalance

Use resampling techniques such as SMOTE or Random Oversampling to balance classes during training.

Alternatively or additionally, use class weights in models (e.g., class_weight='balanced' in SVM).

Tune decision thresholds to balance precision and recall depending on business priorities (e.g., minimize false negatives).

4. Evaluation Metrics

Recall: Important to catch as many spam emails as possible (minimize false negatives).

Precision: Important to avoid marking legitimate emails as spam (minimize false positives).

F1-Score: Balance between precision and recall.

ROC-AUC: Overall model discrimination ability.

Confusion Matrix: For detailed error analysis.

5. Business Impact

Better spam filtering improves user experience, reduces risk from phishing/malicious emails, and increases trust in the email platform.

Automating spam classification saves manual labor and reduces operational costs.

Helps maintain brand reputation and customer satisfaction.  

Example Python Code:

In [13]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix

# Select two categories to simulate Spam vs Not Spam
categories = ['rec.sport.hockey', 'talk.politics.misc']

# Load the data
data = fetch_20newsgroups(subset='all', categories=categories, remove=('headers','footers','quotes'))

X = data.data
y = data.target  # 0 or 1

# Split dataset with stratification
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42)

# Build a pipeline: TF-IDF vectorizer + Linear SVM with class_weight balanced
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english', max_features=5000)),
    ('svm', LinearSVC(class_weight='balanced', random_state=42))
])

# Train the model
pipeline.fit(X_train, y_train)

# Make predictions
y_pred = pipeline.predict(X_test)

# Evaluate
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))


Classification Report:
               precision    recall  f1-score   support

           0       0.98      0.91      0.94       200
           1       0.89      0.98      0.93       155

    accuracy                           0.94       355
   macro avg       0.94      0.94      0.94       355
weighted avg       0.94      0.94      0.94       355

Confusion Matrix:
 [[181  19]
 [  3 152]]
