Q1. What is a Support Vector Machine (SVM), and how does it work?

A Support Vector Machine (SVM) is a supervised machine learning algorithm that separates data into different classes by finding the optimal separating boundary, or hyperplane, that maximizes the margin between the classes. SVMs are used for both classification and regression tasks and are particularly effective for high-dimensional data and complex, non-linear problems, which they handle using kernel functions.

* How SVMs Work

Finding the Hyperplane: SVMs find a hyperplane (a decision boundary) that best separates different classes of data points in a multi-dimensional space. 

Maximizing the Margin: The "best" hyperplane is the one that maximizes the margin, which is the distance between the hyperplane and the nearest data points (called "support vectors") from each class. 

Support Vectors: The support vectors are the crucial data points that lie closest to the separating hyperplane and define its position and orientation. 

Kernel Trick for Non-Linear Data: For data that cannot be separated by a linear hyperplane, SVMs use a kernel trick. This involves mapping the data into a higher-dimensional feature space where it becomes linearly separable, allowing for the creation of complex, non-linear decision boundaries. 

Q2. Explain the difference between Hard Margin and Soft Margin SVM.


* Hard Margin SVM

Perfect Separation: Enforces a strict boundary that perfectly separates the classes, resulting in zero misclassifications on the training data. 

Sensitivity to Outliers: If even a single data point is an outlier or falls within the margin, a hard margin SVM might fail to find a viable decision boundary or create a boundary that is overly sensitive to noise. 

Linearly Separable Data: Only works with datasets where the classes can be perfectly separated by a linear hyperplane. 

No Training Error: The goal is to achieve a margin with a 0% error rate on the training set. 

* Soft Margin SVM

Allows for Misclassifications: Introduces the concept of slack variables to allow for some data points to be misclassified or to fall on the wrong side of the margin. 
Addresses Outliers and Noise: More robust to outliers and overlapping classes, as it doesn't demand perfect separation. 

Balancing Act: Involves a trade-off between maximizing the margin (which reduces the risk of overfitting) and minimizing classification errors. 

Regularization Parameter (C): A parameter (often denoted as C) controls the balance between maximizing the margin and minimizing the classification errors; a higher C penalizes misclassifications more heavily. 

Q3. What is the Kernel Trick in SVM? Give one example of a kernel and explain its use case.

The “Kernel Trick” is a method used in Support Vector Machines (SVMs) to convert data (that is not linearly separable) into a higher-dimensional feature space where it may be linearly separated.

This technique enables the SVM to identify a hyperplane that separates the data with the maximum margin, even when the data is not linearly separable in its original space. The kernel functions are used to compute the inner product between pairs of points in the transformed feature space without explicitly computing the transformation itself. This makes it computationally efficient to deal with high dimensional feature spaces.

The "Trick": Instead of performing the computationally expensive explicit transformation of each data point to this higher dimension, a kernel function computes the dot product between pairs of data points in this higher-dimensional space. 

Example: The Radial Basis Function (RBF) Kernel

Definition: The RBF kernel is a type of Radial Basis Function (RBF). It calculates a similarity between two data points, which is influenced by their distance. 

Use Case: The RBF kernel is widely used for its ability to handle complex, non-linear data with intricate patterns. 

Mechanism: For a 2D dataset, the RBF kernel projects the data into a 3D space. This projection allows data points that were previously indistinguishable in 2D to be separated by a linear plane in the higher 3D space. 

Q4. What is a Naïve Bayes Classifier, and why is it called “naïve”?

* Naive Bayes Classifier

The Naïve Bayes classifier is a supervised machine learning algorithm that is used for classification tasks such as text classification. They use principles of probability to perform classification tasks.

It belongs to the family of generative learning algorithms, which means that it models the distribution of inputs for a given class or category. This approach is based on the assumption that the features of the input data are conditionally independent given the class, allowing the algorithm to make predictions quickly and accurately.

Why is it called "Naïve"?

The term "naïve" comes from the algorithm's core assumption of conditional independence among features. 

Unrealistic Assumption: In reality, features often have dependencies (e.g., in text classification, certain words are more likely to appear together). 

Simplified Calculation: This assumption simplifies the complex probability calculations required in Bayes' theorem, as the algorithm treats each feature's contribution to the class probability independently. 

Practical Success: Despite this oversimplified assumption, Naïve Bayes classifiers have proven to be effective in many real-world applications, such as spam filtering, because the independence assumption allows for efficient and accurate predictions. 

Q5. Describe the Gaussian, Multinomial, and Bernoulli Naïve Bayes variants.
When would you use each one?

* Gaussian Naïve Bayes

In this variant, we assume that the features are continuous values (like age, weight, temperature, income, etc.) and that each feature for a given class follows a normal (Gaussian) distribution.

For example, if we are classifying whether a person is “fit” or “unfit” based on their height and weight, these numerical features are assumed to vary normally (bell-shaped curve) within each class.

When to Use:

When features are continuous and roughly normally distributed.

Examples:

Predicting whether a patient has a disease based on lab results.

Classifying iris flowers using petal and sepal measurements.

Sensor data analysis (like temperature, pressure, etc.).

* Multinomial Naïve Bayes

This model is used when features represent discrete counts — such as how many times a word appears in a document.

It assumes that the features follow a multinomial distribution, which models the probability of observing a particular set of event counts.

It is one of the most common models used in text classification.

When to Use:

When features are discrete frequency counts (not binary or continuous).

It’s ideal for text and document classification where you use:

Example:

In spam detection, the number of times words like “free,” “offer,” or “win” appear are used as features.

The Multinomial Naïve Bayes model uses these counts to determine if the email is spam or not.

Bernoulli Naïve Bayes

In this model, each feature is binary (0 or 1) — indicating whether a feature exists or not in a sample.

Instead of counting how many times a word appears, we only record whether the word is present.

It follows the Bernoulli distribution, which models yes/no (true/false) outcomes.

When to Use:

When features are binary (present/absent).

Common in text classification tasks using binary word occurrence features.

📍 Example:

If we are classifying whether a movie review is positive or negative:

Feature “good” = 1 if the word appears, 0 otherwise

Feature “bad” = 1 if it appears, 0 otherwise

Bernoulli Naïve Bayes looks at presence/absence rather than how many times a word appears.

Q6. Write a Python program to:

● Load the Iris dataset

● Train an SVM Classifier with a linear kernel

● Print the model's accuracy and support vectors.

In [5]:
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
y = iris.target

In [6]:
# Train test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=1)

In [8]:
# fit the model
from sklearn.svm import SVC
model = SVC(kernel='linear')
model.fit(X_train, y_train)

In [10]:
# Make prediction on test data
y_pred = model.predict(X_test)

In [11]:
# Accuracy Score
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

1.0

In [None]:
# Support vector
model.support_vectors_

array([[5.1, 3.3, 1.7, 0.5],
       [4.5, 2.3, 1.3, 0.3],
       [4.8, 3.4, 1.9, 0.2],
       [6. , 3.4, 4.5, 1.6],
       [5.7, 2.8, 4.5, 1.3],
       [6. , 2.7, 5.1, 1.6],
       [6.9, 3.1, 4.9, 1.5],
       [5.9, 3.2, 4.8, 1.8],
       [4.9, 2.4, 3.3, 1. ],
       [6.1, 2.9, 4.7, 1.4],
       [6.7, 3.1, 4.7, 1.5],
       [6.2, 2.2, 4.5, 1.5],
       [6.3, 2.5, 4.9, 1.5],
       [6.2, 2.8, 4.8, 1.8],
       [6.3, 2.7, 4.9, 1.8],
       [6.1, 3. , 4.9, 1.8],
       [6.5, 3.2, 5.1, 2. ],
       [6. , 3. , 4.8, 1.8],
       [5.9, 3. , 5.1, 1.8],
       [4.9, 2.5, 4.5, 1.7],
       [7.2, 3. , 5.8, 1.6],
       [6.3, 2.8, 5.1, 1.5]])

Q7. Write a Python program to:

● Load the Breast Cancer dataset

● Train a Gaussian Naïve Bayes model

● Print its classification report including precision, recall, and F1-score

In [None]:
# Load the Breast Cancer dataset
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
X = data.data
y = data.target

In [None]:
# Train test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

In [16]:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()

In [17]:
gnb.fit(X_train, y_train)

In [19]:
y_pred = gnb.predict(X_test)

In [21]:
from sklearn.metrics import classification_report, accuracy_score
accuracy_score(y_test, y_pred)

0.9473684210526315

In [23]:
# Display the classification report
print(classification_report(y_test, y_pred, target_names=data.target_names))

              precision    recall  f1-score   support

   malignant       0.94      0.92      0.93        63
      benign       0.95      0.96      0.96       108

    accuracy                           0.95       171
   macro avg       0.94      0.94      0.94       171
weighted avg       0.95      0.95      0.95       171



Q8. Write a Python program to:

● Train an SVM Classifier on the Wine dataset using GridSearchCV to find the best C and gamma.

● Print the best hyperparameters and accuracy.


In [24]:
from sklearn.datasets import load_wine
wine = load_wine()
X = wine.data 
y = wine.target

In [25]:
# Train test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

In [26]:
from sklearn.svm import SVC
svm = SVC()

In [27]:
param_grid = {
    'C': [0.1, 1, 10, 100],          # Regularization parameter
    'gamma': [1, 0.1, 0.01, 0.001],  # Kernel coefficient for 'rbf'
    'kernel': ['rbf']                # Use RBF kernel
}

In [None]:
# Apply GridSearchCV to find the best hyperparameters
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(svm, param_grid, refit=True, verbose=1, cv=5)
grid.fit(X_train, y_train)

Fitting 5 folds for each of 16 candidates, totalling 80 fits


In [29]:
# Print the best parameters and best score from training
print("\nBest Hyperparameters found by GridSearchCV:")
print(grid.best_params_)


Best Hyperparameters found by GridSearchCV:
{'C': 100, 'gamma': 0.001, 'kernel': 'rbf'}


In [30]:
# Predict using the best model
best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)

In [31]:
# Evaluate accuracy
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)
print(accuracy)

0.7592592592592593


Q9. Write a Python program to:

● Train a Naïve Bayes Classifier on a synthetic text dataset (e.g. using
sklearn.datasets.fetch_20newsgroups).

● Print the model's ROC-AUC score for its predictions.


In [32]:
# Import necessary libraries
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import label_binarize

# Load the text dataset (subset for faster training)
categories = ['sci.space', 'rec.sport.baseball', 'comp.graphics']
newsgroups = fetch_20newsgroups(subset='all', categories=categories, remove=('headers', 'footers', 'quotes'))

X = newsgroups.data    # Text documents
y = newsgroups.target  # Class labels

# Convert text data into TF-IDF feature vectors
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
X_tfidf = vectorizer.fit_transform(X)

#  Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.3, random_state=42)

# Create and train a Multinomial Naïve Bayes classifier
model = MultinomialNB()
model.fit(X_train, y_train)

# Predict class probabilities on the test data
y_prob = model.predict_proba(X_test)

# Compute ROC-AUC score
# For multi-class classification, we use 'ovr' (One-vs-Rest) method
y_test_bin = label_binarize(y_test, classes=[0, 1, 2])
roc_auc = roc_auc_score(y_test_bin, y_prob, multi_class='ovr')

# Print the ROC-AUC score
print("ROC-AUC Score of the Naïve Bayes Classifier: {:.4f}".format(roc_auc))


ROC-AUC Score of the Naïve Bayes Classifier: 0.9875


Q10. Imagine you’re working as a data scientist for a company that handles email communications.

Your task is to automatically classify emails as Spam or Not Spam. The emails may contain:

● Text with diverse vocabulary

● Potential class imbalance (far more legitimate emails than spam)

● Some incomplete or missing data

Explain the approach you would take to:

● Preprocess the data (e.g. text vectorization, handling missing data)

● Choose and justify an appropriate model (SVM vs. Naïve Bayes)

● Address class imbalance

● Evaluate the performance of your solution with suitable metrics And explain the business impact of your solution.

1. Data Preprocessing

Before training any machine learning model, the first step is to prepare and clean the data.

a) Handling Missing Data

Emails might have missing fields like subject lines or message bodies.
If an email has no content at all, it should be removed since it doesn’t provide useful information.
However, if only some parts are missing (like subject), we can replace them with a placeholder such as “unknown” or an empty string, so the dataset remains consistent.

b) Text Cleaning

Email content is messy — it can include punctuation, numbers, links, and special characters.
So we need to clean the text by:

Converting all words to lowercase (so “Free” and “free” are treated the same).

Removing punctuation, numbers, and extra spaces.

Removing stop words (very common words like “the”, “and”, “is” that don’t carry much meaning).

Converting words to their base form using stemming or lemmatization (for example, “running”, “runs”, and “ran” become “run”).

This helps reduce noise and makes the text more meaningful for the model.

c) Converting Text to Numbers

Machine learning models can’t directly understand text — they need numbers.
So we use a technique called vectorization.

The most common and effective method for spam detection is TF-IDF (Term Frequency–Inverse Document Frequency).
It measures how important a word is in an email relative to all other emails.
For example, common words like “the” get low importance, while rare but meaningful words like “lottery” or “winner” get higher importance.

2. Choosing the Right Model

Two popular algorithms are suitable here: Naïve Bayes and Support Vector Machine (SVM).

Naïve Bayes

Works extremely well for text classification.

It assumes that all words are independent of each other, which isn’t always true, but in practice it performs surprisingly well.

It’s fast, easy to train, and very effective for spam filtering.

SVM (Support Vector Machine)

Builds a boundary between “spam” and “not spam” emails in a high-dimensional space.

It’s powerful and can handle complex relationships, but it’s slower and needs parameter tuning.

Decision: Start with Naïve Bayes because it’s quick, efficient, and well-suited for text data. Later, you can compare it with an SVM model to see if accuracy improves.

3. Handling Class Imbalance

In real-world data, the number of spam emails is usually much smaller than legitimate emails.
If not handled properly, the model might just predict “Not Spam” for everything to achieve high accuracy — but that’s not useful.

To fix this, you can:

Oversample the minority class (Spam): Duplicate or generate synthetic spam examples using techniques like SMOTE.

Undersample the majority class (Not Spam): Randomly remove some non-spam emails to balance the dataset.

Use class weights: Some models (like SVM) can automatically give more importance to the minority class by setting a higher penalty for misclassifying spam emails.

Adjust probability threshold: Instead of using the default 50% cutoff for classification, lower the threshold so the model catches more spam (increasing recall).

4. Evaluating the Model

Since the dataset is imbalanced, accuracy alone is not a good metric — it can be misleading.
Instead, we focus on:

Precision: How many of the emails predicted as spam are actually spam?
(We want high precision to avoid marking important emails as spam.)

Recall: How many of the actual spam emails were correctly identified?
(We want high recall so that spam doesn’t slip through.)

F1-Score: A balance between precision and recall.

ROC-AUC: Measures how well the model can distinguish between spam and not spam overall.

These metrics give a more complete picture of model performance.

5. Business Impact

Implementing this system has major benefits for the company:

| Area                | Business Benefit                                                                  |
| ------------------- | --------------------------------------------------------------------------------- |
| **Security**        | Automatically blocks phishing, scams, and malware emails before they reach users. |
| **Productivity**    | Employees don’t waste time sorting through junk emails.                           |
| **User Experience** | Users trust the company’s email service more, improving reputation.               |
| **Cost Efficiency** | Reduces server load, storage, and manual support efforts.                         |
| **Compliance**      | Helps the company follow data protection and security policies.                   |
