Question 1: What is a Support Vector Machine (SVM), and how does it work?

Answer:

 A Support Vector Machine (SVM) is a supervised machine learning algorithm used for classification and regression.
SVM works by finding the optimal hyperplane that separates data points of different classes with the maximum margin.


The points closest to the hyperplane are called support vectors. These points are critical in defining the decision boundary.


For non-linear data, SVM uses the kernel trick to map data into higher dimensions, where a linear separation is possible.


Thus, SVM is effective for high-dimensional spaces and provides robust performance in classification problems.


Question 2: Explain the difference between Hard Margin and Soft Margin SVM.

Answer:
Hard Margin SVM:


Assumes the data is perfectly linearly separable.


No misclassification is allowed.


Can lead to overfitting if data is noisy.


Soft Margin SVM:


Allows some misclassifications (controlled by penalty parameter C).


Strikes a balance between maximizing the margin and minimizing misclassification error.


Works better in real-world noisy datasets.


Key Difference: Hard margin enforces strict separation, while soft margin introduces flexibility to improve generalization.


Question 3: What is the Kernel Trick in SVM? Give one example of a kernel and explain its use case.


Answer:
 The Kernel Trick allows SVM to perform classification in higher-dimensional spaces without explicitly transforming the data. Instead, it uses a kernel function to compute similarity between points in that space.
Example: Radial Basis Function (RBF) Kernel


Formula:
K(x,y)=exp(−γ∥x−y∥2)
Use Case: Useful when data is not linearly separable in the original space, such as in image classification or complex pattern recognition tasks.


Question 4: What is a Naïve Bayes Classifier, and why is it called “naïve”?

Answer:
 The Naïve Bayes Classifier is a probabilistic classifier based on Bayes’ Theorem. It assumes that all features are independent given the class label.
It calculates the probability of each class for a given instance and selects the class with the highest probability.


It is called “naïve” because the assumption of feature independence is rarely true in real-world data. Despite this, it works well in practice, especially for text classification tasks like spam detection.


Question 5: Describe the Gaussian, Multinomial, and Bernoulli Naïve Bayes variants. When would you use each one?


Answer:
Gaussian Naïve Bayes:


Assumes features follow a normal (Gaussian) distribution.


Used for continuous data (e.g., Iris dataset with petal/sepal lengths).


Multinomial Naïve Bayes:


Works with count-based features.


Used for text classification (e.g., word counts in spam detection).


Bernoulli Naïve Bayes:


Assumes binary features (0/1).


Used when features represent presence/absence (e.g., word present or not in email).


Question 6: Write a Python program to: ● Load the Iris dataset ● Train an SVM Classifier with a linear kernel ● Print the model's accuracy and support vectors.

In [1]:
from sklearn.datasets import load_iris
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train SVM with linear kernel
svm_linear = SVC(kernel="linear", random_state=42)
svm_linear.fit(X_train, y_train)

# Predictions
y_pred = svm_linear.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("Model Accuracy:", accuracy)
print("Support Vectors:", svm_linear.support_vectors_)


Model Accuracy: 1.0
Support Vectors: [[4.8 3.4 1.9 0.2]
 [5.1 3.3 1.7 0.5]
 [4.5 2.3 1.3 0.3]
 [5.6 3.  4.5 1.5]
 [5.4 3.  4.5 1.5]
 [6.7 3.  5.  1.7]
 [5.9 3.2 4.8 1.8]
 [5.1 2.5 3.  1.1]
 [6.  2.7 5.1 1.6]
 [6.3 2.5 4.9 1.5]
 [6.1 2.9 4.7 1.4]
 [6.5 2.8 4.6 1.5]
 [6.9 3.1 4.9 1.5]
 [6.3 2.3 4.4 1.3]
 [6.3 2.8 5.1 1.5]
 [6.3 2.7 4.9 1.8]
 [6.  3.  4.8 1.8]
 [6.  2.2 5.  1.5]
 [6.2 2.8 4.8 1.8]
 [6.5 3.  5.2 2. ]
 [7.2 3.  5.8 1.6]
 [5.6 2.8 4.9 2. ]
 [5.9 3.  5.1 1.8]
 [4.9 2.5 4.5 1.7]]


Question 7: Write a Python program to: ● Load the Breast Cancer dataset ● Train a Gaussian Naïve Bayes model ● Print its classification report including precision, recall, and F1-score.

In [2]:

from sklearn.datasets import load_breast_cancer
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report

# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Train Gaussian Naïve Bayes
gnb = GaussianNB()
gnb.fit(X, y)

# Predictions
y_pred = gnb.predict(X)

# Classification report
print(classification_report(y, y_pred, target_names=data.target_names))

              precision    recall  f1-score   support

   malignant       0.95      0.89      0.92       212
      benign       0.94      0.97      0.95       357

    accuracy                           0.94       569
   macro avg       0.94      0.93      0.94       569
weighted avg       0.94      0.94      0.94       569



Question 8: Write a Python program to: ● Train an SVM Classifier on the Wine dataset using GridSearchCV to find the best C and gamma. ● Print the best hyperparameters and accuracy.

In [3]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC

# Load dataset
wine = load_wine()
X, y = wine.data, wine.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Parameter grid
param_grid = {'C': [0.1, 1, 10], 'gamma': [0.001, 0.01, 0.1], 'kernel': ['rbf']}

# GridSearch
grid = GridSearchCV(SVC(), param_grid, cv=5)
grid.fit(X_train, y_train)

print("Best Parameters:", grid.best_params_)
print("Best Accuracy:", grid.best_score_)


Best Parameters: {'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}
Best Accuracy: 0.6946666666666667


Question 9: Write a Python program to: ● Train a Naïve Bayes Classifier on a synthetic text dataset (e.g. using sklearn.datasets.fetch_20newsgroups). ● Print the model's ROC-AUC score for its predictions.


In [4]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import label_binarize

# Load dataset
data = fetch_20newsgroups(subset='train', categories=['sci.space', 'rec.autos'], remove=('headers','footers','quotes'))
X, y = data.data, data.target

# Text vectorization
vectorizer = TfidfVectorizer()
X_vec = vectorizer.fit_transform(X)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_vec, y, test_size=0.3, random_state=42)

# Train Naive Bayes
nb = MultinomialNB()
nb.fit(X_train, y_train)

# Predictions
y_pred = nb.predict_proba(X_test)

# ROC-AUC
y_test_bin = label_binarize(y_test, classes=[0,1])
roc_auc = roc_auc_score(y_test_bin, y_pred[:,1])

print("ROC-AUC Score:", roc_auc)


ROC-AUC Score: 0.9777370185314023


Question 10: Imagine you’re working as a data scientist for a company that handles email communications. Your task is to automatically classify emails as Spam or Not Spam. The emails may contain: ● Text with diverse vocabulary ● Potential class imbalance (far more legitimate emails than spam) ● Some incomplete or missing data Explain the approach you would take to: ● Preprocess the data (e.g. text vectorization, handling missing data) ● Choose and justify an appropriate model (SVM vs. Naïve Bayes) ● Address class imbalance ● Evaluate the performance of your solution with suitable metrics And explain the business impact of your solution.


Approach:
Preprocessing:


Handle missing data (replace with empty string).


Convert text into numerical features using TF-IDF vectorization.


Model Choice:


Naïve Bayes (Multinomial) is preferred for spam detection due to efficiency on text data.


SVM could be used if higher accuracy is required but may be slower on large text corpora.


Address Class Imbalance:


Use SMOTE (oversampling) or class weighting.


Evaluation Metrics:


Precision, Recall, F1-score, ROC-AUC (important due to class imbalance).


Business Impact:
Automatically filters spam, saving time and improving productivity.


Protects against phishing and scams.


Enhances customer trust in the company’s communication system.


In [9]:
# Q10 - Spam Classification (Final Clean Version)

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, confusion_matrix

# -----------------------------
# 1. Sample dataset
# -----------------------------
data = {
    "email": [
        "Win a free iPhone now",
        "Meeting tomorrow at 10am",
        "Congratulations, you won a lottery ticket",
        "Reminder: Submit project report",
        "Get cheap loans instantly",
        "Schedule lunch with client",
        "Earn money working from home",
        "Exclusive deal just for you",
        "Can we reschedule the call?",
        "URGENT: Your account has been suspended"
    ],
    "label": [1, 0, 1, 0, 1, 0, 1, 1, 0, 1]  # 1 = Spam, 0 = Not Spam
}

df = pd.DataFrame(data)

# -----------------------------
# 2. Preprocessing
# -----------------------------
df['email'] = df['email'].fillna("")

vectorizer = TfidfVectorizer(stop_words="english")
X = vectorizer.fit_transform(df['email'])
y = df['label']

# -----------------------------
# 3. Train-Test Split
# -----------------------------
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# -----------------------------
# 4. Train Naive Bayes Model
# -----------------------------
model = MultinomialNB()
model.fit(X_train, y_train)

# -----------------------------
# 5. Evaluation
# -----------------------------
y_pred = model.predict(X_test)
print("Classification Report:\n", classification_report(y_test, y_pred, zero_division=0))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

# -----------------------------
# 6. Business Impact
# -----------------------------
print("Business Impact:")
print("- Saves employees’ time by filtering spam automatically.")
print("- Prevents phishing attacks and fraud.")
print("- Ensures critical emails are delivered properly.")
print("- Improves organizational productivity and security.")





Classification Report:
               precision    recall  f1-score   support

           0       0.00      0.00      0.00         1
           1       0.67      1.00      0.80         2

    accuracy                           0.67         3
   macro avg       0.33      0.50      0.40         3
weighted avg       0.44      0.67      0.53         3

Confusion Matrix:
 [[0 1]
 [0 2]]
Business Impact:
- Saves employees’ time by filtering spam automatically.
- Prevents phishing attacks and fraud.
- Ensures critical emails are delivered properly.
- Improves organizational productivity and security.
