1. What is a Support Vector Machine (SVM), and how does it work?
- A Support Vector Machine (SVM) is a supervised machine learning algorithm used for classification and regression tasks. For classification, SVM finds a decision boundary (hyperplane) that best separates classes by maximizing the margin — the distance between the hyperplane and the nearest data points from each class (these nearest points are called support vectors).
2. Explain the difference between Hard Margin and Soft Margin SVM.
- Hard Margin SVM: assumes data are perfectly linearly separable. It enforces zero training error and maximizes margin subject to no misclassification. Because it requires perfect separability, it’s sensitive to outliers and noise.

- Soft Margin SVM: allows some misclassifications by introducing slack variables (ξᵢ ≥ 0). The objective becomes maximizing margin while penalizing misclassification errors weighted by C. Smaller C → wider margin, more tolerance for misclassification; larger C → fewer misclassifications, potentially narrower margin. Soft margin is used in practice because data often are noisy/not perfectly separable.
3. What is the Kernel Trick in SVM? Give one example of a kernel and
explain its use case.
- The Kernel Trick lets SVM compute dot products in a high-dimensional (possibly infinite-dimensional) feature space without explicitly mapping data into that space. Instead, a kernel function K(x, x') = φ(x) · φ(x') computes the inner product of transformed features φ(x). This allows SVM to find nonlinear decision boundaries efficiently.

- Common kernels:

  - Linear kernel: K(x,x') = x·x' — use when data are already linearly separable or features are high-dimensional and linear decision is appropriate.

  - Polynomial kernel: K(x,x') = (γ x·x' + r)^d — useful when interactions up to degree d matter.

  - Radial Basis Function (RBF, Gaussian) kernel: K(x,x') = exp(-γ ||x - x'||²) — widely used; handles many nonlinear relationships; has parameter γ controlling kernel width.

- Example use case: RBF kernel for classification of nonlinearly separable data such as complex biological measurements where class boundary is curved in original feature space.
4. What is a Naïve Bayes Classifier, and why is it called “naïve”?
- A Naïve Bayes classifier is a probabilistic classifier based on Bayes’ theorem. It models the posterior probability P(Class | Features) proportional to P(Class) * P(Features | Class). The key simplification is the naïve assumption: features are conditionally independent given the class. This reduces computation: P(x1, x2, ..., xn | class) = ∏ P(xi | class). Despite the strong (often false) independence assumption, Naïve Bayes performs well in many domains, especially text classification.

- Advantages:

  - Fast to train and predict

  - Works well with high-dimensional inputs (e.g., bag-of-words)

  - Requires relatively little training data

- Limitations:

  - The conditional independence assumption may be unrealistic; correlated features can hurt performance.

  - Continuous features need modeling (e.g., Gaussian assumption) or discretization.
  5. Describe the Gaussian, Multinomial, and Bernoulli Naïve Bayes variants.
When would you use each one?
- Gaussian Naïve Bayes: Assumes continuous features have a Gaussian (normal) distribution for each class. Use when features are continuous real-valued (e.g., height, weight, sensor measures) and roughly normally distributed. Implementation computes class-wise mean and variance.

- Multinomial Naïve Bayes: Models counts or frequency features where data represent counts per feature (e.g., word counts in documents). Typical for document classification with bag-of-words or TF counts. Works with integer count vectors or normalized frequencies.

- Bernoulli Naïve Bayes: Models binary occurrence features (0/1) indicating presence/absence of a feature (e.g., whether a word appears in a document). Use when features are binary indicators rather than counts.

- Rule of thumb:

  - Text classification with word counts → Multinomial NB.

  - Text classification with binary word indicators → Bernoulli NB.

  - Continuous numeric features → Gaussian NB.

In [1]:
# Write a Python program to:
# ● Load the Iris dataset
# ● Train an SVM Classifier with a linear kernel
# ● Print the model's accuracy and support vectors.
# Question 6 - SVM on Iris
from sklearn import datasets
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import numpy as np

iris = datasets.load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)

clf = SVC(kernel='linear', C=1.0, random_state=42)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print(f"Accuracy on test set: {acc:.4f}")

print("Number of support vectors for each class:", clf.n_support_)
print("Support vectors array shape:", clf.support_vectors_.shape)


Accuracy on test set: 1.0000
Number of support vectors for each class: [ 3 10  9]
Support vectors array shape: (22, 4)


In [2]:
# 7. Write a Python program to:
# ● Load the Breast Cancer dataset
# ● Train a Gaussian Naïve Bayes model
# ● Print its classification report including precision, recall, and F1-score.

from sklearn import datasets
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix

data = datasets.load_breast_cancer()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)

gnb = GaussianNB()
gnb.fit(X_train, y_train)

y_pred = gnb.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n")
print(classification_report(y_test, y_pred, target_names=data.target_names))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))


Accuracy: 0.9370629370629371

Classification Report:

              precision    recall  f1-score   support

   malignant       0.96      0.87      0.91        53
      benign       0.93      0.98      0.95        90

    accuracy                           0.94       143
   macro avg       0.94      0.92      0.93       143
weighted avg       0.94      0.94      0.94       143

Confusion Matrix:
 [[46  7]
 [ 2 88]]


In [3]:
# Write a Python program to:
# ● Train an SVM Classifier on the Wine dataset using GridSearchCV to find the best
# C and gamma.
# ● Print the best hyperparameters and accuracy.
# Question 8 - Grid search for SVM hyperparameters on Wine dataset
from sklearn import datasets
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

wine = datasets.load_wine()
X = wine.data
y = wine.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)

param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': ['scale', 'auto', 0.01, 0.1, 1],
    'kernel': ['rbf']
}

svc = SVC()
grid = GridSearchCV(svc, param_grid, cv=5, scoring='accuracy', n_jobs=-1, verbose=1)
grid.fit(X_train, y_train)

print("Best hyperparameters:", grid.best_params_)
best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)
print("Test accuracy with best params:", accuracy_score(y_test, y_pred))


Fitting 5 folds for each of 20 candidates, totalling 100 fits
Best hyperparameters: {'C': 100, 'gamma': 'scale', 'kernel': 'rbf'}
Test accuracy with best params: 0.8222222222222222


In [4]:
# Write a Python program to:
# ● Train a Naïve Bayes Classifier on a synthetic text dataset (e.g. using
# sklearn.datasets.fetch_20newsgroups).
# ● Print the model's ROC-AUC score for its predictions.
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, roc_curve, classification_report

categories = ['sci.space', 'rec.sport.hockey']
data = fetch_20newsgroups(subset='all', categories=categories, remove=('headers', 'footers', 'quotes'))

X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

vect = TfidfVectorizer(stop_words='english', max_df=0.7)
X_train_tfidf = vect.fit_transform(X_train)
X_test_tfidf = vect.transform(X_test)

clf = MultinomialNB()
clf.fit(X_train_tfidf, y_train)

y_proba = clf.predict_proba(X_test_tfidf)[:, 1]
y_pred = clf.predict(X_test_tfidf)

auc = roc_auc_score(y_test, y_proba)
print(f"ROC-AUC score: {auc:.4f}\n")

print("Classification report:")
print(classification_report(y_test, y_pred, target_names=data.target_names))


ROC-AUC score: 0.9957

Classification report:
                  precision    recall  f1-score   support

rec.sport.hockey       0.93      0.98      0.96       300
       sci.space       0.98      0.93      0.95       296

        accuracy                           0.95       596
       macro avg       0.96      0.95      0.95       596
    weighted avg       0.96      0.95      0.95       596



10. Imagine you’re working as a data scientist for a company that handles
email communications.
Your task is to automatically classify emails as Spam or Not Spam. The emails may
contain:
● Text with diverse vocabulary
● Potential class imbalance (far more legitimate emails than spam)
● Some incomplete or missing data
Explain the approach you would take to:
● Preprocess the data (e.g. text vectorization, handling missing data)
● Choose and justify an appropriate model (SVM vs. Naïve Bayes)
● Address class imbalance
● Evaluate the performance of your solution with suitable metrics
And explain the business impact of your solution.
-

```
# Sample spam-detection pipeline (MultinomialNB baseline)
# Note: Replace `emails` & `labels` with your dataset (emails: list of text, labels: 0=ham,1=spam)
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, confusion_matrix, precision_recall_curve, roc_auc_score, average_precision_score
from sklearn.pipeline import Pipeline
import numpy as np

# Example placeholder data - replace with real dataset
emails = [
    "Win a free phone now! Click here to claim your prize",
    "Monthly invoice attached, please review",
    "Limited time offer: cheap meds, no prescription",
    "Meeting agenda for tomorrow attached"
]
labels = [1, 0, 1, 0]  # 1=spam, 0=not spam

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(emails, labels, test_size=0.25, random_state=42, stratify=labels)

# Pipeline: TF-IDF + MultinomialNB
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english', max_df=0.8, ngram_range=(1,2))),
    ('clf', MultinomialNB())
])

# Optionally tune alpha using GridSearch
param_grid = {'clf__alpha': [0.1, 0.5, 1.0]}
grid = GridSearchCV(pipeline, param_grid, cv=3, scoring='f1', n_jobs=-1)
grid.fit(X_train, y_train)

print("Best params:", grid.best_params_)
y_pred = grid.predict(X_test)
y_proba = grid.predict_proba(X_test)[:, 1]

print("Classification report:\n", classification_report(y_test, y_pred))
print("Confusion matrix:\n", confusion_matrix(y_test, y_pred))
# Use PR-AUC for imbalanced data
print("Average Precision (PR-AUC):", average_precision_score(y_test, y_proba))
# ROC-AUC (works but PR-AUC better for imbalance)
print("ROC-AUC:", roc_auc_score(y_test, y_proba))

```


