Q1. What is an SVM and how does it work?

- Support Vector Machine (SVM) is a maximum-margin classifier. It finds the hyperplane that separates classes with the largest margin (distance to the nearest training points). Only a few points on the edge—support vectors—determine the boundary. For linear data it fits a straight hyperplane; for nonlinear data it uses kernels (see Q3). Training typically minimizes hinge loss with a margin penalty.

Q-2: Explain the difference between Hard Margin and Soft Margin SVM.
- Hard margin: assumes perfect linear separability; no misclassifications allowed. Maximizes margin subject to all points being correctly classified. Very sensitive to noise/outliers; can overfit if data isn’t separable.

- Soft margin: allows violations using slack variables and a penalty C. Large C → fewer violations (tighter fit); small C → more margin (better generalization, more tolerant to outliers). Used in practice.


Q-3: What is the Kernel Trick in SVM? Give one example of a kernel and explain its use case.
-  The kernel trick computes inner products in a high-dimensional feature space without explicitly mapping features. Replace ⟨x, x′⟩ with k(x, x′) and learn a linear separator in that space → a nonlinear boundary in original space.

 Example of a kernel: Radial Basis Function (RBF) kernel also known as the Gaussian kernel

   k(x,x ′ )= exp(-γ ∥ x - x ′ ∥ 2 )

 - Use when class boundary is curved/complex.

 - γ controls locality: high γ → tighter, wiggly decision boundary; low γ → smoother, broader influence.




Q4. What is a Naïve Bayes Classifier and why “naïve”?

- Naïve Bayes applies Bayes’ Theorem to compute
𝑃(𝑦 ∣ 𝑥 ) and predicts the class with the highest posterior. It assumes conditional independence of features given the class—that’s the “naïve” part. Despite the simplification, it’s fast, robust, and works well for high-dimensional text.

Question 5: Describe the Gaussian, Multinomial, and Bernoulli Naïve Bayes variants.When would you use each one?
- Gaussian NB: assumes each feature is normally distributed within a class.

 Use for continuous features (e.g., sensor measurements, real-valued attributes).

- Multinomial NB: models counts/frequencies (word counts).

 Use for text classification with bag-of-words or TF-IDF counts.

- Bernoulli NB: models binary features (present/absent).

 Use for text with binary indicators (word present?), or other on/off features.

# practical question


In [9]:
# 6: Write a Python program to: ● Load the Iris dataset ● Train an SVM Classifier with a linear kernel ● Print the model's accuracy and support vectors
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load data
iris = datasets.load_iris()
X, y = iris.data, iris.target

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y)

# Linear SVM
clf = SVC(kernel='linear', C=1.0, random_state=42)
clf.fit(X_train, y_train)

# Accuracy
y_pred = clf.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print(f"Accuracy: {acc:.4f}")

# Support vectors
print("Number of support vectors per class:", clf.n_support_)
print("Support vectors (first 5 rows):\n", clf.support_vectors_[:5])


Accuracy: 1.0000
Number of support vectors per class: [ 3 10  9]
Support vectors (first 5 rows):
 [[5.1 3.8 1.9 0.4]
 [4.8 3.4 1.9 0.2]
 [5.  3.  1.6 0.2]
 [6.9 3.1 4.9 1.5]
 [6.  2.9 4.5 1.5]]


In [11]:
# 7.Write a Python program to: ● Load the Breast Cancer dataset ● Train a Gaussian Naïve Bayes model ● Print its classification report including precision, recall, and F1-score.
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report

data = datasets.load_breast_cancer()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y)

gnb = GaussianNB()
gnb.fit(X_train, y_train)

y_pred = gnb.predict(X_test)
print(classification_report(y_test, y_pred, target_names=data.target_names))


              precision    recall  f1-score   support

   malignant       0.96      0.87      0.91        53
      benign       0.93      0.98      0.95        90

    accuracy                           0.94       143
   macro avg       0.94      0.92      0.93       143
weighted avg       0.94      0.94      0.94       143



In [12]:
# 8: Write a Python program to: ● Train an SVM Classifier on the Wine dataset using GridSearchCV to find the best C and gamma. ● Print the best hyperparameters and accuracy
from sklearn import datasets
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

wine = datasets.load_wine()
X, y = wine.data, wine.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y)

param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': ['scale', 'auto', 0.01, 0.1, 1]
}

grid = GridSearchCV(
    SVC(kernel='rbf', random_state=42),
    param_grid=param_grid,
    scoring='accuracy',
    cv=5,
    n_jobs=-1
)

grid.fit(X_train, y_train)

best = grid.best_estimator_
y_pred = best.predict(X_test)
acc = accuracy_score(y_test, y_pred)

print("Best params:", grid.best_params_)
print(f"Test accuracy: {acc:.4f}")

Best params: {'C': 100, 'gamma': 'scale'}
Test accuracy: 0.8222


In [13]:
 # 9: Write a Python program to: ● Train a Naïve Bayes Classifier on a synthetic text dataset ● Print the model's ROC-AUC score for its predictions.
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import label_binarize
from sklearn.metrics import roc_auc_score

# Load text data
cats = None  # or choose a subset like ['sci.space','rec.sport.baseball',...]
newsgroups = fetch_20newsgroups(subset='all', categories=cats, remove=('headers','footers','quotes'))
X_text, y = newsgroups.data, newsgroups.target
classes = list(range(len(newsgroups.target_names)))

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X_text, y, test_size=0.25, random_state=42, stratify=y)

# Vectorize (word + char can help; here word n-grams as a simple baseline)
tfidf = TfidfVectorizer(ngram_range=(1,2), min_df=2)
X_train_t = tfidf.fit_transform(X_train)
X_test_t  = tfidf.transform(X_test)

# Model
clf = MultinomialNB(alpha=0.1)
clf.fit(X_train_t, y_train)

# ROC-AUC (multiclass, macro-avg)
probs = clf.predict_proba(X_test_t)
y_test_bin = label_binarize(y_test, classes=classes)
auc = roc_auc_score(y_test_bin, probs, average='macro', multi_class='ovr')
print(f"Macro ROC-AUC: {auc:.4f}")


Macro ROC-AUC: 0.9682


Question 10: Imagine you’re working as a data scientist for a company that handles email communications.Your task is to automatically classify emails as Spam or Not Spam.

The emails may contain : ● Text with diverse vocabulary ● Potential class imbalance (far more legitimate emails than spam) ● Some incomplete or missing data

Explain the approach you would take to:
● Preprocess the data (e.g. text vectorization, handling missing data)
● Choose and justify an appropriate model (SVM vs. Naïve Bayes)
● Address class imbalance
● Evaluate the performance of your solution with suitable metrics
And explain the business impact of your solution.

->  Preprocessing

 - Text normalization: lowercasing, strip HTML, remove URLs/usernames, normalize unicode.

- Tokenization & vectorization:

  - Start with TF-IDF: combine word n-grams (1-2) + character n-grams (3-5) to capture misspellings and obfuscations (e.g., “vi@gr@”).

  - Optional: lemmatization; usually not required with char n-grams.

- Handle missing/incomplete data: fill missing bodies/subjects with empty strings; concatenate subject + body.

- Feature selection: optional χ² to drop uninformative terms; or rely on linear models regularization.

- Meta features (optional): number of links, number of recipients, presence of attachments, ratio of uppercase, etc.


-> Model choice (SVM vs. Naïve Bayes)

- Multinomial NB is a strong, fast baseline for text; it works very well with TF-IDF and massive vocabularies.

- Linear SVM (e.g., LinearSVC) typically yields higher accuracy/F1 for spam tasks because it optimizes a discriminative margin in high-dimensional sparse space.

- Practical plan: start with MultinomialNB for a baseline; then train LinearSVC (or SGDClassifier(loss='hinge' or 'modified_huber')) and compare. If calibrated probabilities are needed (for thresholding or business rules), wrap SVM with CalibratedClassifierCV.

-> Address class imbalance
- Use stratified splits and class_weight='balanced' for SVM (or tune class weights).

- Adjust decision threshold to optimize business metric (maximize F1 or minimize false positives).

- Consider downsampling the majority class or SMOTE on dense meta features; for raw sparse text, class weighting + thresholding usually suffices.

- Cost-sensitive evaluation: treat false positives (ham flagged as spam) as more costly than false negatives; reflect this via weights/thresholds.

-> Evaluation

- Primary metrics:

  -  Precision (Spam class): minimize false positives (avoid losing real mail).

  -  Recall (Spam class): catch spam.

  -  F1 (Spam) or Fβ with β<1 to emphasize precision if business prefers.

  -  ROC-AUC and PR-AUC (PR-AUC is more informative for imbalanced data).

- Validation: Stratified k-fold CV; keep a chronological holdout if data is time-dependent (concept drift).

- Calibration: if using thresholds or triage, use probability calibration (Platt scaling/Isotonic via CalibratedClassifierCV).

- Error analysis: inspect most-confused examples; look at top weighted features for each class to detect leakage or bias.

-> Business impact

- Productivity: less manual triage; employees see more real mail, faster response times.

- Risk reduction: fewer phishing/spam reaching inbox; protect users and brand reputation.

- Deliverability & CSAT: better filtering reduces user complaints; configurable thresholds reduce false positives that could hide important emails.

- Operational efficiency: automated models plus explainable features (top tokens/links) make trust & safety reviews faster.