In [None]:
#Machine Learning-SVM & Naive Bayes Assignment

Question 1: What is a Support Vector Machine (SVM), and how does it work?


Answer:

A Support Vector Machine (SVM) is a supervised machine learning algorithm used for classification and regression.

It works by finding a hyperplane that best separates data points of different classes in a feature space.

The best hyperplane is the one that maximizes the margin, i.e., the distance between the hyperplane and the nearest data points from each class (called support vectors).

For non-linear problems, SVM uses kernel functions to project data into higher-dimensional space where it becomes linearly separable.

Question 2: Explain the difference between Hard Margin and Soft Margin SVM.

Answer:

Hard Margin SVM:

Assumes data is perfectly linearly separable.

No misclassification is allowed.

Sensitive to noise and outliers.

Soft Margin SVM:

Allows some misclassifications with a penalty (controlled by parameter C).

Balances margin maximization with classification error.

More robust in real-world noisy datasets.

Question 3: What is the Kernel Trick in SVM? Give one example of a kernel and
explain its use case.


Answer:

The **Kernel Trick** allows SVMs to operate in a high-dimensional feature space without explicitly computing the coordinates of the data in that space. Instead, it computes the inner products between data points in the transformed space using a **kernel function**.

**Example — Radial Basis Function (RBF) kernel**:

                                    **K(x,x′)=exp(−γ∥x−x′∥2)**
Maps data into infinite-dimensional space.

Use case: when decision boundaries are non-linear (e.g., classifying spirals or concentric circles).

Question 4: What is a Naïve Bayes Classifier, and why is it called “naïve”?


Naïve Bayes is a probabilistic classifier based on **Bayes’ Theorem**:

P(y∣X)=P(X∣y)P(y)/P(X)

It assumes conditional independence between features given the class label.

Called “naïve” because in real-world data features are often correlated, but the model still performs surprisingly well in many tasks, especially text classification.
	​


Question 5: Describe the Gaussian, Multinomial, and Bernoulli Naïve Bayes variants.
When would you use each one?


Answer:

**Gaussian Naïve Bayes:**

Assumes features follow a normal (Gaussian) distribution.

Use case: continuous data like Iris flower measurements.

**Multinomial Naïve Bayes:**

Assumes features are counts/frequencies.

Use case: text classification, word counts in documents.

**Bernoulli Naïve Bayes:**

Assumes binary features (0/1).

Use case: text classification with presence/absence of words (spam detection).

Question 6: Write a Python program to:

● Load the Iris dataset

● Train an SVM Classifier with a linear kernel

● Print the model's accuracy and support vectors.

(Include your Python code and output in the code box below.)


In [1]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

# Train SVM with linear kernel
clf = SVC(kernel="linear", random_state=42)
clf.fit(X_train, y_train)

# Accuracy
y_pred = clf.predict(X_test)
acc = accuracy_score(y_test, y_pred)

print(f"Accuracy: {acc:.4f}")
print("Support Vectors:")
print(clf.support_vectors_)


Accuracy: 1.0000
Support Vectors:
[[4.8 3.4 1.9 0.2]
 [4.5 2.3 1.3 0.3]
 [5.1 3.8 1.9 0.4]
 [5.1 2.5 3.  1.1]
 [6.2 2.2 4.5 1.5]
 [6.  2.9 4.5 1.5]
 [5.9 3.2 4.8 1.8]
 [6.9 3.1 4.9 1.5]
 [6.7 3.1 4.7 1.5]
 [6.8 2.8 4.8 1.4]
 [6.  2.7 5.1 1.6]
 [6.3 2.5 4.9 1.5]
 [6.5 3.2 5.1 2. ]
 [6.3 2.7 4.9 1.8]
 [6.3 2.5 5.  1.9]
 [6.  2.2 5.  1.5]
 [5.6 2.8 4.9 2. ]
 [5.9 3.  5.1 1.8]
 [6.2 2.8 4.8 1.8]
 [7.2 3.  5.8 1.6]]


Question 7: Write a Python program to:

● Load the Breast Cancer dataset

● Train a Gaussian Naïve Bayes model

● Print its classification report including precision, recall, and F1-score.

(Include your Python code and output in the code box below.)


In [2]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report

# Load data
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

# Train Gaussian NB
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# Predict & report
y_pred = gnb.predict(X_test)
print(classification_report(y_test, y_pred, target_names=cancer.target_names))


              precision    recall  f1-score   support

   malignant       0.97      0.89      0.93        64
      benign       0.94      0.98      0.96       107

    accuracy                           0.95       171
   macro avg       0.95      0.94      0.94       171
weighted avg       0.95      0.95      0.95       171



Question 8: Write a Python program to:

● Train an SVM Classifier on the Wine dataset using GridSearchCV to find the best C and gamma.

● Print the best hyperparameters and accuracy.

(Include your Python code and output in the code box below.)


In [3]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load data
wine = load_wine()
X, y = wine.data, wine.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

# Grid Search
param_grid = {"C": [0.1, 1, 10], "gamma": [0.001, 0.01, 0.1, 1]}
grid = GridSearchCV(SVC(kernel="rbf"), param_grid, cv=5, scoring="accuracy")
grid.fit(X_train, y_train)

# Best model
best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)

print("Best parameters:", grid.best_params_)
print("Test Accuracy:", accuracy_score(y_test, y_pred))


Best parameters: {'C': 10, 'gamma': 0.001}
Test Accuracy: 0.7777777777777778


Question 9: Write a Python program to:

● Train a Naïve Bayes Classifier on a synthetic text dataset (e.g. using
sklearn.datasets.fetch_20newsgroups).

● Print the model's ROC-AUC score for its predictions.

(Include your Python code and output in the code box below.)


In [4]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import label_binarize

# Load dataset (subset for speed)
data = fetch_20newsgroups(subset="train", categories=['sci.space', 'rec.sport.baseball'], remove=('headers','footers','quotes'))
X, y = data.data, data.target

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

# Vectorize text
vectorizer = TfidfVectorizer(stop_words="english", max_features=5000)
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# Train Naive Bayes
nb = MultinomialNB()
nb.fit(X_train_tfidf, y_train)

# Predict probabilities
y_prob = nb.predict_proba(X_test_tfidf)
roc_auc = roc_auc_score(y_test, y_prob[:,1])

print(f"ROC-AUC Score: {roc_auc:.4f}")


ROC-AUC Score: 0.9909


Question 10: Imagine you’re working as a data scientist for a company that handles email communications.

Your task is to automatically classify emails as Spam or Not Spam. The emails may contain:

● Text with diverse vocabulary

● Potential class imbalance (far more legitimate emails than spam)

● Some incomplete or missing data

Explain the approach you would take to:

● Preprocess the data (e.g. text vectorization, handling missing data)

● Choose and justify an appropriate model (SVM vs. Naïve Bayes)

● Address class imbalance

● Evaluate the performance of your solution with suitable metrics

And explain the business impact of your solution.

(Include your Python code and output in the code box below.)


Answer:

Approach:

Preprocessing:

Handle missing emails (drop or impute missing body).

Convert text to numerical features using TfidfVectorizer or CountVectorizer.

Optionally use n-grams (bi-grams) for more context.

Model Choice:

Naïve Bayes is usually preferred for spam detection (fast, works well with text).

SVM can be used for higher accuracy but is slower on large datasets.

Class Imbalance:

Use class_weight='balanced' in SVM.

Or use oversampling (SMOTE) / undersampling techniques.

Use stratified train-test splits.

Evaluation Metrics:

Accuracy alone is misleading (due to imbalance).

Use Precision, Recall, F1-score, ROC-AUC.

Recall is critical (don’t miss spam).

Business Impact:

Reduces manual filtering effort.

Protects employees/customers from phishing and malware.

Improves productivity and trust in company email system.

In [5]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, roc_auc_score

# Load subset as spam vs not spam (simulate with 2 categories)
data = fetch_20newsgroups(subset="train", categories=["rec.sport.hockey", "sci.med"], remove=("headers","footers","quotes"))
X, y = data.data, data.target

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

# Vectorize
vectorizer = TfidfVectorizer(stop_words="english", max_features=5000)
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# Train Naive Bayes
nb = MultinomialNB()
nb.fit(X_train_tfidf, y_train)

# Predict
y_pred = nb.predict(X_test_tfidf)
y_prob = nb.predict_proba(X_test_tfidf)[:,1]

# Evaluation
print(classification_report(y_test, y_pred))
print("ROC-AUC:", roc_auc_score(y_test, y_prob))


              precision    recall  f1-score   support

           0       0.97      0.96      0.96       180
           1       0.96      0.97      0.96       179

    accuracy                           0.96       359
   macro avg       0.96      0.96      0.96       359
weighted avg       0.96      0.96      0.96       359

ROC-AUC: 0.9964928615766604
