SVM & Naive Bayes ASSIGNMENT

1. What is a Support Vector Machine (SVM), and how does it work?
- Support Vectore Machines (SVMs) are powerful and versatile supervised machine learning algorithms used primarily for classification and some regression problems.
- It aims to find optimal hyperplane that seperates data points into distinct classes while maximising the margin between them.

2. Explain the difference between Hard Margin and Soft Margin SVM.
- The core difference between hard margin and soft margin SVMs lies in how they handle misclassification during training.
- Hard Margin SVM > Requires perfect linear data and highly sensitive to outliers and misclassified points.
- Soft Margin SVM > Can handle non-linear seperable noisy data with outliers. More robust for unseen data.

3. What is the Kernel Trick in SVM? Give one example of a kernel and explain its use case.
- The Kernel Trick is a crucial technique in Support Vector Machines (SVMs) that allows them to handle non-linearly separable data by implicitly mapping the data into a higher-dimensional space where it can be linearly separated.
- The Radial Basis Function (RBF) Kernel, also known as the Gaussian kernel, is one of the most popular and versatile kernel functions used in SVMs.

4. What is a Naive Bayes Classifier, and why is it called “naive”?
- The Naive Bayes Classifier is a supervised machine learning algorithm primarily used for classification tasks.
- The term "naive" in Naive Bayes refers to a fundamental assumption of the algorithm makes: that all features (or predictors) in a dataset are conditionally independent of each other.

5. Describe the Gaussian, Multinomial, and Bernoulli Naive Bayes variants. When would you use each one?
- Gaussian Naive Bayes > features are continuous and can be approximated by a normal distribution.
 - Ex: Prediction of continous variables like weight, height.
- Multinomial Naive Bayes > text classification tasks, where features are typically word counts.
 - Ex: Spam filtering and Document Categorisation
- Bernoulli Naive Bayes > features are strictly binary and you're interested in the presence or absence of features rather than their counts or magnitude.
 - Ex: Medical diagnosis where features are binary.

6. Write a Python program to:
- Load the Iris dataset
- Train an SVM Classifier with a linear kernel
- Print the model's accuracy and support vectors.

In [1]:
#load iris dataset
from sklearn.datasets import load_iris
iris = load_iris()
x, y= iris.data, iris.target

In [2]:
#train test split
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

In [3]:
#train an SVM classifier with a linear kernel
from sklearn.svm import SVC
svm = SVC(kernel='linear')
svm.fit(x_train, y_train)

In [6]:
#predict and test data and calculate accuracy
from sklearn.metrics import accuracy_score
y_pred = svm.predict(x_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

#support vectors
print("\nNumber of support vectors for each class:", svm.n_support_)
print("\nSupport Vectors:\n", svm.support_vectors_)

Accuracy: 1.0

Number of support vectors for each class: [ 3 11 11]

Support Vectors:
 [[4.8 3.4 1.9 0.2]
 [5.1 3.3 1.7 0.5]
 [4.5 2.3 1.3 0.3]
 [5.6 3.  4.5 1.5]
 [5.4 3.  4.5 1.5]
 [6.7 3.  5.  1.7]
 [5.9 3.2 4.8 1.8]
 [5.1 2.5 3.  1.1]
 [6.  2.7 5.1 1.6]
 [6.3 2.5 4.9 1.5]
 [6.1 2.9 4.7 1.4]
 [6.5 2.8 4.6 1.5]
 [6.9 3.1 4.9 1.5]
 [6.3 2.3 4.4 1.3]
 [6.3 2.5 5.  1.9]
 [6.3 2.8 5.1 1.5]
 [6.3 2.7 4.9 1.8]
 [6.  3.  4.8 1.8]
 [6.  2.2 5.  1.5]
 [6.2 2.8 4.8 1.8]
 [6.5 3.  5.2 2. ]
 [7.2 3.  5.8 1.6]
 [5.6 2.8 4.9 2. ]
 [5.9 3.  5.1 1.8]
 [4.9 2.5 4.5 1.7]]


7. Write a Python program to:
- Load the Breast Cancer dataset
- Train a Gaussian Naive Bayes model
- Print its classification report including precision, recall, and F1-score.

In [7]:
#load cancer dataset
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
x, y = cancer.data, cancer.target

#train test split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

In [8]:
#train gaussian naive bayes model
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(x_train, y_train)

In [9]:
#classification report
from sklearn.metrics import classification_report
y_pred = gnb.predict(x_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      0.93      0.96        43
           1       0.96      1.00      0.98        71

    accuracy                           0.97       114
   macro avg       0.98      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114



8. Write a Python program to:
- Train an SVM Classifier on the Wine dataset using GridSearchCV to find the best C and gamma.
- Print the best hyperparameters and accuracy.


In [10]:
#load wine datset
from sklearn.datasets import load_wine
wine = load_wine()
x, y = wine.data, wine.target

#train test split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

In [11]:
#parameter grid for GridSearchCV
param_grid = {'C': [0.1, 1, 10, 100], 'gamma': [1, 0.1, 0.01, 0.001],
              'kernel': ['linear', 'rbf']}

In [12]:
#train SVM with GridSearch CV
from sklearn.model_selection import GridSearchCV
svm = SVC()
grid_search = GridSearchCV(svm, param_grid, cv=5)
grid_search.fit(x_train, y_train)

In [13]:
#best model and prediction with accuracy
best_svm = grid_search.best_estimator_
y_pred = best_svm.predict(x_test)
accuracy = accuracy_score(y_test, y_pred)
print("Best Hyperparameters:", grid_search.best_params_)
print("Test Accuracy:", accuracy)

Best Hyperparameters: {'C': 0.1, 'gamma': 1, 'kernel': 'linear'}
Test Accuracy: 1.0


9. Write a Python program to:
- Train a Naive Bayes Classifier on a synthetic text dataset (e.g. using sklearn.datasets.fetch_20newsgroups).
- Print the model's ROC-AUC score for its predictions.

In [14]:
#load 20 news groups dataset
from sklearn.datasets import fetch_20newsgroups
categories = ['sci.space', 'rec.sport.baseball']
news_groups = fetch_20newsgroups(subset='all', categories=categories)
x, y = news_groups.data, news_groups.target

In [17]:
# Convert text to TF-IDF features
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
x_tfidf = vectorizer.fit_transform(x)

In [18]:
#split into train and test datasets
x_train, x_test, y_train, y_test = train_test_split(x_tfidf, y, test_size=0.2, random_state=42)

In [19]:
#train naive bayes calssifier
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb.fit(x_train, y_train)

In [24]:
#prediction probabilities
y_pred_proba = nb.predict_proba(x_test)[:, 1]

In [25]:
#compute roc-aoc
from sklearn.metrics import roc_auc_score
roc_auc = roc_auc_score(y_test, y_pred_proba)
print("ROC-AUC Score:", roc_auc)

ROC-AUC Score: 1.0


10. Imagine you're working as a data scientist for a company that handles email communications.

Your task is to automatically classify emails as Spam or Not Spam. The emails may contain:
- Text with diverse vocabulary
- Potential class imbalance (far more legitimate emails than spam)
- Some incomplete or missing data

Explain the approach you would take to:
- Preprocess the data (e.g. text vectorization, handling missing data)
- Choose and justify an appropriate model (SVM vs. Naive Bayes)
- Address class imbalance
- Evaluate the performance of your solution with suitable metrics

And explain the business impact of your solution.


In [29]:
#load news 20 dataset
from sklearn.datasets import fetch_20newsgroups
categories = ['sci.space', 'talk.politics.misc']
news_groups = fetch_20newsgroups(subset='all', categories=categories)
x, y = news_groups.data, news_groups.target
# 0 = sci.space (not spam), 1 = politics (spam-like)

In [30]:
#train test split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

In [32]:
#Vectorization with TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(stop_words="english", max_features=5000, ngram_range=(1,2))
x_train_tfidf = vectorizer.fit_transform(x_train)
x_test_tfidf = vectorizer.transform(x_test)

In [34]:
#Model 1: Naive Bayes
from sklearn.naive_bayes import MultinomialNB
nb_clf = MultinomialNB()
nb_clf.fit(x_train_tfidf, y_train)
y_pred_nb = nb_clf.predict(x_test_tfidf)
y_proba_nb = nb_clf.predict_proba(x_test_tfidf)[:, 1]

print("\n--- Naive Bayes Results ---")
print(classification_report(y_test, y_pred_nb, target_names=categories))
print("ROC-AUC Score:", roc_auc_score(y_test, y_proba_nb))


--- Naive Bayes Results ---
                    precision    recall  f1-score   support

         sci.space       0.97      0.98      0.98       193
talk.politics.misc       0.98      0.97      0.97       160

          accuracy                           0.98       353
         macro avg       0.98      0.98      0.98       353
      weighted avg       0.98      0.98      0.98       353

ROC-AUC Score: 0.9989637305699482


In [39]:
#address class imbalance
import numpy as np
from sklearn.utils.class_weight import compute_class_weight
classes = np.unique(y_train)
class_weights = compute_class_weight(class_weight="balanced", classes=classes, y=y_train)
class_weight_dict = {cls: w for cls, w in zip(classes, class_weights)}

In [40]:
#Model 2: Linear SVM
svm_clf = SVC(kernel='linear', probability=True)
svm_clf.fit(x_train_tfidf, y_train)
y_pred_svm = svm_clf.predict(x_test_tfidf)
y_proba_svm = svm_clf.predict_proba(x_test_tfidf)[:, 1]

print("\n--- SVM Results ---")
print(classification_report(y_test, y_pred_svm, target_names=categories))


--- SVM Results ---
                    precision    recall  f1-score   support

         sci.space       0.99      0.98      0.99       193
talk.politics.misc       0.98      0.99      0.98       160

          accuracy                           0.99       353
         macro avg       0.99      0.99      0.99       353
      weighted avg       0.99      0.99      0.99       353



Loads 20 Newsgroups dataset with 2 categories (simulating spam vs not spam).

1. Cleans missing text and vectorizes with TF-IDF + bigrams.

2. Trains and evaluates:

- Naive Bayes (baseline, probability output → ROC-AUC available).

- Linear SVM (class-weighted for imbalance, usually stronger).

3. Reports precision, recall, F1, ROC-AUC (where available).