# SVM and Naive Bayes Assignment

1.  What is a Support Vector Machine (SVM), and how does it work?
A Support Vector Machine (SVM) is a powerful supervised machine learning algorithm used for classification and regression tasks (though it‚Äôs mostly used for classification).

At its core, SVM aims to find the best decision boundary‚Äîcalled a hyperplane‚Äîthat separates data points of different classes with the maximum possible margin.

2. Explain the difference between Hard Margin and Soft Margin SVM.
A Hard Margin SVM assumes that the data is perfectly linearly separable ‚Äî meaning there exists a straight line (or hyperplane) that divides the two classes without any misclassification.

It strictly enforces that every data point must be correctly classified and lie outside or on the margin boundaries.


A Soft Margin SVM allows some flexibility ‚Äî it accepts that a few points might be:

On the wrong side of the margin, or even

Misclassified entirely.

It introduces slack variables (Œæ·µ¢ ‚â• 0) to measure how much each point violates the margin constraint.


3. What is the Kernel Trick in SVM? Give one example of a kernel and
explain its use case.
The Kernel Trick allows SVMs to separate data that is not linearly separable by implicitly mapping the data into a higher-dimensional space, without ever computing that mapping directly.

Imagine you have data that looks like this in 2D:

üü† Inner circle (Class 1)
üîµ Outer ring (Class 2)

You cannot draw a straight line to separate them in 2D.

But if you lift the data into a higher dimension (e.g., add a new feature like 
ùëß
=
ùë•
2
+
ùë¶
2
z=x
2
+y
2
), the circular pattern becomes linearly separable in that higher-dimensional space.

üëâ The Kernel Trick does this implicitly, using a mathematical function called a kernel function.


4. What is a Na√Øve Bayes Classifier, and why is it called ‚Äúna√Øve‚Äù?
The Na√Øve Bayes Classifier is a probabilistic machine learning model based on Bayes‚Äô Theorem.

It predicts the probability that a given sample belongs to a particular class based on the values of its features.

The algorithm is called ‚Äúna√Øve‚Äù because it makes a strong (and unrealistic) assumption:

It assumes that all features are independent of each other, given the class label.

5.  Describe the Gaussian, Multinomial, and Bernoulli Na√Øve Bayes variants.
When would you use each one?

Assumes that the continuous features (numeric values) for each class follow a normal (Gaussian) distribution.\

Example Use Cases:

Iris flower classification (using petal/sepal length and width)

Medical diagnosis (continuous lab test values)

Sensor or signal data classification

üßÆ 2. Multinomial Na√Øve Bayes
üìò Description:

Designed for discrete features ‚Äî typically count data (like word frequencies in text).

Assumes that the features represent the number of times a particular event occurs (e.g., how often a word appears in a document).


6, Dataset Info:
‚óè You can use any suitable datasets like Iris, Breast Cancer, or Wine from
sklearn.datasets or a CSV file you have.
Question 6: Write a Python program to:
‚óè Load the Iris dataset
‚óè Train an SVM Classifier with a linear kernel
‚óè Print the model's accuracy and support vectors.
(Include your Python code and output in the code box below.)

In [11]:
# Import necessary libraries
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = datasets.load_iris()
X = iris.data        # Feature matrix
y = iris.target      # Labels

# Split data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create an SVM classifier with a linear kernel
svm_model = SVC(kernel='linear', random_state=42)

# Train the model
svm_model.fit(X_train, y_train)

# Make predictions on the test data
y_pred = svm_model.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)

# Print results
print("=== SVM Classifier Results ===")
print(f"Accuracy: {accuracy * 100:.2f}%")
print("\nSupport Vectors (one per class):")
print(svm_model.support_vectors_)
print("\nNumber of Support Vectors for each class:")
print(svm_model.n_support_)


=== SVM Classifier Results ===
Accuracy: 100.00%

Support Vectors (one per class):
[[4.8 3.4 1.9 0.2]
 [5.1 3.3 1.7 0.5]
 [4.5 2.3 1.3 0.3]
 [5.6 3.  4.5 1.5]
 [5.4 3.  4.5 1.5]
 [6.7 3.  5.  1.7]
 [5.9 3.2 4.8 1.8]
 [5.1 2.5 3.  1.1]
 [6.  2.7 5.1 1.6]
 [6.3 2.5 4.9 1.5]
 [6.1 2.9 4.7 1.4]
 [6.5 2.8 4.6 1.5]
 [6.9 3.1 4.9 1.5]
 [6.3 2.3 4.4 1.3]
 [6.3 2.5 5.  1.9]
 [6.3 2.8 5.1 1.5]
 [6.3 2.7 4.9 1.8]
 [6.  3.  4.8 1.8]
 [6.  2.2 5.  1.5]
 [6.2 2.8 4.8 1.8]
 [6.5 3.  5.2 2. ]
 [7.2 3.  5.8 1.6]
 [5.6 2.8 4.9 2. ]
 [5.9 3.  5.1 1.8]
 [4.9 2.5 4.5 1.7]]

Number of Support Vectors for each class:
[ 3 11 11]


7. Write a Python program to:
‚óè Load the Breast Cancer dataset
‚óè Train a Gaussian Na√Øve Bayes model
‚óè Print its classification report including precision, recall, and F1-score.
(Include your Python code and output in the code box below.)

In [14]:
# Import necessary libraries
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report

# Load the Breast Cancer dataset
breast_cancer = datasets.load_breast_cancer()
X = breast_cancer.data     # Feature matrix
y = breast_cancer.target   # Labels (0 = malignant, 1 = benign)

# Split data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the Gaussian Na√Øve Bayes model
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# Make predictions on the test data
y_pred = gnb.predict(X_test)

# Print the classification report
print("=== Gaussian Na√Øve Bayes Classifier Report ===")
print(classification_report(y_test, y_pred, target_names=breast_cancer.target_names))

=== Gaussian Na√Øve Bayes Classifier Report ===
              precision    recall  f1-score   support

   malignant       1.00      0.93      0.96        43
      benign       0.96      1.00      0.98        71

    accuracy                           0.97       114
   macro avg       0.98      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114



8. Write a Python program to:
‚óè Train an SVM Classifier on the Wine dataset using GridSearchCV to find the best
C and gamma.
‚óè Print the best hyperparameters and accuracy.
(Include your Python code and output in the code box below.)

In [17]:
# Import necessary libraries
from sklearn import datasets
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load the Wine dataset
wine = datasets.load_wine()
X = wine.data      # Feature matrix
y = wine.target    # Labels

# Split data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the SVM model
svm = SVC()

# Define parameter grid for GridSearchCV
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [1, 0.1, 0.01, 0.001],
    'kernel': ['rbf']   # using RBF kernel for non-linear decision boundaries
}

# Perform Grid Search with 5-fold cross-validation
grid = GridSearchCV(svm, param_grid, refit=True, verbose=0, cv=5)
grid.fit(X_train, y_train)

# Make predictions using the best model
best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

# Print results
print("=== SVM Classifier with GridSearchCV ===")
print(f"Best Parameters: {grid.best_params_}")
print(f"Best Cross-Validation Accuracy: {grid.best_score_ * 100:.2f}%")
print(f"Test Set Accuracy: {accuracy * 100:.2f}%")

=== SVM Classifier with GridSearchCV ===
Best Parameters: {'C': 100, 'gamma': 0.001, 'kernel': 'rbf'}
Best Cross-Validation Accuracy: 71.80%
Test Set Accuracy: 83.33%


9. Write a Python program to:
‚óè Train a Na√Øve Bayes Classifier on a synthetic text dataset (e.g. using
sklearn.datasets.fetch_20newsgroups).
‚óè Print the model's ROC-AUC score for its predictions.
(Include your Python code and output in the code box below.)

# Import necessary libraries
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import roc_auc_score

# Load a subset of the 20 Newsgroups dataset (binary classification for simplicity)
categories = ['sci.space', 'rec.autos']  # Two distinct categories
newsgroups = fetch_20newsgroups(subset='all', categories=categories, remove=('headers', 'footers', 'quotes'))

X = newsgroups.data     # Text documents
y = newsgroups.target   # Labels (0 or 1)

# Convert text data into TF-IDF feature vectors
vectorizer = TfidfVectorizer(stop_words='english')
X_tfidf = vectorizer.fit_transform(X)

# Split data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.2, random_state=42)

# Train a Multinomial Na√Øve Bayes classifier
nb_model = MultinomialNB()
nb_model.fit(X_train, y_train)

# Predict probabilities for the positive class
y_proba = nb_model.predict_proba(X_test)[:, 1]

# Compute ROC-AUC score
roc_auc = roc_auc_score(y_test, y_proba)

# Print results
print("=== Na√Øve Bayes Text Classification (ROC-AUC) ===")
print(f"ROC-AUC Score: {roc_auc:.4f}")

10. Question 10: Imagine you‚Äôre working as a data scientist for a company that handles
email communications.
Your task is to automatically classify emails as Spam or Not Spam. The emails may
contain:
‚óè Text with diverse vocabulary
‚óè Potential class imbalance (far more legitimate emails than spam)
‚óè Some incomplete or missing data
Explain the approach you would take to:
‚óè Preprocess the data (e.g. text vectorization, handling missing data)
‚óè Choose and justify an appropriate model (SVM vs. Na√Øve Bayes)
‚óè Address class imbalance
‚óè Evaluate the performance of your solution with suitable metrics
And explain the business impact of your solution.
(Include your Python code and output in the code box below.)

Data Preprocessing

Emails are text data that may contain missing fields. Steps include:

Handling missing data:

Drop emails with no content or fill missing text with a placeholder like an empty string.

Text cleaning & normalization:

Lowercasing, removing punctuation, numbers, and stopwords.

Optional: stemming or lemmatization.

Text vectorization:

Convert emails to numeric vectors for ML models using:

TF-IDF (TfidfVectorizer) ‚Äî accounts for word importance.

Count Vectorizer (CountVectorizer) ‚Äî counts word occurrences.

2Ô∏è‚É£ Model Choice
Na√Øve Bayes

Works well with text classification, especially spam detection.

Handles high-dimensional sparse data efficiently.

Performs well even with small training sets.

SVM

Can produce higher accuracy with well-separated classes.

Works with kernels (linear for text data) but is slower on large datasets.

‚úÖ Choice: Start with Multinomial Na√Øve Bayes, because spam detection is mostly about word presence and frequency, and NB handles sparse high-dimensional text efficiently.

3Ô∏è‚É£ Handling Class Imbalance

Spam is usually less frequent than legitimate emails:

Techniques:

Class weighting: Adjust the classifier to penalize misclassification of minority class more.

Resampling: Oversample minority class (SMOTE) or undersample majority class.

Threshold tuning: Adjust the probability threshold for classifying spam.

4Ô∏è‚É£ Evaluation Metrics

For imbalanced datasets, accuracy is misleading. Prefer:

Precision: Fraction of predicted spam that is actually spam (avoids false positives).

Recall: Fraction of actual spam detected (avoids false negatives).

F1-score: Harmonic mean of precision and recall.

ROC-AUC: Probability that the model ranks a random spam email higher than a non-spam email.

5Ô∏è‚É£ Business Impact

Reducing false negatives: Prevent spam from reaching users ‚Üí better user experience, compliance.

Reducing false positives: Avoid misclassifying important emails ‚Üí prevents loss of critical communication.

Efficiency: Automated filtering reduces manual inspection costs.

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.utils.class_weight import compute_class_weight
import numpy as np

# Load dataset (simulate spam vs non-spam)
categories = ['rec.autos', 'sci.space']
emails = fetch_20newsgroups(subset='all', categories=categories, remove=('headers','footers','quotes'))
X, y = [text or "" for text in emails.data], emails.target  # Handle missing text

# Split and vectorize
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
X_train_tfidf, X_test_tfidf = vectorizer.fit_transform(X_train), vectorizer.transform(X_test)

# Compute sample weights for imbalance
weights = dict(zip(*[np.unique(y_train), compute_class_weight('balanced', np.unique(y_train), y_train)]))
sample_weights = np.array([weights[label] for label in y_train])

# Train and predict
model = MultinomialNB()
model.fit(X_train_tfidf, y_train, sample_weight=sample_weights)
y_pred, y_proba = model.predict(X_test_tfidf), model.predict_proba(X_test_tfidf)[:,1]

# Evaluation
print(classification_report(y_test, y_pred, target_names=emails.target_names))
print(f"ROC-AUC Score: {roc_auc_score(y_test, y_proba):.4f}")