<a href="https://colab.research.google.com/github/arunshi01/Assignment-Logistic-Regression/blob/main/SVM%26NAIVEBAYES.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

1) What is a Support Vector Machine (SVM), and how does it work?

A Support Vector Machine (SVM) is a powerful supervised machine learning algorithm used for classification (and sometimes regression) tasks. It’s particularly effective in high-dimensional spaces and works well when there is a clear margin of separation between classes.


How SVM Works – Step-by-Step

I)Plot the data in a feature space.

II)Find the hyperplane that separates the classes.
The goal is to maximize the margin, which is the distance between the hyperplane and the nearest data points from each class.

III)Use support vectors (the points on the edge of the margin) to define this hyperplane.

IV)When new data comes in, SVM determines on which side of the hyperplane it lies — and classifies it accordingly.

2) : Explain the difference between Hard Margin and Soft Margin SVM

Hard Margin SVM

Used when data is perfectly linearly separable — i.e., you can draw a straight line (or hyperplane) that separates the two classes with no misclassifications.

Idea:

-The SVM tries to find the hyperplane that maximizes the margin between the two classes.

-No data points are allowed to be on the wrong side of the hyperplane or even within the margin.

Soft Margin SVM

Used when data is not perfectly separable — which is common in real-world datasets.

Idea:

-Allows some misclassifications or margin violations to achieve a better overall boundary.

-Introduces a penalty term for points that fall inside the margin or on the wrong side of the hyperplane.

3) What is the Kernel Trick in SVM? Give one example of a kernel and
explain its use case.

The Kernel Trick allows an SVM to handle non-linearly separable data by implicitly mapping it into a higher-dimensional space — without actually computing that transformation.

This means SVM can find a linear boundary in that higher-dimensional space, which corresponds to a nonlinear boundary in the original space.
Example-RBF (Gaussian) Kernel for curved/nonlinear boundaries.

Use Case:

Imagine classifying whether a patient has a disease based on two features (say, blood sugar and BMI).
If the relationship between healthy and sick patients is nonlinear (e.g., curved boundary), an RBF kernel can help the SVM learn that complex pattern effectively.

4) What is a Naïve Bayes Classifier, and why is it called “naïve”?

The Naïve Bayes Classifier is based on Bayes’ Theorem, which describes how to update the probability of a hypothesis based on new evidence.

It’s used to predict the class of a data point given its features by calculating posterior probabilities.

It’s called “naïve” because it assumes that all features are independent of each other given the class label.


5) Describe the Gaussian, Multinomial, and Bernoulli Naïve Bayes variants.
When would you use each one?

Gaussian Naïve Bayes (GNB)
When to Use:
-When your features are continuous (numerical values).
-Common in datasets with measurements like height, weight, age, temperature, etc.

Assumption:

Each feature follows a normal (Gaussian) distribution within each class.

Multinomial Naïve Bayes (MNB)
When to Use:
-When your features are counts or frequencies.
-Most common for text classification, e.g., spam detection, sentiment analysis.

Assumption:

Features represent discrete counts (e.g., number of times a word appears)

Bernoulli Naïve Bayes (BNB)
When to Use:
-When your features are binary (0 or 1).
-Indicates presence or absence of a feature, not how many times it appears.

Assumption:

Each feature is a binary variable (1 = present, 0 = absent).


6) : Write a Python program to:
● Load the Iris dataset
● Train an SVM Classifier with a linear kernel
● Print the model's accuracy and support vectors.

In [1]:
# Import necessary libraries
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# 1️⃣ Load the Iris dataset
iris = datasets.load_iris()
X = iris.data       # Features
y = iris.target     # Labels

# 2️⃣ Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3️⃣ Train an SVM Classifier with a linear kernel
model = SVC(kernel='linear')
model.fit(X_train, y_train)

# 4️⃣ Make predictions on the test set
y_pred = model.predict(X_test)

# 5️⃣ Calculate and print the model's accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", accuracy)

# 6️⃣ Print the support vectors
print("\nSupport Vectors:\n", model.support_vectors_)


Model Accuracy: 1.0

Support Vectors:
 [[4.8 3.4 1.9 0.2]
 [5.1 3.3 1.7 0.5]
 [4.5 2.3 1.3 0.3]
 [5.6 3.  4.5 1.5]
 [5.4 3.  4.5 1.5]
 [6.7 3.  5.  1.7]
 [5.9 3.2 4.8 1.8]
 [5.1 2.5 3.  1.1]
 [6.  2.7 5.1 1.6]
 [6.3 2.5 4.9 1.5]
 [6.1 2.9 4.7 1.4]
 [6.5 2.8 4.6 1.5]
 [6.9 3.1 4.9 1.5]
 [6.3 2.3 4.4 1.3]
 [6.3 2.5 5.  1.9]
 [6.3 2.8 5.1 1.5]
 [6.3 2.7 4.9 1.8]
 [6.  3.  4.8 1.8]
 [6.  2.2 5.  1.5]
 [6.2 2.8 4.8 1.8]
 [6.5 3.  5.2 2. ]
 [7.2 3.  5.8 1.6]
 [5.6 2.8 4.9 2. ]
 [5.9 3.  5.1 1.8]
 [4.9 2.5 4.5 1.7]]


7) Write a Python program to:
● Load the Breast Cancer dataset
● Train a Gaussian Naïve Bayes model
● Print its classification report including precision, recall, and F1-score.


In [2]:
# Import required libraries
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report

# 1️⃣ Load the Breast Cancer dataset
breast_cancer = datasets.load_breast_cancer()
X = breast_cancer.data      # Features
y = breast_cancer.target    # Labels

# 2️⃣ Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3️⃣ Train a Gaussian Naïve Bayes model
model = GaussianNB()
model.fit(X_train, y_train)

# 4️⃣ Make predictions
y_pred = model.predict(X_test)

# 5️⃣ Print the classification report
print("Classification Report:\n")
print(classification_report(y_test, y_pred, target_names=breast_cancer.target_names))


Classification Report:

              precision    recall  f1-score   support

   malignant       1.00      0.93      0.96        43
      benign       0.96      1.00      0.98        71

    accuracy                           0.97       114
   macro avg       0.98      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114



8) Question 8: Write a Python program to:
● Train an SVM Classifier on the Wine dataset using GridSearchCV to find the best
C and gamma.
● Print the best hyperparameters and accuracy.


In [None]:
# Import required libraries
from sklearn import datasets
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# 1️⃣ Load the Wine dataset
wine = datasets.load_wine()
X = wine.data       # Features
y = wine.target     # Labels

# 2️⃣ Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3️⃣ Define the SVM model
svm = SVC()

# 4️⃣ Define the parameter grid for GridSearchCV
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [1, 0.1, 0.01, 0.001],
    'kernel': ['rbf']   # using RBF kernel for non-linear mapping
}

# 5️⃣ Apply GridSearchCV to find the best parameters
grid = GridSearchCV(svm, param_grid, refit=True, cv=5, verbose=0)
grid.fit(X_train, y_train)

# 6️⃣ Print the best hyperparameters
print("Best Hyperparameters found:")
print(grid.best_params_)

# 7️⃣ Evaluate the model on test data
y_pred = grid.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("\nTest Accuracy:", accuracy)


9) Write a Python program to:
● Train a Naïve Bayes Classifier on a synthetic text dataset (e.g. using
sklearn.datasets.fetch_20newsgroups).
● Print the model's ROC-AUC score for its predictions.


In [3]:
# Import required libraries
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import label_binarize

# 1️⃣ Load the 20 Newsgroups dataset
categories = ['sci.space', 'rec.sport.baseball']  # using 2 categories for binary classification
newsgroups = fetch_20newsgroups(subset='all', categories=categories, remove=('headers', 'footers', 'quotes'))

X = newsgroups.data
y = newsgroups.target

# 2️⃣ Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3️⃣ Convert text data into TF-IDF feature vectors
vectorizer = TfidfVectorizer(stop_words='english')
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# 4️⃣ Train a Multinomial Naïve Bayes classifier
model = MultinomialNB()
model.fit(X_train_tfidf, y_train)

# 5️⃣ Predict probabilities for the test set
y_probs = model.predict_proba(X_test_tfidf)[:, 1]  # probability for positive class

# 6️⃣ Compute the ROC-AUC score
roc_auc = roc_auc_score(y_test, y_probs)
print("ROC-AUC Score:", roc_auc)


ROC-AUC Score: 0.995617387759262


10) : Imagine you’re working as a data scientist for a company that handles
email communications.
Your task is to automatically classify emails as Spam or Not Spam. The emails may
contain:
● Text with diverse vocabulary
● Potential class imbalance (far more legitimate emails than spam)
● Some incomplete or missing data
Explain the approach you would take to:
● Preprocess the data (e.g. text vectorization, handling missing data)
● Choose and justify an appropriate model (SVM vs. Naïve Bayes)
● Address class imbalance
● Evaluate the performance of your solution with suitable metrics
And explain the business impact of your solution.


Problem Context

We need to build an automated system to classify emails as Spam or Not Spam — a binary text classification problem.
Challenges include:

-Unstructured text data
-Class imbalance (more “Not Spam” than “Spam”)
-Missing or incomplete content

1)Data Preprocessing
a) Handling Missing Data
-Some emails might have missing subject lines or bodies.
Approaches:
-Drop emails with completely missing text.
-Impute missing parts with placeholders (e.g., “no_subject”, “no_body”) to preserve record count.
b) Text Cleaning
-Convert all text to lowercase.
-Remove HTML tags, punctuation, stopwords, and extra spaces.
-Perform tokenization and optionally lemmatization (reducing words to their base form).
c) Text Vectorization
-To convert raw text into numerical features:
-Use TF-IDF (Term Frequency–Inverse Document Frequency), which:
-Captures how important a word is relative to all documents.
-Reduces the impact of common words (“the”, “and”, etc.).
-You can also try CountVectorizer for simpler models or word embeddings (Word2Vec, BERT) for deep learning approaches.

2) Model Choice: SVM vs. Naïve Bayes

3) Handling Class Imbalance

Since spam is usually less frequent than legitimate emails:

Techniques:

i)Resampling:
-Oversampling (SMOTE) – synthetically generate spam samples.
-Undersampling – reduce non-spam samples.

ii)Class Weights:
-In SVM or logistic regression, set class_weight='balanced' to give more importance to the minority class.

iii)Threshold Tuning:
-Adjust classification probability threshold (e.g., label as spam if P(spam) > 0.4 instead of 0.5).

4) Model Evaluation
-Precision: How many predicted spam emails are actually spam (avoid false alarms).
-Recall: How many actual spam emails were correctly detected (avoid missing spam).
-F1-Score: Balances precision and recall.
-ROC-AUC Score: Measures how well the model separates the two classes.
-Confusion Matrix: Shows the trade-off between spam and not-spam predictions.

5)Business Impact

Implementing this spam detection system provides clear business value.