Question 1: What is a Support Vector Machine (SVM), and how does it work?

-->  A Support Vector Machine (SVM) is a supervised machine learning algorithm used for classification and regression tasks. It works by finding an optimal boundary, called a hyperplane, that separates data points into different classes with the largest possible margin.
How SVM works
The core objective of an SVM is to identify the best possible hyperplane to divide data points from two or more classes. The best hyperplane is the one that achieves the maximum margin, the distance between the hyperplane and the nearest data point from each class.
For linearly separable data
Optimal hyperplane: For data that can be separated by a straight line or plane, SVM finds the single hyperplane that provides the widest possible separation between the classes.
Support vectors: The data points closest to this optimal hyperplane are known as "support vectors." These crucial points determine the position and orientation of the boundary, with all other data points being irrelevant to the final model.
Maximum margin: By maximizing the distance between the support vectors and the hyperplane, the SVM creates a robust decision boundary that is less sensitive to small changes in the data.
For non-linearly separable data
When data cannot be separated by a simple linear boundary, SVM uses a technique called the "kernel trick".
Kernel trick: This technique involves applying a kernel function to implicitly map the data into a higher-dimensional feature space, where it becomes linearly separable.
Higher-dimensional hyperplane: In this new, higher dimension, SVM can find a linear hyperplane to separate the classes. Common kernel functions for this purpose include the Polynomial Kernel and the Radial Basis Function (RBF) Kernel.
Handling noise with soft margins
For real-world data that contains outliers or noise, a strict boundary can lead to overfitting. SVM can be adapted to be more flexible by using a "soft margin".
Allowing misclassifications: A soft margin allows for some data points to be misclassified, or to fall on the wrong side of the margin, by introducing a penalty for these errors.
The C-parameter: A regularization parameter, "C," controls the trade-off between a wider margin and minimizing misclassification errors. A low C allows more misclassifications for a wider margin, while a high C enforces a stricter boundary.
Key takeaways
Powerful and versatile: SVMs are effective for both linear and non-linear classification and regression problems, making them applicable in fields like image classification, bioinformatics, and text categorization.
Memory efficiency: The decision boundary is defined by a small subset of the training data (the support vectors), making SVM a memory-efficient algorithm.
Sensitive to hyperparameters: Performance can be highly dependent on the choice of kernel and regularization parameters, which may require careful tuning.


Question 2: Explain the difference between Hard Margin and Soft Margin SVM

--> Hard Margin SVM
Assumption: Data must be linearly separable, meaning a line (or hyperplane in higher dimensions) can perfectly separate the classes without any errors.
Goal: Maximize the margin between the classes while ensuring every single data point is correctly classified and falls outside the margin.
Sensitivity: Highly sensitive to outliers, as even a single outlier can significantly impact the decision boundary and lead to overfitting.
Use Case: Best for clean, perfectly separable datasets but generally not practical for real-world scenarios.
Soft Margin SVM
Assumption: The data may not be perfectly linearly separable, or it might contain outliers.
Mechanism: Introduces slack variables to allow for some misclassifications and margin violations.
Goal: It finds a balance between maximizing the margin and minimizing the number of misclassified points or errors.
Flexibility: More flexible and robust to outliers, making it a more practical choice for real-world, noisy datasets.


Question 3: What is the Kernel Trick in SVM? Give one example of a kernel and
explain its use case.

--> The kernel trick is a method used in Support Vector Machines (SVMs) to handle non-linearly separable data by implicitly mapping the data into a higher-dimensional space, where it can be separated by a linear hyperplane, without the computational cost of explicit transformation. A common example is the Radial Basis Function (RBF) kernel, which uses a Gaussian function to measure data point similarity, allowing SVMs to find complex, non-linear decision boundaries suitable for tasks like image classification or detecting localized patterns in data.

What is the Kernel Trick

Addressing Non-Linearity: SVMs are fundamentally linear classifiers. However, the kernel trick allows them to find non-linear boundaries for data that isn't easily separated by a straight line.
Implicit Mapping: Instead of manually mapping every data point to a higher-dimensional space, a kernel function computes the dot product in this transformed space. This transformation is implicit, avoiding the computational expense of working with complex, high-dimensional data.
Computational Efficiency: The trick saves significant computational resources because it avoids the direct, often costly, calculations of coordinates in the new, high-dimensional space.
Non-Linear Decision Boundaries: By operating in a higher-dimensional space, a linear decision boundary (hyperplane) can effectively separate non-linear data, resulting in a non-linear boundary in the original feature space.


Question 4: What is a Naïve Bayes Classifier, and why is it called “naïve”?


--> A Naïve Bayes classifier is a probabilistic supervised machine learning algorithm that uses Bayes' Theorem to classify data, but it's called "naïve" because it makes the unrealistic assumption that all predictor features are independent of one another, given the class label. Despite this strong and often false assumption, the algorithm performs well in practice, especially for text classification tasks like spam detection, due to its simplicity and efficiency with large datasets.
What it is:
A Classification Algorithm: It's a type of supervised machine learning model used to assign data points to predefined categories or classes.
Based on Bayes' Theorem: The algorithm's core logic is derived from Bayes' Theorem, a fundamental principle in probability that allows for updating the probability of an event based on new information.
Probabilistic: It works by calculating the probability of an item belonging to a particular class and then choosing the class with the highest probability.
Why it's called "naïve":
The Assumption of Independence: The "naïve" aspect refers to the algorithm's core assumption that each feature used to predict the outcome is conditionally independent of all other features.
An Unrealistic Simplification: In real-world scenarios, features are often not truly independent; there are usually dependencies and correlations between them.
"Naive" Despite Its Success: The term highlights the oversimplified nature of this assumption, which is a significant deviation from reality. However, this simplifying assumption allows the algorithm to be computationally efficient and perform remarkably well in practice, particularly in domains where it's successfully applied.


Question 5: Describe the Gaussian, Multinomial, and Bernoulli Naïve Bayes variants.
When would you use each one?


Gaussian Naïve Bayes is for continuous data that follows a normal distribution, such as height or income. Multinomial Naïve Bayes is for discrete count data, most often used for text classification with term frequencies. Bernoulli Naïve Bayes is for binary (presence/absence) features, suitable for tasks where you only care if a feature is present or not, such as in text classification where you check if a word exists in a document.  
Gaussian Naïve Bayes
Description: This variant assumes that your continuous features follow a normal (Gaussian) distribution.
When to use it: It is ideal for features that are continuous and exhibit a normal distribution, such as a person's height, weight, or income, or even features in text analysis represented by TF-IDF vectors.
Multinomial Naïve Bayes
Description: This variant is used when features are discrete counts, meaning they represent how many times something occurred.
When to use it: It is best for discrete datasets where the features have multiple possible outcomes, such as word frequencies in text classification tasks like spam detection or document classification.
Bernoulli Naïve Bayes
Description: This variant deals with binary, boolean, or presence/absence features.
When to use it: It is best for binary data where features represent the absence or presence of an item, like whether a specific word appears in a document, rather than its frequency.
Choosing the Right Variant
The most important factor in selecting a Naïve Bayes variant is the nature of your data.
If your data is continuous and normally distributed, Gaussian is a good choice.
If you have discrete counts, like word counts in text, use Multinomial.
If your data features are binary (yes/no, present/absent), use Bernoulli.




● You can use any suitable datasets like Iris, Breast Cancer, or Wine from
sklearn.datasets or a CSV file you have.
Question 6: Write a Python program to:
● Load the Iris dataset
● Train an SVM Classifier with a linear kernel
● Print the model's accuracy and support vectors.

In [1]:
from sklearn import datasets
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train an SVM Classifier with a linear kernel
svm_classifier = SVC(kernel='linear')
svm_classifier.fit(X_train, y_train)

# Predict on the test set
y_pred = svm_classifier.predict(X_test)

# Calculate and print the model's accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")

# Print the support vectors
print("Support Vectors:")
print(svm_classifier.support_vectors_)

Model Accuracy: 1.00
Support Vectors:
[[4.8 3.4 1.9 0.2]
 [5.1 3.3 1.7 0.5]
 [4.5 2.3 1.3 0.3]
 [5.6 3.  4.5 1.5]
 [5.4 3.  4.5 1.5]
 [6.7 3.  5.  1.7]
 [5.9 3.2 4.8 1.8]
 [5.1 2.5 3.  1.1]
 [6.  2.7 5.1 1.6]
 [6.3 2.5 4.9 1.5]
 [6.1 2.9 4.7 1.4]
 [6.5 2.8 4.6 1.5]
 [6.9 3.1 4.9 1.5]
 [6.3 2.3 4.4 1.3]
 [6.3 2.8 5.1 1.5]
 [6.3 2.7 4.9 1.8]
 [6.  3.  4.8 1.8]
 [6.  2.2 5.  1.5]
 [6.2 2.8 4.8 1.8]
 [6.5 3.  5.2 2. ]
 [7.2 3.  5.8 1.6]
 [5.6 2.8 4.9 2. ]
 [5.9 3.  5.1 1.8]
 [4.9 2.5 4.5 1.7]]


Question 7: Write a Python program to:
● Load the Breast Cancer dataset
● Train a Gaussian Naïve Bayes model
● Print its classification report including precision, recall, and F1-score

In [2]:
from sklearn import datasets
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Load the Breast Cancer dataset
breast_cancer = datasets.load_breast_cancer()
X = breast_cancer.data
y = breast_cancer.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a Gaussian Naïve Bayes model
gnb_model = GaussianNB()
gnb_model.fit(X_train, y_train)

# Predict on the test set
y_pred = gnb_model.predict(X_test)

# Print the classification report
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=breast_cancer.target_names))

Classification Report:
              precision    recall  f1-score   support

   malignant       0.93      0.90      0.92        63
      benign       0.95      0.96      0.95       108

    accuracy                           0.94       171
   macro avg       0.94      0.93      0.94       171
weighted avg       0.94      0.94      0.94       171



Question 8: Write a Python program to:
● Train an SVM Classifier on the Wine dataset using GridSearchCV to find the best
C and gamma.
● Print the best hyperparameters and accuracy.


In [3]:
from sklearn import datasets
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

# Load the Wine dataset
wine = datasets.load_wine()
X = wine.data
y = wine.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define the parameter grid for GridSearchCV
param_grid = {'C': [0.1, 1, 10, 100], 'gamma': [1, 0.1, 0.01, 0.001], 'kernel': ['rbf']}

# Create a GridSearchCV object
grid = GridSearchCV(SVC(), param_grid, refit=True, verbose=2)

# Fit the GridSearchCV object to the training data
grid.fit(X_train, y_train)

# Print the best parameters found by GridSearchCV
print("Best Hyperparameters:")
print(grid.best_params_)

# Predict on the test set using the best model
y_pred = grid.predict(X_test)

# Calculate and print the accuracy of the best model
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy with Best Hyperparameters: {accuracy:.2f}")

Fitting 5 folds for each of 16 candidates, totalling 80 fits
[CV] END .........................C=0.1, gamma=1, kernel=rbf; total time=   0.0s
[CV] END .........................C=0.1, gamma=1, kernel=rbf; total time=   0.0s
[CV] END .........................C=0.1, gamma=1, kernel=rbf; total time=   0.0s
[CV] END .........................C=0.1, gamma=1, kernel=rbf; total time=   0.0s
[CV] END .........................C=0.1, gamma=1, kernel=rbf; total time=   0.0s
[CV] END .......................C=0.1, gamma=0.1, kernel=rbf; total time=   0.0s
[CV] END .......................C=0.1, gamma=0.1, kernel=rbf; total time=   0.0s
[CV] END .......................C=0.1, gamma=0.1, kernel=rbf; total time=   0.0s
[CV] END .......................C=0.1, gamma=0.1, kernel=rbf; total time=   0.0s
[CV] END .......................C=0.1, gamma=0.1, kernel=rbf; total time=   0.0s
[CV] END ......................C=0.1, gamma=0.01, kernel=rbf; total time=   0.0s
[CV] END ......................C=0.1, gamma=0.01

Question 9: Write a Python program to:
● Train a Naïve Bayes Classifier on a synthetic text dataset (e.g. using
sklearn.datasets.fetch_20newsgroups).
● Print the model's ROC-AUC score for its predictions.


In [4]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
import numpy as np

# Load a subset of the 20 Newsgroups dataset
# Using a smaller subset for demonstration purposes
categories = ['alt.atheism', 'soc.religion.christian']
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=42)

X_train, y_train = newsgroups_train.data, newsgroups_train.target
X_test, y_test = newsgroups_test.data, newsgroups_test.target

# Convert text data to TF-IDF features
vectorizer = TfidfVectorizer()
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# Train a Multinomial Naïve Bayes model
nb_model = MultinomialNB()
nb_model.fit(X_train_tfidf, y_train)

# Predict probabilities on the test set for ROC-AUC score
y_pred_proba = nb_model.predict_proba(X_test_tfidf)[:, 1]

# Calculate and print the ROC-AUC score
# Check if there are at least two classes present in the test set for ROC AUC calculation
if len(np.unique(y_test)) > 1:
    roc_auc = roc_auc_score(y_test, y_pred_proba)
    print(f"ROC-AUC Score: {roc_auc:.2f}")
else:
    print("Cannot calculate ROC-AUC score: Only one class present in the test set.")

ROC-AUC Score: 0.98


Question 10: Imagine you’re working as a data scientist for a company that handles
email communications.
Your task is to automatically classify emails as Spam or Not Spam. The emails may
contain:
● Text with diverse vocabulary
● Potential class imbalance (far more legitimate emails than spam)
● Some incomplete or missing data
Explain the approach you would take to:
● Preprocess the data (e.g. text vectorization, handling missing data)
● Choose and justify an appropriate model (SVM vs. Naïve Bayes)
● Address class imbalance
● Evaluate the performance of your solution with suitable metrics
And explain the business impact of your solution.
(Include your Python code and output in the code box below.)

In [5]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.metrics import classification_report, roc_auc_score
from imblearn.over_sampling import SMOTE
from collections import Counter
import numpy as np

# Load a synthetic text dataset with potential imbalance
# Using 'comp.graphics' and 'comp.windows.x' to simulate imbalance
# (assuming 'comp.graphics' is the minority class for demonstration)
categories = ['comp.graphics', 'comp.windows.x']
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=42)

X_train, y_train = newsgroups_train.data, newsgroups_train.target
X_test, y_test = newsgroups_test.data, newsgroups_test.target

# Convert text data to TF-IDF features
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000) # Limit features for simplicity
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# Check class distribution before handling imbalance
print("Class distribution before oversampling:", Counter(y_train))

# Address class imbalance using SMOTE (Oversampling)
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_tfidf, y_train)

print("Class distribution after oversampling:", Counter(y_train_resampled))

# Train a Multinomial Naïve Bayes model on resampled data
print("\n--- Multinomial Naïve Bayes ---")
nb_model = MultinomialNB()
nb_model.fit(X_train_resampled, y_train_resampled)

# Predict on the test set
y_pred_nb = nb_model.predict(X_test_tfidf)
y_pred_proba_nb = nb_model.predict_proba(X_test_tfidf)[:, 1]

# Evaluate Naïve Bayes model
print("Classification Report (Naïve Bayes):")
print(classification_report(y_test, y_pred_nb, target_names=newsgroups_test.target_names))

# Calculate ROC-AUC for Naïve Bayes
if len(np.unique(y_test)) > 1:
    roc_auc_nb = roc_auc_score(y_test, y_pred_proba_nb)
    print(f"ROC-AUC Score (Naïve Bayes): {roc_auc_nb:.2f}")
else:
    print("Cannot calculate ROC-AUC for Naïve Bayes: Only one class present in the test set.")


# Train an SVM Classifier with RBF kernel on resampled data
print("\n--- Support Vector Machine (SVM) ---")
svm_model = SVC(kernel='rbf', probability=True, random_state=42) # probability=True to get predict_proba
svm_model.fit(X_train_resampled, y_train_resampled)

# Predict on the test set
y_pred_svm = svm_model.predict(X_test_tfidf)
y_pred_proba_svm = svm_model.predict_proba(X_test_tfidf)[:, 1]

# Evaluate SVM model
print("Classification Report (SVM):")
print(classification_report(y_test, y_pred_svm, target_names=newsgroups_test.target_names))

# Calculate ROC-AUC for SVM
if len(np.unique(y_test)) > 1:
    roc_auc_svm = roc_auc_score(y_test, y_pred_proba_svm)
    print(f"ROC-AUC Score (SVM): {roc_auc_svm:.2f}")
else:
    print("Cannot calculate ROC-AUC for SVM: Only one class present in the test set.")

Class distribution before oversampling: Counter({np.int64(1): 593, np.int64(0): 584})
Class distribution after oversampling: Counter({np.int64(0): 593, np.int64(1): 593})

--- Multinomial Naïve Bayes ---
Classification Report (Naïve Bayes):
                precision    recall  f1-score   support

 comp.graphics       0.84      0.94      0.89       389
comp.windows.x       0.93      0.83      0.88       395

      accuracy                           0.88       784
     macro avg       0.89      0.88      0.88       784
  weighted avg       0.89      0.88      0.88       784

ROC-AUC Score (Naïve Bayes): 0.94

--- Support Vector Machine (SVM) ---
Classification Report (SVM):
                precision    recall  f1-score   support

 comp.graphics       0.83      0.93      0.88       389
comp.windows.x       0.92      0.81      0.86       395

      accuracy                           0.87       784
     macro avg       0.87      0.87      0.87       784
  weighted avg       0.87      0.87  