## 1)What is a Support Vector Machine (SVM), and how does it work?+

# Support Vector Machine (SVM)

A **Support Vector Machine (SVM)** is a supervised machine learning algorithm primarily used for classification, though it can also be adapted for regression. The goal of SVM is to find the optimal decision boundary (or **hyperplane**) that best separates the different classes of data points.

## How SVM works:

### 1. Linear SVM:
- In its simplest form, SVM works by finding a **hyperplane** that separates the data into different classes with the largest possible margin.
- This hyperplane is a line (in 2D), a plane (in 3D), or a hyperplane in higher-dimensional spaces. The objective is to place this boundary such that the distance between the closest data points of each class (these points are called **support vectors**) and the hyperplane is maximized.

### 2. Maximizing the Margin:
- The key idea is that the wider the margin (the distance between the hyperplane and the closest data points from both classes), the better the model will generalize to unseen data. SVM aims to maximize this margin, ensuring that the model is as robust as possible.
- These closest points (support vectors) are the critical data points that influence the position and orientation of the hyperplane.

### 3. Handling Non-linear Data:
- In real-world problems, data may not always be linearly separable (i.e., you can't draw a straight line or hyperplane that separates the classes perfectly).
- To deal with this, SVM uses a technique called the **kernel trick**. Instead of working directly in the input space, SVM maps the data into a higher-dimensional space where it becomes easier to find a linear separating hyperplane. Common kernel functions include:
    - **Linear kernel** (no transformation)
    - **Polynomial kernel**
    - **Radial Basis Function (RBF) kernel**
    - **Sigmoid kernel**

### 4. Soft Margin (for non-linearly separable data):
- In practice, it's rare for data to be perfectly separable. To handle this, SVM introduces a concept called the **soft margin**. This allows some misclassifications, but penalizes them based on how far away they are from the decision boundary. The **C parameter** controls the trade-off between maximizing the margin and minimizing classification errors. A smaller C value allows for more misclassifications, whereas a larger C value forces the classifier to minimize misclassifications more aggressively.

## Key Concepts:

- **Support Vectors:** These are the data points that are closest to the hyperplane and have the most influence on its position. The model uses only these points to define the decision boundary.
  
- **Margin:** The distance between the hyperplane and the closest support vectors. SVM aims to maximize this margin to ensure better generalization.

- **Kernel Trick:** A mathematical technique that allows SVM to work in higher-dimensional spaces to separate data that is not linearly separable.

## Example (2D visualization):
Imagine you have a dataset with two classes of data points plotted on a 2D graph (e.g., circles and squares). The SVM algorithm tries to draw a straight line (hyperplane) that separates the circles from the squares. The goal is to position this line in such a way that the distance between the line and the nearest circle and square is as large as possible.

If the data is not linearly separable, SVM might map the data into a higher dimension (for example, turning the 2D data into a 3D space) so that the classes become separable by a plane.

## Advantages of SVM:
- **Effective in high-dimensional spaces**: SVM performs well in scenarios with many features (high-dimensional data).
- **Robust to overfitting**: Especially with the right choice of the regularization parameter (C).
- **Versatile with kernels**: Can handle complex, non-linear relationships.

## Disadvantages:
- **Computationally expensive**: Especially with large datasets, since SVM involves solving a quadratic optimization problem.
- **Sensitive to parameter tuning**: The choice of kernel, regularization parameter (C), and kernel parameters can have a significant impact on the performance of the model.
- **Not great with large datasets**: SVM can struggle with datasets that are very large due to its computational complexity.

## Conclusion:
In short, an SVM works by finding the best boundary to separate classes while maximizing the margin, and it uses kernels to handle non-linear separations. It is a powerful tool for classification tasks, especially when dealing with high-dimensional data.


## 2) Explain the difference between Hard Margin and Soft Margin SVM.

# Hard Margin vs Soft Margin SVM

The difference between **Hard Margin** and **Soft Margin** SVM lies in how they handle data that is not perfectly separable. Both methods aim to find an optimal hyperplane that separates data into classes, but they handle situations where data points might not align perfectly with the ideal decision boundary in different ways.

## 1. Hard Margin SVM

In **Hard Margin SVM**, the algorithm assumes that the data is **perfectly linearly separable**. In other words, it assumes that there exists a hyperplane that can completely separate the two classes with no errors. The goal is to find this hyperplane that maximizes the margin (the distance between the closest data points of each class).

### Key Points of Hard Margin SVM:
- **No Misclassification:** There is no allowance for misclassified data points. Every data point must be on the correct side of the hyperplane.
- **Strict Linear Separation:** It works only if the classes are perfectly separable, meaning there is no overlap between them.
- **Optimization Objective:** Maximize the margin between the support vectors while ensuring that no points fall on the wrong side of the hyperplane.

### When to use:
- When the data is known to be linearly separable without noise.

### Example:
Imagine two classes of data (e.g., circles and squares) that are perfectly separable with a straight line. A Hard Margin SVM would find the line that maximizes the gap between the two classes without allowing for any overlap.

#### Advantages:
- Provides an ideal hyperplane if the data is perfectly separable.

#### Disadvantages:
- **Not Robust to Outliers:** If the data is noisy or has outliers, the algorithm might overfit the model.
- **Doesn't work with non-linearly separable data.**

---

## 2. Soft Margin SVM

**Soft Margin SVM** is a more general and practical approach, especially useful when the data is **not perfectly linearly separable**. Instead of insisting on a perfect separation, Soft Margin SVM allows for some **misclassifications** (points that fall on the wrong side of the hyperplane) to create a more flexible model. It introduces a **penalty** for misclassifying points, and the optimization tries to balance between **maximizing the margin** and **minimizing misclassification errors**.

### Key Points of Soft Margin SVM:
- **Misclassifications Allowed:** The algorithm allows for some points to fall on the wrong side of the margin, but each misclassification is penalized.
- **Regularization Parameter (C):** The **C parameter** controls the trade-off between maximizing the margin and minimizing misclassifications. A small value of C allows for more misclassifications, while a large value of C forces fewer misclassifications at the cost of a smaller margin.
- **Optimization Objective:** The goal is to maximize the margin but with a regularization term that penalizes misclassifications, ensuring the model doesn't overfit.

### When to use:
- When data is noisy or contains outliers.
- When the data is not perfectly separable or contains overlapping classes.

### Example:
Imagine the same circles and squares dataset, but now some data points overlap or are incorrectly labeled. A Soft Margin SVM would allow for a few misclassifications, but still try to find a hyperplane that works well for the majority of the data.

#### Advantages:
- **Robust to Outliers:** It can handle noisy datasets or cases where classes are not linearly separable.
- **Flexibility:** Allows for a more flexible decision boundary.

#### Disadvantages:
- **Requires tuning of C:** Choosing the right value of C is important. Too small a C will allow too many misclassifications, and too large a C will cause the model to overfit.

---

## Visual Comparison:

### Hard Margin SVM:
- The decision boundary is a straight line (hyperplane) that separates the classes perfectly.
- All data points must be on the correct side of the line.

### Soft Margin SVM:
- The decision boundary may still be a straight line, but some data points will be misclassified (on the wrong side of the line).
- The margin is optimized in such a way that the majority of the points are correctly classified, but a small number of points may fall within the margin or even on the wrong side.

---

## Mathematical Difference:

### Hard Margin SVM:
Minimizes the margin subject to the constraint that all data points are correctly classified:

$$
y_i (w \cdot x_i + b) \geq 1 \quad \forall i
$$

Where \( y_i \) is the class label, \( x_i \) is the data point, \( w \) is the weight vector, and \( b \) is the bias term.

### Soft Margin SVM:
Introduces a slack variable (\(\xi_i\)) to allow for misclassifications:

$$
y_i (w \cdot x_i + b) \geq 1 - \xi_i \quad \forall i
$$

And the total cost (penalty) for misclassification is minimized:

$$
\text{Minimize} \quad \frac{1}{2} \|w\|^2 + C \sum_i \xi_i
$$

Where \( C \) is the regularization parameter controlling the trade-off between margin size and misclassification penalties.

---

## Summary of Differences:

| **Feature**            | **Hard Margin SVM**                     | **Soft Margin SVM**                        |
| ---------------------- | ------------------------------------- | ---------------------------------------- |
| **Data Separability**  | Assumes perfect linear separability | Allows for some misclassifications       |
| **Outliers**           | Not robust to outliers              | Robust to outliers                       |
| **Penalty for Errors** | No allowance for errors             | Misclassification errors are penalized   |
| **Use Case**           | Ideal for perfectly separable data  | Used when data is noisy or overlapping   |
| **Regularization**     | Not applicable                      | Controlled by the parameter \( C \)      |

---

### In short:
- **Hard Margin SVM** is strict, requiring perfect separation, which can fail with noisy data.
- **Soft Margin SVM** is more flexible and robust, allowing errors for the sake of better generalization.


## 3) What is the Kernel Trick in SVM? Give one example of a kernel and explain its use case.

# The Kernel Trick in SVM

The **Kernel Trick** is a powerful technique used in Support Vector Machines (SVMs) that allows for **non-linear separation** of data by implicitly mapping the original input space into a higher-dimensional feature space without actually computing the mapping.

### Why is it needed?

In some cases, the data is **not linearly separable** in its original space, meaning that no straight hyperplane can separate the two classes. The Kernel Trick allows us to apply SVM to non-linearly separable data by mapping the data into a higher-dimensional space where the classes become separable with a linear hyperplane.

### How does it work?

The idea behind the **kernel trick** is to avoid computing the transformation explicitly. Instead of mapping the data to a higher-dimensional space and then calculating the dot product in that space, we use a **kernel function** that computes the inner product in the higher-dimensional space directly, using the original data.

Without the Kernel Trick, SVM works by calculating the **dot product** between vectors in the input space, which is computationally expensive when dealing with high-dimensional feature spaces.

With the Kernel Trick, we replace the dot product with a **kernel function** that computes the inner product in the higher-dimensional space without having to explicitly transform the data.

### Mathematical Representation:

The transformation that the kernel trick helps with can be mathematically represented as:

Let the original feature vectors be \( \mathbf{x} \) and \( \mathbf{x'} \).

The kernel function \( K(\mathbf{x}, \mathbf{x'}) \) is equivalent to the inner product of the data points \( \phi(\mathbf{x}) \) and \( \phi(\mathbf{x'}) \) in the transformed feature space:

$$
K(\mathbf{x}, \mathbf{x'}) = \langle \phi(\mathbf{x}), \phi(\mathbf{x'}) \rangle
$$

Where \( \phi(\mathbf{x}) \) is a feature map that transforms the original data points into a higher-dimensional space.

---

## Example of a Kernel: **Gaussian (RBF) Kernel**

One of the most commonly used kernels in SVMs is the **Radial Basis Function (RBF) kernel**, also known as the **Gaussian kernel**. It is used to map data into an infinite-dimensional feature space, allowing SVM to classify data that is not linearly separable in its original space.

### Formula for the RBF Kernel:

$$
K(\mathbf{x}, \mathbf{x'}) = \exp\left( - \frac{\|\mathbf{x} - \mathbf{x'}\|^2}{2\sigma^2} \right)
$$

Where:
- \( \|\mathbf{x} - \mathbf{x'}\|^2 \) is the squared Euclidean distance between the data points.
- \( \sigma \) is a parameter that controls the width of the Gaussian function.

---

### Use Case of the RBF Kernel:

The RBF kernel is widely used in problems where the decision boundary is highly non-linear. Some typical use cases include:

- **Image Classification**: In problems like facial recognition or object detection, where the data points (such as pixel values) are complex and the decision boundaries are not linear.
- **Handwriting Recognition**: In recognition systems that need to classify different handwriting styles, which may not be linearly separable.
- **Bioinformatics (Gene Expression Data)**: In problems involving gene expression profiles, where data points are high-dimensional and non-linearly separable.

---

## SVM with RBF Kernel Example in Python

Let's implement an SVM using the RBF kernel on a toy dataset.

```python
# Import necessary libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load a toy dataset (e.g., a binary classification problem)
X, y = datasets.make_classification(n_samples=100, n_features=2, n_informative=2, random_state=42)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train an SVM with RBF Kernel
svm_rbf = SVC(kernel='rbf', gamma='scale')  # 'scale' is a common choice for gamma
svm_rbf.fit(X_train, y_train)

# Make predictions
y_pred = svm_rbf.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"RBF Kernel SVM Accuracy: {accuracy:.2f}")

# Plot decision boundary
xx, yy = np.meshgrid(np.linspace(X[:, 0].min(), X[:, 0].max(), 100),
                     np.linspace(X[:, 1].min(), X[:, 1].max(), 100))
Z = svm_rbf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.contourf(xx, yy, Z, alpha=0.75, cmap='coolwarm')
plt.scatter(X[:, 0], X[:, 1], c=y, edgecolors='k', marker='o', s=100, cmap='coolwarm')
plt.title("SVM with RBF Kernel Decision Boundary")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.show()


## 4) What is a Naïve Bayes Classifier, and why is it called “naïve”?

Here is the corrected version with the formulas properly formatted for Google Colab. The key is to use `$$` for block equations to ensure they are rendered correctly.

# Naïve Bayes Classifier

A **Naïve Bayes classifier** is a probabilistic machine learning model based on **Bayes' Theorem**, which describes the probability of a class given some feature values. It is called "naïve" because it makes a simplifying assumption that the features (or variables) are **independent** of each other, which is often not true in real-world data. Despite this, Naïve Bayes classifiers often perform surprisingly well, especially in text classification tasks like **spam detection** or **sentiment analysis**.

## Bayes' Theorem:

Bayes' Theorem is a principle from probability theory that relates conditional probabilities. It states:

$$P(C \mid X) = \frac{P(X \mid C) \cdot P(C)}{P(X)}$$

Where:
- $P(C \mid X)$ is the **posterior probability** of the class $C$ given the feature set $X$.
- $P(X \mid C)$ is the **likelihood**, the probability of observing the features $X$ given the class $C$.
- $P(C)$ is the **prior probability** of the class $C$.
- $P(X)$ is the **evidence**, the total probability of the features $X$ across all classes.

### Naïve Bayes in Practice:

In Naïve Bayes, the classifier uses Bayes' Theorem to compute the **posterior probability** for each class, given the features, and assigns the class with the highest posterior probability as the predicted class:

$$P(C \mid X) \propto P(C) \cdot \prod_{i=1}^{n} P(x_i \mid C)$$

Where:
- $x_i$ represents individual features (attributes) from the feature set $X$.
- $P(C)$ is the prior probability of the class.
- $P(x_i \mid C)$ is the likelihood of feature $x_i$ given class $C$.

## Why is it called "Naïve"?

The "naïve" assumption refers to the **independence assumption**. The classifier assumes that all features are **independent** of each other, given the class. In other words, it assumes that the presence or absence of a feature does not depend on the presence or absence of any other feature, which is often not true in practice.

For example, in spam email classification, the presence of the word "free" in an email might be correlated with the presence of the word "offer." However, the Naïve Bayes classifier assumes these two words are independent, even if they are related in the real world.

### Types of Naïve Bayes Classifiers:
- **Gaussian Naïve Bayes**: Assumes that the features follow a **Gaussian (normal)** distribution. It's typically used when the features are **continuous**.
- **Multinomial Naïve Bayes**: Suitable for **discrete features**, commonly used for text classification problems where features (words) represent **counts** or **frequencies**.
- **Bernoulli Naïve Bayes**: Suitable for **binary/boolean features**. It assumes that each feature is binary, representing the presence or absence of some characteristic.

### Advantages:
- **Simple and fast**: Naïve Bayes is computationally efficient, making it suitable for large datasets.
- **Works well with high-dimensional data**: It's particularly good for problems like text classification, where there are many features (e.g., words in a document).
- **Good performance with less data**: It often performs well even when the dataset is small or when the independence assumption is violated to some extent.

### Disadvantages:
- **Strong independence assumption**: The assumption that features are independent rarely holds in real-world problems, which can reduce performance when features are highly correlated.
- **Poor with numerical data**: Naïve Bayes is not as effective with continuous features unless the appropriate distribution (e.g., Gaussian) is used.

## Example:

Consider a simple example where you want to classify emails as **Spam** or **Not Spam** based on the presence of certain words (features). The Naïve Bayes classifier would compute the probability of an email being spam given the presence or absence of these words, using **Bayes' Theorem**.

## 5) Describe the Gaussian, Multinomial, and Bernoulli Naïve Bayes variants. When would you use each one?

### Naïve Bayes Variants

Naïve Bayes classifiers come in different types, each tailored to specific types of data. The three most commonly used variants are:

1. **Gaussian Naïve Bayes**
2. **Multinomial Naïve Bayes**
3. **Bernoulli Naïve Bayes**

Each variant assumes different distributions or types of features and is suitable for different kinds of tasks. Here's a breakdown of each:

---

### 1. **Gaussian Naïve Bayes**

**Assumption**:
Gaussian Naïve Bayes assumes that the features are **continuous** and follow a **Gaussian (Normal) distribution**. This is suitable for problems where the data points can be represented by continuous values that follow a bell curve.

**Mathematical Representation**:
The likelihood $P(x_i | C)$ is modeled using the **normal (Gaussian) distribution**:

$$
P(x_i | C) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp \left( -\frac{(x_i - \mu)^2}{2\sigma^2} \right)
$$

Where:

* $\mu$ is the mean of the feature $x_i$ for class $C$,
* $\sigma^2$ is the variance of the feature $x_i$ for class $C$.

**When to use it**:

* Use **Gaussian Naïve Bayes** when the data is **continuous** and you suspect that it is distributed according to a Gaussian (Normal) distribution.
* Common in applications where features are numerical, such as **predicting house prices**, **stock market predictions**, and **medical diagnoses** with continuous variables (e.g., height, weight, age, etc.).

**Example**:

* Predicting whether a person has diabetes based on continuous features like **age**, **BMI**, **blood pressure**, etc.
* Analyzing customer income or expenditure with numerical features in marketing datasets.

---

### 2. **Multinomial Naïve Bayes**

**Assumption**:
Multinomial Naïve Bayes is used when the features represent **counts** or **frequencies** of events. It assumes that the features follow a **multinomial distribution**.

**Mathematical Representation**:
The likelihood $P(x_i | C)$ is modeled as:

$$
P(x_i | C) = \frac{(n_i + \alpha)}{(N + \alpha k)}
$$

Where:

* $n_i$ is the count of feature $i$ in class $C$,
* $N$ is the total number of features in class $C$,
* $k$ is the number of unique features,
* $\alpha$ is a smoothing parameter (usually $\alpha = 1$ for Laplace smoothing).

**When to use it**:

* Use **Multinomial Naïve Bayes** when the features are **discrete counts** or represent **frequencies** of events.
* Most commonly used for text classification tasks, such as **spam detection** or **document classification**, where the features are word counts or term frequencies.

**Example**:

* **Text classification**: Classifying emails as spam or not spam, where the features are the frequencies of words in the email.
* **Document categorization**: Classifying articles into categories based on the number of times certain words appear (e.g., classifying news articles as sports, technology, health, etc.).

---

### 3. **Bernoulli Naïve Bayes**

**Assumption**:
Bernoulli Naïve Bayes assumes that the features are **binary** (i.e., either present or absent). It treats each feature as a Bernoulli random variable, meaning the feature can either be 0 (absent) or 1 (present).

**Mathematical Representation**:
The likelihood $P(x_i | C)$ for each binary feature $x_i$ is modeled as:

$$
P(x_i | C) =
\begin{cases}
P(x_i = 1 | C) & \text{if } x_i = 1 \\
P(x_i = 0 | C) & \text{if } x_i = 0
\end{cases}
$$

Where:

* $P(x_i = 1 | C)$ is the probability that feature $i$ is present in class $C$,
* $P(x_i = 0 | C)$ is the probability that feature $i$ is absent in class $C$.

**When to use it**:

* Use **Bernoulli Naïve Bayes** when the features are **binary** (i.e., representing the presence or absence of certain characteristics).
* Commonly used in text classification tasks where features are binary (e.g., whether a word appears or not).

**Example**:

* **Spam classification**: Classifying emails as spam or not spam where the features are binary, representing the presence or absence of certain keywords (e.g., "free", "win", "offer").
* **Document classification**: Classifying text documents based on whether specific words are present (e.g., classifying whether a review is positive or negative based on the presence of words like "good", "excellent", or "bad").

---

### Summary of When to Use Each Variant

| Variant                     | Assumption/Use Case                             | Example Use Case                                                                              |
| --------------------------- | ----------------------------------------------- | --------------------------------------------------------------------------------------------- |
| **Gaussian Naïve Bayes**    | Continuous features with Gaussian distribution  | Predicting continuous data like prices or measurements (e.g., house prices, health metrics)   |
| **Multinomial Naïve Bayes** | Discrete features (counts/frequencies)          | Text classification, document categorization, spam filtering, topic modeling                  |
| **Bernoulli Naïve Bayes**   | Binary features (presence/absence of a feature) | Text classification where words are present/absent, like spam detection or sentiment analysis |

---

### Conclusion:

Each Naïve Bayes variant is tailored to a specific type of data, making them versatile tools for classification tasks. Understanding the type of features in your dataset will guide you in choosing the appropriate variant:

* **Gaussian** for continuous, normally distributed features.
* **Multinomial** for discrete features that represent counts or frequencies.
* **Bernoulli** for binary features representing presence or absence.


## 6) Dataset Info:
● You can use any suitable datasets like Iris, Breast Cancer, or Wine from
sklearn.datasets or a CSV file you have.

Write a Python program to:

● Load the Iris dataset

● Train an SVM Classifier with a linear kernel

● Print the model's accuracy and support vectors

In [None]:
# Import necessary libraries
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = datasets.load_iris()
X = iris.data  # Features
y = iris.target  # Labels

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create an SVM classifier with a linear kernel
svm_classifier = SVC(kernel='linear')

# Train the SVM model
svm_classifier.fit(X_train, y_train)

# Predict on the test set
y_pred = svm_classifier.predict(X_test)

# Calculate the accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Model Accuracy: {accuracy * 100:.2f}%')

# Get the support vectors
support_vectors = svm_classifier.support_vectors_
print(f'Number of Support Vectors: {support_vectors.shape[0]}')
print(f'Support Vectors: \n{support_vectors}')

Model Accuracy: 100.00%
Number of Support Vectors: 24
Support Vectors: 
[[4.8 3.4 1.9 0.2]
 [5.1 3.3 1.7 0.5]
 [4.5 2.3 1.3 0.3]
 [5.6 3.  4.5 1.5]
 [5.4 3.  4.5 1.5]
 [6.7 3.  5.  1.7]
 [5.9 3.2 4.8 1.8]
 [5.1 2.5 3.  1.1]
 [6.  2.7 5.1 1.6]
 [6.3 2.5 4.9 1.5]
 [6.1 2.9 4.7 1.4]
 [6.5 2.8 4.6 1.5]
 [6.9 3.1 4.9 1.5]
 [6.3 2.3 4.4 1.3]
 [6.3 2.8 5.1 1.5]
 [6.3 2.7 4.9 1.8]
 [6.  3.  4.8 1.8]
 [6.  2.2 5.  1.5]
 [6.2 2.8 4.8 1.8]
 [6.5 3.  5.2 2. ]
 [7.2 3.  5.8 1.6]
 [5.6 2.8 4.9 2. ]
 [5.9 3.  5.1 1.8]
 [4.9 2.5 4.5 1.7]]


## 7) Write a Python program to:

● Load the Breast Cancer dataset

● Train a Gaussian Naïve Bayes model

● Print its classification report including precision, recall, and F1-score.

In [None]:
# Import necessary libraries
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report

# Load the Breast Cancer dataset
cancer = datasets.load_breast_cancer()
X = cancer.data  # Features
y = cancer.target  # Labels

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create a Gaussian Naïve Bayes classifier
gnb = GaussianNB()

# Train the model
gnb.fit(X_train, y_train)

# Predict on the test set
y_pred = gnb.predict(X_test)

# Print the classification report
print("Classification Report:\n")
print(classification_report(y_test, y_pred))

Classification Report:

              precision    recall  f1-score   support

           0       0.93      0.90      0.92        63
           1       0.95      0.96      0.95       108

    accuracy                           0.94       171
   macro avg       0.94      0.93      0.94       171
weighted avg       0.94      0.94      0.94       171



## 8) Write a Python program to:

● Train an SVM Classifier on the Wine dataset using GridSearchCV to find the best C and gamma.

● Print the best hyperparameters and accuracy.

In [None]:
# Import necessary libraries
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load the Wine dataset
wine = datasets.load_wine()
X = wine.data  # Features
y = wine.target  # Labels

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create an SVM classifier
svm = SVC()

# Define the hyperparameter grid for GridSearchCV
param_grid = {
    'C': [0.1, 1, 10, 100],  # Regularization parameter
    'gamma': ['scale', 'auto', 0.001, 0.01, 0.1, 1]  # Kernel coefficient
}

# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=svm, param_grid=param_grid, cv=5, verbose=1, n_jobs=-1)

# Train the model with GridSearchCV
grid_search.fit(X_train, y_train)

# Print the best hyperparameters
print("Best hyperparameters found by GridSearchCV:")
print(f"C: {grid_search.best_params_['C']}")
print(f"Gamma: {grid_search.best_params_['gamma']}")

# Evaluate the model with the best hyperparameters
best_svm = grid_search.best_estimator_
y_pred = best_svm.predict(X_test)

# Calculate and print accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy on the test set: {accuracy * 100:.2f}%")

Fitting 5 folds for each of 24 candidates, totalling 120 fits
Best hyperparameters found by GridSearchCV:
C: 100
Gamma: scale
Accuracy on the test set: 77.78%


## 9) Write a Python program to:

● Train a Naïve Bayes Classifier on a synthetic text dataset (e.g. using
sklearn.datasets.fetch_20newsgroups).

● Print the model's ROC-AUC score for its predictions.

In [None]:
# Import necessary libraries
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import LabelBinarizer

# Load the 20 Newsgroups dataset
newsgroups = fetch_20newsgroups(subset='all')

# Extract the text data and target labels
X = newsgroups.data
y = newsgroups.target

# Convert the text data to TF-IDF features
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
X_tfidf = vectorizer.fit_transform(X)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.3, random_state=42)

# Initialize and train a Naïve Bayes classifier
nb_classifier = MultinomialNB()
nb_classifier.fit(X_train, y_train)

# Predict the probabilities for the test set
y_prob = nb_classifier.predict_proba(X_test)

# Binarize the true labels (for multi-class ROC-AUC)
lb = LabelBinarizer()
y_bin = lb.fit_transform(y_test)

# Calculate the ROC-AUC score
roc_auc = roc_auc_score(y_bin, y_prob, average='macro', multi_class='ovr')

# Print the ROC-AUC score
print(f"ROC-AUC Score: {roc_auc:.4f}")

ROC-AUC Score: 0.9909


## 10) 10: Imagine you’re working as a data scientist for a company that handles email communications. Your task is to automatically classify emails as Spam or Not Spam. The emails may contain:

● Text with diverse vocabulary

● Potential class imbalance (far more legitimate emails than spam)

● Some incomplete or missing data

Explain the approach you would take to:

● Preprocess the data (e.g. text vectorization, handling missing data)

● Choose and justify an appropriate model (SVM vs. Naïve Bayes)

● Address class imbalance

● Evaluate the performance of your solution with suitable metrics
And explain the business impact of your solution.

### Approach for Automatically Classifying Emails as Spam or Not Spam

As a data scientist tasked with building a spam classification system, I would follow a structured approach to handle preprocessing, model selection, class imbalance, and evaluation. Below, I outline the steps in detail:

---

### **1. Preprocess the Data**

#### **Text Vectorization:**

The emails contain a diverse vocabulary, so effective text preprocessing and vectorization are crucial.

* **Text Cleaning:**

  * **Lowercasing**: Convert all text to lowercase to ensure consistency (e.g., "FREE" and "free" should be treated as the same word).
  * **Removing Special Characters & Punctuation**: Strip out unnecessary punctuation, numbers, and special characters.
  * **Removing Stopwords**: Stopwords like "the", "and", "is" don’t add much value for spam classification, so we remove them.
  * **Lemmatization**: Convert words to their base form (e.g., “running” becomes “run”) to standardize words.

* **Text Vectorization:**

  * Use **TF-IDF (Term Frequency-Inverse Document Frequency)** vectorizer to convert text into numerical features. TF-IDF helps to weigh words based on their frequency and importance in a document relative to the entire dataset, which is useful in distinguishing between spam and legitimate emails.
  * Alternatively, **Word2Vec** or **FastText** embeddings can be considered if we have access to more advanced techniques.

#### **Handling Missing Data:**

* **Handling Missing Text**: Emails with missing content may occur. We can impute missing text by assigning a placeholder value like `"<missing>"` or `"<empty>"`.
* **Missing Features**: If there are missing columns, we can either fill them with a neutral value (like "unknown") or drop rows with missing data, depending on the feature's importance.

#### **Handling Class Imbalance:**

* **Class Imbalance**: In most email datasets, there are far more legitimate emails (Not Spam) than spam emails. This can bias the model, making it less effective at detecting spam.

  * **Resampling Techniques**:

    * **Oversampling**: Use techniques like **SMOTE (Synthetic Minority Over-sampling Technique)** to create synthetic spam email samples to balance the classes.
    * **Undersampling**: Alternatively, randomly under-sample the legitimate email class (Not Spam) to reduce the imbalance, but this can lead to loss of important data.
  * **Class Weights**: Many classifiers (like SVM and Naïve Bayes) allow us to assign higher weight to the minority class (Spam). This penalizes the model for misclassifying spam emails, helping the model focus more on correctly classifying the minority class.

---

### **2. Choose and Justify the Model**

#### **Naïve Bayes Classifier:**

* **Why Naïve Bayes?**

  * **Text Classification Suitability**: Naïve Bayes, particularly **Multinomial Naïve Bayes**, is highly effective for text classification tasks because it assumes that words are conditionally independent given the class. This simplifies the learning process and works well with high-dimensional data like text.
  * **Handling Missing Data**: Naïve Bayes can handle missing data reasonably well by ignoring missing features during training (as long as the missing features are not highly correlated with the target).
  * **Efficient**: It's computationally efficient, making it suitable for real-time spam classification in email systems.
  * **Class Imbalance**: Naïve Bayes can be effective for imbalanced classes if we adjust for class priors (i.e., giving higher weight to spam emails).

#### **Support Vector Machine (SVM):**

* **Why SVM?**

  * **Effectiveness in High-Dimensional Spaces**: SVMs are known for their effectiveness in high-dimensional spaces (like text data), where they try to find the optimal hyperplane that separates spam and non-spam emails.
  * **Non-linearity**: Using **kernel tricks**, SVMs can model non-linear decision boundaries, which could be useful in distinguishing spam from legitimate emails with complex features.
  * **Class Imbalance**: SVMs handle class imbalance well if we use techniques like **class weighting** or **SMOTE**.

##### **Decision:**

* Given the simplicity, effectiveness, and efficiency of Naïve Bayes for text classification, I would choose **Naïve Bayes** for this problem as the first approach, especially for real-time email filtering. However, **SVM** could be explored as an alternative, especially if the dataset is non-linearly separable or if we want to experiment with kernels.

---

### **3. Address Class Imbalance**

Class imbalance can be a significant issue in spam detection because legitimate emails (Not Spam) vastly outnumber spam emails. Here’s how we can address it:

* **Resampling Techniques**:

  * **Oversampling (SMOTE)**: Generating synthetic spam samples can balance the dataset. This ensures that the model doesn’t bias predictions toward the majority class (legitimate emails).
  * **Undersampling**: By randomly undersampling the majority class, we can create a balanced dataset. However, this comes at the cost of potentially losing important data.
* **Class Weight Adjustment**:

  * Many models like **Naïve Bayes** and **SVM** allow setting class weights, which can make the model more sensitive to the minority class (spam). For instance, assigning a higher weight to the spam class makes the model "care more" about correctly predicting spam.

---

### **4. Evaluate the Model’s Performance**

Since we are working with imbalanced data, accuracy alone isn’t sufficient to evaluate the performance of the model. Instead, we focus on the following metrics:

* **Precision, Recall, and F1-Score**:

  * **Precision**: The percentage of emails classified as spam that are actually spam. High precision means fewer false positives (legitimate emails misclassified as spam).
  * **Recall**: The percentage of actual spam emails that are correctly identified as spam. High recall means fewer false negatives (spam emails misclassified as legitimate).
  * **F1-Score**: The harmonic mean of precision and recall. It balances the two metrics and gives us a single value that reflects the performance of the classifier.

* **ROC-AUC**:

  * The **Receiver Operating Characteristic (ROC) curve** and the **Area Under the Curve (AUC)** are important for understanding the classifier’s ability to distinguish between classes. A high AUC score indicates that the classifier can distinguish spam from non-spam effectively.

* **Confusion Matrix**:

  * The confusion matrix will provide more granular details, including true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN). It’s helpful for understanding the types of classification errors the model is making.

---

### **5. Business Impact of the Solution**

The ultimate goal of spam classification is to improve the email communication experience for users. Here’s how the solution would impact the business:

* **Improved Productivity**:

  * By filtering out spam emails, employees can focus on important, legitimate communications. This leads to better time management and productivity, as less time is spent manually sorting through spam.

* **Enhanced User Experience**:

  * A robust spam filter can significantly enhance the user experience by keeping inboxes clean. This can lead to higher customer satisfaction, particularly in email services or platforms.

* **Cost Savings**:

  * With fewer employees wasting time dealing with spam emails and reduced risk of malware or phishing attempts (common in spam), the organization can save on costs related to cybersecurity and administrative time.

* **Compliance and Security**:

  * Spam emails often contain phishing attempts or malware. An effective spam filter reduces the risk of these attacks, ensuring that sensitive company data is protected and regulatory compliance is maintained.

---

### **Summary of Approach**:

1. **Preprocessing**: Clean the data, handle missing values, and vectorize the text using TF-IDF.
2. **Model Choice**: Start with **Naïve Bayes** for text classification, with potential to explore **SVM**.
3. **Address Class Imbalance**: Use **SMOTE**, **class weighting**, or **undersampling**.
4. **Evaluation**: Focus on precision, recall, F1-score, and ROC-AUC to ensure balanced performance.
5. **Business Impact**: Improve productivity, enhance user experience, reduce costs, and ensure compliance.

This approach balances technical rigor and business needs, ultimately providing a robust spam filtering solution for the organization.
