# **SVM & Naive Bayes Assignment**


1. **What is a Support Vector Machine (SVM), and how does it work?**
- A support vector machine (SVM) is a supervised machine learning algorithm that classifies data by finding an optimal line or hyperplane that maximizes the distance between each class in an N-dimensional space.
- It tries to find the best boundary known as hyperplane that separates different classes in the data. It is useful when you want to do binary classification like spam vs. not spam or cat vs. dog.
- The main goal of SVM is to maximize the margin between the two classes. The larger the margin the better the model performs on new and unseen data.

 **How SVMs work**

- Finding the optimal hyperplane: SVM aims to identify a decision boundary (a line in 2D or a hyperplane in higher dimensions) that divides the data points of different classes with the largest possible margin.
- Maximizing the margin: The margin is the distance between the decision boundary and the closest data points from each class. A wider margin indicates better separation and improved generalization ability to new data.
- Support vectors: The data points closest to the hyperplane (those that define the margin) are called support vectors. These are crucial for determining the hyperplane's position and orientation.
- Handling non-linear data (kernel trick): When data isn't linearly separable, SVM uses a technique called the "kernel trick." This involves implicitly mapping the data into a higher-dimensional feature space where it becomes linearly separable, allowing a hyperplane to be found.
> Kernel functions: These are mathematical functions (e.g., linear, polynomial, Radial Basis Function/RBF) that facilitate this transformation without explicitly computing the coordinates in the higher-dimensional space.

2. **Explain the difference between Hard Margin and Soft Margin SVM.**
- Support Vector Machines (SVMs) are powerful supervised learning models used for classification and regression. The key idea behind SVMs is to find an optimal hyperplane that separates data points belonging to different classes with the widest possible margin.
- This margin signifies the distance between the decision boundary and the nearest data points, also known as support vectors. Two common approaches to handling this margin are Hard Margin SVM and Soft Margin SVM.

 **Hard Margin SVM**

- Objective: Finds a hyperplane that completely separates data points of different classes, maximizing the margin with no misclassifications.
- Requirements: Assumes data is perfectly linearly separable, meaning a clear line or plane can divide the classes without any errors.
- Strengths: Provides a clear and interpretable decision boundary.
- Limitations:
  - Highly sensitive to outliers: Even a single outlier can significantly affect the decision boundary.
  - Not suitable for noisy or overlapping datasets where perfect separation is impossible.
  - Fails when data is not linearly separable, unable to find a perfect separating hyperplane.

 **Soft Margin SVM**

- Objective: Allows for some misclassifications or margin violations to handle situations where data is not perfectly separable. It introduces a penalty term for misclassifications, balancing margin maximization with error minimization.
- Flexibility: Uses slack variables (ξi) to measure the extent of misclassification or how much a data point violates the margin constraint.
  - ξi = 0 if the data point is correctly classified and lies on or outside the margin.
  - 0 < ξi < 1 if the data point is correctly classified but lies within the margin.
  - ξi ≥ 1 if the data point is misclassified.
- Regularization Parameter (C): This hyperparameter controls the trade-off between maximizing the margin and minimizing the classification error.
  - High C value: Aims for fewer misclassifications on the training data, leading to a narrower margin but potentially overfitting.
  - Low C value: Allows more misclassifications for a wider margin and a smoother decision boundary, potentially preventing overfitting but risking underfitting.
- Strengths:
  - Robust to outliers and noisy data.
  - Can handle non-linearly separable data by implicitly mapping it to a higher-dimensional space using kernel functions.
  - Offers a more generalized model that performs better on unseen data.
- Limitations: Requires careful tuning of the regularization parameter C to achieve optimal performance.

3. **What is the Kernel Trick in SVM? Give one example of a kernel and
explain its use case.**
- The kernel trick is a method used in SVMs to enable them to classify non-linear data using a linear classifier. By applying a kernel function, SVMs can implicitly map input data into a higher-dimensional space where a linear separator (hyperplane) can be used to divide the classes. This mapping is computationally efficient because it avoids the direct calculation of the coordinates in this higher space.
- Types of Kernel Function
  - Linear Kernel: No mapping is needed as the data is already assumed to be linearly separable.
  - Polynomial Kernel: Maps inputs into a polynomial feature space, enhancing the classifier's ability to capture interactions between features.
  - Radial Basis Function (RBF) Kernel: Also known as the Gaussian kernel, it is useful for capturing complex regions by considering the distance between points in the input space.
  - Sigmoid Kernel: Mimics the behavior of neural networks by using a sigmoid function as the kernel

  **Example of a kernel: Radial Basis Function (RBF) Kernel**

- The Radial Basis Function (RBF) kernel, also known as the Gaussian kernel, is one of the most popular and versatile kernel functions used with SVMs. It's particularly useful when dealing with highly non-linear data distributions and when the relationship between features and class labels is complex and unknown.
It is defined as:
 **`K(x, y) = exp(-γ ||x - y||²)`**

 - **x** and **y** are the input vectors.
 - **γ(gamma)** is a positive parameter that controls the influence of each training example, or the "spread" of the kernel. A larger gamma value implies a closer fit to the data, potentially leading to overfitting if not tuned properly.
 - **||x - y||²** represents the squared Euclidean distance between the input vectors **x** and **y**.

 **Use Cases**

- The RBF kernel is particularly useful for capturing complex, non-linear relationships within the data. It is widely used when the relationship between class labels and attributes is highly non-linear, allowing SVM to create intricate decision boundaries that adapt to the data's inherent structure.
- For instance, consider a dataset where different classes of data points are scattered in a way that forms a spiral pattern. A linear SVM would be unable to classify them. However, by employing the RBF kernel, the data is implicitly mapped to a higher-dimensional space where it becomes linearly separable, enabling a linear classifier to separate the points successfully, and creating a non-linear spiral-shaped decision boundary in the original space.
- In essence, the kernel trick, particularly with kernels like the RBF, equips SVM with the ability to effectively tackle a wide range of non-linear classification challenges that would otherwise be intractable with simple linear models.  


4. **What is a Naïve Bayes Classifier, and why is it called “naïve”?**
- Naive Bayes Classifier is a supervised machine learning algorithm based on Bayes'Theorem, commonly used for classification tasks. It calculates the probability of different classes by assuming independence between features, making it effective in text classification in machine learning, spam detection, and more.
- The Naive Bayes Classifier relies on certain assumptions that simplify calculating probabilities and making predictions.
- Here are the main assumptions of Naive Bayes:
  - Feature independence: This indicates that when classifying an item, we presume that each feature (or data point) does not influence any other feature.
  - nContinuous features are assumed to follow a normal distribution: If a feature is continuous, it is considered normally distributed across each class.
  - Discrete features follow multinomial distributions: If a feature is discrete, it is presumed to exhibit a multinomial distribution for each class.
  - All features hold equal significance: It is assumed that every feature contributes uniformly to predicting the class label.
  - No absent data: The data must not have any absent values.

 **Why It’s Called Naive Bayes**

- The Naive Bayes classifier is termed "naive" because it assumes that all input variables are independent, a premise that is frequently unrealistic in actual data scenarios.This means that the presence or absence of one feature is considered unrelated to the presence or absence of any other feature when predicting the class.
- For example, when classifying a fruit as an apple based on color, roundness, and size, Naïve Bayes treats each of these characteristics as independently contributing to the probability of it being an apple, regardless of any potential correlation between them.

5. **Describe the Gaussian, Multinomial, and Bernoulli Naïve Bayes variants.
When would you use each one?**
- Naive Bayes classifiers come in several variants, each suited for different types of data. Gaussian Naive Bayes is used for continuous data assuming a normal distribution, Multinomial Naive Bayes is for discrete data with multiple outcomes like word counts, and Bernoulli Naive Bayes is for binary data where features represent presence or absence.

 **Gaussian Naïve Bayes**

- Description
  - Assumes that continuous features follow a normal (Gaussian) distribution.
  - The mean and standard deviation for each class's features are estimated from the training data.
  - Classifies new data points by finding the maximum value of the posterior probability for each class, assuming a Gaussian likelihood for features.
- When to use
  - Continuous data: Ideal for datasets where features have continuous, numerical values (e.g., age, height, temperature) that are believed to follow a normal distribution.
  - Spam detection and text classification: Can be used, especially with features like word frequencies, as it handles continuous features well.
  - Medical Diagnosis: Can be used for disease prediction based on symptoms and medical test results, as these often involve continuous data points that may follow a normal distribution.
  - Credit scoring and fraud detection: Can be used to assess credit risk or identify fraudulent transactions based on numerical features like transaction amounts or credit scores..

 **Multinomial Naïve Bayes**

- Description
  - Designed for discrete features, especially when features represent counts or frequencies.
  - Commonly used in text classification, where features are often word counts or frequencies.
  - Models the likelihood of observing a specific set of counts for a fixed number of trials, using the multinomial distribution.
- When to use
  - Text classification: Widely used in Natural Language Processing (NLP) tasks such as spam filtering, sentiment analysis, document categorization, and topic modeling, where the features are word counts or frequencies.
  - Document classification: Particularly effective for categorizing documents into different classes based on their word content.
  - Spam filtering: Can be applied to classify emails as spam or not spam based on the frequency of words in the email.

 **Bernoulli Naïve Bayes**

- Description
  - Specifically designed for binary or Boolean features, where values are typically 0 or 1, indicating the presence or absence of an attribute.
  - Uses the Bernoulli distribution to calculate probabilities based on whether a feature is present or absent given the class.
- When to use
  - Binary Classification: Appropriate for problems where the outcome is binary (e.g., spam/not spam, disease/no disease) and the features are also binary.
  - Text classification with presence/absence: Can be used in text classification where the presence or absence of a word in a document, rather than its frequency, is the key feature.
  - Fraud detection: Can be applied to detect fraudulent transactions where features might be binary, such as whether a particular type of transaction occurred or not.
  - Medical diagnosis (binary features): Can be used for tasks involving binary symptoms or test results (e.g., positive/negative test result).

6. **Write a Python program to:**

 ● **Load the Iris dataset**

 ● **Train an SVM Classifier with a linear kernel**

 ● **Print the model's accuracy and support vectors.**

In [1]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target

# Split dataset into train and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train an SVM classifier with linear kernel
svm_clf = SVC(kernel='linear')
svm_clf.fit(X_train, y_train)

# Predict on test set
y_pred = svm_clf.predict(X_test)

# Print accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", accuracy)

# Print support vectors
print("\nSupport Vectors:")
print(svm_clf.support_vectors_)


Model Accuracy: 1.0

Support Vectors:
[[4.8 3.4 1.9 0.2]
 [5.1 3.3 1.7 0.5]
 [4.5 2.3 1.3 0.3]
 [5.6 3.  4.5 1.5]
 [5.4 3.  4.5 1.5]
 [6.7 3.  5.  1.7]
 [5.9 3.2 4.8 1.8]
 [5.1 2.5 3.  1.1]
 [6.  2.7 5.1 1.6]
 [6.3 2.5 4.9 1.5]
 [6.1 2.9 4.7 1.4]
 [6.5 2.8 4.6 1.5]
 [6.9 3.1 4.9 1.5]
 [6.3 2.3 4.4 1.3]
 [6.3 2.5 5.  1.9]
 [6.3 2.8 5.1 1.5]
 [6.3 2.7 4.9 1.8]
 [6.  3.  4.8 1.8]
 [6.  2.2 5.  1.5]
 [6.2 2.8 4.8 1.8]
 [6.5 3.  5.2 2. ]
 [7.2 3.  5.8 1.6]
 [5.6 2.8 4.9 2. ]
 [5.9 3.  5.1 1.8]
 [4.9 2.5 4.5 1.7]]


7. **Write a Python program to:**

 ● **Load the Breast Cancer dataset**

 ● **Train a Gaussian Naïve Bayes model**

 ● **Print its classification report including precision, recall, and F1-score.**

In [2]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report

# Load the Breast Cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train a Gaussian Naïve Bayes model
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# Predict on the test set
y_pred = gnb.predict(X_test)

# Print classification report
print("Classification Report:\n")
print(classification_report(y_test, y_pred, target_names=data.target_names))


Classification Report:

              precision    recall  f1-score   support

   malignant       1.00      0.93      0.96        43
      benign       0.96      1.00      0.98        71

    accuracy                           0.97       114
   macro avg       0.98      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114



8. **Write a Python program to:**

 ● **Train an SVM Classifier on the Wine dataset using GridSearchCV to find the best C and gamma.**

 ● **Print the best hyperparameters and accuracy**

In [3]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load the Wine dataset
wine = load_wine()
X = wine.data
y = wine.target

# Split into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Define parameter grid for C and gamma
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [0.001, 0.01, 0.1, 1],
    'kernel': ['rbf']  # using RBF kernel
}

# Perform Grid Search with cross-validation
grid_search = GridSearchCV(SVC(), param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Get the best model
best_model = grid_search.best_estimator_

# Predict on the test set
y_pred = best_model.predict(X_test)

# Print best hyperparameters and accuracy
print("Best Hyperparameters:", grid_search.best_params_)
print("Test Set Accuracy:", accuracy_score(y_test, y_pred))


Best Hyperparameters: {'C': 100, 'gamma': 0.001, 'kernel': 'rbf'}
Test Set Accuracy: 0.8333333333333334


9. **Write a Python program to:**

 ● **Train a Naïve Bayes Classifier on a synthetic text dataset (e.g. using
sklearn.datasets.fetch_20newsgroups).**

 ● **Print the model's ROC-AUC score for its predictions.**

In [4]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import label_binarize

# Load a subset of the 20 Newsgroups dataset (to keep it binary for ROC-AUC)
categories = ['sci.space', 'rec.sport.baseball']
newsgroups = fetch_20newsgroups(subset='all', categories=categories)

X = newsgroups.data
y = newsgroups.target

# Convert text data into TF-IDF features
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
X_tfidf = vectorizer.fit_transform(X)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X_tfidf, y, test_size=0.2, random_state=42
)

# Train Naïve Bayes Classifier
nb = MultinomialNB()
nb.fit(X_train, y_train)

# Predict probabilities
y_prob = nb.predict_proba(X_test)[:, 1]

# Compute ROC-AUC
roc_auc = roc_auc_score(y_test, y_prob)

print("ROC-AUC Score:", roc_auc)


ROC-AUC Score: 1.0


10. **Imagine you’re working as a data scientist for a company that handles
email communications.
Your task is to automatically classify emails as Spam or Not Spam. The emails may contain:**

 ● **Text with diverse vocabulary**

 ● **Potential class imbalance (far more legitimate emails than spam)**

 ● **Some incomplete or missing data**

 **Explain the approach you would take to:**

 ● **Preprocess the data (e.g. text vectorization, handling missing data**)

 ● **Choose and justify an appropriate model (SVM vs. Naïve Bayes)**

 ● **Address class imbalance**

 ● **Evaluate the performance of your solution with suitable metrics
And explain the business impact of your solution.**

- Building an automated email spam classification system is an important task for a company that handles email communications. Given the characteristics of email data – diverse vocabulary, potential class imbalance, and incomplete data – a robust approach is required

**Data Preprocessing**

- Text Vectorization: Convert emails into a numerical format. Consider these techniques:
  - Bag-of-Words (BoW): Count word occurrences in each email to create a numerical representation.
  - TF-IDF (Term Frequency-Inverse Document Frequency): This method refines BoW by assigning higher weights to words that are frequent in a specific email but rare across the entire dataset (corpus). This helps highlight the unique and important terms within each email.
  - Word Embeddings (e.g., Word2Vec, GloVe): Represent words as dense vectors to capture semantic and syntactic relationships. Although more computationally intensive, they offer the potential to capture more nuanced meaning and context within email.
  - Justification: For initial experimentation, TF-IDF is often a good starting point because it balances simplicity with the ability to account for word importance within a document and across the corpus. Word embeddings would be explored later if TF-IDF did not provide satisfactory results, particularly if capturing subtle semantic distinctions proves important for spam detection
- Handling Incomplete or Missing Data: Email data may have missing values in fields like sender information or parts of the message body. Address these as follows:
  - Deletion: Remove rows with missing values if the amount of missing data is small and randomly distributed.
  - Imputation: Replace missing values with estimated ones using techniques like mean, median, or most frequent value. More advanced methods like k-Nearest Neighbors (KNN) or model-based imputation could also be explored.
  - Consider Missingness as a Feature: Create a new feature indicating whether a particular field was missing to capture useful patterns.  
  - Justification: The method used depends on the extent and nature of missing data. For text content, techniques that attempt to reconstruct missing parts based on context (e.g., using information from other parts of the email or similar emails) could be beneficial. If missingness is not random, treating it as a feature might be the best approach.

**Choosing an Appropriate Model**

- Naïve Bayes (NB): This probabilistic classifier is simple, efficient, and performs well for text classification tasks, particularly with high-dimensional and sparse datasets typical of email data. It calculates the conditional probability of a class (spam or not spam) given the presence of certain words.
- Support Vector Machines (SVM): SVMs are powerful and effective at distinguishing between different categories. They find an optimal hyperplane to separate classes, even in high-dimensional feature spaces, and can handle outliers well.Justification: Naïve Bayes offers simplicity and good performance for text classification, while SVMs provide more robustness and potentially higher accuracy in complex scenarios, especially when using a non-linear kernel. Experiment with both to determine which performs best on the specific email dataset. While Naïve Bayes might be simpler to implement initially, if the data exhibits complex relationships, SVM with an appropriate kernel could outperform it, according to Stack Overflow. Additionally, SVM is often better at handling full-length content compared to Naive Bayes.

**Addressing Class Imbalance**

- Since there is an imbalance (more legitimate emails than spam), special techniques are necessary to prevent the model from becoming biased towards the majority class and performing poorly on the minority class (spam).
- Resampling Techniques:
  - Oversampling: Increase the number of instances in the minority class (spam) by duplicating existing examples or generating synthetic ones (e.g., SMOTE).
  - Undersampling: Reduce the number of instances in the majority class (legitimate) by randomly removing examples. While simpler, this can lead to loss of valuable data.
- Class Weighting: Adjust the weights in the model's loss function to give more importance to the minority class.
- Tree-based Models: Decision trees and ensemble methods like Random Forest and Gradient Boosted Trees are naturally more robust to class imbalance than some other models due to their hierarchical structure.
- Anomaly Detection Models: These can be effective when the minority class (spam) is seen as anomalous behavior.Justification: Combining oversampling techniques like SMOTE with undersampling or using tree-based models can be particularly effective for handling class imbalance in email classification, notes Analytics Vidhya. Experimentation is key to finding the most suitable combination of techniques for the specific dataset.

**Evaluating Performance**

- Precision: Measures the proportion of correctly identified spam emails out of all emails predicted as spam. It is crucial when minimizing false positives (legitimate emails wrongly classified as spam) is important.
- Recall (Sensitivity): Measures the proportion of actual spam emails that the model correctly identified. It is important when minimizing false negatives (spam emails missed by the filter) is paramount.
- F1-Score: The harmonic mean of precision and recall, providing a balanced evaluation metric. It is particularly useful for imbalanced datasets.
- Confusion Matrix: A table summarizing the number of true positives, false positives, true negatives, and false negatives, providing a detailed view of the model's performance on each class.
- AUC ROC (Area Under the Receiver Operating Characteristic Curve): Measures the model's ability to distinguish between spam and legitimate emails across various classification thresholds. A higher AUC ROC indicates better discrimination power.

**Justification:** While accuracy might seem like a straightforward metric, it can be misleading in the presence of class imbalance. Instead, using precision, recall, and F1-score provides a more comprehensive evaluation, especially for spam detection where both false positives and false negatives have distinct business implications. Visualizing performance with the confusion matrix and AUC ROC curves offers a deeper understanding of the model's strengths and weaknesses.

  **Business Impact**

- Reduced User Frustration and Improved Productivity: An effective spam filter minimizes the number of unwanted emails that reach users' inboxes, leading to a less cluttered and more efficient email experience, notes Mailtrap. This saves valuable time users might spend sifting through junk mail.
- Enhanced Security: Accurate spam detection reduces the risk of phishing attacks, malware dissemination, and other cyber threats that often leverage email as an entry point. This protects user data and company systems.
- Improved Deliverability and Reputation: For businesses sending legitimate emails (e.g., transactional emails, newsletters), a good spam filter ensures these emails reach their intended recipients rather than being blocked as false positives. This strengthens client relationships and business reputation, says Paubox Email.
- Resource Optimization: Efficiently filtering out spam reduces the load on email servers and associated infrastructure, potentially leading to cost savings in storage and network bandwidth.
- Actionable Insights: Analyzing spam and legitimate emails can provide valuable insights into evolving spamming techniques and user behavior, which can be used to further refine and improve the spam filter and email security strategies.


