In [4]:
'''

## Question 1:  What is a Support Vector Machine (SVM), and how does it work?
A Support Vector Machine (SVM) is a supervised machine learning algorithm used for classification and regression tasks. It works by finding the optimal hyperplane that best separates data points of different classes in a feature space. The goal is to maximize the margin between the hyperplane and the nearest data points from each class, known as support vectors.

In a simple two-dimensional case, the hyperplane is a line that divides the data into two classes. SVM selects the hyperplane that has the largest distance (margin) to the nearest training data point of any class. A larger margin is associated with better generalization and lower classification error on unseen data.

SVM can also handle non-linear classification using a technique called the kernel trick. Kernels transform the input space into a higher-dimensional space where a linear separator can be found. Common kernels include polynomial, radial basis function (RBF), and sigmoid.

SVMs are effective in high-dimensional spaces and are memory-efficient because they use a subset of training points in the decision function. However, they can be less effective on very large datasets and are sensitive to the choice of kernel and parameters like C (regularization) and gamma (in RBF kernel).


## Question 2: Explain the difference between Hard Margin and Soft Margin SVM.

The difference between Hard Margin and Soft Margin SVM lies in how strictly the model separates the data.

Hard Margin SVM

Hard Margin SVM is used when the data is linearly separable without any errors. It finds a hyperplane that perfectly separates the two classes with the maximum possible margin, ensuring no data points fall within the margin or are misclassified. While this approach leads to a clean separation, it is very sensitive to outliers and noise. Even a single misclassified point can make it impossible to find a separating hyperplane, making hard margin impractical for real-world, noisy datasets.

Soft Margin SVM

Soft Margin SVM allows some misclassifications or margin violations to improve the model’s generalization. It introduces a regularization parameter (C) that controls the trade-off between maximizing the margin and minimizing classification error. A small value of C allows more violations (leading to a wider margin), while a large C penalizes misclassifications more heavily (leading to a tighter fit).

Soft margin SVM is more flexible and robust, especially with overlapping classes or noisy data, making it the preferred choice in most practical applications. It balances the need for accuracy with the ability to generalize well on unseen data.


## Question 3: What is the Kernel Trick in SVM? Give one example of a kernel and
explain its use case.

The Kernel Trick in Support Vector Machines (SVM) allows the algorithm to classify data that is not linearly separable by transforming it into a higher-dimensional space. Instead of manually transforming the data, the kernel trick uses a kernel function to compute the relationships between data points as if they were in that higher-dimensional space—without actually performing the transformation. This makes the process computationally efficient.

Example: Radial Basis Function (RBF) Kernel

The RBF kernel, also known as the Gaussian kernel, is one of the most widely used kernels in SVM. It’s effective when the boundary between classes is curved or complex. The RBF kernel measures similarity based on the distance between data points, giving higher values to closer points.

Use Case:

The RBF kernel is commonly used in applications like image recognition, spam detection, or medical diagnosis, where data cannot be separated with a straight line. For instance, in classifying handwritten digits, the data points (pixel intensities) form patterns that are not linearly separable. The RBF kernel allows SVM to capture those intricate patterns and make accurate predictions.

In short, the kernel trick lets SVM handle non-linear problems with the speed and efficiency of linear computation.

## Question 4: What is a Naïve Bayes Classifier, and why is it called “naïve”?

A Naïve Bayes Classifier is a simple yet powerful probabilistic machine learning algorithm used for classification tasks. It is based on Bayes’ Theorem, which describes the probability of a class given some features. Naïve Bayes is particularly popular for text classification, such as spam detection, sentiment analysis, and document categorization.

The classifier works by calculating the probability of each class based on the input features and then selecting the class with the highest probability. Despite its simplicity, it often performs surprisingly well, especially with large datasets.

It is called “naïve” because it makes a strong assumption: that all features are independent of each other given the class label. In reality, this assumption is rarely true—for example, in text data, words often appear together in patterns. However, the model still works well in practice even when this assumption is violated.

The Naïve Bayes classifier is fast, requires little training data, and handles high-dimensional inputs well, making it a good choice for initial models or real-time systems. However, it may not perform as well when feature dependencies play a crucial role in determining the outcome, as its assumption of independence can limit accuracy in such cases.

## Question 5: Describe the Gaussian, Multinomial, and Bernoulli Naïve Bayes variants.
When would you use each one?

Naïve Bayes has several variants, each designed for different types of data: Gaussian, Multinomial, and Bernoulli.

Gaussian Naïve Bayes

This variant assumes that the features follow a normal (Gaussian) distribution. It’s suitable for continuous data, such as age, height, or temperature. Gaussian Naïve Bayes is commonly used in problems like medical diagnosis or sensor data classification, where feature values are real numbers.

Multinomial Naïve Bayes

Multinomial Naïve Bayes is designed for discrete count data, such as the frequency of words in a document. It works well for text classification tasks like spam filtering, news categorization, and sentiment analysis. This model assumes that features (e.g., word counts) are drawn from a multinomial distribution and is particularly effective when dealing with document-term matrices.

Bernoulli Naïve Bayes

Bernoulli Naïve Bayes is used for binary/boolean features, where each feature represents whether a term is present or absent. It’s also used in text classification, but instead of using word frequency, it only considers whether a word appears at least once in a document. It's useful when the presence or absence of a feature matters more than its frequency.

When to use which:

Gaussian: Continuous features

Multinomial: Count data (e.g., word frequencies)

Bernoulli: Binary features (e.g., word presence



'''


"\n\n## Question 1:  What is a Support Vector Machine (SVM), and how does it work? \nA Support Vector Machine (SVM) is a supervised machine learning algorithm used for classification and regression tasks. It works by finding the optimal hyperplane that best separates data points of different classes in a feature space. The goal is to maximize the margin between the hyperplane and the nearest data points from each class, known as support vectors.\n\nIn a simple two-dimensional case, the hyperplane is a line that divides the data into two classes. SVM selects the hyperplane that has the largest distance (margin) to the nearest training data point of any class. A larger margin is associated with better generalization and lower classification error on unseen data.\n\nSVM can also handle non-linear classification using a technique called the kernel trick. Kernels transform the input space into a higher-dimensional space where a linear separator can be found. Common kernels include polynomia

In [5]:
#Question 6:   Write a Python program to:
#● Load the Iris dataset
#● Train an SVM Classifier with a linear kernel
#● Print the model's accuracy and support vectors.

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target

# Split into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train an SVM classifier with a linear kernel
svm_model = SVC(kernel='linear')
svm_model.fit(X_train, y_train)

# Predict on the test set
y_pred = svm_model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

# Print results
print("Model Accuracy:", accuracy)
print("Number of Support Vectors for each class:", svm_model.n_support_)
print("Support Vectors:\n", svm_model.support_vectors_)

Model Accuracy: 1.0
Number of Support Vectors for each class: [ 3 11 11]
Support Vectors:
 [[4.8 3.4 1.9 0.2]
 [5.1 3.3 1.7 0.5]
 [4.5 2.3 1.3 0.3]
 [5.6 3.  4.5 1.5]
 [5.4 3.  4.5 1.5]
 [6.7 3.  5.  1.7]
 [5.9 3.2 4.8 1.8]
 [5.1 2.5 3.  1.1]
 [6.  2.7 5.1 1.6]
 [6.3 2.5 4.9 1.5]
 [6.1 2.9 4.7 1.4]
 [6.5 2.8 4.6 1.5]
 [6.9 3.1 4.9 1.5]
 [6.3 2.3 4.4 1.3]
 [6.3 2.5 5.  1.9]
 [6.3 2.8 5.1 1.5]
 [6.3 2.7 4.9 1.8]
 [6.  3.  4.8 1.8]
 [6.  2.2 5.  1.5]
 [6.2 2.8 4.8 1.8]
 [6.5 3.  5.2 2. ]
 [7.2 3.  5.8 1.6]
 [5.6 2.8 4.9 2. ]
 [5.9 3.  5.1 1.8]
 [4.9 2.5 4.5 1.7]]


In [6]:
#Question 7:  Write a Python program to:
#● Load the Breast Cancer dataset
#● Train a Gaussian Naïve Bayes model
#● Print its classification report including precision, recall, and F1-score.

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report

# Load the Breast Cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the Gaussian Naïve Bayes model
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# Predict on the test set
y_pred = gnb.predict(X_test)

# Print the classification report
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=data.target_names))


Classification Report:
              precision    recall  f1-score   support

   malignant       1.00      0.93      0.96        43
      benign       0.96      1.00      0.98        71

    accuracy                           0.97       114
   macro avg       0.98      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114



In [7]:
#Question 8: Write a Python program to:
#● Train an SVM Classifier on the Wine dataset using GridSearchCV to find the best
#C and gamma.
#● Print the best hyperparameters and accuracy.

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load the Wine dataset
data = load_wine()
X = data.data
y = data.target

# Split the dataset (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define parameter grid for GridSearchCV
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [0.001, 0.01, 0.1, 1],
    'kernel': ['rbf']
}

# Create the SVM model and GridSearchCV
svm = SVC()
grid_search = GridSearchCV(svm, param_grid, cv=5, scoring='accuracy')

# Fit GridSearchCV to the training data
grid_search.fit(X_train, y_train)

# Get the best model
best_model = grid_search.best_estimator_

# Predict on the test set
y_pred = best_model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

# Print the results
print("Best Hyperparameters:", grid_search.best_params_)
print("Test Set Accuracy:", accuracy)

Best Hyperparameters: {'C': 100, 'gamma': 0.001, 'kernel': 'rbf'}
Test Set Accuracy: 0.8333333333333334


In [8]:
#Question 9: Write a Python program to:
#● Train a Naïve Bayes Classifier on a synthetic text dataset (e.g. using
#sklearn.datasets.fetch_20newsgroups).
#● Print the model's ROC-AUC score for its predictions.

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import roc_auc_score

# Load a subset of the 20 Newsgroups dataset (binary classification)
categories = ['sci.space', 'rec.sport.baseball']
newsgroups = fetch_20newsgroups(subset='all', categories=categories, remove=('headers', 'footers', 'quotes'))

# Convert text to TF-IDF features
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(newsgroups.data)
y = newsgroups.target  # 0 or 1

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Multinomial Naïve Bayes classifier
nb_model = MultinomialNB()
nb_model.fit(X_train, y_train)

# Predict probabilities for ROC-AUC
y_proba = nb_model.predict_proba(X_test)[:, 1]  # Probabilities for the positive class

# Compute ROC-AUC score
roc_auc = roc_auc_score(y_test, y_proba)

# Print the result
print("ROC-AUC Score:", roc_auc)

ROC-AUC Score: 0.9964327574784692


In [10]:
''' Question 10: Imagine you’re working as a data scientist for a company that handles
email communications.
Your task is to automatically classify emails as Spam or Not Spam. The emails may
contain:
● Text with diverse vocabulary
● Potential class imbalance (far more legitimate emails than spam)
● Some incomplete or missing data
Explain the approach you would take to:
● Preprocess the data (e.g. text vectorization, handling missing data)
● Choose and justify an appropriate model (SVM vs. Naïve Bayes)
● Address class imbalance
● Evaluate the performance of your solution with suitable metrics
And explain the business impact of your solution.

1. Data Preprocessing

Text Cleaning & Vectorization:
Emails often contain noisy text — punctuation, stopwords, and diverse vocabulary. I’d start by cleaning the text (removing special characters, lowercasing, removing stopwords if needed) and then convert the text into numerical features using TF-IDF vectorization. TF-IDF helps emphasize important words and reduce the impact of very common words.

Handling Missing Data:
If some emails have missing parts (e.g., empty body or subject), I’d either:

Fill missing values with placeholders (like "missing") to retain info, or

Remove or flag incomplete emails depending on how critical the missing data is.

Feature Engineering:
I might also include additional metadata features like email length, sender reputation, or presence of links to improve the model.

2. Model Choice: SVM vs. Naïve Bayes

Naïve Bayes is often the go-to for spam detection because it’s fast, handles high-dimensional text well, and performs surprisingly well despite its simplistic assumptions.

SVM with a linear kernel can also perform strongly, especially if the data is well-preprocessed and large enough, as it finds an optimal boundary with good generalization.

Given the class imbalance and diverse vocabulary, I’d start with Naïve Bayes because it’s robust with text data and easier to tune. If needed, I’d try SVM next for potentially improved accuracy.

3. Addressing Class Imbalance

Spam datasets usually have fewer spam emails compared to legitimate ones, which can bias the model.

To tackle this:

Use techniques like SMOTE (Synthetic Minority Over-sampling Technique) or random oversampling to balance the classes in training data.

Alternatively, apply class weighting (e.g., in SVM) to penalize misclassifying the minority class more.

Also, choose metrics sensitive to imbalance (more on that below).

4. Performance Evaluation

Accuracy alone can be misleading due to imbalance, so I’d rely on:

Precision: How many emails flagged as spam are actually spam? (Avoid false positives)

Recall: How many actual spam emails are correctly identified? (Avoid missing spam)

F1-score: Balance between precision and recall.

ROC-AUC or PR-AUC: Measures overall model discrimination power, especially useful with imbalance.

I’d also monitor false positives carefully, since marking legitimate emails as spam can hurt user trust and business.

Business Impact

A reliable spam filter:

Improves user experience by keeping inboxes clean and focused.

Saves time and resources by reducing manual email sorting and spam-related risks.

Protects against phishing and malware, improving company security.

Builds user trust in the company’s communication system.

However, minimizing false positives is critical — incorrectly classifying legitimate emails as spam could result in lost opportunities, frustrated users, and damage to the company’s reputation.

'''


' Question 10: Imagine you’re working as a data scientist for a company that handles \nemail communications. \nYour task is to automatically classify emails as Spam or Not Spam. The emails may \ncontain: \n● Text with diverse vocabulary \n● Potential class imbalance (far more legitimate emails than spam) \n● Some incomplete or missing data \nExplain the approach you would take to: \n● Preprocess the data (e.g. text vectorization, handling missing data) \n● Choose and justify an appropriate model (SVM vs. Naïve Bayes) \n● Address class imbalance \n● Evaluate the performance of your solution with suitable metrics \nAnd explain the business impact of your solution. \n\n1. Data Preprocessing\n\nText Cleaning & Vectorization:\nEmails often contain noisy text — punctuation, stopwords, and diverse vocabulary. I’d start by cleaning the text (removing special characters, lowercasing, removing stopwords if needed) and then convert the text into numerical features using TF-IDF vectorization. TF-I