Question-1.  What is a Support Vector Machine (SVM), and how does it work?

Answer-1. A Support Vector Machine (SVM) is a supervised machine learning algorithm used for classification and regression tasks. SVM works by finding the hyperplane that maximally separates the classes in the feature space.

Key Concepts
1. Hyperplane: A decision boundary that separates the classes.
2. Support vectors: The data points that lie closest to the hyperplane and have a significant impact on the classification.
3. Margin: The distance between the hyperplane and the support vectors.

How SVM Works
1. Data preparation: Prepare the dataset by scaling and encoding the features.
2. Choose a kernel: Choose a suitable kernel function (e.g., linear, polynomial, radial basis function (RBF)) to map the data into a higher-dimensional space.
3. Find the optimal hyperplane: Use optimization techniques (e.g., quadratic programming) to find the hyperplane that maximizes the margin between the classes.
4. Make predictions: Use the trained SVM model to make predictions on new, unseen data.


Question 2: Explain the difference between Hard Margin and Soft Margin SVM.

Answer-2. Hard Margin SVM
1. Definition: Hard Margin SVM is a type of SVM that requires all data points to be classified correctly and lie on the correct side of the hyperplane.
2. Assumptions: The data is linearly separable, and there are no misclassifications.
3. Goal: Find the hyperplane that maximizes the margin between the classes.

Soft Margin SVM
1. Definition: Soft Margin SVM is a type of SVM that allows for some misclassifications and data points to lie on the wrong side of the hyperplane.
2. Assumptions: The data may not be linearly separable, and some misclassifications are acceptable.
3. Goal: Find the hyperplane that balances the margin between the classes and the misclassification error.

Key differences
1. Handling non-linear separability: Soft Margin SVM can handle non-linearly separable data, while Hard Margin SVM requires linear separability.
2. Misclassification tolerance: Soft Margin SVM allows for some misclassifications, while Hard Margin SVM does not tolerate any misclassifications.
3. Regularization parameter: Soft Margin SVM introduces a regularization parameter (C) that controls the trade-off between margin maximization and misclassification error.

Question 3: What is the Kernel Trick in SVM? Give one example of a kernel and explain its use case.

Answer-3. The Kernel Trick is a mathematical technique used in Support Vector Machines (SVMs) to transform the original data into a higher-dimensional space, where the data becomes linearly separable. This is done without explicitly mapping the data to the higher-dimensional space, which would be computationally expensive.

Example: Radial Basis Function (RBF) Kernel
The RBF kernel, also known as the Gaussian kernel, is a popular kernel function used in SVMs. It maps the data to an infinite-dimensional space and is defined as:

K(x, y) = exp(-γ|x - y|^2)

where γ is a hyperparameter that controls the width of the kernel.

Use Case
The RBF kernel is useful when the data is non-linearly separable in the original feature space. For example, in image classification tasks, the RBF kernel can help the SVM model capture complex relationships between pixel values and classify images accurately.


Question 4: What is a Naïve Bayes Classifier, and why is it called “naïve”?

Answer-4. A Naïve Bayes Classifier is a probabilistic machine learning model based on Bayes' theorem. It is used for classification tasks and predicts the probability of a class given a set of features.

Why "Naïve"?
The Naïve Bayes Classifier is called "naïve" because it makes a strong assumption about the independence of features. It assumes that:

1. Features are conditionally independent: Given the class label, the features are independent of each other.
2. No correlation between features: The presence or absence of one feature does not affect the presence or absence of another feature.


Question 5: Describe the Gaussian, Multinomial, and Bernoulli Naïve Bayes variants. When would you use each one?

Answer-5. Gaussian Naïve Bayes
1. Assumes Gaussian distribution: Assumes that the features follow a Gaussian (normal) distribution.
2. Continuous features: Suitable for continuous features.
3. Use case: Use Gaussian Naïve Bayes when dealing with continuous features, such as in regression tasks or when features are measurements.

Multinomial Naïve Bayes
1. Assumes multinomial distribution: Assumes that the features follow a multinomial distribution.
2. Discrete features: Suitable for discrete features, such as word counts.
3. Use case: Use Multinomial Naïve Bayes for text classification tasks, such as spam detection, sentiment analysis, or topic modeling.

Bernoulli Naïve Bayes
1. Assumes Bernoulli distribution: Assumes that the features follow a Bernoulli distribution.
2. Binary features: Suitable for binary features, such as presence/absence of a word.
3. Use case: Use Bernoulli Naïve Bayes for binary feature classification tasks, such as text classification with binary features.

Choosing the right variant
1. Feature type: Choose the variant based on the type of features:
    - Gaussian for continuous features.
    - Multinomial for discrete features (e.g., word counts).
    - Bernoulli for binary features.
2. Data distribution: Consider the distribution of the features and choose the variant that best matches the data.

Dataset Options
You can use the following datasets from sklearn.datasets:

1. Iris Dataset: A classic multiclass classification dataset with 150 samples and 4 features.
2. Breast Cancer Dataset: A binary classification dataset with 569 samples and 30 features.
3. Wine Dataset: A multiclass classification dataset with 178 samples and 13 features.

Loading Datasets
You can load these datasets using the following code:


from sklearn.datasets import load_iris, load_breast_cancer, load_wine

# Load Iris dataset
iris = load_iris()

# Load Breast Cancer dataset
breast_cancer = load_breast_cancer()

# Load Wine dataset
wine = load_wine()


Using a CSV File
If you have a CSV file, you can load it using pandas:

import pandas as pd

# Load CSV file
data = pd.read_csv('your_data.csv')


Make sure to replace 'your_data.csv' with the path to your actual CSV file.

Choosing a Dataset
Choose a dataset that suits your needs and experiment with different machine learning algorithms to achieve your goals.

Question 6:   Write a Python program to:

 ● Load the Iris dataset

 ● Train an SVM Classifier with a linear kernel

  ● Print the model's accuracy and support vectors.

  Answer-6.

In [1]:
# Import necessary libraries
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn import metrics

# Load the Iris dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train an SVM Classifier with a linear kernel
svm_classifier = svm.SVC(kernel='linear')
svm_classifier.fit(X_train, y_train)

# Make predictions on the test set
y_pred = svm_classifier.predict(X_test)

# Calculate and print the model's accuracy
accuracy = metrics.accuracy_score(y_test, y_pred)
print("Model Accuracy:", accuracy)

# Print the support vectors
support_vectors = svm_classifier.support_vectors_
print("Number of Support Vectors:", len(support_vectors))
print("Support Vectors:")
print(support_vectors)




Model Accuracy: 1.0
Number of Support Vectors: 25
Support Vectors:
[[4.8 3.4 1.9 0.2]
 [5.1 3.3 1.7 0.5]
 [4.5 2.3 1.3 0.3]
 [5.6 3.  4.5 1.5]
 [5.4 3.  4.5 1.5]
 [6.7 3.  5.  1.7]
 [5.9 3.2 4.8 1.8]
 [5.1 2.5 3.  1.1]
 [6.  2.7 5.1 1.6]
 [6.3 2.5 4.9 1.5]
 [6.1 2.9 4.7 1.4]
 [6.5 2.8 4.6 1.5]
 [6.9 3.1 4.9 1.5]
 [6.3 2.3 4.4 1.3]
 [6.3 2.5 5.  1.9]
 [6.3 2.8 5.1 1.5]
 [6.3 2.7 4.9 1.8]
 [6.  3.  4.8 1.8]
 [6.  2.2 5.  1.5]
 [6.2 2.8 4.8 1.8]
 [6.5 3.  5.2 2. ]
 [7.2 3.  5.8 1.6]
 [5.6 2.8 4.9 2. ]
 [5.9 3.  5.1 1.8]
 [4.9 2.5 4.5 1.7]]


Question 7:  Write a Python program to:

 ● Load the Breast Cancer dataset

  ● Train a Gaussian Naïve Bayes model
  
   ● Print its classification report including precision, recall, and F1-score.

   Answer-7.

In [2]:
# Import necessary libraries
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report

# Load the Breast Cancer dataset
breast_cancer = datasets.load_breast_cancer()
X = breast_cancer.data
y = breast_cancer.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Gaussian Naïve Bayes model
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# Make predictions on the test set
y_pred = gnb.predict(X_test)

# Print the classification report
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=breast_cancer.target_names))


Classification Report:
              precision    recall  f1-score   support

   malignant       1.00      0.93      0.96        43
      benign       0.96      1.00      0.98        71

    accuracy                           0.97       114
   macro avg       0.98      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114



Question 8: Write a Python program to:

 ● Train an SVM Classifier on the Wine dataset using GridSearchCV to find the best C and gamma.

  ● Print the best hyperparameters and accuracy.

  Answer-8.

In [3]:
# Import necessary libraries
from sklearn import datasets
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn import svm
from sklearn.metrics import accuracy_score

# Load the Wine dataset
wine = datasets.load_wine()
X = wine.data
y = wine.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define hyperparameter tuning space
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': ['scale', 'auto', 0.1, 1, 10]
}

# Perform GridSearchCV
grid_search = GridSearchCV(svm.SVC(), param_grid, cv=5)
grid_search.fit(X_train, y_train)

# Print the best hyperparameters
print("Best Hyperparameters:", grid_search.best_params_)

# Train an SVM Classifier with the best hyperparameters
best_svm = svm.SVC(**grid_search.best_params_)
best_svm.fit(X_train, y_train)

# Make predictions on the test set
y_pred = best_svm.predict(X_test)

# Calculate and print the accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)


Best Hyperparameters: {'C': 100, 'gamma': 'scale'}
Accuracy: 0.8333333333333334


Question 9: Write a Python program to:

● Train a Naïve Bayes Classifier on a synthetic text dataset (e.g. using sklearn.datasets.fetch_20newsgroups).

 ● Print the model's ROC-AUC score for its predictions.

 Answer-9.

In [4]:
# Import necessary libraries
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import label_binarize

# Load the 20 Newsgroups dataset
newsgroups = fetch_20newsgroups(subset='all', categories=['alt.atheism', 'talk.religion.misc'])

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(newsgroups.data, newsgroups.target, test_size=0.2, random_state=42)

# Vectorize the text data using TF-IDF
vectorizer = TfidfVectorizer()
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# Binarize the labels for ROC-AUC calculation
y_train_bin = label_binarize(y_train, classes=[0, 1])
y_test_bin = label_binarize(y_test, classes=[0, 1])

# Train a Multinomial Naïve Bayes Classifier
clf = MultinomialNB()
clf.fit(X_train_tfidf, y_train)

# Predict probabilities on the test set
y_pred_proba = clf.predict_proba(X_test_tfidf)

# Calculate and print the ROC-AUC score
roc_auc = roc_auc_score(y_test_bin, y_pred_proba)
print("ROC-AUC Score:", roc_auc)

ValueError: y should be a 1d array, got an array of shape (286, 2) instead.

Question 10: Imagine you’re working as a data scientist for a company that handles email communications. Your task is to automatically classify emails as Spam or Not Spam. The emails may contain: ● Text with diverse vocabulary ● Potential class imbalance (far more legitimate emails than spam) ● Some incomplete or missing data Explain the approach you would take to: ● Preprocess the data (e.g. text vectorization, handling missing data) ● Choose and justify an appropriate model (SVM vs. Naïve Bayes) ● Address class imbalance ● Evaluate the performance of your solution with suitable metrics And explain the business impact of your solution. (Include your Python code and output in the code box below.)

Answer-10.Preprocessing the Data
1. Text Vectorization: Use TF-IDF (Term Frequency-Inverse Document Frequency) to vectorize the text data, which takes into account the importance of words in the entire corpus.
2. Handling Missing Data: Remove or impute missing data, depending on the extent of missingness. For emails, it's likely that missing data is minimal, so removal might be sufficient.

Choosing a Model
1. SVM vs. Naïve Bayes: Both SVM and Naïve Bayes can be effective for text classification tasks. However, considering the potential complexity of the text data and the need to handle high-dimensional feature spaces, SVM might be a better choice. SVM is also more robust to noise and outliers.
2. Justification: SVM's ability to find the optimal hyperplane that maximally separates the classes makes it suitable for text classification tasks, especially when dealing with high-dimensional data.

Addressing Class Imbalance
1. Class Weighting: Use class weighting in SVM to assign more weight to the minority class (Spam), which can help improve the model's performance on the minority class.
2. Oversampling: Alternatively, oversample the minority class or use techniques like SMOTE (Synthetic Minority Over-sampling Technique) to balance the classes.

Evaluating Performance
1. Metrics: Use precision, recall, F1-score, and ROC-AUC score to evaluate the model's performance. These metrics provide a comprehensive understanding of the model's ability to classify emails correctly.
2. Cross-Validation: Use cross-validation to ensure the model's performance is robust and not overfitting to the training data.


In [7]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score
from sklearn.utils.class_weight import compute_class_weight
from sklearn.preprocessing import label_binarize

# Load the dataset (assuming a CSV file with 'email' and 'label' columns)
df = pd.read_csv('email_data.csv')

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df['email'], df['label'], test_size=0.2, random_state=42)

# Vectorize the text data using TF-IDF
vectorizer = TfidfVectorizer()
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# Compute class weights
class_weights = compute_class_weight(class_weight='balanced', classes=[0, 1], y=y_train)

# Train an SVM model with class weighting
svm_model = svm.SVC(kernel='linear', class_weight={0: class_weights[0], 1: class_weights[1]})
svm_model.fit(X_train_tfidf, y_train)

# Make predictions on the test set
y_pred = svm_model.predict(X_test_tfidf)

# Evaluate the model's performance
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
y_test_bin = label_binarize(y_test, classes=[0, 1])
y_pred_proba = svm_model.decision_function(X_test_tfidf)
roc_auc = roc_auc_score(y_test_bin, y_pred_proba)

print("Precision:", precision)
print("Recall:", recall)
print("F1-score:", f1)
print("ROC-AUC Score:", roc_auc)


FileNotFoundError: [Errno 2] No such file or directory: 'email_data.csv'

////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////