#Assignment Code: DA-AG-013

#Theoritical & Practical Question

Question 1: What is a Support Vector Machine (SVM), and how does it work?
* A Support Vector Machine (SVM) is a supervised learning algorithm used for both classification and regression. It works by finding the optimal hyperplane that separates data points of different classes with the maximum margin. The closest points to this boundary, called support vectors, play a key role in defining it. By maximizing the margin, SVM improves generalization and is especially effective in high-dimensional spaces.

Question 2: Explain the difference between Hard Margin and Soft Margin SVM.
* In a Hard Margin SVM, the goal is to find a hyperplane that separates the classes perfectly, with no misclassifications. This works only when the data is linearly separable and is highly sensitive to noise or outliers.
* In a Soft Margin SVM, some misclassifications are allowed by introducing a penalty term. This trade-off between maximizing the margin and minimizing errors makes it more robust and suitable for real-world, noisy datasets.

Question 3: What is the Kernel Trick in SVM? Give one example of a kernel and
explain its use case.
* The Kernel Trick in SVM is a method that enables the algorithm to separate data that is not linearly separable. It works by implicitly mapping the data into a higher-dimensional space using a kernel function, without having to perform the actual transformation. This makes computations efficient while allowing SVM to model complex decision boundaries.
* For example, the Radial Basis Function (RBF) kernel projects data into an infinite-dimensional space, making it effective for problems with non-linear relationships. It is commonly used in applications like image recognition and text classification.

Question 4: What is a Naïve Bayes Classifier, and why is it called “naïve”?
* A Naïve Bayes Classifier is a simple machine learning algorithm used for classification. It is based on Bayes’ Theorem, which helps us calculate the probability of something happening given some evidence. The classifier predicts the class that has the highest probability for the given features.
* It is called “naïve” because it assumes that all features are independent of each other, which usually isn’t true in real life. Even with this simple assumption, it works very well in practice — for example, in spam email detection or text classification.

Question 5: Describe the Gaussian, Multinomial, and Bernoulli Naïve Bayes variants. When would you use each one?
* The Naïve Bayes Classifier comes in different types depending on the kind of data we have:
     * Gaussian Naïve Bayes: Used when the features are numbers that follow a normal distribution (like height, weight, or exam scores).
     * Multinomial Naïve Bayes: Used when features are counts, such as how many times a word appears in a document. This is very common in text classification.
     * Bernoulli Naïve Bayes: Used when features are just yes/no or 0/1 values, like whether a word is present in a document or not.
* In short:
      * Gaussian → numbers/continuous data
      * Multinomial → counts (word frequency)
      * Bernoulli → yes/no features


Dataset Info:
* You can use any suitable datasets like Iris, Breast Cancer, or Wine from
sklearn.datasets or a CSV file you have.

Question 6: Write a Python program to:
* Load the Iris dataset
* Train an SVM Classifier with a linear kernel
* Print the model's accuracy and support vectors.
* (Include your Python code and output in the code box below.)


In [1]:
# Answer
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
import pandas as pd

iris = datasets.load_iris()
df = pd.DataFrame(data = iris.data, columns=iris.feature_names)
df['target'] = iris.target

X = df.drop('target', axis=1)
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

model = SVC(kernel='linear')
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Support Vectors:", model.support_vectors_)


Accuracy: 1.0
Support Vectors: [[5.1 3.3 1.7 0.5]
 [4.5 2.3 1.3 0.3]
 [4.8 3.4 1.9 0.2]
 [6.  3.4 4.5 1.6]
 [5.7 2.8 4.5 1.3]
 [6.  2.7 5.1 1.6]
 [6.9 3.1 4.9 1.5]
 [5.9 3.2 4.8 1.8]
 [4.9 2.4 3.3 1. ]
 [6.1 2.9 4.7 1.4]
 [6.7 3.1 4.7 1.5]
 [6.2 2.2 4.5 1.5]
 [6.3 2.5 4.9 1.5]
 [6.2 2.8 4.8 1.8]
 [6.3 2.7 4.9 1.8]
 [6.1 3.  4.9 1.8]
 [6.5 3.2 5.1 2. ]
 [6.  3.  4.8 1.8]
 [5.9 3.  5.1 1.8]
 [4.9 2.5 4.5 1.7]
 [7.2 3.  5.8 1.6]
 [6.3 2.8 5.1 1.5]]


Question 7: Write a Python program to:
*  Load the Breast Cancer dataset
* Train a Gaussian Naïve Bayes model
* Print its classification report including precision, recall, and F1-score.
* (Include your Python code and output in the code box below.)

In [2]:
# Answer
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report
import pandas as pd

cancer = load_breast_cancer()

df = pd.DataFrame(data = cancer.data, columns=cancer.feature_names)
df['target'] = cancer.target

X = df.drop('target', axis=1)
y = df['target']


X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=1
)


model = GaussianNB()
model.fit(X_train, y_train)


y_pred = model.predict(X_test)


print("Classification Report:\n")
print(classification_report(y_test, y_pred, target_names=cancer.target_names))


Classification Report:

              precision    recall  f1-score   support

   malignant       0.94      0.92      0.93        63
      benign       0.95      0.96      0.96       108

    accuracy                           0.95       171
   macro avg       0.94      0.94      0.94       171
weighted avg       0.95      0.95      0.95       171



Question 8: Write a Python program to:
* Train an SVM Classifier on the Wine dataset using GridSearchCV to find the best C and gamma.
*  Print the best hyperparameters and accuracy.
* (Include your Python code and output in the code box below.)


In [None]:
from sklearn import datasets
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load Wine dataset
wine = datasets.load_wine()
X, y = wine.data, wine.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define SVM and parameter grid
param_grid = {'C': [0.1, 1, 10, 100],
              'gamma': [0.001, 0.01, 0.1, 1],
              'kernel': ['rbf']}

# Grid search
grid = GridSearchCV(SVC(), param_grid, cv=5)
grid.fit(X_train, y_train)

# Best model results
y_pred = grid.best_estimator_.predict(X_test)
print("Best Hyperparameters:", grid.best_params_)
print("Test Accuracy:", accuracy_score(y_test, y_pred))


Best Hyperparameters: {'C': 100, 'gamma': 0.001, 'kernel': 'rbf'}
Test Accuracy: 0.8333333333333334


Question 9: Write a Python program to:
* Train a Naïve Bayes Classifier on a synthetic text dataset (e.g. using
sklearn.datasets.fetch_20newsgroups).
* Print the model's ROC-AUC score for its predictions.
* (Include your Python code and output in the code box below.)

In [3]:
# Import needed libraries
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

# 1. Load the 20 Newsgroups text dataset
X, y = fetch_20newsgroups(subset='all',
                          remove=('headers','footers','quotes'),
                          return_X_y=True)

# 2. Convert text into numerical features using TF-IDF
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
X = vectorizer.fit_transform(X)

# 3. Split dataset into training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 4. Train a Naïve Bayes Classifier
model = MultinomialNB()
model.fit(X_train, y_train)

# 5. Evaluate using ROC-AUC score
y_prob = model.predict_proba(X_test)
print("ROC-AUC Score:", roc_auc_score(y_test, y_prob, multi_class='ovr'))


ROC-AUC Score: 0.9613427017995614


Question 10: Imagine you’re working as a data scientist for a company that handles email communications.
Your task is to automatically classify emails as Spam or Not Spam. The emails may contain:
* Text with diverse vocabulary
* Potential class imbalance (far more legitimate emails than spam)\
* Some incomplete or missing data
Explain the approach you would take to:
* Preprocess the data (e.g. text vectorization, handling missing data)
* Choose and justify an appropriate model (SVM vs. Naïve Bayes)
* Address class imbalance
* Evaluate the performance of your solution with suitable metrics
And explain the business impact of your solution.
* (Include your Python code and output in the code box below.)

Answer-
Approach Explanation:
1. Preprocessing the Data
* Handle missing values: Replace missing emails with empty strings.
* Text Vectorization: Convert email text into numerical form using TF-IDF (Term Frequency – Inverse Document Frequency) so that ML models can understand it.
* Lowercasing, removing stopwords, punctuation, etc.

2. Model Choice
* We’ll use Naïve Bayes because:
  * It works great for text problems.
  * It’s simple and fast.

3. Balance the Data
* Usually, there are more “Not Spam” emails than “Spam”.
* We use SMOTE to create extra spam samples, so the model learns fairly.

4. Check Performance
* Use Precision, Recall, and F1-score (important for spam detection).
* Recall is most important → we don’t want to miss spam.

5. Business Value
* Blocks spam → saves time and protects users from scams.
* Keeps important emails safe → avoids false alarms.

In [4]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
from imblearn.over_sampling import SMOTE

# Example dataset
data = {
    "text": [
        "Congratulations, you won a lottery prize!",
        "Meeting tomorrow at 10am",
        "Get cheap pills now!!!",
        "Project deadline next week",
        "You have been selected for prize money!!!",
        "Can we reschedule the call?"
    ],
    "label": ["spam", "ham", "spam", "ham", "spam", "ham"]
}
df = pd.DataFrame(data)

# Handle missing values
df["text"] = df["text"].fillna("")

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(df["text"], df["label"], test_size=0.3, random_state=42)

# Text vectorization
vectorizer = TfidfVectorizer(stop_words="english")
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

# Fix class imbalance
smote = SMOTE(random_state=42)
X_train_res, y_train_res = smote.fit_resample(X_train_vec, y_train)

# Train model
model = MultinomialNB()
model.fit(X_train_res, y_train_res)

# Prediction
y_pred = model.predict(X_test_vec)

# Evaluation
print("Classification Report:\n", classification_report(y_test, y_pred))


Classification Report:
               precision    recall  f1-score   support

         ham       1.00      1.00      1.00         1
        spam       1.00      1.00      1.00         1

    accuracy                           1.00         2
   macro avg       1.00      1.00      1.00         2
weighted avg       1.00      1.00      1.00         2

