#ASSIGNMENT:

#Supervised Classification: Decision
#Trees, SVM, and Naive Bayes


Question 1 : What is Information Gain, and how is it used in Decision Trees?

Answer 1 : In the world of machine learning, Information Gain (IG) is the "yardstick" used by decision tree algorithms (like ID3) to determine which feature is best to split the data at any given node.

Think of it as a measure of how much a specific question—like "Is the weather sunny?"—helps clear up the confusion about the final answer.

1. The Relationship with Entropy
To understand Information Gain, you first need to understand Entropy. Entropy is a measure of "disorder" or "impurity" in a dataset.

* High Entropy: The data is a messy mix (e.g., 50% "Yes" and 50% "No"). You have no idea what the outcome will be.

* Low Entropy: The data is pure (e.g., 100% "Yes"). You are certain about the outcome.

Information Gain is simply the reduction in entropy after you split the data based on a specific attribute.

* How Decision Trees Use It
When a decision tree is being "trained," it follows these steps:

1) Calculate Entropy for the target variable in the current dataset.

2) Calculate Information Gain for every available feature (e.g., Age, Income, Location).

3) Compare results: The algorithm picks the feature with the highest Information Gain to be the decision node.

4) Split and Repeat: The data is split into subsets based on that feature, and the process repeats for each branch until the data is "pure" or a stopping condition is met.


Question 2: What is the difference between Gini Impurity and Entropy?

Answer 2 : Both Gini Impurity and Entropy are metrics used by Decision Trees to decide where to "split" the data. While they often lead to very similar results, they have different mathematical origins and slight behavioral differences.

**Gini Impurity**

* Strengths: Its main advantage is efficiency.3 Because it uses squared probabilities (4$1 - \sum p_i^2$), it is much easier for a computer to process, especially on massive datasets with millions of rows.

* Weaknesses: It can sometimes be "lazy." It tends to isolate the most frequent class into its own branch, which might result in a slightly less balanced tree in complex scenarios.

* Best Use Case: Large datasets where speed is a priority, or as a general-purpose default.

**Entropy**

* Strengths: It is more sensitive to small changes in class probabilities. Because the log function grows steeply near zero, Entropy penalizes "impurity" more harshly than Gini does. This often leads to more balanced trees.


* Weaknesses: Computational cost. Calculating logarithms for every single potential split at every node takes more time and battery/CPU power.

*  Best Use Case: Smaller datasets where you want the most "informative" split possible, or when you are performing exploratory data analysis and want a deeper look at uncertainty.

Question 3:What is Pre-Pruning in Decision Trees?

Answer 3 : Pre-Pruning is the "stop early" rule for a decision tree.

When a decision tree grows, it naturally wants to keep splitting until every single data point is perfectly categorized. This often leads to a tree that is too complex, messy, and "overfitted" (meaning it memorizes the training data but fails to predict new data correctly).

Pre-Pruning prevents this by stopping the tree's growth before it becomes too complicated.

* How it Works:

Think of it like a gardener trimming a hedge as it grows, rather than waiting for it to become a wild mess and cutting it back later. You set specific "stopping rules" at the start. If a node doesn't meet these rules, the tree stops splitting that branch.

Question 4:Write a Python program to train a Decision Tree Classifier using Gini
Impurity as the criterion and print the feature importances (practical).

Hint: Use criterion='gini' in DecisionTreeClassifier and access .feature_importances_.

(Include your Python code and output in the code box below.)

Answer 4 : To build a Decision Tree using Gini Impurity, we use the scikit-learn library. In this example, I'll use the classic Iris Dataset (flower species) because it is built into the library and easy to visualize.

* Python Implementation




In [2]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
import pandas as pd

# 1. Load the dataset
iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names

# 2. Initialize the Classifier
# We specify 'gini' as the criterion (though it is the default)
clf = DecisionTreeClassifier(criterion='gini', random_state=42)

# 3. Train (fit) the model
clf.fit(X, y)

# 4. Get Feature Importances
importances = clf.feature_importances_

# 5. Create a clean output
feature_importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance Score': importances
}).sort_values(by='Importance Score', ascending=False)

print("Decision Tree Feature Importances (Gini):")
print(feature_importance_df)

Decision Tree Feature Importances (Gini):
             Feature  Importance Score
2  petal length (cm)          0.564056
3   petal width (cm)          0.422611
0  sepal length (cm)          0.013333
1   sepal width (cm)          0.000000


Question 5: What is a Support Vector Machine (SVM)?

Answer 5 : A Support Vector Machine (SVM) is a powerful supervised machine learning algorithm used for both classification (sorting data into categories) and regression (predicting numerical values).

* The Core Components:

To find the best boundary, an SVM uses three key concepts:

1) Hyperplane: This is the actual decision boundary (the line) that separates the classes. In 2D, it's a line; in 3D, it's a plane; and in higher dimensions, it’s called a "hyperplane."


2) Support Vectors: These are the data points located closest to the boundary. They are the most critical points because if you moved them, the boundary would move. They "support" the hyperplane.



3) Margin: This is the "no-man's land" or the gap between the hyperplane and the support vectors. SVM tries to make this gap as wide as possible to ensure the classes are clearly separated.

Question 6: What is the Kernel Trick in SVM?

Answer 6 : The "Kernel Trick"

Real-world data is rarely perfectly separable by a straight line. Sometimes data points are mixed in a way that requires a curved boundary.

* The Problem: In 2D space, the data looks like a messy jumble.

* The Solution: SVM uses a Kernel to mathematically "lift" the data into a higher dimension (like adding a 3rd dimension).

* The Result: In this higher dimension, the data becomes separable by a straight plane. When you project that plane back down to 2D, it looks like a perfect circle or curve.

Question 7: Write a Python program to train two SVM classifiers with Linear and RBF

kernels on the Wine dataset, then compare their accuracies.

Hint:Use SVC(kernel='linear') and SVC(kernel='rbf'), then compare accuracy scores after fitting
on the same dataset.
(Include your Python code and output in the code box below.)

Answer 7 : To compare the performance of different SVM kernels, we use the Wine dataset, which contains chemical analysis results of wines grown in the same region in Italy.

* Python Implementation

In [3]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# 1. Load the Wine dataset
wine = load_wine()
X, y = wine.data, wine.target

# 2. Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 3. Train SVM with Linear Kernel
linear_svc = SVC(kernel='linear', random_state=42)
linear_svc.fit(X_train, y_train)
linear_preds = linear_svc.predict(X_test)
linear_accuracy = accuracy_score(y_test, linear_preds)

# 4. Train SVM with RBF (Radial Basis Function) Kernel
rbf_svc = SVC(kernel='rbf', random_state=42)
rbf_svc.fit(X_train, y_train)
rbf_preds = rbf_svc.predict(X_test)
rbf_accuracy = accuracy_score(y_test, rbf_preds)

# 5. Compare the results
print(f"Accuracy with Linear Kernel: {linear_accuracy:.4f}")
print(f"Accuracy with RBF Kernel: {rbf_accuracy:.4f}")

Accuracy with Linear Kernel: 0.9815
Accuracy with RBF Kernel: 0.7593


Question 8: What is the Naïve Bayes classifier, and why is it called "Naïve"?

Answer 8 : The Naïve Bayes classifier is a popular machine learning algorithm used primarily for classification tasks, such as sorting emails into "Spam" or "Not Spam." It is based on the Bayes’ Theorem, a mathematical formula used to calculate the probability of an event based on prior knowledge.

**Why is it called "Naïve"?**

The "Naïve" part comes from a very bold (and usually unrealistic) assumption the algorithm makes: Independence.

The algorithm assumes that every feature is completely unrelated to every other feature.

* In a "Non-Naïve" world: If you see the word "Credit," there is a high chance the next word is "Card." The words are dependent on each other.

* In the "Naïve" world: The algorithm treats "Credit" and "Card" as if they have absolutely nothing to do with each other. It ignores the context and the relationship between features.

Question 9: Explain the differences between Gaussian Naïve Bayes, Multinomial Naïve
Bayes, and Bernoulli Naïve Bayes

Answer 9 : While all three algorithms use Bayes' Theorem, they are designed for different types of data. The main difference lies in the distribution (the shape or pattern) of the features you are feeding into the model.

**1. Gaussian Naïve Bayes (GNB)**:

This version is used when your features are continuous (numbers that can have decimals) and follow a Normal (Gaussian) Distribution (the "bell curve").

* Data Type: Real-world measurements like height, weight, temperature, or blood pressure.

* How it works: It calculates the mean and standard deviation of your features to estimate probabilities.

Example: Predicting if a person is an athlete based on their height and weight.

**2. Multinomial Naïve Bayes (MNB)**:

This is the "go-to" algorithm for text classification. it is used when your data represents counts or frequencies.

* Data Type: Discrete counts (integers). In text, this is usually the "Word Count" (how many times the word "Win" appears in an email).

* How it works: It looks at the frequency of events. It doesn't care just if a word is there, but how many times it appears.

Example: Sorting news articles into categories like "Sports," "Politics," or "Tech" based on word frequencies.

**3. Bernoulli Naïve Bayes (BNB)**:

This version is used when your features are binary (Yes/No, 1/0). It only cares about whether a feature exists or not.

* Data Type: Boolean values. In text, this is "Presence vs. Absence" (is the word "Winner" in this email? Yes or No).

* How it works: It ignores how many times a word appears. It treats 1 occurrence the same as 10 occurrences.

Example: Simple spam filters where the mere presence of a "trigger word" is enough to flag the message, regardless of its frequency.

Question 10: Breast Cancer Dataset

Write a Python program to train a Gaussian Naïve Bayes classifier on the Breast Cancer
dataset and evaluate accuracy.

Hint:Use GaussianNB() from sklearn.naive_bayes and the Breast Cancer dataset from
sklearn.datasets.

(Include your Python code and output in the code box below.)



In [4]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

# 1. Load the Breast Cancer dataset
data = load_breast_cancer()
X, y = data.data, data.target

# 2. Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3. Initialize the Gaussian Naive Bayes classifier
model = GaussianNB()

# 4. Train the model
model.fit(X_train, y_train)

# 5. Make predictions and evaluate accuracy
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"Accuracy on Breast Cancer Dataset: {accuracy:.4f}")

Accuracy on Breast Cancer Dataset: 0.9737
