<a href="https://colab.research.google.com/github/ankitkush1487/Machine-Learning/blob/main/Supervised_Classification_Decision_Trees%2C_SVM%2C_and_Naive_Bayes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Question 1 : What is Information Gain, and how is it used in Decision Trees?**

Information Gain measures the reduction in uncertainty (entropy) about a dataset's target variable after splitting it on a particular feature. In decision trees, it is used at each node to select the attribute that provides the most "information," meaning the one that best separates the data into purer subsets, leading to the most effective split. The attribute with the highest information gain is chosen to create the new branches at that node.

How it's used in decision trees

Calculate entropy: First, the initial uncertainty (entropy) of the entire dataset is calculated.
Calculate gain for each feature: For every potential feature to split on, the algorithm calculates the weighted average entropy of the resulting subsets. The information gain is the difference between the original entropy and this new, averaged entropy.
Select the best feature: The algorithm selects the feature that yields the highest information gain. This is the feature that reduces the most uncertainty and creates the purest child nodes.
Repeat: This process is repeated recursively for each new node until a stopping criterion is met, such as a node becoming pure (e.g., all instances belong to the same class) or a maximum tree depth being reached.


**Question 2: What is the difference between Gini Impurity and Entropy?**

**Hint: Directly compares the two main impurity measures highlighting strengths, weaknesses, and appropriate use cases.**


Definition:

Gini Impurity: Measures how often a randomly chosen element would be incorrectly classified if it was randomly labeled according to the class distribution in that node.

Entropy: Measures the amount of disorder or uncertainty in the class labels within a node.

2Ô∏è‚É£ Formula:

Gini:
ùê∫
ùëñ
ùëõ
ùëñ
=
1
‚àí
‚àë
ùëù
ùëñ
2
Gini=1‚àí‚àëp
i
2
	‚Äã


Entropy:
ùê∏
ùëõ
ùë°
ùëü
ùëú
ùëù
ùë¶
=
‚àí
‚àë
ùëù
ùëñ
log
‚Å°
2
(
ùëù
ùëñ
)
Entropy=‚àí‚àëp
i
	‚Äã

log
2
	‚Äã

(p
i
	‚Äã

)

3Ô∏è‚É£ Range:

Gini: Ranges from 0 (pure node) to 0.5 (maximum impurity for two classes).

Entropy: Ranges from 0 (pure node) to 1 (maximum impurity for two classes).

4Ô∏è‚É£ Computation Speed:

Gini: Faster to compute since it doesn‚Äôt use logarithms.

Entropy: Slightly slower because it involves log calculations.

5Ô∏è‚É£ Interpretation:

Gini: Puts more emphasis on the most frequent class; it prefers larger class separation.

Entropy: Considers the overall information gain and is more sensitive to small changes in class probabilities.

6Ô∏è‚É£ Common Use in Algorithms:

Gini: Used by CART (Classification and Regression Trees).

Entropy: Used by ID3 and C4.5 decision tree algorithms.

7Ô∏è‚É£ When to Use:

Gini: Prefer when speed and simplicity are priorities.

Entropy: Prefer when you want a more information-theoretic and detailed measure of impurity.

**Question 3:What is Pre-Pruning in Decision Trees?**

Pre-pruning, or early stopping, is a technique in decision trees where the growth of the tree is halted before it becomes fully developed, preventing overfitting by using criteria to stop splitting nodes. Common stopping conditions include reaching a maximum depth, not having enough samples in a node, or a split not improving the model's impurity enough.

**Question 4:Write a Python program to train a Decision Tree Classifier using Gini Impurity as the criterion and print the feature importances (practical).**

**Hint: Use criterion='gini' in DecisionTreeClassifier and access .feature_importances_.**
**(Include your Python code and output in the code box below.)**


In [1]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load the Iris dataset as an example
iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the Decision Tree Classifier with Gini impurity as the criterion
dt_classifier = DecisionTreeClassifier(criterion='gini', random_state=42)

# Train the classifier
dt_classifier.fit(X_train, y_train)

# Get the feature importances
feature_importances = dt_classifier.feature_importances_

# Print the feature importances
print("Feature Importances:")
for i, importance in enumerate(feature_importances):
    print(f"{feature_names[i]}: {importance:.4f}")



Feature Importances:
sepal length (cm): 0.0000
sepal width (cm): 0.0167
petal length (cm): 0.9061
petal width (cm): 0.0772


**Question 5: What is a Support Vector Machine (SVM)?**

A Support Vector Machine (SVM) is a powerful, supervised machine learning algorithm used for both classification and regression tasks. Its primary goal is to find the optimal decision boundary, known as a hyperplane, that maximally separates data points of different classes in a high-dimensional space


Hyperplane: The decision boundary separating data points of different classes. This is a line in 2D or a plane in higher dimensions.

Margin: The distance between the hyperplane and the closest data points from each class. The algorithm aims to maximize this distance for better performance.

Support Vectors: The data points closest to the hyperplane that determine its position.

Kernel Trick: A method using kernel functions to map non-linearly separable data into a higher-dimensional space where a linear separation is possible without explicit transformation

**Question 6: What is the Kernel Trick in SVM?**

The Kernel Trick is a method used in Support Vector Machines (SVMs) that allows for the classification of non-linear data by implicitly mapping it to a higher-dimensional space. Instead of explicitly transforming the data, the trick uses a kernel function to compute the dot product of the data points in that higher-dimensional space, which is computationally efficient. This allows the algorithm to find a linear separator in the new, higher-dimensional space that corresponds to a non-linear decision boundary in the original space

**Question 7: Write a Python program to train two SVM classifiers with Linear and RBF kernels on the Wine dataset, then compare their accuracies.**

**Hint:Use SVC(kernel='linear') and SVC(kernel='rbf'), then compare accuracy scores after fitting on the same dataset.**
**(Include your Python code and output in the code box below.)**

In [2]:
# Import necessary libraries
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load Wine dataset
data = load_wine()
X = data.data
y = data.target

# Split data into 70% training and 30% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

# Feature scaling (important for SVM)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Train SVM with Linear Kernel
svm_linear = SVC(kernel='linear', random_state=42)
svm_linear.fit(X_train, y_train)
y_pred_linear = svm_linear.predict(X_test)
acc_linear = accuracy_score(y_test, y_pred_linear)

# Train SVM with RBF Kernel
svm_rbf = SVC(kernel='rbf', random_state=42)
svm_rbf.fit(X_train, y_train)
y_pred_rbf = svm_rbf.predict(X_test)
acc_rbf = accuracy_score(y_test, y_pred_rbf)

# Print accuracy comparison
print("SVM with Linear Kernel Accuracy:", acc_linear)
print("SVM with RBF Kernel Accuracy:", acc_rbf)

# Compare which performed better
if acc_linear > acc_rbf:
    print("\nLinear Kernel performed better.")
elif acc_rbf > acc_linear:
    print("\nRBF Kernel performed better.")
else:
    print("\nBoth kernels performed equally well.")


SVM with Linear Kernel Accuracy: 0.9629629629629629
SVM with RBF Kernel Accuracy: 0.9814814814814815

RBF Kernel performed better.


**Question 8: What is the Na√Øve Bayes classifier, and why is it called "Na√Øve"?**

The Na√Øve Bayes classifier is a probability-based machine learning algorithm that uses Bayes' Theorem to classify data points. It is called "na√Øve" because it makes a strong and often unrealistic assumption that all features are independent of each other, meaning the presence of one feature does not affect the others.

What it is
A supervised classification algorithm: It is used for both binary and multi-class classification problems.
Based on Bayes' Theorem: The algorithm uses this theorem to calculate the probability of a class given a set of features. It predicts the class with the highest probability.
Effective in practice: Despite its simplistic assumption, it performs well in many real-world applications, such as spam filtering and text categorization.

Why it is called "na√Øve"

The "na√Øve" assumption: The core assumption is that all features used for classification are independent of each other.
Unrealistic in real-world data: This is often not true in reality, as features are frequently correlated. For example, in a text document, the presence of one word might make another word more likely to be present.

Simplifies computation: The independence assumption simplifies the complex probability calculations, making the algorithm easier and faster to compute, even though it is an oversimplification.

**Question 9: Explain the differences between Gaussian Na√Øve Bayes, Multinomial Na√Øve Bayes, and Bernoulli Na√Øve Bayes**

1Ô∏è‚É£ Gaussian Na√Øve Bayes

Used for: Continuous (real-valued) features
Assumption: Each feature follows a normal (Gaussian) distribution within each class.

Example use case:

Predicting based on continuous data like height, weight, or age.

Commonly used in iris classification, medical diagnosis, etc.

Formula:

ùëÉ
(
ùë•
ùëñ
‚à£
ùë¶
)=1
2
ùúã
ùúé
ùë¶
2
exp
‚Å°
(
‚àí
(
ùë•
ùëñ
‚àí
ùúá
ùë¶
)
2
2
ùúé
ùë¶
2
)
P(x
i
	‚Äã

‚à£y)=
2œÄœÉ
y
2
	‚Äã

	‚Äã

1
	‚Äã

exp(‚àí
2œÉ
y
2
(x
i
‚àíŒº
y
)
2	‚Äã
)

Key idea:
Uses the mean (Œº) and variance (œÉ¬≤) of each feature for each class.

2Ô∏è‚É£ Multinomial Na√Øve Bayes

Used for: Discrete or count-based features
Assumption: Features represent frequency or count data (non-negative integers).

Example use case:

Text classification (spam detection, sentiment analysis).

Word counts or TF-IDF scores in documents.

Key idea:
Estimates the probability of a feature (e.g., a word) occurring within a class, based on counts.

3Ô∏è‚É£ Bernoulli Na√Øve Bayes

Used for: Binary/Boolean features (0 or 1)
Assumption: Each feature represents a yes/no or present/absent condition.

Example use case:

Text classification where only the presence or absence of a word matters (not the count).

Suitable for binary feature vectors.

**Question 10: Breast Cancer Dataset**
**Write a Python program to train a Gaussian Na√Øve Bayes classifier on the Breast Cancer**
**dataset and evaluate accuracy.**

**Hint:Use GaussianNB() from sklearn.naive_bayes and the Breast Cancer dataset from**
**sklearn.datasets.**
**(Include your Python code and output in the code box below.)**


In [3]:
# Import necessary libraries
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

# Load the Breast Cancer dataset
breast_cancer = load_breast_cancer()
X = breast_cancer.data  # Features
y = breast_cancer.target  # Target labels

# Split the dataset into training and testing sets (80% train, 20% test)
# We use a fixed random_state for reproducibility of results
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a Gaussian Naive Bayes classifier
gnb = GaussianNB()

# Train the classifier on the training data
gnb.fit(X_train, y_train)

# Make predictions on the test data
y_pred = gnb.predict(X_test)

# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)

# Print the results
print(f"Number of samples in the dataset: {len(X)}")
print(f"Number of training samples: {len(X_train)}")
print(f"Number of testing samples: {len(X_test)}")
print("-" * 30)
print(f"Accuracy of the Gaussian Naive Bayes classifier: {accuracy:.4f}")
print("-" * 30)



Number of samples in the dataset: 569
Number of training samples: 455
Number of testing samples: 114
------------------------------
Accuracy of the Gaussian Naive Bayes classifier: 0.9737
------------------------------
