1.What is Information Gain, and how is it used in Decision Trees?


Information Gain measures how much "uncertainty" (entropy) is reduced when a dataset is split on a particular feature. In decision trees, it is used to decide which feature to split on at each step, ensuring the tree becomes more accurate and efficient by creating purer subsets.


Decision Trees build classification or regression models by splitting data into subsets. Information Gain guides this process:
- Calculate Entropy of the Dataset
   - Example: If we’re classifying animals into "Mammal" vs. "Bird," entropy is high if the dataset is mixed.
- Evaluate Each Feature
   - For each candidate feature (e.g., "Has Wings," "Gives Milk"), calculate the Information Gain.
- Choose the Best Split
   - The feature with the highest Information Gain is chosen because it reduces uncertainty the most.
- Repeat Recursively
   - The process continues until subsets are pure (entropy = 0) or stopping criteria are met (e.g., max depth).


2.What is the difference between Gini Impurity and Entropy?

Both Gini Impurity and Entropy are measures of how mixed a dataset is, used in decision trees to decide the best split. The key difference is that Entropy uses logarithms to measure disorder, while Gini Impurity uses squared probabilities.
- Gini:
   - It measures the probability that a randomly chosen sample would be misclassified if you assign its label according to the distribution of classes in the node.
   - id gini=0.48 it means: if you randomly pick a sample and assign its class based on the node’s distribution, there’s a 48% chance that this assignment will be wrong.
- Entropy:
  -  It tells us how much information is required, on average, to describe the outcome of a random process
   - Entropy = 0.98 means the node is almost at maximum uncertainty, so nearly the highest amount of information is needed to resolve it





3.What is Pre-Pruning in Decision Trees?
- Pre-pruning (also called early stopping) is a method where the decision tree growth is stopped before it becomes overly complex. Instead of building a full tree and then trimming it, pre-pruning sets constraints during tree construction.
- To prevent overfitting by limiting unnecessary splits that don’t significantly improve predictive accuracy
- Decision tree algorithms use thresholds to decide when to stop splitting:
  - Maximum Depth: Restrict how deep the tree can grow.
  - Minimum Samples per Node: Stop splitting if a node has fewer than a set number of samples.
  - Minimum Information Gain: Only split if the gain in accuracy (e.g., reduction in impurity) exceeds a threshold


In [1]:
'''
4.Write a Python program to train a Decision Tree Classifier using Gini
Impurity as the criterion and print the feature importances (practical).
Hint: Use criterion='gini' in DecisionTreeClassifier and access .feature_importances_
'''


# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
import pandas as pd

iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names

clf = DecisionTreeClassifier(criterion='gini', random_state=42)
clf.fit(X, y)

importances = clf.feature_importances_

# Display feature importances in a neat format
importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': importances
}).sort_values(by='Importance', ascending=False)

print("Feature Importances using Gini Impurity:")
print(importance_df)

Feature Importances using Gini Impurity:
             Feature  Importance
2  petal length (cm)    0.564056
3   petal width (cm)    0.422611
0  sepal length (cm)    0.013333
1   sepal width (cm)    0.000000


5.What is a Support Vector Machine (SVM)?
 - A Support Vector Machine (SVM) is a supervised machine learning algorithm that finds the optimal boundary (called a hyperplane) to separate data into different classes with the maximum margin.
-  Idea of SVM
    - Classification: SVM tries to separate data points of different classes using a hyperplane.
    - Maximum Margin Principle: The chosen hyperplane is the one that maximizes the distance (margin) between itself and the nearest data points from each class. These nearest points are called support vectors.
    - Generalization: By maximizing the margin, SVM reduces the risk of misclassification on unseen data.
  -  Linear Case: If data is linearly separable, SVM finds a straight line (in 2D), plane (in 3D), or hyperplane (in higher dimensions) that best separates the classes.
- Nonlinear Case: If data isn’t linearly separable, SVM uses the kernel trick to project data into a higher-dimensional space where a linear separator can be found.
- Common kernels: Linear, Polynomial, Radial Basis Function (RBF), Sigmoid.
- Support Vectors: Only a subset of training points (the ones closest to the boundary) influence the decision function, making SVM memory-efficient






6.What is the Kernel Trick in SVM?
- The Kernel Trick in SVM is a mathematical technique that allows Support Vector Machines to handle non-linear data by implicitly mapping it into a higher-dimensional space without explicitly computing the transformation
- Kernel Trick is Needed
  - Linear separability: SVMs work best when data can be separated by a straight line (or hyperplane).
  - Real-world data: Often, data is non-linear and cannot be separated with a simple line.
  - Solution: Map the data into a higher-dimensional space where it becomes linearly separable.
- How the Kernel Trick Works:
  - Instead of explicitly computing the coordinates in higher dimensions, SVM uses a kernel function to calculate the inner product between two points in that space.
  - This avoids the computational cost of working directly in high dimensions.
  - Key idea: You never need to know the actual mapping; the kernel function does the job implicitly

comman kernals are:
linear(Basic one)
polynomial
gaussian(RBF)
sigmoid



In [2]:
'''
7.
Write a Python program to train two SVM classifiers with Linear and RBF
kernels on the Wine dataset, then compare their accuracies.
Hint:Use SVC(kernel='linear') and SVC(kernel='rbf'), then compare accuracy scores after fitting
on the same dataset.
'''
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

wine = load_wine()
X = wine.data
y = wine.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Train SVM with Linear kernel
svm_linear = SVC(kernel='linear', random_state=42)
svm_linear.fit(X_train, y_train)
y_pred_linear = svm_linear.predict(X_test)
accuracy_linear = accuracy_score(y_test, y_pred_linear)

# Train SVM with RBF kernel
svm_rbf = SVC(kernel='rbf', random_state=42)
svm_rbf.fit(X_train, y_train)
y_pred_rbf = svm_rbf.predict(X_test)
accuracy_rbf = accuracy_score(y_test, y_pred_rbf)


print("Accuracy with Linear Kernel:", accuracy_linear)
print("Accuracy with RBF Kernel:", accuracy_rbf)

if accuracy_linear > accuracy_rbf:
    print("Linear kernel performed better.")
elif accuracy_rbf > accuracy_linear:
    print("RBF kernel performed better.")
else:
    print("Both kernels performed equally well.")


Accuracy with Linear Kernel: 0.9444444444444444
Accuracy with RBF Kernel: 0.6944444444444444
Linear kernel performed better.


8.What is the Naïve Bayes classifier, and why is it called "Naïve"?

- Bayes’ Theorem: It calculates the probability of a class given the features:
P(C|X)={P(X|C)*P(C)}/{P(X)}
- where:
  - P(C|X): Posterior probability of class C given features X
  - P(X|C): Likelihood of features given class C
  - P(C): Prior probability of class C
  - P(X): Evidence (probability of features)
  - Classifier: Naïve Bayes uses this theorem to assign the most probable class to a given input.
- Bayes’ Theorem: General rule, works with dependent events.
- Naïve Bayes classifier: Applies Bayes’ Theorem but assumes independence between features to make computation easier.
- That’s why it’s called “naïve” — it ignores dependencies


9.Explain the differences between Gaussian Naïve Bayes, Multinomial Naïve
Bayes, and Bernoulli Naïve Bayes
- 1. Gaussian Naïve Bayes
   - Assumption: Features are continuous and follow a normal (Gaussian) distribution.
   - Use Case: Works well for datasets with continuous numerical values (e.g., height, weight, sensor readings).
- 2. Multinomial Naïve Bayes- Assumption: Features represent discrete counts (non-negative integers).
  - Use Case: Commonly used in text classification (spam detection, sentiment analysis) where features are word counts or term frequencies.
- 3. Bernoulli Naïve Bayes- Assumption: Features are binary (0 or 1), representing presence/absence of a feature.
  - Use Case: Also used in text classification, but instead of word counts, it considers whether a word appears at all(Yes/No).


In [3]:
'''
10.
Write a Python program to train a Gaussian Naïve Bayes classifier on the Breast Cancer
dataset and evaluate accuracy.
Hint:Use GaussianNB() from sklearn.naive_bayes and the Breast Cancer dataset from
sklearn.datasets

'''
# Import necessary libraries
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

data = load_breast_cancer()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_pred = gnb.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)

print("Accuracy of Gaussian Naïve Bayes on Breast Cancer dataset:", accuracy)

Accuracy of Gaussian Naïve Bayes on Breast Cancer dataset: 0.9385964912280702
