#ASSIGNMENT 9
Question 1 : What is Information Gain, and how is it used in Decision Trees?

Answer:

**Information Gain** is a metric used in **Decision Tree algorithms** to decide which attribute should be selected to split the data at each node. It measures how much **uncertainty or impurity** in the dataset is reduced after splitting the data based on a particular feature.

In decision trees, impurity is measured using **Entropy**. Entropy shows how mixed the data is. If all records belong to the same class, entropy is zero, which means there is no uncertainty. When data is evenly mixed, entropy is maximum.

The formula for entropy is:
Entropy(S)=−∑pi​log2​pi​

where ( p_i ) is the probability of each class.

Information Gain is calculated as the difference between the entropy of the parent node and the weighted average entropy of the child nodes after the split.

Information Gain=Entropy(Parent)−∑∣Parent∣∣Child∣​×Entropy(Child)

**Use of Information Gain in Decision Trees:**

1. Initially, the entropy of the complete dataset is calculated.
2. For each attribute, the dataset is split based on its possible values.
3. Entropy is calculated for each subset formed after the split.
4. Information Gain is computed for every attribute.
5. The attribute with the **highest Information Gain** is selected as the decision node.
6. This process is repeated recursively for each child node until the tree is fully constructed or stopping conditions are met.

**Advantages:**

* Helps in selecting the most relevant attribute.
* Reduces uncertainty effectively.
* Improves classification accuracy.

**Example:**
If a dataset is split using the attribute **“Outlook”** and it results in a higher reduction of entropy compared to other attributes, then “Outlook” will be chosen as the root node of the decision tree.

**Conclusion:**
Information Gain plays a crucial role in decision tree construction by choosing the best attribute that provides maximum reduction in entropy, resulting in an efficient and accurate classification model.




Question 2: What is the difference between Gini Impurity and Entropy?
Hint: Directly compares the two main impurity measures, highlighting strengths,
weaknesses, and appropriate use cases.

Answer:
Gini Impurity and Entropy are **impurity measures** used in **Decision Tree algorithms** to decide how data should be split at each node. Both measure how mixed the classes are in a dataset, but they differ in calculation, interpretation, and usage.

### **1. Definition**

* **Gini Impurity:**
  Measures the probability that a randomly chosen data point would be incorrectly classified if it were randomly labeled according to the class distribution.

* **Entropy:**
  Measures the amount of uncertainty or randomness in the dataset.


### **2. Formula**

* **Gini Impurity:**
  [
  Gini = 1 - \sum p_i^2
  ]

* **Entropy:**
  [
  Entropy = -\sum p_i \log_2 p_i
  ]

Where ( p_i ) is the probability of class ( i ).


### **3. Range of Values**

* **Gini Impurity:**
  Ranges from **0 to 0.5** (for binary classification).
  0 means pure node.

* **Entropy:**
  Ranges from **0 to 1** (for binary classification).
  0 means pure node.



### **4. Computational Complexity**

* **Gini Impurity:**
  Faster to compute because it does not involve logarithmic calculations.

* **Entropy:**
  Slower due to logarithmic operations.



### **5. Sensitivity to Class Distribution**

* **Gini Impurity:**
  Less sensitive to small changes in probability; tends to isolate the most frequent class.

* **Entropy:**
  More sensitive to changes in class probabilities; gives more balanced splits.



### **6. Usage in Algorithms**

* **Gini Impurity:**
  Used in **CART (Classification and Regression Trees)** algorithm (default in scikit-learn).

* **Entropy:**
  Used in **ID3 and C4.5** algorithms.



### **7. Strengths**

* **Gini Impurity:**

  * Faster performance
  * Suitable for large datasets

* **Entropy:**

  * Theoretically sound (information theory based)
  * Produces informative splits



### **8. Weaknesses**

* **Gini Impurity:**

  * Slightly less informative in some edge cases

* **Entropy:**

  * Computationally expensive



### **9. Use Cases**

* Use **Gini Impurity** when speed is important and dataset is large.
* Use **Entropy** when interpretability and information-based splitting is preferred.



Question 3:What is Pre-Pruning in Decision Trees?

Answer:

**Pre-Pruning**, also known as **Early Stopping**, is a technique used in Decision Trees to **control the growth of the tree during the training phase**. In this approach, the tree is stopped from growing further **before it becomes too complex**, even if further splits could improve training accuracy.

The main objective of pre-pruning is to **prevent overfitting**, which occurs when a decision tree learns noise and unnecessary details from the training data, resulting in poor performance on unseen data.

### **How Pre-Pruning Works**

During the construction of a decision tree, each potential split is evaluated. If a split does not satisfy certain predefined conditions, the algorithm **does not perform the split**, and the node becomes a **leaf node**.

### **Common Pre-Pruning Criteria**

1. **Maximum Depth**: Limit the maximum depth of the tree.

2. **Minimum Samples for Split**: A node must have a minimum number of samples to be split.

3. **Minimum Information Gain**: Split is allowed only if information gain exceeds a threshold.

4. **Minimum Samples in Leaf Node**: Ensures each leaf contains enough data points.

5. **Minimum Impurity Decrease**: Split only if impurity reduction is significant.

### **Advantages of Pre-Pruning**

* Reduces overfitting
* Improves generalization ability
* Produces simpler and more interpretable trees
* Reduces training time and memory usage

### **Disadvantages of Pre-Pruning**

* May stop tree growth too early
* Can lead to underfitting
* Requires careful selection of stopping criteria

### **Example**

If splitting a node results in very small information gain or the number of samples is below a set threshold, the algorithm **stops further splitting**, even if the node is not perfectly pure.



Question 4:Write a Python program to train a Decision Tree Classifier using Gini
Impurity as the criterion and print the feature importances (practical).

Hint: Use criterion='gini' in DecisionTreeClassifier and access .feature_importances_.

(Include your Python code and output in the code box below.)

Answer:

In [None]:
# Import required libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier

# Load dataset
data = load_iris()
X = data.data      # Features
y = data.target    # Target labels

# Create Decision Tree Classifier with Gini Impurity
model = DecisionTreeClassifier(criterion='gini', random_state=42)

# Train the model
model.fit(X, y)

# Print feature importances
print("Feature Importances:")
for feature, importance in zip(data.feature_names, model.feature_importances_):
    print(f"{feature}: {importance}")


Feature Importances:
sepal length (cm): 0.013333333333333329
sepal width (cm): 0.0
petal length (cm): 0.5640559581320451
petal width (cm): 0.4226107085346215


Question 5: What is a Support Vector Machine (SVM)?

Answer:  

A **Support Vector Machine (SVM)** is a **supervised machine learning algorithm** used for **classification and regression** problems. Its main objective is to find an **optimal decision boundary (hyperplane)** that separates data points of different classes with the **maximum margin**.

### **Working Principle of SVM**

SVM works by identifying the **best hyperplane** that divides the data into classes. The data points closest to this hyperplane are called **support vectors**. These support vectors are the most important points because they directly influence the position and orientation of the hyperplane.

The margin is the distance between the hyperplane and the nearest data points from each class. SVM aims to **maximize this margin**, which helps in better generalization and reduces overfitting.

### **Key Concepts in SVM**

1. **Hyperplane**: A decision boundary that separates different classes.

2. **Support Vectors**: Data points closest to the hyperplane.

3. **Margin**: Distance between the hyperplane and support vectors.

4. **Kernel Function**: Used to handle non-linearly separable data by mapping it to a higher-dimensional space.

### **Types of Kernels**

* **Linear Kernel** – for linearly separable data

* **Polynomial Kernel** – for curved boundaries

* **Radial Basis Function (RBF)** – most commonly used

* **Sigmoid Kernel**

### **Advantages of SVM**

* Effective in high-dimensional spaces

* Works well with small and medium-sized datasets

* Robust against overfitting

* Can handle non-linear data using kernels

### **Disadvantages of SVM**

* Computationally expensive for large datasets

* Choosing the right kernel and parameters is difficult

* Less interpretable compared to decision trees

### **Applications of SVM**

* Image and face recognition

* Text classification and spam detection

* Bioinformatics

* Handwriting recognition



Question 6: What is the Kernel Trick in SVM?

Answer:
The **Kernel Trick** is a powerful technique used in **Support Vector Machines (SVM)** to handle **non-linearly separable data**. It allows SVM to create **non-linear decision boundaries** without explicitly transforming the data into a higher-dimensional space.

### **Why Kernel Trick is Needed**

In many real-world problems, data cannot be separated using a straight line (or linear hyperplane). To separate such data, it must be mapped to a **higher-dimensional space** where it becomes linearly separable.
The Kernel Trick performs this mapping **implicitly**, making computation efficient.

### **How Kernel Trick Works**

Instead of computing the transformation of data points into higher dimensions, the kernel function computes the **inner product** of data points directly in the transformed space.
This avoids heavy computations and reduces processing time.

Mathematically:

K(xi​,xj​)=ϕ(xi​)⋅ϕ(xj​)

where:

*ϕ is a mapping function to higher dimensions

K is the kernel function

### **Common Kernel Functions**

1. **Linear Kernel**
   ( K(x_i, x_j) = x_i . x_j )
   Used when data is linearly separable.

2. **Polynomial Kernel**
   ( K(x_i, x_j) = (x_i . x_j + c)^d )
   Used for curved decision boundaries.

3. **Radial Basis Function (RBF) Kernel**
   ( K(x_i, x_j) = \exp(-\gamma ||x_i - x_j||^2) )
   Most popular kernel for complex data.

4. **Sigmoid Kernel**
   ( K(x_i, x_j) = \tanh(\alpha x_i \cdot x_j + c) )

### **Advantages of Kernel Trick**

* Handles complex, non-linear data
* Avoids explicit high-dimensional transformation
* Computationally efficient
* Improves model flexibility

### **Limitations**

* Choosing the right kernel is challenging
* Kernel selection affects performance
* Risk of overfitting with complex kernels

### **Example**

If data is circular in shape and cannot be separated linearly, the **RBF kernel** maps it to higher dimensions where a linear hyperplane can separate it.

### **Conclusion**

The Kernel Trick enables SVM to solve non-linear classification problems efficiently by implicitly mapping data to higher-dimensional space using kernel functions.





Question 7: Write a Python program to train two SVM classifiers with Linear and RBF
kernels on the Wine dataset, then compare their accuracies.

Hint:Use SVC(kernel='linear') and SVC(kernel='rbf'), then compare accuracy scores after fitting
on the same dataset.

(Include your Python code and output in the code box below.)

Answer:

In [1]:
# Import required libraries
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load the Wine dataset
wine = load_wine()
X = wine.data
y = wine.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Train SVM with Linear Kernel
svm_linear = SVC(kernel='linear')
svm_linear.fit(X_train, y_train)
y_pred_linear = svm_linear.predict(X_test)
accuracy_linear = accuracy_score(y_test, y_pred_linear)

# Train SVM with RBF Kernel
svm_rbf = SVC(kernel='rbf')
svm_rbf.fit(X_train, y_train)
y_pred_rbf = svm_rbf.predict(X_test)
accuracy_rbf = accuracy_score(y_test, y_pred_rbf)

# Print the accuracies
print("Accuracy of SVM with Linear Kernel:", accuracy_linear)
print("Accuracy of SVM with RBF Kernel:", accuracy_rbf)


Accuracy of SVM with Linear Kernel: 0.9814814814814815
Accuracy of SVM with RBF Kernel: 0.7592592592592593


Question 8: What is the Naïve Bayes classifier, and why is it called "Naïve"?

Answer:
The **Naïve Bayes classifier** is a **supervised machine learning algorithm** used for **classification problems**. It is a **probabilistic classifier** based on **Bayes’ Theorem**, which calculates the probability of a class label given a set of input features. The classifier assigns a data point to the class with the **highest posterior probability**.



### Bayes’ Theorem

Naïve Bayes uses the following formula:

P(C∣X)=P(X∣C)P(C)/ P(X)​

Where:

* ( P(C|X) ) = Posterior probability of class ( C ) given features ( X )
* ( P(X|C) ) = Likelihood of features given class
* ( P(C) ) = Prior probability of class
* ( P(X) ) = Evidence (constant for all classes)



### Why is it called **“Naïve”**?

The classifier is called **“Naïve”** because it assumes that **all input features are conditionally independent of each other given the class label**. This means the presence or absence of one feature does not affect the presence of another feature.

This assumption is **rarely true in real-world data**, hence the term *naïve*. However, even with this unrealistic assumption, the algorithm performs very well in many practical applications.



### Types of Naïve Bayes Classifiers

1. **Gaussian Naïve Bayes** – Used for continuous data
2. **Multinomial Naïve Bayes** – Used for text data and word counts
3. **Bernoulli Naïve Bayes** – Used for binary features



### Advantages

* Simple and easy to understand
* Very fast training and prediction
* Works well with high-dimensional data
* Requires less training data
* Effective for text classification problems



### Limitations

* Strong independence assumption
* Less accurate when features are highly correlated
* Probability estimates may be poor



### Applications

* Spam email detection
* Sentiment analysis
* Document classification
* Medical diagnosis



### Conclusion

The **Naïve Bayes classifier** is called *naïve* due to its **feature independence assumption**, but it remains a **powerful, efficient, and widely used classification algorithm**, especially in text-based and large-scale applications.


Question 9: Explain the differences between Gaussian Naïve Bayes, Multinomial Naïve
Bayes, and Bernoulli Naïve Bayes

Answer:
Naïve Bayes is a **probabilistic classification algorithm** based on **Bayes’ Theorem**. Depending on the **type of data and feature distribution**, different variants of Naïve Bayes are used. The three most common types are **Gaussian Naïve Bayes**, **Multinomial Naïve Bayes**, and **Bernoulli Naïve Bayes**.



## 1. Gaussian Naïve Bayes

**Gaussian Naïve Bayes** is used when the input features are **continuous numerical values** and are assumed to follow a **normal (Gaussian) distribution**.

### Key Characteristics:

* Assumes features follow a **Gaussian distribution**
* Uses **mean and variance** for probability estimation
* Suitable for **real-valued data**

### Example Applications:

* Medical diagnosis (blood pressure, temperature)
* Iris dataset classification



## 2. Multinomial Naïve Bayes

**Multinomial Naïve Bayes** is mainly used for **discrete count data**, especially in **text classification** problems.

### Key Characteristics:

* Works with **frequency/count of features**
* Commonly used with **Bag-of-Words** or **TF-IDF**
* Feature values represent how often something occurs

### Example Applications:

* Spam detection
* News article classification
* Sentiment analysis



## 3. Bernoulli Naïve Bayes

**Bernoulli Naïve Bayes** is used when features are **binary (0 or 1)**, indicating the **presence or absence** of a feature.

### Key Characteristics:

* Features take **binary values**
* Considers both presence and absence of features
* Suitable for binary feature vectors

### Example Applications:

* Text classification with binary word occurrence
* Email spam detection (word present or not)



## Comparison Table

| Feature              | Gaussian NB       | Multinomial NB      | Bernoulli NB    |
| -------------------- | ----------------- | ------------------- | --------------- |
| Data Type            | Continuous        | Discrete counts     | Binary          |
| Distribution Assumed | Normal (Gaussian) | Multinomial         | Bernoulli       |
| Feature Values       | Real numbers      | Integer counts      | 0 or 1          |
| Common Use Case      | Numerical data    | Text frequency data | Binary features |
| Example              | Medical data      | Spam filtering      | Word presence   |



## Conclusion

* **Gaussian Naïve Bayes** is best for **continuous numerical data**.
* **Multinomial Naïve Bayes** is ideal for **text and count-based data**.
* **Bernoulli Naïve Bayes** is suitable for **binary feature data**.

Choosing the correct variant depends on the **nature of the dataset and feature representation**.


Question 10: Breast Cancer Dataset
Write a Python program to train a Gaussian Naïve Bayes classifier on the Breast Cancer
dataset and evaluate accuracy.

Hint:Use GaussianNB() from sklearn.naive_bayes and the Breast Cancer dataset from
sklearn.datasets.

(Include your Python code and output in the code box below.)

Answer:

In [2]:
# Import required libraries
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

# Load the Breast Cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Create Gaussian Naïve Bayes model
gnb = GaussianNB()

# Train the model
gnb.fit(X_train, y_train)

# Make predictions
y_pred = gnb.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

# Print accuracy
print("Accuracy of Gaussian Naïve Bayes classifier:", accuracy)


Accuracy of Gaussian Naïve Bayes classifier: 0.9415204678362573
