# Decision Tree | Assignment

**Question 1:  What is a Decision Tree, and how does it work in the context of
classification?**

**Answer:-** A Decision Tree is a supervised machine learning algorithm used for both classification and regression tasks, but is especially popular for classification. It works by splitting input data into subsets based on feature values, represented as a tree-like model with nodes and branches leading to a final decision or class label.

**Decision Tree Structure:-**

**Root Node:** The starting point, representing the entire dataset and a feature-based question.

**Branches:** Possible answers or outcomes from a decision, corresponding to specific feature values.

**Internal Nodes:** Decision points where another feature-based split is made.

Leaf Nodes: Final nodes that provide a class label (for classification) or a value (for regression).

**How Decision Trees Work for Classification:-**

>the data is split and branches out into subsets.

>This splitting continues recursively, using criteria such as Gini impurity or information gain to determine the best feature for each split.

>The recursion ends when all records in a node belong to the same class or cannot be split further.

>The path from the root to a leaf represents a sequence of decisions, ultimately classifying each data point into a category.

**Example**
If predicting whether a person is fit or unfit based on age, exercise, and pizza consumption:

The tree may first split on age (e.g., "Is age > 30?").

If yes, check exercise habit ("Does the person exercise?").

Continue branching until reaching leaf nodes with outcomes like "fit" or "unfit," setting the class accordingly.

Decision

**Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures.
How do they impact the splits in a Decision Tree?**

**Answer:-** Gini Impurity and Entropy are two core impurity measures in decision trees that guide how splits are chosen at each node. Both are designed to quantify the “impurity,” or disorder, in the data at a node, but use slightly different formulas and properties to do so.

**Gini Impurity:**
Gini Impurity measures the probability that a randomly chosen element from the set would be incorrectly labeled if randomly assigned a label according to the distribution of classes at that node

    Gini= ∑ of i=1 to n (Pi)^2

 where pi is the probability of class i at that node.

>A Gini Impurity of 0 means all samples at the node belong to one class (perfectly “pure”), while higher values indicate more mixing of classes.

>When building the tree, decision trees evaluate all possible splits and select the one that produces child nodes with the lowest weighted average Gini Impurity (i.e., the purest split)

**Entropy:**
Entropy, rooted in information theory, measures the unpredictability or disorder in the set of class labels.

    Formula: Entropy= −∑ of i=1 to n Pilog2(Pi)

  where pi is the probability of class i at that node.

>Entropy is 0 when the node is pure, and has higher values as the distribution of classes becomes more uniform.

>The decision tree chooses splits that maximize the “Information Gain,” which is the reduction in entropy from parent to child nodes.

**Impact on Splits in a Decision Tree:-**

>Both measures lead to selection of the feature and threshold that produce the largest reduction in impurity from parent to child nodes.

>Gini impurity is computationally faster (no logarithms) and used by default in many libraries (e.g., scikit-learn), while entropy might be more sensitive when classes are nearly equally mixed.

>The split with the largest decrease in impurity (greatest gain in purity) is selected, whether using Gini or entropy; as a result, both typically produce similar splits in practice, though there can be subtle differences in how balanced or “pure” the child nodes ar

**Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision
Trees? Give one practical advantage of using each.**

**Answer:-** Pre-Pruning and Post-Pruning are two strategies to control the complexity of decision trees and prevent overfitting. The main difference is when and how they act during the decision tree's creation and refinement.

**Pre-Pruning (Early Stopping):-**
Pre-Pruning stops the growth of the decision tree during its construction, usually by setting limits such as maximum depth, minimum samples per leaf, or minimum information gain required to make a split.

>It avoids overly complex trees by halting splits that are unlikely to generalize well, resulting in smaller, less overfit trees from the start.

>**Practical Advantage:** Pre-Pruning speeds up model training and is more computationally efficient, especially for large datasets.

**Post-Pruning (Reduced Error/Cost Complexity Pruning):-**
Post-Pruning allows the tree to grow to its full size, potentially overfitting the data, and then prunes back branches or nodes that don’t improve performance, often using cross-validation or cost-complexity criteria.

>It generally yields trees that are better balanced between complexity and accuracy by removing unnecessary branches after fully exploring data patterns.

>**Practical Advantage:** Post-Pruning can achieve higher generalization accuracy by carefully removing parts of the tree that do not contribute to predictive power, especially for smaller or noisier datasets.



**Question 4: What is Information Gain in Decision Trees, and why is it important for
choosing the best split?**

**Answer:-** **Information Gain** in Decision Trees is a metric that measures how much a particular feature contributes to reducing the uncertainty (entropy) of the target variable when used for splitting the dataset. It calculates the difference between the entropy of the dataset before the split and the weighted average entropy after the dataset is split based on a specific feature.

**Information Gain (IG)** quantifies the reduction in entropy after a dataset is split on an attribute.

    Formula: IG(A)=H(D)−H(D/A)
where
>H(D) is the entropy of the original dataset and

>H(D/A) is the weighted entropy after splitting on attribute A.

**Importance for Choosing the Best Split:-**

>At every node, the decision tree evaluates all possible features and calculates the information gain for splitting on each.

>The feature with the highest information gain is selected for the split because it leads to child nodes that are more “pure” (i.e., with less class mixing and less randomness/entropy).

>This process directly impacts the effectiveness and accuracy of the tree; by always choosing the split that provides the most significant reduction in uncertainty about class membership, the tree becomes more efficient and interpretable.



**Question 5: What are some common real-world applications of Decision Trees, and
what are their main advantages and limitations?**

**Answer:-** **Decision trees*** are widely used in various real-world domains for both classification and regression tasks. Their intuitive, rule-based nature makes them appealing for data-driven decision-making across industries.

**Common Applications:-**

>**Healthcare** Assisting with medical diagnosis, treatment recommendations, and risk prediction by evaluating symptoms, test results, and patient history.

>**Finance**Used for credit scoring, loan approval, and fraud detection by analyzing financial history and transactional data.

>**Marketing and Retail:** Customer segmentation, targeted marketing, product recommendations, and dynamic pricing based on user behavior and preferences.

>**Manufacturing:** Quality control and defect prediction through analyzing production variables and sensor data.

>**Business Decision**-Making: Strategic planning, risk assessment, and choosing optimal locations or actions.

>**Agriculture:** Crop yield prediction, pest management, and resource optimization in precision farming.

**Main Advantages:-**

>**Interpretability:** The tree structure and “if-then” rules are easy to understand, explain, and visualize, supporting transparent decision-making.

>**Handling of Both Types of Data:** Decision trees accommodate both numerical and categorical input features without the need for complex encoding.

>**No Need for Feature Scaling:** They do not require normalization or standardization of variables, unlike some other machine learning algorithms.

>**Versatility:** Suitable for classification, regression, and even multi-output problems across domains.

>**Main Limitations**
Overfitting: Decision trees are prone to creating overly complex structures that capture noise in the training data, reducing generalization ability.

>**Instability:** Small changes in the dataset can lead to very different tree structures, affecting reliability.

>**Bias Toward Dominant Features:** They may be biased toward features with more categories or values if not properly controlled.

>**Limited Expressiveness:** Single decision trees may struggle with very complex relationships unless combined into ensemble models (e.g., Random Forests, Gradient Boosting)

**Question 6:   Write a Python program to:**

**● Load the Iris Dataset**

**● Train a Decision Tree Classifier using the Gini criterion**

**● Print the model’s accuracy and feature importances**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

from sklearn.datasets import load_iris

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split into train and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Decision Tree Classifier with Gini criterion
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(criterion='gini', random_state=42)
clf.fit(X_train, y_train)

# Predict and calculate accuracy
y_pred = clf.predict(X_test)
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)

# Print accuracy and feature importances
print('Accuracy:', accuracy)
print('Feature Importances:', clf.feature_importances_)

Accuracy: 1.0
Feature Importances: [0.         0.01667014 0.90614339 0.07718647]


**Question 7:  Write a Python program to:**

**● Load the Iris Dataset**

**● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
a fully-grown tree.**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

from sklearn.datasets import load_iris

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split into train and test datasets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Decision Tree with max_depth=3
from sklearn.tree import DecisionTreeClassifier
clf_depth3 = DecisionTreeClassifier(max_depth=3, random_state=42)
clf_depth3.fit(X_train, y_train)

y_pred = clf_depth3.predict(X_test)
accuracy_depth3 = accuracy_score(y_test, y_pred)

# Train fully grown Decision Tree (no max_depth)
clf_full = DecisionTreeClassifier(random_state=42)
clf_full.fit(X_train, y_train)

y_pred_full = clf_full.predict(X_test)
from sklearn.metrics import accuracy_score
accuracy_full = accuracy_score(y_test, y_pred_full)

print("Accuracy with max_depth=3:", accuracy_depth3)
print("Accuracy with fully grown tree:", accuracy_full)

Accuracy with max_depth=3: 1.0
Accuracy with fully grown tree: 1.0


**Question 8: Write a Python program to:**

**● Load the California Housing dataset from sklearn**

**● Train a Decision Tree Regressor**

**● Print the Mean Squared Error (MSE) and feature importances**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')


from sklearn.datasets import fetch_california_housing

# Load California Housing dataset
housing = fetch_california_housing()
X, y = housing.data, housing.target

# Split into training and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Decision Tree Regressor
from sklearn.tree import DecisionTreeRegressor
regressor = DecisionTreeRegressor(random_state=42)
regressor.fit(X_train, y_train)

# Predict on test set
y_pred = regressor.predict(X_test)

# Calculate Mean Squared Error
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_test, y_pred)

# Print MSE and feature importances
print('Mean Squared Error:', mse)
print('Feature Importances:', regressor.feature_importances_)

Mean Squared Error: 0.495235205629094
Feature Importances: [0.52850909 0.05188354 0.05297497 0.02866046 0.03051568 0.13083768
 0.09371656 0.08290203]


**Question 9: Write a Python program to:**

**● Load the Iris Dataset**

**● Tune the Decision Tree’s max_depth and min_samples_split using
GridSearchCV**

**● Print the best parameters and the resulting model accuracy**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

from sklearn.datasets import load_iris


# Load Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Train/test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Set up parameter grid for max_depth and min_samples_split tuning
param_grid = {
    'max_depth': [2, 3, 4, 5, None],
    'min_samples_split': [2, 4, 6, 8]
}

# Initialize Decision Tree Classifier
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(random_state=42)

# Setup GridSearchCV
from sklearn.model_selection import GridSearchCV
grid_search = GridSearchCV(estimator=clf, param_grid=param_grid, cv=5, scoring='accuracy')

# Fit GridSearchCV
grid_search.fit(X_train, y_train)

# Make predictions and calculate accuracy for best estimator
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)

print("Best Parameters:", grid_search.best_params_)
print("Model Accuracy:", accuracy)

Best Parameters: {'max_depth': 4, 'min_samples_split': 2}
Model Accuracy: 1.0
