Decision Tree

1. What is a Decision Tree, and how does it work in the context of
classification?
- A Decision Tree is a popular supervised machine learning algorithm used for both classification and regression tasks. In the context of classification, it is a model that predicts the class label of an instance by learning decision rules inferred from the features of the data.

How it Works for Classification

The process of building a decision tree for classification can be summarized as:

- Select the Best Feature to Split:

The algorithm chooses the feature that best separates the classes using a splitting criterion such as:

Gini Impurity

Entropy / Information Gain

Chi-square

- Split the Dataset:

The dataset is divided into subsets based on the values of the selected feature.

- Repeat Recursively (Recursive Partitioning):

For each subset, steps 1 and 2 are repeated until:

All samples in a node belong to the same class, or

A stopping criterion is reached (like maximum tree depth or minimum samples per leaf).

- Assign Class Labels:

Each leaf node is assigned the class label that is most frequent among the samples in that node.

2.  Explain the concepts of Gini Impurity and Entropy as impurity measures.
How do they impact the splits in a Decision Tree?
- In decision trees, we need a way to decide which feature to split on at each node. This is done using impurity measures that quantify how “mixed” the classes are in a dataset. The two most common measures are Gini Impurity and Entropy.

Gini Impurity measures the probability of incorrectly classifying a randomly chosen element from the dataset if we randomly assign a label according to the class distribution in that node.

How They Impact Splits in a Decision Tree

- Goal: Choose the split that reduces impurity the most (makes child nodes as “pure” as possible).

- Gini and Entropy are used as criteria for evaluating splits:

- Gini Impurity → tries to minimize the probability of misclassification

- Entropy → tries to maximize information gain (reduction in disorder)

- Both often lead to similar splits, but Gini is slightly faster to compute, so it is often used in libraries like scikit-learn.

3. What is the difference between Pre-Pruning and Post-Pruning in Decision
Trees? Give one practical advantage of using each.
- | Aspect                | Pre-Pruning                  | Post-Pruning                        |
| --------------------- | ---------------------------- | ----------------------------------- |
| **When Applied**      | During tree construction     | After full tree is built            |
| **Overfitting**       | Prevented early              | Reduced after analysis              |
| **Computation**       | Faster (smaller tree)        | Slower (build full tree first)      |
| **Accuracy**          | Might underfit if too strict | Usually more accurate if tuned well |
| **Example Parameter** | max_depth, min_samples_split | CCP (α), reduced error pruning      |


4. What is Information Gain in Decision Trees, and why is it important for
choosing the best split?
- Information Gain measures the reduction in uncertainty (entropy) about the target variable after splitting the dataset based on a feature. In other words, it tells us how much “information” a feature gives us about the class labels. The feature with the highest information gain is chosen for the split.

Information Gain is Important because-

- Helps select the feature that best separates the classes.

- Reduces uncertainty most efficiently, leading to smaller, more accurate trees.

- Ensures the tree splits on features that provide the most predictive power.

5. What are some common real-world applications of Decision Trees, and
what are their main advantages and limitations?
- Real-World Applications of Decision Trees
A. Classification Tasks

Healthcare Diagnosis

Predicting whether a patient has a disease based on symptoms and test results.

Example: Classifying whether a tumor is malignant or benign.

Credit Risk Assessment

Banks use decision trees to classify loan applicants as “low risk” or “high risk” based on financial history, income, and credit score.

Customer Segmentation

E-commerce sites classify customers into categories (e.g., likely to buy, occasional buyers, or churn risk).

Fraud Detection

Classifying financial transactions as “fraudulent” or “legitimate” based on transaction patterns.

B. Regression Tasks

Predicting House Prices

Using features like area, location, and number of bedrooms.

Sales Forecasting

Predicting future sales of products based on historical data and seasonality.

C. Other Applications

Decision Support Systems

Help managers make decisions in business or logistics.

Game AI

Simple AI in games uses decision trees for strategy or move selection.

Manufacturing

Predicting machine failure or quality issues based on sensor data.

| Advantage                            | Explanation                                                               |
| ------------------------------------ | ------------------------------------------------------------------------- |
| **Easy to Understand**               | Trees can be visualized; even non-experts can interpret them.             |
| **No Need for Feature Scaling**      | Works well with both numerical and categorical data.                      |
| **Handles Non-linear Relationships** | Can model complex relationships between features and target.              |
| **Automatic Feature Selection**      | Splitting naturally chooses the most informative features.                |
| **Fast Prediction**                  | Once trained, predictions are made by simple traversal from root to leaf. |


| Limitation                                 | Explanation                                                                                      |
| ------------------------------------------ | ------------------------------------------------------------------------------------------------ |
| **Prone to Overfitting**                   | Can create very deep trees that fit noise in the training data.                                  |
| **Unstable**                               | Small changes in data can lead to very different trees.                                          |
| **Less Accurate than Ensembles**           | Often outperformed by Random Forests or Gradient Boosting.                                       |
| **Bias Towards Features with More Levels** | Features with many categories may be chosen for splits unfairly.                                 |
| **Greedy Splitting**                       | The algorithm makes local decisions (best split at each node) which may not be globally optimal. |


6. Iris Dataset for classification tasks (sklearn.datasets.load_iris() or
provided CSV).
● Boston Housing Dataset for regression tasks
(sklearn.datasets.load_boston() or provided CSV).
Question 6: Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier using the Gini criterion
● Print the model’s accuracy and feature importances

In [None]:

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
import pandas as pd

iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

clf = DecisionTreeClassifier(criterion='gini', random_state=42)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"Decision Tree Accuracy: {accuracy:.2f}")

importances = clf.feature_importances_
feature_importances = pd.Series(importances, index=feature_names).sort_values(ascending=False)
print("\nFeature Importances:")
print(feature_importances)


7. Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
a fully-grown tree.
(Include your Python code and output in the code box below.)

In [None]:

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

clf_depth3 = DecisionTreeClassifier(max_depth=3, random_state=42)
clf_depth3.fit(X_train, y_train)
y_pred_depth3 = clf_depth3.predict(X_test)
accuracy_depth3 = accuracy_score(y_test, y_pred_depth3)

clf_full = DecisionTreeClassifier(random_state=42)
clf_full.fit(X_train, y_train)
y_pred_full = clf_full.predict(X_test)
accuracy_full = accuracy_score(y_test, y_pred_full)

print(f"Accuracy with max_depth=3: {accuracy_depth3:.2f}")
print(f"Accuracy with fully-grown tree: {accuracy_full:.2f}")

Output:
Accuracy with max_depth=3: 0.94
Accuracy with fully-grown tree: 1.00


8. Write a Python program to:
● Load the Boston Housing Dataset
● Train a Decision Tree Regressor
● Print the Mean Squared Error (MSE) and feature importances
(Include your Python code and output in the code box below.)

In [None]:

import numpy as np
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
import pandas as pd

boston = load_boston()
X = boston.data
y = boston.target
feature_names = boston.feature_names

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

regressor = DecisionTreeRegressor(random_state=42)
regressor.fit(X_train, y_train)

y_pred = regressor.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error (MSE): {mse:.2f}")

importances = regressor.feature_importances_
feat_imp = pd.Series(importances, index=feature_names).sort_values(ascending=False)
print("\nFeature Importances:")
print(feat_imp)

Outout:
Mean Squared Error (MSE): 10.93


9. Write a Python program to:
● Load the Iris Dataset
● Tune the Decision Tree’s max_depth and min_samples_split using
GridSearchCV
● Print the best parameters and the resulting model accuracy
(Include your Python code and output in the code box below.)

In [None]:

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score


iris = load_iris()
X = iris.data
y = iris.target


X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


dtree = DecisionTreeClassifier(random_state=42)


param_grid = {
    'max_depth': [1, 2, 3, 4, 5, None],
    'min_samples_split': [2, 5, 10, 15]
}

grid_search = GridSearchCV(
    estimator=dtree,
    param_grid=param_grid,
    cv=5,
    scoring='accuracy'
)


grid_search.fit(X_train, y_train)


best_params = grid_search.best_params_
print("Best Parameters:", best_params)


best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of the best model: {accuracy:.2f}")

Output:
Best Parameters: {'max_depth': None, 'min_samples_split': 2}


10. Imagine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.
Explain the step-by-step process you would follow to:
● Handle the missing values
● Encode the categorical features
● Train a Decision Tree model
● Tune its hyperparameters
● Evaluate its performance
And describe what business value this model could provide in the real-world
setting.


In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'max_depth': [3, 5, 7, None],
    'min_samples_split': [2, 5, 10],
    'criterion': ['gini', 'entropy']
}

grid_search = GridSearchCV(DecisionTreeClassifier(random_state=42),
                           param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

best_model = grid_search.best_estimator_
