#Assignment Code: DA-AG-012
#Decision Tree | Assignment

Question 1: What is a Decision Tree, and how does it work in the context of classification?

Answer:

A Decision Tree is a supervised machine learning algorithm used for both classification and regression tasks. In the context of classification, it works by learning rules from the features to split the dataset into smaller groups.

Each internal node in the tree represents a test on a feature.

Each branch represents the outcome of the test.

Each leaf node represents a class label.

The model splits the dataset recursively based on the most informative features until the stopping criteria are met (like pure nodes or max depth). The goal is to reduce impurity in each split.

 Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?

Answer:

Both Gini Impurity and Entropy are impurity measures used to decide how to split nodes in a Decision Tree.

Gini Impurity (formula):

Gini = 1 - (p1^2 + p2^2 + ... + pn^2)


Entropy (formula):

Entropy = - (p1 * log2(p1) + p2 * log2(p2) + ... + pn * log2(pn))


Where:

p1, p2, ..., pn are the probabilities of each class in the node.

Impact on Splits:

The Decision Tree calculates the decrease in impurity after each possible split.

It selects the split that reduces impurity the most.

This results in purer child nodes and better classification.

Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.

Answer:

Pre-Pruning (Early Stopping):

Limits the growth of the tree during training using parameters like max_depth, min_samples_split, etc.

Advantage: Reduces overfitting and training time.

Post-Pruning:

Allows the tree to grow fully, then removes branches that don’t improve performance using validation data.

Advantage: Often leads to better generalization because pruning is based on real model performance.

 Question 4: What is Information Gain in Decision Trees, and why is it important for choosing the best split?

Answer:

Information Gain is the reduction in impurity after a dataset is split on a feature.

Formula:

Information Gain = Impurity(parent node)
                 - [ (n1/n) * Impurity(child1) + (n2/n) * Impurity(child2) + ... ]


Where:

n1, n2, ... are number of samples in each child node

n is the number of samples in the parent node

Importance:

Helps the tree identify the feature that results in the most significant improvement in class purity.

Higher information gain = better feature for splitting.

 Question 5: What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?

Answer:

Applications:

  Disease diagnosis

  Loan approval

  Fraud detection

  Customer churn prediction

  Credit scoring

Advantages:

Easy to understand and visualize

Handles both numerical and categorical data

Requires little data preprocessing

Limitations:

Prone to overfitting if not pruned

High variance (small changes in data → different tree)

Less accurate compared to ensemble methods like Random Forest   


 Question 6:   Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier using the Gini criterion
● Print the model’s accuracy and feature importances
(Include your Python code and output in the code box below.)
Answer:  
  
  
   Dataset Info:
● Iris Dataset for classification tasks (sklearn.datasets.load_iris() or
provided CSV).
● Boston Housing Dataset for regression tasks
(sklearn.datasets.load_boston() or provided CSV).   
Answer:  



In [1]:
# Question 6

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train classifier
clf = DecisionTreeClassifier(criterion='gini', random_state=42)
clf.fit(X_train, y_train)

# Predict and evaluate
y_pred = clf.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Feature Importances:", clf.feature_importances_)


Accuracy: 1.0
Feature Importances: [0.         0.01911002 0.89326355 0.08762643]


Question 7:  Write a Python program to:   
● Load the Iris Dataset   
● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
a fully-grown tree.   
(Include your Python code and output in the code box below.)   
Answer:

In [2]:
# Question 7

# Full tree
clf_full = DecisionTreeClassifier(random_state=42)
clf_full.fit(X_train, y_train)
acc_full = accuracy_score(y_test, clf_full.predict(X_test))

# Pruned tree
clf_pruned = DecisionTreeClassifier(max_depth=3, random_state=42)
clf_pruned.fit(X_train, y_train)
acc_pruned = accuracy_score(y_test, clf_pruned.predict(X_test))

# Compare
print("Full Tree Accuracy:", acc_full)
print("Pruned Tree Accuracy (max_depth=3):", acc_pruned)


Full Tree Accuracy: 1.0
Pruned Tree Accuracy (max_depth=3): 1.0


Question 8: Write a Python program to:   
● Load the Boston Housing Dataset   
● Train a Decision Tree Regressor   
● Print the Mean Squared Error (MSE) and feature importances   
(Include your Python code and output in the code box below.)   
Answer:

In [3]:
# Question 8

# Use California Housing as replacement for deprecated Boston dataset
from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Load dataset
data = fetch_california_housing()
X, y = data.data, data.target

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train regressor
reg = DecisionTreeRegressor(random_state=42)
reg.fit(X_train, y_train)

# Predict and evaluate
y_pred = reg.predict(X_test)
mse = mean_squared_error(y_test, y_pred)

print("Mean Squared Error:", mse)
print("Feature Importances:", reg.feature_importances_)


Mean Squared Error: 0.5280096503174904
Feature Importances: [0.52345628 0.05213495 0.04941775 0.02497426 0.03220553 0.13901245
 0.08999238 0.08880639]


Question 9: Write a Python program to:   
● Load the Iris Dataset   
● Tune the Decision Tree’s max_depth and min_samples_split using
GridSearchCV   
● Print the best parameters and the resulting model accuracy   
(Include your Python code and output in the code box below.)   
Answer:  

In [6]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define hyperparameter grid
param_grid = {
    'max_depth': [2, 3, 4, 5],
    'min_samples_split': [2, 4, 6, 10]
}

# Set up GridSearchCV
grid_search = GridSearchCV(
    estimator=DecisionTreeClassifier(random_state=42),
    param_grid=param_grid,
    cv=5,
    scoring='accuracy'
)

# Fit GridSearchCV
grid_search.fit(X_train, y_train)

# Get best model and evaluate
best_model = grid_search.best_estimator_
test_accuracy = accuracy_score(y_test, best_model.predict(X_test))

# Print results
print("Best Parameters:", grid_search.best_params_)
print("Best Cross-Validated Accuracy:", grid_search.best_score_)
print("Test Accuracy with Best Parameters:", test_accuracy)


Best Parameters: {'max_depth': 4, 'min_samples_split': 6}
Best Cross-Validated Accuracy: 0.9428571428571428
Test Accuracy with Best Parameters: 1.0


Question 10: Imagine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.   
Explain the step-by-step process you would follow to:  
● Handle the missing values   
● Encode the categorical features   
● Train a Decision Tree model   
● Tune its hyperparameters   
● Evaluate its performance   
And describe what business value this model could provide in the real-world
setting.   
Answer:   

Step-by-step process:

1. Handle Missing Values:

Use SimpleImputer(strategy='mean') for numeric features.

Use SimpleImputer(strategy='most_frequent') for categorical features.

2. Encode Categorical Features:

Use OneHotEncoder for nominal features.

Use ColumnTransformer to combine encoding with numerical columns.

3. Train Decision Tree Model:

Split data using train_test_split().

Train using DecisionTreeClassifier() from sklearn.tree.

4. Tune Hyperparameters:

Use GridSearchCV to find best max_depth, min_samples_split, etc.

5. Evaluate Performance:

Use accuracy_score, confusion_matrix, classification_report.

In healthcare, focus more on recall to minimize false negatives.

Business Value:

Early disease detection

Better patient care decisions

Cost and time savings through automation

Explainable predictions (easy for doctors to understand)