Question 1: What is a Decision Tree, and how does it work in the context of classification?


Answer:

 A Decision Tree is a supervised machine learning algorithm used for both classification and regression tasks. It models decisions and their possible consequences in a tree-like structure.
Each internal node represents a feature (attribute) test.


Each branch represents an outcome of the test.


Each leaf node represents a class label (in classification) or a numerical value (in regression).


In classification, the tree is built by splitting the dataset into subsets based on the most significant attribute at each step. The splitting is done using measures like Gini Impurity or Entropy. The process continues recursively until the stopping criteria are met.
Thus, the Decision Tree works by repeatedly partitioning the data and assigning class labels to unseen instances based on the learned splits.


Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?


Answer:
Gini Impurity:
 Measures the probability of incorrectly classifying a randomly chosen element.
 Gini=1−∑i=1npi2Gini = 1 - \sum_{i=1}^{n} pi^2Gini=1−i=1∑n​pi2​
 where pipipi​ is the probability of class iii.


A Gini of 0 means perfect purity (all elements belong to one class).


The lower the Gini, the better the split.


Entropy:
 Measures the amount of disorder or impurity in the dataset.
 Entropy=−∑i=1npilog⁡2(pi)Entropy = - \sum{i=1}^{n} pi \log2(pi)Entropy=−i=1∑n​pi​log2​(pi​)
Entropy is 0 when the data is perfectly pure.


Higher entropy means more randomness.


Impact on Splits:
 Both Gini and Entropy help the tree choose the “best” feature for splitting. The feature that produces the largest reduction in impurity (highest Information Gain) is chosen for the split.


Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.


Answer:
Pre-Pruning (Early Stopping):
 The tree growth is stopped early during training when a condition is met (e.g., max depth, min samples per leaf).


Advantage: Saves time and reduces overfitting by keeping the tree simpler.


Post-Pruning:
 The tree is first fully grown, and then branches are removed that do not improve accuracy significantly.


Advantage: Produces a simpler tree with better generalization while still considering all possible splits first.


Question 4: What is Information Gain in Decision Trees, and why is it important for choosing the best split?


Answer:
 Information Gain (IG) measures how much “information” about the class labels is gained by splitting on a feature.
IG(S,A)=Entropy(S)−∑v∈Values(A)∣Sv∣∣S∣×Entropy(Sv)IG(S, A) = Entropy(S) - \sum_{v \in Values(A)} \frac{|S_v|}{|S|} \times Entropy(S_v)IG(S,A)=Entropy(S)−v∈Values(A)∑​∣S∣∣Sv​∣​×Entropy(Sv​)
S: dataset


A: attribute


Sv​: subset for value v of attribute A


Importance:
Higher IG means the split results in purer subsets.


Decision Trees use IG (or Gini reduction) to choose the best feature at each step.


It ensures the tree makes the most informative splits, improving classification accuracy.


Question 5: What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?

Answer:
Applications:
Healthcare: Predicting diseases.


Finance: Loan approval, fraud detection.


Marketing: Customer segmentation.


Manufacturing: Quality control.


Education: Predicting student performance.


Advantages:
Easy to understand and interpret.


Handles both categorical and numerical data.


Requires little data preprocessing.


Limitations:
Can easily overfit if not pruned.


Unstable—small changes in data may change the tree.


Biased toward features with more levels.


Question 6: Python Program – ● Load the Iris Dataset ● Train a Decision Tree Classifier using the Gini criterion ● Print the model’s accuracy and feature importances

In [1]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Train Decision Tree with Gini
clf = DecisionTreeClassifier(criterion="gini", random_state=42)
clf.fit(X, y)

# Predictions & Accuracy
y_pred = clf.predict(X)
accuracy = accuracy_score(y, y_pred)

print("Model Accuracy:", accuracy)
print("Feature Importances:", clf.feature_importances_)


Model Accuracy: 1.0
Feature Importances: [0.01333333 0.         0.56405596 0.42261071]


Question 7: Python Program – Max Depth Comparison
● Load the Iris Dataset ● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to a fully-grown tree.


In [2]:
from sklearn.model_selection import train_test_split

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Fully grown tree
clf_full = DecisionTreeClassifier(random_state=42)
clf_full.fit(X_train, y_train)
acc_full = clf_full.score(X_test, y_test)

# Limited depth tree
clf_pruned = DecisionTreeClassifier(max_depth=3, random_state=42)
clf_pruned.fit(X_train, y_train)
acc_pruned = clf_pruned.score(X_test, y_test)

print("Fully grown tree accuracy:", acc_full)
print("Max depth=3 tree accuracy:", acc_pruned)


Fully grown tree accuracy: 1.0
Max depth=3 tree accuracy: 1.0


Question 8: Write a Python program to: ● Load the California Housing dataset from sklearn ● Train a Decision Tree Regressor ● Print the Mean Squared Error (MSE) and feature importances

In [3]:
from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

# Load dataset
housing = fetch_california_housing()
X, y = housing.data, housing.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model
reg = DecisionTreeRegressor(random_state=42)
reg.fit(X_train, y_train)

# Predictions
y_pred = reg.predict(X_test)
mse = mean_squared_error(y_test, y_pred)

print("Mean Squared Error:", mse)
print("Feature Importances:", reg.feature_importances_)


Mean Squared Error: 0.495235205629094
Feature Importances: [0.52850909 0.05188354 0.05297497 0.02866046 0.03051568 0.13083768
 0.09371656 0.08290203]


Question 9: Write a Python program to: ● Load the Iris Dataset ● Tune the Decision Tree’s max_depth and min_samples_split using GridSearchCV ● Print the best parameters and the resulting model accuracy

In [5]:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris(as_frame=True)
X = iris.data
y = iris.target   # <-- categorical labels (0,1,2)

# Split into train/test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define parameter grid for tuning
param_grid = {
    'max_depth': [2, 3, 4, 5, None],
    'min_samples_split': [2, 3, 4, 5, 10]
}

# GridSearch with DecisionTree
grid = GridSearchCV(DecisionTreeClassifier(random_state=42), param_grid, cv=5)
grid.fit(X_train, y_train)

# Best parameters
print("Best Parameters:", grid.best_params_)

# Evaluate on test set
y_pred = grid.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy of the best model:", accuracy)


Best Parameters: {'max_depth': 4, 'min_samples_split': 2}
Accuracy of the best model: 1.0


Question 10: Imagine you’re working as a data scientist for a healthcare company that wants to predict whether a patient has a certain disease. You have a large dataset with mixed data types and some missing values. Explain the step-by-step process you would follow to: ● Handle the missing values ● Encode the categorical features ● Train a Decision Tree model ● Tune its hyperparameters ● Evaluate its performance And describe what business value this model could provide in the real-world setting.

Answer:

Step 1: Handle Missing Values
For numerical features: use mean/median imputation.


For categorical features: use mode imputation or “unknown” category.


Step 2: Encode Categorical Features
Use One-Hot Encoding or Label Encoding depending on the variable type.


Step 3: Train Decision Tree Model
Split the dataset into training and testing sets.


Train a Decision Tree Classifier with proper criteria (Gini/Entropy).


Step 4: Hyperparameter Tuning
Use GridSearchCV to tune parameters like max_depth, min_samples_split, and criterion.


Step 5: Evaluate Performance
Use metrics such as Accuracy, Precision, Recall, F1-score, and ROC-AUC.


Business Value:
The model helps doctors identify high-risk patients quickly.


Enables early diagnosis and treatment, improving patient outcomes.


Helps reduce costs by focusing medical resources on high-risk cases.
