Theoretical Questions (1–15)
1. What is a Decision Tree, and how does it work?

Answer:

A Decision Tree is a supervised learning algorithm used for classification and regression. It splits data into branches using features that best separate the target classes based on impurity measures like Gini or Entropy.

2. What are impurity measures in Decision Trees?

Answer:

Impurity measures determine how well a feature splits the data:

Gini Impurity

Entropy

Information Gain (based on Entropy)

3. What is the mathematical formula for Gini impurity?

Answer:

𝐺
𝑖
𝑛
𝑖
=
1
−
∑
𝑖
=
1
𝑛
𝑝
𝑖
2
Gini=1−
i=1
∑
n
​
 p
i
2
​

Where
𝑝
𝑖
p
i
​
  is the probability of class
𝑖
i.

4. What is the mathematical formula for Entropy?

Answer:

𝐸
𝑛
𝑡
𝑟
𝑜
𝑝
𝑦
=
−
∑
𝑖
=
1
𝑛
𝑝
𝑖
log
⁡
2
(
𝑝
𝑖
)
Entropy=−
i=1
∑
n
​
 p
i
​
 log
2
​
 (p
i
​
 )

5. What is Information Gain, and how is it used in Decision Trees?
Information Gain is the reduction in entropy after a dataset is split on an attribute.

Answer:

𝐼
𝐺
=
𝐸
𝑛
𝑡
𝑟
𝑜
𝑝
𝑦
(
𝑝
𝑎
𝑟
𝑒
𝑛
𝑡
)
−
∑
(
∣
𝑐
ℎ
𝑖
𝑙
𝑑
∣
∣
𝑝
𝑎
𝑟
𝑒
𝑛
𝑡
∣
⋅
𝐸
𝑛
𝑡
𝑟
𝑜
𝑝
𝑦
(
𝑐
ℎ
𝑖
𝑙
𝑑
)
)
IG=Entropy(parent)−∑(
∣parent∣
∣child∣
​
 ⋅Entropy(child))

6. What is the difference between Gini Impurity and Entropy?

Answer:

Gini is faster to compute.

Entropy involves logarithmic calculations.

Both measure impurity, but Gini tends to isolate the most frequent class faster.

7. What is the mathematical explanation behind Decision Trees?

Answer:

It involves recursively choosing features based on highest Information Gain or lowest Gini to partition the data into subsets, creating a tree structure.

8. What is Pre-Pruning in Decision Trees?

Answer:

Stopping tree growth early using criteria like max_depth or min_samples_split.

9. What is Post-Pruning in Decision Trees?

Answer:

Letting the tree fully grow and then trimming unnecessary branches using a validation set.

10. What is the difference between Pre-Pruning and Post-Pruning?

Answer:

Pre-Pruning prevents overfitting during training.

Post-Pruning removes overfitted parts after full tree creation.

11. What is a Decision Tree Regressor?

Answer:

A Decision Tree used for predicting continuous values (regression tasks) instead of classes.

12. What are the advantages and disadvantages of Decision Trees?

Answer:

Advantages:

Easy to understand

No data normalization needed

Works for both classification and regression

Disadvantages:

Prone to overfitting

Less accurate than ensemble models

Can be unstable with small changes in data

13. How does a Decision Tree handle missing values?

Answer:

Scikit-learn doesn't natively support missing values in trees. Options:

Imputation before training

Use models like XGBoost that handle missing data internally

14. How does a Decision Tree handle categorical features?

Answer:

Convert categories to numerical values (e.g., One-Hot Encoding or Label Encoding).

Scikit-learn requires numeric inputs.

15. What are some real-world applications of Decision Trees?

Answer:

Medical Diagnosis

Customer Churn Prediction

Credit Scoring

Loan Approval

Fraud Detection

In [None]:
# Decision Tree Practical Questions (Q16 to Q27)

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, export_graphviz, plot_tree, DecisionTreeRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import accuracy_score, mean_squared_error
import matplotlib.pyplot as plt
import numpy as np

# Load Iris dataset
iris = load_iris()
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Q16 - Basic Decision Tree Classifier

clf_16 = DecisionTreeClassifier()
clf_16.fit(X_train, y_train)
y_pred_16 = clf_16.predict(X_test)
print("Q16 - Accuracy:", accuracy_score(y_test, y_pred_16))

# Q17 - Using Gini Impurity

clf_17 = DecisionTreeClassifier(criterion="gini")
clf_17.fit(X_train, y_train)
print("Q17 - Feature Importances:", clf_17.feature_importances_)

# Q18 - Using Entropy

clf_18 = DecisionTreeClassifier(criterion="entropy")
clf_18.fit(X_train, y_train)
y_pred_18 = clf_18.predict(X_test)
print("Q18 - Accuracy:", accuracy_score(y_test, y_pred_18))

#Q19: Train a Decision Tree Regressor on a housing dataset and evaluate using Mean Squared Error (MSE)
#We'll use the California Housing dataset from sklearn.datasets.

from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np

# Load the housing dataset
data = fetch_california_housing()
X = data.data
y = data.target

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train the Decision Tree Regressor
regressor = DecisionTreeRegressor()
regressor.fit(X_train, y_train)

# Predict on test set
y_pred = regressor.predict(X_test)

# Evaluate using Mean Squared Error
mse = mean_squared_error(y_test, y_pred)
print("Q19 - Mean Squared Error on housing dataset:", mse)

# Q20 - Visualize tree using graphviz (text form)

dot_data = export_graphviz(clf_16, out_file=None, feature_names=iris.feature_names, class_names=iris.target_names)
print("Q20 - DOT data created for Graphviz visualization")

# Q21 - Max Depth 3 vs Full Tree

clf_21_full = DecisionTreeClassifier()
clf_21_full.fit(X_train, y_train)
acc_full = accuracy_score(y_test, clf_21_full.predict(X_test))

clf_21_limited = DecisionTreeClassifier(max_depth=3)
clf_21_limited.fit(X_train, y_train)
acc_limited = accuracy_score(y_test, clf_21_limited.predict(X_test))
print("Q21 - Full Accuracy:", acc_full, "| MaxDepth=3 Accuracy:", acc_limited)

# Q22 - min_samples_split=5 vs default

clf_22_default = DecisionTreeClassifier()
clf_22_default.fit(X_train, y_train)
acc_def = accuracy_score(y_test, clf_22_default.predict(X_test))

clf_22_custom = DecisionTreeClassifier(min_samples_split=5)
clf_22_custom.fit(X_train, y_train)
acc_custom = accuracy_score(y_test, clf_22_custom.predict(X_test))
print("Q22 - Default Accuracy:", acc_def, "| min_samples_split=5 Accuracy:", acc_custom)

# Q23 - Feature scaling before training

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

clf_23 = DecisionTreeClassifier()
clf_23.fit(X_train_scaled, y_train)
acc_scaled = accuracy_score(y_test, clf_23.predict(X_test_scaled))
print("Q23 - Accuracy without Scaling:", acc_def, "| With Scaling:", acc_scaled)

# Q24 - One-vs-Rest (OvR)

clf_24 = OneVsRestClassifier(DecisionTreeClassifier())
clf_24.fit(X_train, y_train)
print("Q24 - OvR Accuracy:", accuracy_score(y_test, clf_24.predict(X_test)))

# Q25 - Feature Importance Display

clf_25 = DecisionTreeClassifier()
clf_25.fit(X_train, y_train)
print("Q25 - Feature Importances:", clf_25.feature_importances_)

# Q26 - Decision Tree Regressor with max_depth=5

X_reg = np.random.rand(100, 1) * 10
y_reg = np.sin(X_reg).ravel()
X_reg_train, X_reg_test, y_reg_train, y_reg_test = train_test_split(X_reg, y_reg, test_size=0.3, random_state=42)

reg_full = DecisionTreeRegressor()
reg_full.fit(X_reg_train, y_reg_train)
mse_full = mean_squared_error(y_reg_test, reg_full.predict(X_reg_test))

reg_limited = DecisionTreeRegressor(max_depth=5)
reg_limited.fit(X_reg_train, y_reg_train)
mse_limited = mean_squared_error(y_reg_test, reg_limited.predict(X_reg_test))
print("Q26 - MSE Unrestricted:", mse_full, "| MSE MaxDepth=5:", mse_limited)

# Q27 - Cost Complexity Pruning

path = clf_16.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas = path.ccp_alphas
acc_list = []
for ccp_alpha in ccp_alphas:
    clf = DecisionTreeClassifier(ccp_alpha=ccp_alpha)
    clf.fit(X_train, y_train)
    acc = accuracy_score(y_test, clf.predict(X_test))
    acc_list.append((ccp_alpha, acc))
print("Q27 - Accuracy with different CCP alphas:")
for alpha, acc in acc_list:
    print(f"Alpha: {alpha:.4f}, Accuracy: {acc:.4f}")
