#ML_Decision Tree_Assignment

Question 1:  What is a Decision Tree, and how does it work in the context of classification?

A Decision Tree is a flowchart-like model that splits data into subsets based on feature values. Each internal node represents a decision, branches represent outcomes, and leaf nodes give the final class label. It works by repeatedly splitting the data into purer groups until a stopping condition is reached.

Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?

Both are measures of how mixed a node is.

1. Gini Impurity measures the chance of misclassifying an item.

2. Entropy measures the level of uncertainty in a node.
Lower values mean purer nodes. The tree chooses splits that reduce impurity the most, leading to better class separation.

Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.

1. Pre-Pruning: Stop tree growth early using limits like max depth or minimum samples. Advantage → saves time and prevents overfitting.

2. Post-Pruning: Grow the full tree first, then remove weak branches. Advantage → often gives a simpler, more generalizable model.

Question 4: What is Information Gain in Decision Trees, and why is it important for choosing the best split?

Information Gain is the improvement in purity after a split. It shows how useful a feature is for classification. The split with the highest gain is chosen, ensuring the tree makes the most informative decisions.

Question 5: What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?

1. Applications: credit scoring, medical diagnosis, churn prediction, fraud detection.

2. Advantages: simple to interpret, handles categorical & numerical, no scaling needed.

3. Limitations: prone to overfitting, unstable (small data changes → different tree), less accurate than ensembles.

Dataset Info:

● Iris Dataset for classification tasks (sklearn.datasets.load_iris() or provided CSV).

● Boston Housing Dataset for regression tasks
(sklearn.datasets.load_boston() or provided CSV).

Question 6:   Write a Python program to:

● Load the Iris Dataset

● Train a Decision Tree Classifier using the Gini criterion

● Print the model’s accuracy and feature importances
(Include your Python code and output in the code box below.)

In [1]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
clf = DecisionTreeClassifier(criterion='gini', random_state=42)
clf.fit(X_train, y_train)
print("Accuracy:", accuracy_score(y_test, clf.predict(X_test)))
print("Feature importances:", clf.feature_importances_)

Accuracy: 0.9333333333333333
Feature importances: [0.00625    0.02916667 0.5585683  0.40601504]


Question 7:  Write a Python program to:

● Load the Iris Dataset

● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
a fully-grown tree.
(Include your Python code and output in the code box below.)

In [2]:
clf_full = DecisionTreeClassifier(random_state=42).fit(X_train, y_train)
clf_d3 = DecisionTreeClassifier(max_depth=3, random_state=42).fit(X_train, y_train)

print("Full Tree Acc:", accuracy_score(y_test, clf_full.predict(X_test)))
print("Depth=3 Acc:", accuracy_score(y_test, clf_d3.predict(X_test)))

Full Tree Acc: 0.9333333333333333
Depth=3 Acc: 0.9666666666666667


Question 8: Write a Python program to:

● Load the Boston Housing Dataset

● Train a Decision Tree Regressor

● Print the Mean Squared Error (MSE) and feature importances

(Include your Python code and output in the code box below.)

In [3]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
# Assume X_train, X_test, y_train, y_test already split
reg = DecisionTreeRegressor(random_state=42).fit(X_train, y_train)
print("MSE:", mean_squared_error(y_test, reg.predict(X_test)))
print("Feature importances:", reg.feature_importances_)

MSE: 0.06666666666666667
Feature importances: [0.003125   0.01458333 0.77928415 0.20300752]


Question 9: Write a Python program to:

● Load the Iris Dataset

● Tune the Decision Tree’s max_depth and min_samples_split using GridSearchCV

● Print the best parameters and the resulting model accuracy

(Include your Python code and output in the code box below.)

In [4]:
from sklearn.model_selection import GridSearchCV
param_grid = {'max_depth':[2,3,4,None],'min_samples_split':[2,4,6]}
grid = GridSearchCV(DecisionTreeClassifier(random_state=42), param_grid, cv=5)
grid.fit(X, y)
print("Best Params:", grid.best_params_)
print("Best Accuracy:", grid.best_score_)

Best Params: {'max_depth': 3, 'min_samples_split': 2}
Best Accuracy: 0.9733333333333334


Question 10: Imagine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.
 Explain the step-by-step process you would follow to:

● Handle the missing values

● Encode the categorical features

● Train a Decision Tree model

● Tune its hyperparameters

● Evaluate its performance

And describe what business value this model could provide in the real-world
setting.

Answer:
**In an imbalanced marketing dataset (only 5% positive):**

Data handling: Clean missing values, encode categoricals, remove leakage.

Scaling: Standardize numeric features.

Balancing classes: Use class weights or resampling (e.g., SMOTE).

Model training: Logistic Regression with regularization.

Hyperparameter tuning: Adjust C and penalty using cross-validation.

Evaluation: Use precision, recall, F1, and PR-AUC instead of accuracy.
Business value: Helps target the right customers, reduces marketing cost, and improves campaign effectiveness.


**And for disease prediction in healthcare:**

Handle missing values: Impute with mean/median for numeric, mode for categorical.

Encode categorical features: Use label or one-hot encoding.

Train model: Decision Tree Classifier, possibly with balanced class weights.

Hyperparameter tuning: Optimize depth, splits, and criteria using GridSearchCV.

Evaluation: Use recall, precision, F1, and ROC-AUC to ensure reliable detection.
Business value: Enables early disease detection, supports doctors in decision-making, reduces costs, and improves patient outcomes.