#Assignment - Machine Learning : Theory Questions & Answers

Q-1 : What is a Decision Tree, and how does it work in the context of
classification?

A-1 : A Decision Tree is a popular supervised machine learning algorithm used for classification and regression tasks. In the context of classification, it is used to predict the class or category of a given input based on its features.

What is a Decision Tree?
A Decision Tree is a flowchart-like structure where:

Internal nodes represent decisions or tests on a feature (e.g., "Is age > 30?").

Branches represent the outcomes of those decisions (Yes/No).

Leaf nodes represent the final output label (class) — for example, "Approved" or "Rejected".



Q-2 : : Explain the concepts of Gini Impurity and Entropy as impurity measures.
How do they impact the splits in a Decision Tree?

A-2 : Gini Impurity and Entropy are impurity measures used to decide the best feature to split the data in a Decision Tree.

Gini Impurity:
Measures the probability of misclassifying a randomly chosen element.

Formula:

𝐺
𝑖
𝑛
𝑖
=
1
−
∑
𝑖
=
1
𝑛
𝑝
𝑖
2
Gini=1−
i=1
∑
n
​
 p
i
2
​

where
𝑝
𝑖
p
i
​
  is the probability of class
𝑖
i.

Lower Gini means purer nodes.

Entropy:
Measures the amount of disorder or uncertainty in the data.

Formula:

𝐸
𝑛
𝑡
𝑟
𝑜
𝑝
𝑦
=
−
∑
𝑖
=
1
𝑛
𝑝
𝑖
log
⁡
2
(
𝑝
𝑖
)
Entropy=−
i=1
∑
n
​
 p
i
​
 log
2
​
 (p
i
​
 )
Entropy is 0 when all samples belong to one class.

Impact on Splits:
Both are used to evaluate the quality of splits.

The feature that reduces impurity the most (highest Information Gain for Entropy or lowest Gini after split) is chosen for the split.

Q-3 :  What is the difference between Pre-Pruning and Post-Pruning in Decision
Trees? Give one practical advantage of using each.


A-3 : Pre-Pruning and Post-Pruning are techniques used to prevent overfitting in Decision Trees by controlling their growth.

Pre-Pruning (Early Stopping):
Definition: Stop growing the tree before it becomes overly complex.

How: Set limits like max depth, minimum samples per split, or minimum impurity decrease.

Advantage:
Faster training and simpler trees, which are easier to interpret and deploy.

Post-Pruning (Reduced Error Pruning):
Definition: Grow the full tree first, then cut back branches that don’t improve performance.

How: Remove branches using validation data or statistical tests.

Advantage:
Better generalization as pruning is guided by actual performance, not assumptions.

Q-4 : What is Information Gain in Decision Trees, and why is it important for
choosing the best split?

A-4 : It measures the reduction in entropy (impurity) after splitting a dataset on a feature.

Formula:

Information Gain
=
Entropy (Parent)
−
∑
(
𝑛
𝑖
𝑛
×
Entropy (Child
𝑖
)
)
Information Gain=Entropy (Parent)−∑(
n
n
i
​

​
 ×Entropy (Child
i
​
 ))
where:

𝑛
n = total samples

𝑛
𝑖
n
i
​
  = samples in child node
𝑖
i

Why is it Important?
It helps choose the best feature to split the data by selecting the one that maximizes the reduction in uncertainty.

A higher Information Gain means the feature creates purer child nodes.

Q-5 :  What are some common real-world applications of Decision Trees, and
what are their main advantages and limitations?

A-5 :  Real-World Applications of Decision Trees:
Medical Diagnosis

Predict diseases based on symptoms and test results.

Credit Scoring / Loan Approval

Assess if a person is likely to repay a loan based on income, credit history, etc.

Customer Churn Prediction

Identify which customers are likely to leave a service.

Fraud Detection

Detect unusual patterns in transactions.

Marketing and Recommendation Systems

Suggest products based on past behavior.

Advantages:
Easy to understand and interpret

Works with both numerical and categorical data

No need for feature scaling or normalization

Can model non-linear relationships

Limitations:
Prone to overfitting (especially with deep trees)

Unstable to small changes in data

Greedy algorithm — may not find the global optimal tree

Q-6 : Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier using the Gini criterion
● Print the model’s accuracy and feature importances
(Include your Python code and output in the code box below.)

In [1]:
# Import required libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data  # Features
y = iris.target  # Labels

# Split the dataset into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the Decision Tree Classifier using Gini criterion
clf = DecisionTreeClassifier(criterion='gini', random_state=42)
clf.fit(X_train, y_train)

# Predict on test data
y_pred = clf.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

# Print accuracy and feature importances
print("Model Accuracy:", accuracy)
print("Feature Importances:")
for feature_name, importance in zip(iris.feature_names, clf.feature_importances_):
    print(f"  {feature_name}: {importance:.4f}")


Model Accuracy: 1.0
Feature Importances:
  sepal length (cm): 0.0000
  sepal width (cm): 0.0167
  petal length (cm): 0.9061
  petal width (cm): 0.0772


Q-7 : Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
a fully-grown tree.
(Include your Python code and output in the code box below.)

In [2]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the data (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Decision Tree with max_depth=3
tree_limited = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=42)
tree_limited.fit(X_train, y_train)
y_pred_limited = tree_limited.predict(X_test)
accuracy_limited = accuracy_score(y_test, y_pred_limited)

# Fully-grown Decision Tree
tree_full = DecisionTreeClassifier(criterion='gini', random_state=42)
tree_full.fit(X_train, y_train)
y_pred_full = tree_full.predict(X_test)
accuracy_full = accuracy_score(y_test, y_pred_full)

# Print both accuracies
print(f"Accuracy with max_depth=3: {accuracy_limited:.2f}")
print(f"Accuracy with fully-grown tree: {accuracy_full:.2f}")


Accuracy with max_depth=3: 1.00
Accuracy with fully-grown tree: 1.00


Q-8 : Write a Python program to:
● Load the Boston Housing Dataset
● Train a Decision Tree Regressor
● Print the Mean Squared Error (MSE) and feature importances
(Include your Python code and output in the code box below.)

In [3]:
from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load California Housing dataset
data = fetch_california_housing()
X = data.data
y = data.target

# Split the dataset (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Decision Tree Regressor
regressor = DecisionTreeRegressor(random_state=42)
regressor.fit(X_train, y_train)

# Predict and calculate MSE
y_pred = regressor.predict(X_test)
mse = mean_squared_error(y_test, y_pred)

# Print results
print("Mean Squared Error (MSE):", round(mse, 4))
print("Feature Importances:")
for name, importance in zip(data.feature_names, regressor.feature_importances_):
    print(f"  {name}: {importance:.4f}")


Mean Squared Error (MSE): 0.4952
Feature Importances:
  MedInc: 0.5285
  HouseAge: 0.0519
  AveRooms: 0.0530
  AveBedrms: 0.0287
  Population: 0.0305
  AveOccup: 0.1308
  Latitude: 0.0937
  Longitude: 0.0829


Q-9 : Write a Python program to:
● Load the Iris Dataset
● Tune the Decision Tree’s max_depth and min_samples_split using
GridSearchCV
● Print the best parameters and the resulting model accuracy
(Include your Python code and output in the code box below.)

In [4]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the Decision Tree model
dt = DecisionTreeClassifier(random_state=42)

# Define parameter grid
param_grid = {
    'max_depth': [2, 3, 4, 5, None],
    'min_samples_split': [2, 5, 10]
}

# Perform grid search with 5-fold cross-validation
grid_search = GridSearchCV(estimator=dt, param_grid=param_grid, cv=5)
grid_search.fit(X_train, y_train)

# Get best model and evaluate on test set
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Print results
print("Best Parameters:", grid_search.best_params_)
print("Test Set Accuracy:", round(accuracy, 4))


Best Parameters: {'max_depth': 4, 'min_samples_split': 2}
Test Set Accuracy: 1.0


Q-10 : Imagine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.
Explain the step-by-step process you would follow to:
● Handle the missing values
● Encode the categorical features
● Train a Decision Tree model
● Tune its hyperparameters
● Evaluate its performance
And describe what business value this model could provide in the real-world
setting.

A-10 :  1. Handle Missing Values
Numerical Features:

Use mean or median imputation (e.g., SimpleImputer(strategy='mean')).

Categorical Features:

Use most frequent value or create a new category like "Unknown".

Tools: SimpleImputer from sklearn.impute

2. Encode Categorical Features
Use One-Hot Encoding for nominal (unordered) categories (e.g., gender, blood type).

Use Ordinal Encoding if categories have a meaningful order (e.g., mild → moderate → severe).

Tools: OneHotEncoder, OrdinalEncoder from sklearn.preprocessing

3. Train a Decision Tree Model
Split the data into training and testing sets (e.g., 80/20).

Fit a DecisionTreeClassifier on the training data.

In [5]:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(random_state=42)
model.fit(X_train, y_train)


4. Tune Hyperparameters
Use GridSearchCV to tune key parameters:

max_depth: Limits the tree depth to reduce overfitting.

min_samples_split: Minimum samples required to split a node.

criterion: "gini" or "entropy"

In [8]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'max_depth': [3, 5, 10, None],
    'min_samples_split': [2, 5, 10],
    'criterion': ['gini', 'entropy']
}

grid = GridSearchCV(DecisionTreeClassifier(), param_grid, cv=5)
grid.fit(X_train, y_train)
best_model = grid.best_estimator_


Evaluate Model Performance
Use appropriate metrics:

Accuracy (for balanced datasets)

Precision, Recall, F1-Score (especially important in healthcare to reduce false negatives)

Confusion Matrix

ROC-AUC (for binary classification)

In [9]:
from sklearn.metrics import classification_report, confusion_matrix

y_pred = best_model.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))


[[10  0  0]
 [ 0  9  0]
 [ 0  0 11]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      1.00      1.00         9
           2       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30



Real-World Business Value in Healthcare
Early Disease Detection
→ Helps doctors prioritize high-risk patients and begin treatment earlier.

Personalized Treatment Plans
→ By identifying key features (age, blood markers, symptoms), doctors can tailor care.

Resource Optimization
→ Hospitals can reduce unnecessary testing by targeting high-risk individuals more accurately.

Scalability
→ Once trained, the model can evaluate thousands of records instantly — aiding telemedicine and rural health outreach.

Explainability
→ Decision Trees are easy to visualize and interpret, which builds trust with clinicians.