Question 1: What is a Decision Tree, and how does it work in the context of
classification?


A Decision Tree is a supervised learning algorithm used for both classification and regression, but it is most commonly applied in classification problems.

It looks like a tree structure, where:

Root Node -the first feature used to split the data.

Internal Nodes-decision points based on features.

Branches-possible outcomes of a decision.

Leaf Nodes-the final class label (prediction).

How it Works in Classification

i)Start at the root node- The algorithm selects the feature that best separates the classes (using metrics like Gini Index, Entropy, or Information Gain).
ii)Split the dataset-Data is divided into subsets based on the feature’s values.
iii)Repeat recursively-Each subset is further split using the same process.
iv)Stop condition-The splitting stops when:
All data in a node belongs to the same class,
No more features are left, or
A maximum depth/limit is reached.
v)Prediction-A new observation is classified by following the decisions from the root to a leaf node.

Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?

Gini measures how often a randomly chosen sample would be incorrectly classified if it was randomly labeled according to the distribution of classes in a node.

Entropy measures the amount of uncertainty or disorder in a node.

How They Impact Splits

A Decision Tree algorithm (like CART or ID3) uses these impurity measures to decide the best feature to split on:

For each possible split, it calculates the impurity (Gini or Entropy) of the child nodes.

It measures how much impurity decreases (this is called Information Gain when using Entropy).

The split that gives the largest reduction in impurity is chosen.


Ques 3- What is the difference between Pre-Pruning and Post-Pruning in Decision
Trees? Give one practical advantage of using each.

Pre-Pruning (Early Stopping)

Definition: Stop the tree from growing too deep during training.

How: Apply constraints like:

-Maximum depth

-Minimum samples per leaf/node

-Minimum impurity decrease

Goal: Prevent overfitting before it happens.

✅ Practical Advantage:
Faster training and simpler trees → useful when working with large datasets.

Post-Pruning (Pruning after Full Growth)

Definition: First, grow the tree fully, then prune (cut back) branches that don’t add much predictive power.

How:

-Cost-complexity pruning (CART)

-Reduced-error pruning (C4.5)

Goal: Simplify the model while keeping accuracy.

✅ Practical Advantage:
Often leads to better generalization, because the pruning decisions are based on the entire dataset.


Question 4: What is Information Gain in Decision Trees, and why is it important for choosing the best split?

nformation Gain (IG)

Definition: Information Gain measures the reduction in impurity (uncertainty) after a dataset is split on a feature.

It tells us how much “information” a feature gives us about the target variable.

Why It’s Important

-Decision Trees choose splits that maximize Information Gain.

-Higher Information Gain = the split makes the child nodes purer (more homogeneous).

-Without IG (or similar impurity measures), the tree would split randomly and not learn useful patterns.


Question 5: What are some common real-world applications of Decision Trees, and
what are their main advantages and limitations?

Real-World Applications of Decision Trees

i)Healthcare-Diagnosing diseases based on patient symptoms/test results.

ii)Finance-Credit scoring, loan approval (predicting risk of default).

iii)Marketing-Customer segmentation, predicting churn, targeting ads.

iv)Retail/E-commerce-Product recommendation, demand forecasting.

v)Manufacturing-Quality control, defect detection.

vi)Education-Predicting student performance based on study habits, attendance, etc.

Advantages of Decision Trees

i)Easy to understand & interpret → Works like a flowchart, no complex math needed.

ii)Handles categorical & numerical data → Flexible with different data types.

iii)No feature scaling required → No need for normalization/standardization.

iv)Captures nonlinear relationships → Works well when the relationship between features and target is complex.

v)Can handle missing values (depending on implementation).

Limitations of Decision Trees

i)Overfitting-Trees can become too deep and memorize training data.

ii)Unstable-Small changes in data can lead to very different trees.

iii)Biased towards features with more levels-May favor categorical variables with many categories.

iv)Less accurate compared to ensemble methods-Alone, a single tree is weaker than Random Forest or Gradient Boosted Trees.

Dataset Info:
● Iris Dataset for classification tasks (sklearn.datasets.load_iris() or
provided CSV).
● Boston Housing Dataset for regression tasks
(sklearn.datasets.load_boston() or provided CSV).

Question 6: Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier using the Gini criterion
● Print the model’s accuracy and feature importances

In [1]:
# Import libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data        # Features
y = iris.target      # Target labels

# Split dataset into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train Decision Tree Classifier using Gini criterion
clf = DecisionTreeClassifier(criterion="gini", random_state=42)
clf.fit(X_train, y_train)

# Make predictions
y_pred = clf.predict(X_test)

# Print accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", accuracy)

# Print feature importances
print("\nFeature Importances:")
for feature, importance in zip(iris.feature_names, clf.feature_importances_):
    print(f"{feature}: {importance:.4f}")


Model Accuracy: 1.0

Feature Importances:
sepal length (cm): 0.0000
sepal width (cm): 0.0167
petal length (cm): 0.9061
petal width (cm): 0.0772


Question 7: Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
a fully-grown tree.

In [2]:
# Import libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data        # Features
y = iris.target      # Target labels

# Split dataset into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train fully-grown Decision Tree (no depth limit)
full_tree = DecisionTreeClassifier(criterion="gini", random_state=42)
full_tree.fit(X_train, y_train)
y_pred_full = full_tree.predict(X_test)
accuracy_full = accuracy_score(y_test, y_pred_full)

# Train Decision Tree with max_depth=3
limited_tree = DecisionTreeClassifier(criterion="gini", max_depth=3, random_state=42)
limited_tree.fit(X_train, y_train)
y_pred_limited = limited_tree.predict(X_test)
accuracy_limited = accuracy_score(y_test, y_pred_limited)

# Print results
print("Accuracy of fully-grown tree: ", accuracy_full)
print("Accuracy of tree with max_depth=3: ", accuracy_limited)


Accuracy of fully-grown tree:  1.0
Accuracy of tree with max_depth=3:  1.0


Question 8: Write a Python program to:
● Load the California Housing dataset from sklearn
● Train a Decision Tree Regressor
● Print the Mean Squared Error (MSE) and feature importances


In [3]:
# Import libraries
from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load the California Housing dataset
housing = fetch_california_housing()
X = housing.data        # Features
y = housing.target      # Target values (median house value)

# Split dataset into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train a Decision Tree Regressor
regressor = DecisionTreeRegressor(criterion="squared_error", random_state=42)
regressor.fit(X_train, y_train)

# Make predictions
y_pred = regressor.predict(X_test)

# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error


SyntaxError: unterminated string literal (detected at line 26) (ipython-input-3383668803.py, line 26)

Question 9: Write a Python program to:
● Load the Iris Dataset
● Tune the Decision Tree’s max_depth and min_samples_split using
GridSearchCV
● Print the best parameters and the resulting model accuracy

In [5]:
# Import libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data        # Features
y = iris.target      # Target labels

# Split dataset into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Define parameter grid for tuning
param_grid = {
    'max_depth': [2, 3, 4, 5, None],          # tree depth options
    'min_samples_split': [2, 4, 6, 8, 10]     # minimum samples to split a node
}

# Initialize Decision Tree Classifier
dt = DecisionTreeClassifier(criterion="gini", random_state=42)

# Perform GridSearchCV with 5-fold cross-validation
grid_search = GridSearchCV(dt, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Get best parameters
print("Best Parameters:", grid_search.best_params_)

# Evaluate model with best parameters
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy with best parameters:", accuracy)


Best Parameters: {'max_depth': 4, 'min_samples_split': 2}
Model Accuracy with best parameters: 1.0


Question 10: Imagine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.
Explain the step-by-step process you would follow to:
● Handle the missing values
● Encode the categorical features
● Train a Decision Tree model
● Tune its hyperparameters
● Evaluate its performance
And describe what business value this model could provide in the real-world
setting.


Step 1: Handling Missing Values

Identify missing values in the dataset using isnull() or info().

Decide on a strategy depending on the data type:

Numerical features → fill with mean, median, or use advanced imputation (like KNN Imputer).

Categorical features → fill with mode or “Unknown” category.

Optionally, remove rows/columns with too many missing values if imputation is not meaningful.

Example:

from sklearn.impute import SimpleImputer
num_imputer = SimpleImputer(strategy='median')
cat_imputer = SimpleImputer(strategy='most_frequent')

Step 2: Encoding Categorical Features

Identify categorical features.

Use encoding techniques:

Label Encoding → for ordinal categories.

One-Hot Encoding → for nominal categories with no order.

Step 3: Train a Decision Tree Model

Split the dataset into train/test sets.

Initialize a Decision Tree Classifier:

from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(criterion='gini', random_state=42)
dt.fit(X_train, y_train)


Make predictions:

y_pred = dt.predict(X_test)

Step 4: Tune Hyperparameters

Use GridSearchCV or RandomizedSearchCV to find the best parameters.

Key parameters to tune:

max_depth → controls tree depth to prevent overfitting

min_samples_split - minimum samples needed to split a node

min_samples_leaf - minimum samples at a leaf

criterion - gini or entropy

Step 5: Evaluate Performance

Use metrics suited for classification:

Accuracy - overall correctness

Precision & Recall - especially important in healthcare (to reduce false negatives)

F1-score - balance between precision & recall

ROC-AUC - performance across thresholds

from sklearn.metrics import accuracy_score, classification_report, roc_auc_score
print(accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Step 6: Business Value

Early Detection - Identify high-risk patients before symptoms worsen.

Resource Optimization - Prioritize medical tests and interventions efficiently.

Personalized Care - Tailor treatment plans for patients based on risk prediction.

Cost Reduction - Avoid unnecessary procedures and hospitalizations.

