#Decision Tree | Assignment

## Question 1:  What is a Decision Tree, and how does it work in the context of classification?

A decision tree is a supervised learning algorithm used for both classification and regression tasks. It has a hierarchical tree structure which consists of a root node, branches, internal nodes and leaf nodes. It works like a flowchart help to make decisions step by step where:

- Internal nodes represent attribute tests
- Branches represent attribute values
- Leaf nodes represent final decisions or predictions.

Decision trees are widely used due to their interpretability, flexibility and low preprocessing needs.


# How a Decision Tree Works for Classification:

The process works like a series of questions and answers:

- Start with the Root Node: The entire dataset begins at the top, called the root node.

- Feature Selection (Splitting): The algorithm evaluates all available features to determine which one best splits the data into the most homogeneous groups with respect to the target variable (the class label). Common metrics used for this evaluation are Information Gain or the Gini Index.

- Recursive Partitioning: The process of splitting the data is repeated recursively for each new, resulting subset (child node). The goal at each step is to maximize the "purity" of the nodes, meaning that each resulting node contains as many instances of a single class as possible.

- Creating Decision (Internal) Nodes: Each split creates a decision node, which represents a test on an attribute (e.g., "Is the email content spam?").

- Reaching Leaf Nodes: The process stops when a predefined criterion is met, such as when the nodes are pure (all data points belong to the same class), when a certain tree depth is reached, or when the number of data points in a node is below a minimum threshold. The final nodes that are not split further are called leaf nodes.

- Making a Prediction: Each leaf node represents a final class label (e.g., "Spam" or "Not Spam"). To classify a new data point, you start at the root node and follow the path down the tree by answering the questions at each decision node until you reach a leaf node, which gives the classification.

In essence, a decision tree mimics human decision-making by creating a set of IF-THEN rules to classify data.

## Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?



- Gini Impurity and Entropy are impurity measures used in decision trees to determine the best split by quantifying the disorder in a dataset. Gini Impurity measures the probability of misclassifying a random sample, aiming for a low score, while Entropy measures the disorder or uncertainty, seeking to reduce it to zero. The impact on splits is that the algorithm chooses the feature that results in the greatest reduction of either Gini Impurity or Entropy after the split, as this indicates a more "pure" or "certain" child node.

## Gini Impurity

- Concept: Measures the likelihood of a randomly chosen element being incorrectly classified if it were randomly labeled according to the class distribution of the node.

- Calculation: For a node with k classes, the formula is $Gini = 1 - \sum_{i=1}^{n} p_i^2$

where \(p_i^$) is the proportion of elements belonging to class \(i\).

- Goal: The goal is to achieve a Gini Impurity of 0, which indicates a "pure" node where all elements belong to the same class.

- Impact on Splits: A split is considered good if it results in a lower Gini Impurity in the child nodes compared to the parent node. The algorithm selects the split that maximizes the reduction in Gini Impurity (or "Gini Gain").

## Entropy

- Concept: Measures the disorder or randomness in a dataset. A higher entropy value indicates greater uncertainty.

- Calculation: The formula for entropy is \(Entropy=-\sum _{i=1}^{k}p_{i}\log _{2}(p_{i})\), where \(p_{i}\) is the proportion of elements belonging to class \(i\).

- Goal: To reduce the entropy to 0, which signifies a "pure" node with no uncertainty about the class labels.

- Impact on Splits: The algorithm chooses the feature and split that results in the greatest reduction of entropy. This reduction is known as "Information Gain".

- Speed: Computing entropy involves logarithmic calculations, which can be computationally more expensive than Gini Impurity.


##  How they impact splits in a Decision Tree Scoring potential splits:

For each potential split (e.g., splitting on a specific feature with a certain threshold), the impurity of the resulting child nodes is calculated using either Gini Impurity or Entropy.Calculating the gain: The impurity reduction is calculated for that split. For example, the Information Gain from a split is the parent node's impurity minus the weighted average impurity of the child nodes.Choosing the best split: The algorithm compares the impurity reduction for all possible splits. It then selects the split that results in the maximum impurity reduction (either Gini Gain or Information Gain).Iterative process: This process is repeated for each new node until a stopping criterion is met, such as reaching a maximum depth or a minimum number of samples in a node.

##Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.

- **Pre-pruning** stops a decision tree from growing by setting criteria like maximum depth, while post-pruning removes branches from a fully grown tree. A practical advantage of **pre-pruning** is its computational efficiency, as it avoids the cost of building a large, unnecessary tree, while **post-pruning's** advantage is its potentially higher accuracy, since it can make more informed decisions by evaluating the full tree structure before making cuts.

## Pre-pruning

- How it works: Prevents the tree from growing excessively by stopping the splitting process early based on predefined conditions.

- Conditions: These can include setting a maximum tree depth, a minimum number of samples required to split a node, or a minimum information gain.

- Practical advantage: Computational efficiency. It is more computationally efficient because it prevents the algorithm from building a full tree that might later be pruned, saving time and resources during training, especially for large datasets.

## Post-pruning
- How it works: A full decision tree is first constructed, and then branches are removed if they are deemed insignificant or if pruning improves the overall model performance.

- Conditions: A common approach is to use a validation set to evaluate the impact of removing a branch on the tree's accuracy.

- Practical advantage: Potentially higher accuracy. It can lead to better pruning decisions because it analyzes the complete tree structure, allowing for more robust evaluation of a branch's impact on performance and potentially leading to a more generalized model compared to pre-pruning which might stop too early.






   |  Feature        | Pre-Pruning               |    Post-Pruning                      |
   | -------------- | ------------------------- | --------------------------------- |
   | When applied   | During tree growth        | After full tree is grown          |
   | Goal           | Prevent overfitting early | Remove overfitting after building |
   | Complexity     | Lower                     | Higher                            |
   | Main advantage | Faster, cheaper training  | Better predictive performance     |


In [None]:
# Pre-pruning

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Pre-pruned tree
pre_pruned_tree = DecisionTreeClassifier(
    max_depth=3,           # limit depth
    min_samples_split=5,   # require at least 5 samples to split
    min_samples_leaf=2,    # require at least 2 samples per leaf
    random_state=42
)

pre_pruned_tree.fit(X_train, y_train)
pred = pre_pruned_tree.predict(X_test)
print("Pre-Pruned Accuracy:", accuracy_score(y_test, pred))


Pre-Pruned Accuracy: 1.0


In [None]:
# Post-pruning

#Step 1: Train tree fully

tree = DecisionTreeClassifier(random_state=42)
tree.fit(X_train, y_train)

#Step 2: Get effective alphas for pruning

path = tree.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas = path.ccp_alphas

# Step 3: Train several pruned trees and choose the best

trees = []
for alpha in ccp_alphas:
    t = DecisionTreeClassifier(random_state=42, ccp_alpha=alpha)
    t.fit(X_train, y_train)
    trees.append(t)

#Step 4: Pick the best based on validation/test accuracy

import numpy as np

accuracies = [accuracy_score(y_test, t.predict(X_test)) for t in trees]
best_tree = trees[np.argmax(accuracies)]

print("Best post-pruned accuracy:", max(accuracies))
print("Best alpha:", ccp_alphas[np.argmax(accuracies)])

Best post-pruned accuracy: 1.0
Best alpha: 0.0


## Question 4: What is Information Gain in Decision Trees, and why is it important for choosing the best split?

- Information Gain (IG) is a metric used in Decision Trees to measure how much a feature helps reduce uncertainty (impurity) in the target variable when splitting the data.

- Information Gain = Reduction in impurity after a dataset is split using a feature.

## Formula:

$$
IG = Impurity(parent) - \sum_{i=1}^{n} \frac{n_i}{n} \cdot Impurity(child_i)
$$

Where:

- $Impurity$ = Entropy or Gini
- $n$ = number of samples in parent
- $n_i$ = number of samples in child node
- $k$ = number of child nodes




####Information gain is important for choosing the best split in a decision tree because it quantifies how much a feature helps in separating the data into purer, more homogeneous groups. The feature with the highest information gain is selected as the split because it leads to the largest reduction in uncertainty (entropy) and improves the model's predictive accuracy.

- Measures impurity reduction: Information gain measures the decrease in entropy, a metric that quantifies impurity or uncertainty in a dataset. A higher information gain means the split based on that feature results in child nodes that are more homogeneous with respect to the target variable.
- Selects the best feature: To build an effective decision tree, the algorithm selects the feature that provides the most information gain at each node to make the split. This is done iteratively, with the feature that has the highest information gain being chosen to create the sub-nodes.
- Improves model accuracy: By consistently choosing splits that create the purest possible groups, information gain helps build a more accurate and efficient decision tree. It helps the model make better decisions by reducing the uncertainty about the final outcome with each split.
- Handles different data types: Information gain can be used for both categorical and numerical features. For numerical features, it calculates the information gain for different possible split points (thresholds) and chooses the one with the highest gain.



## Question 5: What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?

###Common Real-World Applications of Decision Trees:

- Healthcare: Assisting in medical diagnosis by predicting diseases based on patient data like blood pressure and glucose levels.

- Finance: Evaluating creditworthiness for loan applications, conducting risk assessment, and detecting fraudulent transactions.

- Marketing: Predicting customer behavior, such as identifying customers likely to churn or respond to a marketing campaign, and segmenting customers based on their behavior and demographics.

- Business: Strategic planning, resource allocation, and evaluating new product launches or market expansions.

- Education: Predicting exam results based on factors like attendance and past grades to identify at-risk students.

- Automated systems: Guiding users through automated telephone systems or other interactive processes.

### Advantages:

- Easy to understand: The visual, flowchart-like structure is simple to comprehend, even for those without a deep analytical background.

- Handles different data types: They can be used with both numerical (e.g., income) and categorical (e.g., gender) data.

- Non-linear relationships: They can easily capture and model nonlinear relationships in data.

- Feature selection: They can help identify the most important variables for a problem, as they show the relationships between variables.

- Transparency: Each decision node can be traced, making the reasoning behind a prediction clear, which is crucial for gaining trust in fields like finance and healthcare.

### Limitations:

- Prone to overfitting: They can create overly complex trees that do not generalize well to new, unseen data.

- Instability: Small variations in the data can lead to the creation of a completely different tree.

- Bias: Decision trees can be biased toward features with more levels.

- Not ideal for regression: While they can be used for regression, they often perform better with classification tasks, as the predictions can be biased towards the most frequent class.

## Question 6:   Write a Python program to:
- Load the Iris Dataset
- Train a Decision Tree Classifier using the Gini criterion
- Print the model’s accuracy and feature importances

In [None]:
# Load libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# 1. Load the Iris Dataset
iris = load_iris()
X = iris.data
y = iris.target

# 2. Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42)

# 3. Train Decision Tree Classifier with Gini criterion
model = DecisionTreeClassifier(criterion='gini', random_state=42)
model.fit(X_train, y_train)

# 4. Make predictions
y_pred = model.predict(X_test)

# 5. Print accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", accuracy)

# 6. Print feature importances
print("Feature Importances:")
for feature, importance in zip(iris.feature_names, model.feature_importances_):
    print(f"{feature}: {importance:.4f}")


Model Accuracy: 1.0
Feature Importances:
sepal length (cm): 0.0000
sepal width (cm): 0.0167
petal length (cm): 0.9061
petal width (cm): 0.0772


## Question 7:  Write a Python program to:
- Load the Iris Dataset
- Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
a fully-grown tree.

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Decision Tree with max_depth = 3
tree_limited = DecisionTreeClassifier(max_depth=3, random_state=42)
tree_limited.fit(X_train, y_train)
pred_limited = tree_limited.predict(X_test)
acc_limited = accuracy_score(y_test, pred_limited)

# Fully-grown Decision Tree
tree_full = DecisionTreeClassifier(random_state=42)
tree_full.fit(X_train, y_train)
pred_full = tree_full.predict(X_test)
acc_full = accuracy_score(y_test, pred_full)

acc_limited, acc_full


(1.0, 1.0)

## Question 8: Write a Python program to:
- Load the Boston Housing Dataset
- Train a Decision Tree Regressor
- Print the Mean Squared Error (MSE) and feature importances

In [None]:
# Load libraries
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# 1. Load the California Housing Dataset (replacement for Boston Housing)
data = fetch_california_housing()
X = data.data
y = data.target

# 2. Split the dataset
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 3. Train a Decision Tree Regressor
model = DecisionTreeRegressor(random_state=42)
model.fit(X_train, y_train)

# 4. Predict on test data
y_pred = model.predict(X_test)

# 5. Calculate Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error (MSE):", mse)

# 6. Print Feature Importances
print("\nFeature Importances:")
for name, importance in zip(data.feature_names, model.feature_importances_):
    print(f"{name}: {importance:.4f}")


Mean Squared Error (MSE): 0.495235205629094

Feature Importances:
MedInc: 0.5285
HouseAge: 0.0519
AveRooms: 0.0530
AveBedrms: 0.0287
Population: 0.0305
AveOccup: 0.1308
Latitude: 0.0937
Longitude: 0.0829


## Question 9: Write a Python program to:
- Load the Iris Dataset
- Tune the Decision Tree’s max_depth and min_samples_split using
GridSearchCV
- Print the best parameters and the resulting model accuracy

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# 1. Load the Iris Dataset
iris = load_iris()
X = iris.data
y = iris.target

# 2. Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 3. Define Decision Tree model
dt = DecisionTreeClassifier(random_state=42)

# 4. Parameter grid for tuning
param_grid = {
    "max_depth": [1, 2, 3, 4, 5, None],
    "min_samples_split": [2, 3, 4, 5, 10]
}

# 5. GridSearchCV for hyperparameter tuning
grid = GridSearchCV(
    estimator=dt,
    param_grid=param_grid,
    cv=5,               # 5-fold cross-validation
    scoring="accuracy"
)

# Fit the grid search
grid.fit(X_train, y_train)

# 6. Evaluate best model
best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# 7. Print results
print("Best Parameters:", grid.best_params_)
print("Model Accuracy:", accuracy)


Best Parameters: {'max_depth': 4, 'min_samples_split': 2}
Model Accuracy: 1.0


## Question 10: Imagine you’re working as a data scientist for a healthcare company that wants to predict whether a patient has a certain disease. You have a large dataset with mixed data types and some missing values.

Explain the step-by-step process you would follow to:

- Handle the missing values
- Encode the categorical features
- Train a Decision Tree model
- Tune its hyperparameters
- Evaluate its performance
And describe what business value this model could provide in the real-world
setting.


##1. Handle the missing values

Goal: avoid data leakage, preserve signal in missingness, and keep preprocessing inside CV.

Steps:

1. Explore missingness

   - Percent missing per column, missingness patterns, correlation between missingness and target.

2. Classify columns (numeric / categorical / datetime / id).

3. Impute inside a pipeline

    - Numeric: SimpleImputer(strategy='median') or model-based IterativeImputer if many correlated numerics. Add missing_indicator=True for important features.

    - Categorical: SimpleImputer(strategy='constant', fill_value='__MISSING__') (or a dedicated “missing” category).

    - Time series / longitudinal: use forward/backward fill or feature engineering (e.g., “days since last visit”).

4. Keep imputation inside Pipeline/ColumnTransformer so imputation is fit only on training folds (no leakage).

5. If missingness is informative: add explicit binary indicator columns (is_missing_featureX).

##2. Encode the categorical features

Goal: convert categories to numeric without introducing leakage or artificial ordering.

Rules of thumb:

- Low-cardinality (≤10 unique): OneHotEncoder(handle_unknown='ignore').

- Medium/high-cardinality:

    - OrdinalEncoder if no order but tree model can handle integers (beware implied order).

    - Target (mean) encoding or frequency encoding for high-cardinality — must be done with CV-safe scheme (e.g., use category_encoders with K-fold target encoding inside the training folds to avoid leakage).

- Rare categories: group into 'other' if counts are small.

- Unseen categories: ensure encoder handles unknowns (handle_unknown='ignore').

Always include encoding in the same pipeline as imputation.

##3. Train a Decision Tree model

Goal: reproducible, leakage-free training.

Steps:

1. Train/test split: train_test_split(..., stratify=y, test_size=0.2, random_state=SEED).

2. Build preprocessing ColumnTransformer:

    - Numeric pipeline: SimpleImputer(median) (+ StandardScaler only if mixing with models that need scaling; not required for trees).

    - Categorical pipeline: SimpleImputer(constant='__MISSING__') + OneHotEncoder(handle_unknown='ignore') or other encoder.

3. Create Pipeline([('preproc', preproc),
       ('clf'DecisionTreeClassifier(random_state=SEED))]).

4. Fit the pipeline on training data.

5. If class imbalance: either class_weight='balanced' in the classifier or resampling inside CV (e.g., SMOTE via imblearn in a pipeline).

##4. Tune hyperparameters

Goal: control overfitting and optimize clinically-relevant objectives.

Important DecisionTree params:

* max_depth

* min_samples_split

* min_samples_leaf

* max_features

* criterion (gini or entropy)

* class_weight (for imbalance)

Tuning approach:

1. Choose metric aligned with business cost. For disease detection, often prioritize recall (sensitivity) to avoid missed cases, or use F1 / PR-AUC if false positives are costly too.

2. Use cross-validation: StratifiedKFold(n_splits=5).

3. Search method: RandomizedSearchCV for large search space; GridSearchCV for small grid.

4. Example param grid:

  param_grid = {

         'clf__max_depth': [3, 5, 8, None],

         'clf__min_samples_split': [2, 5, 10],

         'clf__min_samples_leaf': [1, 2, 4],

         'clf__criterion': ['gini', 'entropy']

               }


5. Avoid leakage: keep preprocessing inside pipeline passed to GridSearchCV.

6. ptionally use nested CV to get an unbiased estimate of performance when reporting.

##5. Evaluate performance

Goal: evaluate both statistical performance and clinical usefulness.

Metrics to compute:

* Confusion matrix (TP, FP, TN, FN) — essential for clinical tradeoffs.

* Recall (sensitivity) — prioritized if missing disease is costly.

* Precision — important if follow-up tests are costly or invasive.

* F1-score — balance precision & recall.

* ROC AUC and PR AUC (PR AUC is more informative for imbalanced problems).

* Calibration (are predicted probabilities meaningful?) — use calibration plots / CalibratedClassifierCV.

* Decision-curve analysis or cost-sensitive metrics to choose an operating threshold aligned with business costs.

* Subgroup analysis: check performance across age, gender, hospital, etc., to detect bias.

Robustness & monitoring:

* Evaluate on a held-out test set (or temporal holdout if dataset is time-ordered).

* Perform bootstrapping or repeated CV to get confidence intervals for metrics.

* Monitor feature stability and concept drift after deployment.


##6 Business value in a real-world healthcare setting

- Early detection & triage: flag high-risk patients for follow-up tests or specialist review, enabling earlier intervention and improving outcomes.

= Resource optimization: prioritize costly diagnostic tests and specialist time for patients most likely to benefit.

- Cost savings: avoid unnecessary tests for low-risk patients while catching high-risk cases earlier (reduces downstream costs).

- Decision support & auditing: interpretable rules let clinicians understand and validate suggestions, enabling adoption and trust.

- Population health management: identify segments with high prevalence and direct preventive programs or screening.

- Clinical research: identify important predictors and hypotheses for further study.

Risks/limitations to manage:

- False negatives (missed disease) have high cost — tune thresholds to prefer sensitivity or add human review for negatives.

- Data bias or drift can harm subgroups — perform fairness checks and monitoring.

- Regulatory and ethical approvals may be required before clinical use.

In [None]:
##