# Decision Tree

##
### Question 1: What is a Decision Tree, and how does it work in the context of classification?

A **Decision Tree** is a supervised learning algorithm used for **classification and regression tasks**. It works by **splitting data into branches** based on feature values, forming a tree-like structure where:

* **Root Node** → Represents the entire dataset.
* **Internal Nodes** → Represent decisions or tests on features.
* **Leaf Nodes** → Represent final outcomes or class labels.

In classification, the tree divides data based on measures like **Gini Impurity** or **Entropy**, aiming to create **pure subsets** where most instances belong to a single class.


##
### Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?


**Gini Impurity** and **Entropy** are metrics used to measure the **purity or disorder** of data at a node in a Decision Tree — they help decide the **best split**.

* **Gini Impurity:**
  Measures how often a randomly chosen sample would be **incorrectly classified** if labeled randomly based on class distribution.<br>
$$[
  Gini = 1 - \sum p_i^2
  ]$$
  <br>Lower Gini → purer node.

* **Entropy:**
  Measures the **amount of uncertainty** or randomness in data.
 $$ [
  Entropy = -\sum p_i \log_2(p_i)
  ]$$
  Lower entropy → higher purity.



##
### Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.


**Pre-Pruning (Early Stopping):**
Pre-pruning stops the tree from growing once a certain condition is met (e.g., maximum depth, minimum samples per split).

* **Advantage:** Prevents overfitting early, making the model simpler and faster to train.

**Post-Pruning (Reduced Error Pruning):**
Post-pruning allows the tree to grow fully, then removes branches that don’t improve accuracy on validation data.

* **Advantage:** Produces a more optimal and generalizable model by evaluating actual performance before trimming.


##
### Question 4: What is Information Gain in Decision Trees, and why is it important for choosing the best split?


**Information Gain (IG)** measures how much **uncertainty (entropy)** in the target variable is reduced after splitting a dataset based on a feature. It helps determine **which feature provides the most useful information** for classification.

$$[
Information\ Gain = Entropy(Parent) - \sum \frac{N_i}{N} \times Entropy(Child_i)
]$$

* A **higher IG** means the feature provides a better split.
* Decision Trees select the feature with the **maximum Information Gain** at each node, ensuring the tree becomes more **pure and efficient** in separating classes.


##
### Question 5: What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?


**Applications:**

* **Finance:** Credit risk assessment and loan approval.
* **Healthcare:** Disease diagnosis based on symptoms.
* **Marketing:** Customer segmentation and churn prediction.
* **Manufacturing:** Quality control and fault detection.

**Advantages:**

* Easy to **understand, visualize, and interpret**.
* Requires little data preprocessing (no scaling or normalization).
* Handles **numerical and categorical** data effectively.

**Limitations:**

* Prone to **overfitting** if not pruned.
* **Unstable** — small data changes can alter the tree structure.
* May **favor features** with more levels (biased splits).


##
### Dataset Info:
* Iris Dataset for classification tasks (sklearn.datasets.load_iris() or
provided CSV).
* Boston Housing Dataset for regression tasks
(sklearn.datasets.load_boston() or provided CSV).

##
### Question 6: Write a Python program to:
* Load the Iris Dataset
* Train a Decision Tree Classifier using the Gini criterion
* Print the model’s accuracy and feature importances


In [1]:
# --- Import Libraries ---
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
import pandas as pd

# --- Load the Iris Dataset ---
iris = load_iris()
X = iris.data
y = iris.target

# --- Split into Train and Test Sets ---
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# --- Train Decision Tree Classifier (using Gini) ---
model = DecisionTreeClassifier(criterion='gini', random_state=42)
model.fit(X_train, y_train)

# --- Predict and Evaluate ---
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# --- Display Results ---
print("Decision Tree Accuracy:", accuracy)
print("\nFeature Importances:")
for feature, importance in zip(iris.feature_names, model.feature_importances_):
    print(f"{feature}: {importance:.4f}")


Decision Tree Accuracy: 1.0

Feature Importances:
sepal length (cm): 0.0000
sepal width (cm): 0.0167
petal length (cm): 0.9061
petal width (cm): 0.0772


##
### Question 7: Write a Python program to:
* Load the Iris Dataset
* Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
a fully-grown tree

In [2]:
# --- Load the Iris Dataset ---
iris = load_iris()
X = iris.data
y = iris.target

# --- Split Data ---
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# --- Fully Grown Tree ---
full_tree = DecisionTreeClassifier(random_state=42)
full_tree.fit(X_train, y_train)
full_pred = full_tree.predict(X_test)
full_acc = accuracy_score(y_test, full_pred)

# --- Pruned Tree (max_depth=3) ---
pruned_tree = DecisionTreeClassifier(max_depth=3, random_state=42)
pruned_tree.fit(X_train, y_train)
pruned_pred = pruned_tree.predict(X_test)
pruned_acc = accuracy_score(y_test, pruned_pred)

# --- Print Comparison ---
print("Accuracy of Fully Grown Tree:", full_acc)
print("Accuracy of Pruned Tree (max_depth=3):", pruned_acc)

Accuracy of Fully Grown Tree: 1.0
Accuracy of Pruned Tree (max_depth=3): 1.0


##
### Question 8: Write a Python program to:
* Load the Boston Housing Dataset
* Train a Decision Tree Regressor
* Print the Mean Squared Error (MSE) and feature importances

In [4]:
# --- Import Libraries ---
from sklearn.datasets import fetch_openml
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# --- Load Boston Housing Dataset ---
boston = fetch_openml(name='boston', version=1, as_frame=True)
X = boston.data
y = boston.target

# --- Split Data ---
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# --- Train Decision Tree Regressor ---
regressor = DecisionTreeRegressor(random_state=42)
regressor.fit(X_train, y_train)

# --- Predict and Evaluate ---
y_pred = regressor.predict(X_test)
mse = mean_squared_error(y_test, y_pred)

# --- Display Results ---
print("Mean Squared Error (MSE):", mse)
print("\nFeature Importances:")
for feature, importance in zip(X.columns, regressor.feature_importances_):
    print(f"{feature}: {importance:.4f}")


Mean Squared Error (MSE): 10.416078431372549

Feature Importances:
CRIM: 0.0513
ZN: 0.0034
INDUS: 0.0058
CHAS: 0.0000
NOX: 0.0271
RM: 0.6003
AGE: 0.0136
DIS: 0.0707
RAD: 0.0019
TAX: 0.0125
PTRATIO: 0.0110
B: 0.0090
LSTAT: 0.1933


##
### Question 9: Write a Python program to:
* Load the Iris Dataset
* Tune the Decision Tree’s max_depth and min_samples_split using
GridSearchCV
* Print the best parameters and the resulting model accuracy

In [5]:
# --- Import Libraries ---
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# --- Load the Iris Dataset ---
iris = load_iris()
X = iris.data
y = iris.target

# --- Split Data ---
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# --- Define Parameter Grid ---
param_grid = {
    'max_depth': [2, 3, 4, 5, None],
    'min_samples_split': [2, 3, 4, 5, 10]
}

# --- Initialize Model and Grid Search ---
grid_search = GridSearchCV(
    estimator=DecisionTreeClassifier(random_state=42),
    param_grid=param_grid,
    cv=5,
    scoring='accuracy'
)

# --- Fit the Model ---
grid_search.fit(X_train, y_train)

# --- Best Parameters and Accuracy ---
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("Best Parameters:", grid_search.best_params_)
print("Best Cross-Validation Accuracy:", grid_search.best_score_)
print("Test Set Accuracy:", accuracy)


Best Parameters: {'max_depth': 4, 'min_samples_split': 2}
Best Cross-Validation Accuracy: 0.9416666666666668
Test Set Accuracy: 1.0


##
### Question 10: Imagine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.
Explain the step-by-step process you would follow to:
* Handle the missing values
* Encode the categorical features
* Train a Decision Tree model
* Tune its hyperparameters
* Evaluate its performance
And describe what business value this model could provide in the real-world
setting.


In [6]:
# --- Decision Tree Pipeline for Disease Prediction ---

# Import libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# ---------------------------
# Step 1: Simulate healthcare dataset
# ---------------------------
data = {
    'Age': [25, 40, 35, np.nan, 50, 45, 60, np.nan, 30, 55],
    'Gender': ['Male', 'Female', 'Female', 'Male', 'Male', 'Female', np.nan, 'Female', 'Male', 'Female'],
    'BloodPressure': [120, 130, np.nan, 110, 140, 125, 150, 135, np.nan, 145],
    'Cholesterol': ['High', 'Normal', 'High', 'Normal', np.nan, 'High', 'High', 'Normal', 'High', 'Normal'],
    'Disease': [1, 0, 1, 0, 1, 0, 1, 0, 1, 0]
}
df = pd.DataFrame(data)

# Separate features and target
X = df.drop('Disease', axis=1)
y = df['Disease']

# Identify column types
num_cols = ['Age', 'BloodPressure']
cat_cols = ['Gender', 'Cholesterol']

# ---------------------------
# Step 2: Preprocessing (handle missing values + encoding)
# ---------------------------
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median'))
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, num_cols),
        ('cat', categorical_transformer, cat_cols)
    ]
)

# ---------------------------
# Step 3: Model + Pipeline setup
# ---------------------------
dt = DecisionTreeClassifier(random_state=42)

pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('classifier', dt)])

# ---------------------------
# Step 4: Train/Test Split
# ---------------------------
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# ---------------------------
# Step 5: Hyperparameter Tuning
# ---------------------------
param_grid = {
    'classifier__max_depth': [2, 3, 4, 5, None],
    'classifier__min_samples_split': [2, 4, 6],
    'classifier__criterion': ['gini', 'entropy']
}

grid_search = GridSearchCV(pipeline, param_grid, cv=3, scoring='accuracy')
grid_search.fit(X_train, y_train)

# ---------------------------
# Step 6: Evaluation
# ---------------------------
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)

print("Best Parameters:", grid_search.best_params_)
print("\nModel Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

# ---------------------------
# Step 7: Business Value Summary
# ---------------------------
print("""
✅ BUSINESS VALUE:
- Enables early detection of diseases through data-driven insights.
- Helps allocate medical resources efficiently.
- Improves diagnostic accuracy and supports personalized treatment plans.
- Provides healthcare professionals with interpretable, rule-based decisions.
""")


Best Parameters: {'classifier__criterion': 'gini', 'classifier__max_depth': 2, 'classifier__min_samples_split': 2}

Model Accuracy: 0.6666666666666666

Classification Report:
               precision    recall  f1-score   support

           0       1.00      0.50      0.67         2
           1       0.50      1.00      0.67         1

    accuracy                           0.67         3
   macro avg       0.75      0.75      0.67         3
weighted avg       0.83      0.67      0.67         3

Confusion Matrix:
 [[1 1]
 [0 1]]

✅ BUSINESS VALUE:
- Enables early detection of diseases through data-driven insights.
- Helps allocate medical resources efficiently.
- Improves diagnostic accuracy and supports personalized treatment plans.
- Provides healthcare professionals with interpretable, rule-based decisions.

