Decision Tree | Assignment

#1. What is a Decision Tree, and how does it work in the context of classification?
Decision Tree Basics
A Decision Tree is a supervised learning algorithm used for both classification and regression tasks. It's a tree-like model where each node represents a feature, and each branch represents a decision based on that feature.

How it Works for Classification
1. Root Node: The top node, representing the entire dataset.
2. Splitting: Data is split based on feature values (e.g., age > 30).
3. Decision Nodes: Internal nodes that split data further.
4. Leaf Nodes: Terminal nodes predicting class labels.
5. Prediction: New data traverses the tree, following decisions to a leaf node.

#2. Explain the concepts of Gini Impurity and Entropy as impurity measures.
Impurity Measures in Decision Trees
Both Gini Impurity and Entropy measure node impurity, guiding splits in Decision Trees.

Gini Impurity
- Formula: G = 1 - Σ(p_i²) where p_i is class i's proportion.
- Interpretation: Probability of incorrect classification if randomly labeled.
- Range: 0 (pure) to 0.5 (max impurity for binary).

Entropy
- Formula: H = -Σ(p_i * log₂(p_i))
- Interpretation: Uncertainty in node.
- Range: 0 (pure) to 1 (max impurity for binary).

Impact on Splits
- Goal: Minimize impurity (maximize purity).
- Split selection: Choose feature/split maximizing Information Gain:
    - Gini: ΔG = G_parent - (n_left/n * G_left + n_right/n * G_right)
    - Entropy: IG = H_parent - weighted H_children
- Both lead to similar trees; Gini is faster, Entropy can lead to more balanced splits.

Example
For a node with 50% class A, 50% class B:
- Gini = 1 - (0.5² + 0.5²) = 0.5
- Entropy = -(0.5log(0.5) + 0.5log(0.5)) ≈ 1

#3 What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.
Pruning in Decision Trees
Pruning controls overfitting by limiting tree growth.

Pre-Pruning (Early Stopping)
- When: During tree growth.
- How: Stop splitting based on criteria (e.g., max depth, min samples).
- Advantage: Efficiency. Reduces computation by avoiding unnecessary splits.

Post-Pruning (Reduced Error Pruning)
- When: After tree is fully grown.
- How: Remove subtrees, replacing with leaf nodes.
- Advantage: Optimal subtree. Can find best trade-off between complexity and accuracy.

Example Criteria
- Pre-pruning: max_depth=3, min_samples_split=10
- Post-pruning: Prune based on validation set performance.

#4. What is Information Gain in Decision Trees, and why is it important for choosing the best split?
Information Gain in Decision Trees
Information Gain (IG) measures how much a split improves node purity.

Formula
IG = H_parent - (weighted H_children)
- H: Entropy (or Gini)
- Weighted sum of child node impurities

Importance
- Split selection: Choose feature/split maximizing IG.
- Purity increase: Higher IG = better separation of classes.
- Tree construction: Guides tree growth, focusing on most informative features first.

#5 What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?
Decision Trees: Applications, Advantages & Limitations
Real-World Applications
1. Healthcare: Disease diagnosis (e.g., predicting diabetes, cancer risk)
2. Finance: Credit scoring, fraud detection
3. Marketing: Customer segmentation, churn prediction
4. Retail: Product recommendation systems
5. Manufacturing: Quality control, predictive maintenance

Main Advantages
1. Interpretability: Easy to understand and visualize
2. Non-parametric: No assumptions about data distribution
3. Handles mixed data: Categorical and numerical features
4. Fast training: Simple and efficient

Limitations
1. Overfitting: Pruning needed to prevent complex trees
2. Bias towards dominant classes: Class imbalance issues
3. Instability: Small data changes can alter tree structure
4. Greedy algorithm: Doesn't always find optimal tree

#6 Write a Python program to: ● Load the Iris Dataset ● Train a Decision Tree Classifier using the Gini criterion ● Print the model’s accuracy and feature importances (Include your Python code and output in the code box below.)



In [None]:
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Decision Tree Classifier with Gini criterion
clf = DecisionTreeClassifier(criterion='gini', random_state=42)
clf.fit(X_train, y_train)

# Predictions and accuracy
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")

# Feature importances
print("Feature Importances:")
for feature, importance in zip(iris.feature_names, clf.feature_importances_):
    print(f"{feature}: {importance:.2f}")




 #7 Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
a fully-grown tree.
(Include your Python code and output in the code box below.)

In [None]:
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Decision Tree with max_depth=3
clf_depth3 = DecisionTreeClassifier(max_depth=3, random_state=42)
clf_depth3.fit(X_train, y_train)
y_pred_depth3 = clf_depth3.predict(X_test)
accuracy_depth3 = accuracy_score(y_test, y_pred_depth3)

# Train fully-grown tree
clf_full = DecisionTreeClassifier(random_state=42)
clf_full.fit(X_train, y_train)
y_pred_full = clf_full.predict(X_test)
accuracy_full = accuracy_score(y_test, y_pred_full)

# Compare accuracies
print(f"Accuracy (max_depth=3): {accuracy_depth3:.2f}")
print(f"Accuracy (fully-grown): {accuracy_full:.2f}")




 #8Write a Python program to:
● Load the Boston Housing Dataset
● Train a Decision Tree Regressor
● Print the Mean Squared Error (MSE) and feature importances
(Include your Python code and output in the code box below.)

In [2]:
# Import necessary libraries
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Load Boston Housing dataset
boston = load_boston()
X, y = boston.data, boston.target

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Decision Tree Regressor
reg = DecisionTreeRegressor(random_state=42)
reg.fit(X_train, y_train)

# Predictions and MSE
y_pred = reg.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse:.2f}")

# Feature importances
print("Feature Importances:")
for feature, importance in zip(boston.feature_names, reg.feature_importances_):
    print(f"{feature}: {importance:.2f}")





ImportError: 
`load_boston` has been removed from scikit-learn since version 1.2.

The Boston housing prices dataset has an ethical problem: as
investigated in [1], the authors of this dataset engineered a
non-invertible variable "B" assuming that racial self-segregation had a
positive impact on house prices [2]. Furthermore the goal of the
research that led to the creation of this dataset was to study the
impact of air quality but it did not give adequate demonstration of the
validity of this assumption.

The scikit-learn maintainers therefore strongly discourage the use of
this dataset unless the purpose of the code is to study and educate
about ethical issues in data science and machine learning.

In this special case, you can fetch the dataset from the original
source::

    import pandas as pd
    import numpy as np

    data_url = "http://lib.stat.cmu.edu/datasets/boston"
    raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
    data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
    target = raw_df.values[1::2, 2]

Alternative datasets include the California housing dataset and the
Ames housing dataset. You can load the datasets as follows::

    from sklearn.datasets import fetch_california_housing
    housing = fetch_california_housing()

for the California housing dataset and::

    from sklearn.datasets import fetch_openml
    housing = fetch_openml(name="house_prices", as_frame=True)

for the Ames housing dataset.

[1] M Carlisle.
"Racist data destruction?"
<https://medium.com/@docintangible/racist-data-destruction-113e3eff54a8>

[2] Harrison Jr, David, and Daniel L. Rubinfeld.
"Hedonic housing prices and the demand for clean air."
Journal of environmental economics and management 5.1 (1978): 81-102.
<https://www.researchgate.net/publication/4974606_Hedonic_housing_prices_and_the_demand_for_clean_air>


#9: Write a Python program to:
● Load the Iris Dataset
● Tune the Decision Tree’s max_depth and min_samples_split using
GridSearchCV
● Print the best parameters and the resulting model accuracy
(Include your Python code and output in the code box below.)

In [3]:
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define hyperparameter grid
param_grid = {
    'max_depth': [3, 5, None],
    'min_samples_split': [2, 5, 10]
}

# GridSearchCV
grid_search = GridSearchCV(DecisionTreeClassifier(random_state=42), param_grid, cv=5)
grid_search.fit(X_train, y_train)

# Best parameters and accuracy
best_params = grid_search.best_params_
best_acc = grid_search.best_score_

print(f"Best Parameters: {best_params}")
print(f"Best Cross-Validation Accuracy: {best_acc:.2f}")

# Evaluate on test set
best_model = grid_search.best_estimator_
test_acc = accuracy_score(y_test, best_model.predict(X_test))
print(f"Test Accuracy: {test_acc:.2f}")



Best Parameters: {'max_depth': None, 'min_samples_split': 2}
Best Cross-Validation Accuracy: 0.94
Test Accuracy: 1.00
