# ***Decision Tree Assignment***



1. What is a Decision Tree, and how does it work in classification?
* A Decision Tree is a supervised learning algorithm that splits data into branches based on feature values. In classification, it recursively chooses the best feature to divide data into homogeneous groups until leaf nodes represent class labels.

2. Explain Gini Impurity and Entropy. How do they impact splits?
* Gini Impurity measures how often a randomly chosen sample would be incorrectly labeled. Entropy measures the disorder or uncertainty in a dataset. A decision tree selects splits that minimize impurity—lower Gini or higher Information Gain (entropy reduction).

3. Difference between Pre-Pruning and Post-Pruning. One advantage each.
* Pre-pruning stops tree growth early by limiting depth or minimum samples, reducing overfitting. Post-pruning grows a full tree first and then removes weak branches, improving generalization and model simplicity.

4. What is Information Gain, and why is it important?
* Information Gain measures how much uncertainty is reduced after a split. It helps identify the best feature to split on, ensuring the tree becomes more pure and efficient after each split.

5. Real-world applications, advantages, and limitations of Decision Trees.
* Applications include medical diagnosis, fraud detection, loan approval, and customer segmentation. Trees are easy to interpret and handle mixed data types but tend to overfit and may become unstable with small data changes.



In [1]:
# Q6. Write a Python program to:
# ● Load the Iris Dataset
# ● Train a Decision Tree Classifier using the Gini criterion
# ● Print the model’s accuracy and feature importances

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model
model = DecisionTreeClassifier(criterion='gini')
model.fit(X_train, y_train)

# Accuracy & feature importances
pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, pred))
print("Feature Importances:", model.feature_importances_)


Accuracy: 1.0
Feature Importances: [0.         0.01667014 0.90614339 0.07718647]


In [2]:
# Q7. Write a Python program to:
# ● Load the Iris Dataset
# ● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to a fully-grown tree.

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Full tree
full = DecisionTreeClassifier()
full.fit(X_train, y_train)
full_acc = accuracy_score(y_test, full.predict(X_test))

# Depth-limited tree
limited = DecisionTreeClassifier(max_depth=3)
limited.fit(X_train, y_train)
limited_acc = accuracy_score(y_test, limited.predict(X_test))

print("Fully-grown Tree Accuracy:", full_acc)
print("Max Depth = 3 Accuracy:", limited_acc)


Fully-grown Tree Accuracy: 1.0
Max Depth = 3 Accuracy: 1.0


In [4]:
# Q8. Write a Python program to:
# ● Load the Boston Housing Dataset
# ● Train a Decision Tree Regressor
# ● Print the Mean Squared Error (MSE) and feature importances

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Load California Housing dataset
data = fetch_california_housing()
X, y = data.data, data.target

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model
reg = DecisionTreeRegressor()
reg.fit(X_train, y_train)

# Predictions & metrics
pred = reg.predict(X_test)
print("MSE:", mean_squared_error(y_test, pred))
print("Feature Importances:", reg.feature_importances_)


MSE: 0.5122372264527132
Feature Importances: [0.52819451 0.05183078 0.05501545 0.02837502 0.02990242 0.13033425
 0.0931787  0.08316887]


In [5]:
# Q9. Write a Python program to:
# ● Load the Iris Dataset
# ● Tune the Decision Tree’s max_depth and min_samples_split using GridSearchCV
# ● Print the best parameters and the resulting model accuracy

from sklearn.model_selection import GridSearchCV
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Parameter grid
params = {
    'max_depth': [2, 3, 4, 5, None],
    'min_samples_split': [2, 3, 4, 5]
}

# Grid search
grid = GridSearchCV(DecisionTreeClassifier(), params, cv=5)
grid.fit(X_train, y_train)

# Results
print("Best Parameters:", grid.best_params_)
pred = grid.predict(X_test)
print("Accuracy:", accuracy_score(y_test, pred))


Best Parameters: {'max_depth': 4, 'min_samples_split': 2}
Accuracy: 1.0


10. Imagine you’re working as a data scientist for a healthcare company that wants to predict whether a patient has a certain disease. You have a large dataset with mixed data types and some missing values. Explain the step-by-step process you would follow to:

● Handle the missing values

● Encode the categorical features

● Train a Decision Tree model

● Tune its hyperparameters

● Evaluate its performance And describe what business value this model could provide in the real-world setting.


* To handle missing values, use strategies like mean/median imputation for numerics and mode or “most frequent” for categoricals. Encode categorical variables using One-Hot Encoding or Label Encoding depending on model needs. Train a Decision Tree with proper splitting, then optimize max_depth, min_samples_split, etc., using GridSearchCV. Evaluate performance using accuracy, precision, recall, and confusion matrix. Such a model helps hospitals detect diseases early, reduce manual workload, and support faster decision-making.