# Decision Tree

```
Question 1: What is a Decision Tree, and how does it work in classification?
A.A Decision Tree is a supervised machine learning algorithm used for classification and regression.
It works like a flowchart where each internal node represents a decision based on a feature
 ->Each branch represents an outcome of the decision
 ->Each leaf node represents the final prediction (class label)

 classification:
 1.The algorithm selects the best feature to split the dataset.
 2.The split divides the dataset into subsets that are more “pure” (more similar class labels).
 3.This splitting continues recursively until all samples belong to the same class or No further splits improve purity, or  stopping condition is reached (like max_depth)

The final class is predicted based on the class majority in the leaf node.
```

```
Question 2: Explain Gini Impurity and Entropy. How do they impact Decision Tree splits?

Gini Impurity:
Measures the probability of incorrectly classifying a randomly chosen element.

$$ Gini = 1 - \sum_{i=1}^{k} p_i^2 $$

* Value = 0 → perfectly pure node

* Used by default in CART decision trees

Entropy:

Measures the amount of uncertainty or disorder

$$ Entropy = - \sum_{i=1}^{k} p_i \log_2(p_i) $$

* Value = 0 → pure node

* Used in ID3 / C4.5 trees

**Impact on Splits**

The tree chooses the feature and threshold that produces the greatest reduction in impurity.

More reduction in impurity → better split → more homogeneous child nodes.

Both impurity measures help the tree find the most informative features.
```

```
Question 3: Difference between Pre-Pruning and Post-Pruning Pre-Pruning (Early Stopping)

Pre-pruning:- stops the decision tree during the training phase before it becomes too complex.
It applies certain rules so the tree does not grow unnecessarily.

Key points:

1)Tree stops splitting early based on constraints.
2)Prevents overfitting by limiting depth, nodes, samples, etc.
3)Faster to train.
4)Risk: Might underfit if stopped too soon.

Examples of Pre-Pruning parameters: max_depth, min_samples_split, min_samples_leaf, max_leaf_nodes,min_impurity_decrease

Post-pruning:- Allows the tree to grow fully and then removes branches that do not improve model performance.

Key points:
1)First grow full tree → then cut unnecessary branches.
2)More accurate than pre-pruning.
3)Helps reduce overfitting after tree is built.
4)More computationally expensive.

Post-Pruning methods: Cost Complexity Pruning (Alpha pruning), Reduced Error Pruning, Minimum Error Pruning

```

Question 4: What is Information Gain and why is it important?
A.Information Gain measures the reduction in impurity after a dataset split.

$$ IG = H(parent) - \sum_{j=1}^{m} \left( \frac{n_j}{n} \right) H(child_j) $$

Importance:-
1)Helps select the best feature for splitting.
2)Higher Information Gain → more useful split.
3)Ensures the decision tree learns the most informative patterns.
```

**Question 5: Real-world applications of Decision Trees + Advantages & Limitations Applications**

**Answer**

* Medical diagnosis (disease prediction)

* Fraud detection

* Customer churn prediction

* Loan approval

* Recommendation systems

* Manufacturing quality checks

**Advantages**

* Easy to understand and visualize
* Handles numerical + categorical data
* Requires little data preprocessing
* Fast and efficient

**Limitations**

* Prone to overfitting
* Unstable (small changes → different trees)
* Can create biased splits if class imbalance exists

**Dataset Info:**

* Iris Dataset for classification tasks (sklearn.datasets.load_iris() or
provided CSV).

* Boston Housing Dataset for regression tasks (sklearn.datasets.load_boston() or provided CSV).

**Question 6: Write a Python program to:**
* Load the Iris Dataset
* Train a Decision Tree Classifier using the Gini criterion
* Print the model’s accuracy and feature importances


In [1]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model using Gini
model = DecisionTreeClassifier(criterion='gini', random_state=42)
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Accuracy
print("Accuracy:", accuracy_score(y_test, y_pred))

# Feature Importances
for name, importance in zip(iris.feature_names, model.feature_importances_):
    print(name, ":", importance)


Accuracy: 1.0
sepal length (cm) : 0.0
sepal width (cm) : 0.016670139612419255
petal length (cm) : 0.9061433868879218
petal width (cm) : 0.07718647349965893


**Question 7: Write a Python program to:**

* Load the Iris Dataset
* Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
a fully-grown tree.

In [2]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Full tree
full_tree = DecisionTreeClassifier(random_state=42)
full_tree.fit(X_train, y_train)
full_pred = full_tree.predict(X_test)

# Depth-limited tree
limited_tree = DecisionTreeClassifier(max_depth=3, random_state=42)
limited_tree.fit(X_train, y_train)
limited_pred = limited_tree.predict(X_test)

print("Full Tree Accuracy:", accuracy_score(y_test, full_pred))
print("Max Depth=3 Accuracy:", accuracy_score(y_test, limited_pred))


Full Tree Accuracy: 1.0
Max Depth=3 Accuracy: 1.0


**Question 8: Write a Python program to:**

* Load the Boston Housing Dataset
* Train a Decision Tree Regressor
* Print the Mean Squared Error (MSE) and feature importances

In [3]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Load California Housing Dataset
housing = fetch_california_housing()
X, y = housing.data, housing.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Decision Tree Regressor
model = DecisionTreeRegressor(random_state=42)
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# MSE
print("Mean Squared Error (MSE):", mean_squared_error(y_test, y_pred))

# Feature importance
print("\nFeature Importances:")
for name, importance in zip(housing.feature_names, model.feature_importances_):
    print(name, ":", importance)


Mean Squared Error (MSE): 0.495235205629094

Feature Importances:
MedInc : 0.5285090936963706
HouseAge : 0.05188353710616045
AveRooms : 0.05297496833123543
AveBedrms : 0.02866045788296106
Population : 0.030515676373806224
AveOccup : 0.13083767753210346
Latitude : 0.09371656401749287
Longitude : 0.08290202505986989


In [4]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Parameter grid
param_grid = {
    "max_depth": [2, 3, 4, 5, None],
    "min_samples_split": [2, 3, 4, 5]
}

# Grid Search
grid = GridSearchCV(DecisionTreeClassifier(random_state=42), param_grid, cv=5)
grid.fit(X_train, y_train)

# Best parameters & accuracy
print("Best Parameters:", grid.best_params_)

best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))


Best Parameters: {'max_depth': 4, 'min_samples_split': 2}
Accuracy: 1.0


**Question 10: Imagine you’re working as a data scientist for a healthcare company that**

wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.
Explain the step-by-step process you would follow to:
* Handle the missing values
* Encode the categorical features
* Train a Decision Tree model
* Tune its hyperparameters
* Evaluate its performance


And describe what business value this model could provide in the real-world
setting.

**Answer**

**Healthcare Disease Prediction Project**

**1. Handle Missing Values**

* Numerical features → use mean/median imputation

* Categorical features → use most frequent value

* Use sklearn’s SimpleImputer

**2. Encode Categorical Features For Decision Trees:**

* One-Hot Encoding (preferred) Or Ordinal Encoding

* Use OneHotEncoder inside a ColumnTransformer.

**3. Train a Decision Tree Model Steps:**

* Split data into train/test

* Apply preprocessing

* Train DecisionTreeClassifier()

* Evaluate using accuracy, precision, recall, F1-score

**4. Hyperparameter Tuning Use GridSearchCV with parameters:**

* max_depth

* min_samples_split

* min_samples_leaf

* criterion (gini/entropy)

**5. Evaluate Performance Use:**

* Confusion Matrix

* ROC-AUC

* Classification Report

**6. Business Value The model helps:**

* Early disease detection

* Reduce manual diagnosis workload

* Prioritize high-risk patients

* Save medical resources

* Improve patient outcomes

* Make data-driven healthcare decisions