# Decision Trees

---

## Q1: What is a Decision Tree, and how does it work in the context of classification?

A Decision Tree is a supervised learning algorithm used for classification and regression tasks. It splits the data into subsets based on feature values, forming a tree-like structure where each node represents a decision based on a feature, and each leaf node represents a class label (for classification). The tree recursively partitions the data to maximize the separation between classes.

---

## Q2: Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?

- **Gini Impurity** measures the probability of incorrectly classifying a randomly chosen element if it was randomly labeled according to the distribution of labels in the subset.
- **Entropy** quantifies the amount of disorder or uncertainty in the subset.

Both are used to evaluate the quality of splits: lower impurity means better separation. The algorithm chooses splits that minimize impurity (Gini or Entropy), resulting in purer child nodes.

---

## Q3: What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.

- **Pre-Pruning** stops the tree growth early (e.g., by setting max_depth, min_samples_split) to prevent overfitting.
    - *Advantage:* Reduces computation and risk of overfitting.
- **Post-Pruning** grows the tree fully and then removes branches that do not improve performance.
    - *Advantage:* Can yield a simpler, more generalizable model after evaluating the full tree.

---

## Q4: What is Information Gain in Decision Trees, and why is it important for choosing the best split?

Information Gain measures the reduction in impurity (Entropy or Gini) after a dataset is split on a feature. It helps select the feature and threshold that best separates the classes, leading to more informative splits and improved model accuracy.

---

## Q5: What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?

**Applications:** Medical diagnosis, credit scoring, customer segmentation, fraud detection.

**Advantages:** Easy to interpret, handle both numerical and categorical data, require little data preprocessing.

**Limitations:** Prone to overfitting, unstable to small data changes, may not capture complex relationships as well as ensemble methods.

---

In [2]:
## Q6: Python Program - Iris Dataset, Decision Tree Classifier (Gini)


from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()
X, y = iris.data, iris.target

clf = DecisionTreeClassifier(criterion='gini', random_state=42)
clf.fit(X, y)
y_pred = clf.predict(X)

print("Accuracy:", accuracy_score(y, y_pred))
print("Feature Importances:", clf.feature_importances_)

Accuracy: 1.0
Feature Importances: [0.01333333 0.         0.56405596 0.42261071]


In [4]:
## Q7: Python Program - Compare max_depth=3 vs Fully-Grown Tree


from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# Fully-grown tree
clf_full = DecisionTreeClassifier(random_state=42)
clf_full.fit(X_train, y_train)
acc_full = clf_full.score(X_test, y_test)

# Pruned tree
clf_pruned = DecisionTreeClassifier(max_depth=3, random_state=42)
clf_pruned.fit(X_train, y_train)
acc_pruned = clf_pruned.score(X_test, y_test)

print("Fully-grown tree accuracy:", acc_full)
print("max_depth=3 tree accuracy:", acc_pruned)


Fully-grown tree accuracy: 1.0
max_depth=3 tree accuracy: 1.0


In [12]:
## Q8: Python Program - California Housing, Decision Tree Regressor

from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

housing = fetch_california_housing()
X_housing, y_housing = housing.data, housing.target

reg = DecisionTreeRegressor(random_state=42)
reg.fit(X_housing, y_housing)
y_pred = reg.predict(X_housing)

print("Mean Squared Error:", mean_squared_error(y_housing, y_pred))
print("Feature Importances:", reg.feature_importances_)


Mean Squared Error: 9.555001274479309e-32
Feature Importances: [0.52500998 0.05100488 0.05341707 0.02651485 0.03282405 0.1320936
 0.09387213 0.08526344]


In [8]:
## Q9: Python Program - GridSearchCV for Decision Tree Hyperparameters

from sklearn.model_selection import GridSearchCV

param_grid = {
        'max_depth': [2, 3, 4, 5, None],
        'min_samples_split': [2, 5, 10]
}

grid = GridSearchCV(DecisionTreeClassifier(random_state=42), param_grid, cv=5)
grid.fit(X, y)

print("Best Parameters:", grid.best_params_)
print("Best Accuracy:", grid.best_score_)


Best Parameters: {'max_depth': 3, 'min_samples_split': 2}
Best Accuracy: 0.9733333333333334


## Q10: Step-by-Step Process for Healthcare Classification Task

1. **Handle Missing Values:** Impute missing values using mean/median for numerical features, mode or a separate category for categorical features.
2. **Encode Categorical Features:** Use one-hot encoding or ordinal encoding as appropriate.
3. **Train Decision Tree Model:** Split data into train/test sets, fit a Decision Tree classifier.
4. **Tune Hyperparameters:** Use GridSearchCV or RandomizedSearchCV to optimize parameters like max_depth, min_samples_split.
5. **Evaluate Performance:** Assess accuracy, precision, recall, F1-score, and ROC-AUC on test data.

**Business Value:** The model can help identify high-risk patients, enabling early intervention, personalized treatment, and resource optimization, ultimately improving patient outcomes and reducing costs.