# Decision Tree Assignment Solutions

Generated by ChatGPT

## Question 1: What is a Decision Tree, and how does it work in the context of classification?

A decision tree is a flowchart-like model used for classification (and regression) that recursively splits the feature space into regions based on feature values. In classification, each internal node tests a feature, branches represent possible values/ranges, and leaf nodes assign class labels. The tree is built by selecting splits that maximize purity (e.g., information gain or reduction in impurity), leading to hierarchical decision rules that classify input samples.

## Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?

- **Gini Impurity** measures the probability of incorrectly classifying a randomly chosen element if it were labeled according to the class distribution in the node. Formula: 1 - Σ p_i^2. Lower is purer.
- **Entropy** comes from information theory and quantifies the amount of disorder: -Σ p_i log2(p_i). Lower entropy means more homogenous.

When choosing splits, the algorithm evaluates how much a split reduces impurity (Gini or Entropy). Splits that yield the largest reduction (highest information gain for entropy) are preferred, leading to more informative partitions of the data.

## Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.

- **Pre-Pruning**: Stops tree growth early based on criteria (e.g., max depth, minimum samples per leaf) to prevent overfitting. *Advantage:* faster training and simpler model by restricting complexity upfront.
- **Post-Pruning**: Grows a full tree then removes branches that do not improve generalization using validation data. *Advantage:* can recover from suboptimal early splits and often yields more accurate final trees because it considers the full structure before trimming.

## Question 4: What is Information Gain in Decision Trees, and why is it important for choosing the best split?

Information Gain is the reduction in impurity (usually entropy) achieved by partitioning the data on a feature. It is computed as the difference between the parent node's impurity and the weighted average impurity of the child nodes after the split. The split with the highest information gain is chosen because it most effectively separates the classes, leading to purer subsets.

## Question 5: What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?

**Applications:**
- Medical diagnosis (disease prediction)
- Credit scoring and loan approval
- Customer churn prediction
- Fraud detection
- Marketing segmentation

**Advantages:**
- Easy to interpret and visualize
- Handles both numerical and categorical data
- Requires little data preprocessing
- Can capture nonlinear relationships

**Limitations:**
- Prone to overfitting if unrestricted
- Unstable (small changes in data can lead to different trees)
- Biased toward features with more levels
- Less accurate than ensemble methods like Random Forest or Gradient Boosting in many cases.

## Question 6: Load Iris, train Decision Tree Classifier (Gini), print accuracy and feature importances.

In [None]:

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)
clf_gini = DecisionTreeClassifier(criterion="gini", random_state=42)
clf_gini.fit(X_train, y_train)
y_pred = clf_gini.predict(X_test)
print("Accuracy (Gini):", accuracy_score(y_test, y_pred))
print("Feature importances:", clf_gini.feature_importances_)


## Question 7: Compare Decision Tree with max_depth=3 vs fully-grown tree.

In [None]:

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

clf_depth3 = DecisionTreeClassifier(max_depth=3, random_state=42)
clf_depth3.fit(X_train, y_train)
acc_depth3 = accuracy_score(y_test, clf_depth3.predict(X_test))
acc_full = accuracy_score(y_test, clf_gini.predict(X_test))
print("Accuracy with max_depth=3:", acc_depth3)
print("Accuracy fully-grown:", acc_full)


## Question 8: Load synthetic regression dataset (stand-in for Boston), train Decision Tree Regressor, print MSE and feature importances.

In [None]:

from sklearn.datasets import make_regression
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

X_reg, y_reg = make_regression(n_samples=506, n_features=13, noise=0.5, random_state=42)
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(X_reg, y_reg, test_size=0.2, random_state=42)
reg = DecisionTreeRegressor(random_state=42)
reg.fit(X_train_reg, y_train_reg)
y_pred_reg = reg.predict(X_test_reg)
print("MSE:", mean_squared_error(y_test_reg, y_pred_reg))
print("Feature importances:", reg.feature_importances_)


## Question 9: Tune max_depth and min_samples_split using GridSearchCV on Iris dataset.

In [None]:

from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier

param_grid = {'max_depth': [2, 3, 4, 5, None], 'min_samples_split': [2, 4, 6, 8]}
grid_search = GridSearchCV(DecisionTreeClassifier(random_state=42), param_grid, cv=5)
grid_search.fit(X_train, y_train)
print("Best parameters:", grid_search.best_params_)
print("Best CV score:", grid_search.best_score_)


## Question 10: Healthcare disease prediction pipeline

**1. Handle missing values:**
- Analyze missingness pattern (e.g., missing completely at random vs not).
- Impute numerical features (mean/median or model-based) and categorical (most frequent or new category). Use techniques like KNN imputation if appropriate.

**2. Encode categorical features:**
- Use one-hot encoding for nominal categories or ordinal encoding if there is intrinsic order. For high-cardinality, consider target encoding with regularization.

**3. Train Decision Tree model:**
- Split data into train/validation/test.
- Scale if needed (trees don't require scaling).
- Initialize DecisionTreeClassifier; choose criterion (gini/entropy) based on interpretability preferences.

**4. Tune hyperparameters:**
- Use GridSearchCV or RandomizedSearchCV over parameters like max_depth, min_samples_split, min_samples_leaf, and class_weight to handle imbalance.
- Use cross-validation and use a validation set to prevent overfitting.

**5. Evaluate performance:**
- Metrics: accuracy, precision, recall, F1-score, ROC AUC for binary disease prediction.
- Use confusion matrix to understand types of errors.
- Calibrate probabilities if needed.
- Perform feature importance analysis and SHAP explanations for interpretability.

**Business value:**
- Early detection leading to timely treatment, reducing costs and improving outcomes.
- Risk stratification to allocate resources efficiently.
- Personalized care recommendations.
- Reduces unnecessary tests by flagging high-risk patients.