# Decision Tree

Question 1: What is a Decision Tree, and how does it work in classification?
- A Decision Tree is a flowchart-like model used for classification. It works by making sequential splits on data features, creating branches that lead down to leaf nodes, which represent the final predicted class.

Question 2: What are Gini Impurity and Entropy?
- They are both metrics used to measure the "impurity" or "disorder" of classes within a node. The tree uses them to decide which feature split will create the "purest" possible child nodes

Question 3: What is the difference between Pre-Pruning and Post-Pruning?
- Pre-pruning stops the tree from growing during training (e.g., setting a max_depth), which is faster. Post-pruning lets the tree grow fully and then removes branches after training, which can lead to a more accurate model.

Question 4: What is Information Gain?
- Information Gain is the measure of how much a feature split reduces impurity (like Gini or Entropy). It's important because the tree always chooses the split that provides the highest Information Gain.

Question 5: What is a key advantage and limitation of Decision Trees?
- A key advantage is that they are highly interpretable and easy to understand (a "white-box" model). A key limitation is their tendency to overfit the training data if not pruned.

In [None]:
# Question 6: Python code to load Iris, train with Gini, print accuracy and feature importances.

import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load the Iris Dataset
iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names

# Split data into training and testing sets to evaluate accuracy
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a Decision Tree Classifier using the Gini criterion
clf = DecisionTreeClassifier(criterion = 'gini', random_state = 42)
clf.fit(X_train, y_train)

# Print the model's accuracy
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.4f}\n")

# Print the feature importances
importances = clf.feature_importances_
feature_importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': importances})
print("Feature Importances:")
print(feature_importance_df)

Model Accuracy: 1.0000

Feature Importances:
             Feature  Importance
0  sepal length (cm)    0.000000
1   sepal width (cm)    0.019110
2  petal length (cm)    0.893264
3   petal width (cm)    0.087626


In [None]:
# Question 7: Python code to load Iris, train with max_depth=3, and compare accuracy to a fully-grown tree.
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load the Iris Dataset
X, y = load_iris(return_X_y=True)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a fully-grown tree (max_depth=None is the default)
full_tree = DecisionTreeClassifier(random_state=42)
full_tree.fit(X_train, y_train)
y_pred_full = full_tree.predict(X_test)
acc_full = accuracy_score(y_test, y_pred_full)

print(f"Accuracy of Fully-Grown Tree: {acc_full:.4f}")
print(f"Depth of Fully-Grown Tree: {full_tree.get_depth()}")

# Train a Decision Tree Classifier with max_depth=3
pruned_tree = DecisionTreeClassifier(max_depth = 3, random_state= 42)
pruned_tree.fit(X_train, y_train)
y_pred_pruned = pruned_tree.predict(X_test)
acc_pruned = accuracy_score(y_test, y_pred_pruned)

print(f"\nAccuracy of max_depth=3 Tree: {acc_pruned:.4f}")
print(f"\nDepth of max_depth = 3 Tree: {pruned_tree.get_depth()}")

# Compare
print(f"\nComparison:")
print(f"The pruned (max_depth=3) tree achieved the same accuracy (1.0000) as the fully-grown tree.")
print("This suggests the fully-grown tree might be slightly overfit, and the key decisions are all made in the first 3 levels.")

Accuracy of Fully-Grown Tree: 1.0000
Depth of Fully-Grown Tree: 6

Accuracy of max_depth=3 Tree: 1.0000

Depth of max_depth = 3 Tree: 3

Comparison:
The pruned (max_depth=3) tree achieved the same accuracy (1.0000) as the fully-grown tree.
This suggests the fully-grown tree might be slightly overfit, and the key decisions are all made in the first 3 levels.


In [None]:
# Question 8: Python code to load California Housing, train a regressor, print MSE and feature importances.

import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Load the California Housing Dataset
housing = fetch_california_housing()
X = housing.data
y = housing.target
feature_names = housing.feature_names

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a Decision Tree Regressor
regressor = DecisionTreeRegressor(random_state=42)
regressor.fit(X_train, y_train)

# Print the Mean Squared Error (MSE)
y_pred = regressor.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error (MSE): {mse:.4f}\n")

# Print the feature importances
importances = regressor.feature_importances_
feature_importance_df = pd.DataFrame(
    {'Feature': feature_names, 'Importance': importances}
)
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)

print("Feature Importances:")
print(feature_importance_df)

Mean Squared Error (MSE): 0.5280

Feature Importances:
      Feature  Importance
0      MedInc    0.523456
5    AveOccup    0.139012
6    Latitude    0.089992
7   Longitude    0.088806
1    HouseAge    0.052135
2    AveRooms    0.049418
4  Population    0.032206
3   AveBedrms    0.024974


In [1]:
# Question 9: Python code to load Iris, tune hyperparameters with GridSearchCV, print best params and accuracy.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load the Iris Dataset
X, y = load_iris(return_X_y=True)

# Split data into train and test sets
# We tune on the training set and validate the final model on the test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Tune hyperparameters using GridSearchCV
# Define the model
dt = DecisionTreeClassifier(random_state=42)

# Define the parameter grid to search
param_grid = {
    'max_depth': [None, 3, 5, 7, 10],
    'min_samples_split': [2, 5, 10, 20],
    'min_samples_leaf': [1, 2, 5, 10],
    'criterion': ['gini', 'entropy']
}

# Set up GridSearchCV with 5-fold cross-validation
grid_search = GridSearchCV(
    estimator=dt,
    param_grid=param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1  # Use all available cores
)

# Run the grid search on the training data
grid_search.fit(X_train, y_train)

# Print the best parameters
print(f"Best Parameters found by GridSearchCV:\n{grid_search.best_params_}\n")

# Print the resulting model accuracy
# Get the best model found by the search
best_model = grid_search.best_estimator_

# Evaluate the best model on the held-out test set
y_pred = best_model.predict(X_test)
final_accuracy = accuracy_score(y_test, y_pred)

print(f"Resulting Model Accuracy on Test Set: {final_accuracy:.4f}")

Best Parameters found by GridSearchCV:
{'criterion': 'gini', 'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 10}

Resulting Model Accuracy on Test Set: 1.0000


Question 10: Step-by-step process for the healthcare scenario.

Here is the step-by-step process I would follow to build a predictive model for the healthcare company, along with the business value it provides.

Step 1: Handle Missing Values

Given a large dataset with mixed data types and missing values, preprocessing is critical.

Analyze Missingness: First, I would visualize the missing data using a library like missingno to see if the missingness is random (MCAR) or follows a pattern (MAR/MNAR).

Numerical Features (e.g., 'Age', 'BloodPressure'):

Imputation: I would likely impute missing values using the median, which is robust to outliers, or using a more advanced method like KNNImputer, which fills in values based on the "nearest" (most similar) patients in the dataset.

Categorical Features (e.g., 'BloodType', 'Symptom'):

Imputation: I would impute missing values with the mode (the most frequent category).

"Missing" as a Category: If the fact that data is "missing" is itself predictive (e.g., a test was "not performed"), I would create a new category called "Missing" or "Unknown".


Row/Column Deletion: Since the dataset is large, if a small number of patients are missing many features, I might drop those rows. If a specific feature is missing for >50% of patients, I might drop the entire feature column as it's unreliable.

Step 2: Encode Categorical Features

The scikit-learn Decision Tree model requires all inputs to be numeric.

Ordinal Features (e.g., 'PainLevel' as 'Low', 'Medium', 'High'): I would use OrdinalEncoder to map these to integers that preserve their order (e.g., 0, 1, 2).

Nominal Features (e.g., 'Gender', 'BloodType'): I would use OneHotEncoder (or pd.get_dummies). This creates new binary (0/1) columns for each category, preventing the model from assuming an incorrect order (e.g., that 'BloodType B' > 'BloodType A').

Step 3: Train a Decision Tree Model

Split Data: I would split the preprocessed data into a training set and a testing set (e.g., 80% train, 20% test) using train_test_split. I would also ensure stratify=y is set to maintain the same percentage of diseased and healthy patients in both splits, which is crucial for imbalanced medical data.

Initial Training: I would train a default DecisionTreeClassifier on the training data (X_train, y_train) to establish a baseline performance.

Step 4: Tune its Hyperparameters

To prevent overfitting and find the best model, I would perform hyperparameter tuning using GridSearchCV or RandomizedSearchCV with 5-fold or 10-fold cross-validation on the training set.

Parameters to Tune:

criterion: ['gini', 'entropy']

max_depth: [3, 5, 7, 10, None] (to control complexity)

min_samples_split: [10, 20, 50] (to prevent splitting on small nodes)

min_samples_leaf: [5, 10, 20] (to ensure leaves are not too specific)

class_weight: ['balanced'] (This is critical if the disease is rare, as it will penalize the model more for misclassifying the minority 'diseased' class).

Step 5: Evaluate its Performance

After finding the best model from GridSearchCV, I would evaluate its performance on the held-out test set (X_test, y_test).

Key Metrics: For disease prediction, accuracy is often misleading. The most important metrics would be:

Confusion Matrix: To see the raw count of True Positives, True Negatives, False Positives, and False Negatives.

Recall (Sensitivity): TP/(TP+FN). This is the most critical metric in this scenario. It answers: "Of all the patients who actually have the disease, what percentage did our model correctly identify?" We must minimize False Negatives (missing a sick patient), as the cost is extremely high.

Precision: TP/(TP+FP). "Of all the patients the model predicted as sick, how many were correct?" This is important for avoiding unnecessary, costly, or invasive follow-up tests.

F1-Score: The harmonic mean of Precision and Recall, providing a good balance.

ROC-AUC: The Area Under the Curve shows how well the model can distinguish between the 'diseased' and 'healthy' classes.

Real-World Business Value -

This model would provide immense value to the healthcare company in several ways:

Early-Stage Screening & Risk Stratification: The model can serve as a fast, low-cost, automated tool to screen a large patient population and flag individuals at high risk for the disease.

Decision Support for Clinicians: It acts as a "second opinion" or decision-support tool for doctors. By showing why a patient is flagged (e.g., "Age > 60 and Symptom-X = True"), it helps doctors prioritize who needs immediate, and often more expensive, diagnostic testing (like MRIs or biopsies).

Resource Optimization: By identifying high-risk patients more effectively, the company can allocate its limited resources (specialist appointments, diagnostic equipment, patient support programs) more efficiently to those who need them most.

Interpretability and Trust: A key value of the Decision Tree is its "white-box" nature. Doctors and regulators can audit the model's logic. This transparency builds trust and allows clinical experts to validate that the rules the model learned are medically sound, which is a requirement that "black-box" models (like complex neural networks) struggle to meet.