# Decision Tree for HR Attrition Prediction

This notebook demonstrates how to use a Decision Tree Classifier in Python to predict employee attrition using HR data. It is designed for learners with basic Python skills and includes step-by-step code with explanations.

## 🎯 Learning Objectives
By the end of this notebook, you will be able to:
- Understand the intuition and structure behind Decision Trees
- Train a Decision Tree Classifier using `scikit-learn`
- Interpret model output to identify key attrition drivers
- Connect insights to real-world HR decisions

Step 1: Get the dependencies setup

In [None]:
#visit the requirements.txt to see the dependency packages to pip install
#e.g. run for the dependency in the requirements.txt: !pip install seaborn --user

#alternatively you can run the below script for requirements.txt
#pip install -r requirements.txt

import pandas as pd
import os

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

Step 2: Upload the dataset

In [None]:
# Make sure this CSV is uploaded in your environment
#df = pd.read_csv("WA_Fn-UseC_-HR-Employee-Attrition.csv")
#df.head()

# Prompt user to enter path

import tkinter as tk
from tkinter import filedialog

# Only works if GUI is available
root = tk.Tk()
root.withdraw()  # Hide the main window

file_path = filedialog.askopenfilename(title="Select your CSV file")
if file_path:
    df = pd.read_csv(file_path)
    print("✅ File loaded successfully!")
    display(df.head())
else:
    print("❌ No file selected.")

Step 3: Select the features/variables/levers and map them

In [None]:
df = df[["Attrition", "Age", "JobSatisfaction", "MonthlyIncome", "DistanceFromHome", "OverTime"]]
df['Attrition'] = df['Attrition'].map({'Yes': 1, 'No': 0})
df['OverTime'] = df['OverTime'].map({'Yes': 1, 'No': 0})
df.head()

Step 4: Assign the target (aka outcome) as y and assign the drivers as x

In [None]:
X = df.drop("Attrition", axis=1)
y = df["Attrition"]

Step 5: Setup the data for training and testing
Note that the x is different from step 4. X is the training set and y is the testing set.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#Note for Step 5 a: The [test_size] is the how much of the data is set aside for testing (to evaluate the model on unseen data)

#Discussion: Why does test_size matter?

#Note for Step 5 b: The [random_state] is the specific pattern number that you get the model to choose.

#Discussion: Why do we assign 42 as the placement value?

#Is it for reproducibility? For debugging? For Fair comparison?

Step 6: Setup the decision tree model and run

In [None]:
model = DecisionTreeClassifier(max_depth=3, random_state=42)
model.fit(X_train, y_train)

#Discussion: What is [max_depth]? Why do we pick the value as 3?
#Discussion: Is adding more value to max_depth good or bad?
#Discussion: Why do we assign 42 for [random_state] as the placement value?

Step 7: Interpreting the output of the model (Explainability / Interpretability)

In [None]:
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 6))
plot_tree(
    model,
    feature_names=X.columns,
    class_names=["Stay", "Leave"],
    filled=True,
    impurity=False,)
plt.title("Decision Tree for Employee Attrition")
plt.show()

#“What’s the business interpretation of OverTime ≤ 0.5 being the first split?”
#“If you were HR, what action would you take for employees working overtime and earning < $2,500?”
# Discussion: so if these are actions that you can take, what's your next thought? Is the model accurate?

Step 8 Evaluating the accuracy of the model

In [None]:
y_pred = model.predict(X_test)
print("Accuracy Score:", accuracy_score(y_test, y_pred))

#Discussion: How many predictions are correct?
#Discussion: What is the limitation of this test?
#Discussion: Why is that even though accuracy is easy to understand it is not always the best metric?
#Discussion: If the model is accurate, can we explain the outcome of the model?

Step 9: Running the confusion matrix as another validator

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix

# Train model
#model = DecisionTreeClassifier(max_depth=3, random_state=42)
#model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)
cm = confusion_matrix(y_test, y_pred)

# Define base heatmap with colored background

# Set manual colors per cell (based on position)
cell_colors = np.array([["green", "blue"],
                        ["yellow", "red"]])

# Plot square patches with fixed colors
fig, ax = plt.subplots(figsize=(6, 5))
for i in range(2):
    for j in range(2):
        ax.add_patch(plt.Rectangle((j, i), 1, 1, color=cell_colors[i][j]))

# Overlay numeric confusion matrix values
for i in range(2):
    for j in range(2):
        ax.text(j + 0.5, i + 0.3, cm[i, j], ha='center', va='center',
                fontsize=16, color='black', fontweight='bold')

# Custom descriptive labels
labels = [["True Stay", "False Leave"],
          ["False Stay", "True Leave"]]

for i in range(2):
    for j in range(2):
        ax.text(j + 0.5, i + 0.75, labels[i][j],
                ha='center', va='center', fontsize=10, color='white', fontweight='bold')

# Axis setup
ax.set_xticks([0.5, 1.5])
ax.set_yticks([0.5, 1.5])
ax.set_xticklabels(["Stay", "Leave"])
ax.set_yticklabels(["Stay", "Leave"])
ax.set_xlabel("Predicted")
ax.set_ylabel("Actual")
ax.set_title("Confusion Matrix – Color Coded with Labels")
ax.set_xlim(0, 2)
ax.set_ylim(0, 2)
ax.invert_yaxis()
ax.set_aspect('equal')
plt.grid(False)
plt.tight_layout()
plt.show()

#Discussion: so what does this confusion matrix help us to validate?

Step 10: Figuring what are the key drivers?

In [None]:
#understanding the importance of drivers / levers based on feature importance

importance = pd.Series(model.feature_importances_, index=X.columns).sort_values(ascending=False)
print("Feature Importances:\n", importance)

#Discussion: so how does understanding these drivers help us to formulate better strategies or policies?

Step 11: Optimising constraints for understanding of the model?

In [None]:
# Prepare experiment
results = []
test_sizes = [0.1, 0.2, 0.3]
random_states = [1, 42, 99]

# Loop over combinations
for test_size in test_sizes:
    for rand in random_states:
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=rand)
        model = DecisionTreeClassifier(max_depth=3, random_state=rand)
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        acc = accuracy_score(y_test, y_pred)
        fi = model.feature_importances_

        results.append({
            "test_size": test_size,
            "random_state": rand,
            "accuracy": round(acc, 3),
            "Age_importance": round(fi[0], 3),
            "JobSatisfaction_importance": round(fi[1], 3),
            "MonthlyIncome_importance": round(fi[2], 3),
            "DistanceFromHome_importance": round(fi[3], 3),
            "OverTime_importance": round(fi[4], 3)
        })

# Convert to DataFrame
results_df = pd.DataFrame(results)
results_df


#💡 What students learn from this:
#How model accuracy varies by test split and seed
#Which features consistently matter most
#Why tuning and reproducibility matter in ML

## 🧠 Reflection Questions
- Which features are most influential in predicting attrition?
- How would you act on this insight as an HR leader?
- Would you use this model to automate decisions or guide conversations?
