# 👩‍💻 HR Attrition Prediction with Decision Trees and Random Forests

## 📋 Overview
In this lab, you'll use decision trees and random forests to predict employee attrition based on HR analytics data. You'll build and compare these models, then identify the key factors contributing to attrition. This will show how tree-based algorithms can help human resources understand and improve employee retention.

## 🎯 Learning Outcomes
By the end of this lab, you will be able to:
- In this lab, you'll clean and prepare HR attrition data for machine learning. 
- You'll then use decision trees and random forests to predict attrition, evaluating and comparing their performance. 
- Finally, you'll analyze feature importance to identify key indicators driving employee satisfaction, interpreting these insights within an organization context. 
- This will show how tree-based algorithms can be applied to identify risk factors and potentially improve employee outcomes.

## 🚀 Starting Point
Start with the provided starter code which includes necessary imports and data loading functionality:


In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# Load the dataset
df = pd.read_csv("WA_Fn-UseC_-HR-Employee-Attrition.csv")

print("Dataset loaded successfully.")
print("First 5 records:")
df.head()

Dataset loaded successfully.
First 5 records:


Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


## Task 1: Data Exploration
**Context:** Before building any machine learning model, it's essential to understand the dataset structure and characteristics, especially in the HR department where data quality directly impacts company outcomes.

**Steps:**

1. Display the first few rows of the dataset using `head()` to understand its structure
2. Examine the statistical summary of the data using `describe()`
3. Check for missing values using the `isnull().sum()` method
4. Explore the balance of the target variable (Consider Attrition as the target variable)


In [2]:
# Your data exploration code here


**💡 Tip:** Pay special attention to the distribution of the target variable to determine if your dataset is balanced or imbalanced, as this will impact your model evaluation approach.

**⚙️ Test Your Work:**

- Execute your data exploration code and verify you can answer: How many features are in the dataset? Are there any missing values?


## Task 2: Data Cleaning and Preparation
**Context:** HR datasets often contain missing values or require feature engineering. Proper preprocessing ensures your model makes accurate predictions.

**Steps:**

1. Handle any missing values in the dataset (if present)
2. Handle categorical features, if any (e.g., using one-hot encoding with pandas.get_dummies)
3. Select relevant features for your model (This is important, try to check if you can find some features that are not relevant or have no explanatory power).
4. Separate features (X) and target variable (y)
5. Split the data into training and testing sets using `train_test_split()`

In [3]:
# Your data cleaning and preparation code here


**💡 Tip:** For HR data, carefully consider how to handle missing values. Simple imputation with mean values might not always be appropriate depending on the HR context.

**⚙️ Test Your Work:**

Check the shape of your X_train, X_test, y_train, and y_test to confirm proper splitting


## Task 3: Decision Tree Model Implementation
**Context:** Decision trees provide an interpretable approach to explain attrition in this case, making them valuable in understanding the model's decision process is crucial.

**Steps:**

1. Initialize a DecisionTreeClassifier with appropriate parameters to prevent overfitting (consider setting `max_depth`)
2. Fit the model to your training data
3. Make predictions on the test set
4. Calculate the accuracy using `accuracy_score()`
5. Generate a classification report using `classification_report()`
6. Visualize the confusion matrix using `ConfusionMatrixDisplay`


In [4]:
# Your decision tree implementation code here


**💡 Tip:** Try different values of max_depth to find a balance between model complexity and accuracy.

**⚙️ Test Your Work:**

Your decision tree should output an accuracy score and display a confusion matrix


## Task 4: Random Forest Model Implementation
**Context:** Random Forest models often provide better predictive performance than single decision trees by reducing overfitting, which is crucial for reliable attrition risk prediction.

**Steps:**

1. Initialize a RandomForestClassifier with an appropriate number of estimators
2. Train the model on your training data
3. Generate predictions on the test set
4. Evaluate model performance using accuracy and a classification report
5. Create a confusion matrix visualization


In [5]:
# Your random forest implementation code here


**💡 Tip:** Random forests typically perform better with default parameters than decision trees, but you can still experiment with n_estimators and max_depth to optimize performance.

**⚙️ Test Your Work:**

Compare the accuracy of your random forest model with the decision tree model


## Task 5: Feature Importance Analysis
**Context:** Understanding which factors most strongly influence the attrition can guide the relevant teams to understand how to be pro-active to prevent attrition..

**Steps:**

1. Extract feature importance scores from your Random Forest model
2. Sort features by their importance
3. Create a bar plot visualizing the importance of each feature
4. Analyze which variables have the highest importance in terms of explaining the target variable.
5. Add a comparison between the two models.


In [6]:
# Your feature importance analysis code here


**💡 Tip:** Connect the feature importance findings back to real-world HR implications.

**⚙️ Test Your Work:** Your visualization should clearly show which features have the highest importance scores

## ✅ Success Checklist
- Dataset is properly loaded and explored
- Data cleaning steps are implemented appropriately
- Decision Tree model is trained and evaluated
- Random Forest model is trained and compared to the Decision Tree
- Feature importances are extracted and visualized
- All visualizations are properly labeled and interpretable
- Program runs without errors

## 🔍 Common Issues & Solutions
**Problem:** Low model accuracy 
- **Solution:** Try adjusting hyperparameters like max_depth for Decision Trees or n_estimators for Random Forests

**Problem:** Feature importance visualization is unclear 
- **Solution:** Sort features by importance and limit to top N features if you have many variables

## 🔑 Key Points
- Decision trees provide interpretable models but may suffer from overfitting
- Random forests typically offer better performance by averaging multiple decision trees
- Feature importance analysis provides valuable insights for this dataset but generally any dataset to understand how the outcomes are determined by the feature variables. This is an iterative process.
- Model evaluation should consider multiple metrics beyond just accuracy, especially for HR applications

## ➡️ Next Steps
In future labs, you'll explore more advanced ensemble methods and apply these techniques to more complex datasets. You'll also learn how to perform hyperparameter tuning to optimize your models further.


## Exemplar Solution
After completing this activity (or if you get stuck!), take a moment to review the exemplar solution. This sample solution can offer insights into different techniques and approaches. Reflect on what you can learn from the exemplar solution to improve your coding skills. Remember, multiple solutions can exist for some problems; the goal is to learn and grow as a programmer by exploring various approaches. Use the exemplar solution as a learning tool to enhance your understanding and refine your approach to coding challenges.

<details>

<summary><strong>Click HERE to see an exemplar solution</strong></summary>    
    
```python
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# Load the dataset
df = pd.read_csv("WA_Fn-UseC_-HR-Employee-Attrition.csv")

print("Dataset loaded successfully.")
print("First 5 records:")
print(df.head())

print("\n--- Dataset Info ---")
print(df.info())

print("\n--- Dataset Description ---")
print(df.describe())

print("\n--- Missing Values ---")
print(df.isnull().sum())

# For this HR dataset, let's look at the target variable distribution
if 'Attrition' in df.columns:
    print("\n--- Attrition Distribution ---")
    print(df['Attrition'].value_counts())
    sns.countplot(x='Attrition', data=df)
    plt.title('Distribution of Attrition')
    plt.show()

# Identify features (X) and target (y)
# For the IBM HR Analytics dataset, 'Attrition' is the target variable.
# We'll drop 'EmployeeCount', 'StandardHours', 'Over18', and 'EmployeeNumber'
# as they are constant or unique identifiers with no predictive power.
if 'Attrition' in df.columns:
    y = df['Attrition']
    X = df.drop(['Attrition', 'EmployeeCount', 'StandardHours', 'Over18', 'EmployeeNumber'], axis=1, errors='ignore') # errors='ignore' prevents error if column already dropped
else:
    print("Error: 'Attrition' column not found. Please ensure the correct dataset is loaded.")
    # Fallback for dummy dataset
    X = df.drop('target', axis=1)
    y = df['target']
    print("Using dummy dataset for preprocessing.")


# Convert 'Attrition' target to numerical (Yes=1, No=0)
if y.dtype == 'object':
    y = y.map({'Yes': 1, 'No': 0})
    print("Converted 'Attrition' to numerical (Yes=1, No=0).")

# One-hot encode categorical features in X
X = pd.get_dummies(X, drop_first=True)
print("\n--- Features after One-Hot Encoding ---")
print(X.head())
print(f"Number of features after encoding: {X.shape[1]}")

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)
# Using stratify=y helps maintain the same proportion of target classes in both train and test sets,
# which is important for imbalanced datasets like this one.

print(f"\nTraining set shape: {X_train.shape}, Test set shape: {X_test.shape}")

print("\n--- Running Decision Tree Classifier ---")
# Initialize and train the Decision Tree Classifier
dt_classifier = DecisionTreeClassifier(random_state=42)
dt_classifier.fit(X_train, y_train)

# Make predictions on the test set
y_pred_dt = dt_classifier.predict(X_test)

# Evaluate the Decision Tree model
print(f"Accuracy: {accuracy_score(y_test, y_pred_dt):.4f}")
print("Classification Report:")
print(classification_report(y_test, y_pred_dt))

print("\n--- Running Random Forest Classifier ---")
# Initialize and train the Random Forest Classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42) # n_estimators is the number of trees
rf_classifier.fit(X_train, y_train)

# Make predictions on the test set
y_pred_rf = rf_classifier.predict(X_test)

# Evaluate the Random Forest model
print(f"Accuracy: {accuracy_score(y_test, y_pred_rf):.4f}")
print("Classification Report:")
print(classification_report(y_test, y_pred_rf))

# Feature Importance for Random Forest
print("\n--- Random Forest Feature Importance ---")
feature_importances = pd.Series(rf_classifier.feature_importances_, index=X.columns)
# Get top 10 features
top_10_features = feature_importances.nlargest(10)
print(top_10_features)

plt.figure(figsize=(12, 7))
top_10_features.sort_values(ascending=True).plot(kind='barh')
plt.title('Top 10 Random Forest Feature Importances')
plt.xlabel('Importance')
plt.show()

print("\n--- Model Comparison ---")
dt_accuracy = accuracy_score(y_test, y_pred_dt)
rf_accuracy = accuracy_score(y_test, y_pred_rf)

print(f"Decision Tree Accuracy: {dt_accuracy:.4f}")
print(f"Random Forest Accuracy: {rf_accuracy:.4f}")


print("--Insights--")

if rf_accuracy > dt_accuracy:
    print("\nThe **Random Forest Classifier** performed better than the Decision Tree in terms of accuracy.")
    print("This is often expected, as Random Forest is an ensemble method that combines multiple decision trees, reducing overfitting and improving generalization.")
elif dt_accuracy > rf_accuracy:
    print("\nIn this particular run, the **Decision Tree Classifier** had slightly higher accuracy.")
    print("While less common, this can happen depending on the dataset characteristics or if the Random Forest hyperparameters aren't optimally tuned.")
else:
    print("\nBoth models achieved similar accuracy.")

    
```