# HR Attrition Prediction Project: Full Pipeline

This notebook details the end-to-end process of building and evaluating machine learning models to predict employee attrition. It covers data loading, comprehensive preprocessing and feature engineering, model training, evaluation, and final data export for Tableau visualization.

---

## 1. Environment Setup & Data Loading

This section sets up the necessary libraries and loads the pre-processed dataset.

### 1.1 Import Libraries

We begin by importing all necessary Python libraries for data manipulation, machine learning, visualization, and utility functions. Consolidating imports at the top ensures all dependencies are clear and available throughout the notebook.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix, ConfusionMatrixDisplay
from imblearn.over_sampling import SMOTE
import xgboost as xgb
import shap
import warnings # To manage warnings if needed

warnings.filterwarnings('ignore') # Suppress warnings for cleaner output during development

### 1.2 Load Cleaned Data

The `full summary.csv` dataset, which contains employee `snapshot data`, is loaded into a Pandas DataFrame. The snapshot_date column is converted to datetime objects to enable time-based calculations.

In [2]:
full_data = pd.read_csv('full summary.csv')
full_data['snapshot_date'] = pd.to_datetime(full_data['snapshot_date'], format='%d/%m/%Y')
print(full_data.head())
print(full_data.info())
print(full_data.describe())

   employee_id snapshot_date  age        department business_unit  \
0            1    2021-04-30   35  Customer Success    Commercial   
1            1    2021-05-31   35  Customer Success    Commercial   
2            1    2021-06-30   35  Customer Success    Commercial   
3            1    2021-07-31   35  Customer Success    Commercial   
4            1    2021-08-31   35  Customer Success    Commercial   

        job_title     location  base_salary  bonus_eligible  bonus_pct  ...  \
0  Senior Analyst  Seattle, WA        77073               0        0.0  ...   
1  Senior Analyst  Seattle, WA        77073               0        0.0  ...   
2  Senior Analyst  Seattle, WA        77073               0        0.0  ...   
3  Senior Analyst  Seattle, WA        77073               0        0.0  ...   
4  Senior Analyst  Seattle, WA        77073               0        0.0  ...   

   job_level  last_training_date vacation_leave sick_leave personal_leave  \
0          1           30/3/2021 

### 1.3 Initial Data Checks & Preparation

Basic checks are performed to understand the dataset's dimensions and the distribution of the target variable (`target_variable`). The data is then sorted by `employee_id` and snapshot_date (descending) to facilitate operations that require the latest employee records.

In [3]:
print(f"Dataset shape: {full_data.shape}")
# Using 'target_variable' based on your df.info() output for consistency
print(f"Target variable distribution:\n{full_data['target_variable'].value_counts()}")

# Sort data by employee and date - crucial for later filtering of latest snapshots
full_data_sorted = full_data.sort_values(by=['employee_id', 'snapshot_date'], ascending=[True, False])
print("\nData sorted by employee_id and snapshot_date.")

Dataset shape: (18232, 43)
Target variable distribution:
target_variable
0    17234
1      998
Name: count, dtype: int64

Data sorted by employee_id and snapshot_date.


## 2. Feature Engineering & Preprocessing

This section focuses on transforming and preparing the data for machine learning, including creating a key feature and handling missing values.

### 2.1 Feature Engineering: Employee Tenure
`tenure_months` is calculated to represent the duration an employee has been with the company at each snapshot date. This is a crucial feature for attrition prediction.

In [4]:
# Ensure 'hire_date' is also a datetime object
full_data['hire_date'] = pd.to_datetime(full_data['hire_date'], format='%d/%m/%Y', errors='coerce')

# --- CORRECTED TENURE_MONTHS CALCULATION ---
# Calculate the difference in days between snapshot and hire date
time_difference_days = (full_data['snapshot_date'] - full_data['hire_date']).dt.days

# Convert days to months using an average number of days in a month (~30.44 days/month)
full_data['tenure_months'] = (time_difference_days / 30.44).astype(int)

print(full_data[['employee_id', 'snapshot_date', 'hire_date', 'tenure_months']].head())

   employee_id snapshot_date  hire_date  tenure_months
0            1    2021-04-30 2020-10-30              5
1            1    2021-05-31 2020-10-30              6
2            1    2021-06-30 2020-10-30              7
3            1    2021-07-31 2020-10-30              9
4            1    2021-08-31 2020-10-30             10


### 2.2 Handling Missing Values
Missing values in key numerical features (`performance_rating`, `engagement_score`) are imputed with their median values to maintain data completeness and prevent errors during model training.

In [5]:
# Impute missing 'performance_rating' and 'engagement_score' with their medians
for col in ['performance_rating', 'engagement_score']:
    if col in full_data.columns and full_data[col].isnull().any():
        median_val = full_data[col].median()
        full_data[col].fillna(median_val, inplace=True)
        print(f"Filled missing values in '{col}' with median: {median_val}")

### 2.3 Defining Features and Target Variable
The feature set (`X`) is created by dropping the target variable and identifier columns. The target variable (`y`) is explicitly set to `target_variable`. Numerical and categorical features are then separated for distinct preprocessing steps.

In [6]:
# Using 'target_variable' consistently as per your df.info()
X = full_data.drop(['target_variable', 'employee_id', 'risk_of_exit_score', 'termination_date', 'snapshot_date', 'hire_date'], axis=1)
y = full_data['target_variable'] # Assign 'target_variable' to y

numerical_features = X.select_dtypes(include=np.number).columns.tolist()
categorical_features = X.select_dtypes(include='object').columns.tolist()

print(f"Numerical Features selected for preprocessing: {numerical_features}")
print(f"Categorical Features selected for preprocessing: {categorical_features}")

Numerical Features selected for preprocessing: ['age', 'base_salary', 'bonus_eligible', 'bonus_pct', 'equity_grant', 'equity_pct', 'veteran_status', 'disability_status', 'fte', 'aihr_certified', 'promotion_count', 'tenure_months', 'performance_rating', 'engagement_score', 'current_salary', 'training_count', 'job_level', 'vacation_leave', 'sick_leave', 'personal_leave', 'parental_leave', 'months_since_last_training', 'ever_terminated_flag']
Categorical Features selected for preprocessing: ['department', 'business_unit', 'job_title', 'location', 'employment_type', 'ethnicity', 'marital_status', 'education_level', 'pay_frequency', 'cost_center', 'exemption_status', 'high_potential_flag', 'succession_plan_status', 'last_training_date']


### 2.4 Data Splitting: Train and Test Sets
The dataset is divided into training (80%) and testing (20%) sets. Stratified sampling is used to ensure the proportion of the target variable is maintained in both sets, which is essential for imbalanced datasets. `random_state` ensures reproducibility.

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print(f"X_train shape: {X_train.shape}, y_train shape: {y_train.shape}")
print(f"X_test shape: {X_test.shape}, y_test shape: {y_test.shape}")
print(f"Train target distribution:\n{y_train.value_counts(normalize=True)}")
print(f"Test target distribution:\n{y_test.value_counts(normalize=True)}")

X_train shape: (14585, 37), y_train shape: (14585,)
X_test shape: (3647, 37), y_test shape: (3647,)
Train target distribution:
target_variable
0    0.945286
1    0.054714
Name: proportion, dtype: float64
Test target distribution:
target_variable
0    0.94516
1    0.05484
Name: proportion, dtype: float64


### 2.5 Preprocessing Pipelines with ColumnTransformer
A `ColumnTransformer` is used within a `Pipeline` to apply different preprocessing steps to numerical and categorical features:

- **Numerical Features:** Scaled using `StandardScaler` to standardize their range.

- **Categorical Features:** Converted to numerical format using `OneHotEncoder`. This approach ensures consistency and prevents data leakage from the test set.

In [8]:
# --- REVISED CODE START FOR SECTION 2.5 (Paste this entire block) ---

# Section 2.5: Preprocessing Pipelines with ColumnTransformer

# Ensure these imports are available (typically done at the very top of the notebook)
# from sklearn.preprocessing import StandardScaler, OneHotEncoder
# from sklearn.compose import ColumnTransformer
# from sklearn.pipeline import Pipeline
# import pandas as pd
# import numpy as np # For np.number

numeric_transformer = Pipeline(steps=[('scaler', StandardScaler())])
categorical_transformer = Pipeline(steps=[('onehot', OneHotEncoder(handle_unknown='ignore'))])

# Diagnostic: Print the columns in X_train before preprocessing
print(f"\nDEBUG (2.5 Start): Columns in X_train before preprocessing: {X_train.columns.tolist()}")
print(f"DEBUG (2.5 Start): Identified numerical_features: {numerical_features}")
print(f"DEBUG (2.5 Start): Identified categorical_features: {categorical_features}")

# Construct the list of transformers.
# Only include transformers if their corresponding feature lists are not empty.
transformers_list = []
if numerical_features:
    transformers_list.append(('num', numeric_transformer, numerical_features))
if categorical_features:
    transformers_list.append(('cat', categorical_transformer, categorical_features))

# If for some reason both lists are empty, ColumnTransformer might rely solely on 'remainder'.
if not transformers_list:
    print("WARNING: No numerical or categorical features found for specific transformations. ColumnTransformer will rely solely on 'remainder'.")

preprocessor = ColumnTransformer(
    transformers=transformers_list,
    remainder='passthrough' # Passes through any columns not explicitly handled
)

# Fit and transform the training data
X_train_processed = preprocessor.fit_transform(X_train)

print(f"DEBUG (X_train_processed dtype): {X_train_processed.dtype}")


# Transform the test data using the *fitted* preprocessor
X_test_processed = preprocessor.transform(X_test)

print(f"DEBUG (X_test_processed dtype): {X_test_processed.dtype}")

# --- DEBUGGING AID: Print processed array shapes immediately after transformation ---
# IMPORTANT: This is the actual shape of the numpy array right after the transformation.
print(f"DEBUG (Post-Transform): Shape of X_train_processed (array): {X_train_processed.shape}")
print(f"DEBUG (Post-Transform): Shape of X_test_processed (array): {X_test_processed.shape}")
# --- END DEBUGGING AID ---

# --- CRITICAL CORRECTION: Use preprocessor.get_feature_names_out() to get all output feature names ---
# This is the most robust way to get all feature names after ColumnTransformer,
# including those passed through by 'remainder', ensuring a perfect match with the output shape.
all_feature_names = preprocessor.get_feature_names_out()
# --- END OF CRITICAL CORRECTION ---

# Diagnostic: Print the length of feature names list
print(f"DEBUG (Feature Names): Length of all_feature_names: {len(all_feature_names)}")

# Convert processed arrays back to DataFrame for easier handling and SHAP compatibility
# The ValueError occurs here if X_train_processed.shape[1] != len(all_feature_names)
X_train_processed_df = pd.DataFrame(X_train_processed, columns=all_feature_names, index=X_train.index)
X_test_processed_df = pd.DataFrame(X_test_processed, columns=all_feature_names, index=X_test.index)

print(f"Processed X_train shape: {X_train_processed_df.shape}")
print(f"Processed X_test shape: {X_test_processed_df.shape}")

# --- REVISED CODE END FOR SECTION 2.5 ---


DEBUG (2.5 Start): Columns in X_train before preprocessing: ['age', 'department', 'business_unit', 'job_title', 'location', 'base_salary', 'bonus_eligible', 'bonus_pct', 'equity_grant', 'equity_pct', 'employment_type', 'ethnicity', 'marital_status', 'education_level', 'pay_frequency', 'veteran_status', 'disability_status', 'cost_center', 'fte', 'exemption_status', 'high_potential_flag', 'succession_plan_status', 'aihr_certified', 'promotion_count', 'tenure_months', 'performance_rating', 'engagement_score', 'current_salary', 'training_count', 'job_level', 'last_training_date', 'vacation_leave', 'sick_leave', 'personal_leave', 'parental_leave', 'months_since_last_training', 'ever_terminated_flag']
DEBUG (2.5 Start): Identified numerical_features: ['age', 'base_salary', 'bonus_eligible', 'bonus_pct', 'equity_grant', 'equity_pct', 'veteran_status', 'disability_status', 'fte', 'aihr_certified', 'promotion_count', 'tenure_months', 'performance_rating', 'engagement_score', 'current_salary', 

ValueError: Shape of passed values is (14585, 1), indices imply (14585, 227)

### 2.6 Addressing Class Imbalance with SMOTE
Attrition datasets are typically highly imbalanced (many more employees stay than leave). To prevent models from being biased towards the majority class, we apply `SMOTE` (Synthetic Minority Over-sampling Technique) to the training data. `SMOTE` generates synthetic samples of the minority class (`future_terminated_flag` == 1), balancing the class distribution.

In [None]:
# Your existing SMOTE code
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_processed_df, y_train)

print(f"Original training target distribution:\n{y_train.value_counts()}")
print(f"\nResampled training target distribution:\n{y_train_resampled.value_counts()}")

## 3. Model Training & Evaluation
This section trains three different machine learning models and evaluates their performance using various classification metrics.

### 3.1 Model Selection Rationale
For this HR attrition prediction project, a strategic choice of machine learning models was made, prioritizing a balance of interpretability, robustness, and high predictive performance typical for tabular binary classification problems.

- Logistic Regression (Interpretable Baseline): Chosen as the foundational linear model. Its interpretability, allowing for direct understanding of feature impacts on attrition likelihood, is invaluable for HR stakeholders. It also provides a strong, efficient baseline for comparison against more complex models.
- Random Forest (Robust Ensemble): This ensemble method was selected for its ability to capture complex, non-linear relationships and interactions between features without extensive manual engineering. Its ensemble nature significantly reduces the risk of overfitting, making it a robust choice, and it naturally provides insights into feature importance.
- XGBoost (High-Performance Powerhouse): As a state-of-the-art gradient boosting framework, XGBoost was included for its exceptional predictive performance and efficiency. It excels at handling large datasets and complex relationships, often delivering industry-leading accuracy in tabular data competitions. Its built-in regularization also helps prevent overfitting.

**Why Other Common Models Were Not Prioritised:**

While many other machine learning algorithms exist, several were not the primary focus for this project due to specific trade-offs relative to the problem's requirements:

- **Individual Decision Trees:** Prone to severe overfitting, which is effectively mitigated by ensemble methods like Random Forest and XGBoost.
Support Vector Machines (SVMs): Can be computationally expensive on larger datasets and sensitive to feature scaling. Interpretability, especially with non-linear kernels, is also a significant challenge for business understanding.

- **K-Nearest Neighbors (KNN):** Computationally intensive for prediction on substantial datasets, sensitive to feature scaling, and can suffer from the "curse of dimensionality" in higher-dimensional feature spaces.

- **Naive Bayes:** Relies on a strong assumption of feature independence, which is rarely true in correlated HR datasets, often leading to suboptimal performance compared to tree-based models.

- **Deep Learning / Neural Networks:** While powerful, they are typically overkill and less interpretable for structured tabular data of this nature compared to tree-based models, often requiring more data and computational resources to achieve comparable results. The priority for HR insights leans towards interpretability, making simpler yet powerful models more suitable.

### 3.2 Logistic Regression
The Logistic Regression model is trained on the resampled training data. Its performance is evaluated using accuracy, precision, recall, F1-score, and ROC AUC, along with a confusion matrix.

In [None]:
# Your existing Logistic Regression training and evaluation code
log_reg_model = LogisticRegression(random_state=42, solver='liblinear') # Added solver for robustness
log_reg_model.fit(X_train_resampled, y_train_resampled)
y_pred_log_reg = log_reg_model.predict(X_test_processed_df)
y_prob_log_reg = log_reg_model.predict_proba(X_test_processed_df)[:, 1]

print(f"\n--- Logistic Regression Performance ---")
print(f"Accuracy: {accuracy_score(y_test, y_pred_log_reg):.4f}")
print(f"Precision: {precision_score(y_test, y_pred_log_reg):.4f}")
print(f"Recall: {recall_score(y_test, y_pred_log_reg):.4f}")
print(f"F1-Score: {f1_score(y_test, y_pred_log_reg):.4f}")
print(f"ROC AUC: {roc_auc_score(y_test, y_prob_log_reg):.4f}")

cm_log_reg = confusion_matrix(y_test, y_pred_log_reg)
disp_log_reg = ConfusionMatrixDisplay(confusion_matrix=cm_log_reg, display_labels=['Stay', 'Terminated'])
disp_log_reg.plot()
plt.title('Confusion Matrix for Logistic Regression')
plt.show()

### 3.3 Random Forest Classifier
A Random Forest model is trained, leveraging its ensemble nature to achieve robust predictions. Its performance is evaluated using the same set of metrics.

In [None]:
# Your existing Random Forest training and evaluation code
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train_resampled, y_train_resampled)
y_pred_rf = rf_model.predict(X_test_processed_df)
y_prob_rf = rf_model.predict_proba(X_test_processed_df)[:, 1]

print(f"\n--- Random Forest Performance ---")
print(f"Accuracy: {accuracy_score(y_test, y_pred_rf):.4f}")
print(f"Precision: {precision_score(y_test, y_pred_rf):.4f}")
print(f"Recall: {recall_score(y_test, y_pred_rf):.4f}")
print(f"F1-Score: {f1_score(y_test, y_pred_rf):.4f}")
print(f"ROC AUC: {roc_auc_score(y_test, y_prob_rf):.4f}")

cm_rf = confusion_matrix(y_test, y_pred_rf)
disp_rf = ConfusionMatrixDisplay(confusion_matrix=cm_rf, display_labels=['Stay', 'Terminated'])
disp_rf.plot()
plt.title('Confusion Matrix for Random Forest')
plt.show()

### 3.4 XGBoost Classifier
XGBoost, a highly efficient gradient boosting algorithm, is trained to leverage its powerful predictive capabilities. Its performance is critically assessed, as it often provides the best balance of precision and recall for imbalanced classification.

In [None]:
# Your existing XGBoost training and evaluation code
xgb_model = xgb.XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='logloss') # Added eval_metric for modern XGBoost
xgb_model.fit(X_train_resampled, y_train_resampled)
y_pred_xgb = xgb_model.predict(X_test_processed_df)
y_prob_xgb = xgb_model.predict_proba(X_test_processed_df)[:, 1]

print(f"\n--- XGBoost Performance ---")
print(f"Accuracy: {accuracy_score(y_test, y_pred_xgb):.4f}")
print(f"Precision: {precision_score(y_test, y_pred_xgb):.4f}")
print(f"Recall: {recall_score(y_test, y_pred_xgb):.4f}")
print(f"F1-Score: {f1_score(y_test, y_pred_xgb):.4f}")
print(f"ROC AUC: {roc_auc_score(y_test, y_prob_xgb):.4f}")

cm_xgb = confusion_matrix(y_test, y_pred_xgb)
disp_xgb = ConfusionMatrixDisplay(confusion_matrix=cm_xgb, display_labels=['Stay', 'Terminated'])
disp_xgb.plot()
plt.title('Confusion Matrix for XGBoost')
plt.show()

## 4. Model Interpretability with SHAP
Understanding why a model makes certain predictions is crucial for actionable insights. SHAP (SHapley Additive exPlanations) values are used to explain the output of the XGBoost model by showing the contribution of each feature to the prediction.

### 4.1 SHAP Summary Plot
The SHAP summary plot provides a global view of feature importance, indicating which features have the largest impact on the model's output across the entire dataset, and whether their impact is positive (increasing attrition probability) or negative (decreasing attrition probability).

In [None]:
# Your existing SHAP code
explainer = shap.TreeExplainer(xgb_model)
shap_values = explainer.shap_values(X_test_processed_df)
if isinstance(shap_values, list): # For multi-output models
    shap_values = shap_values[1] # Use SHAP values for the positive class

shap.summary_plot(shap_values, X_test_processed_df, feature_names=all_feature_names, show=False)
plt.title('SHAP Summary Plot for XGBoost Model')
plt.show()

## 5. Exporting Predictions for Tableau
This final section prepares and exports two key datasets for visualization and further analysis in Tableau: `main_for_tableau.csv` (for historical context and model comparison) and `current_snapshot.csv` (for identifying currently at-risk employees).

### 5.1 Generate Predictions for Full Dataset
We apply all trained models (Logistic Regression, Random Forest, XGBoost) to the entire preprocessed dataset to generate predictions and probabilities for every employee snapshot. This comprehensive dataset will be used for historical analysis and dashboarding in Tableau.

In [None]:
# Your existing code for generating predictions on the full dataset
# Ensure X_full_transformed_df is generated correctly based on the full_data
# Using the preprocessor on the full_data to transform it consistently

X_full_processed = preprocessor.transform(X) # Assuming X is the full feature set from earlier
X_full_processed_df = pd.DataFrame(X_full_processed, columns=all_feature_names, index=X.index)


full_data_with_preds = full_data.copy()

full_data_with_preds['logreg_prediction'] = log_reg_model.predict(X_full_processed_df)
full_data_with_preds['logreg_probability'] = log_reg_model.predict_proba(X_full_processed_df)[:, 1]

full_data_with_preds['random_forest_prediction'] = rf_model.predict(X_full_processed_df)
full_data_with_preds['random_forest_probability'] = rf_model.predict_proba(X_full_processed_df)[:, 1]

full_data_with_preds['xgboost_prediction'] = xgb_model.predict(X_full_processed_df)
full_data_with_preds['xgboost_probability'] = xgb_model.predict_proba(X_full_processed_df)[:, 1]

# Save the primary dataset for Tableau
full_data_with_preds.to_csv('main_for_tableau.csv', index=False)
print("Exported 'main_for_tableau.csv' with all historical snapshots and model predictions.")

### 5.2 Filter & Export Current Snapshot for Active Employees
To provide actionable insights for immediate intervention, we filter the dataset to create `current_snapshot.csv`. This file contains only the latest snapshot for each active employee (those not already terminated) whose last snapshot date is recent (e.g., from 2025-01-01 onwards). This ensures that HR focuses on current, relevant attrition risks.

In [None]:
# Your existing code for filtering current_snapshot.csv
latest_per_employee_data_all = full_data_with_preds.sort_values(by=['employee_id', 'snapshot_date'], ascending=[True, False])\
                                                    .drop_duplicates(subset=['employee_id'], keep='first').copy()

recent_snapshot_threshold = pd.to_datetime('2025-01-01') # Adjust this date as needed
current_snapshot_df = latest_per_employee_data_all[
    (latest_per_employee_data_all['ever_terminated_flag'] == 0) & # Filter for active employees
    (latest_per_employee_data_all['snapshot_date'] >= recent_snapshot_threshold) # Filter for recent snapshots
].copy()

current_snapshot_df.to_csv('current_snapshot.csv', index=False)
print(f"Exported 'current_snapshot.csv' with the final snapshot for truly current, active employees.")
print(f"Number of employees in current_snapshot.csv after recency filter: {len(current_snapshot_df)}")

## Next Steps:

With the `main_for_tableau.csv` and `current_snapshot.csv` files generated, the next phase involves building interactive dashboards in Tableau to visualize these insights for HR stakeholders.