## HR Attrition Predictive Model for a Fintech Startup

This project demonstrates the end-to-end process of building a machine learning model to predict and mitigate employee attrition. Using a synthetic dataset modeled on a high-growth fintech company like Tabby, I identified key drivers of attrition and provided actionable, data-driven recommendations to improve employee retention.

#### Problem Statement
Employee turnover is a major challenge for high-growth companies, leading to significant costs in recruitment, training, and lost productivity. The goal of this project was to develop a predictive model that could identify employees at high risk of leaving, allowing the People Team to intervene proactively.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.impute import SimpleImputer
import numpy as np

### Load dataset

In [2]:
df = pd.read_csv('synthetic_hr_data.csv')

### 1. Handling Missing Values

In [6]:
# We'll use a median imputer for numerical columns with missing data
imputer = SimpleImputer(strategy='median')
numerical_cols_to_impute = ['PerformanceRating', 'EngagementScore', 'JobSatisfactionScore', 'WorkLifeBalance']
df[numerical_cols_to_impute] = imputer.fit_transform(df[numerical_cols_to_impute])

print("After Missing Value Imputation")
print(df[numerical_cols_to_impute].isnull().sum())
print("\n")

After Missing Value Imputation
PerformanceRating       0
EngagementScore         0
JobSatisfactionScore    0
WorkLifeBalance         0
dtype: int64




### 2. Feature Engineering

In [7]:
# Create a new feature: Salary Per Tenure
df['SalaryPerTenure'] = df['SalaryUSD'] / (df['TenureMonths'] + 1) # Add 1 to avoid division by zero

### 3. Encoding Categorical Variables

In [8]:
# We need to drop EmployeeID as it's not a useful feature for the model.
df = df.drop('EmployeeID', axis=1)

In [9]:
# Manually encode binary 'OverTime' column
df['OverTime'] = df['OverTime'].map({'Yes': 1, 'No': 0})

In [10]:
# Separate features (X) and target (y)
X = df.drop('Attrition', axis=1)
y = df['Attrition']

In [11]:
# Use LabelEncoder for the binary target variable 'Attrition'
le = LabelEncoder()
y = le.fit_transform(y)

In [12]:
# Get list of nominal categorical columns to one-hot encode
nominal_cols = X.select_dtypes(include=['object']).columns

In [13]:
# One-Hot Encoding using pandas get_dummies()
X_encoded = pd.get_dummies(X, columns=nominal_cols, drop_first=True)

### 4. Feature Scaling

In [14]:
# Scale all numerical features to have a mean of 0 and a standard deviation of 1
# This is crucial for models like Logistic Regression.
scaler = StandardScaler()
# Exclude the one-hot encoded columns from scaling
numerical_features = X_encoded.select_dtypes(include=np.number).columns
X_encoded[numerical_features] = scaler.fit_transform(X_encoded[numerical_features])

### 5. Splitting the Data

In [17]:
# Split the data into 80% training and 20% testing sets
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, test_size=0.2, random_state=42, stratify=y)

print("Data Splitting & Final Shapes")
print(f"Shape of X_train: {X_train.shape}")
print(f"Shape of X_test: {X_test.shape}")
print(f"Shape of y_train: {y_train.shape}")
print(f"Shape of y_test: {y_test.shape}")

Data Splitting & Final Shapes
Shape of X_train: (2000, 49)
Shape of X_test: (500, 49)
Shape of y_train: (2000,)
Shape of y_test: (500,)


In [18]:
df.head()

Unnamed: 0,Age,TenureMonths,PerformanceRating,EngagementScore,SalaryUSD,PromotionsLastYear,YearsSinceLastPromotion,WorkLifeBalance,OverTime,DistanceFromHome,HasStockOptions,JobSatisfactionScore,CultureAlignmentScore,Department,JobRole,Attrition,SalaryPerTenure
0,50,99,3.0,7.0,99638,0,3,4.0,0,22,0,9.0,4,People,People Partner,Yes,996.38
1,36,3,3.0,3.0,114072,0,0,3.0,1,50,0,2.0,1,Product,Product Manager,No,28518.0
2,29,22,3.0,1.0,112309,0,0,4.0,1,48,0,9.0,4,Growth,Growth Analyst,No,4883.0
3,42,4,4.0,4.0,169724,0,0,2.0,1,53,1,8.0,8,Growth,Growth Analyst,No,33944.8
4,40,20,3.0,6.0,172843,0,0,2.0,0,59,0,9.0,8,Finance,Controller,No,8230.619048
