# RetainAI: Employee Retention Intelligence


## Background

Employee attrition is a critical challenge for organizations, as losing valuable talent impacts productivity, team morale, and recruitment costs. This project aims to develop a predictive model to identify employees at risk of leaving the company, enabling HR teams to take proactive retention measures. By analyzing various employee-related factors, such as job satisfaction, performance metrics, work-life balance, and compensation, the model will predict whether an employee is likely to leave the organization or stay.

The objective of the model is to accurately classify employee attrition as a binary classification problem. This is a supervised machine learning task, where the model is trained using labeled historical data containing various employee attributes and their attrition status. The project focuses on building a reliable, explainable solution that can assist HR departments in identifying high-risk employees and addressing potential concerns before attrition occurs, ultimately improving organizational stability and employee satisfaction.

The employee attrition prediction model will be evaluated using both business use cases and model performance metrics. From a business perspective, the model’s success will be determined by its ability to accurately identify employees who are likely to leave, enabling HR teams to take timely actions. The primary evaluation metrics will include accuracy, precision, recall, and the F1-score, with a focus on recall to ensure that the model minimizes false negatives (employees who are at risk but not flagged by the model).

The data source is the [Employee Attrition Dataset](https://www.kaggle.com/datasets/stealthtechnologies/employee-attrition-dataset/data?select=train.csv) from Kaggle, which includes a variety of features such as demographic details, job roles, performance metrics, and compensation. The dataset requires data preparation steps such as handling missing values, encoding categorical variables, and normalizing numerical features. Exploratory data analysis will involve identifying patterns and correlations in the data to understand the key drivers of attrition. Based on the data, it is hypothesized that features like job satisfaction, monthly income, years at the company, work-life balance, and overtime status will be significant predictors of attrition. A random forest classifier will be used initially due to its robustness in handling both categorical and numerical data. Additionally, other models, such as logistic regression, gradient boosting machines, support vector machines, and neural networks, will be explored to compare performance.

In [6]:
# Data Analysis
import kagglehub
import pandas as pd

## Exploratory Data Analysis

### Getting Started with Our Data

In [7]:
# Download latest version
employee_dataset = kagglehub.dataset_download("stealthtechnologies/employee-attrition-dataset")

print("Path to dataset files:", employee_dataset)

Path to dataset files: /root/.cache/kagglehub/datasets/stealthtechnologies/employee-attrition-dataset/versions/2


In [13]:
TRAIN_PATH, TEST_PATH = f"..{employee_dataset }/test.csv", f"..{employee_dataset }/train.csv"


train_df, test_df = pd.read_csv(TRAIN_PATH), pd.read_csv(TEST_PATH)

train_df.head()

Unnamed: 0,Employee ID,Age,Gender,Years at Company,Job Role,Monthly Income,Work-Life Balance,Job Satisfaction,Performance Rating,Number of Promotions,...,Number of Dependents,Job Level,Company Size,Company Tenure,Remote Work,Leadership Opportunities,Innovation Opportunities,Company Reputation,Employee Recognition,Attrition
0,52685,36,Male,13,Healthcare,8029,Excellent,High,Average,1,...,1,Mid,Large,22,No,No,No,Poor,Medium,Stayed
1,30585,35,Male,7,Education,4563,Good,High,Average,1,...,4,Entry,Medium,27,No,No,No,Good,High,Left
2,54656,50,Male,7,Education,5583,Fair,High,Average,3,...,2,Senior,Medium,76,No,No,Yes,Good,Low,Stayed
3,33442,58,Male,44,Media,5525,Fair,Very High,High,0,...,4,Entry,Medium,96,No,No,No,Poor,Low,Left
4,15667,39,Male,24,Education,4604,Good,High,Average,0,...,6,Mid,Large,45,Yes,No,No,Good,High,Stayed


## Data Preparation

## Modeling