# **HR Attrition Project**

# **Introduction and Methodology**

The objective of this capstone project is to synthesize and apply the diverse skills and knowledge acquired throughout the coursework by developing an end-to-end data science solution focused on predicting employee attrition. The project integrates data exploration, feature engineering, machine learning, evaluation, and deployment, while emphasizing clear communication of results for business stakeholders.

This is the first notebook of this repository. The structure implements a full pipeline from business problem framing to deployment: data ingestion and cleaning, exploratory data analysis, feature selection, model training (Logistic Regression, Decision Tree, Random Forest, XGBoost, Neural Networks), threshold optimization with business-focused recall targets, and a Gradio app (via app.py and space.yaml) for serving predictions.


Group 10:
- Filipe Brandão Carmo 20240828
- João Silva 20241655
- Rita Marques 20242019
- Sara Henriques 20242070



# **The Business Problem**

High turnover rates are costly and disruptive, making it essential for HR to anticipate which employees are likely to leave. We are tasked with assisting a multinational consultancy firm in predicting employee attrition.

The primary goal is to predict whether an employee will leave the company, based on the data provided. Secondary goals include identifying key factors influencing attrition and recommending strategies to retain valuable employees.

The HR department and executive management will be the primary consumers of your insights, and they expect actionable recommendations based on your analysis.

# **Data Collection and Initial Processing**

The main goal of this notebook is to load the dataset provided by our client, that consists on a .csv file.
This dataset contains various attributes related to employee demographics, job satisfaction, work experience, and compensation.

### **1. Import libraries**

The following cell of code imports the necessary libraries used to project development.

In [14]:
import numpy as np ## pip install numpy==2.1 #need this to run ydata-profiling
import pandas as pd
import seaborn as sns
from scipy.stats import chi2_contingency # filter method
from sklearn.preprocessing import MinMaxScaler
from ydata_profiling import ProfileReport 
import matplotlib.pyplot as plt
from pathlib import Path



import sys
!{sys.executable} -m pip install -U ydata-profiling[notebook]
%pip install jupyter-contrib-nbextensions
!jupyter nbextension enable --py widgetsnbextension

Note: you may need to restart the kernel to use updated packages.


usage: jupyter [-h] [--version] [--config-dir] [--data-dir] [--runtime-dir]
               [--paths] [--json] [--debug]
               [subcommand]

Jupyter: Interactive Computing

positional arguments:
  subcommand     the subcommand to launch

options:
  -h, --help     show this help message and exit
  --version      show the versions of core jupyter packages and exit
  --config-dir   show Jupyter config dir
  --data-dir     show Jupyter data dir
  --runtime-dir  show Jupyter runtime dir
  --paths        show all Jupyter paths. Add --json for machine-readable
                 format.
  --json         output paths as machine-readable json
  --debug        output debug information about paths

Available subcommands: console contrib dejavu events execute kernel kernelspec
lab labextension labhub migrate nbconvert nbextensions_configurator notebook
run server troubleshoot trust

Jupyter command `jupyter-nbextension` not found.


## **2. Import dataset**

The data for the project is available in this repository:
- Path: EDSB25_10\EDSB25_10\data\raw\
- Filename: HR_Attrition_Dataset.csv

In [15]:
#dataset available at: \EDSB25_10\EDSB25_10\data\raw\HR_Attrition_Dataset.csv
project_root = Path.cwd().parent

data_path = project_root / "data" / "raw" / "HR_Attrition_Dataset.csv"

HR = pd.read_csv(data_path)

#### Result DataFrame: HR

## **3. Data Description**

The dataset is composed by 1470 lines and 35 columns. The following commands indicate that there are no missing values.
Regarding data types it is possible to conclude that there are integers, floats and objects. When crossing the data types and the columns description it looks consistent.

In [16]:
print(f"The table has the following format: ", HR.shape)
print(HR.info(max_cols=35))

The table has the following format:  (1470, 35)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 35 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Age                       1470 non-null   int64 
 1   Attrition                 1470 non-null   object
 2   BusinessTravel            1470 non-null   object
 3   DailyRate                 1470 non-null   int64 
 4   Department                1470 non-null   object
 5   DistanceFromHome          1470 non-null   int64 
 6   Education                 1470 non-null   int64 
 7   EducationField            1470 non-null   object
 8   EmployeeCount             1470 non-null   int64 
 9   EmployeeNumber            1470 non-null   int64 
 10  EnvironmentSatisfaction   1470 non-null   int64 
 11  Gender                    1470 non-null   object
 12  HourlyRate                1470 non-null   int64 
 13  JobInvolvement            1470

## **4. Explore data with ydata-profiling**

This section presents a data profile report about the different variables across the dataset.

In [17]:
#profile = ProfileReport(HR, title="Profiling Report") 
#profile

The dataset is composed with different variables that can be divided in 4 main categories:
1. **Employee demographics** - Age, Distance from Home, Gender, MaritalStatus
2. **Job characteristics** - Department, Business Travel (frequency), Daily Rate, Horly Rate, Montly Rate, Environment Satisfaction (on a scale from 1 to 4 where 1 means low), Job Involvement, Job Level, Job Role, Job Satisfaction, Over time, Relationship Satisfaction with the work, Training Times Last Year, Work life balance
3. **Work experience** - Education, Education Field, Number of Companies Worked, Total of Working Years, Years at the company, Years in the current role, Years since last promotion, Years with th current manager
4. **Compensation** -  MontlyIncome, Percent Salary Hike, Performance Rating, Stock Option Level 

The exploration of the report above quickly indicates that there are features that don't have a significative value for the study, since are constants. It also highlights that there are no missing values.

## **5. Preprocessing for Exploratory Data Analysis**

### **5.1. Set index**

The analysis made previously allowed to undestand that EmployeeNumber had the potential to be used as index. Since those values are unique it will be used for that function.

In [18]:
unique_values = HR['EmployeeNumber'].is_unique
print (unique_values)

True


In [19]:
# Added to 
#pd.set_option('display.max_columns', None)
#HR.head(5)

In [20]:
#see the first 5 rows of the dataset with all columns
HR.set_index('EmployeeNumber', inplace = True)
HR.head(5)

Unnamed: 0_level_0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EnvironmentSatisfaction,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
EmployeeNumber,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,2,...,1,80,0,8,0,1,6,4,0,5
2,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,3,...,4,80,1,10,3,3,10,7,1,7
4,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
5,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,4,...,3,80,0,8,3,3,8,7,3,0
7,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,1,...,4,80,1,6,3,3,2,2,2,2


### **5.1. Check duplicates**

The command below was used to find duplicated lines - any register was found on those conditions.

In [21]:
HR[HR.duplicated()]

Unnamed: 0_level_0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EnvironmentSatisfaction,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
EmployeeNumber,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1


### **5.2. Check missing values**

As indicated before there are no missing values on the features provided, however the following command is fully dedicated to check that.

In [22]:
HR.isna().sum()

Age                         0
Attrition                   0
BusinessTravel              0
DailyRate                   0
Department                  0
DistanceFromHome            0
Education                   0
EducationField              0
EmployeeCount               0
EnvironmentSatisfaction     0
Gender                      0
HourlyRate                  0
JobInvolvement              0
JobLevel                    0
JobRole                     0
JobSatisfaction             0
MaritalStatus               0
MonthlyIncome               0
MonthlyRate                 0
NumCompaniesWorked          0
Over18                      0
OverTime                    0
PercentSalaryHike           0
PerformanceRating           0
RelationshipSatisfaction    0
StandardHours               0
StockOptionLevel            0
TotalWorkingYears           0
TrainingTimesLastYear       0
WorkLifeBalance             0
YearsAtCompany              0
YearsInCurrentRole          0
YearsSinceLastPromotion     0
YearsWithC

### **5.3. Remove unvaluable features**

Based on the analysis performed in the 4th point of the present notebook,  where it was found that there are constant features, those features will be removed before moving on since they don't add value.

In [23]:
cleaned_for_eda = HR.drop(columns=['EmployeeCount', 'StandardHours', 'Over18'])

## **6. Export the dataset after initial preprocessing**

The processed data is being saved to csv files available in the diretory \EDSB25_10\data\processed.

In [24]:
# 1. Get the current working directory (e.g., C:\...\EDSB25_10\EDSB25_10\notebooks)
current_dir = Path.cwd()


processed_dir = current_dir / "data" / "processed"

# 3. Create the directory structure if it doesn't exist
processed_dir.mkdir(parents=True, exist_ok=True)

# Export to CSV
output_path = processed_dir / "cleaned_for_eda.csv"
cleaned_for_eda.to_csv(output_path, index=True)

print(f"Saved cleaned_for_eda to: {output_path}")



Saved cleaned_for_eda to: c:\Projects\EDSB25_10\EDSB25_10\notebooks\data\processed\cleaned_for_eda.csv


#### Result DataFrame: cleaned_for_eda
#### Result File: cleaned_for_eda.csv

The preprocess steps are available in the preprocessing_for_EDA.py 