# HR Dataset Analysis Project

## Context:
The transition from higher education to employment is a critical phase for graduates. Institutions in Singapore, such as universities and specialized colleges, produce a diverse pool of talent each year. However, the employment outcomes (including employment rates and salaries) vary significantly across fields of study, universities, and individual demographic factors. Recently, we have been reading news about premature retrenchments from many companies, especially those from the tech sector. Meanwhile, there is an increasing trend of graduates not finding jobs as per reported by The Straits Times. Although our chosen dataset is not a local dataset, understanding these trends is essential for enhancing educational programs, supporting graduates, and aligning their skills with market demands. We chose this dataset due it's extensive number of records and diverse predictors that can truly help us to find as many factors as possible that can help those seeking employment.

## Problem Statement
What factors significantly influence graduate employment outcomes amid a more competitive job market?

## Objective:
To address this gap, we aim to leverage predictive analytics and machine learning techniques to analyze factors influencing graduate employment outcomes. This project seeks to identify key trends and predictors that can be used to forecast the following:
1. **Attrition**: If an employee has left the company, regardless of cause, i.e. retrenched, resigned, etc.
2. **MonthlyIncome**: Prediction of monthly income for graduates.

In this notebook, we will:
- **Load and inspect** the HR dataset.
- **Clean and prepare** the data (including type conversion and handling duplicates).
- **Detect outliers** in numerical features.
- **Engineer new features** (for example, creating tenure buckets).
- **Perform Exploratory Data Analysis (EDA)** including univariate, categorical, and bivariate analyses.
- **Save the cleaned data** for further modeling if needed.

The dataset includes features like Age, Attrition, BusinessTravel, DailyRate, Department, DistanceFromHome, and many more.


## Data Loading & Initial Inspection

We start by importing the necessary libraries and loading the dataset from a CSV file. Then we inspect the first few rows and check basic information.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set plotting style and default figure size
sns.set(style="whitegrid")
plt.rcParams["figure.figsize"] = (12, 8)

# Load the dataset (ensure 'hr_data.csv' is in your working directory)
df = pd.read_csv("data.csv")

# Display the first few rows
print("Head of the DataFrame:")
df.head()


Head of the DataFrame:


Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


## Data Inspection

We examine the dataset’s structure, check data types, and look for missing values.


In [5]:
# DataFrame basic information
print("\nDataFrame Info:")
print(df.info())

# Summary statistics for numerical features
print("\nSummary Statistics (numerical features):")
print(df.describe())

# Check for missing values in each column
print("\nMissing values by column:")
print(df.isnull().sum())



DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 35 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Age                       1470 non-null   int64 
 1   Attrition                 1470 non-null   object
 2   BusinessTravel            1470 non-null   object
 3   DailyRate                 1470 non-null   int64 
 4   Department                1470 non-null   object
 5   DistanceFromHome          1470 non-null   int64 
 6   Education                 1470 non-null   int64 
 7   EducationField            1470 non-null   object
 8   EmployeeCount             1470 non-null   int64 
 9   EmployeeNumber            1470 non-null   int64 
 10  EnvironmentSatisfaction   1470 non-null   int64 
 11  Gender                    1470 non-null   object
 12  HourlyRate                1470 non-null   int64 
 13  JobInvolvement            1470 non-null   int64 
 14  JobLeve

## Removing Duplicates

If there are any duplicate rows, we remove them to ensure data quality.


In [None]:
df.drop_duplicates(inplace=True)
print("Shape after removing duplicates:", df.shape)


Shape after removing duplicates: (1470, 36)


In [8]:
# List of columns you want to treat as categories
cat_cols = ["Attrition", "BusinessTravel", "Department", 
            "EducationField", "Gender", "MaritalStatus", 
            "Over18", "OverTime", "JobRole"]

# Convert each to 'category' dtype
for col in cat_cols:
    df[col] = df[col].astype("category")

# Verify the new dtypes
print(df.dtypes)


Age                            int64
Attrition                   category
BusinessTravel              category
DailyRate                      int64
Department                  category
DistanceFromHome               int64
Education                      int64
EducationField              category
EmployeeCount                  int64
EmployeeNumber                 int64
EnvironmentSatisfaction        int64
Gender                      category
HourlyRate                     int64
JobInvolvement                 int64
JobLevel                       int64
JobRole                     category
JobSatisfaction                int64
MaritalStatus               category
MonthlyIncome                  int64
MonthlyRate                    int64
NumCompaniesWorked             int64
Over18                      category
OverTime                    category
PercentSalaryHike              int64
PerformanceRating              int64
RelationshipSatisfaction       int64
StandardHours                  int64
S

## Feature Engineering

We create another category `TenureBucket`, by categorizing employees based on their years at the company.


In [9]:
# Define bins and labels for tenure buckets
bins = [0, 3, 6, 10, 20, np.inf]
labels = ["<3", "3-6", "6-10", "10-20", "20+"]
df["TenureBucket"] = pd.cut(df["YearsAtCompany"], bins=bins, labels=labels)
df["TenureBucket"] = df["TenureBucket"].astype('category')

# Display the value counts for the new feature
print("\nValue counts for TenureBucket:")
print(df["TenureBucket"].value_counts())



Value counts for TenureBucket:
TenureBucket
<3       426
3-6      382
6-10     372
10-20    180
20+       66
Name: count, dtype: int64


## Save Data

Finally we save the data to be used in part 2 of our EDA

In [12]:
df.to_csv("hr_data_cleaned.csv", index=False)
print("Cleaned data saved to 'hr_data_cleaned.csv'.")

Cleaned data saved to 'hr_data_cleaned.csv'.
