# INX Future Inc – Employee Performance Analysis

## Data Processing

### Project Objective
This notebook performs initial data processing for the INX Future Inc Employee Performance project.
It includes loading the dataset, inspecting data quality, preprocessing features, and saving a
cleaned dataset for further analysis.


### Project Background
INX Future Inc is a leading data analytics and automation solutions provider with over 15 years
of global presence. Although the company is known for employee-friendly HR practices, recent
performance indicators have shown a decline, affecting service delivery and client satisfaction.

To address this issue without impacting employee morale, INX Future Inc initiated a data science
project to analyze employee data, identify key performance drivers, and support informed HR
decisions.


### Scope of This Notebook
This notebook focuses on data processing activities required before analysis and modeling.
The tasks performed include:
- Loading the raw employee dataset
- Inspecting data structure and quality
- Handling irrelevant features
- Encoding categorical variables
- Saving a processed dataset for further use


### Import Required Libraries
Python libraries are imported to support data manipulation and numerical operations.


In [1]:
import pandas as pd
import numpy as np


### Load the Dataset
The employee performance dataset is loaded from the raw data folder.
The dataset contains employee demographics, job attributes, satisfaction levels,
and performance ratings.


In [2]:
file_path = "../../data/raw/INX_Future_Inc_Employee_Performance_CDS_Project2_Data_V1.8.csv"
df = pd.read_csv(file_path)

### Dataset Description
The dataset contains employee demographic, job-related, satisfaction,
and experience attributes along with performance ratings.
The goal is to prepare this data for analysis and machine learning.


In [3]:
df.head()

Unnamed: 0,EmpNumber,Age,Gender,EducationBackground,MaritalStatus,EmpDepartment,EmpJobRole,BusinessTravelFrequency,DistanceFromHome,EmpEducationLevel,...,EmpRelationshipSatisfaction,TotalWorkExperienceInYears,TrainingTimesLastYear,EmpWorkLifeBalance,ExperienceYearsAtThisCompany,ExperienceYearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,Attrition,PerformanceRating
0,E1001000,32,Male,Marketing,Single,Sales,Sales Executive,Travel_Rarely,10,3,...,4,10,2,2,10,7,0,8,No,3
1,E1001006,47,Male,Marketing,Single,Sales,Sales Executive,Travel_Rarely,14,4,...,4,20,2,3,7,7,1,7,No,3
2,E1001007,40,Male,Life Sciences,Married,Sales,Sales Executive,Travel_Frequently,5,4,...,3,20,2,3,18,13,1,12,No,4
3,E1001009,41,Male,Human Resources,Divorced,Human Resources,Manager,Travel_Rarely,10,4,...,2,23,2,2,21,6,12,6,No,3
4,E1001010,60,Male,Marketing,Single,Sales,Sales Executive,Travel_Rarely,16,4,...,4,10,1,3,2,2,2,2,No,3


In [4]:
df.tail()

Unnamed: 0,EmpNumber,Age,Gender,EducationBackground,MaritalStatus,EmpDepartment,EmpJobRole,BusinessTravelFrequency,DistanceFromHome,EmpEducationLevel,...,EmpRelationshipSatisfaction,TotalWorkExperienceInYears,TrainingTimesLastYear,EmpWorkLifeBalance,ExperienceYearsAtThisCompany,ExperienceYearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,Attrition,PerformanceRating
1195,E100992,27,Female,Medical,Divorced,Sales,Sales Executive,Travel_Frequently,3,1,...,2,6,3,3,6,5,0,4,No,4
1196,E100993,37,Male,Life Sciences,Single,Development,Senior Developer,Travel_Rarely,10,2,...,1,4,2,3,1,0,0,0,No,3
1197,E100994,50,Male,Medical,Married,Development,Senior Developer,Travel_Rarely,28,1,...,3,20,3,3,20,8,3,8,No,3
1198,E100995,34,Female,Medical,Single,Data Science,Data Scientist,Travel_Rarely,9,3,...,2,9,3,4,8,7,7,7,No,3
1199,E100998,24,Female,Life Sciences,Single,Sales,Sales Executive,Travel_Rarely,3,2,...,1,4,3,3,2,2,2,0,Yes,2


### Dataset Shape
The dataset contains the number of rows and columns representing
employee records and features.


In [5]:
df.shape

(1200, 28)

In [6]:
df.describe()

Unnamed: 0,Age,DistanceFromHome,EmpEducationLevel,EmpEnvironmentSatisfaction,EmpHourlyRate,EmpJobInvolvement,EmpJobLevel,EmpJobSatisfaction,NumCompaniesWorked,EmpLastSalaryHikePercent,EmpRelationshipSatisfaction,TotalWorkExperienceInYears,TrainingTimesLastYear,EmpWorkLifeBalance,ExperienceYearsAtThisCompany,ExperienceYearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,PerformanceRating
count,1200.0,1200.0,1200.0,1200.0,1200.0,1200.0,1200.0,1200.0,1200.0,1200.0,1200.0,1200.0,1200.0,1200.0,1200.0,1200.0,1200.0,1200.0,1200.0
mean,36.918333,9.165833,2.8925,2.715833,65.981667,2.731667,2.0675,2.7325,2.665,15.2225,2.725,11.33,2.785833,2.744167,7.0775,4.291667,2.194167,4.105,2.948333
std,9.087289,8.176636,1.04412,1.090599,20.211302,0.707164,1.107836,1.100888,2.469384,3.625918,1.075642,7.797228,1.263446,0.699374,6.236899,3.613744,3.22156,3.541576,0.518866
min,18.0,1.0,1.0,1.0,30.0,1.0,1.0,1.0,0.0,11.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,2.0
25%,30.0,2.0,2.0,2.0,48.0,2.0,1.0,2.0,1.0,12.0,2.0,6.0,2.0,2.0,3.0,2.0,0.0,2.0,3.0
50%,36.0,7.0,3.0,3.0,66.0,3.0,2.0,3.0,2.0,14.0,3.0,10.0,3.0,3.0,5.0,3.0,1.0,3.0,3.0
75%,43.0,14.0,4.0,4.0,83.0,3.0,3.0,4.0,4.0,18.0,4.0,15.0,3.0,3.0,10.0,7.0,3.0,7.0,3.0
max,60.0,29.0,5.0,4.0,100.0,4.0,5.0,4.0,9.0,25.0,4.0,40.0,6.0,4.0,40.0,18.0,15.0,17.0,4.0


### Dataset Information
This step provides details about the number of records, data types,
and non-null values in the dataset.


In [7]:
df.info()

<class 'pandas.DataFrame'>
RangeIndex: 1200 entries, 0 to 1199
Data columns (total 28 columns):
 #   Column                        Non-Null Count  Dtype
---  ------                        --------------  -----
 0   EmpNumber                     1200 non-null   str  
 1   Age                           1200 non-null   int64
 2   Gender                        1200 non-null   str  
 3   EducationBackground           1200 non-null   str  
 4   MaritalStatus                 1200 non-null   str  
 5   EmpDepartment                 1200 non-null   str  
 6   EmpJobRole                    1200 non-null   str  
 7   BusinessTravelFrequency       1200 non-null   str  
 8   DistanceFromHome              1200 non-null   int64
 9   EmpEducationLevel             1200 non-null   int64
 10  EmpEnvironmentSatisfaction    1200 non-null   int64
 11  EmpHourlyRate                 1200 non-null   int64
 12  EmpJobInvolvement             1200 non-null   int64
 13  EmpJobLevel                   1200 non-null 

### Check for Missing Values
The dataset is checked for any missing or null values.


In [8]:
df.isnull().sum()


EmpNumber                       0
Age                             0
Gender                          0
EducationBackground             0
MaritalStatus                   0
EmpDepartment                   0
EmpJobRole                      0
BusinessTravelFrequency         0
DistanceFromHome                0
EmpEducationLevel               0
EmpEnvironmentSatisfaction      0
EmpHourlyRate                   0
EmpJobInvolvement               0
EmpJobLevel                     0
EmpJobSatisfaction              0
NumCompaniesWorked              0
OverTime                        0
EmpLastSalaryHikePercent        0
EmpRelationshipSatisfaction     0
TotalWorkExperienceInYears      0
TrainingTimesLastYear           0
EmpWorkLifeBalance              0
ExperienceYearsAtThisCompany    0
ExperienceYearsInCurrentRole    0
YearsSinceLastPromotion         0
YearsWithCurrManager            0
Attrition                       0
PerformanceRating               0
dtype: int64

- The dataset does not contain missing values. Hence, no imputation is required.

### Rename Columns for Consistency
To improve readability and maintain consistency, column names are renamed using
snake_case format. This makes the dataset easier to understand and use in
analysis, visualization, and machine learning models.


In [9]:
df.rename(columns={
    'EmpNumber': 'employee_id',
    'Age': 'age',
    'Gender': 'gender',
    'EducationBackground': 'education_background',
    'MaritalStatus': 'marital_status',
    'EmpDepartment': 'department',
    'EmpJobRole': 'job_role',
    'BusinessTravelFrequency': 'business_travel',
    'DistanceFromHome': 'distance_from_home',
    'EmpEducationLevel': 'education_level',
    'EmpEnvironmentSatisfaction': 'environment_satisfaction',
    'EmpHourlyRate': 'hourly_rate',
    'EmpJobInvolvement': 'job_involvement',
    'EmpJobLevel': 'job_level',
    'EmpJobSatisfaction': 'job_satisfaction',
    'NumCompaniesWorked': 'num_companies_worked',
    'OverTime': 'overtime',
    'EmpLastSalaryHikePercent': 'last_salary_hike_percent',
    'EmpRelationshipSatisfaction': 'relationship_satisfaction',
    'TotalWorkExperienceInYears': 'total_experience_years',
    'TrainingTimesLastYear': 'training_times_last_year',
    'EmpWorkLifeBalance': 'work_life_balance',
    'ExperienceYearsAtThisCompany': 'years_at_company',
    'ExperienceYearsInCurrentRole': 'years_in_current_role',
    'YearsSinceLastPromotion': 'years_since_last_promotion',
    'YearsWithCurrManager': 'years_with_current_manager',
    'Attrition': 'attrition',
    'PerformanceRating': 'performance_rating'
}, inplace=True)


### Feature Selection Justification
The column **employee_id** is removed because it is a unique identifier
and does not contribute to predicting employee performance.
Keeping identifier variables can negatively impact model learning.


In [None]:
df.drop(columns=['employee_id'], inplace=True)

### Separating Categorical and Numerical Features

In [None]:
cat_features = df.select_dtypes(include=['object','str']).columns
num_features = df.select_dtypes(exclude=['object','str']).columns
cat_features, num_features

### Encoding Categorical Variables
- Machine learning models require numerical inputs. Categorical features such as department, job role, and gender are
  converted into numeric format using Label Encoding.
- Label Encoding is chosen because:
- Dataset contains many categorical variables  
- Tree-based models work well with encoded labels  
- It keeps the pipeline simple and efficient


In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
for col in cat_features:
   df[col] = le.fit_transform(df[col])

### Final Processed Dataset Overview

In [None]:
df.head()

### Data Transformation Summary
After preprocessing:
- All features are numeric  
- Dataset is ready for analysis and modeling  
- Processed dataset is saved for reuse


In [None]:
df.info()

### Column Overview
The dataset contains demographic, job-related, satisfaction, and experience
features along with the target variable performance rating.


### Saving Processed Dataset
The cleaned and processed dataset is saved for use in exploratory data analysis and model training.

In [None]:
processed_path = "../../data/processed/employee_performance_processed.csv"
df.to_csv(processed_path, index=False)


### Summary of Data Processing
- Loaded the raw employee performance dataset
- Verified data quality and structure
- Renamed columns for consistency
- Removed non-informative identifier column
- Encoded categorical variables
- Saved the processed dataset for further analysis

This processed dataset will be used for exploratory data analysis,
visualization, model training, and performance prediction.
