
# Project Name :- INX Future Inc. Employee Performance Analysis.
# Project Code :- 10281
# --------------------------------------------------------------------------------------------------------------

## DATA PREPROCESSING =>> Data preprocessing is the critical step in data preparation, involving cleaning, transforming, and organizing raw data to enhance its quality and usability for analysis and machine learning applications.

## IMPORT ALL IMPORTANT LIBRARIES :-

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
from sklearn.preprocessing import LabelEncoder

## READING THE DATASET :-

In [2]:
data=pd.read_excel("Original_data.xls")

In [3]:
data.head()

Unnamed: 0,EmpNumber,Age,Gender,EducationBackground,MaritalStatus,EmpDepartment,EmpJobRole,BusinessTravelFrequency,DistanceFromHome,EmpEducationLevel,...,EmpRelationshipSatisfaction,TotalWorkExperienceInYears,TrainingTimesLastYear,EmpWorkLifeBalance,ExperienceYearsAtThisCompany,ExperienceYearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,Attrition,PerformanceRating
0,E1001000,32,Male,Marketing,Single,Sales,Sales Executive,Travel_Rarely,10,3,...,4,10,2,2,10,7,0,8,No,3
1,E1001006,47,Male,Marketing,Single,Sales,Sales Executive,Travel_Rarely,14,4,...,4,20,2,3,7,7,1,7,No,3
2,E1001007,40,Male,Life Sciences,Married,Sales,Sales Executive,Travel_Frequently,5,4,...,3,20,2,3,18,13,1,12,No,4
3,E1001009,41,Male,Human Resources,Divorced,Human Resources,Manager,Travel_Rarely,10,4,...,2,23,2,2,21,6,12,6,No,3
4,E1001010,60,Male,Marketing,Single,Sales,Sales Executive,Travel_Rarely,16,4,...,4,10,1,3,2,2,2,2,No,3


### Here we use pd.read to read the original dataset which is given in excel file.
### This command shows the top five rows in given dataset by default. 

## DOMAIN ANALYSIS :-

### As per Project Description, INX Future Inc Employee Performance is one of the leading data analytics and automation solutions provider with over 15 years of global business presence. 
### INX is consistently rated as top 20 best employers past 5 years. INX human resource policies are considered as employee friendly and widely perceived as best practices in the industry. 
### Recent years, the employee performance rating are not healthy and this is becoming a growing concerns among the top management. 
### There has been increased escalations on service delivery and client satisfaction levels came down by 8 percentage points. CEO, Mr. Brain, knows the issues but concerned to take any actions in penalizing non-performing employees as this would affect the employee morale of all the employees in general and may further reduce the performance.
### Also, the market perception best employer and thereby attracting best talents to join the company.
### Mr. Brain decided to initiate a data science project , which analyses the current employee data and find the core underlying causes of this performance issues.
### Mr. Brain, being a data scientist himself, expects the findings of this project will help him to take right course of actions. 
### He also expects a clear indicators of non performing employees, so that any penalization of non-performing employee, if required, may not significantly affect other employee morals.

## FEATURES DESCRIPTION :-

### 1.EmpNumber: 
##### An Employee ID, sometimes referred to as an Employee Number or Employee Code, is a unique number that has been assigned to each individual staff member within a company.
 
### 2.Age:
##### Age of employee in years.

### 3.Gender:
##### Gender of employee [Male/Feamale].

### 4.EducationBackground:
##### This will be a high school diploma or a post-secondary degree of a employee.

### 5.MaritalStatus:
##### Marital status, are the distinct options that describe a person's relationship with a significant other.

### 6.EmpDepartment:
##### Department specifice of employee.

### 7.EmpJobRole:
##### Job role means the key responsibility of a job profile or job position. 

### 8.BusinessTravelFrequency:
##### The employee travel for comapny buisness purpose.

### 9.DistanceFromHome:
##### Distance between home to company of employee.

### 10.EmpEducationLevel:
##### Employee Education level means the academic qualification. For example, it could be a diploma, degree, masters or PhD.

### 11.EmpEnvironmentSatisfaction:
##### Satisfied employees are with elements like their jobs, their employee experience, and the organizations they work for.

### 12.EmpHourlyRate:
##### Hourly Rate means the amount paid to an employee for each hour worked.

### 13.EmpJobInvolvement:
##### Job involvement refers to a state of psychological identification with work—or the degree to which a job is central to a person's identity. From an organizational perspective, it has been regarded as the key to unlocking employee motivation and increasing productivity.

### 14.EmpJobLevel:
##### Job levels, also known as job grades and classifications, set the responsibility level and expectations.

### 15.EmpJobSatisfaction:
##### Level of contentment employees feel with their job.

### 16.NumCompaniesWorked:
##### Employee work in how many companies.

### 17.OverTime:
##### Employee work overtime or not. [Yes, No]

### 18.EmpLastSalaryHikePercent:
##### Salary hike percent of employee in last year.

### 19.EmpRelationshipSatisfaction:
##### Healthy relationships may motivate employees and increase morale. When employees cast aside relationship issues, they can focus on work tasks more effectively.

### 20.TotalWorkExperienceInYears:
##### Total experiance of employee in years.

### 21.TrainingTimesLastYear:
##### Total training done by employee in last year.

### 22.EmpWorkLifeBalance:
##### “Work-life balance” typically means the achievement by employees of equality between time spent working and personal life. A good work-life balance for employees can improve staff motivation, increase staff retention rates, reduce absence, attract new talent, and reduce employee stress.

### 23.ExperienceYearsAtThisCompany:
##### Total no of eaxperiance at current company.

### 24.ExperienceYearsInCurrentRole:
##### Total no of experiance in current job role.

### 25.YearsSinceLastPromotion:
##### Total no of year since last promotion of employee.

### 26.YearsWithCurrManager:
##### Employee total no of years with current manager. 

### 27.Attrition:
##### Employee attrition is the naturally occurring, voluntary departure of employees from a company. Employee attrition involves leaving a job for: Personal reasons. Professional motivation.

### 28.PerformanceRating:
##### This is a target feature, tell that the total rating of employee performance in company.

## BASIC CHECKS :-

In [4]:
data.shape

(1200, 28)

### There are 28 columns and 1200 Rows in the given dataset

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1200 entries, 0 to 1199
Data columns (total 28 columns):
 #   Column                        Non-Null Count  Dtype 
---  ------                        --------------  ----- 
 0   EmpNumber                     1200 non-null   object
 1   Age                           1200 non-null   int64 
 2   Gender                        1200 non-null   object
 3   EducationBackground           1200 non-null   object
 4   MaritalStatus                 1200 non-null   object
 5   EmpDepartment                 1200 non-null   object
 6   EmpJobRole                    1200 non-null   object
 7   BusinessTravelFrequency       1200 non-null   object
 8   DistanceFromHome              1200 non-null   int64 
 9   EmpEducationLevel             1200 non-null   int64 
 10  EmpEnvironmentSatisfaction    1200 non-null   int64 
 11  EmpHourlyRate                 1200 non-null   int64 
 12  EmpJobInvolvement             1200 non-null   int64 
 13  EmpJobLevel       

### This shows the information regarding the given dataset like how many datatypes are present in dataset.

In [6]:
data.dtypes

EmpNumber                       object
Age                              int64
Gender                          object
EducationBackground             object
MaritalStatus                   object
EmpDepartment                   object
EmpJobRole                      object
BusinessTravelFrequency         object
DistanceFromHome                 int64
EmpEducationLevel                int64
EmpEnvironmentSatisfaction       int64
EmpHourlyRate                    int64
EmpJobInvolvement                int64
EmpJobLevel                      int64
EmpJobSatisfaction               int64
NumCompaniesWorked               int64
OverTime                        object
EmpLastSalaryHikePercent         int64
EmpRelationshipSatisfaction      int64
TotalWorkExperienceInYears       int64
TrainingTimesLastYear            int64
EmpWorkLifeBalance               int64
ExperienceYearsAtThisCompany     int64
ExperienceYearsInCurrentRole     int64
YearsSinceLastPromotion          int64
YearsWithCurrManager     

In [7]:
data.describe()

Unnamed: 0,Age,DistanceFromHome,EmpEducationLevel,EmpEnvironmentSatisfaction,EmpHourlyRate,EmpJobInvolvement,EmpJobLevel,EmpJobSatisfaction,NumCompaniesWorked,EmpLastSalaryHikePercent,EmpRelationshipSatisfaction,TotalWorkExperienceInYears,TrainingTimesLastYear,EmpWorkLifeBalance,ExperienceYearsAtThisCompany,ExperienceYearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,PerformanceRating
count,1200.0,1200.0,1200.0,1200.0,1200.0,1200.0,1200.0,1200.0,1200.0,1200.0,1200.0,1200.0,1200.0,1200.0,1200.0,1200.0,1200.0,1200.0,1200.0
mean,36.918333,9.165833,2.8925,2.715833,65.981667,2.731667,2.0675,2.7325,2.665,15.2225,2.725,11.33,2.785833,2.744167,7.0775,4.291667,2.194167,4.105,2.948333
std,9.087289,8.176636,1.04412,1.090599,20.211302,0.707164,1.107836,1.100888,2.469384,3.625918,1.075642,7.797228,1.263446,0.699374,6.236899,3.613744,3.22156,3.541576,0.518866
min,18.0,1.0,1.0,1.0,30.0,1.0,1.0,1.0,0.0,11.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,2.0
25%,30.0,2.0,2.0,2.0,48.0,2.0,1.0,2.0,1.0,12.0,2.0,6.0,2.0,2.0,3.0,2.0,0.0,2.0,3.0
50%,36.0,7.0,3.0,3.0,66.0,3.0,2.0,3.0,2.0,14.0,3.0,10.0,3.0,3.0,5.0,3.0,1.0,3.0,3.0
75%,43.0,14.0,4.0,4.0,83.0,3.0,3.0,4.0,4.0,18.0,4.0,15.0,3.0,3.0,10.0,7.0,3.0,7.0,3.0
max,60.0,29.0,5.0,4.0,100.0,4.0,5.0,4.0,9.0,25.0,4.0,40.0,6.0,4.0,40.0,18.0,15.0,17.0,4.0


### It is a statistical code which gives us the knowledge about the dataset like Mean ,Std_Deviation ,Maximum_number etc.

In [8]:
data.isnull().sum()

EmpNumber                       0
Age                             0
Gender                          0
EducationBackground             0
MaritalStatus                   0
EmpDepartment                   0
EmpJobRole                      0
BusinessTravelFrequency         0
DistanceFromHome                0
EmpEducationLevel               0
EmpEnvironmentSatisfaction      0
EmpHourlyRate                   0
EmpJobInvolvement               0
EmpJobLevel                     0
EmpJobSatisfaction              0
NumCompaniesWorked              0
OverTime                        0
EmpLastSalaryHikePercent        0
EmpRelationshipSatisfaction     0
TotalWorkExperienceInYears      0
TrainingTimesLastYear           0
EmpWorkLifeBalance              0
ExperienceYearsAtThisCompany    0
ExperienceYearsInCurrentRole    0
YearsSinceLastPromotion         0
YearsWithCurrManager            0
Attrition                       0
PerformanceRating               0
dtype: int64

### There are no null value present in this dataset.

In [9]:
data.columns

Index(['EmpNumber', 'Age', 'Gender', 'EducationBackground', 'MaritalStatus',
       'EmpDepartment', 'EmpJobRole', 'BusinessTravelFrequency',
       'DistanceFromHome', 'EmpEducationLevel', 'EmpEnvironmentSatisfaction',
       'EmpHourlyRate', 'EmpJobInvolvement', 'EmpJobLevel',
       'EmpJobSatisfaction', 'NumCompaniesWorked', 'OverTime',
       'EmpLastSalaryHikePercent', 'EmpRelationshipSatisfaction',
       'TotalWorkExperienceInYears', 'TrainingTimesLastYear',
       'EmpWorkLifeBalance', 'ExperienceYearsAtThisCompany',
       'ExperienceYearsInCurrentRole', 'YearsSinceLastPromotion',
       'YearsWithCurrManager', 'Attrition', 'PerformanceRating'],
      dtype='object')

In [10]:
data.duplicated().sum()

0

### There are also no duplicate values present.

In [11]:
data.describe(include=['O'])

Unnamed: 0,EmpNumber,Gender,EducationBackground,MaritalStatus,EmpDepartment,EmpJobRole,BusinessTravelFrequency,OverTime,Attrition
count,1200,1200,1200,1200,1200,1200,1200,1200,1200
unique,1200,2,6,3,6,19,3,2,2
top,E1001000,Male,Life Sciences,Married,Sales,Sales Executive,Travel_Rarely,No,No
freq,1,725,492,548,373,270,846,847,1022


### This is done for the Statistical knowledge about the non-numerical data.

In [12]:
print("EmpNumber: ", data.EmpNumber.unique())
print("-"*40)
print("Gender: ", data.Gender.unique())
print("-"*40)
print("EducationBackground: ", data.EducationBackground.unique())
print("-"*40)
print("MaritalStatus: ", data.MaritalStatus.unique())
print("-"*40)
print("EmpDepartment: ", data.EmpDepartment.unique())
print("-"*40)
print("EmpJobRole: ", data.EmpJobRole.unique())
print("-"*40)
print("BusinessTravelFrequency: ", data.BusinessTravelFrequency.unique())
print("-"*40)
print("OverTime: ", data.OverTime.unique())
print("-"*40)
print("Attrition: ", data.Attrition.unique())
print("-"*40)

EmpNumber:  ['E1001000' 'E1001006' 'E1001007' ... 'E100994' 'E100995' 'E100998']
----------------------------------------
Gender:  ['Male' 'Female']
----------------------------------------
EducationBackground:  ['Marketing' 'Life Sciences' 'Human Resources' 'Medical' 'Other'
 'Technical Degree']
----------------------------------------
MaritalStatus:  ['Single' 'Married' 'Divorced']
----------------------------------------
EmpDepartment:  ['Sales' 'Human Resources' 'Development' 'Data Science'
 'Research & Development' 'Finance']
----------------------------------------
EmpJobRole:  ['Sales Executive' 'Manager' 'Developer' 'Sales Representative'
 'Human Resources' 'Senior Developer' 'Data Scientist'
 'Senior Manager R&D' 'Laboratory Technician' 'Manufacturing Director'
 'Research Scientist' 'Healthcare Representative' 'Research Director'
 'Manager R&D' 'Finance Manager' 'Technical Architect' 'Business Analyst'
 'Technical Lead' 'Delivery Manager']
-------------------------------------

In [13]:
print("EmpNumber :",data.EmpNumber.value_counts(),sep = '\n')
print("-"*40)
print("Gender :",data.Gender.value_counts(),sep = '\n')
print("-"*40)
print("EducationBackground :",data.EducationBackground.value_counts(),sep = '\n')
print("-"*40)
print("MaritalStatus :",data.MaritalStatus.value_counts(),sep = '\n')
print("-"*40)
print("EmpDepartment :",data.EmpDepartment.value_counts(),sep = '\n')
print("-"*40)
print("EmpJobRole :",data.EmpJobRole.value_counts(),sep = '\n')
print("-"*40)
print("BusinessTravelFrequency:",data.BusinessTravelFrequency.value_counts(),sep = '\n')
print("-"*40)
print("OverTime :",data.OverTime.value_counts(),sep = '\n')
print("-"*40)
print("PerformanceRating:",data.PerformanceRating.value_counts(),sep = '\n')
print("-"*40)
print("Attrition:",data.Attrition.value_counts(),sep = '\n')
print("-"*40)

EmpNumber :
E1001000    1
E100346     1
E100342     1
E100341     1
E100340     1
           ..
E1001718    1
E1001717    1
E1001716    1
E1001713    1
E100998     1
Name: EmpNumber, Length: 1200, dtype: int64
----------------------------------------
Gender :
Male      725
Female    475
Name: Gender, dtype: int64
----------------------------------------
EducationBackground :
Life Sciences       492
Medical             384
Marketing           137
Technical Degree    100
Other                66
Human Resources      21
Name: EducationBackground, dtype: int64
----------------------------------------
MaritalStatus :
Married     548
Single      384
Divorced    268
Name: MaritalStatus, dtype: int64
----------------------------------------
EmpDepartment :
Sales                     373
Development               361
Research & Development    343
Human Resources            54
Finance                    49
Data Science               20
Name: EmpDepartment, dtype: int64
----------------------------

## LABEL ENCODING :-

### Label Encoding  =>>  Label encoding is a data preprocessing technique that assigns unique numerical values to categorical labels, enabling machine learning algorithms to process them.

In [14]:
LE=LabelEncoder()

In [15]:
data.Age=LE.fit_transform(data.Age)
data.Attrition=LE.fit_transform(data.Attrition)
data.OverTime=LE.fit_transform(data.OverTime)
data.BusinessTravelFrequency=LE.fit_transform(data.BusinessTravelFrequency)
data.EmpJobRole=LE.fit_transform(data.EmpJobRole)
data.EmpDepartment=LE.fit_transform(data.EmpDepartment)
data.MaritalStatus=LE.fit_transform(data.MaritalStatus)
data.EducationBackground=LE.fit_transform(data.EducationBackground)
data.Gender=LE.fit_transform(data.Gender )
data.EmpNumber=LE.fit_transform(data.EmpNumber)

### Here we Encode the Categorical data into Numerical data

In [16]:
data.dtypes

EmpNumber                       int32
Age                             int64
Gender                          int32
EducationBackground             int32
MaritalStatus                   int32
EmpDepartment                   int32
EmpJobRole                      int32
BusinessTravelFrequency         int32
DistanceFromHome                int64
EmpEducationLevel               int64
EmpEnvironmentSatisfaction      int64
EmpHourlyRate                   int64
EmpJobInvolvement               int64
EmpJobLevel                     int64
EmpJobSatisfaction              int64
NumCompaniesWorked              int64
OverTime                        int32
EmpLastSalaryHikePercent        int64
EmpRelationshipSatisfaction     int64
TotalWorkExperienceInYears      int64
TrainingTimesLastYear           int64
EmpWorkLifeBalance              int64
ExperienceYearsAtThisCompany    int64
ExperienceYearsInCurrentRole    int64
YearsSinceLastPromotion         int64
YearsWithCurrManager            int64
Attrition   

### Its shows that all the features datatypes is integer which is suitable for Programming.

In [17]:
data

Unnamed: 0,EmpNumber,Age,Gender,EducationBackground,MaritalStatus,EmpDepartment,EmpJobRole,BusinessTravelFrequency,DistanceFromHome,EmpEducationLevel,...,EmpRelationshipSatisfaction,TotalWorkExperienceInYears,TrainingTimesLastYear,EmpWorkLifeBalance,ExperienceYearsAtThisCompany,ExperienceYearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,Attrition,PerformanceRating
0,0,14,1,2,2,5,13,2,10,3,...,4,10,2,2,10,7,0,8,0,3
1,1,29,1,2,2,5,13,2,14,4,...,4,20,2,3,7,7,1,7,0,3
2,2,22,1,1,1,5,13,1,5,4,...,3,20,2,3,18,13,1,12,0,4
3,3,23,1,0,0,3,8,2,10,4,...,2,23,2,2,21,6,12,6,0,3
4,4,42,1,2,2,5,13,2,16,4,...,4,10,1,3,2,2,2,2,0,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1195,1195,9,0,3,0,5,13,1,3,1,...,2,6,3,3,6,5,0,4,0,4
1196,1196,19,1,1,2,1,15,2,10,2,...,1,4,2,3,1,0,0,0,0,3
1197,1197,32,1,3,1,1,15,2,28,1,...,3,20,3,3,20,8,3,8,0,3
1198,1198,16,0,3,2,0,1,2,9,3,...,2,9,3,4,8,7,7,7,0,3


In [18]:
data.to_excel("Final_data.xlsx")

### NOTE :-  This is use to save the file in excel format i,e. (.xlsx) which is the latest version of excel file format. Here we preprocess the Original data into Processed Data which is useful for further programming. 
### It is save as ( Final_data.xlsx)