<div style="background-color: #FFD700; color: #333333; padding: 10px;">

# **Human Resources Analysis Predict Attrition**

</div>


<div style="background-color: #f0f0f0; color: #000000; padding: 10px;">
  <strong>Project Contributors</strong>
  <ul>
    <li>Gonçalo Alves</li>
    <li>Gonçalo Eloy</li>
    <li>Maria Beatriz Amado</li>
    <li>Mariana Pereira</li>
  </ul>
</div>


<div style="background-color: #3399CC; color: white; padding: 10px;">
<a id='scrub'>
<h3 style="color: white;"><strong>Introduction</strong></h3>
</a>
</div>


<div style="background-color: #333333; color: #ffffff; padding: 1px;">
This project aims to predict employee attrition within a specific company's workforce. 

As data scientists, our goal is to create a model that estimates the likelihood of an employee staying or quitting based on their characteristics. 

Despite the complexity and incomplete data, our model will assist the HR team in taking preventive measures to reduce attrition, such as adjusting salaries, promoting engaging projects, and offering remote work options. This project provides a valuable opportunity to explore feature extraction from diverse data sources and addresses key questions for a multinational consultancy firm regarding employee retention. 

Our main objectives encompass descriptive analytics to uncover correlations and predictive analytics to build classification models for attrition prediction.

<div>

In [22]:
#pip install --upgrade pip
#pip install pandas
#pip install --upgrade pip
#pip install matplotlib
#pip install seaborn

<a id='toc'></a>

### Table of Contents
* 1. [Obtain Data](#obtain-data)
    * [1.1 Load Libraries](#lib)
    * [1.2. Import data](#import)<br>
    * [1.3. Dimensionality of the dataframe](#dim)<br>
    * [1.4. Check missing values](#miss)<br>

* [2. Scrub data](#scrub) <br>
    * [2.1. Information about columns](#info) <br>
    * [2.2. Checking duplicate values](#duplicates)<br>
    * [2.3. Missing values](#var)<br>
        * [2.3.1. Categorical Variables](#cat)<br>
        * [2.3.2. Numerical Variables](#num)<br>
    * [2.4. Data Cleaning: Outliers](#out)<br> 

* [3. Explore data](#explore) <br>
        * [3.1. Basic Exploration](#exp)<br>
        * [3.2. Data Visualization](#vis)<br>
        * [3.3. Statistical Exploration](#stat)<br>


* [4. Model](#model) <br>
* [5. Interpret](#int) <br>





<div id="obtain-data"></div>
<div style="background-color: #3399CC; color: white; padding: 10px;">
<a id='scrub'>
<h3 style="color: white;"><strong>1. Obtain Data</strong></h3>
</a>
</div>
    


**__`1.1`__ Load Libraries** 

<a id='lib'></a>


In [21]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns

**__`1.2`__ Import data**

<a id='import'></a>

In [11]:
data= pd.read_csv("https://raw.githubusercontent.com/beatrizamado/HR-Analysis/main/HR_DS.csv")
data

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1465,36,No,Travel_Frequently,884,Research & Development,23,2,Medical,1,2061,...,3,80,1,17,3,3,5,2,0,3
1466,39,No,Travel_Rarely,613,Research & Development,6,1,Medical,1,2062,...,1,80,1,9,5,3,7,7,1,7
1467,27,No,Travel_Rarely,155,Research & Development,4,3,Life Sciences,1,2064,...,2,80,1,6,0,3,6,2,0,3
1468,49,No,Travel_Frequently,1023,Sales,2,3,Medical,1,2065,...,4,80,0,17,3,2,9,6,0,8


**__`1.3`__ Dimensionality of the dataframe**

<a id='dim'></a>

In [12]:
data.shape

(1470, 35)

<span style="font-size:small">Dataset has high dimensionality, over 30 attributes. We will have to perform data reduction.</span>

**__`1.4`__ Checking missing values**

<a id='miss'></a>

In [13]:
data.isna().sum()

Age                         0
Attrition                   0
BusinessTravel              0
DailyRate                   0
Department                  0
DistanceFromHome            0
Education                   0
EducationField              0
EmployeeCount               0
EmployeeNumber              0
EnvironmentSatisfaction     0
Gender                      0
HourlyRate                  0
JobInvolvement              0
JobLevel                    0
JobRole                     0
JobSatisfaction             0
MaritalStatus               0
MonthlyIncome               0
MonthlyRate                 0
NumCompaniesWorked          0
Over18                      0
OverTime                    0
PercentSalaryHike           0
PerformanceRating           0
RelationshipSatisfaction    0
StandardHours               0
StockOptionLevel            0
TotalWorkingYears           0
TrainingTimesLastYear       0
WorkLifeBalance             0
YearsAtCompany              0
YearsInCurrentRole          0
YearsSince

<div style="background-color: #3399CC; color: white; padding: 10px;">
<a id='scrub'>
<h3 style="color: white;"><strong>2. Describe Data</strong></h3>
</a>
</div>
   

**__`2.1`__ Information about columns**

<a id='info'></a>

In [14]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 35 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Age                       1470 non-null   int64 
 1   Attrition                 1470 non-null   object
 2   BusinessTravel            1470 non-null   object
 3   DailyRate                 1470 non-null   int64 
 4   Department                1470 non-null   object
 5   DistanceFromHome          1470 non-null   int64 
 6   Education                 1470 non-null   int64 
 7   EducationField            1470 non-null   object
 8   EmployeeCount             1470 non-null   int64 
 9   EmployeeNumber            1470 non-null   int64 
 10  EnvironmentSatisfaction   1470 non-null   int64 
 11  Gender                    1470 non-null   object
 12  HourlyRate                1470 non-null   int64 
 13  JobInvolvement            1470 non-null   int64 
 14  JobLevel                

In [15]:
data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Age,1470.0,36.92381,9.135373,18.0,30.0,36.0,43.0,60.0
DailyRate,1470.0,802.485714,403.5091,102.0,465.0,802.0,1157.0,1499.0
DistanceFromHome,1470.0,9.192517,8.106864,1.0,2.0,7.0,14.0,29.0
Education,1470.0,2.912925,1.024165,1.0,2.0,3.0,4.0,5.0
EmployeeCount,1470.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0
EmployeeNumber,1470.0,1024.865306,602.024335,1.0,491.25,1020.5,1555.75,2068.0
EnvironmentSatisfaction,1470.0,2.721769,1.093082,1.0,2.0,3.0,4.0,4.0
HourlyRate,1470.0,65.891156,20.329428,30.0,48.0,66.0,83.75,100.0
JobInvolvement,1470.0,2.729932,0.711561,1.0,2.0,3.0,3.0,4.0
JobLevel,1470.0,2.063946,1.10694,1.0,1.0,2.0,3.0,5.0


**__`2.2`__ Checking duplicates**

<a id='duplicates'></a>

In [16]:
duplicates=data[data.duplicated()] # save all duplicate rows
print("There are {} duplicates.".format(list(data.duplicated()).count(True)))
data.drop_duplicates(inplace = True) #remove duplicates

data.reset_index(inplace=True)
data.drop("index", inplace=True, axis=1)
print("\n")
print("All the duplicates were saved in a dataframe named:'Duplicates'")

There are 0 duplicates.


All the duplicates were saved in a dataframe named:'Duplicates'


**__`2.3`__ Missing values**

<a id='var'></a>

In [17]:
categoricalVar = data.select_dtypes(include=['object']).columns.tolist()
numericalVar= data.select_dtypes(exclude=['object']).columns.tolist()

print("\nThe numerical variables are: \n{}.".format((numericalVar)))
print("\nThe non-numerical variables are:\n{}.".format(categoricalVar))

print("\nIn summary:")
pd.DataFrame(([categoricalVar, numericalVar]),index = ['Categorical Variables', 'Numerical Variables']).T


The numerical variables are: 
['Age', 'DailyRate', 'DistanceFromHome', 'Education', 'EmployeeCount', 'EmployeeNumber', 'EnvironmentSatisfaction', 'HourlyRate', 'JobInvolvement', 'JobLevel', 'JobSatisfaction', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked', 'PercentSalaryHike', 'PerformanceRating', 'RelationshipSatisfaction', 'StandardHours', 'StockOptionLevel', 'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance', 'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion', 'YearsWithCurrManager'].

The non-numerical variables are:
['Attrition', 'BusinessTravel', 'Department', 'EducationField', 'Gender', 'JobRole', 'MaritalStatus', 'Over18', 'OverTime'].

In summary:


Unnamed: 0,Categorical Variables,Numerical Variables
0,Attrition,Age
1,BusinessTravel,DailyRate
2,Department,DistanceFromHome
3,EducationField,Education
4,Gender,EmployeeCount
5,JobRole,EmployeeNumber
6,MaritalStatus,EnvironmentSatisfaction
7,Over18,HourlyRate
8,OverTime,JobInvolvement
9,,JobLevel


**__`2.3.1.`__ Categorical Variables**

<a id='cat'></a>

In [18]:
pd.DataFrame(data[categoricalVar]).describe()

Unnamed: 0,Attrition,BusinessTravel,Department,EducationField,Gender,JobRole,MaritalStatus,Over18,OverTime
count,1470,1470,1470,1470,1470,1470,1470,1470,1470
unique,2,3,3,6,2,9,3,1,2
top,No,Travel_Rarely,Research & Development,Life Sciences,Male,Sales Executive,Married,Y,No
freq,1233,1043,961,606,882,326,673,1470,1054


**__`2.3.2.`__ Pandas Profilling**

<a id='cat'></a>

In [None]:
pip install pandas-profiling

In [None]:
pip install pydantic-settings

In [1]:
from pydantic_settings import BaseSettings

In [None]:
from pandas_profiling import ProfileReport

**__`2.3.3.`__ Cardinality**

<a id='cat'></a>

**__`2.3.4.`__ Correlations**

<a id='cat'></a>

In [None]:
pip install 

<div style="background-color: #3399CC; color: white; padding: 10px;">
<a id='scrub'>
<h3 style="color: white;"><strong>3. Feature Engineering</strong></h3>
</a>
</div>


##### **New Feature Creation**

**Variable: PercentSalaryHike**

Consider binning it into the following categories:
- Small Increase
- Medium Increase
- High Increase

<div style="background-color: #3399CC; color: white; padding: 10px;">
<a id='scrub'>
<h3 style="color: white;"><strong>4. Model</strong></h3>
</a>
</div>


**Logistic Regression**

**NN**

<div style="background-color: #3399CC; color: white; padding: 10px;">
<a id='scrub'>
<h3 style="color: white;"><strong>5. Interpret</strong></h3>
</a>
</div>
