## Exploratory Data Analysis - Employee Attrition and Factors

Este notebook foi desenvolvido para realizar uma análise exploratória de um conjunto de dados abrangente que oferece uma visão detalhada dos funcionários de uma organização. Este dataset engloba uma variedade de parâmetros, incluindo características pessoais, relacionadas ao trabalho e financeiras. Ao explorar áreas como desgastes dos funcionários, fatores pessoais e profissionais, podemos identificar insights e padrões significativos sobre a dinâmica da força de trabalho.

Com essas informações, nossa análise visa identificar padrões, tendências e correlações que podem contribuir para uma compreensão mais profunda das filosofias de gerenciamento de força de trabalho. Á medida que avançamos nesta análise, esperamos extrair conhecimentos que podem informar estratégias de gestão mais eficazes e alinhas com as transformações contínuas na dinâmica do ambiente de trabalho.

O conjunto de dados foi coletado da plataforma Kaggle, que é bastante conhecida por Cientista de Dados e profissionais da área. Além disso, é famosa por suas competições de Machine Learning. Para acessar o dataset, basta acessar o link abaixo:

Link dos Dados -> https://www.kaggle.com/datasets/thedevastator/employee-attrition-and-factors

Vamos iniciar a nossa análise...

### 1. Importando Bibliotecas

In [1]:
# Manipulação e Tratamento de Dados
import pandas as pd
import numpy as np 

# EDA - Visualização de Dados
import matplotlib.pyplot as pyplot
import plotly.graph_objects as go
import plotly.express as px
import plotly.subplots as ps


# Configurando Warnings
import warnings
warnings.filterwarnings("ignore")
from tabulate import tabulate

# Configurar para exibir todas as linhas de um DataFrame Pandas
pd.set_option("display.max_columns", None)

### 2. Dataset e análises

In [2]:
df = pd.read_csv("./HR_Analytics.csv", sep=',')

df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,EnvironmentSatisfaction,Gender,HourlyRate,JobInvolvement,JobLevel,JobRole,JobSatisfaction,MaritalStatus,MonthlyIncome,MonthlyRate,NumCompaniesWorked,Over18,OverTime,PercentSalaryHike,PerformanceRating,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,2,Female,94,3,2,Sales Executive,4,Single,5993,19479,8,Y,Yes,11,3,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,3,Male,61,2,2,Research Scientist,2,Married,5130,24907,1,Y,No,23,4,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,4,Male,92,2,1,Laboratory Technician,3,Single,2090,2396,6,Y,Yes,15,3,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,4,Female,56,3,1,Research Scientist,3,Married,2909,23159,1,Y,Yes,11,3,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,1,Male,40,3,1,Laboratory Technician,2,Married,3468,16632,9,Y,No,12,3,4,80,1,6,3,3,2,2,2,2


In [48]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 35 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Age                       1470 non-null   int64 
 1   Attrition                 1470 non-null   object
 2   BusinessTravel            1470 non-null   object
 3   DailyRate                 1470 non-null   int64 
 4   Department                1470 non-null   object
 5   DistanceFromHome          1470 non-null   int64 
 6   Education                 1470 non-null   int64 
 7   EducationField            1470 non-null   object
 8   EmployeeCount             1470 non-null   int64 
 9   EmployeeNumber            1470 non-null   int64 
 10  EnvironmentSatisfaction   1470 non-null   int64 
 11  Gender                    1470 non-null   object
 12  HourlyRate                1470 non-null   int64 
 13  JobInvolvement            1470 non-null   int64 
 14  JobLevel                

In [3]:
fig = go.Figure(
    data=[
        go.Bar(name="Yes", x=['Yes'], y=[df['Attrition'].value_counts()['Yes']], marker_color='#9FE2BF'),
        go.Bar(name="No", x=['No'], y=[df['Attrition'].value_counts()['No']], marker_color='salmon'),
    ]
)

# Atualização do layout
fig.update_layout(title='Answer counts of Attrition',
                    xaxis_title='Answer',
                    yaxis_title='Count')

# Exibição do gráfico
fig.show()

In [4]:
df['BusinessTravel'].value_counts()

BusinessTravel
Travel_Rarely        1043
Travel_Frequently     277
Non-Travel            150
Name: count, dtype: int64

In [5]:
fig = go.Figure(
    data=[
        go.Bar(name="Travel_Rarely", x=['Travel_Rarely'], y=[df['BusinessTravel'].value_counts()['Travel_Rarely']], marker_color='#9FE2BF'),
        go.Bar(name="Travel_Frequently", x=['Travel_Frequently'], y=[df['BusinessTravel'].value_counts()['Travel_Frequently']], marker_color='salmon'),
        go.Bar(name="Non-Travel", x=['Non-Travel'], y=[df['BusinessTravel'].value_counts()['Non-Travel']], marker_color='#D7BDE2'),
    ]
)

# Atualização do layout
fig.update_layout(title='Count of employees who go on Business Travel',
                    xaxis_title='Employess quantity',
                    yaxis_title='Count')

# Exibição do gráfico
fig.show()


In [6]:
df

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,EnvironmentSatisfaction,Gender,HourlyRate,JobInvolvement,JobLevel,JobRole,JobSatisfaction,MaritalStatus,MonthlyIncome,MonthlyRate,NumCompaniesWorked,Over18,OverTime,PercentSalaryHike,PerformanceRating,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,2,Female,94,3,2,Sales Executive,4,Single,5993,19479,8,Y,Yes,11,3,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,3,Male,61,2,2,Research Scientist,2,Married,5130,24907,1,Y,No,23,4,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,4,Male,92,2,1,Laboratory Technician,3,Single,2090,2396,6,Y,Yes,15,3,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,4,Female,56,3,1,Research Scientist,3,Married,2909,23159,1,Y,Yes,11,3,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,1,Male,40,3,1,Laboratory Technician,2,Married,3468,16632,9,Y,No,12,3,4,80,1,6,3,3,2,2,2,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1465,36,No,Travel_Frequently,884,Research & Development,23,2,Medical,1,2061,3,Male,41,4,2,Laboratory Technician,4,Married,2571,12290,4,Y,No,17,3,3,80,1,17,3,3,5,2,0,3
1466,39,No,Travel_Rarely,613,Research & Development,6,1,Medical,1,2062,4,Male,42,2,3,Healthcare Representative,1,Married,9991,21457,4,Y,No,15,3,1,80,1,9,5,3,7,7,1,7
1467,27,No,Travel_Rarely,155,Research & Development,4,3,Life Sciences,1,2064,2,Male,87,4,2,Manufacturing Director,2,Married,6142,5174,1,Y,Yes,20,4,2,80,1,6,0,3,6,2,0,3
1468,49,No,Travel_Frequently,1023,Sales,2,3,Medical,1,2065,4,Male,63,2,2,Sales Executive,2,Married,5390,13243,2,Y,No,14,3,4,80,0,17,3,2,9,6,0,8


In [7]:
df['EnvironmentSatisfaction'].value_counts()

EnvironmentSatisfaction
3    453
4    446
2    287
1    284
Name: count, dtype: int64

In [8]:
fig = go.Figure(
    data=[
        go.Bar(name="Research & Development", x=['Research & Development'], y=[df['Department'].value_counts()['Research & Development']], marker_color='#9FE2BF'),
        go.Bar(name="Sales", x=['Sales'], y=[df['Department'].value_counts()['Sales']], marker_color='salmon'),
        go.Bar(name="Human Resources", x=['Human Resources'], y=[df['Department'].value_counts()['Human Resources']], marker_color='#D7BDE2'),
])

# Atualização do layout
fig.update_layout(title='Employess department',
                    xaxis_title='Departments',
                    yaxis_title='Count')

# Exibição do gráfico
fig.show()

In [9]:
education_field_data = df['EducationField'].value_counts().reset_index()
education_field_data.columns = ['EducationField', 'Count']

fig = go.Figure(
    go.Bar(
    x=education_field_data['EducationField'],
    y=education_field_data['Count'],
    text=education_field_data['Count'],
    textposition='outside',            
    marker_color=['#7DCEA0', '#82E0AA', '#D7BDE2', '#9FE2BF', '#40E0D0', '#DAF7A6']
))

# Configurar layout do gráfico
fig.update_layout(
    title="The field of study for the employee's education",
    xaxis_title='Education field',
    yaxis_title='Count',
    height=550
)

# Mostrar o gráfico
fig.show()

In [10]:
# Contagem de ocorrências para a feature 'gender'
gender_counts = df['Gender'].value_counts().reset_index()
gender_counts.columns = ['Gender', 'count']

# Gráfico de barras para contagem
fig_bar = go.Figure(go.Bar(
    x=gender_counts['Gender'],
    y=gender_counts['count'],
    text=gender_counts['count'],
    textposition='outside',
    marker_color=['#40E0D0', '#F08080'],
))

fig_bar.update_layout(
    title='The gender of the employee.',
    xaxis_title='Gender',
    yaxis_title='Count',
    height=550
)

# Gráfico de pizza para porcentagem
fig_pie = px.pie(gender_counts, values='count', names='Gender', title='Gender distribution', color_discrete_sequence=['#40E0D0', '#F08080'], hole=0.2)

# Mostrar os gráficos
fig_bar.show()
fig_pie.show()

In [11]:
jobrole_counts = df['JobRole'].value_counts().reset_index()
jobrole_counts.columns = ['JobRole', 'Count']

colors = [
    '#FFB6C1',  # Rosa claro
    '#FFD700',  # Amarelo dourado
    '#98FB98',  # Verde claro
    '#87CEFA',  # Azul celeste
    '#FFA07A',  # Salmão claro
    '#FF69B4',  # Rosa vívido
    '#ADD8E6',  # Azul claro
    '#FF6347',  # Vermelho tomate
    '#D2B48C'   # Marrom claro
]

# Gráfico de barras para contagem
fig_bar = go.Figure(go.Bar(
    x=jobrole_counts['JobRole'],
    y=jobrole_counts['Count'],
    text=jobrole_counts['Count'],
    textposition='outside',
    marker_color=colors,
))

fig_bar.update_layout(
    title='The role of the employee in the organization',
    xaxis_title='Gender',
    yaxis_title='Count',
    height=550
)

# Gráfico de pizza para porcentagem
fig_pie = px.pie(jobrole_counts, values='Count', names='JobRole', title='JobRole distribution', color_discrete_sequence=colors, hole=0.1, )

# Mostrar os gráficos
fig_bar.show()
fig_pie.show()

#### Criando uma lista com cores para serem usadas no decorrer da análise

In [12]:
colors = [
    '#FF6347', 
    '#FF7F50', 
    '#FFA07A', 
    '#FFB6C1', 
    '#FFC0CB', 
    '#FFD700', 
    '#FFDAB9', 
    '#FFFACD', 
    '#FAFAD2', 
    '#F0FFF0'
]


In [13]:
df_leaves_company = df[df['Attrition'] == 'Yes']
df_stays_company = df[df['Attrition'] == 'No']

In [14]:
df_leaves_company

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,EnvironmentSatisfaction,Gender,HourlyRate,JobInvolvement,JobLevel,JobRole,JobSatisfaction,MaritalStatus,MonthlyIncome,MonthlyRate,NumCompaniesWorked,Over18,OverTime,PercentSalaryHike,PerformanceRating,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,2,Female,94,3,2,Sales Executive,4,Single,5993,19479,8,Y,Yes,11,3,1,80,0,8,0,1,6,4,0,5
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,4,Male,92,2,1,Laboratory Technician,3,Single,2090,2396,6,Y,Yes,15,3,2,80,0,7,3,3,0,0,0,0
14,28,Yes,Travel_Rarely,103,Research & Development,24,3,Life Sciences,1,19,3,Male,50,2,1,Laboratory Technician,3,Single,2028,12947,5,Y,Yes,14,3,2,80,0,6,4,3,4,2,0,3
21,36,Yes,Travel_Rarely,1218,Sales,9,4,Life Sciences,1,27,3,Male,82,2,1,Sales Representative,1,Single,3407,6986,7,Y,No,23,4,2,80,0,10,4,3,5,3,0,3
24,34,Yes,Travel_Rarely,699,Research & Development,6,1,Medical,1,31,2,Male,83,3,1,Research Scientist,1,Single,2960,17102,2,Y,No,11,3,3,80,0,8,2,3,4,2,1,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1438,23,Yes,Travel_Frequently,638,Sales,9,3,Marketing,1,2023,4,Male,33,3,1,Sales Representative,1,Married,1790,26956,1,Y,No,19,3,1,80,1,1,3,2,1,0,1,0
1442,29,Yes,Travel_Rarely,1092,Research & Development,1,4,Medical,1,2027,1,Male,36,3,1,Research Scientist,4,Married,4787,26124,9,Y,Yes,14,3,2,80,3,4,3,4,2,2,2,2
1444,56,Yes,Travel_Rarely,310,Research & Development,7,2,Technical Degree,1,2032,4,Male,72,3,1,Laboratory Technician,3,Married,2339,3666,8,Y,No,11,3,4,80,1,14,4,1,10,9,9,8
1452,50,Yes,Travel_Frequently,878,Sales,1,4,Life Sciences,1,2044,2,Male,94,3,2,Sales Executive,3,Divorced,6728,14255,7,Y,No,12,3,4,80,2,12,3,3,6,3,0,1


In [15]:
leaves_company_travel = df_leaves_company['BusinessTravel']

leaves_company_travel_counts = leaves_company_travel.value_counts().reset_index()
leaves_company_travel_counts.columns = ['BusinessTravel', 'Count']

fig = go.Figure(
    go.Bar(
        x=leaves_company_travel_counts['BusinessTravel'],
        y=leaves_company_travel_counts['Count'],
        text=round((leaves_company_travel_counts['Count'] / leaves_company_travel_counts['Count'].sum()) * 100 , 2).apply(lambda x: f'{x:.2f}%'),
        textposition='outside',
        marker_color=colors
    )
)

fig.update_layout(
    title='Frequency of Business Travels by departing employees',
    xaxis_title='Business Travel',
    yaxis_title='Count',
    height=550
)

fig.show()

In [16]:
leaves_company_department = df_leaves_company['Department']

leaves_company_department_counts = leaves_company_department.value_counts().reset_index()
leaves_company_department_counts.columns = ['Departament', 'Count']

fig = go.Figure(
    go.Bar(
        x=leaves_company_department_counts['Departament'],
        y=leaves_company_department_counts['Count'],
        text=round((leaves_company_department_counts['Count'] / leaves_company_department_counts['Count'].sum()) * 100 , 2).apply(lambda x: f'{x:.2f}%'),
        textposition='outside',
        marker_color=[
                '#FFB6C1',  # Rosa claro
                '#87CEFA',  # Azul celeste
                '#98FB98',  # Verde claro
                '#FFD700',  # Amarelo dourado
                '#FFA07A',  # Salmão claro
                '#FF69B4',  # Rosa vívido
                '#ADD8E6',  # Azul claro
                '#FF6347',  # Vermelho tomate
                '#D2B48C',  # Marrom claro
                '#F08080'   # Coral claro
            ]
    )
)

fig.update_layout(
    title='Departament distribution by departing employees',
    xaxis_title='Department',
    yaxis_title='Count',
    height=550
)

In [17]:
 # Define o mapeamento de cores para as categorias de gênero
color_map = {'Female': '#FFB6C1', 'Male': '#87CEFA'}

fig = px.histogram(
    df_leaves_company, 
    x='Age', 
    color='Gender',
    marginal="box",
    hover_data=df_leaves_company.columns,
    color_discrete_map=color_map  # Define o mapeamento de cores
)

fig.show()


In [18]:
fig = px.histogram(
    df_leaves_company,
    x='DistanceFromHome', 
    marginal="box",
    hover_data=df_leaves_company.columns,
    nbins=30,
    color_discrete_sequence=['#5D6D7E']
)

fig.show()

In [19]:
color_map_overtime = {'No': '#FD008A', 'Yes': '#007BFD'}

fig = px.histogram(
    df_leaves_company, 
    x="Gender", 
    color='OverTime',
    barmode='group',
    height=400,
    color_discrete_map=color_map_overtime
)

fig.show()

In [30]:
from random import shuffle

jobsatisfaction = df_leaves_company['JobSatisfaction'].value_counts().reset_index()
jobsatisfaction.columns = ['JobSatisfaction', 'Count']

fig = go.Figure(
    go.Bar(
        x = jobsatisfaction['JobSatisfaction'],
        y = jobsatisfaction['Count'],
        text = round((jobsatisfaction['Count'] / jobsatisfaction['Count'].sum()) * 100).apply(lambda x : f'{x:.2f}%'),
        textposition = 'outside',
        marker_color=['#1F497D', '#4F81BD', '#85A3E0', '#B8CCE4']
    )
)

fig.update_layout(
    title='Distribution of JobSatisfaction by departing employess',
    xaxis_title='JobSatisfaction',
    yaxis_title='Count',
    height=550
)

fig.show()

In [41]:
fig = px.histogram(
    df_leaves_company,
    x='YearsAtCompany', 
    marginal="violin",
    hover_data=df_leaves_company.columns,
    nbins=40,
    color_discrete_sequence=['#660C3D']
)

fig.show()


In [47]:
fig = px.histogram(
    df_leaves_company,
    x='YearsSinceLastPromotion', 
    marginal="violin",
    hover_data=df_leaves_company.columns,
    nbins=15,
    color_discrete_sequence=['#2E8DAF']
)

fig.show()
