<a href="https://colab.research.google.com/github/auramolina/Analitica-en-recursos-humanos/blob/main/giberto1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## <font color='0C2054'> **Librerías** </font>

In [4]:
import pandas as pd
import numpy as np
from sklearn.impute import KNNImputer
import matplotlib.pyplot as plt
import seaborn as sns

import warnings


warnings.filterwarnings("ignore")

from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import silhouette_score, calinski_harabasz_score
!pip install kneed
from kneed import KneeLocator

from scipy.spatial.distance import cdist
from sklearn.cluster import KMeans, DBSCAN
from sklearn.mixture import GaussianMixture
from sklearn.decomposition import PCA, FactorAnalysis

!pip install factor_analyzer
from factor_analyzer import FactorAnalyzer



## <font color='0c2054'> **IMPORTACIÓN DE DATOS** </font>

In [5]:
# Lectura de los datos
data_general = pd.read_csv("https://raw.githubusercontent.com/auramolina/Analitica-en-recursos-humanos/main/general_data.csv", delimiter=';')
manager_survey = pd.read_csv("https://raw.githubusercontent.com/auramolina/Analitica-en-recursos-humanos/main/manager_survey.csv", delimiter=';')
retirement_info = pd.read_csv("https://raw.githubusercontent.com/auramolina/Analitica-en-recursos-humanos/main/retirement_info.csv", delimiter=';')
employee_survey = pd.read_csv("https://raw.githubusercontent.com/auramolina/Analitica-en-recursos-humanos/main/employee_survey_data.csv", delimiter=';')

In [6]:
# Configurar la opción para mostrar todas las columnas completas
pd.set_option('display.max_columns', None)

## <font color='0c2054'> **Base de datos 1** </font>
data_general




### <font color='0C2054'> **Selección de características** </font>

In [7]:
data_general.sort_values(by=['EmployeeID'],ascending=1)

Unnamed: 0,Age,BusinessTravel,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeID,Gender,JobLevel,JobRole,MaritalStatus,MonthlyIncome,NumCompaniesWorked,Over18,PercentSalaryHike,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,YearsAtCompany,YearsSinceLastPromotion,YearsWithCurrManager,InfoDate
0,51,Travel_Rarely,Sales,6,2,Life Sciences,1,1,Female,1,Healthcare Representative,Married,131160,1.0,Y,11,8,0,1.0,6,1,0,0,31/12/2015
4410,51,Travel_Rarely,Sales,6,2,Life Sciences,1,1,Female,1,Healthcare Representative,Married,131160,1.0,Y,11,8,0,1.0,6,1,0,0,31/12/2016
1,31,Travel_Frequently,Research & Development,10,1,Life Sciences,1,2,Female,1,Research Scientist,Single,41890,0.0,Y,23,8,1,6.0,3,5,1,4,31/12/2015
2,32,Travel_Frequently,Research & Development,17,4,Other,1,3,Male,4,Sales Executive,Married,193280,1.0,Y,15,8,3,5.0,2,5,0,3,31/12/2015
4412,32,Travel_Frequently,Research & Development,17,4,Other,1,3,Male,4,Sales Executive,Married,193280,1.0,Y,15,8,3,5.0,2,5,0,3,31/12/2016
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8798,33,Travel_Rarely,Sales,1,3,Life Sciences,1,8799,Male,2,Manager,Married,51470,7.0,Y,11,8,0,13.0,2,9,1,7,31/12/2016
8801,32,Travel_Rarely,Sales,23,1,Life Sciences,1,8802,Male,3,Healthcare Representative,Single,24680,0.0,Y,11,8,0,4.0,2,3,1,2,31/12/2016
4391,32,Travel_Rarely,Sales,23,1,Life Sciences,1,8802,Male,3,Healthcare Representative,Single,24680,0.0,Y,11,8,0,4.0,2,3,1,2,31/12/2015
8812,37,Travel_Frequently,Sales,2,3,Marketing,1,8813,Male,1,Laboratory Technician,Divorced,40010,6.0,Y,11,8,1,17.0,2,1,0,0,31/12/2016


In [8]:
for column in data_general.columns:
    print(f"Frecuencia en la columna: {column}")
    print(data_general[column].value_counts())
    print("\n")

Frecuencia en la columna: Age
Age
35    468
34    462
31    414
36    414
29    408
32    366
30    360
38    348
33    348
40    342
37    300
28    288
27    288
42    276
39    252
45    246
41    240
26    234
46    198
44    198
43    192
50    180
24    156
25    156
49    144
47    144
55    132
48    114
51    114
53    114
52    108
54    108
22     96
58     84
23     84
56     84
21     78
20     66
59     60
19     54
18     48
60     30
57     24
Name: count, dtype: int64


Frecuencia en la columna: BusinessTravel
BusinessTravel
Travel_Rarely        6258
Travel_Frequently    1662
Non-Travel            900
Name: count, dtype: int64


Frecuencia en la columna: Department
Department
Research & Development    5766
Sales                     2676
Human Resources            378
Name: count, dtype: int64


Frecuencia en la columna: DistanceFromHome
DistanceFromHome
2     1266
1     1248
10     516
9      510
3      504
7      504
8      480
5      390
4      384
6      354
16     

En la búsqueda de variables que no aportan información útil, hemos analizado la frecuencia de los datos y hemos identificado que las variables EmployeeCount, Over18 y StandardHours contienen valores que se repiten en todos los registros. Al examinar estos datos, observamos que:

*Over18: El valor en esta columna siempre indica que la persona es mayor de 18 años, lo cual es un requisito para trabajar.

*StandardHours: El valor en esta columna siempre es 8 horas, lo cual es la jornada laboral estándar.

*EmployeeCount: El valor en esta columna es siempre 1, lo que representa el conteo de empleados para cada registro.

Dado que estos valores son constantes y no aportan información adicional sobre la variabilidad de los datos, hemos concluido que estas variables pueden ser descartadas, ya que no tienen utilidad para el análisis.

In [9]:
# Eliminación de variables no significativas
data_general.drop(['EmployeeCount','Over18','StandardHours'], axis=1, inplace=True)

# Tamaño del dataset
data_general.shape

(8820, 21)

In [10]:
# Información general del dataset
data_general.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8820 entries, 0 to 8819
Data columns (total 21 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Age                      8820 non-null   int64  
 1   BusinessTravel           8820 non-null   object 
 2   Department               8820 non-null   object 
 3   DistanceFromHome         8820 non-null   int64  
 4   Education                8820 non-null   int64  
 5   EducationField           8820 non-null   object 
 6   EmployeeID               8820 non-null   int64  
 7   Gender                   8820 non-null   object 
 8   JobLevel                 8820 non-null   int64  
 9   JobRole                  8820 non-null   object 
 10  MaritalStatus            8820 non-null   object 
 11  MonthlyIncome            8820 non-null   int64  
 12  NumCompaniesWorked       8782 non-null   float64
 13  PercentSalaryHike        8820 non-null   int64  
 14  StockOptionLevel        

In [11]:
# Cambiar el tipo de dato de la variable fecha a formato fecha
data_general['InfoDate'] = pd.to_datetime(data_general['InfoDate'], format='%d/%m/%Y')
data_general['InfoDate']

Unnamed: 0,InfoDate
0,2015-12-31
1,2015-12-31
2,2015-12-31
3,2015-12-31
4,2015-12-31
...,...
8815,2016-12-31
8816,2016-12-31
8817,2016-12-31
8818,2016-12-31


In [12]:
#observar los nulos de la base de datos
data_general.isnull().sum()

Unnamed: 0,0
Age,0
BusinessTravel,0
Department,0
DistanceFromHome,0
Education,0
EducationField,0
EmployeeID,0
Gender,0
JobLevel,0
JobRole,0


In [13]:
# Filtrar los datos para el año 2016
dg2016 = data_general[data_general['InfoDate'].dt.year == 2016]

dg2016.sort_values(by=['EmployeeID'], ascending=1)

Unnamed: 0,Age,BusinessTravel,Department,DistanceFromHome,Education,EducationField,EmployeeID,Gender,JobLevel,JobRole,MaritalStatus,MonthlyIncome,NumCompaniesWorked,PercentSalaryHike,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,YearsAtCompany,YearsSinceLastPromotion,YearsWithCurrManager,InfoDate
4410,51,Travel_Rarely,Sales,6,2,Life Sciences,1,Female,1,Healthcare Representative,Married,131160,1.0,11,0,1.0,6,1,0,0,2016-12-31
4412,32,Travel_Frequently,Research & Development,17,4,Other,3,Male,4,Sales Executive,Married,193280,1.0,15,3,5.0,2,5,0,3,2016-12-31
4413,38,Non-Travel,Research & Development,2,5,Life Sciences,4,Male,3,Human Resources,Married,83210,3.0,11,3,13.0,5,8,7,5,2016-12-31
4414,32,Travel_Rarely,Research & Development,10,1,Medical,5,Male,1,Sales Executive,Single,23420,4.0,12,2,9.0,2,6,0,4,2016-12-31
4415,46,Travel_Rarely,Research & Development,8,3,Life Sciences,6,Female,4,Research Director,Married,40710,3.0,13,0,28.0,5,7,7,7,2016-12-31
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8791,29,Travel_Rarely,Research & Development,7,1,Life Sciences,8792,Female,1,Research Scientist,Single,21800,1.0,21,0,4.0,2,4,0,1,2016-12-31
8796,33,Travel_Rarely,Sales,11,4,Marketing,8797,Male,1,Research Scientist,Married,71400,5.0,21,0,8.0,2,5,0,4,2016-12-31
8798,33,Travel_Rarely,Sales,1,3,Life Sciences,8799,Male,2,Manager,Married,51470,7.0,11,0,13.0,2,9,1,7,2016-12-31
8801,32,Travel_Rarely,Sales,23,1,Life Sciences,8802,Male,3,Healthcare Representative,Single,24680,0.0,11,0,4.0,2,3,1,2,2016-12-31


Se observan nulos en las variables "NumCompaniesWorked" y "TotalWorkingYears", ya que son variables que pueden brindar información sobre la otra se observaran las entradas que cuentan con nulos en cada una de las variables y despues se decidira como se imputaran los datos.

In [14]:
#data_general[data_general['NumCompaniesWorked'].isnull()]

In [15]:
#data_general[data_general['TotalWorkingYears'].isnull()]

Al observar que existen entradas que cuentan con bastantes años de experiencia y son menores a los años que llevan trabajando en la empresa se dicidio que para la variable "NumCompaniesWorked" se imputaran los nulos utilizando la mediana de la variable pues no se puede asumir que solo han trabajado en una compañía.

Para la variable "TotalWorkingYears" se observo que algunas entradas dicen haber trabajado en varias empresas y llevar varios años en la empresa en la que trabajan, por lo que no se puede asumir que no cuentan con experiencia, por lo cual se decide imputar los datos con la mediana para que no se vea afectado por valores extremos.

In [16]:
## Imputar datos

#data_general['NumCompaniesWorked'].fillna(data_general['NumCompaniesWorked'].median(), inplace=True)
#data_general['TotalWorkingYears'].fillna(data_general['TotalWorkingYears'].median(), inplace=True)

#data_general.isnull().sum()

## <font color='0c2054'> **base de datos 2** </font>

### <font color='0C2054'> **Selección de características** </font>

In [17]:
manager_survey.columns

Index(['EmployeeID', 'JobInvolvement', 'PerformanceRating', 'SurveyDate'], dtype='object')

In [18]:
#Lectura de datos
manager_survey.sort_values(by=['EmployeeID'], ascending=1)

Unnamed: 0,EmployeeID,JobInvolvement,PerformanceRating,SurveyDate
0,1,3,3,31/12/2015
4410,1,3,3,31/12/2016
1,2,2,4,31/12/2015
2,3,3,3,31/12/2015
4412,3,3,3,31/12/2016
...,...,...,...,...
8798,8799,3,3,31/12/2016
8801,8802,3,3,31/12/2016
4391,8802,3,3,31/12/2015
8812,8813,3,3,31/12/2016


In [19]:
# Información general del dataset
manager_survey.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8820 entries, 0 to 8819
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   EmployeeID         8820 non-null   int64 
 1   JobInvolvement     8820 non-null   int64 
 2   PerformanceRating  8820 non-null   int64 
 3   SurveyDate         8820 non-null   object
dtypes: int64(3), object(1)
memory usage: 275.8+ KB


In [20]:
#observar los nulos de la base de datos
manager_survey.isnull().sum()

Unnamed: 0,0
EmployeeID,0
JobInvolvement,0
PerformanceRating,0
SurveyDate,0


In [21]:
# Filtrar los datos para el año 2016
ms2016 = manager_survey[manager_survey['SurveyDate'].dt.year == 2016]

ms2016.sort_values(by=['EmployeeID'], ascending=1)

AttributeError: Can only use .dt accessor with datetimelike values

## <font color='0c2054'> **base de datos 3** </font>

### <font color='0c2054'> **SELECCIÓN DE CARACTERÍSTICAS** </font>

In [None]:
print(retirement_info.columns)

In [None]:
#Lectura de datos
retirement_info.sort_values(by=['EmployeeID'],ascending=1)

In [None]:
# Información general del dataset
retirement_info.info()

In [None]:
#observar los nulos de la base de datos
retirement_info.isnull().sum()

se le pregunta al profesor que se hace con esos datos

In [None]:
retirement_info['retirementDate'] = pd.to_datetime(retirement_info['retirementDate'])

# Filtrar los datos para el año 2016 y el tipo de retiro 'Resignation'
retirement_info_2016 = retirement_info[
    (retirement_info['retirementDate'].dt.year == 2016) &
    (retirement_info['retirementType'] == "Resignation")
]

#Lectura de datos
retirement_info_2016.sort_values(by=['EmployeeID'],ascending=1)

In [None]:
#observar los nulos de la base de datos
retirement_info.isnull().sum()

## <font color='0c2054'> **base de datos 4** </font>

### <font color='0c2054'> **SELECCIÓN DE CARACTERÍSTICAS** </font>

In [None]:
print(employee_survey.columns)

In [None]:
#Lectura de datos
employee_survey.sort_values(by=['EmployeeID'],ascending=1)

In [None]:
# Información general del dataset
employee_survey.info()

In [None]:
#observar los nulos de la base de datos
employee_survey.isnull().sum()

In [None]:
employee_survey['DateSurvey'] = pd.to_datetime(employee_survey['DateSurvey'], format='%d/%m/%Y')
employee_survey['DateSurvey']

In [None]:
# Filtrar los datos para el año 2015
es2016 = employee_survey[employee_survey['DateSurvey'].dt.year == 2016]

es2016.sort_values(by=['EmployeeID'], ascending=1)

In [None]:
# Eliminación de variables no significativas
es2016.drop(['DateSurvey','InfoDate'], axis=1, inplace=True)

## <font color='0c2054'> **base de datos completa** </font>

Base de datos 2016

In [None]:
# Realizar las uniones (left join) de los DataFrames
d2016 = (dg2016
                .merge(es2016, on='EmployeeID', how='left')
                .merge(ms2016, on='EmployeeID', how='left')
                .merge(retirement_info_2016, on='EmployeeID', how='left'))

d2016.sort_values(by=['EmployeeID'], ascending=1)

In [None]:
# Convertir las variables 'Attrition' y 'retirementType' en variables dummy
d2016_dummis = pd.get_dummies(d2016, columns=['Attrition', 'retirementType'], drop_first=True)

d2016_dummis.sort_values(by=['EmployeeID'], ascending=1)

In [None]:
#observar los nulos de la base de datos
d2016.isnull().sum()