<p align="center">
<img src="https://www.python.org/static/community_logos/python-powered-h-70x91.png?raw=true" width="100" height="">
</p>

 # **<font color="#07a8ed">Análisis Exploratorio y Preprocesamiento de los Datos</font>**




## **<font color="#07a8ed">Bibliotecas**

### **<font color="#07a8ed">Para análisis de datos**

In [1]:
import pandas as pd
import numpy as np

### **<font color="#07a8ed">Para preprocesamiento de datos**

In [2]:
from sklearn.preprocessing import LabelEncoder

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html


In [3]:
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import RobustScaler

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html


https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html


https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html


 ## **<font color="#07a8ed">Obtención de los Datos</font>**

In [4]:
url = "https://raw.githubusercontent.com/LucaAPiattelli/Diplomatura_Business_Analytics_UDA/main/Modulo_08_Aprendizaje_Automatico/visualizacion.csv"

In [5]:
analisis = pd.read_csv(url)  ## conecto la url del dataset
analisis.head()  ## descargo dataset con el encabezado

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary
0,0.38,0.53,2,157,3,0,1,0,sales,low
1,0.8,0.86,5,262,6,0,1,0,sales,medium
2,0.11,0.88,7,272,4,0,1,0,sales,medium
3,0.72,0.87,5,223,5,0,1,0,sales,low
4,0.37,0.52,2,159,3,0,1,0,sales,low


Se cambian los nombres de las columnas para mejor interpretacion.

In [6]:
analisis.info() ## info del dataset

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14999 entries, 0 to 14998
Data columns (total 10 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   satisfaction_level     14999 non-null  float64
 1   last_evaluation        14999 non-null  float64
 2   number_project         14999 non-null  int64  
 3   average_montly_hours   14999 non-null  int64  
 4   time_spend_company     14999 non-null  int64  
 5   Work_accident          14999 non-null  int64  
 6   left                   14999 non-null  int64  
 7   promotion_last_5years  14999 non-null  int64  
 8   sales                  14999 non-null  object 
 9   salary                 14999 non-null  object 
dtypes: float64(2), int64(6), object(2)
memory usage: 1.1+ MB


In [7]:
analisis.rename(columns={"satisfaction_level":"niveldesatisfaccion", ### renombro las columnas en español
                         "last_evaluation":"ultimaevaluacion",
                         "number_project":"numerosdeproyectos",
                         "average_montly_hours":"horasmensualespromedio",
                         "time_spend_company":"tiempoenlaempresa",
                         "Work_accident":"accidentedetrabajo",
                         "left":"abandono",
                         "promotion_last_5years":"promocionultimos5años",
                         "sales":"ventas",
                         "salary":"sueldo"}, inplace= True)

In [8]:
analisis.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14999 entries, 0 to 14998
Data columns (total 10 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   niveldesatisfaccion     14999 non-null  float64
 1   ultimaevaluacion        14999 non-null  float64
 2   numerosdeproyectos      14999 non-null  int64  
 3   horasmensualespromedio  14999 non-null  int64  
 4   tiempoenlaempresa       14999 non-null  int64  
 5   accidentedetrabajo      14999 non-null  int64  
 6   abandono                14999 non-null  int64  
 7   promocionultimos5años   14999 non-null  int64  
 8   ventas                  14999 non-null  object 
 9   sueldo                  14999 non-null  object 
dtypes: float64(2), int64(6), object(2)
memory usage: 1.1+ MB


In [9]:
###analisis.sueldo.value_counts()

 ## **<font color="#07a8ed">Limpieza y transformación de datos</font>**

 ### **<font color="#07a8ed">Filtrado de datos</font>**

In [10]:
analisis.head(2)

Unnamed: 0,niveldesatisfaccion,ultimaevaluacion,numerosdeproyectos,horasmensualespromedio,tiempoenlaempresa,accidentedetrabajo,abandono,promocionultimos5años,ventas,sueldo
0,0.38,0.53,2,157,3,0,1,0,sales,low
1,0.8,0.86,5,262,6,0,1,0,sales,medium


In [11]:
analisis.ventas.value_counts()

Unnamed: 0_level_0,count
ventas,Unnamed: 1_level_1
sales,4140
technical,2720
support,2229
IT,1227
product_mng,902
marketing,858
RandD,787
accounting,767
hr,739
management,630


In [12]:
analisis.rename(columns={"ventas":"sector"}, inplace=True)  ## renombro columna ventas por sector

In [13]:
analisis.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14999 entries, 0 to 14998
Data columns (total 10 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   niveldesatisfaccion     14999 non-null  float64
 1   ultimaevaluacion        14999 non-null  float64
 2   numerosdeproyectos      14999 non-null  int64  
 3   horasmensualespromedio  14999 non-null  int64  
 4   tiempoenlaempresa       14999 non-null  int64  
 5   accidentedetrabajo      14999 non-null  int64  
 6   abandono                14999 non-null  int64  
 7   promocionultimos5años   14999 non-null  int64  
 8   sector                  14999 non-null  object 
 9   sueldo                  14999 non-null  object 
dtypes: float64(2), int64(6), object(2)
memory usage: 1.1+ MB


In [14]:
analisis.filter(["sector", "sueldo"])  ## filtrar por sector y sueldo

Unnamed: 0,sector,sueldo
0,sales,low
1,sales,medium
2,sales,medium
3,sales,low
4,sales,low
...,...,...
14994,support,low
14995,support,low
14996,support,low
14997,support,low


In [16]:
analisis.sueldo.value_counts()

Unnamed: 0_level_0,count
sueldo,Unnamed: 1_level_1
low,7316
medium,6446
high,1237


In [17]:
###analisis.sueldo.value_counts()/len(analisis.sueldo)  ### variable objetivo ?

In [18]:
analisis.filter([0,1,2,5,7,19],axis=0)  ## filtro algunas filas por numero  axis = 0

Unnamed: 0,niveldesatisfaccion,ultimaevaluacion,numerosdeproyectos,horasmensualespromedio,tiempoenlaempresa,accidentedetrabajo,abandono,promocionultimos5años,sector,sueldo
0,0.38,0.53,2,157,3,0,1,0,sales,low
1,0.8,0.86,5,262,6,0,1,0,sales,medium
2,0.11,0.88,7,272,4,0,1,0,sales,medium
5,0.41,0.5,2,153,3,0,1,0,sales,low
7,0.92,0.85,5,259,5,0,1,0,sales,low
19,0.76,0.89,5,262,5,0,1,0,sales,low


In [19]:
analisis[analisis.sector=="support"] ## filtrar por el dato de support en la columna de sector

Unnamed: 0,niveldesatisfaccion,ultimaevaluacion,numerosdeproyectos,horasmensualespromedio,tiempoenlaempresa,accidentedetrabajo,abandono,promocionultimos5años,sector,sueldo
46,0.40,0.55,2,147,3,0,1,0,support,low
47,0.57,0.70,3,273,6,0,1,0,support,low
48,0.40,0.54,2,148,3,0,1,0,support,low
49,0.43,0.47,2,147,3,0,1,0,support,low
50,0.13,0.78,6,152,2,0,1,0,support,low
...,...,...,...,...,...,...,...,...,...,...
14994,0.40,0.57,2,151,3,0,1,0,support,low
14995,0.37,0.48,2,160,3,0,1,0,support,low
14996,0.37,0.53,2,143,3,0,1,0,support,low
14997,0.11,0.96,6,280,4,0,1,0,support,low


 ### **<font color="#07a8ed">Valores faltantes</font>**

In [20]:
url2 = "https://raw.githubusercontent.com/LucaAPiattelli/Diplomatura_Business_Analytics_UDA/main/Modulo_08_Aprendizaje_Automatico/empleados.csv"

In [21]:
empleados = pd.read_csv(url2)

In [22]:
empleados  ## dataset empleados

Unnamed: 0,nombre,edad,sueldo,sexo,sector,nivel,performance
0,Alan Smith,45.0,,,Operations,G3,723
1,Sandro Kumar,,16000.0,F,Finance,G0,520
2,Jacinto Morgan,32.0,35000.0,M,Finance,G2,674
3,Ernesto Chin,45.0,65000.0,F,Sales,G3,556
4,Fernanda Patel,30.0,42000.0,F,Operations,G2,711
5,Samara Sharma,,62000.0,,Sales,G3,649
6,Joaquin Fleiman,54.0,,F,Operations,G3,53
7,Juana Wilkis,54.0,52000.0,F,Finance,G3,901
8,Leonardo Doberti,23.0,98000.0,M,Sales,G4,709


In [23]:
###empleados.info()

 ### **<font color="#07a8ed">Eliminando los NaN</font>**


In [24]:
empleados.dropna(how="any", inplace=True)

In [25]:
empleados

Unnamed: 0,nombre,edad,sueldo,sexo,sector,nivel,performance
2,Jacinto Morgan,32.0,35000.0,M,Finance,G2,674
3,Ernesto Chin,45.0,65000.0,F,Sales,G3,556
4,Fernanda Patel,30.0,42000.0,F,Operations,G2,711
7,Juana Wilkis,54.0,52000.0,F,Finance,G3,901
8,Leonardo Doberti,23.0,98000.0,M,Sales,G4,709


In [26]:
##empleados.info()

 ### **<font color="#07a8ed">Completando los NaN</font>**

In [27]:
empleados = pd.read_csv(url2)
empleados

Unnamed: 0,nombre,edad,sueldo,sexo,sector,nivel,performance
0,Alan Smith,45.0,,,Operations,G3,723
1,Sandro Kumar,,16000.0,F,Finance,G0,520
2,Jacinto Morgan,32.0,35000.0,M,Finance,G2,674
3,Ernesto Chin,45.0,65000.0,F,Sales,G3,556
4,Fernanda Patel,30.0,42000.0,F,Operations,G2,711
5,Samara Sharma,,62000.0,,Sales,G3,649
6,Joaquin Fleiman,54.0,,F,Operations,G3,53
7,Juana Wilkis,54.0,52000.0,F,Finance,G3,901
8,Leonardo Doberti,23.0,98000.0,M,Sales,G4,709


In [28]:
empleados["edad"] = empleados.edad.fillna(empleados.edad.mean())
empleados

Unnamed: 0,nombre,edad,sueldo,sexo,sector,nivel,performance
0,Alan Smith,45.0,,,Operations,G3,723
1,Sandro Kumar,40.428571,16000.0,F,Finance,G0,520
2,Jacinto Morgan,32.0,35000.0,M,Finance,G2,674
3,Ernesto Chin,45.0,65000.0,F,Sales,G3,556
4,Fernanda Patel,30.0,42000.0,F,Operations,G2,711
5,Samara Sharma,40.428571,62000.0,,Sales,G3,649
6,Joaquin Fleiman,54.0,,F,Operations,G3,53
7,Juana Wilkis,54.0,52000.0,F,Finance,G3,901
8,Leonardo Doberti,23.0,98000.0,M,Sales,G4,709


In [29]:
empleados = pd.read_csv(url2)
empleados["edad"] = round(empleados.edad.fillna(empleados.edad.mean()),0)  ## imputo la columna edad con la media
empleados

Unnamed: 0,nombre,edad,sueldo,sexo,sector,nivel,performance
0,Alan Smith,45.0,,,Operations,G3,723
1,Sandro Kumar,40.0,16000.0,F,Finance,G0,520
2,Jacinto Morgan,32.0,35000.0,M,Finance,G2,674
3,Ernesto Chin,45.0,65000.0,F,Sales,G3,556
4,Fernanda Patel,30.0,42000.0,F,Operations,G2,711
5,Samara Sharma,40.0,62000.0,,Sales,G3,649
6,Joaquin Fleiman,54.0,,F,Operations,G3,53
7,Juana Wilkis,54.0,52000.0,F,Finance,G3,901
8,Leonardo Doberti,23.0,98000.0,M,Sales,G4,709


In [30]:
empleados = pd.read_csv(url2)
empleados["edad"] = empleados.edad.fillna(empleados.edad.median())
empleados

Unnamed: 0,nombre,edad,sueldo,sexo,sector,nivel,performance
0,Alan Smith,45.0,,,Operations,G3,723
1,Sandro Kumar,45.0,16000.0,F,Finance,G0,520
2,Jacinto Morgan,32.0,35000.0,M,Finance,G2,674
3,Ernesto Chin,45.0,65000.0,F,Sales,G3,556
4,Fernanda Patel,30.0,42000.0,F,Operations,G2,711
5,Samara Sharma,45.0,62000.0,,Sales,G3,649
6,Joaquin Fleiman,54.0,,F,Operations,G3,53
7,Juana Wilkis,54.0,52000.0,F,Finance,G3,901
8,Leonardo Doberti,23.0,98000.0,M,Sales,G4,709


In [31]:
empleados["sueldo"] = empleados.sueldo.fillna(empleados.sueldo.median())  ##imputo la columna sueldo con la mediana
empleados

Unnamed: 0,nombre,edad,sueldo,sexo,sector,nivel,performance
0,Alan Smith,45.0,52000.0,,Operations,G3,723
1,Sandro Kumar,45.0,16000.0,F,Finance,G0,520
2,Jacinto Morgan,32.0,35000.0,M,Finance,G2,674
3,Ernesto Chin,45.0,65000.0,F,Sales,G3,556
4,Fernanda Patel,30.0,42000.0,F,Operations,G2,711
5,Samara Sharma,45.0,62000.0,,Sales,G3,649
6,Joaquin Fleiman,54.0,52000.0,F,Operations,G3,53
7,Juana Wilkis,54.0,52000.0,F,Finance,G3,901
8,Leonardo Doberti,23.0,98000.0,M,Sales,G4,709


 ## **<font color="#07a8ed">Codificación de variables categóricas</font>**

 ### **<font color="#07a8ed">Método <code>get_dummies</code></font>**

<p align="justify">
El método <code>get_dummies</code> de Pandas se utiliza principalmente para convertir variables categóricas en variables ficticias o numéricas, lo que permite su uso en algoritmos de aprendizaje automático que requieren datos numéricos como entrada.

In [32]:
empleados = pd.read_csv(url2)
empleados.dropna(how="any", inplace=True)
empleados.reset_index(drop=True, inplace=True)
empleados

Unnamed: 0,nombre,edad,sueldo,sexo,sector,nivel,performance
0,Jacinto Morgan,32.0,35000.0,M,Finance,G2,674
1,Ernesto Chin,45.0,65000.0,F,Sales,G3,556
2,Fernanda Patel,30.0,42000.0,F,Operations,G2,711
3,Juana Wilkis,54.0,52000.0,F,Finance,G3,901
4,Leonardo Doberti,23.0,98000.0,M,Sales,G4,709


In [33]:
empleadoscodificados = pd.get_dummies(empleados["sexo"])  ## genero un objeto con las columnas codificadas por get dummies

In [34]:
empleadoscodificados

Unnamed: 0,F,M
0,False,True
1,True,False
2,True,False
3,True,False
4,False,True


In [36]:
empleados = empleados.join(empleadoscodificados) ### al objeto empleados le adjunto el de empleados codificados

ValueError: columns overlap but no suffix specified: Index(['F', 'M'], dtype='object')

In [37]:
empleados

Unnamed: 0,nombre,edad,sueldo,sexo,sector,nivel,performance,F,M
0,Jacinto Morgan,32.0,35000.0,M,Finance,G2,674,False,True
1,Ernesto Chin,45.0,65000.0,F,Sales,G3,556,True,False
2,Fernanda Patel,30.0,42000.0,F,Operations,G2,711,True,False
3,Juana Wilkis,54.0,52000.0,F,Finance,G3,901,True,False
4,Leonardo Doberti,23.0,98000.0,M,Sales,G4,709,False,True


In [38]:
##empleados.info()

In [39]:
empleados = empleados.drop(columns = ["sexo"], axis=1)  ## al dataset de empleados dropeo la columna sexo

In [40]:
empleados

Unnamed: 0,nombre,edad,sueldo,sector,nivel,performance,F,M
0,Jacinto Morgan,32.0,35000.0,Finance,G2,674,False,True
1,Ernesto Chin,45.0,65000.0,Sales,G3,556,True,False
2,Fernanda Patel,30.0,42000.0,Operations,G2,711,True,False
3,Juana Wilkis,54.0,52000.0,Finance,G3,901,True,False
4,Leonardo Doberti,23.0,98000.0,Sales,G4,709,False,True


In [41]:
empleados = pd.read_csv(url2)
empleados.dropna(how="any", inplace=True)
empleados.reset_index(drop=True, inplace=True)
empleados

Unnamed: 0,nombre,edad,sueldo,sexo,sector,nivel,performance
0,Jacinto Morgan,32.0,35000.0,M,Finance,G2,674
1,Ernesto Chin,45.0,65000.0,F,Sales,G3,556
2,Fernanda Patel,30.0,42000.0,F,Operations,G2,711
3,Juana Wilkis,54.0,52000.0,F,Finance,G3,901
4,Leonardo Doberti,23.0,98000.0,M,Sales,G4,709


In [42]:
empleados = empleados.join(pd.get_dummies(empleados["sexo"], prefix="sexo"))
empleados.drop(columns = "sexo", inplace = True)

In [43]:
empleados

Unnamed: 0,nombre,edad,sueldo,sector,nivel,performance,sexo_F,sexo_M
0,Jacinto Morgan,32.0,35000.0,Finance,G2,674,False,True
1,Ernesto Chin,45.0,65000.0,Sales,G3,556,True,False
2,Fernanda Patel,30.0,42000.0,Operations,G2,711,True,False
3,Juana Wilkis,54.0,52000.0,Finance,G3,901,True,False
4,Leonardo Doberti,23.0,98000.0,Sales,G4,709,False,True


Otra forma más directa:

In [44]:
empleados = pd.read_csv(url2)
empleados.dropna(how="any", inplace=True)
empleados.reset_index(drop=True, inplace=True)
empleados

Unnamed: 0,nombre,edad,sueldo,sexo,sector,nivel,performance
0,Jacinto Morgan,32.0,35000.0,M,Finance,G2,674
1,Ernesto Chin,45.0,65000.0,F,Sales,G3,556
2,Fernanda Patel,30.0,42000.0,F,Operations,G2,711
3,Juana Wilkis,54.0,52000.0,F,Finance,G3,901
4,Leonardo Doberti,23.0,98000.0,M,Sales,G4,709


In [45]:
empleados = pd.get_dummies(data=empleados, columns=['sexo'])
empleados

Unnamed: 0,nombre,edad,sueldo,sector,nivel,performance,sexo_F,sexo_M
0,Jacinto Morgan,32.0,35000.0,Finance,G2,674,False,True
1,Ernesto Chin,45.0,65000.0,Sales,G3,556,True,False
2,Fernanda Patel,30.0,42000.0,Operations,G2,711,True,False
3,Juana Wilkis,54.0,52000.0,Finance,G3,901,True,False
4,Leonardo Doberti,23.0,98000.0,Sales,G4,709,False,True


### **<font color="#07a8ed">Método <code>LabelEncoder</code></font>**

<p align="justify">
El método <code>LabelEncoder</code> de la biblioteca scikit-learn se utiliza para codificar variables categóricas en valores numéricos. Toma como entrada una lista de etiquetas y asigna un número entero único a cada etiqueta.

In [46]:
empleados = pd.read_csv(url2)
empleados

Unnamed: 0,nombre,edad,sueldo,sexo,sector,nivel,performance
0,Alan Smith,45.0,,,Operations,G3,723
1,Sandro Kumar,,16000.0,F,Finance,G0,520
2,Jacinto Morgan,32.0,35000.0,M,Finance,G2,674
3,Ernesto Chin,45.0,65000.0,F,Sales,G3,556
4,Fernanda Patel,30.0,42000.0,F,Operations,G2,711
5,Samara Sharma,,62000.0,,Sales,G3,649
6,Joaquin Fleiman,54.0,,F,Operations,G3,53
7,Juana Wilkis,54.0,52000.0,F,Finance,G3,901
8,Leonardo Doberti,23.0,98000.0,M,Sales,G4,709


In [47]:
etiqueta_ordenada = LabelEncoder()

In [48]:
empleados["nivel_cod"] = etiqueta_ordenada.fit_transform(empleados["nivel"])

In [49]:
empleados

Unnamed: 0,nombre,edad,sueldo,sexo,sector,nivel,performance,nivel_cod
0,Alan Smith,45.0,,,Operations,G3,723,2
1,Sandro Kumar,,16000.0,F,Finance,G0,520,0
2,Jacinto Morgan,32.0,35000.0,M,Finance,G2,674,1
3,Ernesto Chin,45.0,65000.0,F,Sales,G3,556,2
4,Fernanda Patel,30.0,42000.0,F,Operations,G2,711,1
5,Samara Sharma,,62000.0,,Sales,G3,649,2
6,Joaquin Fleiman,54.0,,F,Operations,G3,53,2
7,Juana Wilkis,54.0,52000.0,F,Finance,G3,901,2
8,Leonardo Doberti,23.0,98000.0,M,Sales,G4,709,3


 ## **<font color="#07a8ed">Escalado de variables numéricas</font>**

### **<font color="#07a8ed">Método <code>StandardScaler</code></font>**

<p align="justify">
El método <code>StandardScaler</code> de la biblioteca scikit-learn se utiliza para estandarizar variables numéricas. La estandarización de las variables significa que cada variable tendrá una media de cero y una desviación estándar de uno. Esto asegura que las variables estén en la misma escala, lo que puede mejorar el rendimiento de muchos algoritmos de aprendizaje automático.

In [50]:
escala = StandardScaler()

In [51]:
escala.fit(empleados[["performance"]])

In [52]:
empleados["escala_perform"]=escala.transform(empleados[["performance"]])

In [53]:
empleados

Unnamed: 0,nombre,edad,sueldo,sexo,sector,nivel,performance,nivel_cod,escala_perform
0,Alan Smith,45.0,,,Operations,G3,723,2,0.505565
1,Sandro Kumar,,16000.0,F,Finance,G0,520,0,-0.408053
2,Jacinto Morgan,32.0,35000.0,M,Finance,G2,674,1,0.285037
3,Ernesto Chin,45.0,65000.0,F,Sales,G3,556,2,-0.246032
4,Fernanda Patel,30.0,42000.0,F,Operations,G2,711,1,0.451558
5,Samara Sharma,,62000.0,,Sales,G3,649,2,0.172522
6,Joaquin Fleiman,54.0,,F,Operations,G3,53,2,-2.509823
7,Juana Wilkis,54.0,52000.0,F,Finance,G3,901,2,1.306668
8,Leonardo Doberti,23.0,98000.0,M,Sales,G4,709,3,0.442557


In [54]:
empleados.escala_perform.describe()

Unnamed: 0,escala_perform
count,9.0
mean,1.912051e-16
std,1.06066
min,-2.509823
25%,-0.2460317
50%,0.2850367
75%,0.4515581
max,1.306668


### **<font color="#07a8ed">Método <code>MinMaxScaler</code></font>**

<p align="justify">
El método <code>MinMaxScaler</code> de la biblioteca scikit-learn se utiliza para escalar características al rango dado por un mínimo y un máximo específicos, generalmente entre 0 y 1. Esto es útil cuando deseas preservar la forma de la distribución original mientras ajustas los valores dentro de un rango específico.

In [55]:
escalaminmax = MinMaxScaler()

In [56]:
escalaminmax.fit(empleados[["performance"]])

In [57]:
empleados["escala_minmax"]=escalaminmax.transform(empleados[["performance"]])

In [58]:
empleados

Unnamed: 0,nombre,edad,sueldo,sexo,sector,nivel,performance,nivel_cod,escala_perform,escala_minmax
0,Alan Smith,45.0,,,Operations,G3,723,2,0.505565,0.790094
1,Sandro Kumar,,16000.0,F,Finance,G0,520,0,-0.408053,0.550708
2,Jacinto Morgan,32.0,35000.0,M,Finance,G2,674,1,0.285037,0.732311
3,Ernesto Chin,45.0,65000.0,F,Sales,G3,556,2,-0.246032,0.59316
4,Fernanda Patel,30.0,42000.0,F,Operations,G2,711,1,0.451558,0.775943
5,Samara Sharma,,62000.0,,Sales,G3,649,2,0.172522,0.70283
6,Joaquin Fleiman,54.0,,F,Operations,G3,53,2,-2.509823,0.0
7,Juana Wilkis,54.0,52000.0,F,Finance,G3,901,2,1.306668,1.0
8,Leonardo Doberti,23.0,98000.0,M,Sales,G4,709,3,0.442557,0.773585


### **<font color="#07a8ed">Método <code>RobustScaler</code></font>**

<p align="justify">
El método <code>RobustScaler</code> de la biblioteca scikit-learn se utiliza para escalar variables utilizando estadísticas resistentes a los valores atípicos. Esto significa que <code>RobustScaler</code> ajusta y transforma los datos utilizando la mediana y el rango intercuartílico, lo que hace que sea menos sensible a los valores atípicos en comparación con <code>StandarScaler</code> o <code>MinMaxScaler</code>

In [59]:
escala_robusta = RobustScaler()

In [60]:
escala_robusta.fit(empleados[["performance"]])

In [61]:
empleados["escala_robusta"]=escala_robusta.transform(empleados[["performance"]])

In [62]:
empleados

Unnamed: 0,nombre,edad,sueldo,sexo,sector,nivel,performance,nivel_cod,escala_perform,escala_minmax,escala_robusta
0,Alan Smith,45.0,,,Operations,G3,723,2,0.505565,0.790094,0.316129
1,Sandro Kumar,,16000.0,F,Finance,G0,520,0,-0.408053,0.550708,-0.993548
2,Jacinto Morgan,32.0,35000.0,M,Finance,G2,674,1,0.285037,0.732311,0.0
3,Ernesto Chin,45.0,65000.0,F,Sales,G3,556,2,-0.246032,0.59316,-0.76129
4,Fernanda Patel,30.0,42000.0,F,Operations,G2,711,1,0.451558,0.775943,0.23871
5,Samara Sharma,,62000.0,,Sales,G3,649,2,0.172522,0.70283,-0.16129
6,Joaquin Fleiman,54.0,,F,Operations,G3,53,2,-2.509823,0.0,-4.006452
7,Juana Wilkis,54.0,52000.0,F,Finance,G3,901,2,1.306668,1.0,1.464516
8,Leonardo Doberti,23.0,98000.0,M,Sales,G4,709,3,0.442557,0.773585,0.225806


In [63]:
empleados1 = empleados.copy()

In [64]:
empleados1['escala'] = empleados1.escala_perform - empleados1.escala_minmax-empleados1.escala_robusta

In [65]:
empleados1

Unnamed: 0,nombre,edad,sueldo,sexo,sector,nivel,performance,nivel_cod,escala_perform,escala_minmax,escala_robusta,escala
0,Alan Smith,45.0,,,Operations,G3,723,2,0.505565,0.790094,0.316129,-0.600658
1,Sandro Kumar,,16000.0,F,Finance,G0,520,0,-0.408053,0.550708,-0.993548,0.034788
2,Jacinto Morgan,32.0,35000.0,M,Finance,G2,674,1,0.285037,0.732311,0.0,-0.447275
3,Ernesto Chin,45.0,65000.0,F,Sales,G3,556,2,-0.246032,0.59316,-0.76129,-0.077902
4,Fernanda Patel,30.0,42000.0,F,Operations,G2,711,1,0.451558,0.775943,0.23871,-0.563095
5,Samara Sharma,,62000.0,,Sales,G3,649,2,0.172522,0.70283,-0.16129,-0.369018
6,Joaquin Fleiman,54.0,,F,Operations,G3,53,2,-2.509823,0.0,-4.006452,1.496628
7,Juana Wilkis,54.0,52000.0,F,Finance,G3,901,2,1.306668,1.0,1.464516,-1.157848
8,Leonardo Doberti,23.0,98000.0,M,Sales,G4,709,3,0.442557,0.773585,0.225806,-0.556834


###Analisis de escalas

In [66]:
import plotly.express as px

In [67]:
px.line(empleados1, x="escala_robusta")

In [68]:
px.line(empleados1, x="escala_minmax")

In [69]:
px.line(empleados1, x="escala_perform")

In [70]:
px.line(empleados1, x="escala")

## **<font color="#07a8ed">Transformación de variables numéricas a categóricas</code></font>**

In [71]:
empleados = pd.read_csv(url2)

In [72]:
def grado_performance(performance):
    if performance >= 700:
        return "A"
    elif performance < 700 and performance >= 500:
        return "B"
    else:
        return "C"

In [73]:
empleados["grado_performance"] = empleados.performance.apply(grado_performance)

In [74]:
empleados

Unnamed: 0,nombre,edad,sueldo,sexo,sector,nivel,performance,grado_performance
0,Alan Smith,45.0,,,Operations,G3,723,A
1,Sandro Kumar,,16000.0,F,Finance,G0,520,B
2,Jacinto Morgan,32.0,35000.0,M,Finance,G2,674,B
3,Ernesto Chin,45.0,65000.0,F,Sales,G3,556,B
4,Fernanda Patel,30.0,42000.0,F,Operations,G2,711,A
5,Samara Sharma,,62000.0,,Sales,G3,649,B
6,Joaquin Fleiman,54.0,,F,Operations,G3,53,C
7,Juana Wilkis,54.0,52000.0,F,Finance,G3,901,A
8,Leonardo Doberti,23.0,98000.0,M,Sales,G4,709,A


In [75]:
empleados = pd.read_csv(url2)

In [76]:
empleados.performance = empleados.performance.apply(grado_performance)

In [77]:
empleados

Unnamed: 0,nombre,edad,sueldo,sexo,sector,nivel,performance
0,Alan Smith,45.0,,,Operations,G3,A
1,Sandro Kumar,,16000.0,F,Finance,G0,B
2,Jacinto Morgan,32.0,35000.0,M,Finance,G2,B
3,Ernesto Chin,45.0,65000.0,F,Sales,G3,B
4,Fernanda Patel,30.0,42000.0,F,Operations,G2,A
5,Samara Sharma,,62000.0,,Sales,G3,B
6,Joaquin Fleiman,54.0,,F,Operations,G3,C
7,Juana Wilkis,54.0,52000.0,F,Finance,G3,A
8,Leonardo Doberti,23.0,98000.0,M,Sales,G4,A


In [78]:
empleados_nivel = pd.get_dummies(empleados["nivel"])

In [79]:
empleados_nivel

Unnamed: 0,G0,G2,G3,G4
0,False,False,True,False
1,True,False,False,False
2,False,True,False,False
3,False,False,True,False
4,False,True,False,False
5,False,False,True,False
6,False,False,True,False
7,False,False,True,False
8,False,False,False,True


In [80]:
empleados = empleados.join(empleados_nivel)

In [81]:
empleados

Unnamed: 0,nombre,edad,sueldo,sexo,sector,nivel,performance,G0,G2,G3,G4
0,Alan Smith,45.0,,,Operations,G3,A,False,False,True,False
1,Sandro Kumar,,16000.0,F,Finance,G0,B,True,False,False,False
2,Jacinto Morgan,32.0,35000.0,M,Finance,G2,B,False,True,False,False
3,Ernesto Chin,45.0,65000.0,F,Sales,G3,B,False,False,True,False
4,Fernanda Patel,30.0,42000.0,F,Operations,G2,A,False,True,False,False
5,Samara Sharma,,62000.0,,Sales,G3,B,False,False,True,False
6,Joaquin Fleiman,54.0,,F,Operations,G3,C,False,False,True,False
7,Juana Wilkis,54.0,52000.0,F,Finance,G3,A,False,False,True,False
8,Leonardo Doberti,23.0,98000.0,M,Sales,G4,A,False,False,False,True
