<a href="https://colab.research.google.com/github/cristiandarioortegayubro/accenture/blob/main/accenturedatascience01.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![texto del vínculo](https://www.accenture.com/t20180820T081710Z__w__/il-en/_acnmedia/Accenture/Dev/Redesign/Acc_Logo_Black_Purple_RGB.PNG)

# Data Science / Datamining
~~~python
Cdor. Cristian Darío Ortega Yubro
~~~

# Módulos

## Instalando módulos

In [1]:
!pip install --upgrade plotly



## Para análisis de datos

In [2]:
import pandas as pd
import numpy as np

## Para preprocesamiento y modelo

In [3]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier


In [44]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.compose import make_column_selector
from sklearn.model_selection import cross_validate

In [4]:
import pickle
import os

## Para gráficos

In [32]:
import plotly.express as px
import plotly.figure_factory as ff
import plotly.graph_objects as go

##Obtención de datos y creación del dataframe

In [6]:
datos = "https://raw.githubusercontent.com/cristiandarioortegayubro/accenture/main/train.csv"
datos_no_vistos = "https://raw.githubusercontent.com/cristiandarioortegayubro/accenture/main/test.csv"

In [7]:
df = pd.read_csv(datos)
df

Unnamed: 0,ID,nivel_de_satisfaccion,ultima_evaluacion,cantidad_proyectos,promedio_horas_mensuales_trabajadas,años_en_la_empresa,tuvo_un_accidente_laboral,promociones_ultimos_5_anios,area,salario,se_fue
0,2876.0,0.63,0.84,3,269,2,0,0,gestión de productos,bajo,no
1,7883.0,0.11,0.93,7,284,4,0,0,tecnica,bajo,si
2,4089.0,0.60,0.42,2,109,6,0,0,ventas,bajo,no
3,8828.0,0.38,0.49,4,196,3,0,1,dirección,alto,no
4,9401.0,0.11,0.83,6,244,4,0,0,contabilidad,bajo,si
...,...,...,...,...,...,...,...,...,...,...,...
7995,8701.0,0.63,0.85,2,156,3,1,0,RRHH,medio,no
7996,501.0,0.62,0.85,3,237,3,1,0,TI,medio,no
7997,2834.0,0.86,1.00,5,257,5,0,0,tecnica,medio,si
7998,8245.0,0.88,0.51,3,208,3,0,0,RRHH,medio,no


In [8]:
df1 = pd.read_csv(datos_no_vistos)
df1.shape

(2000, 11)

# Analisis exploratorio de los datos

## Resumen del tipo de datos

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8000 entries, 0 to 7999
Data columns (total 11 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   ID                                   8000 non-null   float64
 1   nivel_de_satisfaccion                8000 non-null   float64
 2   ultima_evaluacion                    8000 non-null   float64
 3   cantidad_proyectos                   8000 non-null   int64  
 4   promedio_horas_mensuales_trabajadas  8000 non-null   int64  
 5   años_en_la_empresa                   8000 non-null   int64  
 6   tuvo_un_accidente_laboral            8000 non-null   int64  
 7   promociones_ultimos_5_anios          8000 non-null   int64  
 8   area                                 8000 non-null   object 
 9   salario                              8000 non-null   object 
 10  se_fue                               8000 non-null   object 
dtypes: float64(3), int64(5), objec

El dataframe no tiene valores nulos. De las once variables, tres de ellas no son numéricas. La variable objetivo es la variable "se_fue". La variable ID no tiene relevancia para el desarrollo del modelo y se procede a eliminar.

In [10]:
df.drop(columns="ID",inplace=True)

## Tratamiento de duplicados

In [11]:
df.drop_duplicates(inplace=True)
df.shape

(7069, 10)

Se procede a eliminar los duplicados del dataframe, para el desarrollo del modelo

## Histogramas de las variables

In [12]:
df.columns

Index(['nivel_de_satisfaccion', 'ultima_evaluacion', 'cantidad_proyectos',
       'promedio_horas_mensuales_trabajadas', 'años_en_la_empresa',
       'tuvo_un_accidente_laboral', 'promociones_ultimos_5_anios', 'area',
       'salario', 'se_fue'],
      dtype='object')

In [13]:
for name in df.columns:
  fig=px.histogram(df, 
                   x=name, 
                   template="gridon", 
                   title="Histograma "+name, 
                   marginal="box")
  fig.update_layout(bargap=0.1)
  fig.show()

## Análisis descriptivo variables numéricas

In [14]:
round(df.describe(),2).T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
nivel_de_satisfaccion,7069.0,0.62,0.24,0.09,0.45,0.65,0.82,1.0
ultima_evaluacion,7069.0,0.72,0.17,0.36,0.57,0.72,0.87,1.0
cantidad_proyectos,7069.0,3.8,1.19,2.0,3.0,4.0,5.0,7.0
promedio_horas_mensuales_trabajadas,7069.0,200.73,49.35,96.0,156.0,201.0,244.0,310.0
años_en_la_empresa,7069.0,3.42,1.41,2.0,3.0,3.0,4.0,10.0
tuvo_un_accidente_laboral,7069.0,0.15,0.36,0.0,0.0,0.0,0.0,1.0
promociones_ultimos_5_anios,7069.0,0.02,0.14,0.0,0.0,0.0,0.0,1.0


##Análisis descriptivo variables no numéricas

In [15]:
df.select_dtypes(include=['object']).describe().T

Unnamed: 0,count,unique,top,freq
area,7069,10,ventas,1915
salario,7069,3,bajo,3401
se_fue,7069,2,no,5698


## Matriz de correlación

In [16]:
correlacion=round(df.corr(),2).values
nombres=list(df.corr().columns.values)
transposicion=correlacion[::-1]

In [35]:
fig=ff.create_annotated_heatmap(transposicion, 
                                x=nombres,
                                y=nombres[::-1], 
                                colorscale='inferno')
fig.update_xaxes(side="bottom")

fig.show()

# División del conjunto de datos

In [18]:
y = df["se_fue"] #se define la variable objetivo
X = df.drop(columns=["se_fue"]) #se elimina la variable objetivo del vector de entrada

In [19]:
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    test_size = 0.20, 
                                                    random_state = 2021,
                                                    shuffle=True)

# Preprocesamiento de los datos

In [31]:
numericas=X_train.select_dtypes(include=['float64', 'int']).columns.to_list()
cualitativas=X_train.select_dtypes(include=['object', 'category']).columns.to_list()

In [37]:
preprocessor = ColumnTransformer([('scale', StandardScaler(), numericas),
                                  ('onehot', OneHotEncoder(handle_unknown='ignore'), cualitativas)],
                                 remainder='passthrough')

In [38]:
X_train_prep = preprocessor.fit_transform(X_train)
X_test_prep  = preprocessor.transform(X_test)

In [41]:
encoded_cat = preprocessor.named_transformers_['onehot'].get_feature_names(cualitativas)
labels = np.concatenate([numericas, encoded_cat])
datos_train_prep = preprocessor.transform(X_train)
datos_train_prep = pd.DataFrame(datos_train_prep, columns=labels)
datos_train_prep.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5655 entries, 0 to 5654
Data columns (total 20 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   nivel_de_satisfaccion                5655 non-null   float64
 1   ultima_evaluacion                    5655 non-null   float64
 2   cantidad_proyectos                   5655 non-null   float64
 3   promedio_horas_mensuales_trabajadas  5655 non-null   float64
 4   años_en_la_empresa                   5655 non-null   float64
 5   tuvo_un_accidente_laboral            5655 non-null   float64
 6   promociones_ultimos_5_anios          5655 non-null   float64
 7   area_ImásD                           5655 non-null   float64
 8   area_RRHH                            5655 non-null   float64
 9   area_TI                              5655 non-null   float64
 10  area_contabilidad                    5655 non-null   float64
 11  area_dirección                


Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.



In [42]:
numericas = X_train.select_dtypes(include=['float64', 'int']).columns.to_list()
categoricas = X_train.select_dtypes(include=['object', 'category']).columns.to_list()

In [45]:
numeric_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='median')),
                                      ('scaler', StandardScaler())])

In [46]:
categorical_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='most_frequent')),
                                          ('onehot', OneHotEncoder(handle_unknown='ignore'))])

In [47]:
preprocessor = ColumnTransformer(transformers=[('numeric', numeric_transformer, numericas),
                                               ('cat', categorical_transformer, categoricas)],
                                 remainder='passthrough')

In [48]:
X_train_prep = preprocessor.fit_transform(X_train)
X_test_prep  = preprocessor.transform(X_test)

In [49]:
encoded_cat = preprocessor.named_transformers_['cat']['onehot'].get_feature_names(categoricas)
labels = np.concatenate([numericas, encoded_cat])
datos_train_prep = preprocessor.transform(X_train)
datos_train_prep = pd.DataFrame(datos_train_prep, columns=labels)
datos_train_prep.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5655 entries, 0 to 5654
Data columns (total 20 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   nivel_de_satisfaccion                5655 non-null   float64
 1   ultima_evaluacion                    5655 non-null   float64
 2   cantidad_proyectos                   5655 non-null   float64
 3   promedio_horas_mensuales_trabajadas  5655 non-null   float64
 4   años_en_la_empresa                   5655 non-null   float64
 5   tuvo_un_accidente_laboral            5655 non-null   float64
 6   promociones_ultimos_5_anios          5655 non-null   float64
 7   area_ImásD                           5655 non-null   float64
 8   area_RRHH                            5655 non-null   float64
 9   area_TI                              5655 non-null   float64
 10  area_contabilidad                    5655 non-null   float64
 11  area_dirección                


Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.



In [50]:
from sklearn import set_config
set_config(display='diagram')

preprocessor

In [51]:
set_config(display='text')