# Creacion del algoritmo Regresion Logistica


En este codigo aparte levantaremos el dataset con los datos ya preprocedados y procederemos a la creacion del algortimo de Machine Learning para predecir los valores. Utilizaremos una Regresion Logistica ya que es la que mas aplica a nuestras features y targets.

La Regresion Logistica es un modelo de tipo "Clasificacion" por lo que para aplicarlo clasificaremos a las personas de acuerdo a si tienen un ausentismo excesivo o moderado

Empezaremos por cargar la data del csv 'Absentcion_preprocesada' 

In [1]:
import pandas as pd
import numpy as np

In [2]:
data_preprocessed = pd.read_csv('Absentcion_preprocesada.csv')

In [3]:
data_preprocessed.head()

Unnamed: 0,razon_0,razon_1,razon_2,razon_3,Month,Day,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,0,0,0,1,7,1,289,36,33,239.554,30,0,2,1,4
1,0,0,0,0,7,1,118,13,50,239.554,31,0,1,0,0
2,0,0,0,1,7,2,179,51,38,239.554,31,0,0,0,2
3,1,0,0,0,7,3,279,5,39,239.554,24,0,2,0,4
4,0,0,0,1,7,3,289,36,33,239.554,30,0,2,1,2


### Creamos los targets

Para la creacion de los targets, nos basaremos en la feature 'Absenteeism Time in Hours' que nos indica la cantidad de horas ausentadas por una persona en un dia determinado. 

In [4]:
data_preprocessed['Absenteeism Time in Hours'].median()

3.0

Basado en que la mendia es de 3hs. Para tener maso menos balanceado el dataset, tomamos esta medida como parametro para determinar si una persona se ausenta mas de la cuenta o no. Tendremos casi la misma cantidad de valores Verdaderos (1) o Falso(1), equilibrando asi nuestros targets

In [5]:
targets = np.where(data_preprocessed['Absenteeism Time in Hours'] > 
                   data_preprocessed['Absenteeism Time in Hours'].median(), 1, 0)

In [6]:
#Repasamos la variable nueva creada
targets

array([1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0,
       1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1,
       0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1,
       0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1,
       0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0,
       1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1,
       1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1,
       0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0,
       0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0,
       0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0,

Ahora agregramos nuestra nueva variable al dataset

In [7]:
data_preprocessed['Excessive Absenteeism'] = targets

In [8]:
data_preprocessed.head()

Unnamed: 0,razon_0,razon_1,razon_2,razon_3,Month,Day,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours,Excessive Absenteeism
0,0,0,0,1,7,1,289,36,33,239.554,30,0,2,1,4,1
1,0,0,0,0,7,1,118,13,50,239.554,31,0,1,0,0,0
2,0,0,0,1,7,2,179,51,38,239.554,31,0,0,0,2,0
3,1,0,0,0,7,3,279,5,39,239.554,24,0,2,0,4,1
4,0,0,0,1,7,3,289,36,33,239.554,30,0,2,1,2,0


Una buena practica es chequear si el dataset esta balanceado con respecto a los targets. Si maso menos se equilibra en un 50%

In [9]:
targets.sum() / targets.shape[0]

0.45571428571428574

#### Eliminacion de Variables innecesarias

Creamos un checkpoint de nuestro dataset eliminando las variables que no deseamos

In [10]:
data_with_targets = data_preprocessed.drop(['Absenteeism Time in Hours',"Day","Distance to Work","Daily Work Load Average"],axis=1)

In [11]:
data_with_targets

Unnamed: 0,razon_0,razon_1,razon_2,razon_3,Month,Transportation Expense,Age,Body Mass Index,Education,Children,Pets,Excessive Absenteeism
0,0,0,0,1,7,289,33,30,0,2,1,1
1,0,0,0,0,7,118,50,31,0,1,0,0
2,0,0,0,1,7,179,38,31,0,0,0,0
3,1,0,0,0,7,279,39,24,0,2,0,1
4,0,0,0,1,7,289,33,30,0,2,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...
695,1,0,0,0,5,179,40,22,1,2,0,1
696,1,0,0,0,5,225,28,24,0,1,2,0
697,1,0,0,0,5,330,28,25,1,0,0,1
698,0,0,0,1,5,235,32,25,1,0,0,0


In [12]:
#Chequamos que no sea el mismo df
data_with_targets is data_preprocessed

False

In [13]:
data_with_targets.head()

Unnamed: 0,razon_0,razon_1,razon_2,razon_3,Month,Transportation Expense,Age,Body Mass Index,Education,Children,Pets,Excessive Absenteeism
0,0,0,0,1,7,289,33,30,0,2,1,1
1,0,0,0,0,7,118,50,31,0,1,0,0
2,0,0,0,1,7,179,38,31,0,0,0,0
3,1,0,0,0,7,279,39,24,0,2,0,1
4,0,0,0,1,7,289,33,30,0,2,1,0


### Seleccionamos los Features


In [14]:
data_with_targets.shape

(700, 12)

Incluimos todas las variables menos el target ('Excessive Absenteeism')

In [15]:
data_with_targets.iloc[:,:-1]

Unnamed: 0,razon_0,razon_1,razon_2,razon_3,Month,Transportation Expense,Age,Body Mass Index,Education,Children,Pets
0,0,0,0,1,7,289,33,30,0,2,1
1,0,0,0,0,7,118,50,31,0,1,0
2,0,0,0,1,7,179,38,31,0,0,0
3,1,0,0,0,7,279,39,24,0,2,0
4,0,0,0,1,7,289,33,30,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...
695,1,0,0,0,5,179,40,22,1,2,0
696,1,0,0,0,5,225,28,24,0,1,2
697,1,0,0,0,5,330,28,25,1,0,0
698,0,0,0,1,5,235,32,25,1,0,0


Creamos una variable donde alojamos los imputs

In [16]:
unscaled_inputs = data_with_targets.iloc[:,:-1]

## Estandarizacion de las variables

Para poder trabajar con los datos de las distintas variables y estos sean compatibles debemos 'Standarizar' sus valores. La estandarizacion consiste en escalar sus valores ya que puede ser que tengan diferentes magnitudes (algunos muy grandes, otros muy bajos) que hacen que en el momento de crear el algoritmo tomen (o no) mayor relevancia sin ser esto cierto.

Para realizar esta estandarizacion utilizamos el modulo StandardScaler 

In [17]:
from sklearn.preprocessing import StandardScaler

A partir de eso creamos una variable objeto que contendra al metodo para que podamos aplicar este proceso a la data que le indiquemos

In [18]:
absenteeism_scaler = StandardScaler()

In [19]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler

class CustomScaler(BaseEstimator,TransformerMixin): 
       
    def __init__(self,columns):
        self.scaler = StandardScaler()
        self.columns = columns
        self.mean_ = None
        self.var_ = None
    
    def fit(self, X, y=None):
        self.scaler.fit(X[self.columns], y)
        self.mean_ = np.mean(X[self.columns])
        self.var_ = np.var(X[self.columns])
        return self

    def transform(self, X, y=None, copy=None):
        init_col_order = X.columns
        X_scaled = pd.DataFrame(self.scaler.transform(X[self.columns]), columns=self.columns)
        X_not_scaled = X.loc[:,~X.columns.isin(self.columns)]
      
        return pd.concat([X_not_scaled, X_scaled], axis=1)[init_col_order]

In [20]:
unscaled_inputs.columns.values

array(['razon_0', 'razon_1', 'razon_2', 'razon_3', 'Month',
       'Transportation Expense', 'Age', 'Body Mass Index', 'Education',
       'Children', 'Pets'], dtype=object)

In [21]:
columns_to_omit = ['razon_0', 'razon_1', 'razon_2', 'razon_3','Education']

In [22]:
columns_to_scale = [x for x in unscaled_inputs.columns.values if x not in columns_to_omit]
len(columns_to_scale)

6

In [23]:
absenteeism_scaler = CustomScaler(columns_to_scale)

In [24]:
absenteeism_scaler.fit(unscaled_inputs)

  return mean(axis=axis, dtype=dtype, out=out, **kwargs)


In [25]:
scaled_inputs = absenteeism_scaler.transform(unscaled_inputs)

In [26]:
scaled_inputs

Unnamed: 0,razon_0,razon_1,razon_2,razon_3,Month,Transportation Expense,Age,Body Mass Index,Education,Children,Pets
0,0,0,0,1,0.182726,1.005844,-0.536062,0.767431,0,0.880469,0.268487
1,0,0,0,0,0.182726,-1.574681,2.130803,1.002633,0,-0.019280,-0.589690
2,0,0,0,1,0.182726,-0.654143,0.248310,1.002633,0,-0.919030,-0.589690
3,1,0,0,0,0.182726,0.854936,0.405184,-0.643782,0,0.880469,-0.589690
4,0,0,0,1,0.182726,1.005844,-0.536062,0.767431,0,0.880469,0.268487
...,...,...,...,...,...,...,...,...,...,...,...
695,1,0,0,0,-0.388293,-0.654143,0.562059,-1.114186,1,0.880469,-0.589690
696,1,0,0,0,-0.388293,0.040034,-1.320435,-0.643782,0,-0.019280,1.126663
697,1,0,0,0,-0.388293,1.624567,-1.320435,-0.408580,1,-0.919030,-0.589690
698,0,0,0,1,-0.388293,0.190942,-0.692937,-0.408580,1,-0.919030,-0.589690


In [27]:
scaled_inputs.shape

(700, 11)

### Separacion de Data con Train-Test split

In [28]:
from sklearn.model_selection import train_test_split

### Split

In [29]:
train_test_split(scaled_inputs, targets)

[     razon_0  razon_1  razon_2  razon_3     Month  Transportation Expense  \
 377        1        0        0        0 -1.244823               -0.654143   
 395        0        0        0        1 -0.959313                0.190942   
 111        0        0        1        0  1.610276                0.356940   
 144        1        0        0        0 -1.244823                2.499833   
 367        0        0        0        1 -1.530333               -0.654143   
 ..       ...      ...      ...      ...       ...                     ...   
 360        0        0        0        1 -1.530333               -1.016322   
 339        1        0        0        0  1.610276                0.040034   
 272        0        0        1        0  0.753746                1.005844   
 170        0        0        0        1 -0.959313                0.568211   
 421        0        0        0        1 -0.673803                1.171843   
 
           Age  Body Mass Index  Education  Children      Pets

In [30]:
x_train, x_test, y_train, y_test = train_test_split(scaled_inputs, targets,test_size = 0.2, random_state = 20)

In [31]:
print (x_train.shape, y_train.shape)

(560, 11) (560,)


In [32]:
print (x_test.shape, y_test.shape)

(140, 11) (140,)


## Algoritmo de Regresion Logistica con SKLearn


In [33]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

### Entrenamiento del Modelo

In [34]:
reg = LogisticRegression()

In [35]:
reg.fit(x_train,y_train)

In [36]:
reg.score(x_train,y_train)

0.7732142857142857

### Chequeamos Metricas

In [37]:
model_outputs = reg.predict(x_train)
model_outputs

array([0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1,
       1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0,
       0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0,
       0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0,
       1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1,
       1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1,
       0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0,
       0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0,

In [38]:
y_train

array([0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0,
       1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1,
       1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1,
       1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0,
       0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1,
       0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1,
       1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0,
       1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0,
       1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0,

In [39]:
model_outputs == y_train

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True, False,  True, False, False,  True,  True,  True,  True,
       False,  True, False,  True, False, False,  True,  True,  True,
       False,  True,  True,  True,  True,  True,  True,  True,  True,
       False, False, False, False,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True, False,  True,  True,  True,
        True,  True,  True,  True,  True, False,  True,  True,  True,
        True,  True,  True,  True,  True,  True, False,  True,  True,
        True,  True,  True, False,  True,  True,  True,  True,  True,
       False,  True, False,  True,  True, False, False, False,  True,
        True,  True,  True,  True,  True,  True,  True, False,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True, False,  True,  True,  True,  True,
       False,  True,  True,  True,  True, False,  True,  True,  True,
        True,  True,

In [40]:
np.sum((model_outputs==y_train))

433

In [41]:
model_outputs.shape[0]

560

In [42]:
np.sum((model_outputs==y_train)) / model_outputs.shape[0]

0.7732142857142857

### Buscamos coeficientes

In [43]:
reg.intercept_

array([-1.6474549])

In [44]:
reg.coef_

array([[ 2.80019733,  0.95188356,  3.11555338,  0.83900082,  0.1589299 ,
         0.60528415, -0.16989096,  0.27981088, -0.21053312,  0.34826214,
        -0.27739602]])

In [45]:
unscaled_inputs.columns.values

array(['razon_0', 'razon_1', 'razon_2', 'razon_3', 'Month',
       'Transportation Expense', 'Age', 'Body Mass Index', 'Education',
       'Children', 'Pets'], dtype=object)

In [46]:
feature_name = unscaled_inputs.columns.values

In [47]:
summary_table = pd.DataFrame (columns=['Feature name'], data = feature_name)

summary_table['Coefficient'] = np.transpose(reg.coef_)

summary_table

Unnamed: 0,Feature name,Coefficient
0,razon_0,2.800197
1,razon_1,0.951884
2,razon_2,3.115553
3,razon_3,0.839001
4,Month,0.15893
5,Transportation Expense,0.605284
6,Age,-0.169891
7,Body Mass Index,0.279811
8,Education,-0.210533
9,Children,0.348262


In [48]:
summary_table.index = summary_table.index + 1

summary_table.loc[0] = ['Intercept', reg.intercept_[0]]

summary_table = summary_table.sort_index()
summary_table

Unnamed: 0,Feature name,Coefficient
0,Intercept,-1.647455
1,razon_0,2.800197
2,razon_1,0.951884
3,razon_2,3.115553
4,razon_3,0.839001
5,Month,0.15893
6,Transportation Expense,0.605284
7,Age,-0.169891
8,Body Mass Index,0.279811
9,Education,-0.210533


### Interpretacion de Coeficientes

In [49]:
summary_table['Odds_ratio'] = np.exp(summary_table.Coefficient)

In [50]:
summary_table

Unnamed: 0,Feature name,Coefficient,Odds_ratio
0,Intercept,-1.647455,0.192539
1,razon_0,2.800197,16.447892
2,razon_1,0.951884,2.590585
3,razon_2,3.115553,22.545903
4,razon_3,0.839001,2.314054
5,Month,0.15893,1.172256
6,Transportation Expense,0.605284,1.831773
7,Age,-0.169891,0.843757
8,Body Mass Index,0.279811,1.32288
9,Education,-0.210533,0.810152


In [51]:
summary_table.sort_values('Odds_ratio', ascending=False)

Unnamed: 0,Feature name,Coefficient,Odds_ratio
3,razon_2,3.115553,22.545903
1,razon_0,2.800197,16.447892
2,razon_1,0.951884,2.590585
4,razon_3,0.839001,2.314054
6,Transportation Expense,0.605284,1.831773
10,Children,0.348262,1.416604
8,Body Mass Index,0.279811,1.32288
5,Month,0.15893,1.172256
7,Age,-0.169891,0.843757
9,Education,-0.210533,0.810152


# Testeamos el modelo

In [52]:
reg.score(x_test,y_test)

0.75

In [53]:
predict=reg.predict_proba(x_test)

In [54]:
predict

array([[0.71340413, 0.28659587],
       [0.58724228, 0.41275772],
       [0.44020821, 0.55979179],
       [0.78159464, 0.21840536],
       [0.08410854, 0.91589146],
       [0.33487603, 0.66512397],
       [0.29984576, 0.70015424],
       [0.13103971, 0.86896029],
       [0.78625404, 0.21374596],
       [0.74903632, 0.25096368],
       [0.49397598, 0.50602402],
       [0.22484913, 0.77515087],
       [0.07129151, 0.92870849],
       [0.73178133, 0.26821867],
       [0.30934135, 0.69065865],
       [0.5471671 , 0.4528329 ],
       [0.55052275, 0.44947725],
       [0.5392707 , 0.4607293 ],
       [0.40201117, 0.59798883],
       [0.05361575, 0.94638425],
       [0.7003009 , 0.2996991 ],
       [0.78159464, 0.21840536],
       [0.42037128, 0.57962872],
       [0.42037128, 0.57962872],
       [0.24783565, 0.75216435],
       [0.74566259, 0.25433741],
       [0.51017274, 0.48982726],
       [0.85690195, 0.14309805],
       [0.20349733, 0.79650267],
       [0.78159464, 0.21840536],
       [0.

### Exportamos el Modelo

In [55]:
import pickle

In [56]:
with open('model','wb') as file:
    pickle.dump(reg,file)

In [57]:
with open('scalar','wb')as file:
    pickle.dump(absenteeism_scaler,file)