# Teste Técnico para Ciência de Dados da Keyrus

## 1ª parte: Análise Exploratória

- [x] Tipos de variáveis
- [x] Medidas de posição
- [x] Medidas de dispersão
- [x] Tratamento de Missing Values
- [x] Gráficos
- [x] Análise de Outliers

## 2ª parte: Estatística

- [x] Estatística descritiva
- [x] Identificação das distribuições das variáveis

## 3ª parte: Modelagem

- [x] Modelos de previsão
- [x] Escolha de melhor modelo
- [x] Avaliação de resultados
- [x] Métricas

## Imports

In [1]:
# Data analysis and data wrangling
import numpy as np
import pandas as pd

# Plotting
import seaborn as sns
import matplotlib.pyplot as plt
import missingno as msno # missing values

# Preprocessing
from sklearn.preprocessing import LabelEncoder

# Machine Learning
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression

# Metrics
from sklearn.model_selection import KFold
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score

# Other
from IPython.display import Image
import configparser
import warnings
import os
import time
import pprint

## Preparação do Diretório Principal

In [2]:
def prepare_directory_work(end_directory: str='notebooks'):
    # Current path
    curr_dir = os.path.dirname (os.path.realpath ("__file__")) 
    
    if curr_dir.endswith(end_directory):
        os.chdir('..')
        return curr_dir
    
    return f'Current working directory: {curr_dir}'

In [3]:
prepare_directory_work(end_directory='notebooks')

'/home/campos/projetos/challenges/challenge-keyrus/notebooks'

## Cell Format

In [4]:
config = configparser.ConfigParser()
config.read('src/visualization/plot_config.ini')

figure_titlesize = config['figure']['figure_titlesize']
figure_figsize_large = int(config['figure']['figure_figsize_large'])
figure_figsize_width = int(config['figure']['figure_figsize_width'])
figure_dpi = int(config['figure']['figure_dpi'])
figure_facecolor = config['figure']['figure_facecolor']
figure_autolayout = bool(config['figure']['figure_autolayout'])

font_family = config['font']['font_family']
font_size = int(config['font']['font_size'])

legend_loc = config['legend']['legend_loc']
legend_fontsize = int(config['legend']['legend_fontsize'])

In [5]:
# Customizing file matplotlibrc

# Figure
plt.rcParams['figure.titlesize'] = figure_titlesize
plt.rcParams['figure.figsize'] = [figure_figsize_large, figure_figsize_width] 
plt.rcParams['figure.dpi'] = figure_dpi
plt.rcParams['figure.facecolor'] = figure_facecolor
plt.rcParams['figure.autolayout'] = figure_autolayout

# Font
plt.rcParams['font.family'] = font_family
plt.rcParams['font.size'] = font_size

# Legend
plt.rcParams['legend.loc'] = legend_loc
plt.rcParams['legend.fontsize'] = legend_fontsize

In [6]:
# Guarantees visualization inside the jupyter
%matplotlib inline

# Load the "autoreload" extension so that code can change
%load_ext autoreload

# Format the data os all table (float_format 3)
pd.set_option('display.float_format', '{:.6}'.format)

# Print xxxx rows and columns
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

# Supress unnecessary warnings so that presentation looks clean
warnings.filterwarnings('ignore')

## Carregamento dos Dados

In [7]:
%%time

df_callcenter = pd.read_csv('data/cleansing/callcenter_marketing_clenning.csv', 
                            encoding='utf8',
                            delimiter=',',
                            verbose=True)

Tokenization took: 29.40 ms
Type conversion took: 21.56 ms
Parser memory cleanup took: 0.01 ms
CPU times: user 55 ms, sys: 7.73 ms, total: 62.7 ms
Wall time: 61.2 ms


In [8]:
df_callcenter.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41172 entries, 0 to 41171
Data columns (total 15 columns):
idade                          41172 non-null int64
profissao                      41172 non-null int64
educacao                       41172 non-null int64
meio_contato                   41172 non-null int64
mes                            41172 non-null int64
dia_da_semana                  41172 non-null int64
duracao                        41172 non-null int64
dias_ultimo_contato            41172 non-null int64
qtd_contatos_total             41172 non-null int64
campanha_anterior              41172 non-null int64
indice_precos_consumidor       41172 non-null float64
indice_confianca_consumidor    41172 non-null float64
euribor3m                      41172 non-null float64
numero_empregados              41172 non-null int64
resultado                      41172 non-null int64
dtypes: float64(3), int64(12)
memory usage: 4.7 MB


OBS: carragamento em quase metade do tempo em realação a versão original do arquivo csv.

---

## Variáveis Globais

In [9]:
# Lists that will be manipulated in the data processing
list_columns = []
list_categorical_col = []
list_numerical_col = []
list_without_target_col = []

In [10]:
def get_col(df: 'dataframe' = None,
            type_descr: 'numpy' = np.number) -> list:
    """
    Function get list columns 
    
    Args:
    type_descr
        np.number, np.object -> return list with all columns
        np.number            -> return list numerical columns 
        np.object            -> return list object columns
    """
    try:
        col = (df.describe(include=type_descr).columns)  # pandas.core.indexes.base.Index  
    except ValueError:
        print(f'Dataframe not contains {type_descr} columns !', end='\n')    
    else:
        return col.tolist() 

In [11]:
def get_col_without_target(df: 'dataframe',
                           list_columns: list,
                           target_col: str) -> list:

    col_target = list_columns.copy()
    
    col_target.remove(target_col)
    print(type(col_target))
    
    
    return col_target

In [12]:
list_numerical_col = get_col(df=df_callcenter,
                             type_descr=np.number)
list_categorical_col = get_col(df=df_callcenter,
                               type_descr=np.object)
list_columns = get_col(df=df_callcenter,
                       type_descr=[np.object, np.number])
list_without_target_col = get_col_without_target(df=df_callcenter,
                                                 list_columns=list_columns,
                                                 target_col='resultado')

Dataframe not contains <class 'object'> columns !
<class 'list'>


## Training and Testing Dataset
- métrica: cross score

In [13]:
def cross_val_model(X,y, model, n_splits=3):
    'Do split dataset and calculate cross_score'
    print("Begin training", end='\n\n')
    start = time.time()
    
    X = np.array(X)
    y = np.array(y)
    folds = list(StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=2017).split(X, y))

    for j, (train_idx, test_idx) in enumerate(folds):
        X_train = X[train_idx]
        y_train = y[train_idx]
        X_holdout = X[test_idx]
        y_holdout = y[test_idx]

        print ("Fit %s fold %d" % (str(model).split('(')[0], j+1))
        model.fit(X_train, y_train)
        cross_score = cross_val_score(model, X_holdout, y_holdout, cv=3, scoring='roc_auc')
        print("\tcross_score: %.5f" % cross_score.mean())
    
    end = time.time()
    print("\nTraining done! Time Elapsed:", end - start, " seconds.")

In [14]:
# training model

X = df_callcenter[list_without_target_col]
y = df_callcenter['resultado'] # target

---

## Modelos de Previsão
- Modelo Baseline
- Benckmarks

### Modelo Baseline
- Vou começar com um baseline, sendo o mais simples possível.

#### Linear Regression

In [15]:
# Visualize params

LinearRegression()

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [16]:
# create model
lr_model = LinearRegression(n_jobs=-1, normalize=False)

# training model
X = df_callcenter[list_without_target_col]
y = df_callcenter['resultado']

In [17]:
# split dataset and calculate cross_score
cross_val_model(X, y, lr_model)

Begin training

Fit LinearRegression fold 1
	cross_score: 0.80385
Fit LinearRegression fold 2
	cross_score: 0.82735
Fit LinearRegression fold 3
	cross_score: 0.81495

Training done! Time Elapsed: 0.15801525115966797  seconds.


### Benckmarks

#### RandomForest

In [18]:
# Visualize params

RandomForestClassifier()

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators='warn',
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [19]:
# RandomForest params dict
rf_params_one = {}

rf_params_one['n_estimators'] = 10
rf_params_one['max_depth'] = 10
rf_params_one['min_samples_split'] = 10
rf_params_one['min_samples_leaf'] = 10 # end tree necessary 30 leaf
rf_params_one['n_jobs'] = -1 # run all process

In [20]:
# create model
rf_model_one = RandomForestClassifier(**rf_params_one)

# training model
X = df_callcenter[list_without_target_col]
y = df_callcenter['resultado']

In [24]:
# split dataset and calculate cross_score
cross_val_model(X, y, rf_model_one)

Begin training

Fit RandomForestClassifier fold 1
	cross_score: 0.22498
Fit RandomForestClassifier fold 2
	cross_score: 0.29388
Fit RandomForestClassifier fold 3
	cross_score: 0.19864

Training done! Time Elapsed: 1.6323716640472412  seconds.


In [25]:
# RandomForest params dict
rf_params_two = {}

rf_params_two['n_estimators'] = 1
rf_params_two['max_depth'] = len(list_numerical_col)*2
rf_params_two['min_samples_split'] = len(list_numerical_col)
rf_params_two['min_samples_leaf'] = len(list_numerical_col)
rf_params_two['n_jobs'] = -1 # run all process

In [26]:
# create model
rf_model = RandomForestClassifier(**rf_params_two)

# training model
X = df_callcenter[list_without_target_col]
y = df_callcenter['resultado']

In [27]:
# split dataset and calculate cross_score
cross_val_model(X, y, rf_model)

Begin training

Fit RandomForestClassifier fold 1
	cross_score: 0.33051
Fit RandomForestClassifier fold 2
	cross_score: 0.39882
Fit RandomForestClassifier fold 3
	cross_score: 0.30938

Training done! Time Elapsed: 0.4703516960144043  seconds.


#### Random Forest Regressor

In [28]:
# Visualize params

RandomForestRegressor()

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
                      max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators='warn',
                      n_jobs=None, oob_score=False, random_state=None,
                      verbose=0, warm_start=False)

In [29]:
# 1st model Random Forest
rf_regressor_one = RandomForestRegressor(n_jobs = -1,
                                         verbose = 0)

In [30]:
# split dataset and calculate cross_score
cross_val_model(X, y, rf_regressor_one)

Begin training

Fit RandomForestRegressor fold 1
	cross_score: 0.70386
Fit RandomForestRegressor fold 2
	cross_score: 0.75172
Fit RandomForestRegressor fold 3
	cross_score: 0.78349

Training done! Time Elapsed: 2.127944231033325  seconds.


In [31]:
# 2st model Random Forest
rf_regressor_two = RandomForestRegressor(n_estimators = 1000,
                                         max_leaf_nodes = len(list_numerical_col)*8,
                                         min_samples_leaf = len(list_numerical_col),
                                         max_depth = len(list_numerical_col)*4,
                                         n_jobs = -1,
                                         verbose = 0)

In [32]:
# split dataset and calculate cross_score
cross_val_model(X, y, rf_regressor_two)

Begin training

Fit RandomForestRegressor fold 1
	cross_score: 0.78663
Fit RandomForestRegressor fold 2
	cross_score: 0.82629
Fit RandomForestRegressor fold 3
	cross_score: 0.84985

Training done! Time Elapsed: 37.89920616149902  seconds.


---

## Escolha do Melhor Modelo

Baseado no cross_score o modelo escolhido será **random forest regressor** com os parâmetros do 2º modelo, que obteve um score > 0.84.

---

#### Copyright
<a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">
    <img alt="Creative Commons License" align="right" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" />
</a><br />This work by 
    <span xmlns:cc="http://creativecommons.org/ns#" property="cc:attributionName">Bruno A. R. M. Campos</span> is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.