# HR Analytics

<img src = 'https://datahack-prod.s3.ap-south-1.amazonaws.com/__sized__/contest_cover/hr_1920x480_s5WuoZs-thumbnail-1200x1200-90.jpg'>

Practice Problem: https://datahack.analyticsvidhya.com/contest/wns-analytics-hackathon-2018-1/

## HR Analytics

HR analytics is revolutionising the way human resources departments operate, leading to higher efficiency and better results overall. Human resources has been using analytics for years. However, the collection, processing and analysis of data has been largely manual, and given the nature of human resources dynamics and HR KPIs, the approach has been constraining HR. Therefore, it is surprising that HR departments woke up to the utility of machine learning so late in the game. Here is an opportunity to try predictive analytics in identifying the employees most likely to get promoted.

## Problem Statement

Your client is a large MNC and they have 9 broad verticals across the organisation. One of the problem your client is facing is around identifying the right people for promotion *(only for manager position and below)* and prepare them in time. Currently the process, they are following is:

* They first identify a set of employees based on recommendations/ past performance
* Selected employees go through the separate training and evaluation program for each vertical. These programs are based on the required skill of each vertical
* At the end of the program, based on various factors such as training performance, KPI completion (only employees with KPIs completed greater than 60% are considered) etc., employee gets promotion

For above mentioned process, the final promotions are only announced after the evaluation and this leads to delay in transition to their new roles. Hence, company needs your help in identifying the eligible candidates at a particular checkpoint so that they can expedite the entire promotion cycle. 

<img src = 'https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2018/09/wns_hack_im_1.jpg'>

They have provided multiple attributes around Employee's past and current performance along with demographics. Now, The task is to predict whether a potential promotee at checkpoint in the test set will be promoted or not after the evaluation process.

## Evaluation Metric

The evaluation metric for this competition is F1 Score.

## Public and Private Split

Test data is further randomly divided into Public (40%) and Private (60%) data.

Your initial responses will be checked and scored on the Public data.
The final rankings would be based on your private score which will be published once the competition is over.

## Entorno

In [60]:
import sys
sys.version

'3.7.9 (default, Aug 31 2020, 17:10:11) [MSC v.1916 64 bit (AMD64)]'

In [61]:
!conda info --envs

# conda environments:
#
base                  *  C:\Users\antho\Anaconda3



## Paquetes

In [62]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

import os
import numpy as np
import pandas as pd
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm, tqdm_notebook
from pathlib import Path
import random
import warnings
import pickle

warnings.filterwarnings('ignore')


seed = 2020
random.seed(seed)

pd.set_option('display.max_columns', 1000)
pd.set_option('display.max_rows', 400)
sns.set()

DATA = Path('../../data') 
RAW  = DATA/'raw'
PROCESSED = DATA/'processed'
SUBMISSIONS = DATA/'submissions'    

MODEL = Path('../../model') 

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [63]:
pd.__version__

'1.1.3'

In [64]:
np.__version__

'1.19.2'

In [65]:
sklearn.__version__

'0.23.2'

In [66]:
id_columns = 'employee_id'
target = 'is_promoted'

## Lectura de datos

In [67]:
os.listdir(f'{PROCESSED}')

['.DS_Store',
 'preprocess_v1_capping_values.pkl',
 'preprocess_v1_impute_values.pkl',
 'preprocess_v1_ohe.pkl',
 'preprocess_v1_ohe_columns.pkl',
 'preprocess_v1_over50_train.csv',
 'preprocess_v1_scaler.pkl',
 'preprocess_v1_smote20_train.csv',
 'preprocess_v1_smote50_train.csv',
 'preprocess_v1_smoteTomek20_train.csv',
 'preprocess_v1_smoteTomek50_train.csv',
 'preprocess_v1_train.csv',
 'preprocess_v1_under50_train.csv',
 'preprocess_v1_val.csv',
 'preprocess_v2_capping_values.pkl',
 'preprocess_v2_knnimputation.pkl',
 'preprocess_v2_ohe.pkl',
 'preprocess_v2_ohe_columns.pkl',
 'preprocess_v2_over50_train.csv',
 'preprocess_v2_scaler.pkl',
 'preprocess_v2_scalerimputation.pkl',
 'preprocess_v2_smote20_train.csv',
 'preprocess_v2_smote50_train.csv',
 'preprocess_v2_smoteTomek20_train.csv',
 'preprocess_v2_smoteTomek50_train.csv',
 'preprocess_v2_test.csv',
 'preprocess_v2_train.csv',
 'preprocess_v2_under50_train.csv',
 'preprocess_v2_val.csv']

## Entrenamiento V1 sin balanceo

In [68]:
from sklearn.linear_model import LogisticRegression 
from sklearn.metrics import precision_recall_curve, roc_auc_score, f1_score

In [69]:
preproc_train = [file for file in os.listdir(f'{PROCESSED}') if file.endswith('train.csv')]
preproc_train

['preprocess_v1_over50_train.csv',
 'preprocess_v1_smote20_train.csv',
 'preprocess_v1_smote50_train.csv',
 'preprocess_v1_smoteTomek20_train.csv',
 'preprocess_v1_smoteTomek50_train.csv',
 'preprocess_v1_train.csv',
 'preprocess_v1_under50_train.csv',
 'preprocess_v2_over50_train.csv',
 'preprocess_v2_smote20_train.csv',
 'preprocess_v2_smote50_train.csv',
 'preprocess_v2_smoteTomek20_train.csv',
 'preprocess_v2_smoteTomek50_train.csv',
 'preprocess_v2_train.csv',
 'preprocess_v2_under50_train.csv']

In [70]:
preproc_val = [file for file in os.listdir(f'{PROCESSED}') if file.endswith('val.csv')]
preproc_val

['preprocess_v1_val.csv', 'preprocess_v2_val.csv']

In [71]:
for train_file in sorted(preproc_train):
    df_train = pd.read_csv(f'{PROCESSED}/{train_file}', compression = 'zip')
    df_val = pd.read_csv(f'{PROCESSED}/{preproc_val[0]}', compression = 'zip')
    
    print(f'label: {train_file:35} \tnrows: {len(df_train)} \t%target train: {df_train[target].mean():.4f} \t%target val: {df_val[target].mean():.4f}')

label: preprocess_v1_over50_train.csv      	nrows: 80224 	%target train: 0.5000 	%target val: 0.0852
label: preprocess_v1_smote20_train.csv     	nrows: 48134 	%target train: 0.1667 	%target val: 0.0852
label: preprocess_v1_smote50_train.csv     	nrows: 80224 	%target train: 0.5000 	%target val: 0.0852
label: preprocess_v1_smoteTomek20_train.csv 	nrows: 46412 	%target train: 0.1543 	%target val: 0.0852
label: preprocess_v1_smoteTomek50_train.csv 	nrows: 79638 	%target train: 0.5000 	%target val: 0.0852
label: preprocess_v1_train.csv             	nrows: 43846 	%target train: 0.0852 	%target val: 0.0852
label: preprocess_v1_under50_train.csv     	nrows: 7468 	%target train: 0.5000 	%target val: 0.0852
label: preprocess_v2_over50_train.csv      	nrows: 80224 	%target train: 0.5000 	%target val: 0.0852
label: preprocess_v2_smote20_train.csv     	nrows: 48134 	%target train: 0.1667 	%target val: 0.0852
label: preprocess_v2_smote50_train.csv     	nrows: 80224 	%target train: 0.5000 	%target v

In [88]:
train_file = 'preprocess_v1_train.csv'
val_file = 'preprocess_v1_val.csv'
df_val.head()

Unnamed: 0,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,na_previous_year_rating,na_education,department_Analytics,department_Finance,department_HR,department_Legal,department_Operations,department_Procurement,department_R&D,department_Sales & Marketing,department_Technology,region_region_1,region_region_10,region_region_11,region_region_12,region_region_13,region_region_14,region_region_15,region_region_16,region_region_17,region_region_18,region_region_19,region_region_2,region_region_20,region_region_21,region_region_22,region_region_23,region_region_24,region_region_25,region_region_26,region_region_27,region_region_28,region_region_29,region_region_3,region_region_30,region_region_31,region_region_32,region_region_33,region_region_34,region_region_4,region_region_5,region_region_6,region_region_7,region_region_8,region_region_9,education_Bachelor's,education_Below Secondary,education_Master's & above,gender_f,gender_m,recruitment_channel_other,recruitment_channel_referred,recruitment_channel_sourcing,is_promoted
0,-0.431999,-1.019219,0.550807,-0.93923,-0.737109,-0.153163,0.42047,-0.284445,-0.212536,-0.32894,-0.220673,-0.21574,-0.137364,-0.509846,2.58252,-0.135368,-0.666307,-0.387881,-0.106419,-0.110615,-0.155332,-0.094979,-0.222429,-0.125042,-0.232522,-0.166592,-0.122189,-0.024823,-0.126818,-0.537077,-0.125886,-0.085878,-0.363769,6.719649,-0.096073,-0.123338,-0.205222,-0.177348,-0.157475,-0.137364,-0.080456,-0.10934,-0.193472,-0.130933,-0.071013,-0.071175,-0.180403,-0.12103,-0.111771,-0.31312,-0.112084,-0.089315,0.634349,-0.120154,-0.612337,-0.651463,0.651463,0.89273,-0.145991,-0.85558,0
1,-0.431999,1.326548,-0.273747,2.990067,-0.737109,-0.153163,0.270731,-0.284445,-0.212536,-0.32894,-0.220673,-0.21574,-0.137364,-0.509846,2.58252,-0.135368,-0.666307,-0.387881,-0.106419,-0.110615,-0.155332,-0.094979,-0.222429,-0.125042,4.300672,-0.166592,-0.122189,-0.024823,-0.126818,-0.537077,-0.125886,-0.085878,-0.363769,-0.148817,-0.096073,-0.123338,-0.205222,-0.177348,-0.157475,-0.137364,-0.080456,-0.10934,-0.193472,-0.130933,-0.071013,-0.071175,-0.180403,-0.12103,-0.111771,-0.31312,-0.112084,-0.089315,-1.576419,-0.120154,1.633087,-0.651463,0.651463,0.89273,-0.145991,-0.85558,0
2,1.325254,0.153664,-1.0983,-0.693649,-0.737109,-0.153163,1.393774,-0.284445,-0.212536,3.040066,-0.220673,-0.21574,-0.137364,-0.509846,-0.387219,-0.135368,-0.666307,-0.387881,-0.106419,-0.110615,-0.155332,-0.094979,-0.222429,-0.125042,-0.232522,-0.166592,-0.122189,-0.024823,-0.126818,-0.537077,-0.125886,-0.085878,-0.363769,-0.148817,-0.096073,-0.123338,-0.205222,-0.177348,-0.157475,-0.137364,-0.080456,-0.10934,-0.193472,7.637508,-0.071013,-0.071175,-0.180403,-0.12103,-0.111771,-0.31312,-0.112084,-0.089315,0.634349,-0.120154,-0.612337,-0.651463,0.651463,0.89273,-0.145991,-0.85558,0
3,-0.431999,-0.888899,1.37536,-0.202487,1.356651,-0.153163,0.869688,-0.284445,-0.212536,-0.32894,-0.220673,-0.21574,-0.137364,-0.509846,-0.387219,-0.135368,-0.666307,2.578113,9.396775,-0.110615,-0.155332,-0.094979,-0.222429,-0.125042,-0.232522,-0.166592,-0.122189,-0.024823,-0.126818,-0.537077,-0.125886,-0.085878,-0.363769,-0.148817,-0.096073,-0.123338,-0.205222,-0.177348,-0.157475,-0.137364,-0.080456,-0.10934,-0.193472,-0.130933,-0.071013,-0.071175,-0.180403,-0.12103,-0.111771,-0.31312,-0.112084,-0.089315,0.634349,-0.120154,-0.612337,1.535007,-1.535007,-1.120159,-0.145991,1.168798,0
4,-0.431999,-0.106976,-1.0983,1.025418,1.356651,-0.153163,-0.852312,-0.284445,-0.212536,-0.32894,-0.220673,-0.21574,-0.137364,-0.509846,-0.387219,-0.135368,1.50081,-0.387881,-0.106419,-0.110615,-0.155332,-0.094979,-0.222429,-0.125042,-0.232522,-0.166592,-0.122189,-0.024823,-0.126818,-0.537077,-0.125886,-0.085878,-0.363769,-0.148817,-0.096073,-0.123338,-0.205222,-0.177348,-0.157475,-0.137364,-0.080456,-0.10934,-0.193472,-0.130933,-0.071013,-0.071175,-0.180403,-0.12103,-0.111771,3.193664,-0.112084,-0.089315,0.634349,-0.120154,-0.612337,1.535007,-1.535007,-1.120159,-0.145991,1.168798,0


In [85]:
#df_train.drop(target, axis = 1)

dele= ['education','age','no_of_tra','recruitm','length','gender','empl']
for a in dele :
    df_val=df_val[df_train.columns.drop(list(df_train.filter(regex=a)))].head()


In [81]:
from sklearn.model_selection import ParameterGrid

In [82]:
cv_grid = {'penalty': ['l1','l2'],'solver': ['liblinear','saga'],
              'C': [0.001,0.01,0.1,1,10,100,1000],
            'random_state': [seed]}

params_grid = list(ParameterGrid(cv_grid))

In [89]:
df_results = pd.DataFrame(columns = ['preproc_label', 'model_label', 'método', 'parámetros', 'columnas_out',
                                     'auc_train', 'auc_val', 'threshold','f1_train', 'f1_val'])


for xgb_params in tqdm(params_grid):
    
    for train_file in sorted(preproc_train):

        preproc_label = train_file.split('_train')[0]

        print('----------------------------------------------------------------------')
        print(xgb_params)
        print(train_file)
        print('----------------------------------------------------------------------')

        df_train = pd.read_csv(f'{PROCESSED}/{train_file}', compression = 'zip')
        df_val = pd.read_csv(f'{PROCESSED}/{preproc_val[0]}', compression = 'zip')

        
        dele= ['education','age','no_of_tra','recruitm','length','gender','empl']
        for a in dele :
            df_train=df_train[df_train.columns.drop(list(df_train.filter(regex=a)))]
            df_val=df_val[df_val.columns.drop(list(df_val.filter(regex=a)))]
        
        
        X_train, y_train = df_train.drop(target, axis = 1), df_train[target]
        X_val, y_val = df_val.drop(target, axis = 1), df_val[target]

        logi = LogisticRegression(solver = xgb_params["solver"],penalty =xgb_params["penalty"], C =xgb_params["C"], random_state= xgb_params["random_state"])
        logi_fit = logi.fit( X_train,y_train )
                        
        
        #xgb_params_export = xgb_params.copy()
        #xgb_params_export.update(logi_fit.attributes())

        probs_train = logi_fit.predict(X_train)
        probs_val = logi_fit.predict(X_val)

        auc_train = roc_auc_score(y_train, probs_train)
        auc_val = roc_auc_score(y_val, probs_val)

        #best threshold
        prec, recall, threshold = precision_recall_curve(y_train, probs_train)
        prec_recall = pd.DataFrame({'prec': prec[:-1], 'recall': recall[:-1], 'threshold': threshold})
        prec_recall['f1'] = 2*prec_recall['prec']*prec_recall['recall'] / (prec_recall['prec'] + prec_recall['recall'])
        prec_recall = prec_recall.sort_values(by = 'f1', ascending = False).head(1)

        #f1 scores
        best_threshold = prec_recall['threshold'].values[0]
        f1_train = prec_recall['f1'].values[0]

        labels_val = np.where(probs_val >= best_threshold, 1, 0)
        f1_val = f1_score(y_val, labels_val)

        print(f'auc_train: {auc_train:.6f} \tauc_val: {auc_val:.6f} \tf1_train: {f1_train:.6f} \tf1_val: {f1_val:.6f}')

        results = [preproc_label, 'xgboost_baseline', 'fit', xgb_params, '',
                  auc_train, auc_val, best_threshold, f1_train, f1_val]


        df_results.loc[len(df_results)] = results

  0%|          | 0/28 [00:00<?, ?it/s]

----------------------------------------------------------------------
{'C': 0.001, 'penalty': 'l1', 'random_state': 2020, 'solver': 'liblinear'}
preprocess_v1_over50_train.csv
----------------------------------------------------------------------
auc_train: 0.772961 	auc_val: 0.779456 	f1_train: 0.784061 	f1_val: 0.346537
----------------------------------------------------------------------
{'C': 0.001, 'penalty': 'l1', 'random_state': 2020, 'solver': 'liblinear'}
preprocess_v1_smote20_train.csv
----------------------------------------------------------------------
auc_train: 0.543655 	auc_val: 0.551785 	f1_train: 0.285704 	f1_val: 0.157028
----------------------------------------------------------------------
{'C': 0.001, 'penalty': 'l1', 'random_state': 2020, 'solver': 'liblinear'}
preprocess_v1_smote50_train.csv
----------------------------------------------------------------------
auc_train: 0.787844 	auc_val: 0.781960 	f1_train: 0.801094 	f1_val: 0.349023
-----------------------

  4%|▎         | 1/28 [00:18<08:29, 18.87s/it]

auc_train: 0.713846 	auc_val: 0.712636 	f1_train: 0.723866 	f1_val: 0.286592
----------------------------------------------------------------------
{'C': 0.001, 'penalty': 'l1', 'random_state': 2020, 'solver': 'saga'}
preprocess_v1_over50_train.csv
----------------------------------------------------------------------
auc_train: 0.766117 	auc_val: 0.777858 	f1_train: 0.773844 	f1_val: 0.351086
----------------------------------------------------------------------
{'C': 0.001, 'penalty': 'l1', 'random_state': 2020, 'solver': 'saga'}
preprocess_v1_smote20_train.csv
----------------------------------------------------------------------
auc_train: 0.540414 	auc_val: 0.548337 	f1_train: 0.285704 	f1_val: 0.157028
----------------------------------------------------------------------
{'C': 0.001, 'penalty': 'l1', 'random_state': 2020, 'solver': 'saga'}
preprocess_v1_smote50_train.csv
----------------------------------------------------------------------
auc_train: 0.780515 	auc_val: 0.774334

  7%|▋         | 2/28 [00:40<08:32, 19.73s/it]

auc_train: 0.713444 	auc_val: 0.711941 	f1_train: 0.722726 	f1_val: 0.286892
----------------------------------------------------------------------
{'C': 0.001, 'penalty': 'l2', 'random_state': 2020, 'solver': 'liblinear'}
preprocess_v1_over50_train.csv
----------------------------------------------------------------------
auc_train: 0.779355 	auc_val: 0.782002 	f1_train: 0.789291 	f1_val: 0.352409
----------------------------------------------------------------------
{'C': 0.001, 'penalty': 'l2', 'random_state': 2020, 'solver': 'liblinear'}
preprocess_v1_smote20_train.csv
----------------------------------------------------------------------
auc_train: 0.588831 	auc_val: 0.606799 	f1_train: 0.303479 	f1_val: 0.335743
----------------------------------------------------------------------
{'C': 0.001, 'penalty': 'l2', 'random_state': 2020, 'solver': 'liblinear'}
preprocess_v1_smote50_train.csv
----------------------------------------------------------------------
auc_train: 0.794251 	au

 11%|█         | 3/28 [00:58<08:01, 19.26s/it]

auc_train: 0.757633 	auc_val: 0.753203 	f1_train: 0.773297 	f1_val: 0.316137
----------------------------------------------------------------------
{'C': 0.001, 'penalty': 'l2', 'random_state': 2020, 'solver': 'saga'}
preprocess_v1_over50_train.csv
----------------------------------------------------------------------
auc_train: 0.774107 	auc_val: 0.777861 	f1_train: 0.781784 	f1_val: 0.353050
----------------------------------------------------------------------
{'C': 0.001, 'penalty': 'l2', 'random_state': 2020, 'solver': 'saga'}
preprocess_v1_smote20_train.csv
----------------------------------------------------------------------
auc_train: 0.581638 	auc_val: 0.598931 	f1_train: 0.285704 	f1_val: 0.157028
----------------------------------------------------------------------
{'C': 0.001, 'penalty': 'l2', 'random_state': 2020, 'solver': 'saga'}
preprocess_v1_smote50_train.csv
----------------------------------------------------------------------
auc_train: 0.788891 	auc_val: 0.776318

 14%|█▍        | 4/28 [01:18<07:47, 19.47s/it]

auc_train: 0.740761 	auc_val: 0.741713 	f1_train: 0.743304 	f1_val: 0.322536
----------------------------------------------------------------------
{'C': 0.01, 'penalty': 'l1', 'random_state': 2020, 'solver': 'liblinear'}
preprocess_v1_over50_train.csv
----------------------------------------------------------------------
auc_train: 0.788579 	auc_val: 0.797681 	f1_train: 0.795525 	f1_val: 0.378208
----------------------------------------------------------------------
{'C': 0.01, 'penalty': 'l1', 'random_state': 2020, 'solver': 'liblinear'}
preprocess_v1_smote20_train.csv
----------------------------------------------------------------------
auc_train: 0.642546 	auc_val: 0.655469 	f1_train: 0.432858 	f1_val: 0.424284
----------------------------------------------------------------------
{'C': 0.01, 'penalty': 'l1', 'random_state': 2020, 'solver': 'liblinear'}
preprocess_v1_smote50_train.csv
----------------------------------------------------------------------
auc_train: 0.803961 	auc_v

 18%|█▊        | 5/28 [02:06<10:44, 28.01s/it]

auc_train: 0.769282 	auc_val: 0.777467 	f1_train: 0.780649 	f1_val: 0.347144
----------------------------------------------------------------------
{'C': 0.01, 'penalty': 'l1', 'random_state': 2020, 'solver': 'saga'}
preprocess_v1_over50_train.csv
----------------------------------------------------------------------
auc_train: 0.787794 	auc_val: 0.797109 	f1_train: 0.794351 	f1_val: 0.378339
----------------------------------------------------------------------
{'C': 0.01, 'penalty': 'l1', 'random_state': 2020, 'solver': 'saga'}
preprocess_v1_smote20_train.csv
----------------------------------------------------------------------
auc_train: 0.641811 	auc_val: 0.653963 	f1_train: 0.431300 	f1_val: 0.421629
----------------------------------------------------------------------
{'C': 0.01, 'penalty': 'l1', 'random_state': 2020, 'solver': 'saga'}
preprocess_v1_smote50_train.csv
----------------------------------------------------------------------
auc_train: 0.803089 	auc_val: 0.799043 	f

 21%|██▏       | 6/28 [02:53<12:17, 33.53s/it]

auc_train: 0.760846 	auc_val: 0.770775 	f1_train: 0.768172 	f1_val: 0.346925
----------------------------------------------------------------------
{'C': 0.01, 'penalty': 'l2', 'random_state': 2020, 'solver': 'liblinear'}
preprocess_v1_over50_train.csv
----------------------------------------------------------------------
auc_train: 0.788592 	auc_val: 0.797492 	f1_train: 0.795854 	f1_val: 0.376109
----------------------------------------------------------------------
{'C': 0.01, 'penalty': 'l2', 'random_state': 2020, 'solver': 'liblinear'}
preprocess_v1_smote20_train.csv
----------------------------------------------------------------------
auc_train: 0.634182 	auc_val: 0.648037 	f1_train: 0.414751 	f1_val: 0.411397
----------------------------------------------------------------------
{'C': 0.01, 'penalty': 'l2', 'random_state': 2020, 'solver': 'liblinear'}
preprocess_v1_smote50_train.csv
----------------------------------------------------------------------
auc_train: 0.804622 	auc_v

 25%|██▌       | 7/28 [03:13<10:24, 29.73s/it]

auc_train: 0.780396 	auc_val: 0.779551 	f1_train: 0.790870 	f1_val: 0.351869
----------------------------------------------------------------------
{'C': 0.01, 'penalty': 'l2', 'random_state': 2020, 'solver': 'saga'}
preprocess_v1_over50_train.csv
----------------------------------------------------------------------
auc_train: 0.788991 	auc_val: 0.797569 	f1_train: 0.796073 	f1_val: 0.377413
----------------------------------------------------------------------
{'C': 0.01, 'penalty': 'l2', 'random_state': 2020, 'solver': 'saga'}
preprocess_v1_smote20_train.csv
----------------------------------------------------------------------
auc_train: 0.632374 	auc_val: 0.647066 	f1_train: 0.410729 	f1_val: 0.409756
----------------------------------------------------------------------
{'C': 0.01, 'penalty': 'l2', 'random_state': 2020, 'solver': 'saga'}
preprocess_v1_smote50_train.csv
----------------------------------------------------------------------
auc_train: 0.804635 	auc_val: 0.799516 	f

 29%|██▊       | 8/28 [03:37<09:20, 28.01s/it]

auc_train: 0.773299 	auc_val: 0.774116 	f1_train: 0.781520 	f1_val: 0.352363
----------------------------------------------------------------------
{'C': 0.1, 'penalty': 'l1', 'random_state': 2020, 'solver': 'liblinear'}
preprocess_v1_over50_train.csv
----------------------------------------------------------------------
auc_train: 0.788056 	auc_val: 0.798233 	f1_train: 0.794100 	f1_val: 0.381631
----------------------------------------------------------------------
{'C': 0.1, 'penalty': 'l1', 'random_state': 2020, 'solver': 'liblinear'}
preprocess_v1_smote20_train.csv
----------------------------------------------------------------------
auc_train: 0.667528 	auc_val: 0.681509 	f1_train: 0.480626 	f1_val: 0.458463
----------------------------------------------------------------------
{'C': 0.1, 'penalty': 'l1', 'random_state': 2020, 'solver': 'liblinear'}
preprocess_v1_smote50_train.csv
----------------------------------------------------------------------
auc_train: 0.803101 	auc_val:

 32%|███▏      | 9/28 [05:22<16:09, 51.04s/it]

auc_train: 0.787895 	auc_val: 0.799862 	f1_train: 0.795929 	f1_val: 0.381650
----------------------------------------------------------------------
{'C': 0.1, 'penalty': 'l1', 'random_state': 2020, 'solver': 'saga'}
preprocess_v1_over50_train.csv
----------------------------------------------------------------------
auc_train: 0.788081 	auc_val: 0.798233 	f1_train: 0.794119 	f1_val: 0.381631
----------------------------------------------------------------------
{'C': 0.1, 'penalty': 'l1', 'random_state': 2020, 'solver': 'saga'}
preprocess_v1_smote20_train.csv
----------------------------------------------------------------------
auc_train: 0.667528 	auc_val: 0.681509 	f1_train: 0.480626 	f1_val: 0.458463
----------------------------------------------------------------------
{'C': 0.1, 'penalty': 'l1', 'random_state': 2020, 'solver': 'saga'}
preprocess_v1_smote50_train.csv
----------------------------------------------------------------------
auc_train: 0.803039 	auc_val: 0.793806 	f1_t

 36%|███▌      | 10/28 [06:41<17:49, 59.43s/it]

auc_train: 0.787225 	auc_val: 0.798605 	f1_train: 0.794941 	f1_val: 0.381115
----------------------------------------------------------------------
{'C': 0.1, 'penalty': 'l2', 'random_state': 2020, 'solver': 'liblinear'}
preprocess_v1_over50_train.csv
----------------------------------------------------------------------
auc_train: 0.788941 	auc_val: 0.798655 	f1_train: 0.795394 	f1_val: 0.381209
----------------------------------------------------------------------
{'C': 0.1, 'penalty': 'l2', 'random_state': 2020, 'solver': 'liblinear'}
preprocess_v1_smote20_train.csv
----------------------------------------------------------------------
auc_train: 0.664498 	auc_val: 0.677425 	f1_train: 0.474955 	f1_val: 0.451857
----------------------------------------------------------------------
{'C': 0.1, 'penalty': 'l2', 'random_state': 2020, 'solver': 'liblinear'}
preprocess_v1_smote50_train.csv
----------------------------------------------------------------------
auc_train: 0.803425 	auc_val:

 39%|███▉      | 11/28 [07:04<13:42, 48.41s/it]

auc_train: 0.787895 	auc_val: 0.800048 	f1_train: 0.795876 	f1_val: 0.381393
----------------------------------------------------------------------
{'C': 0.1, 'penalty': 'l2', 'random_state': 2020, 'solver': 'saga'}
preprocess_v1_over50_train.csv
----------------------------------------------------------------------
auc_train: 0.788779 	auc_val: 0.797784 	f1_train: 0.795170 	f1_val: 0.380789
----------------------------------------------------------------------
{'C': 0.1, 'penalty': 'l2', 'random_state': 2020, 'solver': 'saga'}
preprocess_v1_smote20_train.csv
----------------------------------------------------------------------
auc_train: 0.664062 	auc_val: 0.677425 	f1_train: 0.474078 	f1_val: 0.451857
----------------------------------------------------------------------
{'C': 0.1, 'penalty': 'l2', 'random_state': 2020, 'solver': 'saga'}
preprocess_v1_smote50_train.csv
----------------------------------------------------------------------
auc_train: 0.803400 	auc_val: 0.794092 	f1_t

 43%|████▎     | 12/28 [07:32<11:18, 42.43s/it]

auc_train: 0.787493 	auc_val: 0.800161 	f1_train: 0.795358 	f1_val: 0.382209
----------------------------------------------------------------------
{'C': 1, 'penalty': 'l1', 'random_state': 2020, 'solver': 'liblinear'}
preprocess_v1_over50_train.csv
----------------------------------------------------------------------
auc_train: 0.787208 	auc_val: 0.798296 	f1_train: 0.792916 	f1_val: 0.382360
----------------------------------------------------------------------
{'C': 1, 'penalty': 'l1', 'random_state': 2020, 'solver': 'liblinear'}
preprocess_v1_smote20_train.csv
----------------------------------------------------------------------
auc_train: 0.669572 	auc_val: 0.681881 	f1_train: 0.484224 	f1_val: 0.456401
----------------------------------------------------------------------
{'C': 1, 'penalty': 'l1', 'random_state': 2020, 'solver': 'liblinear'}
preprocess_v1_smote50_train.csv
----------------------------------------------------------------------
auc_train: 0.803039 	auc_val: 0.793

 46%|████▋     | 13/28 [10:27<20:30, 82.04s/it]

auc_train: 0.787627 	auc_val: 0.800573 	f1_train: 0.794984 	f1_val: 0.383595
----------------------------------------------------------------------
{'C': 1, 'penalty': 'l1', 'random_state': 2020, 'solver': 'saga'}
preprocess_v1_over50_train.csv
----------------------------------------------------------------------
auc_train: 0.787208 	auc_val: 0.798296 	f1_train: 0.792916 	f1_val: 0.382360
----------------------------------------------------------------------
{'C': 1, 'penalty': 'l1', 'random_state': 2020, 'solver': 'saga'}
preprocess_v1_smote20_train.csv
----------------------------------------------------------------------
auc_train: 0.669572 	auc_val: 0.681881 	f1_train: 0.484224 	f1_val: 0.456401
----------------------------------------------------------------------
{'C': 1, 'penalty': 'l1', 'random_state': 2020, 'solver': 'saga'}
preprocess_v1_smote50_train.csv
----------------------------------------------------------------------
auc_train: 0.803027 	auc_val: 0.793756 	f1_train: 

 50%|█████     | 14/28 [11:06<16:10, 69.31s/it]

auc_train: 0.787627 	auc_val: 0.800623 	f1_train: 0.794984 	f1_val: 0.383690
----------------------------------------------------------------------
{'C': 1, 'penalty': 'l2', 'random_state': 2020, 'solver': 'liblinear'}
preprocess_v1_over50_train.csv
----------------------------------------------------------------------
auc_train: 0.787345 	auc_val: 0.798296 	f1_train: 0.793082 	f1_val: 0.382360
----------------------------------------------------------------------
{'C': 1, 'penalty': 'l2', 'random_state': 2020, 'solver': 'liblinear'}
preprocess_v1_smote20_train.csv
----------------------------------------------------------------------
auc_train: 0.669447 	auc_val: 0.681881 	f1_train: 0.483978 	f1_val: 0.456401
----------------------------------------------------------------------
{'C': 1, 'penalty': 'l2', 'random_state': 2020, 'solver': 'liblinear'}
preprocess_v1_smote50_train.csv
----------------------------------------------------------------------
auc_train: 0.802964 	auc_val: 0.793

 54%|█████▎    | 15/28 [11:26<11:48, 54.50s/it]

auc_train: 0.787627 	auc_val: 0.802129 	f1_train: 0.795037 	f1_val: 0.384691
----------------------------------------------------------------------
{'C': 1, 'penalty': 'l2', 'random_state': 2020, 'solver': 'saga'}
preprocess_v1_over50_train.csv
----------------------------------------------------------------------
auc_train: 0.787345 	auc_val: 0.798296 	f1_train: 0.793082 	f1_val: 0.382360
----------------------------------------------------------------------
{'C': 1, 'penalty': 'l2', 'random_state': 2020, 'solver': 'saga'}
preprocess_v1_smote20_train.csv
----------------------------------------------------------------------
auc_train: 0.669348 	auc_val: 0.681881 	f1_train: 0.483811 	f1_val: 0.456401
----------------------------------------------------------------------
{'C': 1, 'penalty': 'l2', 'random_state': 2020, 'solver': 'saga'}
preprocess_v1_smote50_train.csv
----------------------------------------------------------------------
auc_train: 0.802952 	auc_val: 0.793606 	f1_train: 

 57%|█████▋    | 16/28 [11:52<09:09, 45.75s/it]

auc_train: 0.787493 	auc_val: 0.802179 	f1_train: 0.794882 	f1_val: 0.384785
----------------------------------------------------------------------
{'C': 10, 'penalty': 'l1', 'random_state': 2020, 'solver': 'liblinear'}
preprocess_v1_over50_train.csv
----------------------------------------------------------------------
auc_train: 0.787208 	auc_val: 0.798296 	f1_train: 0.792916 	f1_val: 0.382360
----------------------------------------------------------------------
{'C': 10, 'penalty': 'l1', 'random_state': 2020, 'solver': 'liblinear'}
preprocess_v1_smote20_train.csv
----------------------------------------------------------------------
auc_train: 0.669821 	auc_val: 0.683238 	f1_train: 0.484495 	f1_val: 0.457846
----------------------------------------------------------------------
{'C': 10, 'penalty': 'l1', 'random_state': 2020, 'solver': 'liblinear'}
preprocess_v1_smote50_train.csv
----------------------------------------------------------------------
auc_train: 0.802989 	auc_val: 0.

 61%|██████    | 17/28 [12:42<08:36, 46.97s/it]

auc_train: 0.788029 	auc_val: 0.800623 	f1_train: 0.795346 	f1_val: 0.383690
----------------------------------------------------------------------
{'C': 10, 'penalty': 'l1', 'random_state': 2020, 'solver': 'saga'}
preprocess_v1_over50_train.csv
----------------------------------------------------------------------
auc_train: 0.787221 	auc_val: 0.798345 	f1_train: 0.792925 	f1_val: 0.382454
----------------------------------------------------------------------
{'C': 10, 'penalty': 'l1', 'random_state': 2020, 'solver': 'saga'}
preprocess_v1_smote20_train.csv
----------------------------------------------------------------------
auc_train: 0.669834 	auc_val: 0.683238 	f1_train: 0.484534 	f1_val: 0.457846
----------------------------------------------------------------------
{'C': 10, 'penalty': 'l1', 'random_state': 2020, 'solver': 'saga'}
preprocess_v1_smote50_train.csv
----------------------------------------------------------------------
auc_train: 0.802977 	auc_val: 0.793756 	f1_trai

 64%|██████▍   | 18/28 [13:14<07:05, 42.52s/it]

auc_train: 0.787627 	auc_val: 0.800623 	f1_train: 0.794984 	f1_val: 0.383690
----------------------------------------------------------------------
{'C': 10, 'penalty': 'l2', 'random_state': 2020, 'solver': 'liblinear'}
preprocess_v1_over50_train.csv
----------------------------------------------------------------------
auc_train: 0.787221 	auc_val: 0.798345 	f1_train: 0.792925 	f1_val: 0.382454
----------------------------------------------------------------------
{'C': 10, 'penalty': 'l2', 'random_state': 2020, 'solver': 'liblinear'}
preprocess_v1_smote20_train.csv
----------------------------------------------------------------------
auc_train: 0.669846 	auc_val: 0.682753 	f1_train: 0.484574 	f1_val: 0.457178
----------------------------------------------------------------------
{'C': 10, 'penalty': 'l2', 'random_state': 2020, 'solver': 'liblinear'}
preprocess_v1_smote50_train.csv
----------------------------------------------------------------------
auc_train: 0.802977 	auc_val: 0.

 68%|██████▊   | 19/28 [13:35<05:24, 36.03s/it]

auc_train: 0.787761 	auc_val: 0.800623 	f1_train: 0.795087 	f1_val: 0.383690
----------------------------------------------------------------------
{'C': 10, 'penalty': 'l2', 'random_state': 2020, 'solver': 'saga'}
preprocess_v1_over50_train.csv
----------------------------------------------------------------------
auc_train: 0.787221 	auc_val: 0.798345 	f1_train: 0.792925 	f1_val: 0.382454
----------------------------------------------------------------------
{'C': 10, 'penalty': 'l2', 'random_state': 2020, 'solver': 'saga'}
preprocess_v1_smote20_train.csv
----------------------------------------------------------------------
auc_train: 0.669846 	auc_val: 0.682753 	f1_train: 0.484574 	f1_val: 0.457178
----------------------------------------------------------------------
{'C': 10, 'penalty': 'l2', 'random_state': 2020, 'solver': 'saga'}
preprocess_v1_smote50_train.csv
----------------------------------------------------------------------
auc_train: 0.802989 	auc_val: 0.793756 	f1_trai

 71%|███████▏  | 20/28 [14:02<04:26, 33.34s/it]

auc_train: 0.787761 	auc_val: 0.800623 	f1_train: 0.795140 	f1_val: 0.383690
----------------------------------------------------------------------
{'C': 100, 'penalty': 'l1', 'random_state': 2020, 'solver': 'liblinear'}
preprocess_v1_over50_train.csv
----------------------------------------------------------------------
auc_train: 0.787221 	auc_val: 0.798345 	f1_train: 0.792925 	f1_val: 0.382454
----------------------------------------------------------------------
{'C': 100, 'penalty': 'l1', 'random_state': 2020, 'solver': 'liblinear'}
preprocess_v1_smote20_train.csv
----------------------------------------------------------------------
auc_train: 0.669834 	auc_val: 0.683238 	f1_train: 0.484534 	f1_val: 0.457846
----------------------------------------------------------------------
{'C': 100, 'penalty': 'l1', 'random_state': 2020, 'solver': 'liblinear'}
preprocess_v1_smote50_train.csv
----------------------------------------------------------------------
auc_train: 0.802977 	auc_val:

 75%|███████▌  | 21/28 [14:34<03:51, 33.12s/it]

auc_train: 0.788029 	auc_val: 0.800623 	f1_train: 0.795346 	f1_val: 0.383690
----------------------------------------------------------------------
{'C': 100, 'penalty': 'l1', 'random_state': 2020, 'solver': 'saga'}
preprocess_v1_over50_train.csv
----------------------------------------------------------------------
auc_train: 0.787221 	auc_val: 0.798345 	f1_train: 0.792925 	f1_val: 0.382454
----------------------------------------------------------------------
{'C': 100, 'penalty': 'l1', 'random_state': 2020, 'solver': 'saga'}
preprocess_v1_smote20_train.csv
----------------------------------------------------------------------
auc_train: 0.669834 	auc_val: 0.683238 	f1_train: 0.484534 	f1_val: 0.457846
----------------------------------------------------------------------
{'C': 100, 'penalty': 'l1', 'random_state': 2020, 'solver': 'saga'}
preprocess_v1_smote50_train.csv
----------------------------------------------------------------------
auc_train: 0.802977 	auc_val: 0.793756 	f1_t

 79%|███████▊  | 22/28 [15:08<03:19, 33.25s/it]

auc_train: 0.787761 	auc_val: 0.800623 	f1_train: 0.795087 	f1_val: 0.383690
----------------------------------------------------------------------
{'C': 100, 'penalty': 'l2', 'random_state': 2020, 'solver': 'liblinear'}
preprocess_v1_over50_train.csv
----------------------------------------------------------------------
auc_train: 0.787221 	auc_val: 0.798345 	f1_train: 0.792925 	f1_val: 0.382454
----------------------------------------------------------------------
{'C': 100, 'penalty': 'l2', 'random_state': 2020, 'solver': 'liblinear'}
preprocess_v1_smote20_train.csv
----------------------------------------------------------------------
auc_train: 0.669834 	auc_val: 0.683238 	f1_train: 0.484534 	f1_val: 0.457846
----------------------------------------------------------------------
{'C': 100, 'penalty': 'l2', 'random_state': 2020, 'solver': 'liblinear'}
preprocess_v1_smote50_train.csv
----------------------------------------------------------------------
auc_train: 0.802977 	auc_val:

 82%|████████▏ | 23/28 [15:30<02:29, 29.96s/it]

auc_train: 0.788029 	auc_val: 0.800623 	f1_train: 0.795346 	f1_val: 0.383690
----------------------------------------------------------------------
{'C': 100, 'penalty': 'l2', 'random_state': 2020, 'solver': 'saga'}
preprocess_v1_over50_train.csv
----------------------------------------------------------------------
auc_train: 0.787221 	auc_val: 0.798345 	f1_train: 0.792925 	f1_val: 0.382454
----------------------------------------------------------------------
{'C': 100, 'penalty': 'l2', 'random_state': 2020, 'solver': 'saga'}
preprocess_v1_smote20_train.csv
----------------------------------------------------------------------
auc_train: 0.669834 	auc_val: 0.683238 	f1_train: 0.484534 	f1_val: 0.457846
----------------------------------------------------------------------
{'C': 100, 'penalty': 'l2', 'random_state': 2020, 'solver': 'saga'}
preprocess_v1_smote50_train.csv
----------------------------------------------------------------------
auc_train: 0.802977 	auc_val: 0.793756 	f1_t

 86%|████████▌ | 24/28 [15:58<01:57, 29.49s/it]

auc_train: 0.787761 	auc_val: 0.800623 	f1_train: 0.795087 	f1_val: 0.383690
----------------------------------------------------------------------
{'C': 1000, 'penalty': 'l1', 'random_state': 2020, 'solver': 'liblinear'}
preprocess_v1_over50_train.csv
----------------------------------------------------------------------
auc_train: 0.787221 	auc_val: 0.798345 	f1_train: 0.792925 	f1_val: 0.382454
----------------------------------------------------------------------
{'C': 1000, 'penalty': 'l1', 'random_state': 2020, 'solver': 'liblinear'}
preprocess_v1_smote20_train.csv
----------------------------------------------------------------------
auc_train: 0.669834 	auc_val: 0.683238 	f1_train: 0.484534 	f1_val: 0.457846
----------------------------------------------------------------------
{'C': 1000, 'penalty': 'l1', 'random_state': 2020, 'solver': 'liblinear'}
preprocess_v1_smote50_train.csv
----------------------------------------------------------------------
auc_train: 0.802977 	auc_v

 89%|████████▉ | 25/28 [16:30<01:29, 29.98s/it]

auc_train: 0.788029 	auc_val: 0.800623 	f1_train: 0.795346 	f1_val: 0.383690
----------------------------------------------------------------------
{'C': 1000, 'penalty': 'l1', 'random_state': 2020, 'solver': 'saga'}
preprocess_v1_over50_train.csv
----------------------------------------------------------------------
auc_train: 0.787221 	auc_val: 0.798345 	f1_train: 0.792925 	f1_val: 0.382454
----------------------------------------------------------------------
{'C': 1000, 'penalty': 'l1', 'random_state': 2020, 'solver': 'saga'}
preprocess_v1_smote20_train.csv
----------------------------------------------------------------------
auc_train: 0.669834 	auc_val: 0.683238 	f1_train: 0.484534 	f1_val: 0.457846
----------------------------------------------------------------------
{'C': 1000, 'penalty': 'l1', 'random_state': 2020, 'solver': 'saga'}
preprocess_v1_smote50_train.csv
----------------------------------------------------------------------
auc_train: 0.802977 	auc_val: 0.793756 	f

 93%|█████████▎| 26/28 [17:02<01:01, 30.56s/it]

auc_train: 0.787761 	auc_val: 0.800623 	f1_train: 0.795087 	f1_val: 0.383690
----------------------------------------------------------------------
{'C': 1000, 'penalty': 'l2', 'random_state': 2020, 'solver': 'liblinear'}
preprocess_v1_over50_train.csv
----------------------------------------------------------------------
auc_train: 0.787221 	auc_val: 0.798345 	f1_train: 0.792925 	f1_val: 0.382454
----------------------------------------------------------------------
{'C': 1000, 'penalty': 'l2', 'random_state': 2020, 'solver': 'liblinear'}
preprocess_v1_smote20_train.csv
----------------------------------------------------------------------
auc_train: 0.669896 	auc_val: 0.683238 	f1_train: 0.484657 	f1_val: 0.457846
----------------------------------------------------------------------
{'C': 1000, 'penalty': 'l2', 'random_state': 2020, 'solver': 'liblinear'}
preprocess_v1_smote50_train.csv
----------------------------------------------------------------------
auc_train: 0.802977 	auc_v

 96%|█████████▋| 27/28 [17:23<00:27, 27.75s/it]

auc_train: 0.788029 	auc_val: 0.800623 	f1_train: 0.795346 	f1_val: 0.383690
----------------------------------------------------------------------
{'C': 1000, 'penalty': 'l2', 'random_state': 2020, 'solver': 'saga'}
preprocess_v1_over50_train.csv
----------------------------------------------------------------------
auc_train: 0.787221 	auc_val: 0.798345 	f1_train: 0.792925 	f1_val: 0.382454
----------------------------------------------------------------------
{'C': 1000, 'penalty': 'l2', 'random_state': 2020, 'solver': 'saga'}
preprocess_v1_smote20_train.csv
----------------------------------------------------------------------
auc_train: 0.669834 	auc_val: 0.683238 	f1_train: 0.484534 	f1_val: 0.457846
----------------------------------------------------------------------
{'C': 1000, 'penalty': 'l2', 'random_state': 2020, 'solver': 'saga'}
preprocess_v1_smote50_train.csv
----------------------------------------------------------------------
auc_train: 0.802977 	auc_val: 0.793756 	f

100%|██████████| 28/28 [17:51<00:00, 38.25s/it]

auc_train: 0.787761 	auc_val: 0.800623 	f1_train: 0.795087 	f1_val: 0.383690





In [49]:
df_results

Unnamed: 0,preproc_label,model_label,método,parámetros,columnas_out,auc_train,auc_val,threshold,f1_train,f1_val
0,preprocess_v1_over50,xgboost_baseline,fit,"{'C': 0.001, 'penalty': 'l1', 'random_state': ...",,0.772961,0.779456,1,0.784061,0.346537
1,preprocess_v1_smote20,xgboost_baseline,fit,"{'C': 0.001, 'penalty': 'l1', 'random_state': ...",,0.543655,0.551785,0,0.285704,0.157028
2,preprocess_v1_smote50,xgboost_baseline,fit,"{'C': 0.001, 'penalty': 'l1', 'random_state': ...",,0.788791,0.781288,1,0.802107,0.34891
3,preprocess_v1_smoteTomek20,xgboost_baseline,fit,"{'C': 0.001, 'penalty': 'l1', 'random_state': ...",,0.540606,0.548873,0,0.267336,0.157028
4,preprocess_v1_smoteTomek50,xgboost_baseline,fit,"{'C': 0.001, 'penalty': 'l1', 'random_state': ...",,0.792838,0.782408,1,0.805856,0.349722


In [90]:
MODELS = DATA/'models'

In [91]:
df_results.to_csv(f'{MODELS}/logistic_base_MIN_IV.csv', index = False)

In [92]:
df_results.to_csv('logistic_base_MIN_IV.csv')

In [93]:
df_results.to_excel(f'{MODELS}/logistic_base_MIN_IV.xlsx', index = False)