# Hypothesis testing

During our exploratory data analysis (EDA), we observed that some features appear to have no relationship with employee resignation. We will test this hypothesis to determine if these features are indeed irrelevant to our target variable, in order to train the model and check its performance with and without those categories.

A signifficance value of 0.01 will be used in order to have a 99% confidence interval of influence of the features.

## Preparing environment

In [84]:
import numpy as np
import pandas as pd
import pingouin
import sys
sys.path.append('../high_performance_employee_resign_prediction')
from utils import paths

In [85]:
import warnings
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=RuntimeWarning)

## Importing data

In [86]:
train_df = pd.read_csv(paths.data_interim_dir('train_clean.csv'))

## Generating hypothesis testing functions

### Chi2 independence test

In [87]:
def chi2_independence_test(data: pd.DataFrame, main_cat: str, sec_cat: str, alpha=0.01):
    """
    Perform a Chi-square test of independence between two categorical variables.

    Parameters:
    data (pd.DataFrame): The DataFrame containing the data to be tested.
    main_cat (str): The name of the first categorical variable (column).
    sec_cat (str): The name of the second categorical variable (column).
    alpha (float, optional): The significance level to be used for the test. Defaults to 0.05.

    Returns:
    None

    This function prints the Chi-square test statistics and whether the null hypothesis
    of independence between the two categorical variables is rejected or not.

    Example:
    >>> df = pd.DataFrame({'A': ['a', 'b', 'a', 'b'], 'B': ['x', 'x', 'y', 'y']})
    >>> chi2_independence_test(df, 'A', 'B')
    """
    
    expected, observed, stats = pingouin.chi2_independence(data=data, x=main_cat, y=sec_cat)
    
    stats = pd.DataFrame(stats)
    
    print(stats)
    print('-'*80)
    
    stat_sign = stats['pval'] < alpha
    
    if stat_sign.any():
        print('Reject null hypothesis: There is a statistically significant difference between categories.')
    else:
        print('Failed to reject null hypothesis: There is no statistically significant difference between categories.')

### Kruskal-Wallis test

In [88]:
def kruskal_wallis_test(data: pd.DataFrame, cat_feat: str, num_feat: str, alpha=0.01):
    """
    Perform the Kruskal-Wallis H-test for independent samples to determine if there are 
    statistically significant differences between the groups of a categorical variable 
    on a continuous variable.

    Parameters:
    data (pd.DataFrame): The DataFrame containing the data to be tested.
    cat_feat (str): The name of the categorical variable (column).
    num_feat (str): The name of the numerical variable (column).
    alpha (float, optional): The significance level to be used for the test. Defaults to 0.05.

    Returns:
    None

    This function prints the Kruskal-Wallis test results and whether the null hypothesis
    of equal medians among groups is rejected or not.

    Example:
    >>> df = pd.DataFrame({'Category': ['A', 'A', 'B', 'B', 'C', 'C'], 'Values': [5, 6, 7, 8, 9, 10]})
    >>> kruskal_wallis_test(df, 'Category', 'Values')
    """
    
    results = pingouin.kruskal(data=data, dv=num_feat, between=cat_feat)
    
    print(results)
    print('-'*80)
    
    if results['p-unc'].iloc[0] < alpha:
        print(f'Reject null hypothesis: There is a statistically significant difference in {num_feat} between {cat_feat}.')
    else:
        print(f'Failed to reject null hypothesis: There is no statistically significant difference in {num_feat} between {cat_feat}')

In [89]:
no_infl_cat = []
no_infl_num = []

## Categorical tests: Chi2 independence tests between categorical features and target.

### id_last_boss_employee

- H<sub>0</sub>: There is not statistical signifficant difference between bosses and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between bosses and employee resign

In [90]:
chi2_independence_test(train_df, 'resign', 'id_last_boss_employee')

                 test    lambda        chi2    dof          pval    cramer  \
0             pearson  1.000000  264.698574  170.0  4.457220e-06  0.350715   
1        cressie-read  0.666667  270.056103  170.0  1.596952e-06  0.354247   
2      log-likelihood  0.000000  294.549127  170.0  1.000626e-08  0.369963   
3       freeman-tukey -0.500000         NaN  170.0           NaN       NaN   
4  mod-log-likelihood -1.000000         inf  170.0  0.000000e+00       inf   
5              neyman -2.000000         NaN  170.0           NaN       NaN   

   power  
0    1.0  
1    1.0  
2    1.0  
3    NaN  
4    NaN  
5    NaN  
--------------------------------------------------------------------------------
Reject null hypothesis: There is a statistically significant difference between categories.


id_last_boss_employee has influence in our target.

### seniority_employee

- H<sub>0</sub>: There is not statistical signifficant difference between employee seniority and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between employee seniority and employee resign

In [91]:
chi2_independence_test(train_df, 'resign', 'seniority_employee')

                 test    lambda       chi2  dof          pval    cramer  \
0             pearson  1.000000  36.374241  1.0  1.628412e-09  0.130010   
1        cressie-read  0.666667  36.565411  1.0  1.476282e-09  0.130351   
2      log-likelihood  0.000000  37.810045  1.0  7.797921e-10  0.132551   
3       freeman-tukey -0.500000  39.605439  1.0  3.108164e-10  0.135661   
4  mod-log-likelihood -1.000000  42.291719  1.0  7.862479e-11  0.140187   
5              neyman -2.000000  51.216367  1.0  8.272602e-13  0.154271   

      power  
0  0.999977  
1  0.999978  
2  0.999986  
3  0.999993  
4  0.999997  
5  1.000000  
--------------------------------------------------------------------------------
Reject null hypothesis: There is a statistically significant difference between categories.


seniority_employee has influence in our target.

### work_modality_employee

- H<sub>0</sub>: There is not statistical signifficant difference between employee work modality and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between employee work modality and employee resign

In [92]:
chi2_independence_test(train_df, 'resign', 'work_modality_employee')

                 test    lambda      chi2  dof      pval    cramer     power
0             pearson  1.000000  7.961430  1.0  0.004778  0.060824  0.805557
1        cressie-read  0.666667  7.956675  1.0  0.004791  0.060806  0.805325
2      log-likelihood  0.000000  7.950230  1.0  0.004808  0.060781  0.805010
3       freeman-tukey -0.500000  7.948070  1.0  0.004814  0.060773  0.804905
4  mod-log-likelihood -1.000000  7.948196  1.0  0.004814  0.060773  0.804911
5              neyman -2.000000  7.955293  1.0  0.004795  0.060800  0.805258
--------------------------------------------------------------------------------
Reject null hypothesis: There is a statistically significant difference between categories.


work_modality_employee has influence in our target

### gender_employee

- H<sub>0</sub>: There is not statistical signifficant difference between gender and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between gender and employee resign

In [93]:
chi2_independence_test(train_df, 'resign', 'gender_employee')

                 test    lambda      chi2  dof      pval    cramer     power
0             pearson  1.000000  3.902991  1.0  0.048200  0.042587  0.506279
1        cressie-read  0.666667  3.903170  1.0  0.048195  0.042588  0.506297
2      log-likelihood  0.000000  3.903928  1.0  0.048173  0.042592  0.506373
3       freeman-tukey -0.500000  3.904849  1.0  0.048147  0.042597  0.506466
4  mod-log-likelihood -1.000000  3.906070  1.0  0.048112  0.042604  0.506589
5              neyman -2.000000  3.909419  1.0  0.048016  0.042622  0.506927
--------------------------------------------------------------------------------
Failed to reject null hypothesis: There is no statistically significant difference between categories.


gender_employee has influence in our target.

In [94]:
no_infl_cat.append('gender_employee')

### recruitment_channel_employee

- H<sub>0</sub>: There is not statistical signifficant difference between employee recruitment channel and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between employee recruitment channel and employee resign

In [95]:
chi2_independence_test(train_df, 'resign', 'recruitment_channel_employee')

                 test    lambda       chi2  dof          pval    cramer  \
0             pearson  1.000000  43.645651  4.0  7.600404e-09  0.142413   
1        cressie-read  0.666667  43.801927  4.0  7.053200e-09  0.142668   
2      log-likelihood  0.000000  44.227223  4.0  5.755041e-09  0.143359   
3       freeman-tukey -0.500000  44.649401  4.0  4.702424e-09  0.144041   
4  mod-log-likelihood -1.000000  45.165170  4.0  3.673663e-09  0.144871   
5              neyman -2.000000  46.500171  4.0  1.937891e-09  0.146996   

      power  
0  0.999942  
1  0.999945  
2  0.999951  
3  0.999957  
4  0.999963  
5  0.999976  
--------------------------------------------------------------------------------
Reject null hypothesis: There is a statistically significant difference between categories.


recruitment_channel_employee has influence in our target

### marital_estatus

- H<sub>0</sub>: There is not statistical signifficant difference employee between marital estatus and employee resign

- H<sub>1</sub>: There is statistical signifficant difference employee between marital estatus and employee resign

In [96]:
chi2_independence_test(train_df, 'resign', 'marital_estatus_employee')

                 test    lambda       chi2  dof      pval    cramer     power
0             pearson  1.000000  20.900890  3.0  0.000110  0.098551  0.980173
1        cressie-read  0.666667  20.918306  3.0  0.000109  0.098592  0.980261
2      log-likelihood  0.000000  20.968675  3.0  0.000107  0.098711  0.980515
3       freeman-tukey -0.500000  21.020234  3.0  0.000104  0.098832  0.980771
4  mod-log-likelihood -1.000000  21.083810  3.0  0.000101  0.098981  0.981082
5              neyman -2.000000  21.247888  3.0  0.000094  0.099366  0.981864
--------------------------------------------------------------------------------
Reject null hypothesis: There is a statistically significant difference between categories.


marital_estatus has influence in our target.

### join_year_employee

- H<sub>0</sub>: There is not statistical signifficant difference between employee year of join and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between employee year of join and employee resign

In [97]:
chi2_independence_test(train_df, 'resign', 'join_year_employee')

                 test    lambda       chi2   dof          pval    cramer  \
0             pearson  1.000000  48.316797  11.0  1.253445e-06  0.149840   
1        cressie-read  0.666667  48.351606  11.0  1.235634e-06  0.149894   
2      log-likelihood  0.000000  48.619469  11.0  1.106703e-06  0.150309   
3       freeman-tukey -0.500000  48.998801  11.0  9.465954e-07  0.150894   
4  mod-log-likelihood -1.000000  49.537969  11.0  7.577378e-07  0.151722   
5              neyman -2.000000  51.132013  11.0  3.913587e-07  0.154144   

      power  
0  0.999681  
1  0.999684  
2  0.999704  
3  0.999730  
4  0.999764  
5  0.999841  
--------------------------------------------------------------------------------
Reject null hypothesis: There is a statistically significant difference between categories.


join_year_employee has influence in our target

### join_month_employee

- H<sub>0</sub>: There is not statistical signifficant difference between month of join and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between month of join and employee resign

In [98]:
chi2_independence_test(train_df, 'resign', 'join_month_employee')

                 test    lambda      chi2   dof      pval    cramer     power
0             pearson  1.000000  7.293563  11.0  0.774836  0.058217  0.379021
1        cressie-read  0.666667  7.290816  11.0  0.775067  0.058206  0.378872
2      log-likelihood  0.000000  7.289974  11.0  0.775138  0.058203  0.378826
3       freeman-tukey -0.500000  7.293408  11.0  0.774849  0.058216  0.379012
4  mod-log-likelihood -1.000000  7.300327  11.0  0.774267  0.058244  0.379388
5              neyman -2.000000  7.324659  11.0  0.772216  0.058341  0.380710
--------------------------------------------------------------------------------
Failed to reject null hypothesis: There is no statistically significant difference between categories.


join_month has no influence in our target.

In [99]:
no_infl_cat.append('join_month_employee')

### performance_employee

- H<sub>0</sub>: There is not statistical signifficant difference between employee performance and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between employee performance and employee resign

In [100]:
chi2_independence_test(train_df, 'resign', 'performance_employee')

                 test    lambda       chi2  dof          pval    cramer  power
0             pearson  1.000000  69.057825  1.0  9.561839e-17  0.179137    1.0
1        cressie-read  0.666667  69.289208  1.0  8.503360e-17  0.179437    1.0
2      log-likelihood  0.000000  69.937619  1.0  6.120989e-17  0.180274    1.0
3       freeman-tukey -0.500000  70.592867  1.0  4.391019e-17  0.181117    1.0
4  mod-log-likelihood -1.000000  71.400207  1.0  2.916415e-17  0.182150    1.0
5              neyman -2.000000  73.503201  1.0  1.004731e-17  0.184813    1.0
--------------------------------------------------------------------------------
Reject null hypothesis: There is a statistically significant difference between categories.


performance_employee has influence in our target.

### id_last_boss_boss

- H<sub>0</sub>: There is not statistical signifficant difference between boss last boss and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between boss last boss and employee resign

In [101]:
chi2_independence_test(train_df, 'resign', 'id_last_boss_boss')

                 test    lambda        chi2   dof      pval    cramer  \
0             pearson  1.000000  133.221417  92.0  0.003229  0.248809   
1        cressie-read  0.666667  135.458640  92.0  0.002172  0.250889   
2      log-likelihood  0.000000  145.605041  92.0  0.000315  0.260116   
3       freeman-tukey -0.500000         NaN  92.0       NaN       NaN   
4  mod-log-likelihood -1.000000         inf  92.0  0.000000       inf   
5              neyman -2.000000         NaN  92.0       NaN       NaN   

      power  
0  0.999999  
1  1.000000  
2  1.000000  
3       NaN  
4       NaN  
5       NaN  
--------------------------------------------------------------------------------
Reject null hypothesis: There is a statistically significant difference between categories.


id_last_boss_boss has influence in our target.

### work_modality_boss

- H<sub>0</sub>: There is not statistical signifficant difference between boss work modality and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between boss work modality and employee resign

In [102]:
chi2_independence_test(train_df, 'resign', 'work_modality_boss')

                 test    lambda      chi2  dof      pval    cramer     power
0             pearson  1.000000  0.924852  1.0  0.336204  0.020731  0.160815
1        cressie-read  0.666667  0.924137  1.0  0.336391  0.020723  0.160727
2      log-likelihood  0.000000  0.922925  1.0  0.336708  0.020709  0.160577
3       freeman-tukey -0.500000  0.922207  1.0  0.336896  0.020701  0.160489
4  mod-log-likelihood -1.000000  0.921651  1.0  0.337042  0.020695  0.160421
5              neyman -2.000000  0.921028  1.0  0.337205  0.020688  0.160344
--------------------------------------------------------------------------------
Failed to reject null hypothesis: There is no statistically significant difference between categories.


work_modality_boss has no influence in our target.

In [103]:
no_infl_cat.append('work_modality_boss')

### gender_boss

- H<sub>0</sub>: There is not statistical signifficant difference between boss gender and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between boss gender and employee resign

In [104]:
chi2_independence_test(train_df, 'resign', 'gender_boss')

                 test    lambda      chi2  dof      pval    cramer     power
0             pearson  1.000000  2.940271  1.0  0.086396  0.036963  0.403254
1        cressie-read  0.666667  2.940088  1.0  0.086406  0.036962  0.403233
2      log-likelihood  0.000000  2.939959  1.0  0.086413  0.036961  0.403218
3       freeman-tukey -0.500000  2.940069  1.0  0.086407  0.036962  0.403231
4  mod-log-likelihood -1.000000  2.940356  1.0  0.086392  0.036964  0.403263
5              neyman -2.000000  2.941462  1.0  0.086333  0.036971  0.403388
--------------------------------------------------------------------------------
Failed to reject null hypothesis: There is no statistically significant difference between categories.


gender_boss has no influence in our target

In [105]:
no_infl_cat.append('gender_boss')

### recruitment_channel_boss

- H<sub>0</sub>: There is not statistical signifficant difference between boss recruitment channel and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between boss recruitment channel and employee resign

In [106]:
chi2_independence_test(train_df, 'resign', 'recruitment_channel_boss')

                 test    lambda      chi2  dof      pval    cramer     power
0             pearson  1.000000  1.502063  4.0  0.826276  0.026419  0.137539
1        cressie-read  0.666667  1.500585  4.0  0.826538  0.026406  0.137442
2      log-likelihood  0.000000  1.499660  4.0  0.826702  0.026398  0.137380
3       freeman-tukey -0.500000  1.500745  4.0  0.826510  0.026408  0.137452
4  mod-log-likelihood -1.000000  1.503363  4.0  0.826046  0.026431  0.137625
5              neyman -2.000000  1.513262  4.0  0.824290  0.026518  0.138281
--------------------------------------------------------------------------------
Failed to reject null hypothesis: There is no statistically significant difference between categories.


recruitment_channel_boss has no influence in our target

In [107]:
no_infl_cat.append('recruitment_channel_boss')

### marital_estatus_boss

- H<sub>0</sub>: There is not statistical signifficant difference between boss marital estatus and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between boss marital estatus and employee resign

In [108]:
chi2_independence_test(train_df, 'resign', 'marital_estatus_boss')

                 test    lambda      chi2  dof      pval    cramer     power
0             pearson  1.000000  0.932356  3.0  0.817613  0.020815  0.110812
1        cressie-read  0.666667  0.932976  3.0  0.817463  0.020822  0.110857
2      log-likelihood  0.000000  0.934325  3.0  0.817138  0.020837  0.110953
3       freeman-tukey -0.500000  0.935430  3.0  0.816870  0.020849  0.111032
4  mod-log-likelihood -1.000000  0.936617  3.0  0.816584  0.020862  0.111116
5              neyman -2.000000  0.939236  3.0  0.815950  0.020891  0.111303
--------------------------------------------------------------------------------
Failed to reject null hypothesis: There is no statistically significant difference between categories.


marital_estatus_boss has no influence in our target.

In [109]:
no_infl_cat.append('marital_estatus_boss')

### join_year_boss

- H<sub>0</sub>: There is not statistical signifficant difference between boss join year and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between boss join year and employee resign

In [110]:
chi2_independence_test(train_df, 'resign', 'join_year_boss')

                 test    lambda       chi2   dof      pval    cramer     power
0             pearson  1.000000  19.041420  11.0  0.060354  0.094065  0.857897
1        cressie-read  0.666667  19.040042  11.0  0.060379  0.094062  0.857866
2      log-likelihood  0.000000  19.065956  11.0  0.059920  0.094126  0.858447
3       freeman-tukey -0.500000  19.110599  11.0  0.059137  0.094236  0.859443
4  mod-log-likelihood -1.000000  19.177053  11.0  0.057988  0.094400  0.860916
5              neyman -2.000000  19.376574  11.0  0.054660  0.094889  0.865262
--------------------------------------------------------------------------------
Failed to reject null hypothesis: There is no statistically significant difference between categories.


join_year_boss has no influence in our target

In [111]:
no_infl_cat.append('join_year_boss')

### join_month_boss

- H<sub>0</sub>: There is not statistical signifficant difference between boss join month and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between boss join month and employee resign

In [112]:
chi2_independence_test(train_df, 'resign', 'join_month_boss')

                 test    lambda       chi2   dof      pval    cramer     power
0             pearson  1.000000  14.417242  11.0  0.210759  0.081850  0.720273
1        cressie-read  0.666667  14.426933  11.0  0.210263  0.081878  0.720637
2      log-likelihood  0.000000  14.455854  11.0  0.208787  0.081960  0.721720
3       freeman-tukey -0.500000  14.485953  11.0  0.207260  0.082045  0.722844
4  mod-log-likelihood -1.000000  14.523329  11.0  0.205375  0.082151  0.724236
5              neyman -2.000000  14.620227  11.0  0.200551  0.082424  0.727821
--------------------------------------------------------------------------------
Failed to reject null hypothesis: There is no statistically significant difference between categories.


join_month_boss has no influence in our target

In [113]:
no_infl_cat.append('join_month_boss')

### performance_boss

- H<sub>0</sub>: There is not statistical signifficant difference between boss performance and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between boss performance and employee resign

In [114]:
chi2_independence_test(train_df, 'resign', 'performance_boss')

                 test    lambda      chi2  dof      pval    cramer     power
0             pearson  1.000000  2.341648  1.0  0.125956  0.032987  0.333941
1        cressie-read  0.666667  2.341074  1.0  0.126003  0.032983  0.333873
2      log-likelihood  0.000000  2.340121  1.0  0.126080  0.032976  0.333760
3       freeman-tukey -0.500000  2.339577  1.0  0.126124  0.032972  0.333696
4  mod-log-likelihood -1.000000  2.339179  1.0  0.126156  0.032969  0.333648
5              neyman -2.000000  2.338820  1.0  0.126185  0.032967  0.333606
--------------------------------------------------------------------------------
Failed to reject null hypothesis: There is no statistically significant difference between categories.


performance_boss has no influence in our target

In [115]:
no_infl_cat.append('performance_boss')

### joined_after_boss

- H<sub>0</sub>: There is not statistical signifficant difference between employees who joined company after boss and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between employees who joined company after boss and employee resign

In [116]:
chi2_independence_test(train_df, 'resign', 'joined_after_boss')

                 test    lambda      chi2  dof      pval    cramer     power
0             pearson  1.000000  2.751312  1.0  0.097175  0.035756  0.381758
1        cressie-read  0.666667  2.751297  1.0  0.097176  0.035756  0.381756
2      log-likelihood  0.000000  2.751467  1.0  0.097165  0.035757  0.381775
3       freeman-tukey -0.500000  2.751770  1.0  0.097147  0.035759  0.381810
4  mod-log-likelihood -1.000000  2.752224  1.0  0.097119  0.035762  0.381862
5              neyman -2.000000  2.753584  1.0  0.097037  0.035771  0.382018
--------------------------------------------------------------------------------
Failed to reject null hypothesis: There is no statistically significant difference between categories.


joined_after_boss has no influence in our target

In [117]:
no_infl_cat.append('joined_after_boss')

### younger_than_boss

- H<sub>0</sub>: There is not statistical signifficant difference between employees younger than boss and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between employees younger than boss and employee resign

In [118]:
chi2_independence_test(train_df, 'resign', 'younger_than_boss')

                 test    lambda      chi2  dof      pval    cramer     power
0             pearson  1.000000  2.751312  1.0  0.097175  0.035756  0.381758
1        cressie-read  0.666667  2.751297  1.0  0.097176  0.035756  0.381756
2      log-likelihood  0.000000  2.751467  1.0  0.097165  0.035757  0.381775
3       freeman-tukey -0.500000  2.751770  1.0  0.097147  0.035759  0.381810
4  mod-log-likelihood -1.000000  2.752224  1.0  0.097119  0.035762  0.381862
5              neyman -2.000000  2.753584  1.0  0.097037  0.035771  0.382018
--------------------------------------------------------------------------------
Failed to reject null hypothesis: There is no statistically significant difference between categories.


younger_than_boss has no influence in our target

In [119]:
no_infl_cat.append('younger_than_boss')

In [120]:
print(f'Categories considered to discard {no_infl_cat}')

Categories considered to discard ['gender_employee', 'join_month_employee', 'work_modality_boss', 'gender_boss', 'recruitment_channel_boss', 'marital_estatus_boss', 'join_year_boss', 'join_month_boss', 'performance_boss', 'joined_after_boss', 'younger_than_boss']


## Numerical tests: Kruskal-Wallis tests between numerical features and target.

### office_distance_employee

- H<sub>0</sub>: There is not statistical signifficant difference between employee distance to the office and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between employee distance to the office and employee resign

In [121]:
kruskal_wallis_test(train_df, 'resign', 'office_distance_employee')

         Source  ddof1         H     p-unc
Kruskal  resign      1  0.215454  0.642526
--------------------------------------------------------------------------------
Failed to reject null hypothesis: There is no statistically significant difference in office_distance_employee between resign


distancia_oficina has no influence in our target.

In [122]:
no_infl_num.append('office_distance_employee')

### low_health_days_employee

- H<sub>0</sub>: There is not statistical signifficant difference between employee sick days and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between employee sick days and employee resign

In [123]:
kruskal_wallis_test(train_df, 'resign', 'low_health_days_employee')

         Source  ddof1          H         p-unc
Kruskal  resign      1  44.826429  2.152970e-11
--------------------------------------------------------------------------------
Reject null hypothesis: There is a statistically significant difference in low_health_days_employee between resign.


dias_baja_salud has influence in our target

### average_permanence_employee

- H<sub>0</sub>: There is not statistical signifficant difference between employee average permanence and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between employee average permanence and employee resign

In [124]:
kruskal_wallis_test(train_df, 'resign', 'average_permanence_employee')

         Source  ddof1         H    p-unc
Kruskal  resign      1  0.672763  0.41209
--------------------------------------------------------------------------------
Failed to reject null hypothesis: There is no statistically significant difference in average_permanence_employee between resign


permanencia_promedio has no influence in our target.

In [125]:
no_infl_num.append('average_permanence_employee')

### salary_employee

- H<sub>0</sub>: There is not statistical signifficant difference between employee salary and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between employee salary and employee resign

In [126]:
kruskal_wallis_test(train_df, 'resign', 'salary_employee')

         Source  ddof1         H     p-unc
Kruskal  resign      1  0.235091  0.627774
--------------------------------------------------------------------------------
Failed to reject null hypothesis: There is no statistically significant difference in salary_employee between resign


salary_employee has no influence in our target.

In [127]:
no_infl_num.append('salary_employee')

### performance_score_employee

- H<sub>0</sub>: There is not statistical signifficant difference between employee performance score and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between employee performance score and employee resign

In [128]:
kruskal_wallis_test(train_df, 'resign', 'performance_score_employee')

         Source  ddof1           H         p-unc
Kruskal  resign      1  160.008373  1.126728e-36
--------------------------------------------------------------------------------
Reject null hypothesis: There is a statistically significant difference in performance_score_employee between resign.


performance_score_employee has influence in our target

### psi_score_employee

- H<sub>0</sub>: There is not statistical signifficant difference between employee psi score and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between employee psi score and employee resign

In [129]:
kruskal_wallis_test(train_df, 'resign', 'psi_score_employee')

         Source  ddof1         H     p-unc
Kruskal  resign      1  0.592649  0.441396
--------------------------------------------------------------------------------
Failed to reject null hypothesis: There is no statistically significant difference in psi_score_employee between resign


psi_score_employee has no influence in our target

In [130]:
no_infl_num.append('psi_score_employee')

### join_age_employee

- H<sub>0</sub>: There is not statistical signifficant difference between employee age and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between employee age and employee resign

In [131]:
kruskal_wallis_test(train_df, 'resign', 'join_age_employee')

         Source  ddof1         H     p-unc
Kruskal  resign      1  0.080111  0.777146
--------------------------------------------------------------------------------
Failed to reject null hypothesis: There is no statistically significant difference in join_age_employee between resign


join_age_employee has no influence in our target

In [132]:
no_infl_num.append('join_age_employee')

### office_distance_boss

- H<sub>0</sub>: There is not statistical signifficant difference between boss office distance and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between boss office distance and employee resign

In [133]:
kruskal_wallis_test(train_df, 'resign', 'office_distance_boss')

         Source  ddof1         H     p-unc
Kruskal  resign      1  0.009346  0.922987
--------------------------------------------------------------------------------
Failed to reject null hypothesis: There is no statistically significant difference in office_distance_boss between resign


office_distance_boss has no influence in our target

In [134]:
no_infl_num.append('office_distance_boss')

### low_health_days_boss

- H<sub>0</sub>: There is not statistical signifficant difference between boss low health days and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between boss low health days and employee resign

In [135]:
kruskal_wallis_test(train_df, 'resign', 'low_health_days_boss')

         Source  ddof1         H     p-unc
Kruskal  resign      1  2.831318  0.092442
--------------------------------------------------------------------------------
Failed to reject null hypothesis: There is no statistically significant difference in low_health_days_boss between resign


low_health_days_boss has no influence in our target

In [136]:
no_infl_num.append('low_health_days_boss')

### average_permanence_boss

- H<sub>0</sub>: There is not statistical signifficant difference between boss average permanence and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between boss average permanence and employee resign

In [137]:
kruskal_wallis_test(train_df, 'resign', 'average_permanence_boss')

         Source  ddof1         H     p-unc
Kruskal  resign      1  1.228961  0.267609
--------------------------------------------------------------------------------
Failed to reject null hypothesis: There is no statistically significant difference in average_permanence_boss between resign


average_permanence_boss has no influence in our target

In [138]:
no_infl_num.append('average_permanence_boss')

### salary_boss

- H<sub>0</sub>: There is not statistical signifficant difference between boss salary and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between boss salary and employee resign

In [139]:
kruskal_wallis_test(train_df, 'resign', 'salary_boss')

         Source  ddof1         H     p-unc
Kruskal  resign      1  1.300907  0.254048
--------------------------------------------------------------------------------
Failed to reject null hypothesis: There is no statistically significant difference in salary_boss between resign


salary_boss has no influence in our target

In [140]:
no_infl_num.append('salary_boss')

### performance_score_boss

- H<sub>0</sub>: There is not statistical signifficant difference between boss performance score and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between boss performance score and employee resign

In [141]:
kruskal_wallis_test(train_df, 'resign', 'performance_score_boss')

         Source  ddof1         H     p-unc
Kruskal  resign      1  4.155453  0.041501
--------------------------------------------------------------------------------
Failed to reject null hypothesis: There is no statistically significant difference in performance_score_boss between resign


performance_score_boss has no influence in our target

In [142]:
no_infl_num.append('performance_score_boss')

### psi_score_boss

- H<sub>0</sub>: There is not statistical signifficant difference between boss psi score and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between boss psi score and employee resign

In [143]:
kruskal_wallis_test(train_df, 'resign', 'psi_score_boss')

         Source  ddof1         H     p-unc
Kruskal  resign      1  0.357086  0.550129
--------------------------------------------------------------------------------
Failed to reject null hypothesis: There is no statistically significant difference in psi_score_boss between resign


psi_score_boss has no influence in our target

In [144]:
no_infl_num.append('psi_score_boss')

### join_age_boss

- H<sub>0</sub>: There is not statistical signifficant difference between boss join age and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between boss join age and employee resign

In [145]:
kruskal_wallis_test(train_df, 'resign', 'join_age_boss')

         Source  ddof1         H     p-unc
Kruskal  resign      1  1.792592  0.180611
--------------------------------------------------------------------------------
Failed to reject null hypothesis: There is no statistically significant difference in join_age_boss between resign


join_age_boss has no influence in our target

In [146]:
no_infl_num.append('join_age_boss')

### office_distance_diff

- H<sub>0</sub>: There is not statistical signifficant difference between office distance difference between employee and boss and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between office distance difference between employee and boss and employee resign

In [147]:
kruskal_wallis_test(train_df, 'resign', 'office_distance_diff')

         Source  ddof1         H    p-unc
Kruskal  resign      1  0.092422  0.76112
--------------------------------------------------------------------------------
Failed to reject null hypothesis: There is no statistically significant difference in office_distance_diff between resign


office_distance_diff has no influence in our target

In [148]:
no_infl_num.append('office_distance_diff')

### low_health_days_diff

- H<sub>0</sub>: There is not statistical signifficant difference between low health days difference between employee and boss and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between low health days difference between employee and boss and employee resign

In [149]:
kruskal_wallis_test(train_df, 'resign', 'low_health_days_diff')

         Source  ddof1          H     p-unc
Kruskal  resign      1  17.002684  0.000037
--------------------------------------------------------------------------------
Reject null hypothesis: There is a statistically significant difference in low_health_days_diff between resign.


low_health_days_diff has influence in our target

### average_permanence_diff

- H<sub>0</sub>: There is not statistical signifficant difference between average permanence difference between employee and boss and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between average permanence difference between employee and boss and employee resign

In [150]:
kruskal_wallis_test(train_df, 'resign', 'average_permanence_diff')

         Source  ddof1        H    p-unc
Kruskal  resign      1  0.00138  0.97037
--------------------------------------------------------------------------------
Failed to reject null hypothesis: There is no statistically significant difference in average_permanence_diff between resign


average_permanence_diff has no influence in our target

In [151]:
no_infl_num.append('average_permanence_diff')

### salary_diff

- H<sub>0</sub>: There is not statistical signifficant difference between salary difference between employee and boss and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between salary difference between employee and boss and employee resign

In [152]:
kruskal_wallis_test(train_df, 'resign', 'salary_diff')

         Source  ddof1         H     p-unc
Kruskal  resign      1  4.290983  0.038315
--------------------------------------------------------------------------------
Failed to reject null hypothesis: There is no statistically significant difference in salary_diff between resign


salary_diff has no influence in our target

In [153]:
no_infl_num.append('salary_diff')

### join_days_diff

- H<sub>0</sub>: There is not statistical signifficant difference between join days difference between employee and boss and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between join days difference between employee and boss and employee resign

In [154]:
kruskal_wallis_test(train_df, 'resign', 'join_days_diff')

         Source  ddof1         H     p-unc
Kruskal  resign      1  8.372586  0.003809
--------------------------------------------------------------------------------
Reject null hypothesis: There is a statistically significant difference in join_days_diff between resign.


join_days_diff has influence in our target

### age_diff

- H<sub>0</sub>: There is not statistical signifficant difference between age difference between employee and boss and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between age difference between employee and boss and employee resign

In [155]:
kruskal_wallis_test(train_df, 'resign', 'age_diff')

         Source  ddof1         H     p-unc
Kruskal  resign      1  0.144712  0.703641
--------------------------------------------------------------------------------
Failed to reject null hypothesis: There is no statistically significant difference in age_diff between resign


age_diff has no influence in our target

In [156]:
no_infl_num.append('age_diff')

In [157]:
print(f'Category features considered to discard {no_infl_cat}')
print(f'Numerical features considered to discard {no_infl_num}')

Category features considered to discard ['gender_employee', 'join_month_employee', 'work_modality_boss', 'gender_boss', 'recruitment_channel_boss', 'marital_estatus_boss', 'join_year_boss', 'join_month_boss', 'performance_boss', 'joined_after_boss', 'younger_than_boss']
Numerical features considered to discard ['office_distance_employee', 'average_permanence_employee', 'salary_employee', 'psi_score_employee', 'join_age_employee', 'office_distance_boss', 'low_health_days_boss', 'average_permanence_boss', 'salary_boss', 'performance_score_boss', 'psi_score_boss', 'join_age_boss', 'office_distance_diff', 'average_permanence_diff', 'salary_diff', 'age_diff']
