# Hypothesis testing

During our exploratory data analysis (EDA), we observed that some features appear to have no relationship with employee resignation. We will test this hypothesis to determine if these features are indeed irrelevant to our target variable, in order to train the model and check its performance with and without those categories.

A signifficance value of 0.01 will be used in order to have a 99% confidence interval of influence of the features.

## Preparing environment

In [6]:
import numpy as np
import pandas as pd
import pingouin
import sys
sys.path.append('../high_performance_employee_resign_prediction')
from utils import paths

In [7]:
import warnings
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=RuntimeWarning)

## Importing data

In [8]:
train_df = pd.read_csv(paths.data_interim_dir('train_clean.csv'))

## Generating hypothesis testing functions

### Chi2 independence test

In [9]:
def chi2_independence_test(data: pd.DataFrame, main_cat: str, sec_cat: str, alpha=0.01):
    """
    Perform a Chi-square test of independence between two categorical variables.

    Parameters:
    data (pd.DataFrame): The DataFrame containing the data to be tested.
    main_cat (str): The name of the first categorical variable (column).
    sec_cat (str): The name of the second categorical variable (column).
    alpha (float, optional): The significance level to be used for the test. Defaults to 0.05.

    Returns:
    None

    This function prints the Chi-square test statistics and whether the null hypothesis
    of independence between the two categorical variables is rejected or not.

    Example:
    >>> df = pd.DataFrame({'A': ['a', 'b', 'a', 'b'], 'B': ['x', 'x', 'y', 'y']})
    >>> chi2_independence_test(df, 'A', 'B')
    """
    
    expected, observed, stats = pingouin.chi2_independence(data=data, x=main_cat, y=sec_cat)
    
    stats = pd.DataFrame(stats)
    
    print(stats)
    print('-'*80)
    
    stat_sign = stats['pval'] < alpha
    
    if stat_sign.any():
        print('Reject null hypothesis: There is a statistically significant difference between categories.')
    else:
        print('Failed to reject null hypothesis: There is no statistically significant difference between categories.')

### Kruskal-Wallis test

In [10]:
def kruskal_wallis_test(data: pd.DataFrame, cat_feat: str, num_feat: str, alpha=0.01):
    """
    Perform the Kruskal-Wallis H-test for independent samples to determine if there are 
    statistically significant differences between the groups of a categorical variable 
    on a continuous variable.

    Parameters:
    data (pd.DataFrame): The DataFrame containing the data to be tested.
    cat_feat (str): The name of the categorical variable (column).
    num_feat (str): The name of the numerical variable (column).
    alpha (float, optional): The significance level to be used for the test. Defaults to 0.05.

    Returns:
    None

    This function prints the Kruskal-Wallis test results and whether the null hypothesis
    of equal medians among groups is rejected or not.

    Example:
    >>> df = pd.DataFrame({'Category': ['A', 'A', 'B', 'B', 'C', 'C'], 'Values': [5, 6, 7, 8, 9, 10]})
    >>> kruskal_wallis_test(df, 'Category', 'Values')
    """
    
    results = pingouin.kruskal(data=data, dv=num_feat, between=cat_feat)
    
    print(results)
    print('-'*80)
    
    if results['p-unc'].iloc[0] < alpha:
        print(f'Reject null hypothesis: There is a statistically significant difference in {num_feat} between {cat_feat}.')
    else:
        print(f'Failed to reject null hypothesis: There is no statistically significant difference in {num_feat} between {cat_feat}')

In [11]:
no_infl_cat = []
no_infl_num = []

## Categorical tests: Chi2 independence tests between categorical features and target.

### id_last_boss_employee

- H<sub>0</sub>: There is not statistical signifficant difference between bosses and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between bosses and employee resign

In [12]:
chi2_independence_test(train_df, 'resign', 'id_last_boss_employee')

                 test    lambda        chi2    dof          pval    cramer  \
0             pearson  1.000000  253.683979  170.0  3.318702e-05  0.343341   
1        cressie-read  0.666667  258.145263  170.0  1.497057e-05  0.346347   
2      log-likelihood  0.000000  278.098851  170.0  3.227638e-07  0.359483   
3       freeman-tukey -0.500000         NaN  170.0           NaN       NaN   
4  mod-log-likelihood -1.000000         inf  170.0  0.000000e+00       inf   
5              neyman -2.000000         NaN  170.0           NaN       NaN   

   power  
0    1.0  
1    1.0  
2    1.0  
3    NaN  
4    NaN  
5    NaN  
--------------------------------------------------------------------------------
Reject null hypothesis: There is a statistically significant difference between categories.


id_last_boss_employee has influence in our target.

### seniority_employee

- H<sub>0</sub>: There is not statistical signifficant difference between employee seniority and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between employee seniority and employee resign

In [13]:
chi2_independence_test(train_df, 'resign', 'seniority_employee')

                 test    lambda       chi2  dof          pval    cramer  \
0             pearson  1.000000  36.374241  1.0  1.628412e-09  0.130010   
1        cressie-read  0.666667  36.565411  1.0  1.476282e-09  0.130351   
2      log-likelihood  0.000000  37.810045  1.0  7.797921e-10  0.132551   
3       freeman-tukey -0.500000  39.605439  1.0  3.108164e-10  0.135661   
4  mod-log-likelihood -1.000000  42.291719  1.0  7.862479e-11  0.140187   
5              neyman -2.000000  51.216367  1.0  8.272602e-13  0.154271   

      power  
0  0.999977  
1  0.999978  
2  0.999986  
3  0.999993  
4  0.999997  
5  1.000000  
--------------------------------------------------------------------------------
Reject null hypothesis: There is a statistically significant difference between categories.


seniority_employee has influence in our target.

### work_modality_employee

- H<sub>0</sub>: There is not statistical signifficant difference between employee work modality and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between employee work modality and employee resign

In [14]:
chi2_independence_test(train_df, 'resign', 'work_modality_employee')

                 test    lambda      chi2  dof      pval    cramer     power
0             pearson  1.000000  7.961430  1.0  0.004778  0.060824  0.805557
1        cressie-read  0.666667  7.956675  1.0  0.004791  0.060806  0.805325
2      log-likelihood  0.000000  7.950230  1.0  0.004808  0.060781  0.805010
3       freeman-tukey -0.500000  7.948070  1.0  0.004814  0.060773  0.804905
4  mod-log-likelihood -1.000000  7.948196  1.0  0.004814  0.060773  0.804911
5              neyman -2.000000  7.955293  1.0  0.004795  0.060800  0.805258
--------------------------------------------------------------------------------
Reject null hypothesis: There is a statistically significant difference between categories.


work_modality_employee has influence in our target

### gender_employee

- H<sub>0</sub>: There is not statistical signifficant difference between gender and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between gender and employee resign

In [15]:
chi2_independence_test(train_df, 'resign', 'gender_employee')

                 test    lambda      chi2  dof      pval    cramer     power
0             pearson  1.000000  3.902991  1.0  0.048200  0.042587  0.506279
1        cressie-read  0.666667  3.903170  1.0  0.048195  0.042588  0.506297
2      log-likelihood  0.000000  3.903928  1.0  0.048173  0.042592  0.506373
3       freeman-tukey -0.500000  3.904849  1.0  0.048147  0.042597  0.506466
4  mod-log-likelihood -1.000000  3.906070  1.0  0.048112  0.042604  0.506589
5              neyman -2.000000  3.909419  1.0  0.048016  0.042622  0.506927
--------------------------------------------------------------------------------
Failed to reject null hypothesis: There is no statistically significant difference between categories.


gender_employee has influence in our target.

In [16]:
no_infl_cat.append('gender_employee')

### recruitment_channel_employee

- H<sub>0</sub>: There is not statistical signifficant difference between employee recruitment channel and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between employee recruitment channel and employee resign

In [17]:
chi2_independence_test(train_df, 'resign', 'recruitment_channel_employee')

                 test    lambda       chi2  dof          pval    cramer  \
0             pearson  1.000000  43.645651  4.0  7.600404e-09  0.142413   
1        cressie-read  0.666667  43.801927  4.0  7.053200e-09  0.142668   
2      log-likelihood  0.000000  44.227223  4.0  5.755041e-09  0.143359   
3       freeman-tukey -0.500000  44.649401  4.0  4.702424e-09  0.144041   
4  mod-log-likelihood -1.000000  45.165170  4.0  3.673663e-09  0.144871   
5              neyman -2.000000  46.500171  4.0  1.937891e-09  0.146996   

      power  
0  0.999942  
1  0.999945  
2  0.999951  
3  0.999957  
4  0.999963  
5  0.999976  
--------------------------------------------------------------------------------
Reject null hypothesis: There is a statistically significant difference between categories.


recruitment_channel_employee has influence in our target

### marital_estatus

- H<sub>0</sub>: There is not statistical signifficant difference employee between marital estatus and employee resign

- H<sub>1</sub>: There is statistical signifficant difference employee between marital estatus and employee resign

In [18]:
chi2_independence_test(train_df, 'resign', 'marital_estatus_employee')

                 test    lambda       chi2  dof      pval    cramer     power
0             pearson  1.000000  20.900890  3.0  0.000110  0.098551  0.980173
1        cressie-read  0.666667  20.918306  3.0  0.000109  0.098592  0.980261
2      log-likelihood  0.000000  20.968675  3.0  0.000107  0.098711  0.980515
3       freeman-tukey -0.500000  21.020234  3.0  0.000104  0.098832  0.980771
4  mod-log-likelihood -1.000000  21.083810  3.0  0.000101  0.098981  0.981082
5              neyman -2.000000  21.247888  3.0  0.000094  0.099366  0.981864
--------------------------------------------------------------------------------
Reject null hypothesis: There is a statistically significant difference between categories.


marital_estatus has influence in our target.

### join_year_employee

- H<sub>0</sub>: There is not statistical signifficant difference between employee year of join and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between employee year of join and employee resign

In [19]:
chi2_independence_test(train_df, 'resign', 'join_year_employee')

                 test    lambda       chi2   dof          pval    cramer  \
0             pearson  1.000000  48.316797  11.0  1.253445e-06  0.149840   
1        cressie-read  0.666667  48.351606  11.0  1.235634e-06  0.149894   
2      log-likelihood  0.000000  48.619469  11.0  1.106703e-06  0.150309   
3       freeman-tukey -0.500000  48.998801  11.0  9.465954e-07  0.150894   
4  mod-log-likelihood -1.000000  49.537969  11.0  7.577378e-07  0.151722   
5              neyman -2.000000  51.132013  11.0  3.913587e-07  0.154144   

      power  
0  0.999681  
1  0.999684  
2  0.999704  
3  0.999730  
4  0.999764  
5  0.999841  
--------------------------------------------------------------------------------
Reject null hypothesis: There is a statistically significant difference between categories.


join_year_employee has influence in our target

### join_month_employee

- H<sub>0</sub>: There is not statistical signifficant difference between month of join and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between month of join and employee resign

In [20]:
chi2_independence_test(train_df, 'resign', 'join_month_employee')

                 test    lambda      chi2   dof      pval    cramer     power
0             pearson  1.000000  7.293563  11.0  0.774836  0.058217  0.379021
1        cressie-read  0.666667  7.290816  11.0  0.775067  0.058206  0.378872
2      log-likelihood  0.000000  7.289974  11.0  0.775138  0.058203  0.378826
3       freeman-tukey -0.500000  7.293408  11.0  0.774849  0.058216  0.379012
4  mod-log-likelihood -1.000000  7.300327  11.0  0.774267  0.058244  0.379388
5              neyman -2.000000  7.324659  11.0  0.772216  0.058341  0.380710
--------------------------------------------------------------------------------
Failed to reject null hypothesis: There is no statistically significant difference between categories.


join_month has no influence in our target.

In [21]:
no_infl_cat.append('join_month_employee')

### performance_employee

- H<sub>0</sub>: There is not statistical signifficant difference between employee performance and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between employee performance and employee resign

In [22]:
chi2_independence_test(train_df, 'resign', 'performance_employee')

                 test    lambda        chi2  dof          pval    cramer  \
0             pearson  1.000000  107.647958  1.0  3.210224e-25  0.223657   
1        cressie-read  0.666667  108.260367  1.0  2.356907e-25  0.224292   
2      log-likelihood  0.000000  110.031645  1.0  9.643885e-26  0.226119   
3       freeman-tukey -0.500000  111.871485  1.0  3.812389e-26  0.228002   
4  mod-log-likelihood -1.000000  114.188210  1.0  1.185090e-26  0.230351   
5              neyman -2.000000  120.429820  1.0  5.093727e-28  0.236562   

   power  
0    1.0  
1    1.0  
2    1.0  
3    1.0  
4    1.0  
5    1.0  
--------------------------------------------------------------------------------
Reject null hypothesis: There is a statistically significant difference between categories.


performance_employee has influence in our target.

### id_last_boss_boss

- H<sub>0</sub>: There is not statistical signifficant difference between boss last boss and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between boss last boss and employee resign

In [23]:
chi2_independence_test(train_df, 'resign', 'id_last_boss_boss')

                 test    lambda        chi2   dof          pval    cramer  \
0             pearson  1.000000  165.432658  86.0  5.831360e-07  0.277261   
1        cressie-read  0.666667  167.240706  86.0  3.689875e-07  0.278772   
2      log-likelihood  0.000000  175.090931  86.0  4.801039e-08  0.285240   
3       freeman-tukey -0.500000         NaN  86.0           NaN       NaN   
4  mod-log-likelihood -1.000000         inf  86.0  0.000000e+00       inf   
5              neyman -2.000000         NaN  86.0           NaN       NaN   

   power  
0    1.0  
1    1.0  
2    1.0  
3    NaN  
4    NaN  
5    NaN  
--------------------------------------------------------------------------------
Reject null hypothesis: There is a statistically significant difference between categories.


id_last_boss_boss has influence in our target.

### work_modality_boss

- H<sub>0</sub>: There is not statistical signifficant difference between boss work modality and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between boss work modality and employee resign

In [24]:
chi2_independence_test(train_df, 'resign', 'work_modality_boss')

                 test    lambda      chi2  dof      pval    cramer     power
0             pearson  1.000000  1.659943  1.0  0.197611  0.027773  0.251507
1        cressie-read  0.666667  1.658377  1.0  0.197823  0.027760  0.251315
2      log-likelihood  0.000000  1.655912  1.0  0.198156  0.027739  0.251012
3       freeman-tukey -0.500000  1.654645  1.0  0.198328  0.027729  0.250857
4  mod-log-likelihood -1.000000  1.653876  1.0  0.198432  0.027722  0.250762
5              neyman -2.000000  1.653827  1.0  0.198439  0.027722  0.250756
--------------------------------------------------------------------------------
Failed to reject null hypothesis: There is no statistically significant difference between categories.


work_modality_boss has no influence in our target.

In [25]:
no_infl_cat.append('work_modality_boss')

### gender_boss

- H<sub>0</sub>: There is not statistical signifficant difference between boss gender and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between boss gender and employee resign

In [26]:
chi2_independence_test(train_df, 'resign', 'gender_boss')

                 test    lambda      chi2  dof      pval    cramer     power
0             pearson  1.000000  1.536760  1.0  0.215101  0.026723  0.236357
1        cressie-read  0.666667  1.536675  1.0  0.215113  0.026722  0.236347
2      log-likelihood  0.000000  1.536571  1.0  0.215129  0.026721  0.236334
3       freeman-tukey -0.500000  1.536549  1.0  0.215132  0.026721  0.236331
4  mod-log-likelihood -1.000000  1.536576  1.0  0.215128  0.026721  0.236335
5              neyman -2.000000  1.536774  1.0  0.215099  0.026723  0.236359
--------------------------------------------------------------------------------
Failed to reject null hypothesis: There is no statistically significant difference between categories.


gender_boss has no influence in our target

In [27]:
no_infl_cat.append('gender_boss')

### recruitment_channel_boss

- H<sub>0</sub>: There is not statistical signifficant difference between boss recruitment channel and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between boss recruitment channel and employee resign

In [28]:
chi2_independence_test(train_df, 'resign', 'recruitment_channel_boss')

                 test    lambda      chi2  dof      pval    cramer     power
0             pearson  1.000000  0.825301  4.0  0.935024  0.019583  0.094927
1        cressie-read  0.666667  0.824912  4.0  0.935077  0.019579  0.094904
2      log-likelihood  0.000000  0.824357  4.0  0.935153  0.019572  0.094871
3       freeman-tukey -0.500000  0.824137  4.0  0.935183  0.019569  0.094857
4  mod-log-likelihood -1.000000  0.824084  4.0  0.935190  0.019569  0.094854
5              neyman -2.000000  0.824479  4.0  0.935136  0.019574  0.094878
--------------------------------------------------------------------------------
Failed to reject null hypothesis: There is no statistically significant difference between categories.


recruitment_channel_boss has no influence in our target

In [29]:
no_infl_cat.append('recruitment_channel_boss')

### marital_estatus_boss

- H<sub>0</sub>: There is not statistical signifficant difference between boss marital estatus and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between boss marital estatus and employee resign

In [30]:
chi2_independence_test(train_df, 'resign', 'marital_estatus_boss')

                 test    lambda      chi2  dof      pval    cramer     power
0             pearson  1.000000  0.896464  3.0  0.826281  0.020410  0.108262
1        cressie-read  0.666667  0.896960  3.0  0.826161  0.020416  0.108297
2      log-likelihood  0.000000  0.898037  3.0  0.825901  0.020428  0.108373
3       freeman-tukey -0.500000  0.898922  3.0  0.825688  0.020438  0.108436
4  mod-log-likelihood -1.000000  0.899872  3.0  0.825459  0.020449  0.108503
5              neyman -2.000000  0.901970  3.0  0.824952  0.020473  0.108652
--------------------------------------------------------------------------------
Failed to reject null hypothesis: There is no statistically significant difference between categories.


marital_estatus_boss has no influence in our target.

In [31]:
no_infl_cat.append('marital_estatus_boss')

### join_year_boss

- H<sub>0</sub>: There is not statistical signifficant difference between boss join year and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between boss join year and employee resign

In [32]:
chi2_independence_test(train_df, 'resign', 'join_year_boss')

                 test    lambda       chi2   dof      pval    cramer     power
0             pearson  1.000000  20.426235  11.0  0.039827  0.097426  0.886328
1        cressie-read  0.666667  20.422087  11.0  0.039877  0.097416  0.886250
2      log-likelihood  0.000000  20.444204  11.0  0.039608  0.097468  0.886663
3       freeman-tukey -0.500000  20.487517  11.0  0.039086  0.097572  0.887468
4  mod-log-likelihood -1.000000  20.553948  11.0  0.038298  0.097730  0.888693
5              neyman -2.000000  20.757416  11.0  0.035973  0.098212  0.892375
--------------------------------------------------------------------------------
Failed to reject null hypothesis: There is no statistically significant difference between categories.


join_year_boss has no influence in our target

In [33]:
no_infl_cat.append('join_year_boss')

### join_month_boss

- H<sub>0</sub>: There is not statistical signifficant difference between boss join month and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between boss join month and employee resign

In [34]:
chi2_independence_test(train_df, 'resign', 'join_month_boss')

                 test    lambda       chi2   dof      pval    cramer     power
0             pearson  1.000000  14.911043  11.0  0.186606  0.083240  0.738385
1        cressie-read  0.666667  14.921631  11.0  0.186113  0.083270  0.738765
2      log-likelihood  0.000000  14.953765  11.0  0.184623  0.083359  0.739913
3       freeman-tukey -0.500000  14.987542  11.0  0.183068  0.083453  0.741116
4  mod-log-likelihood -1.000000  15.029708  11.0  0.181141  0.083571  0.742612
5              neyman -2.000000  15.139634  11.0  0.176193  0.083876  0.746485
--------------------------------------------------------------------------------
Failed to reject null hypothesis: There is no statistically significant difference between categories.


join_month_boss has no influence in our target

In [35]:
no_infl_cat.append('join_month_boss')

### performance_boss

- H<sub>0</sub>: There is not statistical signifficant difference between boss performance and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between boss performance and employee resign

In [36]:
chi2_independence_test(train_df, 'resign', 'performance_boss')

                 test    lambda      chi2  dof      pval    cramer     power
0             pearson  1.000000  3.750363  1.0  0.052796  0.041746  0.490723
1        cressie-read  0.666667  3.749348  1.0  0.052828  0.041740  0.490618
2      log-likelihood  0.000000  3.747803  1.0  0.052877  0.041732  0.490459
3       freeman-tukey -0.500000  3.747067  1.0  0.052900  0.041728  0.490384
4  mod-log-likelihood -1.000000  3.746692  1.0  0.052912  0.041726  0.490345
5              neyman -2.000000  3.747027  1.0  0.052902  0.041727  0.490380
--------------------------------------------------------------------------------
Failed to reject null hypothesis: There is no statistically significant difference between categories.


performance_boss has no influence in our target

In [37]:
no_infl_cat.append('performance_boss')

### joined_after_boss

- H<sub>0</sub>: There is not statistical signifficant difference between employees who joined company after boss and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between employees who joined company after boss and employee resign

In [38]:
chi2_independence_test(train_df, 'resign', 'joined_after_boss')

                 test    lambda      chi2  dof      pval    cramer     power
0             pearson  1.000000  3.208977  1.0  0.073235  0.038616  0.433143
1        cressie-read  0.666667  3.208975  1.0  0.073235  0.038616  0.433142
2      log-likelihood  0.000000  3.209243  1.0  0.073223  0.038617  0.433172
3       freeman-tukey -0.500000  3.209683  1.0  0.073204  0.038620  0.433220
4  mod-log-likelihood -1.000000  3.210328  1.0  0.073175  0.038624  0.433291
5              neyman -2.000000  3.212234  1.0  0.073090  0.038635  0.433500
--------------------------------------------------------------------------------
Failed to reject null hypothesis: There is no statistically significant difference between categories.


joined_after_boss has no influence in our target

In [39]:
no_infl_cat.append('joined_after_boss')

### younger_than_boss

- H<sub>0</sub>: There is not statistical signifficant difference between employees younger than boss and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between employees younger than boss and employee resign

In [40]:
chi2_independence_test(train_df, 'resign', 'younger_than_boss')

                 test    lambda      chi2  dof      pval    cramer     power
0             pearson  1.000000  3.208977  1.0  0.073235  0.038616  0.433143
1        cressie-read  0.666667  3.208975  1.0  0.073235  0.038616  0.433142
2      log-likelihood  0.000000  3.209243  1.0  0.073223  0.038617  0.433172
3       freeman-tukey -0.500000  3.209683  1.0  0.073204  0.038620  0.433220
4  mod-log-likelihood -1.000000  3.210328  1.0  0.073175  0.038624  0.433291
5              neyman -2.000000  3.212234  1.0  0.073090  0.038635  0.433500
--------------------------------------------------------------------------------
Failed to reject null hypothesis: There is no statistically significant difference between categories.


younger_than_boss has no influence in our target

In [41]:
no_infl_cat.append('younger_than_boss')

In [42]:
print(f'Categories considered to discard {no_infl_cat}')

Categories considered to discard ['gender_employee', 'join_month_employee', 'work_modality_boss', 'gender_boss', 'recruitment_channel_boss', 'marital_estatus_boss', 'join_year_boss', 'join_month_boss', 'performance_boss', 'joined_after_boss', 'younger_than_boss']


## Numerical tests: Kruskal-Wallis tests between numerical features and target.

### office_distance_employee

- H<sub>0</sub>: There is not statistical signifficant difference between employee distance to the office and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between employee distance to the office and employee resign

In [43]:
kruskal_wallis_test(train_df, 'resign', 'office_distance_employee')

         Source  ddof1         H     p-unc
Kruskal  resign      1  0.215454  0.642526
--------------------------------------------------------------------------------
Failed to reject null hypothesis: There is no statistically significant difference in office_distance_employee between resign


distancia_oficina has no influence in our target.

In [44]:
no_infl_num.append('office_distance_employee')

### low_health_days_employee

- H<sub>0</sub>: There is not statistical signifficant difference between employee sick days and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between employee sick days and employee resign

In [45]:
kruskal_wallis_test(train_df, 'resign', 'low_health_days_employee')

         Source  ddof1          H         p-unc
Kruskal  resign      1  44.826429  2.152970e-11
--------------------------------------------------------------------------------
Reject null hypothesis: There is a statistically significant difference in low_health_days_employee between resign.


dias_baja_salud has influence in our target

### average_permanence_employee

- H<sub>0</sub>: There is not statistical signifficant difference between employee average permanence and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between employee average permanence and employee resign

In [46]:
kruskal_wallis_test(train_df, 'resign', 'average_permanence_employee')

         Source  ddof1         H    p-unc
Kruskal  resign      1  0.672763  0.41209
--------------------------------------------------------------------------------
Failed to reject null hypothesis: There is no statistically significant difference in average_permanence_employee between resign


permanencia_promedio has no influence in our target.

In [47]:
no_infl_num.append('average_permanence_employee')

### salary_employee

- H<sub>0</sub>: There is not statistical signifficant difference between employee salary and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between employee salary and employee resign

In [48]:
kruskal_wallis_test(train_df, 'resign', 'salary_employee')

         Source  ddof1         H     p-unc
Kruskal  resign      1  0.235091  0.627774
--------------------------------------------------------------------------------
Failed to reject null hypothesis: There is no statistically significant difference in salary_employee between resign


salary_employee has no influence in our target.

In [49]:
no_infl_num.append('salary_employee')

### performance_score_employee

- H<sub>0</sub>: There is not statistical signifficant difference between employee performance score and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between employee performance score and employee resign

In [50]:
kruskal_wallis_test(train_df, 'resign', 'performance_score_employee')

         Source  ddof1           H         p-unc
Kruskal  resign      1  201.720451  8.798300e-46
--------------------------------------------------------------------------------
Reject null hypothesis: There is a statistically significant difference in performance_score_employee between resign.


performance_score_employee has influence in our target

### psi_score_employee

- H<sub>0</sub>: There is not statistical signifficant difference between employee psi score and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between employee psi score and employee resign

In [51]:
kruskal_wallis_test(train_df, 'resign', 'psi_score_employee')

         Source  ddof1         H     p-unc
Kruskal  resign      1  0.592649  0.441396
--------------------------------------------------------------------------------
Failed to reject null hypothesis: There is no statistically significant difference in psi_score_employee between resign


psi_score_employee has no influence in our target

In [52]:
no_infl_num.append('psi_score_employee')

### join_age_employee

- H<sub>0</sub>: There is not statistical signifficant difference between employee age and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between employee age and employee resign

In [53]:
kruskal_wallis_test(train_df, 'resign', 'join_age_employee')

         Source  ddof1         H     p-unc
Kruskal  resign      1  0.080111  0.777146
--------------------------------------------------------------------------------
Failed to reject null hypothesis: There is no statistically significant difference in join_age_employee between resign


join_age_employee has no influence in our target

In [54]:
no_infl_num.append('join_age_employee')

### office_distance_boss

- H<sub>0</sub>: There is not statistical signifficant difference between boss office distance and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between boss office distance and employee resign

In [55]:
kruskal_wallis_test(train_df, 'resign', 'office_distance_boss')

         Source  ddof1         H     p-unc
Kruskal  resign      1  0.525837  0.468362
--------------------------------------------------------------------------------
Failed to reject null hypothesis: There is no statistically significant difference in office_distance_boss between resign


office_distance_boss has no influence in our target

In [56]:
no_infl_num.append('office_distance_boss')

### low_health_days_boss

- H<sub>0</sub>: There is not statistical signifficant difference between boss low health days and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between boss low health days and employee resign

In [57]:
kruskal_wallis_test(train_df, 'resign', 'low_health_days_boss')

         Source  ddof1         H     p-unc
Kruskal  resign      1  3.210622  0.073162
--------------------------------------------------------------------------------
Failed to reject null hypothesis: There is no statistically significant difference in low_health_days_boss between resign


low_health_days_boss has no influence in our target

In [58]:
no_infl_num.append('low_health_days_boss')

### average_permanence_boss

- H<sub>0</sub>: There is not statistical signifficant difference between boss average permanence and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between boss average permanence and employee resign

In [59]:
kruskal_wallis_test(train_df, 'resign', 'average_permanence_boss')

         Source  ddof1         H     p-unc
Kruskal  resign      1  1.500904  0.220532
--------------------------------------------------------------------------------
Failed to reject null hypothesis: There is no statistically significant difference in average_permanence_boss between resign


average_permanence_boss has no influence in our target

In [60]:
no_infl_num.append('average_permanence_boss')

### salary_boss

- H<sub>0</sub>: There is not statistical signifficant difference between boss salary and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between boss salary and employee resign

In [61]:
kruskal_wallis_test(train_df, 'resign', 'salary_boss')

         Source  ddof1         H     p-unc
Kruskal  resign      1  2.380555  0.122854
--------------------------------------------------------------------------------
Failed to reject null hypothesis: There is no statistically significant difference in salary_boss between resign


salary_boss has no influence in our target

In [62]:
no_infl_num.append('salary_boss')

### performance_score_boss

- H<sub>0</sub>: There is not statistical signifficant difference between boss performance score and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between boss performance score and employee resign

In [63]:
kruskal_wallis_test(train_df, 'resign', 'performance_score_boss')

         Source  ddof1         H     p-unc
Kruskal  resign      1  5.618567  0.017771
--------------------------------------------------------------------------------
Failed to reject null hypothesis: There is no statistically significant difference in performance_score_boss between resign


performance_score_boss has no influence in our target

In [64]:
no_infl_num.append('performance_score_boss')

### psi_score_boss

- H<sub>0</sub>: There is not statistical signifficant difference between boss psi score and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between boss psi score and employee resign

In [65]:
kruskal_wallis_test(train_df, 'resign', 'psi_score_boss')

         Source  ddof1         H     p-unc
Kruskal  resign      1  0.027217  0.868962
--------------------------------------------------------------------------------
Failed to reject null hypothesis: There is no statistically significant difference in psi_score_boss between resign


psi_score_boss has no influence in our target

In [66]:
no_infl_num.append('psi_score_boss')

### join_age_boss

- H<sub>0</sub>: There is not statistical signifficant difference between boss join age and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between boss join age and employee resign

In [67]:
kruskal_wallis_test(train_df, 'resign', 'join_age_boss')

         Source  ddof1         H     p-unc
Kruskal  resign      1  3.150868  0.075887
--------------------------------------------------------------------------------
Failed to reject null hypothesis: There is no statistically significant difference in join_age_boss between resign


join_age_boss has no influence in our target

In [68]:
no_infl_num.append('join_age_boss')

### office_distance_diff

- H<sub>0</sub>: There is not statistical signifficant difference between office distance difference between employee and boss and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between office distance difference between employee and boss and employee resign

In [69]:
kruskal_wallis_test(train_df, 'resign', 'office_distance_diff')

         Source  ddof1         H   p-unc
Kruskal  resign      1  0.422081  0.5159
--------------------------------------------------------------------------------
Failed to reject null hypothesis: There is no statistically significant difference in office_distance_diff between resign


office_distance_diff has no influence in our target

In [70]:
no_infl_num.append('office_distance_diff')

### low_health_days_diff

- H<sub>0</sub>: There is not statistical signifficant difference between low health days difference between employee and boss and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between low health days difference between employee and boss and employee resign

In [71]:
kruskal_wallis_test(train_df, 'resign', 'low_health_days_diff')

         Source  ddof1          H     p-unc
Kruskal  resign      1  15.452513  0.000085
--------------------------------------------------------------------------------
Reject null hypothesis: There is a statistically significant difference in low_health_days_diff between resign.


low_health_days_diff has influence in our target

### average_permanence_diff

- H<sub>0</sub>: There is not statistical signifficant difference between average permanence difference between employee and boss and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between average permanence difference between employee and boss and employee resign

In [72]:
kruskal_wallis_test(train_df, 'resign', 'average_permanence_diff')

         Source  ddof1         H     p-unc
Kruskal  resign      1  0.004402  0.947099
--------------------------------------------------------------------------------
Failed to reject null hypothesis: There is no statistically significant difference in average_permanence_diff between resign


average_permanence_diff has no influence in our target

In [73]:
no_infl_num.append('average_permanence_diff')

### salary_diff

- H<sub>0</sub>: There is not statistical signifficant difference between salary difference between employee and boss and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between salary difference between employee and boss and employee resign

In [74]:
kruskal_wallis_test(train_df, 'resign', 'salary_diff')

         Source  ddof1         H     p-unc
Kruskal  resign      1  4.301208  0.038085
--------------------------------------------------------------------------------
Failed to reject null hypothesis: There is no statistically significant difference in salary_diff between resign


salary_diff has no influence in our target

In [75]:
no_infl_num.append('salary_diff')

### join_days_diff

- H<sub>0</sub>: There is not statistical signifficant difference between join days difference between employee and boss and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between join days difference between employee and boss and employee resign

In [76]:
kruskal_wallis_test(train_df, 'resign', 'join_days_diff')

         Source  ddof1         H     p-unc
Kruskal  resign      1  8.617638  0.003329
--------------------------------------------------------------------------------
Reject null hypothesis: There is a statistically significant difference in join_days_diff between resign.


join_days_diff has influence in our target

### age_diff

- H<sub>0</sub>: There is not statistical signifficant difference between age difference between employee and boss and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between age difference between employee and boss and employee resign

In [77]:
kruskal_wallis_test(train_df, 'resign', 'age_diff')

         Source  ddof1         H    p-unc
Kruskal  resign      1  0.299892  0.58395
--------------------------------------------------------------------------------
Failed to reject null hypothesis: There is no statistically significant difference in age_diff between resign


age_diff has no influence in our target

In [78]:
no_infl_num.append('age_diff')

### avg_od_epb

- H<sub>0</sub>: There is not statistical signifficant difference between employee average office distance by boss and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between employee average office distance by boss and employee resign

In [80]:
kruskal_wallis_test(train_df, 'resign', 'avg_od_epb')

         Source  ddof1        H     p-unc
Kruskal  resign      1  0.05205  0.819533
--------------------------------------------------------------------------------
Failed to reject null hypothesis: There is no statistically significant difference in avg_od_epb between resign


avg_od_epb has no influence in our target

In [81]:
no_infl_num.append('avg_od_epb')

### avg_lhd_epb

- H<sub>0</sub>: There is not statistical signifficant difference between employee average low health days by boss and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between employee average low health days by boss and employee resign

In [82]:
kruskal_wallis_test(train_df, 'resign', 'avg_lhd_epb')

         Source  ddof1         H     p-unc
Kruskal  resign      1  1.251547  0.263257
--------------------------------------------------------------------------------
Failed to reject null hypothesis: There is no statistically significant difference in avg_lhd_epb between resign


avg_lhd_epb has no influence in our target

In [83]:
no_infl_num.append('avg_lhd_epb')

### avg_avgp_epb

- H<sub>0</sub>: There is not statistical signifficant difference between employee average permanence by boss and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between employee average permanence by boss and employee resign

In [84]:
kruskal_wallis_test(train_df, 'resign', 'avg_avgp_epb')

         Source  ddof1         H     p-unc
Kruskal  resign      1  1.807231  0.178841
--------------------------------------------------------------------------------
Failed to reject null hypothesis: There is no statistically significant difference in avg_avgp_epb between resign


avg_avgp_epb has no influence in our target

In [88]:
no_infl_num.append('avg_avgp_epb')

### avg_sal_epb

- H<sub>0</sub>: There is not statistical signifficant difference between employee average salary by boss and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between employee average salary by boss and employee resign

In [90]:
kruskal_wallis_test(train_df, 'resign', 'avg_sal_epb')

         Source  ddof1         H     p-unc
Kruskal  resign      1  0.040999  0.839539
--------------------------------------------------------------------------------
Failed to reject null hypothesis: There is no statistically significant difference in avg_sal_epb between resign


avg_sal_epb has no influence in our target

In [91]:
no_infl_num.append('avg_sal_epb')

### avg_ps_epb

- H<sub>0</sub>: There is not statistical signifficant difference between employee average performance score by boss and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between employee average performance score by boss and employee resign

In [92]:
kruskal_wallis_test(train_df, 'resign', 'avg_ps_epb')

         Source  ddof1          H         p-unc
Kruskal  resign      1  40.844558  1.648316e-10
--------------------------------------------------------------------------------
Reject null hypothesis: There is a statistically significant difference in avg_ps_epb between resign.


avg_ps_epb has influence in our target

### avg_psis_epb

- H<sub>0</sub>: There is not statistical signifficant difference between employee average psi score by boss and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between employee average psi score by boss and employee resign

In [93]:
kruskal_wallis_test(train_df, 'resign', 'avg_psis_epb')

         Source  ddof1        H     p-unc
Kruskal  resign      1  0.03522  0.851135
--------------------------------------------------------------------------------
Failed to reject null hypothesis: There is no statistically significant difference in avg_psis_epb between resign


avg_psis_epb has no influence in our target

In [94]:
no_infl_num.append('avg_psis_epb')

### avg_ja_epb

- H<sub>0</sub>: There is not statistical signifficant difference between employee average join age by boss and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between employee average join age by boss and employee resign

In [95]:
kruskal_wallis_test(train_df, 'resign', 'avg_ja_epb')

         Source  ddof1         H     p-unc
Kruskal  resign      1  8.804674  0.003005
--------------------------------------------------------------------------------
Reject null hypothesis: There is a statistically significant difference in avg_ja_epb between resign.


avg_ja_epb has influence in our target

### avg_od_bpb

- H<sub>0</sub>: There is not statistical signifficant difference between boss average office distance by its boss and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between boss average office distance by its boss and employee resign

In [96]:
kruskal_wallis_test(train_df, 'resign', 'avg_od_bpb')

         Source  ddof1         H     p-unc
Kruskal  resign      1  1.368042  0.242149
--------------------------------------------------------------------------------
Failed to reject null hypothesis: There is no statistically significant difference in avg_od_bpb between resign


avg_od_bpb has no influence in our target

In [97]:
no_infl_num.append('avg_od_bpb')

### avg_lhd_bpb

- H<sub>0</sub>: There is not statistical signifficant difference between boss average low health days by its boss and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between boss average low health days by its boss and employee resign

In [98]:
kruskal_wallis_test(train_df, 'resign', 'avg_lhd_bpb')

         Source  ddof1         H     p-unc
Kruskal  resign      1  2.707232  0.099894
--------------------------------------------------------------------------------
Failed to reject null hypothesis: There is no statistically significant difference in avg_lhd_bpb between resign


avg_lhd_bpb has no influence in our target

In [99]:
no_infl_num.append('avg_lhd_bpb')

### avg_avgp_bpb

- H<sub>0</sub>: There is not statistical signifficant difference between boss average permanence by its boss and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between boss average permanence by its boss and employee resign

In [100]:
kruskal_wallis_test(train_df, 'resign', 'avg_avgp_bpb')

         Source  ddof1         H     p-unc
Kruskal  resign      1  1.679664  0.194969
--------------------------------------------------------------------------------
Failed to reject null hypothesis: There is no statistically significant difference in avg_avgp_bpb between resign


avg_avgp_bpb has no influence in our target

In [101]:
no_infl_num.append('avg_avgp_bpb')

### avg_sal_bpb

- H<sub>0</sub>: There is not statistical signifficant difference between boss average salary by its boss and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between boss average salary by its boss and employee resign

In [102]:
kruskal_wallis_test(train_df, 'resign', 'avg_sal_bpb')

         Source  ddof1        H     p-unc
Kruskal  resign      1  1.05641  0.304035
--------------------------------------------------------------------------------
Failed to reject null hypothesis: There is no statistically significant difference in avg_sal_bpb between resign


avg_sal_bpb has no influence in our target

In [103]:
no_infl_num.append('avg_sal_bpb')

### avg_ps_bpb

- H<sub>0</sub>: There is not statistical signifficant difference between boss average performance score by its boss and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between boss average performance score by its boss and employee resign

In [104]:
kruskal_wallis_test(train_df, 'resign', 'avg_ps_bpb')

         Source  ddof1         H     p-unc
Kruskal  resign      1  6.054435  0.013871
--------------------------------------------------------------------------------
Failed to reject null hypothesis: There is no statistically significant difference in avg_ps_bpb between resign


avg_ps_bpb has no influence in our target

In [105]:
no_infl_num.append('avg_ps_bpb')

### avg_psis_bpb

- H<sub>0</sub>: There is not statistical signifficant difference between boss average psi score by its boss and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between boss average psi score by its boss and employee resign

In [106]:
kruskal_wallis_test(train_df, 'resign', 'avg_psis_bpb')

         Source  ddof1         H     p-unc
Kruskal  resign      1  0.253722  0.614466
--------------------------------------------------------------------------------
Failed to reject null hypothesis: There is no statistically significant difference in avg_psis_bpb between resign


avg_psis_bpb has no influence in our target

In [107]:
no_infl_num.append('avg_psis_bpb')

### avg_ja_bpb

- H<sub>0</sub>: There is not statistical signifficant difference between boss average join age by its boss and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between boss average join age by its boss and employee resign

In [108]:
kruskal_wallis_test(train_df, 'resign', 'avg_ja_bpb')

         Source  ddof1         H     p-unc
Kruskal  resign      1  1.746788  0.186281
--------------------------------------------------------------------------------
Failed to reject null hypothesis: There is no statistically significant difference in avg_ja_bpb between resign


avg_ja_epb has no influence in our target

In [109]:
no_infl_num.append('avg_ja_bpb')

### boss_employees_in_charge

- H<sub>0</sub>: There is not statistical signifficant difference between boss employees and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between boss employees and employee resign

In [110]:
kruskal_wallis_test(train_df, 'resign', 'boss_employees_in_charge')

         Source  ddof1         H     p-unc
Kruskal  resign      1  2.245334  0.134018
--------------------------------------------------------------------------------
Failed to reject null hypothesis: There is no statistically significant difference in boss_employees_in_charge between resign


boss_employees_in_charge has no influence in our target

In [111]:
no_infl_num.append('boss_employees_in_charge')

### bob_bosses_in_charge

- H<sub>0</sub>: There is not statistical signifficant difference between boss of boss bosses in charge and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between boss of boss bosses in charge and employee resign

In [112]:
kruskal_wallis_test(train_df, 'resign', 'bob_bosses_in_charge')

         Source  ddof1         H     p-unc
Kruskal  resign      1  2.280466  0.131013
--------------------------------------------------------------------------------
Failed to reject null hypothesis: There is no statistically significant difference in bob_bosses_in_charge between resign


avg_ja_epb has no influence in our target

In [113]:
no_infl_num.append('bob_bosses_in_charge')

In [114]:
print(f'Category features considered to discard {no_infl_cat}')
print(f'Numerical features considered to discard {no_infl_num}')

Category features considered to discard ['gender_employee', 'join_month_employee', 'work_modality_boss', 'gender_boss', 'recruitment_channel_boss', 'marital_estatus_boss', 'join_year_boss', 'join_month_boss', 'performance_boss', 'joined_after_boss', 'younger_than_boss']
Numerical features considered to discard ['office_distance_employee', 'average_permanence_employee', 'salary_employee', 'psi_score_employee', 'join_age_employee', 'office_distance_boss', 'low_health_days_boss', 'average_permanence_boss', 'salary_boss', 'performance_score_boss', 'psi_score_boss', 'join_age_boss', 'office_distance_diff', 'average_permanence_diff', 'salary_diff', 'age_diff', 'avg_lhd_epb', 'avg_od_epb', 'avg_avgp_epb', 'avg_sal_epb', 'avg_psis_epb', 'avg_od_bpb', 'avg_lhd_bpb', 'avg_avgp_bpb', 'avg_sal_bpb', 'avg_ps_bpb', 'avg_psis_bpb', 'avg_ja_bpb', 'boss_employees_in_charge', 'bob_bosses_in_charge']
