# Hypothesis testing

During our exploratory data analysis (EDA), we observed that some features appear to have no relationship with employee resignation. We will test this hypothesis to determine if these features are indeed irrelevant to our target variable, in order to train the model and check its performance with and without those categories.

A signifficance value of 0.05 will be used in order to have a 95% confidence interval of influence of the features.

## Preparing environment

In [1]:
import numpy as np
import pandas as pd
import pingouin
import sys
sys.path.append('../high_performance_employee_resign_prediction')
from utils import paths

In [21]:
import warnings
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=RuntimeWarning)

## Importing data

In [2]:
train_df = pd.read_csv(paths.data_interim_dir('train_clean.csv'))

## Generating hypothesis testing functions

### Chi2 independence test

In [17]:
def chi2_independence_test(data: pd.DataFrame, main_cat: str, sec_cat: str, alpha=0.05):
    """
    Perform a Chi-square test of independence between two categorical variables.

    Parameters:
    data (pd.DataFrame): The DataFrame containing the data to be tested.
    main_cat (str): The name of the first categorical variable (column).
    sec_cat (str): The name of the second categorical variable (column).
    alpha (float, optional): The significance level to be used for the test. Defaults to 0.05.

    Returns:
    None

    This function prints the Chi-square test statistics and whether the null hypothesis
    of independence between the two categorical variables is rejected or not.

    Example:
    >>> df = pd.DataFrame({'A': ['a', 'b', 'a', 'b'], 'B': ['x', 'x', 'y', 'y']})
    >>> chi2_independence_test(df, 'A', 'B')
    """
    
    expected, observed, stats = pingouin.chi2_independence(data=data, x=main_cat, y=sec_cat)
    
    stats = pd.DataFrame(stats)
    
    print(stats)
    print('-'*80)
    
    stat_sign = stats['pval'] < alpha
    
    if stat_sign.any():
        print('Reject null hypothesis: There is a statistically significant difference between categories.')
    else:
        print('Failed to reject null hypothesis: There is no statistically significant difference between categories.')

### Kruskal-Wallis test

In [18]:
def kruskal_wallis_test(data: pd.DataFrame, cat_feat: str, num_feat: str, alpha=0.05):
    """
    Perform the Kruskal-Wallis H-test for independent samples to determine if there are 
    statistically significant differences between the groups of a categorical variable 
    on a continuous variable.

    Parameters:
    data (pd.DataFrame): The DataFrame containing the data to be tested.
    cat_feat (str): The name of the categorical variable (column).
    num_feat (str): The name of the numerical variable (column).
    alpha (float, optional): The significance level to be used for the test. Defaults to 0.05.

    Returns:
    None

    This function prints the Kruskal-Wallis test results and whether the null hypothesis
    of equal medians among groups is rejected or not.

    Example:
    >>> df = pd.DataFrame({'Category': ['A', 'A', 'B', 'B', 'C', 'C'], 'Values': [5, 6, 7, 8, 9, 10]})
    >>> kruskal_wallis_test(df, 'Category', 'Values')
    """
    
    results = pingouin.kruskal(data=data, dv=num_feat, between=cat_feat)
    
    print(results)
    print('-'*80)
    
    if results['p-unc'].iloc[0] < alpha:
        print(f'Reject null hypothesis: There is a statistically significant difference in {num_feat} between {cat_feat}.')
    else:
        print(f'Failed to reject null hypothesis: There is no statistically significant difference in {num_feat} between {cat_feat}')

In [24]:
no_infl_cat = []
no_infl_num = []

## Categorical tests: Chi2 independence tests between categorical features and target.

### id_ultimo_jefe

- H<sub>0</sub>: There is not statistical signifficant difference between bosses and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between bosses and employee resign

In [22]:
chi2_independence_test(train_df, 'abandono_6meses', 'id_ultimo_jefe')

                 test    lambda        chi2    dof          pval    cramer  \
0             pearson  1.000000  290.927616  171.0  2.902585e-08  0.367681   
1        cressie-read  0.666667  296.670742  171.0  8.362260e-09  0.371293   
2      log-likelihood  0.000000  323.194818  171.0  1.832113e-11  0.387535   
3       freeman-tukey -0.500000         NaN  171.0           NaN       NaN   
4  mod-log-likelihood -1.000000         inf  171.0  0.000000e+00       inf   
5              neyman -2.000000         NaN  171.0           NaN       NaN   

   power  
0    1.0  
1    1.0  
2    1.0  
3    NaN  
4    NaN  
5    NaN  
--------------------------------------------------------------------------------
Reject null hypothesis: There is a statistically significant difference between categories.


id_ultimo_jefe has influence in our target.

### seniority

- H<sub>0</sub>: There is not statistical signifficant difference between seniority and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between seniority and employee resign

In [23]:
chi2_independence_test(train_df, 'abandono_6meses', 'seniority')

                 test    lambda       chi2  dof          pval    cramer  \
0             pearson  1.000000  36.374241  1.0  1.628412e-09  0.130010   
1        cressie-read  0.666667  36.565411  1.0  1.476282e-09  0.130351   
2      log-likelihood  0.000000  37.810045  1.0  7.797921e-10  0.132551   
3       freeman-tukey -0.500000  39.605439  1.0  3.108164e-10  0.135661   
4  mod-log-likelihood -1.000000  42.291719  1.0  7.862479e-11  0.140187   
5              neyman -2.000000  51.216367  1.0  8.272602e-13  0.154271   

      power  
0  0.999977  
1  0.999978  
2  0.999986  
3  0.999993  
4  0.999997  
5  1.000000  
--------------------------------------------------------------------------------
Reject null hypothesis: There is a statistically significant difference between categories.


seniority has influence in our target.

### modalidad_trabajo

- H<sub>0</sub>: There is not statistical signifficant difference between work modality and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between work modality and employee resign

In [26]:
chi2_independence_test(train_df, 'abandono_6meses', 'modalidad_trabajo')

                 test    lambda      chi2  dof      pval    cramer     power
0             pearson  1.000000  7.961430  1.0  0.004778  0.060824  0.805557
1        cressie-read  0.666667  7.956675  1.0  0.004791  0.060806  0.805325
2      log-likelihood  0.000000  7.950230  1.0  0.004808  0.060781  0.805010
3       freeman-tukey -0.500000  7.948070  1.0  0.004814  0.060773  0.804905
4  mod-log-likelihood -1.000000  7.948196  1.0  0.004814  0.060773  0.804911
5              neyman -2.000000  7.955293  1.0  0.004795  0.060800  0.805258
--------------------------------------------------------------------------------
Reject null hypothesis: There is a statistically significant difference between categories.


modalidad_trabajo has influence in our target

### genero

- H<sub>0</sub>: There is not statistical signifficant difference between gender and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between gender and employee resign

In [27]:
chi2_independence_test(train_df, 'abandono_6meses', 'genero')

                 test    lambda      chi2  dof      pval    cramer     power
0             pearson  1.000000  3.902991  1.0  0.048200  0.042587  0.506279
1        cressie-read  0.666667  3.903170  1.0  0.048195  0.042588  0.506297
2      log-likelihood  0.000000  3.903928  1.0  0.048173  0.042592  0.506373
3       freeman-tukey -0.500000  3.904849  1.0  0.048147  0.042597  0.506466
4  mod-log-likelihood -1.000000  3.906070  1.0  0.048112  0.042604  0.506589
5              neyman -2.000000  3.909419  1.0  0.048016  0.042622  0.506927
--------------------------------------------------------------------------------
Reject null hypothesis: There is a statistically significant difference between categories.


genero has influence in our target.

### canal_reclutamiento

- H<sub>0</sub>: There is not statistical signifficant difference between recruitment channel and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between recruitment channel and employee resign

In [28]:
chi2_independence_test(train_df, 'abandono_6meses', 'canal_reclutamiento')

                 test    lambda       chi2  dof          pval    cramer  \
0             pearson  1.000000  43.645651  4.0  7.600404e-09  0.142413   
1        cressie-read  0.666667  43.801927  4.0  7.053200e-09  0.142668   
2      log-likelihood  0.000000  44.227223  4.0  5.755041e-09  0.143359   
3       freeman-tukey -0.500000  44.649401  4.0  4.702424e-09  0.144041   
4  mod-log-likelihood -1.000000  45.165170  4.0  3.673663e-09  0.144871   
5              neyman -2.000000  46.500171  4.0  1.937891e-09  0.146996   

      power  
0  0.999942  
1  0.999945  
2  0.999951  
3  0.999957  
4  0.999963  
5  0.999976  
--------------------------------------------------------------------------------
Reject null hypothesis: There is a statistically significant difference between categories.


canal_reclutamiento has influence in our target

### estado_civil

- H<sub>0</sub>: There is not statistical signifficant difference between marital estatus and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between marital estatus and employee resign

In [29]:
chi2_independence_test(train_df, 'abandono_6meses', 'estado_civil')

                 test    lambda       chi2  dof      pval    cramer     power
0             pearson  1.000000  20.900890  3.0  0.000110  0.098551  0.980173
1        cressie-read  0.666667  20.918306  3.0  0.000109  0.098592  0.980261
2      log-likelihood  0.000000  20.968675  3.0  0.000107  0.098711  0.980515
3       freeman-tukey -0.500000  21.020234  3.0  0.000104  0.098832  0.980771
4  mod-log-likelihood -1.000000  21.083810  3.0  0.000101  0.098981  0.981082
5              neyman -2.000000  21.247888  3.0  0.000094  0.099366  0.981864
--------------------------------------------------------------------------------
Reject null hypothesis: There is a statistically significant difference between categories.


estado_civil has influence in our target.

### join_year

- H<sub>0</sub>: There is not statistical signifficant difference between year of join and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between year of join and employee resign

In [30]:
chi2_independence_test(train_df, 'abandono_6meses', 'join_year')

                 test    lambda       chi2   dof          pval    cramer  \
0             pearson  1.000000  48.316797  11.0  1.253445e-06  0.149840   
1        cressie-read  0.666667  48.351606  11.0  1.235634e-06  0.149894   
2      log-likelihood  0.000000  48.619469  11.0  1.106703e-06  0.150309   
3       freeman-tukey -0.500000  48.998801  11.0  9.465954e-07  0.150894   
4  mod-log-likelihood -1.000000  49.537969  11.0  7.577378e-07  0.151722   
5              neyman -2.000000  51.132013  11.0  3.913587e-07  0.154144   

      power  
0  0.999681  
1  0.999684  
2  0.999704  
3  0.999730  
4  0.999764  
5  0.999841  
--------------------------------------------------------------------------------
Reject null hypothesis: There is a statistically significant difference between categories.


join_year has influence in our target

### join_month

- H<sub>0</sub>: There is not statistical signifficant difference between month of join and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between month of join and employee resign

In [31]:
chi2_independence_test(train_df, 'abandono_6meses', 'join_month')

                 test    lambda      chi2   dof      pval    cramer     power
0             pearson  1.000000  7.293563  11.0  0.774836  0.058217  0.379021
1        cressie-read  0.666667  7.290816  11.0  0.775067  0.058206  0.378872
2      log-likelihood  0.000000  7.289974  11.0  0.775138  0.058203  0.378826
3       freeman-tukey -0.500000  7.293408  11.0  0.774849  0.058216  0.379012
4  mod-log-likelihood -1.000000  7.300327  11.0  0.774267  0.058244  0.379388
5              neyman -2.000000  7.324659  11.0  0.772216  0.058341  0.380710
--------------------------------------------------------------------------------
Failed to reject null hypothesis: There is no statistically significant difference between categories.


join_month has no influence in our target.

In [32]:
no_infl_cat.append('join_month')

### performance

- H<sub>0</sub>: There is not statistical signifficant difference between performance and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between performance and employee resign

In [33]:
chi2_independence_test(train_df, 'abandono_6meses', 'performance')

                 test    lambda       chi2  dof          pval    cramer  power
0             pearson  1.000000  69.057825  1.0  9.561839e-17  0.179137    1.0
1        cressie-read  0.666667  69.289208  1.0  8.503360e-17  0.179437    1.0
2      log-likelihood  0.000000  69.937619  1.0  6.120989e-17  0.180274    1.0
3       freeman-tukey -0.500000  70.592867  1.0  4.391019e-17  0.181117    1.0
4  mod-log-likelihood -1.000000  71.400207  1.0  2.916415e-17  0.182150    1.0
5              neyman -2.000000  73.503201  1.0  1.004731e-17  0.184813    1.0
--------------------------------------------------------------------------------
Reject null hypothesis: There is a statistically significant difference between categories.


performance has influence in our target.

In [37]:
print(f'Categories considered to discard {no_infl_cat}')

Categories considered to discard ['join_month']


## Numerical tests: Kruskal-Wallis tests between numerical features and target.

### distancia_oficina

- H<sub>0</sub>: There is not statistical signifficant difference between distance to the office and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between distance to the office and employee resign

In [34]:
kruskal_wallis_test(train_df, 'abandono_6meses', 'distancia_oficina')

                  Source  ddof1         H     p-unc
Kruskal  abandono_6meses      1  0.215454  0.642526
--------------------------------------------------------------------------------
Failed to reject null hypothesis: There is no statistically significant difference in distancia_oficina between abandono_6meses


distancia_oficina has no influence in our target.

In [35]:
no_infl_num.append('distancia_oficina')

### dias_baja_salud

- H<sub>0</sub>: There is not statistical signifficant difference between sick days and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between sick days and employee resign

In [36]:
kruskal_wallis_test(train_df, 'abandono_6meses', 'dias_baja_salud')

                  Source  ddof1          H         p-unc
Kruskal  abandono_6meses      1  44.826429  2.152970e-11
--------------------------------------------------------------------------------
Reject null hypothesis: There is a statistically significant difference in dias_baja_salud between abandono_6meses.


dias_baja_salud has influence in our target

### permanencia_promedio

- H<sub>0</sub>: There is not statistical signifficant difference between average permanence and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between average permanence and employee resign

In [38]:
kruskal_wallis_test(train_df, 'abandono_6meses', 'permanencia_promedio')

                  Source  ddof1         H    p-unc
Kruskal  abandono_6meses      1  0.672763  0.41209
--------------------------------------------------------------------------------
Failed to reject null hypothesis: There is no statistically significant difference in permanencia_promedio between abandono_6meses


permanencia_promedio has no influence in our target.

In [39]:
no_infl_num.append('permanencia_promedio')

### salario

- H<sub>0</sub>: There is not statistical signifficant difference between salary and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between salary and employee resign

In [40]:
kruskal_wallis_test(train_df, 'abandono_6meses', 'salario')

                  Source  ddof1         H     p-unc
Kruskal  abandono_6meses      1  0.235091  0.627774
--------------------------------------------------------------------------------
Failed to reject null hypothesis: There is no statistically significant difference in salario between abandono_6meses


salario has no influence in our target.

In [41]:
no_infl_num.append('salario')

### performance_score

- H<sub>0</sub>: There is not statistical signifficant difference between performance score and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between performance score and employee resign

In [42]:
kruskal_wallis_test(train_df, 'abandono_6meses', 'performance_score')

                  Source  ddof1           H         p-unc
Kruskal  abandono_6meses      1  160.008373  1.126728e-36
--------------------------------------------------------------------------------
Reject null hypothesis: There is a statistically significant difference in performance_score between abandono_6meses.


performance_score has influence in our target

### psi_score

- H<sub>0</sub>: There is not statistical signifficant difference between psi score and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between psi score and employee resign

In [43]:
kruskal_wallis_test(train_df, 'abandono_6meses', 'psi_score')

                  Source  ddof1         H     p-unc
Kruskal  abandono_6meses      1  0.592649  0.441396
--------------------------------------------------------------------------------
Failed to reject null hypothesis: There is no statistically significant difference in psi_score between abandono_6meses


psi_score has no influence in our target

In [44]:
no_infl_num.append('psi_score')

### age

- H<sub>0</sub>: There is not statistical signifficant difference between age and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between age and employee resign

In [45]:
kruskal_wallis_test(train_df, 'abandono_6meses', 'age')

                  Source  ddof1         H     p-unc
Kruskal  abandono_6meses      1  0.080111  0.777146
--------------------------------------------------------------------------------
Failed to reject null hypothesis: There is no statistically significant difference in age between abandono_6meses


age has no influence in our target

In [46]:
no_infl_num.append('age')

In [47]:
print(f'Category features considered to discard {no_infl_cat}')
print(f'Numerical features considered to discard {no_infl_num}')

Category features considered to discard ['join_month']
Numerical features considered to discard ['distancia_oficina', 'permanencia_promedio', 'salario', 'psi_score', 'age']
