# Hypothesis testing

During our exploratory data analysis (EDA), we observed that some features appear to have no relationship with employee resignation. We will test this hypothesis to determine if these features are indeed irrelevant to our target variable, in order to train the model and check its performance with and without those categories.

A signifficance value of 0.01 will be used in order to have a 99% confidence interval of influence of the features.

## Preparing environment

In [130]:
import numpy as np
import pandas as pd
import pingouin
import sys
sys.path.append('../high_performance_employee_resign_prediction')
from utils import paths

In [131]:
import warnings
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=RuntimeWarning)

## Importing data

In [132]:
train_df = pd.read_csv(paths.data_interim_dir('train_clean.csv'))

## Generating hypothesis testing functions

### Chi2 independence test

In [133]:
def chi2_independence_test(data: pd.DataFrame, main_cat: str, sec_cat: str, sig_col: list, no_sig_col: list, alpha=0.05):
    """
    Perform a chi-squared test of independence between two categorical variables.

    Parameters:
    -----------
    data : pd.DataFrame
        The DataFrame containing the data.
    main_cat : str
        The name of the main categorical variable (column) in the DataFrame.
    sec_cat : str
        The name of the secondary categorical variable (column) in the DataFrame.
    sig_col : list
        A list to append the name of the secondary categorical variable if the test is significant.
    no_sig_col : list
        A list to append the name of the secondary categorical variable if the test is not significant.
    alpha : float, optional (default=0.05)
        The significance level for the test.

    Returns:
    --------
    None

    Prints:
    -------
    - The chi-squared test statistics.
    - A message indicating whether the null hypothesis is rejected or not.
    - Appends the secondary categorical variable name to `sig_col` if the null hypothesis is rejected.
    - Appends the secondary categorical variable name to `no_sig_col` if the null hypothesis is not rejected.

    Notes:
    ------
    - The function uses the `pingouin` library to perform the chi-squared test of independence.
    - The null hypothesis is that the two categorical variables are independent.
    - If the p-value is less than the significance level (`alpha`), the null hypothesis is rejected, indicating a statistically significant difference between the categories.
    """
    
    expected, observed, stats = pingouin.chi2_independence(data=data, x=main_cat, y=sec_cat)
    
    stats = pd.DataFrame(stats)
    
    print('-'*80)
    print(stats)
        
    stat_sign = stats['pval'] < alpha
    
    if stat_sign.any():
        print(f'Reject null hypothesis: There is a statistically significant difference between {sec_cat} and {main_cat}')
        sig_col.append(sec_cat)
    else:
        print(f'Failed to reject null hypothesis: There is no statistically significant difference {sec_cat} and {main_cat}')
        no_sig_col.append(sec_cat)
    print('-'*80)

### Kruskal-Wallis test

In [134]:
def kruskal_wallis_test(data: pd.DataFrame, cat_feat: str, num_feat: str, sig_col: list, no_sig_col: list, alpha=0.05):
    """
    Perform a Kruskal-Wallis test to determine if there are statistically significant differences in a numerical feature across categories of a categorical feature.

    Parameters:
    -----------
    data : pd.DataFrame
        The DataFrame containing the data.
    cat_feat : str
        The name of the categorical feature (column) in the DataFrame.
    num_feat : str
        The name of the numerical feature (column) in the DataFrame.
    sig_col : list
        A list to append the name of the categorical feature if the test is significant.
    no_sig_col : list
        A list to append the name of the categorical feature if the test is not significant.
    alpha : float, optional (default=0.05)
        The significance level for the test.

    Returns:
    --------
    None

    Prints:
    -------
    - The Kruskal-Wallis test results.
    - A message indicating whether the null hypothesis is rejected or not.
    - Appends the categorical feature name to `sig_col` if the null hypothesis is rejected.
    - Appends the categorical feature name to `no_sig_col` if the null hypothesis is not rejected.

    Notes:
    ------
    - The function uses the `pingouin` library to perform the Kruskal-Wallis test.
    - The null hypothesis is that the distributions of the numerical feature are the same across the categories of the categorical feature.
    - If the p-value is less than the significance level (`alpha`), the null hypothesis is rejected, indicating a statistically significant difference in the numerical feature between the categories of the categorical feature.
    """
    
    results = pingouin.kruskal(data=data, dv=num_feat, between=cat_feat)
    
    print('-'*80)
    print(results)
    
    if results['p-unc'].iloc[0] < alpha:
        print(f'Reject null hypothesis: There is a statistically significant difference in {num_feat} between {cat_feat}.')
        sig_col.append(num_feat)
    else:
        print(f'Failed to reject null hypothesis: There is no statistically significant difference in {num_feat} between {cat_feat}')
        no_sig_col.append(num_feat)
    print('-'*80)

## Categorical tests: Chi2 independence tests between categorical features and target.

- H<sub>0</sub>: There is not statistical signifficant difference categorical features and employee resign

- H<sub>1</sub>: There is statistical signifficant difference categorical features and employee resign

In [135]:
cat_cols = ['id_last_boss_employee', 'id_last_boss_boss', 'seniority_employee', 'work_modality_employee', 'gender_employee', 'recruitment_channel_employee',
            'marital_estatus_employee', 'join_year_employee', 'join_month_employee', 'performance_employee',
            'work_modality_boss', 'gender_boss', 'recruitment_channel_boss', 'marital_estatus_boss',
            'join_year_boss', 'join_month_boss', 'performance_boss', 'joined_after_boss', 'younger_than_boss']

In [136]:
influence_feat = []
no_influence_feat = []

for col in cat_cols:
    chi2_independence_test(train_df, 'resign', col, influence_feat, no_influence_feat)

--------------------------------------------------------------------------------
                 test    lambda        chi2    dof          pval    cramer  \
0             pearson  1.000000  253.683979  170.0  3.318702e-05  0.343341   
1        cressie-read  0.666667  258.145263  170.0  1.497057e-05  0.346347   
2      log-likelihood  0.000000  278.098851  170.0  3.227638e-07  0.359483   
3       freeman-tukey -0.500000         NaN  170.0           NaN       NaN   
4  mod-log-likelihood -1.000000         inf  170.0  0.000000e+00       inf   
5              neyman -2.000000         NaN  170.0           NaN       NaN   

   power  
0    1.0  
1    1.0  
2    1.0  
3    NaN  
4    NaN  
5    NaN  
Reject null hypothesis: There is a statistically significant difference between id_last_boss_employee and resign
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
                 tes

## Numerical tests: Kruskal-Wallis tests between numerical features and target.

- H<sub>0</sub>: There is not statistical signifficant difference between numerical features and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between numerical features and employee resign

In [137]:
num_cols = ['office_distance_employee', 'low_health_days_employee', 'average_permanence_employee', 'salary_employee',
            'performance_score_employee', 'psi_score_employee', 'join_age_employee', 'office_distance_boss',
            'low_health_days_boss', 'average_permanence_boss', 'salary_boss', 'performance_score_boss',
            'psi_score_boss', 'join_age_boss', 'office_distance_diff', 'low_health_days_diff', 'average_permanence_diff',
            'salary_diff', 'join_days_diff', 'age_diff', 'avg_od_epb', 'avg_lhd_epb', 'avg_avgp_epb', 'avg_sal_epb',
            'avg_ps_epb', 'avg_psis_epb', 'avg_ja_epb', 'avg_od_bpb', 'avg_lhd_bpb', 'avg_avgp_bpb', 'avg_sal_bpb',
            'avg_ps_bpb', 'avg_psis_bpb', 'avg_ja_bpb', 'boss_employees_in_charge', 'bob_bosses_in_charge']

In [138]:
for col in num_cols:
    kruskal_wallis_test(train_df, 'resign', col, influence_feat, no_influence_feat)

--------------------------------------------------------------------------------
         Source  ddof1         H     p-unc
Kruskal  resign      1  0.215454  0.642526
Failed to reject null hypothesis: There is no statistically significant difference in office_distance_employee between resign
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
         Source  ddof1          H         p-unc
Kruskal  resign      1  44.826429  2.152970e-11
Reject null hypothesis: There is a statistically significant difference in low_health_days_employee between resign.
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
         Source  ddof1         H    p-unc
Kruskal  resign      1  0.672763  0.41209
Failed to reject null hypothesis: There is no statistically significant difference in a

In [139]:
print(f'Features that influence the target: {influence_feat}')
print(f'Features considered to discard: {no_influence_feat}')

Features that influence the target: ['id_last_boss_employee', 'id_last_boss_boss', 'seniority_employee', 'work_modality_employee', 'gender_employee', 'recruitment_channel_employee', 'marital_estatus_employee', 'join_year_employee', 'performance_employee', 'join_year_boss', 'low_health_days_employee', 'performance_score_employee', 'performance_score_boss', 'low_health_days_diff', 'salary_diff', 'join_days_diff', 'avg_ps_epb', 'avg_ja_epb', 'avg_ps_bpb']
Features considered to discard: ['join_month_employee', 'work_modality_boss', 'gender_boss', 'recruitment_channel_boss', 'marital_estatus_boss', 'join_month_boss', 'performance_boss', 'joined_after_boss', 'younger_than_boss', 'office_distance_employee', 'average_permanence_employee', 'salary_employee', 'psi_score_employee', 'join_age_employee', 'office_distance_boss', 'low_health_days_boss', 'average_permanence_boss', 'salary_boss', 'psi_score_boss', 'join_age_boss', 'office_distance_diff', 'average_permanence_diff', 'age_diff', 'avg_od_