# Hypothesis testing

During our exploratory data analysis (EDA), we observed that some features appear to have no relationship with employee resignation. We will test this hypothesis to determine if these features are indeed irrelevant to our target variable, in order to train the model and check its performance with and without those categories.

A signifficance value of 0.01 will be used in order to have a 95% confidence interval of influence of the features.

## Preparing environment

In [1]:
import numpy as np
import pandas as pd
import pingouin
import sys
sys.path.append('../high_performance_employee_resign_prediction')
from utils import paths

In [2]:
import warnings
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=RuntimeWarning)

## Importing data

In [3]:
train_df = pd.read_csv(paths.data_interim_dir('train_clean.csv'))

## Generating hypothesis testing functions

### Chi2 independence test

In [4]:
def chi2_independence_test(data: pd.DataFrame, main_cat: str, sec_cat: str, sig_col: list, no_sig_col: list, alpha=0.05):
    """
    Perform a chi-squared test of independence between two categorical variables.

    Parameters:
    -----------
    data : pd.DataFrame
        The DataFrame containing the data.
    main_cat : str
        The name of the main categorical variable (column) in the DataFrame.
    sec_cat : str
        The name of the secondary categorical variable (column) in the DataFrame.
    sig_col : list
        A list to append the name of the secondary categorical variable if the test is significant.
    no_sig_col : list
        A list to append the name of the secondary categorical variable if the test is not significant.
    alpha : float, optional (default=0.05)
        The significance level for the test.

    Returns:
    --------
    None

    Prints:
    -------
    - The chi-squared test statistics.
    - A message indicating whether the null hypothesis is rejected or not.
    - Appends the secondary categorical variable name to `sig_col` if the null hypothesis is rejected.
    - Appends the secondary categorical variable name to `no_sig_col` if the null hypothesis is not rejected.

    Notes:
    ------
    - The function uses the `pingouin` library to perform the chi-squared test of independence.
    - The null hypothesis is that the two categorical variables are independent.
    - If the p-value is less than the significance level (`alpha`), the null hypothesis is rejected, indicating a statistically significant difference between the categories.
    """
    
    expected, observed, stats = pingouin.chi2_independence(data=data, x=main_cat, y=sec_cat)
    
    stats = pd.DataFrame(stats)
    
    print('-'*80)
    print(stats)
        
    stat_sign = stats['pval'] < alpha
    
    if stat_sign.any():
        print(f'Reject null hypothesis: There is a statistically significant difference between {sec_cat} and {main_cat}')
        sig_col.append(sec_cat)
    else:
        print(f'Failed to reject null hypothesis: There is no statistically significant difference {sec_cat} and {main_cat}')
        no_sig_col.append(sec_cat)
    print('-'*80)

### Kruskal-Wallis test

In [5]:
def kruskal_wallis_test(data: pd.DataFrame, cat_feat: str, num_feat: str, sig_col: list, no_sig_col: list, alpha=0.05):
    """
    Perform a Kruskal-Wallis test to determine if there are statistically significant differences in a numerical feature across categories of a categorical feature.

    Parameters:
    -----------
    data : pd.DataFrame
        The DataFrame containing the data.
    cat_feat : str
        The name of the categorical feature (column) in the DataFrame.
    num_feat : str
        The name of the numerical feature (column) in the DataFrame.
    sig_col : list
        A list to append the name of the categorical feature if the test is significant.
    no_sig_col : list
        A list to append the name of the categorical feature if the test is not significant.
    alpha : float, optional (default=0.05)
        The significance level for the test.

    Returns:
    --------
    None

    Prints:
    -------
    - The Kruskal-Wallis test results.
    - A message indicating whether the null hypothesis is rejected or not.
    - Appends the categorical feature name to `sig_col` if the null hypothesis is rejected.
    - Appends the categorical feature name to `no_sig_col` if the null hypothesis is not rejected.

    Notes:
    ------
    - The function uses the `pingouin` library to perform the Kruskal-Wallis test.
    - The null hypothesis is that the distributions of the numerical feature are the same across the categories of the categorical feature.
    - If the p-value is less than the significance level (`alpha`), the null hypothesis is rejected, indicating a statistically significant difference in the numerical feature between the categories of the categorical feature.
    """
    
    results = pingouin.kruskal(data=data, dv=num_feat, between=cat_feat)
    
    print('-'*80)
    print(results)
    
    if results['p-unc'].iloc[0] < alpha:
        print(f'Reject null hypothesis: There is a statistically significant difference in {num_feat} between {cat_feat}.')
        sig_col.append(num_feat)
    else:
        print(f'Failed to reject null hypothesis: There is no statistically significant difference in {num_feat} between {cat_feat}')
        no_sig_col.append(num_feat)
    print('-'*80)

## Categorical tests: Chi2 independence tests between categorical features and target.

- H<sub>0</sub>: There is not statistical signifficant difference categorical features and employee resign

- H<sub>1</sub>: There is statistical signifficant difference categorical features and employee resign

In [6]:
cat_cols = ['id_last_boss', 'seniority', 'work_modality', 'gender', 'recruitment_channel',
            'marital_estatus', 'join_year', 'join_month', 'performance', 'join_age_group',
            'work_modality_boss', 'gender_boss', 'recruitment_channel_boss', 'marital_estatus_boss',
            'join_year_boss', 'join_month_boss', 'performance_boss', 'join_age_group_boss', 'joined_after_boss', 'younger_than_boss']

In [7]:
influence_feat = []
no_influence_feat = []

for col in cat_cols:
    chi2_independence_test(train_df, 'resign', col, influence_feat, no_influence_feat)

--------------------------------------------------------------------------------
                 test    lambda        chi2    dof          pval    cramer  \
0             pearson  1.000000  263.805759  172.0  8.342189e-06  0.350123   
1        cressie-read  0.666667  268.750146  172.0  3.312762e-06  0.353389   
2      log-likelihood  0.000000  291.571126  172.0  3.332500e-08  0.368088   
3       freeman-tukey -0.500000         NaN  172.0           NaN       NaN   
4  mod-log-likelihood -1.000000         inf  172.0  0.000000e+00       inf   
5              neyman -2.000000         NaN  172.0           NaN       NaN   

   power  
0    1.0  
1    1.0  
2    1.0  
3    NaN  
4    NaN  
5    NaN  
Reject null hypothesis: There is a statistically significant difference between id_last_boss and resign
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
                 test    lamb

## Numerical tests: Kruskal-Wallis tests between numerical features and target.

- H<sub>0</sub>: There is not statistical signifficant difference between numerical features and employee resign

- H<sub>1</sub>: There is statistical signifficant difference between numerical features and employee resign

In [8]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2152 entries, 0 to 2151
Data columns (total 47 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   id_employee               2152 non-null   int64  
 1   id_last_boss              2152 non-null   int64  
 2   seniority                 2152 non-null   int64  
 3   work_modality             2152 non-null   object 
 4   office_distance           2152 non-null   float64
 5   low_health_days           2152 non-null   int64  
 6   gender                    2152 non-null   object 
 7   recruitment_channel       2152 non-null   object 
 8   average_permanence        2152 non-null   int64  
 9   salary                    2152 non-null   int64  
 10  performance_score         2152 non-null   int64  
 11  psi_score                 2152 non-null   int64  
 12  marital_estatus           2152 non-null   object 
 13  join_age                  2152 non-null   int64  
 14  join_yea

In [9]:
num_cols = ['office_distance', 'low_health_days', 'average_permanence', 'salary', 'performance_score', 
            'psi_score', 'join_age', 'office_distance_boss', 'low_health_days_boss', 'average_permanence_boss', 
            'salary_boss', 'performance_score_boss', 'psi_score_boss', 'join_age_boss', 'salary_diff',
            'join_days_diff', 'age_diff', 'avg_od_epb', 'avg_lhd_epb', 'avg_avgp_epb', 
            'avg_sal_epb', 'avg_ps_epb', 'avg_psis_epb', 'avg_ja_epb', 'boss_employees_in_charge']

In [10]:
for col in num_cols:
    kruskal_wallis_test(train_df, 'resign', col, influence_feat, no_influence_feat)

--------------------------------------------------------------------------------
         Source  ddof1         H     p-unc
Kruskal  resign      1  0.215454  0.642526
Failed to reject null hypothesis: There is no statistically significant difference in office_distance between resign
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
         Source  ddof1          H         p-unc
Kruskal  resign      1  44.826429  2.152970e-11
Reject null hypothesis: There is a statistically significant difference in low_health_days between resign.
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
         Source  ddof1         H    p-unc
Kruskal  resign      1  0.672763  0.41209
Failed to reject null hypothesis: There is no statistically significant difference in average_permanence 

In [11]:
print(f'Features that influence the target: {influence_feat}')
print(f'Features considered to discard: {no_influence_feat}')

Features that influence the target: ['id_last_boss', 'seniority', 'work_modality', 'gender', 'recruitment_channel', 'marital_estatus', 'join_year', 'performance', 'join_age_group', 'performance_boss', 'joined_after_boss', 'low_health_days', 'performance_score', 'low_health_days_boss', 'performance_score_boss', 'salary_diff', 'join_days_diff', 'avg_avgp_epb', 'avg_ps_epb', 'avg_ja_epb']
Features considered to discard: ['join_month', 'work_modality_boss', 'gender_boss', 'recruitment_channel_boss', 'marital_estatus_boss', 'join_year_boss', 'join_month_boss', 'join_age_group_boss', 'younger_than_boss', 'office_distance', 'average_permanence', 'salary', 'psi_score', 'join_age', 'office_distance_boss', 'average_permanence_boss', 'salary_boss', 'psi_score_boss', 'join_age_boss', 'age_diff', 'avg_od_epb', 'avg_lhd_epb', 'avg_sal_epb', 'avg_psis_epb', 'boss_employees_in_charge']
