## Hypothesis Testing Phase:
1. people tend to be absent for no-disease reasons more than disease reasons. (reject null hypothesis)
2. Being a Social smoker doesn't affect that you will be absent for disease or no. (fail to reject the null hypothesis)
3. Being a Social drinker doesn't correlates positively with being absent for no-disease.  (fail to reject the null hypothesis)
4. High school grade appears more than any other grade in the dataset.(reject the null hypothesis)
5. Mid-Career professionals are the top career level for absence. (fail to reject the null hypothesis)
6. Drinkers absent long durations (reject the null hypothesis)
7. Smoking doesn't affect the duration of absence. (fail to reject the null hypothesis)
8. Weight correlates positively with Duration (reject the null hypothesis)
9. Old people are more in the disease area. (fail to reject the null hypothesis)
10. Highly educated employees tend to have short durations.
11. There is no correlation between Distancee from Residence to work and Absenteeism in hours. (reject the null hypothesis)
12. There is a slightly positive correlation between Transportation Expense and Absenteeism in hours. (fail reject the null hypothesis)
13. There is a slightly negative correlation between Service time and Absenteeism in hours. (fail reject the null hypothesis)
14. The fewest absences occur in Thursday, same thing with duration
15. July has the longest duration (fail to reject the null hypothesis)
16. Having no children seems to correlate negativly with both duration and absence itself. (fail to reject the null hypothesis)

In [119]:
import numpy as np
import pandas as pd
import scipy.stats as stats
import warnings
warnings.filterwarnings('ignore')

In [120]:
df = pd.read_csv('preprocessed_data.csv')
df.head()

Unnamed: 0,ID,Reason for absence,Month of absence,Day of the week,Seasons,Transportation expense,Distance from Residence to Work,Service time,Age,Work load Average/day,...,Social drinker,Social smoker,Pet,Weight,Height,Body mass index,Absenteeism time in hours,Disease,BMI category,Career level
0,11,26,July,Tuesday,Spring,289,36,13,33,239.554,...,Yes,No,1,90,172,30,4,No,Obese,Mid young adult
1,36,0,July,Tuesday,Spring,118,13,18,50,239.554,...,Yes,No,0,98,178,31,0,No,Obese,Late career professional
2,3,23,July,Wednesday,Spring,179,51,18,38,239.554,...,Yes,No,0,89,170,31,2,No,Obese,Mid career professional
3,7,7,July,Thursday,Spring,279,5,14,39,239.554,...,Yes,Yes,0,68,168,24,4,Yes,Normal Weight,Mid career professional
4,11,23,July,Thursday,Spring,289,36,13,33,239.554,...,Yes,No,1,90,172,30,2,No,Obese,Mid young adult


#### people tend to be absent for no-disease reasons more than disease reasons.
- H_0 : disease >= no-disease
- H_A : disease < no-disease

In [121]:
# count the number of successes and total trials in each group
count = df['Disease'].value_counts().values
nobs = df['Disease'].value_counts().sum()

# calculate the sample proportions
prop1 = count[0] / nobs
prop2 = count[1] / nobs

# perform two-sample t-test assuming equal variances
t_stat, p_value = stats.ttest_ind_from_stats(prop1,
                                             np.sqrt(prop1*(1-prop1)),
                                             nobs,
                                             prop2,
                                             np.sqrt(prop2*(1-prop2)),
                                             nobs,
                                             equal_var=True,
                                             alternative='greater')

print(f'statistic is : {t_stat:.003f}, p-value: {p_value:.003f}')

statistic is : 11.741, p-value: 0.000


#### Being a Social smoker doesn't affect that you will be absent for disease or no.
- H_0 : smoker in disease = smoker in disease
- H_A : smoker in disease != smoker in disease

In [122]:
# count the number of successes and total trials in each group
mask = df['Disease'] == 'Yes'

disease = df[mask]
no_disease = df[~mask]

count1 = disease['Social smoker'].value_counts()
count2 = no_disease['Social smoker'].value_counts()

nobs = count1[1] + count2[1]

# calculate the sample proportions
prop1 = count1[1] / nobs
prop2 = count1[1] / nobs

# perform two-sample t-test assuming equal variances
t_stat, p_value = stats.ttest_ind_from_stats(prop1,
                                             np.sqrt(prop1*(1-prop1)),
                                             nobs,
                                             prop2,
                                             np.sqrt(prop2*(1-prop2)),
                                             nobs,
                                             equal_var=True)
print(f'statistic is : {t_stat:.003f}, p-value: {p_value:.003f}')

statistic is : 0.000, p-value: 1.000


#### Being a Social drinker  correlates positively with being absent for no-disease.
- H_0 : drinker in disease = drinker in disease
- H_A : drinker in no-disease != drinker in no-disease

In [123]:
# count the number of successes and total trials in each group
mask = df['Disease'] == 'Yes'

disease = df[mask]
no_disease = df[~mask]

count1 = disease['Social drinker'].value_counts()
count2 = no_disease['Social drinker'].value_counts()

nobs = count1[0] + count2[0]

# calculate the sample proportions
prop1 = count1[0] / nobs
prop2 = count1[0] / nobs

# perform two-sample t-test assuming equal variances
t_stat, p_value = stats.ttest_ind_from_stats(prop1,
                                             np.sqrt(prop1*(1-prop1)),
                                             nobs,
                                             prop2,
                                             np.sqrt(prop2*(1-prop2)),
                                             nobs)

print(f'statistic is : {t_stat:.003f}, p-value: {p_value:.003f}')

statistic is : 0.000, p-value: 1.000


#### High school grade appears more than any other grade in the dataset.
- H_0 : high school <= other grades
- H_A : high school > other grades

In [124]:
# count the number of successes and total trials in each group
count = df['Education'].value_counts().values
nobs = df['Education'].value_counts().sum()

# calculate the sample proportions
prop1 = count[0] / nobs
prop2 = 1 - prop1

# perform two-sample t-test assuming equal variances
t_stat, p_value = stats.ttest_ind_from_stats(prop1,
                                             np.sqrt(prop1*(1-prop1)),
                                             nobs,
                                             prop2,
                                             np.sqrt(prop2*(1-prop2)),
                                             nobs,
                                             equal_var=True,
                                             alternative='greater')

print(f'statistic is : {t_stat:.003f}, p-value: {p_value:.003f}')

statistic is : 33.024, p-value: 0.000


#### Mid-Career professionals are the top career level for absence.
- H_0 : Mid-Career professionals <= other career levels
- H_A : Mid-Career professionals > other career levels

In [125]:
# count the number of successes and total trials in each group
count = df['Career level'].value_counts().values
nobs = df['Career level'].value_counts().sum()

# calculate the sample proportions
prop1 = count[0] / nobs
prop2 = 1 - prop1

# perform two-sample t-test assuming equal variances
t_stat, p_value = stats.ttest_ind_from_stats(prop1,
                                             np.sqrt(prop1*(1-prop1)),
                                             nobs,
                                             prop2,
                                             np.sqrt(prop2*(1-prop2)),
                                             nobs,
                                             equal_var=True,
                                             alternative='greater')

print(f'statistic is : {t_stat:.003f}, p-value: {p_value:.003f}')

statistic is : -0.520, p-value: 0.698


#### Drinkers absent longer durations
- H_0 : non-Drinkers duration >= Drinkers duration
- H_A : non-Drinkers duration < Drinkers duration

In [126]:
mask = df['Social drinker'] == 'Yes'

drinkers_duration = df[mask]['Absenteeism time in hours']
non_drinkers_duration = df[~mask]['Absenteeism time in hours']

t_stat, p_value = stats.ttest_ind(non_drinkers_duration,
                                       drinkers_duration,
                                       alternative='less')

print(f'statistic is : {t_stat:.003f}, p-value: {p_value:.003f}')

statistic is : -1.771, p-value: 0.038


#### Smoking doesn't affect durations
- H_0 : non-smokers duration >= smokers duration
- H_A : non-smokers duration < smokers duration

In [127]:
mask = df['Social smoker'] == 'Yes'

smokers_duration = df[mask]['Absenteeism time in hours']
non_smokers_duration = df[~mask]['Absenteeism time in hours']

t_stat, p_value = stats.ttest_ind(non_smokers_duration,
                                       smokers_duration,
                                       alternative='less')

print(f'statistic is : {t_stat:.003f}, p-value: {p_value:.003f}')

statistic is : 0.243, p-value: 0.596


#### Weight correlates positively with Duration
- H_0 : there is  no correlation
- H_A : there is a correlation

In [128]:
# extract the two columns of interest
col1 = df['Weight']
col2 = df['Absenteeism time in hours']

# calculate the Pearson correlation coefficient and p-value
corr_coef, p_value = stats.pearsonr(col1, col2)

print(f'statistic is : {corr_coef:.003f}, p-value: {p_value:.003f}')

statistic is : 0.016, p-value: 0.668


#### old people are more in the disease area.
- H_0: old people are the same for both disease and no-diseas areas
- H_1:old people are not the same for both disease and no-diseas areas

In [129]:
"""
get Age entries for employees with Disease == Yes and Disease == No
"""
disease_mask = df["Disease"] == "Yes"
disease_ages = df["Age"][disease_mask]
no_disease_ages = df["Age"][~disease_mask]

# perform hypothesis test for equality of means
test_res = stats.ttest_ind(disease_ages, no_disease_ages)
print(f"Test for equality of means: statistic={test_res[0]:0.3f}, pvalue={test_res[1]:0.3f}")

# test equality of distributions via Kolmogorov-Smirnov test
ks_res = stats.ks_2samp(disease_ages, no_disease_ages)
print(f"KS test for equality of distributions: statistic={ks_res[0]:0.3f}, pvalue={ks_res[1]:0.3f}")

Test for equality of means: statistic=0.630, pvalue=0.529
KS test for equality of distributions: statistic=0.057, pvalue=0.619


#### There is no correlation between Distancee from Residence to work and Absenteeism in hours.
- H_0: there is no correlation
- H_1: there is a correlation

In [130]:
# extract the two columns of interest
col1 = df['Distance from Residence to Work']
col2 = df['Absenteeism time in hours']

# calculate the Pearson correlation coefficient and p-value
corr_coef, p_value = stats.pearsonr(col1, col2)

print(f'statistic is : {corr_coef:.003f}, p-value: {p_value:.003f}')

statistic is : -0.088, p-value: 0.016


#### There is no correlation between Transportation expense and Absenteeism in hours.
- H_0: there is no correlation
- H_1: there is a correlation

In [131]:
# extract the two columns of interest
col1 = df['Transportation expense']
col2 = df['Absenteeism time in hours']

# calculate the Pearson correlation coefficient and p-value
corr_coef, p_value = stats.pearsonr(col1, col2)

print(f'statistic is : {corr_coef:.003f}, p-value: {p_value:.003f}')

statistic is : 0.028, p-value: 0.454


#### There is no correlation between Service time and Absenteeism in hours.
- H_0: there is no correlation
- H_1: there is a correlation

In [132]:
# extract the two columns of interest
col1 = df['Service time']
col2 = df['Absenteeism time in hours']

# calculate the Pearson correlation coefficient and p-value
corr_coef, p_value = stats.pearsonr(col1, col2)

print(f'statistic is : {corr_coef:.003f}, p-value: {p_value:.003f}')

statistic is : 0.019, p-value: 0.605


#### July has the longest duration
- H_0: other_months durations >= july duration
- H_1: other_months durations < july duration

In [133]:
mask = df['Month of absence'] == 'July'

july_duration = df[mask]['Absenteeism time in hours']
other_months_duration = df[~mask]['Absenteeism time in hours']

t_stat, p_value = stats.ttest_ind(other_months_duration,
                                       july_duration,
                                       alternative='greater')

print(f'statistic is : {t_stat:.003f}, p-value: {p_value:.003f}')

statistic is : -2.605, p-value: 0.995


#### Having no children seems to correlate negativly with both duration and absence itself.
- H_0: children durations >= no_children duration
- H_1: children durations < no_children duration

In [134]:
mask = df['Son'] == 0

no_children = df[mask]['Absenteeism time in hours']
children = df[~mask]['Absenteeism time in hours']

t_stat, p_value = stats.ttest_ind(children,
                                  no_children,
                                  alternative='less')

print(f'statistic is : {t_stat:.003f}, p-value: {p_value:.003f}')

statistic is : 2.588, p-value: 0.995
