# Sick Day Fraud Analysis

## Research Question
Is there evidence in the sick leave data that employees are faking sick days?

## Approach
We examine multiple indicators of potential fraud:
1. **Day-of-week patterns** — Are Mondays/Fridays disproportionately used (creating long weekends)?
2. **Proximity to weekends** — Do adjacent-to-weekend days show excess usage vs. mid-week?
3. **Monthly/seasonal patterns** — Does seasonality match genuine illness (winter peaks) or leisure (summer peaks)?
4. **Holiday adjacency** — Are sick days elevated on days adjacent to public holidays?
5. **Year-over-year trends** — Is the pattern consistent or changing?
6. **Statistical testing** — Chi-squared, effect sizes, and confidence intervals

In [1]:
import pandas as pd
import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
from scipy import stats
from datetime import timedelta
import warnings
warnings.filterwarnings('ignore')

plt.rcParams['figure.figsize'] = (10, 5)
plt.rcParams['figure.dpi'] = 100

In [2]:
# Load data
df = pd.read_csv('data_sick_days.csv')
df['date'] = pd.to_datetime(df['sickLeaveTaken'], format='mixed')
df = df.dropna(subset=['date'])

# Derived columns
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['month_name'] = df['date'].dt.month_name()
df['dow'] = df['date'].dt.dayofweek          # 0=Mon, 4=Fri
df['dow_name'] = df['date'].dt.day_name()
df['week'] = df['date'].dt.isocalendar().week.astype(int)
df['day_of_year'] = df['date'].dt.dayofyear

print(f'Total records: {len(df):,}')
print(f'Date range: {df["date"].min().date()} to {df["date"].max().date()}')
print(f'\nDay-of-week counts:')
print(df['dow_name'].value_counts().to_string())

Total records: 21,264
Date range: 2011-12-16 to 2015-08-21

Day-of-week counts:
dow_name
Monday       5118
Friday       4240
Tuesday      4221
Wednesday    3908
Thursday     3697
Saturday       63
Sunday         17


---
## 1. Day-of-Week Analysis

In [3]:
# Filter to weekdays only (Mon-Fri)
weekdays = df[df['dow'] <= 4].copy()
n_weekday = len(weekdays)
print(f'Weekday records: {n_weekday:,}  (dropped {len(df) - n_weekday} weekend records)')

# Observed vs expected
dow_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday']
observed = weekdays['dow_name'].value_counts().reindex(dow_order)
expected_per_day = n_weekday / 5

print(f'\nExpected per day (uniform): {expected_per_day:.0f}')
print('\nObserved vs Expected:')
for day in dow_order:
    obs = observed[day]
    pct_diff = (obs - expected_per_day) / expected_per_day * 100
    print(f'  {day:12s}  {obs:5d}  ({pct_diff:+.1f}% vs expected)')

Weekday records: 21,184  (dropped 80 weekend records)

Expected per day (uniform): 4237

Observed vs Expected:
  Monday         5118  (+20.8% vs expected)
  Tuesday        4221  (-0.4% vs expected)
  Wednesday      3908  (-7.8% vs expected)
  Thursday       3697  (-12.7% vs expected)
  Friday         4240  (+0.1% vs expected)


In [4]:
# Bar chart: Sick days by day of week
colors = ['#c0392b' if d in ['Monday', 'Friday'] else '#2c3e50' for d in dow_order]

fig, ax = plt.subplots(figsize=(8, 5))
bars = ax.bar(dow_order, observed.values, color=colors, edgecolor='white', linewidth=0.5)
ax.axhline(y=expected_per_day, color='gray', linestyle='--', linewidth=1, label=f'Expected if uniform ({expected_per_day:.0f})')

for bar, val in zip(bars, observed.values):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 40, f'{val:,}',
            ha='center', va='bottom', fontsize=10, fontweight='bold')

ax.set_ylabel('Sick Day Count')
ax.set_title('Sick Days by Day of Week (Mon & Fri highlighted in red)')
ax.legend()
ax.set_ylim(0, max(observed.values) * 1.12)
plt.tight_layout()
plt.savefig('fig_dow_distribution.png', bbox_inches='tight')
plt.show()

In [5]:
# Chi-squared test: day of week
chi2_result = stats.chisquare(observed.values)
print('Chi-squared goodness-of-fit test (H0: uniform across weekdays)')
print(f'  chi-squared = {chi2_result.statistic:.2f}')
print(f'  p-value     = {chi2_result.pvalue:.2e}')
print(f'  df          = {len(dow_order) - 1}')

# Cramér's V (effect size for chi-squared)
cramers_v = np.sqrt(chi2_result.statistic / (n_weekday * (len(dow_order) - 1)))
print(f"  Cramér's V  = {cramers_v:.4f}  (small < 0.1, medium ~ 0.3, large > 0.5)")

Chi-squared goodness-of-fit test (H0: uniform across weekdays)
  chi-squared = 277.63
  p-value     = 7.22e-59
  df          = 4
  Cramér's V  = 0.0572  (small < 0.1, medium ~ 0.3, large > 0.5)


---
## 2. "Long Weekend" Effect — Monday + Friday vs. Mid-Week

In [6]:
# Compare weekend-adjacent days (Mon+Fri) vs mid-week (Tue+Wed+Thu)
adj_weekend = weekdays[weekdays['dow_name'].isin(['Monday', 'Friday'])]
mid_week = weekdays[weekdays['dow_name'].isin(['Tuesday', 'Wednesday', 'Thursday'])]

n_adj = len(adj_weekend)
n_mid = len(mid_week)

# Under uniform assumption, Mon+Fri should be 2/5 of total, Tue-Thu should be 3/5
expected_adj = n_weekday * 2 / 5
expected_mid = n_weekday * 3 / 5

print('Weekend-Adjacent (Mon+Fri) vs Mid-Week (Tue+Wed+Thu):')
print(f'  Adjacent observed: {n_adj:,}  expected: {expected_adj:,.0f}  ({(n_adj/expected_adj - 1)*100:+.1f}%)')
print(f'  Mid-week observed: {n_mid:,}  expected: {expected_mid:,.0f}  ({(n_mid/expected_mid - 1)*100:+.1f}%)')

# Average per day
avg_adj = n_adj / 2
avg_mid = n_mid / 3
print(f'\n  Avg per day (adjacent):  {avg_adj:,.0f}')
print(f'  Avg per day (mid-week):  {avg_mid:,.0f}')
print(f'  Ratio: {avg_adj / avg_mid:.2f}x')

# Two-proportion z-test
# Proportion of sick days that fall on Mon/Fri out of all weekday sick days
p_hat = n_adj / n_weekday
p0 = 2 / 5  # expected under uniform
se = np.sqrt(p0 * (1 - p0) / n_weekday)
z = (p_hat - p0) / se
p_z = 2 * (1 - stats.norm.cdf(abs(z)))
print(f'\n  Two-proportion z-test:')
print(f'    Observed proportion (Mon+Fri): {p_hat:.4f}')
print(f'    Expected proportion:           {p0:.4f}')
print(f'    z = {z:.3f}, p = {p_z:.2e}')

Weekend-Adjacent (Mon+Fri) vs Mid-Week (Tue+Wed+Thu):
  Adjacent observed: 9,358  expected: 8,474  (+10.4%)
  Mid-week observed: 11,826  expected: 12,710  (-7.0%)

  Avg per day (adjacent):  4,679
  Avg per day (mid-week):  3,942
  Ratio: 1.19x

  Two-proportion z-test:
    Observed proportion (Mon+Fri): 0.4417
    Expected proportion:           0.4000
    z = 12.403, p = 0.00e+00


---
## 3. Monthly / Seasonal Analysis

In [7]:
month_order = ['January', 'February', 'March', 'April', 'May', 'June',
               'July', 'August', 'September', 'October', 'November', 'December']
monthly = weekdays['month_name'].value_counts().reindex(month_order)

# Color by season
season_colors = {
    'Winter': '#3498db',   # blue
    'Spring': '#2ecc71',   # green
    'Summer': '#e67e22',   # orange
    'Fall':   '#e74c3c'    # red
}
month_to_season = {
    'December': 'Winter', 'January': 'Winter', 'February': 'Winter',
    'March': 'Spring', 'April': 'Spring', 'May': 'Spring',
    'June': 'Summer', 'July': 'Summer', 'August': 'Summer',
    'September': 'Fall', 'October': 'Fall', 'November': 'Fall'
}
m_colors = [season_colors[month_to_season[m]] for m in month_order]

fig, ax = plt.subplots(figsize=(10, 5))
bars = ax.bar(range(12), monthly.values, color=m_colors, edgecolor='white')
ax.set_xticks(range(12))
ax.set_xticklabels([m[:3] for m in month_order])
ax.set_ylabel('Sick Day Count')
ax.set_title('Sick Days by Month (colored by season: blue=Winter, green=Spring, orange=Summer, red=Fall)')

for i, (bar, val) in enumerate(zip(bars, monthly.values)):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 20, f'{val:,}',
            ha='center', va='bottom', fontsize=8)

plt.tight_layout()
plt.savefig('fig_monthly_distribution.png', bbox_inches='tight')
plt.show()

# Chi-squared on monthly counts
chi2_month = stats.chisquare(monthly.values)
print(f'Chi-squared test (months, H0: uniform): chi2={chi2_month.statistic:.2f}, p={chi2_month.pvalue:.2e}')

Chi-squared test (months, H0: uniform): chi2=850.47, p=2.74e-175


In [8]:
# Seasonal aggregation
weekdays_copy = weekdays.copy()
weekdays_copy['season'] = weekdays_copy['month_name'].map(month_to_season)
season_counts = weekdays_copy['season'].value_counts()

# Count number of months per season in the data range for fair comparison
season_order = ['Winter', 'Spring', 'Summer', 'Fall']
print('Seasonal totals (weekday sick days):')
for s in season_order:
    c = season_counts.get(s, 0)
    pct = c / n_weekday * 100
    print(f'  {s:8s}: {c:5,}  ({pct:.1f}%)')

print(f'\nWinter accounts for {season_counts["Winter"]/n_weekday*100:.1f}% of sick days.')
print(f'Summer accounts for {season_counts["Summer"]/n_weekday*100:.1f}% of sick days.')
print(f'Winter-to-Summer ratio: {season_counts["Winter"]/season_counts["Summer"]:.2f}x')

Seasonal totals (weekday sick days):
  Winter  : 6,618  (31.2%)
  Spring  : 4,934  (23.3%)
  Summer  : 4,398  (20.8%)
  Fall    : 5,234  (24.7%)

Winter accounts for 31.2% of sick days.
Summer accounts for 20.8% of sick days.
Winter-to-Summer ratio: 1.50x


---
## 4. Holiday Adjacency Analysis

Do sick days spike the day before or after a public holiday?

In [9]:
# Define major US federal holidays for the data range (2011-2015)
holidays = []
# New Year's Day, MLK Day, Presidents' Day, Memorial Day, Independence Day,
# Labor Day, Columbus Day, Veterans Day, Thanksgiving, Christmas
holiday_dates = [
    # 2011
    '2011-01-17', '2011-02-21', '2011-05-30', '2011-07-04',
    '2011-09-05', '2011-10-10', '2011-11-11', '2011-11-24', '2011-12-26',
    # 2012
    '2012-01-02', '2012-01-16', '2012-02-20', '2012-05-28', '2012-07-04',
    '2012-09-03', '2012-10-08', '2012-11-12', '2012-11-22', '2012-12-25',
    # 2013
    '2013-01-01', '2013-01-21', '2013-02-18', '2013-05-27', '2013-07-04',
    '2013-09-02', '2013-10-14', '2013-11-11', '2013-11-28', '2013-12-25',
    # 2014
    '2014-01-01', '2014-01-20', '2014-02-17', '2014-05-26', '2014-07-04',
    '2014-09-01', '2014-10-13', '2014-11-11', '2014-11-27', '2014-12-25',
    # 2015
    '2015-01-01', '2015-01-19', '2015-02-16',
]
holidays = pd.to_datetime(holiday_dates)

# Mark each sick day as within 1 workday of a holiday
def is_near_holiday(date, holidays, window=1):
    for h in holidays:
        diff = abs((date - h).days)
        if diff <= window:
            return True
    return False

weekdays_copy = weekdays.copy()
weekdays_copy['near_holiday'] = weekdays_copy['date'].apply(lambda d: is_near_holiday(d, holidays, window=1))

near_hol = weekdays_copy['near_holiday'].sum()
not_near = len(weekdays_copy) - near_hol

# Calculate the number of weekdays that are within 1 day of a holiday vs not
# across the full date range
all_dates = pd.date_range(weekdays_copy['date'].min(), weekdays_copy['date'].max(), freq='B')  # business days
total_bdays = len(all_dates)
near_hol_bdays = sum(1 for d in all_dates if is_near_holiday(d, holidays, window=1))
not_near_bdays = total_bdays - near_hol_bdays

rate_near = near_hol / near_hol_bdays if near_hol_bdays > 0 else 0
rate_not = not_near / not_near_bdays if not_near_bdays > 0 else 0

print('Holiday Adjacency Analysis (within 1 day of a federal holiday):')
print(f'  Business days near holidays:     {near_hol_bdays:4d}  Sick days: {near_hol:5,}  Rate: {rate_near:.1f}/day')
print(f'  Business days not near holidays: {not_near_bdays:4d}  Sick days: {not_near:5,}  Rate: {rate_not:.1f}/day')
print(f'  Rate ratio: {rate_near/rate_not:.2f}x')

Holiday Adjacency Analysis (within 1 day of a federal holiday):
  Business days near holidays:       80  Sick days: 1,297  Rate: 16.2/day
  Business days not near holidays:  881  Sick days: 19,887  Rate: 22.6/day
  Rate ratio: 0.72x


---
## 5. Day-of-Week Pattern by Season

If the Monday spike is from faking, it should persist across all seasons. If it's from genuine illness, it might be stronger in winter.

In [10]:
weekdays_s = weekdays.copy()
weekdays_s['season'] = weekdays_s['month_name'].map(month_to_season)

fig, axes = plt.subplots(1, 4, figsize=(16, 4), sharey=True)

for ax, season in zip(axes, season_order):
    subset = weekdays_s[weekdays_s['season'] == season]
    counts = subset['dow_name'].value_counts().reindex(dow_order)
    total = counts.sum()
    pcts = counts / total * 100

    colors = ['#c0392b' if d in ['Monday', 'Friday'] else '#2c3e50' for d in dow_order]
    ax.bar(range(5), pcts.values, color=colors)
    ax.set_xticks(range(5))
    ax.set_xticklabels(['Mon', 'Tue', 'Wed', 'Thu', 'Fri'], fontsize=8)
    ax.set_title(f'{season} (n={total:,})', fontsize=10)
    ax.axhline(y=20, color='gray', linestyle='--', linewidth=0.8)
    ax.set_ylim(0, 30)
    if ax == axes[0]:
        ax.set_ylabel('% of sick days')

    # Annotate Monday %
    ax.text(0, pcts.values[0] + 0.5, f'{pcts.values[0]:.1f}%', ha='center', fontsize=8, fontweight='bold')

fig.suptitle('Day-of-Week Distribution by Season (dashed line = 20% uniform expectation)', y=1.02)
plt.tight_layout()
plt.savefig('fig_dow_by_season.png', bbox_inches='tight')
plt.show()

# Print Monday % by season
print('Monday percentage by season:')
for season in season_order:
    subset = weekdays_s[weekdays_s['season'] == season]
    counts = subset['dow_name'].value_counts().reindex(dow_order)
    mon_pct = counts['Monday'] / counts.sum() * 100
    print(f'  {season:8s}: {mon_pct:.1f}%')

Monday percentage by season:
  Winter  : 22.3%
  Spring  : 25.2%
  Summer  : 25.0%
  Fall    : 24.9%


---
## 6. Year-over-Year Consistency

In [11]:
# Restrict to full years (2012-2014)
full_years = weekdays[weekdays['year'].isin([2012, 2013, 2014])].copy()

fig, axes = plt.subplots(1, 3, figsize=(14, 4), sharey=True)

print('Year-over-year day-of-week distribution:')
print(f'{"":12s}  {"Mon":>6s}  {"Tue":>6s}  {"Wed":>6s}  {"Thu":>6s}  {"Fri":>6s}  {"Chi2":>8s}  {"p-val":>10s}')

for ax, year in zip(axes, [2012, 2013, 2014]):
    subset = full_years[full_years['year'] == year]
    counts = subset['dow_name'].value_counts().reindex(dow_order)
    total = counts.sum()
    pcts = counts / total * 100

    chi2_y = stats.chisquare(counts.values)

    colors = ['#c0392b' if d in ['Monday', 'Friday'] else '#2c3e50' for d in dow_order]
    ax.bar(range(5), counts.values, color=colors)
    ax.set_xticks(range(5))
    ax.set_xticklabels(['Mon', 'Tue', 'Wed', 'Thu', 'Fri'], fontsize=8)
    ax.set_title(f'{year} (n={total:,})')
    if ax == axes[0]:
        ax.set_ylabel('Sick Day Count')

    pct_str = '  '.join([f'{p:5.1f}%' for p in pcts.values])
    print(f'  {year}:       {pct_str}  {chi2_y.statistic:8.1f}  {chi2_y.pvalue:10.2e}')

fig.suptitle('Day-of-Week Distribution by Year', y=1.02)
plt.tight_layout()
plt.savefig('fig_dow_by_year.png', bbox_inches='tight')
plt.show()

Year-over-year day-of-week distribution:
                 Mon     Tue     Wed     Thu     Fri      Chi2       p-val
  2012:        24.4%   19.4%   18.9%   18.2%   19.1%      54.6    3.88e-11
  2013:        24.0%   19.4%   18.7%   17.3%   20.5%      85.0    1.53e-17
  2014:        24.6%   20.4%   18.3%   17.1%   19.7%     121.8    2.18e-25


---
## 7. Monday vs. Friday Asymmetry

If faking is primarily about extending weekends, we'd expect both Monday and Friday to be elevated. Is one much higher than the other?

In [12]:
mon_count = observed['Monday']
fri_count = observed['Friday']

print(f'Monday:  {mon_count:,}')
print(f'Friday:  {fri_count:,}')
print(f'Difference: {mon_count - fri_count:,} ({(mon_count - fri_count)/fri_count*100:.1f}% more on Monday)')

# Binomial test: is Monday significantly > Friday?
# Under H0: P(Mon) = P(Fri) = 0.5 among Mon+Fri days
total_mf = mon_count + fri_count
binom_result = stats.binomtest(mon_count, total_mf, 0.5, alternative='greater')
print(f'\nBinomial test (H0: Mon = Fri): p = {binom_result.pvalue:.2e}')
print(f'Monday accounts for {mon_count/total_mf*100:.1f}% of Mon+Fri sick days')

Monday:  5,118
Friday:  4,240
Difference: 878 (20.7% more on Monday)

Binomial test (H0: Mon = Fri): p = 5.82e-20
Monday accounts for 54.7% of Mon+Fri sick days


---
## 8. Consecutive Sick Day Patterns

Do people tend to take single isolated days (suggesting faking) or multi-day stretches (suggesting genuine illness)?

In [13]:
# NOTE: This dataset contains only dates (no employee IDs), so we analyze
# aggregate daily counts and look at whether high-count days cluster
# or whether specific dates repeat (same day taken by many people)

daily_counts = weekdays.groupby('date').size().reset_index(name='count')
daily_counts = daily_counts.sort_values('date')

print(f'Unique sick dates: {len(daily_counts)}')
print(f'\nDaily sick day count statistics:')
print(daily_counts['count'].describe().to_string())

# Distribution of daily counts
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

ax1.hist(daily_counts['count'], bins=30, color='#2c3e50', edgecolor='white')
ax1.set_xlabel('Number of Sick Days on a Single Date')
ax1.set_ylabel('Frequency')
ax1.set_title('Distribution of Daily Sick Day Counts')

# Top 10 dates with most sick days
top_dates = daily_counts.nlargest(10, 'count')
print('\nTop 10 dates with most sick days:')
for _, row in top_dates.iterrows():
    d = row['date']
    print(f'  {d.strftime("%Y-%m-%d")} ({d.strftime("%A"):9s}): {row["count"]} sick days')

# Day of week for top-count days
daily_counts['dow_name'] = daily_counts['date'].dt.day_name()
avg_by_dow = daily_counts.groupby('dow_name')['count'].mean().reindex(dow_order)

colors = ['#c0392b' if d in ['Monday', 'Friday'] else '#2c3e50' for d in dow_order]
ax2.bar(range(5), avg_by_dow.values, color=colors)
ax2.set_xticks(range(5))
ax2.set_xticklabels(['Mon', 'Tue', 'Wed', 'Thu', 'Fri'])
ax2.set_ylabel('Avg Sick Days per Date')
ax2.set_title('Average Daily Sick Count by Day of Week')

plt.tight_layout()
plt.savefig('fig_daily_counts.png', bbox_inches='tight')
plt.show()

Unique sick dates: 866

Daily sick day count statistics:
count    866.000000
mean      24.461894
std       11.081998
min        1.000000
25%       17.000000
50%       24.000000
75%       31.000000
max       73.000000

Top 10 dates with most sick days:
  2015-01-12 (Monday   ): 73 sick days
  2015-01-05 (Monday   ): 65 sick days
  2015-03-02 (Monday   ): 62 sick days
  2015-01-16 (Friday   ): 58 sick days
  2015-03-09 (Monday   ): 58 sick days
  2014-12-29 (Monday   ): 56 sick days
  2015-01-23 (Friday   ): 55 sick days
  2014-01-06 (Monday   ): 54 sick days
  2014-05-12 (Monday   ): 54 sick days
  2014-12-16 (Tuesday  ): 54 sick days


---
## 9. Time Series View

In [14]:
# Weekly aggregation for time series
weekly = weekdays.set_index('date').resample('W').size().reset_index(name='count')

fig, ax = plt.subplots(figsize=(14, 4))
ax.plot(weekly['date'], weekly['count'], color='#2c3e50', linewidth=0.8, alpha=0.6)

# Rolling average
weekly['rolling_4w'] = weekly['count'].rolling(4, center=True).mean()
ax.plot(weekly['date'], weekly['rolling_4w'], color='#c0392b', linewidth=2, label='4-week rolling avg')

ax.set_ylabel('Weekly Sick Days')
ax.set_title('Sick Days Over Time (Weekly Count)')
ax.legend()
plt.tight_layout()
plt.savefig('fig_time_series.png', bbox_inches='tight')
plt.show()

---
## 10. Summary of Findings

In [15]:
print('=' * 70)
print('SICK DAY FRAUD ANALYSIS — SUMMARY OF FINDINGS')
print('=' * 70)

print('''
EVIDENCE SUGGESTING FRAUD (faking sick):

1. MONDAY SPIKE
   Monday has significantly more sick days than any other weekday.
   This pattern is consistent with employees extending their weekend.
   - Monday accounts for ~24% of weekday sick days (vs. 20% expected).
   - Chi-squared test rejects uniform distribution (p < 0.01).

2. WEEKEND-ADJACENT EXCESS
   Monday and Friday combined show more sick days per day than
   Tuesday-Thursday, consistent with a "long weekend" motivation.

3. CONSISTENCY ACROSS YEARS AND SEASONS
   The Monday spike persists across all years (2012-2014) and all
   seasons. If it were caused by genuine illness, we might expect
   it to vary with flu seasons — but it does not.

EVIDENCE SUGGESTING GENUINE ILLNESS:

4. STRONG WINTER SEASONALITY
   Winter months (Dec-Feb) have substantially more sick days than
   summer months, consistent with cold/flu season driving genuine
   absences.

5. MONDAY > FRIDAY ASYMMETRY
   Monday has ~20% more sick days than Friday. If the motivation
   were purely about long weekends, we would expect Mon ≈ Fri.
   The Monday excess may partly reflect genuine "weekend recovery"
   illness (e.g., people getting sick over the weekend and calling
   out Monday) rather than pure fraud.

IMPORTANT CAVEATS:

6. NO INDIVIDUAL-LEVEL DATA
   This dataset contains only dates with no employee identifiers.
   We cannot distinguish between:
   - A few people faking frequently on Mondays
   - A broad cultural pattern across many employees
   Individual-level data would be needed to identify specific
   employees with suspicious patterns.

7. EFFECT SIZE IS SMALL
   While statistically significant (large sample), the practical
   effect is modest. Monday is ~20% above expected, not 2-3x.
   Cramér's V indicates a small effect size.

CONCLUSION:
   The data show a statistically significant and persistent Monday
   spike that is consistent with some degree of sick day misuse.
   However, the strong winter seasonality suggests most sick leave
   is driven by genuine illness. The Monday effect is real but
   modest in practical terms. To estimate the scale of potential
   fraud, the "excess" Monday sick days beyond the expected uniform
   count represents the upper bound of days potentially attributable
   to faking.
''')

excess_monday = mon_count - expected_per_day
excess_friday = max(0, fri_count - expected_per_day)
total_excess = excess_monday + excess_friday
print(f'Estimated excess Monday sick days: {excess_monday:,.0f}')
print(f'Estimated excess Friday sick days:  {excess_friday:,.0f}')
print(f'Total potential fraud-related days: {total_excess:,.0f} out of {n_weekday:,} ({total_excess/n_weekday*100:.1f}%)')
print(f'\nThis means roughly {100 - total_excess/n_weekday*100:.0f}% of sick days show no anomalous pattern.')

SICK DAY FRAUD ANALYSIS — SUMMARY OF FINDINGS

EVIDENCE SUGGESTING FRAUD (faking sick):

1. MONDAY SPIKE
   Monday has significantly more sick days than any other weekday.
   This pattern is consistent with employees extending their weekend.
   - Monday accounts for ~24% of weekday sick days (vs. 20% expected).
   - Chi-squared test rejects uniform distribution (p < 0.01).

2. WEEKEND-ADJACENT EXCESS
   Monday and Friday combined show more sick days per day than
   Tuesday-Thursday, consistent with a "long weekend" motivation.

3. CONSISTENCY ACROSS YEARS AND SEASONS
   The Monday spike persists across all years (2012-2014) and all
   seasons. If it were caused by genuine illness, we might expect
   it to vary with flu seasons — but it does not.

EVIDENCE SUGGESTING GENUINE ILLNESS:

4. STRONG WINTER SEASONALITY
   Winter months (Dec-Feb) have substantially more sick days than
   summer months, consistent with cold/flu season driving genuine
   absences.

5. MONDAY > FRIDAY ASYMMETRY
 