# Table of Contents
- Data Preprocessing

- Exploratory Data Analysis

  - Waiting Time vs Abandonment Rate

  - Talk Duration by Day of Week

  - Answer Rate vs Wait Time & Answer Speed

  - Daily vs 7-Day Mean

  - Weekend vs Weekday Talk Duration

# Data Preprocessing

In [None]:
# Import statements
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy.stats as stats
import seaborn as sns
from pandas.tseries.holiday import USFederalHolidayCalendar
from scipy.stats import f_oneway, ttest_ind

# Load data
df = pd.read_csv("call-center-data-v2-daily.csv")
df["Date"] = pd.to_datetime(df["Date"])

In [None]:
# Convert time-related columns from "HH:MM:SS" strings to numeric seconds
time_cols = ["Answer Speed (AVG)", "Talk Duration (AVG)", "Waiting Time (AVG)"]
for col in time_cols:
    df[col] = pd.to_timedelta(df[col]).dt.total_seconds()

In [None]:
# Temporal features
df["day_of_week"] = df["Date"].dt.day_name()
df["day_of_week_num"] = df["Date"].dt.weekday
df["is_weekend"] = df["day_of_week_num"] >= 5
df["month"] = df["Date"].dt.month
df["quarter"] = df["Date"].dt.quarter
df["year"] = df["Date"].dt.year

holidays = USFederalHolidayCalendar().holidays(
    start=df["Date"].min(), end=df["Date"].max()
)
df["is_holiday"] = df["Date"].isin(holidays).astype(int)

In [None]:
# Lag and rolling features
df["incoming_lag1"] = df["Incoming Calls"].shift(1)
df["incoming_roll7"] = df["Incoming Calls"].rolling(window=7).mean()
df["incoming_roll30"] = df["Incoming Calls"].rolling(window=30).mean()

# Derived rates
df["answer_rate"] = df["Answered Calls"] / df["Incoming Calls"]
df["abandonment_rate"] = df["Abandoned Calls"] / df["Incoming Calls"]

In [None]:
# Missing values
print("Missing values:\n", df.isna().sum(), sep="")

In [None]:
df.head(5)

# Exploratory Data Analysis

## Waiting Time vs Abandonment Rate
- $H_0$: Average waiting time is not positively associated with abandonment rate.


In [None]:
# Calculate Pearson correlation between average wait time and abandon rate
wt = df["Waiting Time (AVG)"]
ab = df["abandonment_rate"]
corr_wt_ab, pval_wt_ab = stats.pearsonr(wt, ab)
print(f"Pearson r = {corr_wt_ab:.3f}, p-value = {pval_wt_ab:.4g}")
# Scatter plot with regression line
plt.figure(figsize=(6, 4))
sns.regplot(x=wt, y=ab, scatter_kws={"alpha": 0.5, "s": 30})
plt.xlabel("Average Waiting Time (sec)")
plt.ylabel("Abandonment Rate")
plt.title("Waiting Time vs Abandonment Rate")
plt.tight_layout()
plt.show()

**Insights**
- Strong positive correlation between average waiting time and abandonment rate (79%)
- Given $p < 0.05$, we reject $H_0$
- Longer avergae wait times are significantly associated with higher abandonment rates

## Talk Duration by Day of Week
- $H_0$: Mean talk duration is the same acorss all days of the week


Note: ANOVA(analysis of variance) test tests the null hypothesis that all group means are equal against the alternative hypothesis that at least one mean is different

In [None]:
# Prepare data groups by day of week in calendar order
weekday_order = [
    "Monday",
    "Tuesday",
    "Wednesday",
    "Thursday",
    "Friday",
    "Saturday",
    "Sunday",
]
groups = [
    df.loc[df["day_of_week"] == day, "Talk Duration (AVG)"].dropna()
    for day in weekday_order
]

# Perform one-way ANOVA
f_stat, p_val = f_oneway(*groups)
print(f"ANOVA F-statistic = {f_stat:.3f}, p-value = {p_val:.4g}")

# Boxplot of Talk Duration by Day
plt.figure(figsize=(6, 4))
sns.boxplot(
    x="day_of_week", y="Talk Duration (AVG)", data=df, order=weekday_order
)
plt.xlabel("Day of Week")
plt.ylabel("Talk Duration (sec)")
plt.title("Talk Duration by Day of Week")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

**Insights**
- Given $p > 0.05$, we fail to reject $H_0$
- Talk duration is similar across days
- Curious about the larger variation on Sunday

## Answer Rate vs Wait Time
- $H_0$: Answer rate is independent of waiting time and answer speed

In [None]:
# Correlation matrix for answer rate, wait time, answer speed, and abandon rate
sub = df[
    [
        "answer_rate",
        "Waiting Time (AVG)",
        "Answer Speed (AVG)",
        "abandonment_rate",
    ]
].dropna()
corr_matrix = sub.corr().round(2)
print(corr_matrix)

# Heatmap of correlations
plt.figure(figsize=(5, 4))
sns.heatmap(corr_matrix, annot=True, cmap="coolwarm", vmin=-1, vmax=1)
plt.title("Correlation: Answer Rate, Wait & Answer Speed, Abandon Rate")
plt.tight_layout()
plt.show()

**Insights**
- Answer rate is negatively correlated with average waiting time
- Days with higher answer rate have shorter wait times and faster answering speeds
- More calls answers = fewer abandoned

## Daily Volume vs 7-day Rolling Mean

In [None]:
raw = df["Incoming Calls"].dropna()
roll7 = df["incoming_roll7"].dropna()
# Align lengths (rolling produces NaN for first 6 days)
common_idx = raw.index.intersection(roll7.index)
raw_common = raw.loc[common_idx]
roll7_common = roll7.loc[common_idx]

# Plot raw vs rolling series
plt.figure(figsize=(7, 4))
plt.plot(df["Date"], df["Incoming Calls"], alpha=0.5, label="Daily Incoming")
plt.plot(
    df["Date"], df["incoming_roll7"], color="red", label="7-day Rolling Mean"
)
plt.legend()
plt.xlabel("Date")
plt.ylabel("Incoming Calls")
plt.title("Daily Incoming Calls vs 7-day Rolling Mean")
plt.tight_layout()
plt.show()

**Insights**:
- Huge spike in incoming calls in early 2024

## Weekend vs Weekday Talk Duration
- $H_0$: Mean talk duration is the same on weekends and weekdays

In [None]:
weekend = df[df["is_weekend"]]["Talk Duration (AVG)"].dropna()
weekday = df[not df["is_weekend"]]["Talk Duration (AVG)"].dropna()
t_stat, p_val = ttest_ind(weekend, weekday, equal_var=False)
print(f"t-statistic = {t_stat:.3f}, p-value = {p_val:.4g}")

# Boxplot for weekend vs weekday
plt.figure(figsize=(6, 4))
sns.boxplot(
    x=df["is_weekend"].map({False: "Weekday", True: "Weekend"}),
    y=df["Talk Duration (AVG)"],
)
plt.title("Talk Duration: Weekday vs Weekend")
plt.ylabel("Talk Duration (sec)")
plt.tight_layout()
plt.show()

**Insights**
- Similar average talk duration
- Fail to reject $H_0$: average talk duration does not meaningfully differ between weekends and weekdays

## Misc

In [None]:
plt.figure(figsize=(6, 4))
sns.barplot(
    x="day_of_week",
    y="Incoming Calls",
    data=df,
    estimator=np.mean,
    order=[
        "Monday",
        "Tuesday",
        "Wednesday",
        "Thursday",
        "Friday",
        "Saturday",
        "Sunday",
    ],
)
plt.title("Average Incoming Calls by Day of Week")
plt.ylabel("Avg Incoming Calls")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

In [None]:
corr = df[
    [
        "Incoming Calls",
        "Answered Calls",
        "Abandoned Calls",
        "Answer Speed (AVG)",
        "Talk Duration (AVG)",
        "Waiting Time (AVG)",
        "answer_rate",
        "abandonment_rate",
    ]
].corr()
plt.figure(figsize=(6, 5))
sns.heatmap(corr, annot=True, fmt=".2f", cmap="coolwarm")
plt.title("Correlation of Call Center Metrics")
plt.tight_layout()
plt.show()

In [None]:
pivot = df.pivot_table(
    index="day_of_week",
    columns="month",
    values="Incoming Calls",
    aggfunc="mean",
)
pivot = pivot.reindex(
    [
        "Monday",
        "Tuesday",
        "Wednesday",
        "Thursday",
        "Friday",
        "Saturday",
        "Sunday",
    ]
)
plt.figure(figsize=(6, 5))
sns.heatmap(pivot, annot=True, fmt=".0f", cmap="YlGnBu")
plt.title("Avg Calls by Day-of-Week and Month")
plt.ylabel("Day of Week")
plt.xlabel("Month")
plt.tight_layout()
plt.show()

In [None]:
# Add rolling averages to detect change visually
df.set_index("Date")["Incoming Calls"].rolling(window=30).mean().plot(
    figsize=(12, 6)
)
plt.title("30-Day Rolling Average of Call Volume")
plt.legend()
plt.show()