# Merged Jupyter Notebook: arXiv Usage and Posting Analysis

# Section 1: Introduction

"""
This notebook summarizes the key findings from three separate notebooks:
1. `data-collection.ipynb` — collected daily usage statistics from arXiv.
2. `analysis.ipynb` — investigated whether arXiv usage depends on the day of the week.
3. `correlation.ipynb` — examined correlation between arXiv usage and number of postings per day.

We aim to understand potential seasonal patterns in usage and their relation to posting behavior, which may inform future modeling and recommendations.
"""

# Section 2: Data Loading

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import pearsonr
from statsmodels.formula.api import ols
import statsmodels.api as sm

# Load data
usage = pd.read_parquet("../../data/arxiv-usage.parquet")
totals = pd.read_parquet("../../data/arxiv-totals.parquet")

# Preprocessing
usage.columns = ["date", "connections"]
usage["date"] = pd.to_datetime(usage["date"])
usage.set_index("date", inplace=True)

totals.index = pd.to_datetime(totals.index)
totals = totals.loc["2024-01-01":"2025-04-10"]


# Section 3: Usage Analysis

# Daily usage distribution by day of week
usage["weekday"] = usage.index.day_name()
sns.violinplot(data=usage, x="weekday", y="connections")
plt.title("Daily arXiv Connections by Weekday")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


# Section 4: Weekday Effect Hypothesis Test

# One-hot encode weekdays
weekday_dummies = pd.get_dummies(usage["weekday"])
usage_with_dummies = pd.concat([usage["connections"], weekday_dummies], axis=1)

# OLS regression with weekday predictors
X = usage_with_dummies.drop(columns=["connections"])
X = sm.add_constant(X)
y = usage_with_dummies["connections"]
model = sm.OLS(y, X).fit()
print(model.summary())

# Section 5: Correlation with Posting Activity

# Create daily totals (row sums)
totals_sum_df = pd.DataFrame({"total_appearances": totals.sum(axis=1)})

# Align with usage
df_corr = pd.merge(
    totals_sum_df, usage[["connections"]], left_index=True, right_index=True
)

# Correlation
r, p_value = pearsonr(df_corr["total_appearances"], df_corr["connections"])
print(f"Pearson correlation: {r:.4f}, p-value: {p_value:.4e}")

# Scatterplot
sns.scatterplot(
    x="total_appearances", y="connections", data=df_corr, alpha=0.6
)
plt.title("arXiv Connections vs Total Postings")
plt.xlabel("Total Appearances")
plt.ylabel("Connections")
plt.tight_layout()
plt.show()

# Section 6: Conclusion

"""
Key Findings:
- Usage is relatively stable across weekdays, with no statistically significant weekday effect (p > 0.05 in all months except July).
- There is a statistically significant but modest correlation (r = 0.2482, p ≈ 5.7e-08) between daily arXiv usage and total article postings.

This suggests that while usage is not heavily seasonal, it may weakly reflect submission volumes, and can be cautiously used in forecasting or recommendation systems.
"""
