In [1]:
# Load required libraries
library(tidyverse)
library(janitor)
library(dplyr)
library(ggplot2)
library(skimr)
library(purrr)
library(lubridate)

# Source helper scripts
source("../../R/apply_factors.R")
source("../../R/analysis_helpers.R")
source("../../R/temporal_helpers.R")

# Load data
tables <- list(
  Orders  = readr::read_csv("../../data/processed/Orders.csv"),
  Returns = readr::read_csv("../../data/processed/Returns.csv"),
  People  = readr::read_csv("../../data/processed/People.csv")
)

# Apply factor transformations
tables <- apply_factors(tables)

# Extract tables
orders  <- tables$Orders
returns <- tables$Returns
people  <- tables$People

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.6
[32m✔[39m [34mforcats  [39m 1.0.1     [32m✔[39m [34mstringr  [39m 1.6.0
[32m✔[39m [34mggplot2  [39m 4.0.1     [32m✔[39m [34mtibble   [39m 3.3.0
[32m✔[39m [34mlubridate[39m 1.9.4     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.2.0     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors

Attaching package: ‘janitor’


The following objects are masked from ‘package:stats’:

    chisq.test, fisher.test


[1mRows: [22m[34m51290[39m [1mColumns: [22m[

# Pre/Post 2013 Performance

**Observed**

Sales and profit appear to exhibit a step change around 2013.

**Inference Question**

> Was there a statistically significant change in profitability after 2013?

**Method**

Regression with a post-2013 indicator:
- Outcome: profit
- Predictor: post_2013

This framing treats 2013 as a quasi-experimental breakpoint.

In [3]:
orders <- orders |>  
    mutate(    
        year = lubridate::year(order_date),    
        post_2013 = as.integer(year >= 2013)  
    )

post2013_lm <- lm(
    profit ~ post_2013,
    data = orders
)

summary(post2013_lm)


Call:
lm(formula = profit ~ post_2013, data = orders)

Residuals:
    Min      1Q  Median      3Q     Max 
-6629.1   -28.9   -19.4     8.2  8370.8 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   27.874      1.235  22.577   <2e-16 ***
post_2013      1.258      1.580   0.796    0.426    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 174.4 on 51288 degrees of freedom
Multiple R-squared:  1.236e-05,	Adjusted R-squared:  -7.14e-06 
F-statistic: 0.6338 on 1 and 51288 DF,  p-value: 0.426


# Year-over-Year Growth

**Observed**

Some years exhibit stronger sales growth than others.

**Inference Question**

> Is long-run sales growth accelerating or decelerating over time?

**Method**

Trend regression on log-transformed sales:
- Outcome: log(sales)
- Predictor: year

This model captures the average long-run growth trajectory.

In [2]:
sales_trend_lm <- lm(
    log1p(sales) ~ year,
    data = orders
)
summary(sales_trend_lm)


Call:
lm(formula = log1p(sales) ~ year, data = orders)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.1430 -1.0540 -0.0568  1.0180  5.5135 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept)  6.890273  11.515596   0.598    0.550
year        -0.001182   0.005721  -0.207    0.836

Residual standard error: 1.424 on 51288 degrees of freedom
Multiple R-squared:  8.317e-07,	Adjusted R-squared:  -1.867e-05 
F-statistic: 0.04266 on 1 and 51288 DF,  p-value: 0.8364


# Return Rate Over Time

**Observed**

Return rates vary across years.

**Inference Question**

> Is return risk increasing or decreasing over time?

**Method**

Logistic regression:
- Binary outcome: returned
- Predictor: year

In [5]:
orders <- orders |>  
    left_join(    
        returns |>      
            mutate(returned = 1L),    
        by = "order_id",    
        relationship = "many-to-many"  
    ) |>  
    mutate(    
        returned = ifelse(is.na(returned), 0L, returned)  
    )

return_time_logit <- glm(
    returned ~ year,
    data = orders,
    family = binomial(link = "logit")
)

summary(return_time_logit)


Call:
glm(formula = returned ~ year, family = binomial(link = "logit"), 
    data = orders)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept) 57.70735   34.01044   1.697   0.0897 .
year        -0.03004    0.01690  -1.778   0.0754 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 23159  on 51294  degrees of freedom
Residual deviance: 23156  on 51293  degrees of freedom
AIC: 23160

Number of Fisher Scoring iterations: 5


# Discount Intensity Over Time

**Observed**

Discount usage appears to increase over time.

**Inference Question**

> Is discount intensity trending upward over time?

**Method**

Linear regression:
- Outcome: discount
- Predictor: year

In [4]:
discount_trend_lm <- lm(
    discount ~ year,
    data = orders
)
summary(discount_trend_lm)


Call:
lm(formula = discount ~ year, data = orders)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.14493 -0.14265 -0.14152  0.05848  0.70621 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept)  2.4345113  1.7167963   1.418    0.156
year        -0.0011385  0.0008529  -1.335    0.182

Residual standard error: 0.2123 on 51288 degrees of freedom
Multiple R-squared:  3.474e-05,	Adjusted R-squared:  1.524e-05 
F-statistic: 1.782 on 1 and 51288 DF,  p-value: 0.1819
