In [1]:
# Load required libraries
library(tidyverse)
library(janitor)
library(dplyr)
library(ggplot2)
library(skimr)
library(purrr)
library(lubridate)

# Source helper scripts
source("../../R/apply_factors.R")
source("../../R/analysis_helpers.R")
source("../../R/temporal_helpers.R")

# Load data
tables <- list(
  Orders  = readr::read_csv("../../data/processed/Orders.csv"),
  Returns = readr::read_csv("../../data/processed/Returns.csv"),
  People  = readr::read_csv("../../data/processed/People.csv")
)

# Apply factor transformations
tables <- apply_factors(tables)

# Extract tables
orders  <- tables$Orders
returns <- tables$Returns
people  <- tables$People

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.6
[32m✔[39m [34mforcats  [39m 1.0.1     [32m✔[39m [34mstringr  [39m 1.6.0
[32m✔[39m [34mggplot2  [39m 4.0.1     [32m✔[39m [34mtibble   [39m 3.3.0
[32m✔[39m [34mlubridate[39m 1.9.4     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.2.0     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors

Attaching package: ‘janitor’


The following objects are masked from ‘package:stats’:

    chisq.test, fisher.test


[1mRows: [22m[34m51290[39m [1mColumns: [22m[

# Profit Margin by Region

**Inference question**

Do average profit margins differ across regions?

**Model**

One-way ANOVA:

$margin_i = μ + α_region(i) + ε_i$

$H₀$: All regional mean margins are equal.

In [2]:
# Profit margin by region
orders <- orders %>%  mutate(
    margin = ifelse(sales > 0, profit / sales, NA_real_)
)
anova_margin_region <- aov(margin ~ region, data = orders)
summary(anova_margin_region)

               Df Sum Sq Mean Sq F value Pr(>F)    
region         12    646   53.84   263.5 <2e-16 ***
Residuals   51277  10476    0.20                   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Since the ANOVA rejects the null hypothesis, we conduct a post-hoc Tukey test to identify which regional pairs differ significantly.

In [3]:
TukeyHSD(anova_margin_region)

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = margin ~ region, data = orders)

$region
                                    diff           lwr          upr     p adj
Canada-Africa                0.391147682  0.3116029470  0.470692418 0.0000000
Caribbean-Africa             0.224591249  0.1819835689  0.267198928 0.0000000
Central-Africa               0.203824483  0.1775480440  0.230100922 0.0000000
Central Asia-Africa          0.290577081  0.2507837165  0.330370446 0.0000000
East-Africa                  0.310874680  0.2751535112  0.346595850 0.0000000
EMEA-Africa                  0.002417171 -0.0281539651  0.032988308 1.0000000
North-Africa                 0.268892533  0.2379518178  0.299833247 0.0000000
North Asia-Africa            0.323201491  0.2851525297  0.361250453 0.0000000
Oceania-Africa               0.227240799  0.1935994373  0.260882161 0.0000000
South-Africa                 0.197457677  0.1687144035  0.226200950 0.0000000
Southe

# Return Rate by Region / Market

**Inference question**

Is the probability of an order being returned independent of region?

**Model**

returned ⟂ region

In [4]:
orders <- orders %>%  left_join(
    returns %>%      
        mutate(returned = 1L),    
        by = "order_id",    
        relationship = "many-to-many"  
    ) %>%  mutate(
        returned = ifelse(is.na(returned), 0L, returned)  
    ) # Chi-square test of independence
return_region_table <- table(orders$returned, orders$region)
chisq.test(return_region_table)


	Pearson's Chi-squared test

data:  return_region_table
X-squared = 2140, df = 12, p-value < 2.2e-16


To quantify regional risk differences, we also estimate a logistic regression model with region as a predictor.

In [5]:
# Logistic regression: return probability by region
return_region_logit <- glm(  
    returned ~ region,  
    data = orders,  
    family = binomial(link = "logit")
)
summary(return_region_logit)


Call:
glm(formula = returned ~ region, family = binomial(link = "logit"), 
    data = orders)

Coefficients:
                       Estimate Std. Error z value Pr(>|z|)
(Intercept)          -1.957e+01  1.588e+02  -0.123    0.902
regionCanada         -3.596e-08  5.713e+02   0.000    1.000
regionCaribbean       1.630e+01  1.588e+02   0.103    0.918
regionCentral         1.672e+01  1.588e+02   0.105    0.916
regionCentral Asia    1.630e+01  1.588e+02   0.103    0.918
regionEast            1.668e+01  1.588e+02   0.105    0.916
regionEMEA           -3.585e-08  2.196e+02   0.000    1.000
regionNorth           1.763e+01  1.588e+02   0.111    0.912
regionNorth Asia      1.802e+01  1.588e+02   0.113    0.910
regionOceania         1.645e+01  1.588e+02   0.104    0.917
regionSouth           1.669e+01  1.588e+02   0.105    0.916
regionSoutheast Asia  1.660e+01  1.588e+02   0.105    0.917
regionWest            1.785e+01  1.588e+02   0.112    0.910

(Dispersion parameter for binomial family taken t

# Discount Sensitivity of Profit by Region

**Inference question**

Does the effect of discounting on profit vary across regions?

**Model**

$profit_i = β0 + β1·discount_i + β2·region_i + β3·(discount_i × region_i) + ε_i$

The interaction term captures regional differences in discount sensitivity.


In [6]:
# Interaction regression: discount × region
discount_region_lm <- lm(
    profit ~ discount * region,
    data = orders
)
summary(discount_region_lm)


Call:
lm(formula = profit ~ discount * region, data = orders)

Residuals:
    Min      1Q  Median      3Q     Max 
-6469.0   -51.3   -25.6    32.7  8335.0 

Coefficients: (1 not defined because of singularities)
                               Estimate Std. Error t value Pr(>|t|)    
(Intercept)                     49.9511     2.7649  18.066  < 2e-16 ***
discount                      -195.1224     8.3653 -23.325  < 2e-16 ***
regionCanada                    -3.5516     8.8566  -0.401   0.6884    
regionCaribbean                 11.5672     5.8947   1.962   0.0497 *  
regionCentral                   15.0746     3.3514   4.498 6.87e-06 ***
regionCentral Asia              43.2226     4.8003   9.004  < 2e-16 ***
regionEast                      24.9316     4.7538   5.245 1.57e-07 ***
regionEMEA                      -0.9937     3.9471  -0.252   0.8012    
regionNorth                     15.5633     3.8883   4.003 6.27e-05 ***
regionNorth Asia                40.0096     4.5395   8.814  < 2e-16