In [2]:
# Load required libraries
library(tidyverse)
library(janitor)
library(dplyr)
library(ggplot2)
library(skimr)
library(purrr)
library(lubridate)

# Source helper scripts
source("../../R/apply_factors.R")
source("../../R/analysis_helpers.R")
source("../../R/temporal_helpers.R")

# Load data
tables <- list(
  Orders  = readr::read_csv("../../data/processed/Orders.csv"),
  Returns = readr::read_csv("../../data/processed/Returns.csv"),
  People  = readr::read_csv("../../data/processed/People.csv")
)

# Apply factor transformations
tables <- apply_factors(tables)

# Extract tables
orders  <- tables$Orders
returns <- tables$Returns
people  <- tables$People

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.6
[32m✔[39m [34mforcats  [39m 1.0.1     [32m✔[39m [34mstringr  [39m 1.6.0
[32m✔[39m [34mggplot2  [39m 4.0.1     [32m✔[39m [34mtibble   [39m 3.3.0
[32m✔[39m [34mlubridate[39m 1.9.4     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.2.0     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors

Attaching package: ‘janitor’


The following objects are masked from ‘package:stats’:

    chisq.test, fisher.test


[1mRows: [22m[34m51290[39m [1mColumns: [22m[

## Profitability Differences Across Segments

We begin by evaluating whether **average profit margins differ across customer segments** (Consumer, Corporate, Home Office). Profit margin is defined as:

$$
\text{Margin}_i = \frac{\text{Profit}_i}{\text{Sales}_i}
$$

Because this is a comparison of mean outcomes across more than two groups, we apply a **one-way ANOVA** framework.

**Hypotheses:**

- $H_0$: Mean profit margins are equal across segments  
- $H_A$: At least one segment has a different mean profit margin  

In [3]:
orders <- orders %>%
    mutate(margin = profit / sales) %>%
    filter(is.finite(margin))

segment_margin_aov <- aov(margin ~ segment, data = orders)
summary(segment_margin_aov)

               Df Sum Sq Mean Sq F value Pr(>F)
segment         2      0  0.1699   0.783  0.457
Residuals   51287  11122  0.2168               

## Segment Differences in Return Behavior

While average margins may be similar, segments may still differ in **operational behavior**, particularly in return rates. To test this, we model return probability using a **logistic regression**, where the dependent variable is whether an order was returned.

The model specification is:

$$
\Pr(\text{Return}_i = 1) = \text{logit}^{-1}(\beta_0 + \beta_1 \cdot \text{Segment}_i)
$$

This allows us to estimate **relative odds of returns** across customer segments.


In [4]:
orders <- orders %>%
  left_join(
    returns %>%
      mutate(returned = 1L),
    by = "order_id",
    relationship = "many-to-many"
  ) %>%
  mutate(returned = ifelse(is.na(returned), 0L, returned))

segment_return_logit <- glm(
  returned ~ segment,
  data = orders,
  family = binomial(link = "logit")
)

summary(segment_return_logit)



Call:
glm(formula = returned ~ segment, family = binomial(link = "logit"), 
    data = orders)

Coefficients:
                   Estimate Std. Error  z value Pr(>|z|)    
(Intercept)        -2.75636    0.02591 -106.376   <2e-16 ***
segmentCorporate    0.05568    0.04208    1.323   0.1857    
segmentHome Office -0.11675    0.05277   -2.213   0.0269 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 23159  on 51294  degrees of freedom
Residual deviance: 23150  on 51292  degrees of freedom
AIC: 23156

Number of Fisher Scoring iterations: 5


## Discount Sensitivity by Segment

We next evaluate whether **discounting impacts profitability differently across customer segments**. This is tested using a linear regression model that includes **interaction terms between discount level and segment**, allowing the marginal effect of discounts on profit to vary across segments.

The model is specified as:

$$
\text{Profit}_i
=
\beta_0
+
\beta_1 \cdot \text{Discount}_i
+
\beta_2 \cdot \text{Segment}_i
+
\beta_3 \cdot (\text{Discount}_i \times \text{Segment}_i)
+
\varepsilon_i
$$

The interaction terms test whether **discount sensitivity differs across segments**, beyond the average effect of discounting on profitability.


In [5]:
discount_segment_lm <- lm(
  profit ~ discount * segment,
  data = orders
)

summary(discount_segment_lm)
anova(discount_segment_lm)


Call:
lm(formula = profit ~ discount * segment, data = orders)

Residuals:
    Min      1Q  Median      3Q     Max 
-6483.3   -55.7   -25.1    30.5  8334.5 

Coefficients:
                             Estimate Std. Error t value Pr(>|t|)    
(Intercept)                   65.6447     1.2263  53.529   <2e-16 ***
discount                    -260.3901     4.7818 -54.454   <2e-16 ***
segmentCorporate              -0.1787     2.0207  -0.088    0.930    
segmentHome Office             1.0863     2.3939   0.454    0.650    
discount:segmentCorporate      3.1491     7.8902   0.399    0.690    
discount:segmentHome Office   -2.8157     9.3936  -0.300    0.764    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 165.5 on 51289 degrees of freedom
Multiple R-squared:  0.1001,	Adjusted R-squared:    0.1 
F-statistic:  1141 on 5 and 51289 DF,  p-value: < 2.2e-16


Unnamed: 0_level_0,Df,Sum Sq,Mean Sq,F value,Pr(>F)
Unnamed: 0_level_1,<int>,<dbl>,<dbl>,<dbl>,<dbl>
discount,1,156192700.0,156192700.0,5704.93,0.0
segment,2,3388.866,1694.433,0.06188906,0.9399872
discount:segment,2,9772.031,4886.015,0.1784614,0.8365569
Residuals,51289,1404219000.0,27378.55,,


## Sales Concentration and Product-Level Risk

Average outcomes may mask **concentration risk**, where revenue depends disproportionately on a small number of products. To quantify this risk, we compute the **Gini coefficient** of product-level sales within each customer segment.

The Gini coefficient is defined as:

$$
G = \frac{\sum_{i=1}^{n} (2i - n - 1)\, x_{(i)}}{n \sum_{i=1}^{n} x_i}
$$

where $x_{(i)}$ denotes product-level sales sorted in non-decreasing order.

Higher values of \(G\) indicate **greater sales concentration** and higher exposure to product-level operational risk.


In [6]:
gini <- function(x) {
  x <- x[x >= 0]
  if (length(x) == 0) return(NA_real_)
  x <- sort(x)
  n <- length(x)
  sum((2 * seq_len(n) - n - 1) * x) / (n * sum(x))
}

segment_gini <- orders %>%
  group_by(segment, product_name) %>%
  summarise(total_sales = sum(sales), .groups = "drop") %>%
  group_by(segment) %>%
  summarise(
    gini_product_sales = gini(total_sales),
    n_products = n()
  )

segment_gini

segment,gini_product_sales,n_products
<fct>,<dbl>,<int>
Consumer,0.6862046,3635
Corporate,0.6822749,3346
Home Office,0.6792112,2922
