In [1]:
from pathlib import Path
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
from scipy import stats

#### Test 1: Impact of Product Variety on Customer Value
##### 1. Business Question
Do customers who purchase a wider variety of products provide more future value to the business? This is a crucial question for inventory and marketing strategy. If product variety is a key driver of value, the business should focus on cross-selling and product discovery.
##### 2. Analytical Approach
Measure "product variety" by counting the number of unique StockCodes each customer has purchased. We will then split customers into two groups: "Low Variety" and "High Variety" (split by the median). Finally, use an independent t-test to determine if there is a statistically significant difference in the average 12-month CLV between these two groups.
##### 3. Hypotheses
        - Null Hypothesis(H_0): There is no significant difference in the mean CLV between customers who purchase a low variety of products and those who purchse a high variety.
        - Alternative Hypothesis(H_a): There is a significant in the mean CLV between the two groups.
##### 4. Significance Level
Standard significance level of a = 0.05. If the p-value is less than 0.05, reject the null hyopthesis.

In [2]:
# Load Data
DATA_PATH = Path("../data/processed/modeling.csv")
rfm_with_clv = pd.read_csv(DATA_PATH)
rfm_with_clv  = rfm_with_clv .copy()
rfm_with_clv 
# Load the cleaned data
df= pd.read_csv('../data/processed/master_data.csv', parse_dates=['InvoiceDate'])
df = df.copy()
df

Unnamed: 0,Invoice,StockCode,Description,Quantity,InvoiceDate,Price,Customer ID,Country,TotalPrice,InvoiceMonth,DayOfWeek,Hour,CohortMonth,CohortIndex
0,489434,85048,15CM CHRISTMAS GLASS BALL 20 LIGHTS,12,2009-12-01 07:45:00,6.95,13085,United Kingdom,83.40,2009-12,Tuesday,7,2009-12,1
1,489434,79323P,PINK CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085,United Kingdom,81.00,2009-12,Tuesday,7,2009-12,1
2,489434,79323W,WHITE CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085,United Kingdom,81.00,2009-12,Tuesday,7,2009-12,1
3,489434,22041,"RECORD FRAME 7"" SINGLE SIZE",27,2009-12-01 07:45:00,2.10,13085,United Kingdom,56.70,2009-12,Tuesday,7,2009-12,1
4,489434,21232,STRAWBERRY CERAMIC TRINKET BOX,24,2009-12-01 07:45:00,1.25,13085,United Kingdom,30.00,2009-12,Tuesday,7,2009-12,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
765599,580501,23284,DOORMAT KEEP CALM AND COME IN,2,2011-12-04 13:00:00,7.50,14546,United Kingdom,15.00,2011-12,Sunday,13,2009-12,25
765600,580501,22507,MEMO BOARD RETROSPOT DESIGN,3,2011-12-04 13:00:00,4.95,14546,United Kingdom,14.85,2011-12,Sunday,13,2009-12,25
765601,580502,22469,HEART OF WICKER SMALL,3,2011-12-04 13:15:00,1.65,16931,United Kingdom,4.95,2011-12,Sunday,13,2010-11,14
765602,580502,23489,VINTAGE BELLS GARLAND,2,2011-12-04 13:15:00,2.89,16931,United Kingdom,5.78,2011-12,Sunday,13,2010-11,14


In [3]:
# Product variety for each customer
product_variety = df.groupby('Customer ID')['StockCode'].nunique().reset_index()
product_variety.rename(columns={'StockCode': 'ProductVariety'}, inplace=True)

# Merge into CLV dataframe
df_with_variety = rfm_with_clv.merge(product_variety, on='Customer ID')

# Create the 'Low Variety' and 'High Variety' groups
median_variety = df_with_variety['ProductVariety'].median()
low_variety_clv = df_with_variety[df_with_variety['ProductVariety'] < median_variety]['clv']
high_variety_clv = df_with_variety[df_with_variety['ProductVariety'] >= median_variety]['clv']

# Independent t-test
t_stat, p_value = stats.ttest_ind(high_variety_clv, low_variety_clv, equal_var=False)

# Results
print("Impact of Product Variety on Customer Lifetime Value")
print(f"Median Product Variety: {median_variety : .0f} unique items")
print(f"Average CLV for High-Variety Customers: ${high_variety_clv.mean():.2f}")
print(f"Average CLV for Low-Variety Customers: ${low_variety_clv.mean():.2f}")

print(f"\n T-statistic: {t_stat: .4f}")
print(f"\nP-value: {p_value: .4f}")

if p_value < 0.05:
    print("\n Result: Reject the null hypothesis (p < 0.05).")

else:
    print("\n Result: Fail to reject the null hyothesis (p >= 0.05)")

Impact of Product Variety on Customer Lifetime Value
Median Product Variety:  44 unique items
Average CLV for High-Variety Customers: $1576.97
Average CLV for Low-Variety Customers: $251.13

 T-statistic:  21.9870

P-value:  0.0000

 Result: Reject the null hypothesis (p < 0.05).


#### Inference
The test result is highly statisfically significant (p < 0.0001). We can confidently reject the null hypothesis and state that there is a profound relationship between product variety and future customer value.

##### Business Insight:
Customer who buys a wider range of products are dramatically more valueable to the business. On average, a "High-Variety" customer is predicted to be worth nearly six times $(1576.97 vs 251.13) than a "Low-Variety" customer. This suggests that customer engagement across the product catalog is a primary driver of long-term value.

##### Strategic Recommendation:
The business should priotize initiatives that encourages product discovery and cross-selling. Key actions could include:
* Implementing a "Frequently Brought Together" or "Customer Also Bought" feature on product pages.
* Creating targeted email campaigns that showcase products related to a customer's past purchases.
* Personalizing the user experience to expose existing customers to new product catefories

#### Test 2: Impact of Second Purchase Timeliness on Customer Value
1. Business Question
   Our cohort analysis showed a significant customer drop-off after the first purchase. This leads to a critical question: For the customers who do return, does the speed of their second purchase matter? Are customers who make their second purchase quickly more likely to become high-value customers in the long run?
   
2. Analytical Approach
   We will identify all customers who have made at least two purchases. We'll calculate the number of days between their first and second purchase. We will then split these customers into two groups: "Quick Repeat" (those who returned faster than the median) and "Slow Repeat." Finally, we'll use an independent t-test to compare the average CLV of these two groups.

3. Hypotheses

   * Null Hypothesis(H_0): There is no significant in the mean CLV between customers who make a quick second purchase and those who make a slow second purchase.
   * Alternative Hypothesis(H_a): The mean CLV of "Quick Repeat" customers is significantly higher than that of "Slow Repeat" customers.

4. Significance Level

   a = 0.05

In [7]:
#Calculate days to second purchase for repeat customers
# Ensure InvoiceDate is datetime
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])
# Sort and group by customer
purchase_dates = (
    df.sort_values('InvoiceDate')
    .groupby('Customer ID')['InvoiceDate']
    .apply(list)
)

# Filter customers with at least two purchases
repeat_customers_dates = purchase_dates[purchase_dates.str.len() >= 2]

# Compute days between first and second purchase
days_to_second_purchase = repeat_customers_dates.apply(
    lambda x: (x[1] - x[0]).days
).reset_index(name='DaysToSecondPurchase')

# Convert Customer ID to string
days_to_second_purchase['Customer ID'] = days_to_second_purchase['Customer ID'].astype(str)

#Prepare the RFM+CLV dataframe for merging

# Check if Customer ID is in index
if 'Customer ID' not in rfm_with_clv.columns:
    rfm_with_clv = rfm_with_clv.reset_index()

# Convert Customer ID to string
rfm_with_clv['Customer ID'] = rfm_with_clv['Customer ID'].astype(str)

#Merge and perform analysis

# Merge CLV data with days to second purchase
df_with_timing = rfm_with_clv.merge(days_to_second_purchase, on='Customer ID')

# Determine median
median_timing = df_with_timing['DaysToSecondPurchase'].median()

# Group customers by purchase speed
quick_repeat_clv = df_with_timing[df_with_timing['DaysToSecondPurchase'] <= median_timing]['clv']
slow_repeat_clv = df_with_timing[df_with_timing['DaysToSecondPurchase'] > median_timing]['clv']

# Perform one-tailed t-test (quick repeaters expected to have higher CLV)
t_stat, p_value = stats.ttest_ind(
    quick_repeat_clv,
    slow_repeat_clv,
    equal_var=False,
    alternative='greater'
)

# Results

print("Impact of Second Purchase Timeliness on CLV")
print(f"Median Time to Second Purchase: {median_timing:.0f} days")
print(f"Average CLV for Quick Repeat Customers: ${quick_repeat_clv.mean():.2f}")
print(f"Average CLV for Slow Repeat Customers: ${slow_repeat_clv.mean():.2f}")
print(f"\nT-statistic: {t_stat:.4f}")
print(f"P-value: {p_value:.4f}")

if p_value < 0.05:
    print("\nResult: Reject the null hypothesis (p < 0.05).")
else:
    print("\nResult: Fail to reject the null hypothesis (p >= 0.05).")


Impact of Second Purchase Timeliness on CLV
Median Time to Second Purchase: 0 days
Average CLV for Quick Repeat Customers: $942.18
Average CLV for Slow Repeat Customers: $763.89

T-statistic: 1.6935
P-value: 0.0463

Result: Reject the null hypothesis (p < 0.05).


#### Inference <br>
The test result is statistically significant (p < 0.05), allowing us to reject the null hypothesis. We now have strong evidence that customers who make their second purchase quickly are significantly more valuable in the long run.

On average, a "Quick Repeat Customer" has a predicted lifetime value of $942, which is substantially higher than the $764 CLV of a "Slow Repeat Customer."

##### Strategic Recommendation
The key business objective must be to convert new customers into repeat purchasers as quickly as possible. The time window immediately following a customer's first purchase is the most critical period for establishing long-term value.

This insight justifies investing in post-purchase strategies aimed at accelerating the second sale, such as:

1. A "thank you" email series that includes a limited-time offer for their next purchase.
2. Targeted advertising to recent first-time buyers.
3. A welcome program designed to re-engage customers within the first 1-2 weeks.

#### Test 3 : Do Different Customer Segments Have a Significantly Different Average Purchase Value (APV)?
##### 1. Purpose of the Test

This test will determine if our RFM segmentation has created groups that are not just conceptually different, but statistically distinct in their purchasing power. It answers the question: "Is the average spend per transaction of a 'Champion' truly different from that of a 'Loyal Customer', an 'At-Risk' customer, etc., when looking at all segments at once?" This is a crucial validation of our segmentation strategy.

##### 2. Analytical Approach

We will use a One-Way ANOVA to compare the mean Average Purchase Value (APV) across all our major customer segments. If the result is statistically significant, it means at least one segment's average spend is different from the others.

##### 3. Hypotheses

Null Hypothesis (H_0): The mean APV is the same across all customer segments.

Alternative Hypothesis (H_a): At least one customer segment has a different mean APV.

##### 4. Significance Level

We will use a standard significance level of a = 0.05.

In [5]:
#ANOVA Test: APV Across All Customer Segments

# Calculate Average Purchase Value (APV) for each customer
customer_apv = df.groupby('Customer ID')['TotalPrice'].mean().reset_index()
customer_apv.rename(columns={'TotalPrice': 'APV'}, inplace=True)
# Merge APV into our RFM dataframe
rfm_with_clv['Customer ID'] = rfm_with_clv['Customer ID'].astype(str)
customer_apv['Customer ID'] = customer_apv['Customer ID'].astype(str)
df_for_test = rfm_with_clv.merge(customer_apv, on='Customer ID')

#Perform the One-Way ANOVA Test
# We use a formula notation: 'APV ~ C(Segment)'
# This models the APV as a function of the Segment category.
model = ols('APV ~ C(Segment)', data=df_for_test).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

print("ANOVA Results: APV vs. Customer Segment")
display(anova_table)

#Result
p_value = anova_table['PR(>F)'].iloc[0]
if p_value < 0.05:
    print(f"\n Result: Reject the null hypothesis (p-value = {p_value:.4f}).")
    print("Conclusion: There is a statistically significant difference in Average Purchase Value among the different customer segments.")
else:
    print(f"\n Result: Fail to reject the null hypothesis (p-value = {p_value:.4f}).")
    print("Conclusion: There is no statistically significant difference in APV among the segments.")

ANOVA Results: APV vs. Customer Segment


Unnamed: 0,sum_sq,df,F,PR(>F)
C(Segment),7086.314,9.0,3.509324,0.000241
Residual,1312533.0,5850.0,,



 Result: Reject the null hypothesis (p-value = 0.0002).
Conclusion: There is a statistically significant difference in Average Purchase Value among the different customer segments.


#### Inference:
The ANOVA test is highly statistically significant (p-value = 0.0002), proving that there is a meaningful difference in the Average Purchase Value (APV) across the different customer segments.

It tells us that the value of a customer is determined by both their visit patterns and their spending power. A "Champion" is not just a frequent visitor; they are also likely to have a higher-value shopping basket per transaction compared to less valuable segments.

##### Strategic Implication: A Dual-Focus Approach
This finding validates a more nuanced, two-pronged strategy to increase customer value:

1. Increase Purchase Frequency: Continue to use re-engagement campaigns and loyalty programs to encourage all segments to visit more often.

2. Increase Average Purchase Value: Target high-value segments like "Champions" and "Loyal Customers" with upselling and cross-selling initiatives, such as premium product recommendations and "exclusive collection" marketing, to capitalize on their higher spending power.



#### Test 4: Does Product Variety Differ Across Segments?
##### 1. Purpose of the Test

Now that we know frequency is key, let's investigate another behavioral trait. Do your high-value segments, like "Champions," explore a wider range of products, or are they specialists who buy the same few items repeatedly? This test will determine if there's a significant difference in "product variety" across customer segments.

##### 2. Analytical Approach

We will first calculate the number of unique products each customer has purchased. Then, we will use a One-Way ANOVA to compare the mean product variety across all customer segments.

##### 3. Hypotheses

- Null Hypothesis (H_0): The mean product variety is the same across all customer segments.

- Alternative Hypothesis (H_a): At least one customer segment has a different mean product variety.

##### 4. Significance Level

We will use a = 0.05.



In [6]:
#ANOVA Test: Product Variety Across All Customer Segments
product_variety = df.groupby('Customer ID')['StockCode'].nunique().reset_index()
product_variety.rename(columns={'StockCode': 'ProductVariety'}, inplace=True)
# Merge variety into our RFM dataframe
product_variety['Customer ID'] = product_variety['Customer ID'].astype(str)
rfm_with_clv['Customer ID'] = rfm_with_clv['Customer ID'].astype(str)
df_for_test = rfm_with_clv.merge(product_variety, on='Customer ID')

# Perform the One-Way ANOVA Test
model = ols('ProductVariety ~ C(Segment)', data=df_for_test).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

print("ANOVA Results: Product Variety vs. Customer Segment")
display(anova_table)

# Result
p_value = anova_table['PR(>F)'].iloc[0]
if p_value < 0.05:
    print(f"\nResult: Reject the null hypothesis (p-value = {p_value:.4f}).")
    print("Conclusion: There is a statistically significant difference in Product Variety among the different customer segments.")
else:
    print(f"\n Result: Fail to reject the null hypothesis (p-value = {p_value:.4f}).")
    print("Conclusion: There is no statistically significant difference in Product Variety among the segments.")

ANOVA Results: Product Variety vs. Customer Segment


Unnamed: 0,sum_sq,df,F,PR(>F)
C(Segment),22146640.0,9.0,257.027631,0.0
Residual,56006880.0,5850.0,,



Result: Reject the null hypothesis (p-value = 0.0000).
Conclusion: There is a statistically significant difference in Product Variety among the different customer segments.


#### Inference
The ANOVA test is highly statistically significant (p-value is effectively 0), definitively proving that our customer segments have fundamentally different behaviors regarding the variety of products they purchase.

This result, combined with our previous findings, paints a complete, three-dimensional picture of what makes a customer valuable. Our best customers, like "Champions," are superior across all key behavioral dimensions:

1. They visit more often (higher Frequency).
2. They spend more per visit (higher Average Purchase Value).
3. They explore more of the product catalog (higher Product Variety).

##### Strategic Implication
The most effective path to increasing customer value is a three-pronged strategy designed to nurture customers across all three of these dimensions:

1. Encourage Repeat Visits through re-engagement campaigns and loyalty programs.
2. Increase Basket Value by upselling and recommending premium products to high-APV segments.
3. Promote Product Discovery with cross-selling features and personalized recommendations to increase the variety of items a customer purchases.