Step 2: Load the Final, Named Data

We will load the fully processed data which includes our cluster labels and segment names.

In [4]:
# Import libraries
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.multivariate.manova import MANOVA
from scipy.stats import chi2

# Load the final dataset
try:
    customer_df = pd.read_csv('../data/customer_features_clustered_named.csv')
    customer_df.set_index('customer_unique_id', inplace=True)
    print("Final clustered and named data loaded successfully.")
except FileNotFoundError:
    print("ERROR: The file '../data/customer_features_clustered_named.csv' was not found.")
    print("Please make sure you have run the '04-statistical-validation-and-profiling.ipynb' notebook first.")

customer_df.head()

Final clustered and named data loaded successfully.


Unnamed: 0_level_0,Unnamed: 0,recency,frequency,monetary,avg_order_value,product_diversity,tenure,cluster,segment_name
customer_unique_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0000366f3b9a7992bf8c76cfdf3221e2,0,112,1,141.9,141.9,1,0,1,New / One-Time Buyers
0000b849f77a49e4a4ce2b2a4ca5be3f,1,115,1,27.19,27.19,1,0,1,New / One-Time Buyers
0000f46a3911fa3c0805444483337064,2,537,1,86.22,86.22,1,0,0,At-Risk Customers
0000f6ccb0745a6a4b88665a16c9f078,3,321,1,43.62,43.62,1,0,0,At-Risk Customers
0004aac84e0df4da2b147fca70cf8255,4,288,1,196.89,196.89,1,0,0,At-Risk Customers


Step 3: Prepare Data for Regression

We need to define our dependent and independent variables.

Dependent Variables (Y): What do we want to predict? Let's predict spending in a few key categories. To do this, we need to quickly create those features first.

Independent Variables (X): What will we use for prediction? We'll use some basic features.
Categorical Variables: Our cluster variable is a number, but the model needs to understand it as a category. We will use sm.add_constant for the intercept and create dummy variables for the clusters.

In [9]:
# --- Quick Feature Engineering for Spending by Category ---
# (This part would ideally be in the feature engineering notebook, but we add it here for completeness)
# Let's load the main cleaned df to get category data
main_df = pd.read_csv('../data/main_cleaned_data.csv') # Assuming you saved a cleaned df
spending_by_cat = main_df.groupby(['customer_unique_id', 'product_category_name_english'])['payment_value'].sum().unstack(fill_value=0)

# Select top categories to predict
top_categories = ['bed_bath_table', 'health_beauty', 'sports_leisure', 'computers_accessories']
spending_by_cat = spending_by_cat[top_categories]

# Merge this into our customer_df
customer_df = pd.merge(customer_df, spending_by_cat, on='customer_unique_id', how='left').fillna(0)


# --- Prepare variables for Regression ---
# Dependent Variables (What we want to predict)
Y = customer_df[['bed_bath_table', 'health_beauty']]

# Independent Variables (What we use to predict)
# Let's use some basic, non-redundant features
X = customer_df[['tenure', 'avg_order_value']]

# Add a constant (intercept) to our predictors
X = sm.add_constant(X)

# Create dummy variables for our cluster variable. drop_first=True avoids multicollinearity.
cluster_dummies = pd.get_dummies(customer_df['cluster'], prefix='cluster', drop_first=True)

# Combine our base features with the cluster dummies for the "Full Model"
X_full = pd.concat([X, cluster_dummies], axis=1)

print("Shape of Y:", Y.shape)
print("Shape of X (Reduced Model):", X.shape)
print("Shape of X_full (Full Model):", X_full.shape)
X_full.head()

Shape of Y: (93357, 2)
Shape of X (Reduced Model): (93357, 3)
Shape of X_full (Full Model): (93357, 7)


Unnamed: 0_level_0,const,tenure,avg_order_value,cluster_1,cluster_2,cluster_3,cluster_4
customer_unique_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0000366f3b9a7992bf8c76cfdf3221e2,1.0,0,141.9,True,False,False,False
0000b849f77a49e4a4ce2b2a4ca5be3f,1.0,0,27.19,True,False,False,False
0000f46a3911fa3c0805444483337064,1.0,0,86.22,False,False,False,False
0000f6ccb0745a6a4b88665a16c9f078,1.0,0,43.62,False,False,False,False
0004aac84e0df4da2b147fca70cf8255,1.0,0,196.89,False,False,False,False


Step 4: Build and Compare Two Models using the Likelihood Ratio Test

This is the core of this phase. We will build two models:

Reduced Model: Predicts spending using only basic features.

Full Model: Predicts spending using basic features PLUS our discovered segment information.

The LRT will tell us if the full model is statistically better

In [11]:
# --- Step 3 (Revised): Prepare Data for Regression ---

# ... (The code to create Y and spending_by_cat is the same) ...
# ...
Y = customer_df[['bed_bath_table', 'health_beauty']]
X = customer_df[['tenure', 'avg_order_value']]
X = sm.add_constant(X)

# Create dummy variables
cluster_dummies = pd.get_dummies(customer_df['cluster'], prefix='cluster', drop_first=True)

# --- THE FIX IS HERE ---
# Explicitly convert the boolean (True/False) or other types to integers (0/1)
cluster_dummies = cluster_dummies.astype(int)
# --- END OF FIX ---

# Combine for the full model
X_full = pd.concat([X, cluster_dummies], axis=1)

# Now, let's check the data types to be sure
print("--- Data Types of X_full ---")
print(X_full.dtypes)

# --- Step 4 (Revised): Model Building and Comparison ---

# Now, the rest of your code should work perfectly
# Model 1: Reduced Model
model_reduced = sm.OLS(Y, X).fit()
loglik_reduced = model_reduced.llf
print(f"\nLog-Likelihood of Reduced Model: {loglik_reduced:.2f}")

# Model 2: Full Model
model_full = sm.OLS(Y, X_full).fit()
loglik_full = model_full.llf
print(f"Log-Likelihood of Full Model: {loglik_full:.2f}")

# ... (The rest of the Likelihood Ratio Test code is the same) ...
LR_statistic = 2 * (loglik_full - loglik_reduced)
df = X_full.shape[1] - X.shape[1]
p_value = chi2.sf(LR_statistic, df)

print(f"\nLR Statistic: {LR_statistic:.2f}")
print(f"Degrees of Freedom: {df}")
print(f"P-value: {p_value:.3g}")
# ... (rest of the code)

--- Data Types of X_full ---
const              float64
tenure               int64
avg_order_value    float64
cluster_1            int64
cluster_2            int64
cluster_3            int64
cluster_4            int64
dtype: object

Log-Likelihood of Reduced Model: -602952.34
Log-Likelihood of Full Model: -602844.57

LR Statistic: 215.54
Degrees of Freedom: 4
P-value: 1.71e-45


Interpreting the Result:

A tiny p-value (e.g., < 0.001) is your victory lap. It is the final piece of evidence that your Discover -> Prove -> Predict framework was successful. You have statistically proven that your customer segments help predict future behavior.

Step 5: Analyze the Final Model

Now you can look at the summary of your best model (model_full) to get even more insights.

In [12]:
# --- Analyze the Best Model ---
print("\n--- Summary of the Full Model ---")
# Statsmodels OLS doesn't have a direct multivariate summary,
# but we can inspect the parameters.
print(model_full.params)


--- Summary of the Full Model ---
                         0          1
const            12.254461   0.011022
tenure            0.071611  -0.008851
avg_order_value   0.027429   0.087059
cluster_1        -0.874731   4.901523
cluster_2        22.648972  22.971798
cluster_3        36.391392  21.692510
cluster_4        43.846994   3.633092



### **Final Step: The Business Impact and Presentation**

You have all the analysis. The final, final step is to synthesize this into a compelling story for an interviewer. Prepare to talk about:

1.  **The Personas:** "We discovered 5 key customer types, including our 'VIP Champions' and our 'At-Risk Customers'."
2.  **The Proof:** "We didn't just find these groups; we used MANOVA to prove they were statistically different."
3.  **The Value:** "Most importantly, we demonstrated with a Likelihood Ratio Test that these segments have real predictive power, allowing the business to better forecast spending and tailor marketing."
4.  **The Action:** "My recommendation is to create a loyalty program for the 'VIP Champions' and a win-back campaign for the 'At-Risk' segment to maximize customer lifetime value."

Congratulations, brother! You have completed a truly impressive, end-to-end data science project. You should be very proud of this work.