# 1. Business Understanding (CRISP-DM Phase 1)

## 1.1. Project Context
This project utilizes the **RetailRocket E-commerce dataset**, capturing real-world implicit feedback (views, cart adds, transactions). The core challenge is to translate raw behavioral logs into a system that drives marketing decisions.

**Data Source:** RetailRocket E-commerce Dataset (Kaggle).

## 1.2. Business Goal
The primary objective is to build an end-to-end **Customer Segmentation & Personalization System**. The system aims to replace "mass marketing" with data-driven targeting strategies.

**Key Deliverables:**

1.  **Segmentation Model:** Group customers based on RFM (Recency, Frequency, Monetary) and behavioral patterns using K-Means/GMM.
2.  **CLV Prediction:** Estimate Customer Lifetime Value using probabilistic models (BG/NBD, Gamma-Gamma) to identify high-value users.
3.  **Actionable Rules (The Action Board):** A concrete mapping of "Segment -> Marketing Action" (e.g., *If VIP -> Grant Early Access*).
    * Define clear trigger conditions for each segment (e.g., VIP = High CLV & High Frequency)
    * Specify concrete marketing actions and channels (e.g., early-access email, personalized offers)
    * Link each action to measurable business KPIs (e.g., conversion rate, revenue per user). (not ensure)
    * Ensure rules are testable via A/B testing to measure incremental impact.  (not ensure)
4.  **Experimentation Framework:** A design for A/B testing to validate the proposed strategies (Control Group vs. Treatment Group).
    * Control Group: Not apply new strategies
5.  **Personalization Dashboard:** A visual storytelling tool including:
    * *Treemaps* for segment sizing.
    * *Stacked Area Charts* for tracking segment migration over time. (?)
    * *Line Charts* for product trends.

## 1.3. The DOC Framework (Decision - Options - Criteria)
To bridge the gap between technical models and business value:

### **D - Decision**
**"How to allocate the marketing budget and which promotional channel to use for each specific customer group?"**

**"When to allocate these strategies (seasonal or monthly)**

### **O - Options (Hypothetical Actions)**
* **High Value, At Risk** (High Monetary, but High Recency/Low Probability): Grant **High-value Vouchers** (20-30%) to win them back before they churn.
* **Loyal, Low Spend** (High Frequency, but Low Monetary): Trigger **Cross-sell Recommendations** (suggest combos) to increase the average order value.
* **Loyal VIP** (High Frequency & High Monetary): Offer **Early Access / Exclusive Service** (Avoid deep discounts to protect profit margins).
* **Window Shoppers** (High View count, No Purchase): Use **Flash Sales / Urgency Triggers** to motivate the first transaction.

### **C - Criteria (Success Metrics)**
1.  **Model Performance:** Distinct clusters (Silhouette Score) and accurate CLV prediction (RMSE).
2.  **Dashboard Usability:** The final dashboard to clearly visualize the "story" from overview to specific actions.
3.  **Experiment Readiness:** The framework must define clear metrics (Conversion Rate, Retention Rate) to measure the uplift of the new strategy vs. the old one.

## 1.4. KPI Tree & Metric Decomposition

Applying the **5-step KPI Tree process**, we structure the metrics based on the available data constraints.

### **Step 1: North Star Metric**
**Target:** **GMV (Gross Merchandise Value)**.
* *Clarification:* Since the dataset lacks data on cancellations, returns, and basket-level discounts, we optimize for GMV (total value of merchandise sold) as a proxy for Revenue under observed data.

### **Step 2: Metric Decomposition (Drivers)**
Decomposition Formula:
$$GMV = \text{Traffic} \times \text{Conversion Rate (CR)} \times \text{Average Ticket Size (ATS)}$$

*Note: We use "Average Ticket Size" instead of AOV because order-level (basket) identifiers are unavailable; calculations are item-based.*

#### **Driver 1: Traffic (Total Visitors)**
* **Metric:** Count of unique `visitorid`.
* *Scope:* Out of scope for this project (cannot influence traffic with this datasets).

#### **Driver 2: Conversion Rate (CR)**
* **Definition:** % of visitors who perform at least one `transaction`.
* **Formula:**
    $$CR = \frac{\text{Count(Visitors with ≥1 'transaction' event)}}{\text{Count(Total Visitors)}} \times 100\%$$
Note: Conversion Rate in the KPI Tree is defined at the visitor level, while funnel-level transition probabilities are used separately for behavioral analysis.
#### **Driver 3: Average Ticket Size (ATS) / Monetary**
* **Definition:** Average value generated per paying visitor (proxy for AOV).
* **Formula:**
    $$ATS \approx \frac{\sum (\text{Price of items in 'transaction'})}{\text{Count(Paying Visitors)}}$$

### **Step 3: Business Levers (Within Project Scope)**
* **Segmentation Lever – RFM & Behavioral Clustering (K-Means / GMM):**
    * *Purpose:* Group customers with similar purchase recency, frequency, and monetary patterns.
    * *Application:* Enable differentiated treatment instead of mass marketing.

* **Prioritization Lever – CLV Estimation:**
    * *Purpose:* Identify high-value customers and allocate personalization efforts accordingly.
    * *Application:* Focus retention and engagement strategies on users with high predicted lifetime value.

* **Personalization Lever – Rule-based Action Board:**
    * *Purpose:* Translate segments into concrete marketing actions.
    * *Application:* Define rules such as “If Segment = At Risk & CLV High → Retention-oriented messaging”.

* *Future Extension (Out of Scope):*
    * Recommendation systems could be layered on top of this framework to increase Average Ticket Size in future work.


---

### **Step 4: Guardrails (Measurement Assumptions)**
1. **Transaction Validity Assumption:**
    * All `transaction` events are treated as successful purchases at the recorded `item_price`.
    * Returns, cancellations, and refunds are not observable in the dataset.
2. **Fixed Time Window Assumption:**
    * All KPIs are computed within a fixed historical time window.
    * No temporal feedback loop or post-intervention effects are measured.

### **Step 5: Dashboard Visualization & Review Cadence**
* **Metric 1:** Total GMV over time (Line Chart) — North Star tracking.
* **Metric 2:** Segment Migration (Stacked Area — e.g., how many "New" users became "VIP").
* **Metric 3:** RFM Distribution (Heatmap or Scatter).
* *Review Cadence:* Periodic analytical review (weekly/monthly), not real-time monitoring.


Trong RetailRocket: transaction event(gắn với visitorid + itemid)

không có order_id
Không biết (một lần checkout gồm bao nhiêu item, item nào thuộc cùng một đơn)

Không thể xác định đâu là “1 order”.


Vì sao ATS hợp lý? Không cần order_id, Dựa hoàn toàn vào (transaction, item_price ,visitorid)
Phản ánh Monetary dimension trong RFM

ATS = proxy hợp lệ cho “mức chi tiêu”, không phải “giá trị mỗi đơn”.

## 1.5. Project Roadmap (CRISP-DM)

* **01_Business_Understanding:** Defining goals, DOC framework, and KPI Tree.
* **02_Data_Understanding:** Exploratory Data Analysis (EDA).
* **03_Data_Preparation:** Feature engineering (RFM) and Data Cleaning.
* **04.1_Modeling_Segmentation:** K-Means/GMM Clustering.
* **04.2_Modeling_CLV:** Probabilistic Prediction (BG/NBD + Gamma-Gamma).
* **05_Evaluation_Deployment:** Persona Profiling and Action Board.
* **06_Experimentation Framework & Dashboard:** 