# Data Cleaning Notes
- Initial dataset: 541,909 rows, 8 columns.
- Dropped 135,080 rows (~25%) with missing CustomerID, critical for RFM and clustering.
- Filled 1,454 missing Descriptions with 'Unknown' for LLM compatibility.
- Removed rows with negative Quantity/UnitPrice (likely returns/cancellations).
- Created TotalPrice column (Quantity * UnitPrice) for revenue analysis.
- Converted CustomerID to string for categorical use.
- Fixed SettingWithCopyWarning using .copy() and .loc.
- Fixed TypeError by correcting logical operation syntax.
- Final shape: 392692 rows, 9 columns.
- Saved cleaned data to data/processed/cleaned_data.csv.

# Exploratory Data Analysis (EDA)

## Objectives
- Analyze distributions of Quantity, UnitPrice, TotalPrice.
- Explore monthly revenue trends.
- Investigate revenue by country.
- Compute customer-level metrics for RFM prep.

## Findings
- **Distributions**: Quantity, UnitPrice, TotalPrice are right-skewed, indicating outliers (e.g., high-value orders).
- **Monthly Revenue**: Peaks in November/December, suggesting holiday-driven sales.
- **Country Revenue**: UK dominates (>80% revenue), followed by Germany, France.
- **Customer Metrics**: Median frequency ~2, monetary ~$674; some high-value customers (max $280k).
- **Implications**: Focus on UK for marketing, cap outliers for modeling, leverage seasonal trends for forecasting.

## Visualizations
![Distributions](visualizations/distributions.png)
![Monthly Revenue](visualizations/monthly_revenue.png)
![Country Revenue](visualizations/country_revenue.png)

# RFM Analysis

## Objectives
- Calculate Recency, Frequency, Monetary metrics.
- Assign RFM scores and segments.
- Visualize segment distribution.

## Findings
- High-Value customers (RFM 555) are ~5% of the base.
- At-Risk customers (RFM ≤ 222) are ~20%, needing re-engagement.
- Loyal customers (e.g., RFM 544) show consistent spending.

## Visualization
![Segments](visualizations/rfm_segments.png)

# Customer Segmentation with Clustering

## Objectives
- Apply K-Means clustering to RFM data.
- Determine optimal number of clusters.
- Visualize and interpret segments.

## Findings
- Optimal clusters: 4 (based on elbow method).
- Segments identified: Low Activity, Regular, High Value, etc.
- Visualization shows distinct groups by Recency and Monetary.

## Visualization
![Clusters](visualizations/customer_clusters.png)
![Elbow](visualizations/elbow_plot.png)

# Revenue Forecasting with Machine Learning

## Objectives
- Predict customer revenue using RFM and cluster data.
- Train and compare Linear Regression and Random Forest models.
- Visualize and evaluate predictions.

## Findings
- Linear Regression: MSE 68142695.62, R2 0.33
- Random Forest: MSE 38344259.88, R2 0.63
- Random Forest outperforms, capturing non-linear patterns better.

## Visualization
![Predictions](visualizations/revenue_predictions.png)

# LLM-Powered Insight Generation

## Objectives
- Generate business insights from RFM, clustering, and revenue data.
- Provide actionable recommendations using LLM simulation.

## Findings
- High-Value customers: 347 (8.0%), At-Risk: 635 (14.6%).
- Cluster 0: 3,060 customers, avg revenue $1,352.75.
- Cluster 1: 1,061 customers, avg revenue $476.42.
- Cluster 2: 13 customers, avg revenue $127,187.96.
- Cluster 3: 204 customers, avg revenue $12,690.50.
- Average actual revenue: $2,048.69, predicted: $2,116.68 (3.3% increase).

## Recommendations
- Invest in strategies to sustain predicted revenue growth.

## Visualization
![Insights](visualizations/insight_visualizations.png)

In [4]:
import pandas as pd
df = pd.read_csv("../data/processed/rfm_data.csv")
print(df.head())
with open("../insights/insights.txt", "r") as f:
    print(f.read())

   CustomerID         LastPurchase  Frequency  Monetary  Recency  R_Score  \
0       12346  2011-01-18 10:01:00          1  77183.60      324        1   
1       12347  2011-12-07 15:52:00          7   4310.00        1        5   
2       12348  2011-09-25 13:13:00          4   1797.24       74        2   
3       12349  2011-11-21 09:51:00          1   1757.55       17        4   
4       12350  2011-02-02 16:01:00          1    334.40      309        1   

   F_Score  M_Score  RFM_Score     Segment  
0        1        5        115      Others  
1        5        5        555  High-Value  
2        4        4        244      Others  
3        1        4        414      Others  
4        1        2        112     At-Risk  
Total customers: 4338. High-Value customers: 347 (8.0%), At-Risk: 635 (14.6%).
Cluster 0 has 3060 customers with average revenue of $1352.75.
Cluster 1 has 1061 customers with average revenue of $476.42.
Cluster 3 has 204 customers with average revenue of $12690.50.
