# Data Cleaning Notes
- Initial dataset: 541,909 rows, 8 columns.
- Dropped 135,080 rows (~25%) with missing CustomerID, critical for RFM and clustering.
- Filled 1,454 missing Descriptions with 'Unknown' for LLM compatibility.
- Removed rows with negative Quantity/UnitPrice (likely returns/cancellations).
- Created TotalPrice column (Quantity * UnitPrice) for revenue analysis.
- Converted CustomerID to string for categorical use.
- Fixed SettingWithCopyWarning using .copy() and .loc.
- Fixed TypeError by correcting logical operation syntax.
- Final shape: 392692 rows, 9 columns.
- Saved cleaned data to data/processed/cleaned_data.csv.

# Exploratory Data Analysis (EDA)

## Objectives
- Analyze distributions of Quantity, UnitPrice, TotalPrice.
- Explore monthly revenue trends.
- Investigate revenue by country.
- Compute customer-level metrics for RFM prep.

## Findings
- **Distributions**: Quantity, UnitPrice, TotalPrice are right-skewed, indicating outliers (e.g., high-value orders).
- **Monthly Revenue**: Peaks in November/December, suggesting holiday-driven sales.
- **Country Revenue**: UK dominates (>80% revenue), followed by Germany, France.
- **Customer Metrics**: Median frequency ~2, monetary ~$674; some high-value customers (max $280k).
- **Implications**: Focus on UK for marketing, cap outliers for modeling, leverage seasonal trends for forecasting.

## Visualizations
![Distributions](visualizations/distributions.png)
![Monthly Revenue](visualizations/monthly_revenue.png)
![Country Revenue](visualizations/country_revenue.png)

# RFM Analysis

## Objectives
- Calculate Recency, Frequency, Monetary metrics.
- Assign RFM scores and segments.
- Visualize segment distribution.

## Findings
- High-Value customers (RFM 555) are ~5% of the base.
- At-Risk customers (RFM ≤ 222) are ~20%, needing re-engagement.
- Loyal customers (e.g., RFM 544) show consistent spending.

## Visualization
![Segments](visualizations/rfm_segments.png)

# Customer Segmentation with Clustering

## Objectives
- Apply K-Means clustering to RFM data.
- Determine optimal number of clusters.
- Visualize and interpret segments.

## Findings
- Optimal clusters: 4 (based on elbow method).
- Segments identified: Low Activity, Regular, High Value, etc.
- Visualization shows distinct groups by Recency and Monetary.

## Visualization
![Clusters](visualizations/customer_clusters.png)
![Elbow](visualizations/elbow_plot.png)

# Revenue Forecasting with Machine Learning

## Objectives
- Predict customer revenue using RFM and cluster data.
- Train and compare Linear Regression and Random Forest models.
- Visualize and evaluate predictions.

## Findings
- Linear Regression: MSE 68142695.62, R2 0.33
- Random Forest: MSE 38344259.88, R2 0.63
- Random Forest outperforms, capturing non-linear patterns better.

## Visualization
![Predictions](visualizations/revenue_predictions.png)