<a href="https://colab.research.google.com/github/blgayatri/DS_Projects/blob/main/Shopper_spectrum_ML_Submission.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Shopper Spectrum



**Project Type**: Unsupervised Machine Learning  
**Contribution**: Individual  
**Team Member**: Lakshmi Gayatri Balivada

# **Project Summary -**

This project explores transaction data from an e-commerce platform to:
- Segment customers using RFM (Recency, Frequency, Monetary) analysis
- Cluster customers using K-Means
- Recommend products using collaborative filtering
- Deploy a Streamlit interface for interactive analysis

# **GitHub Link -**

https://github.com/blgayatri/Shopper-Spectrum

# **Problem Statement**


### 📌 Problem Statement
The global e-commerce industry generates massive volumes of transaction data each day, providing a rich source of insights into customer purchasing behavior. Leveraging this data is crucial for understanding customer preferences, segmenting them effectively, and offering personalized recommendations to boost engagement and sales.

This project aims to analyze historical transaction data from an online retail store to:

* Identify distinct customer segments using RFM (Recency, Frequency, Monetary) analysis and unsupervised clustering techniques

* Build a collaborative filtering-based product recommendation system to suggest similar products based on past purchase behavior

The insights and systems developed can be directly applied to:

* Personalized marketing

* Customer retention strategies

* Product cross-selling and upselling

* Improved inventory management

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.metrics.pairwise import cosine_similarity
import joblib
import warnings
warnings.filterwarnings("ignore")
from google.colab import files

# Upload file manually
uploaded = files.upload()

### Dataset Loading

In [None]:
import pandas as pd

# Load dataset (replace with exact uploaded filename)
df = pd.read_csv("online_retail.csv", encoding='ISO-8859-1')
df.head()



```
# This is formatted as code
```

## ***2. Data Cleaning***

In [None]:
# ✅ Remove missing CustomerID
df = df.dropna(subset=['CustomerID'])

# ✅ Remove cancelled invoices
df = df[~df['InvoiceNo'].astype(str).str.startswith('C')]

# ✅ Remove negative/zero quantity or price
df = df[(df['Quantity'] > 0) & (df['UnitPrice'] > 0)]

# ✅ Convert date
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])

# ✅ Create TotalPrice
df['TotalPrice'] = df['Quantity'] * df['UnitPrice']

# ✅ Final shape
print("Final data shape:", df.shape)


## ***4. EDA***

> Add blockquote



#### 1. Transactions by Country (excluding UK for contrast)

In [None]:
# 🌍 Transactions by Country (Top 10 excluding UK)
top_countries = df[df['Country'] != 'United Kingdom']['Country'].value_counts().head(10)

plt.figure(figsize=(10, 5))
sns.barplot(x=top_countries.values, y=top_countries.index, palette="viridis")
plt.title("Top 10 Countries by Transaction Volume (excluding UK)")
plt.xlabel("Number of Transactions")
plt.ylabel("Country")
plt.tight_layout()
plt.show()


#### 2. Top Selling Products by Quantity

In [None]:
top_products = df.groupby('Description')['Quantity'].sum().sort_values(ascending=False).head(10)

plt.figure(figsize=(10, 5))
sns.barplot(x=top_products.values, y=top_products.index, palette="magma")
plt.title("Top 10 Best-Selling Products")
plt.xlabel("Quantity Sold")
plt.ylabel("Product Description")
plt.show()

#### 3. Monthly Sales Trend

In [None]:
# 📆 Extract Year-Month for trend analysis
df['Month'] = df['InvoiceDate'].dt.to_period('M')
monthly_sales = df.groupby('Month')['TotalPrice'].sum()

plt.figure(figsize=(12, 6))
monthly_sales.plot(kind='line', marker='o')
plt.title("Monthly Revenue Trend")
plt.xlabel("Month")
plt.ylabel("Revenue (£)")
plt.grid(True)
plt.show()

#### 4. Average Monetary Spend per Transaction

In [None]:
transaction_values = df.groupby('InvoiceNo')['TotalPrice'].sum()

plt.figure(figsize=(8, 4))
sns.histplot(transaction_values, bins=50, kde=True, color='teal')
plt.title("Distribution of Transaction Values")
plt.xlabel("Transaction Value (£)")
plt.ylabel("Frequency")
plt.xlim(0, 1000)
plt.show()

#### 5. Total Spend per Customer

In [None]:
customer_spend = df.groupby('CustomerID')['TotalPrice'].sum()

plt.figure(figsize=(8, 4))
sns.histplot(customer_spend, bins=50, kde=True, color='coral')
plt.title("Distribution of Total Spend per Customer")
plt.xlabel("Total Spend (£)")
plt.ylabel("Number of Customers")
plt.xlim(0, 3000)
plt.show()

#### 6. Number of Purchases per Customer

In [None]:
customer_orders = df.groupby('CustomerID')['InvoiceNo'].nunique()

plt.figure(figsize=(8, 4))
sns.histplot(customer_orders, bins=30, kde=True, color='steelblue')
plt.title("Distribution of Number of Purchases per Customer")
plt.xlabel("Number of Purchases")
plt.ylabel("Number of Customers")
plt.show()

## ***4. RFM Feature Engineering***

### 🎯 RFM Feature Engineering

We segment customers using the RFM model:

- **Recency**: How recently a customer made a purchase  
- **Frequency**: How often they purchase  
- **Monetary**: How much money they spend

This helps identify high-value, at-risk, and occasional customers.

Code Block: RFM Calculation

In [None]:
# 📅 Reference date = 1 day after last invoice
import datetime as dt
reference_date = df['InvoiceDate'].max() + dt.timedelta(days=1)

# 🎯 Aggregate RFM metrics per CustomerID
rfm = df.groupby('CustomerID').agg({
    'InvoiceDate': lambda x: (reference_date - x.max()).days,   # Recency
    'InvoiceNo': 'nunique',                                     # Frequency
    'TotalPrice': 'sum'                                         # Monetary
}).reset_index()

# Rename columns
rfm.columns = ['CustomerID', 'Recency', 'Frequency', 'Monetary']

# Preview
rfm.head()

RFM Value Distribution Chart (7th EDA Plot)

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

sns.histplot(rfm['Recency'], ax=axes[0], kde=True, color='slateblue')
axes[0].set_title('Recency Distribution')

sns.histplot(rfm['Frequency'], ax=axes[1], kde=True, color='green')
axes[1].set_title('Frequency Distribution')

sns.histplot(rfm['Monetary'], ax=axes[2], kde=True, color='darkorange')
axes[2].set_title('Monetary Distribution')

plt.tight_layout()
plt.show()

## ***5. Standardization, Elbow Method & KMeans Clustering***

### Clustering Customers using KMeans

We apply KMeans clustering to group customers based on their RFM values. This helps identify meaningful segments like high-value, regular, or at-risk customers.

#### 1. Standardize RFM values

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
rfm_scaled = scaler.fit_transform(rfm[['Recency', 'Frequency', 'Monetary']])

#### 2. Elbow Method: Find Optimal K

In [None]:
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

inertia = []
K = range(1, 11)

for k in K:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(rfm_scaled)
    inertia.append(kmeans.inertia_)

plt.figure(figsize=(8, 4))
plt.plot(K, inertia, marker='o')
plt.title("Elbow Method to Find Optimal K")
plt.xlabel("Number of Clusters")
plt.ylabel("Inertia")
plt.grid(True)
plt.show()

#### 3. Silhouette Score

In [None]:
from sklearn.metrics import silhouette_score

for k in range(2, 7):
    kmeans = KMeans(n_clusters=k, random_state=42)
    labels = kmeans.fit_predict(rfm_scaled)
    score = silhouette_score(rfm_scaled, labels)
    print(f"Silhouette Score for k={k}: {score:.3f}")

#### 4. Final Clustering with KMeans

In [None]:
kmeans = KMeans(n_clusters=4, random_state=42)
rfm['Cluster'] = kmeans.fit_predict(rfm_scaled)

5. Interpret and Label Segments

In [None]:
# Label clusters based on RFM means
cluster_profile = rfm.groupby('Cluster').mean().round(2)
cluster_profile = cluster_profile.sort_values(by='Monetary', ascending=False)
cluster_profile

In [None]:
# Example based on profile table (adjust based on your results)
segment_map = {
    0: 'Regular',
    1: 'At-Risk',
    2: 'High-Value',
    3: 'Occasional'
}
rfm['Segment'] = rfm['Cluster'].map(segment_map)
rfm.head()

6. Visualize Clusters (2D)

In [None]:
plt.figure(figsize=(10, 6))
sns.scatterplot(data=rfm, x='Recency', y='Monetary', hue='Segment', palette='Set2', s=100)
plt.title('Customer Segments based on Recency and Monetary Value')
plt.show()

## ***Recommendation System – Item-Based Collaborative Filtering***

### 🎁 Product Recommendation System using Collaborative Filtering

We use item-based collaborative filtering by:
- Creating a customer–product matrix
- Calculating cosine similarity between products
- Recommending 5 most similar products for any selected item

1. Prepare Customer–Product Matrix

In [None]:
# Pivot table: rows = CustomerID, columns = Product, values = Quantity
product_matrix = df.pivot_table(index='CustomerID', columns='Description', values='Quantity', aggfunc='sum').fillna(0)

# Transpose to get item-to-item similarity
product_matrix_T = product_matrix.T

2. Compute Cosine Similarity Between Products

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# Cosine similarity between items
similarity_matrix = cosine_similarity(product_matrix_T)

# Convert to DataFrame
similarity_df = pd.DataFrame(similarity_matrix, index=product_matrix_T.index, columns=product_matrix_T.index)
similarity_df.head()

Note:

In [None]:
df['Description'].unique()[:20]

# **Conclusion**

In this project, we successfully explored and analyzed transactional data from an online retail store to uncover patterns in customer behavior and product preferences.

Key accomplishments:

* Cleaned and preprocessed real-world e-commerce data by removing missing values, invalid transactions, and negative quantities.

* Conducted comprehensive Exploratory Data Analysis (EDA) to identify top-selling products, trends across countries, and purchasing patterns.

* Engineered RFM (Recency, Frequency, Monetary) features to quantify customer engagement and value.

* Applied KMeans clustering to segment customers into meaningful groups such as High-Value, Regular, Occasional, and At-Risk.

* Built a collaborative filtering-based recommendation system using cosine similarity to suggest similar products based on purchase history.

* Developed a user-friendly Streamlit web app with two main features:

* Product Recommendation Module: Suggests 5 similar items for a given product.

* Customer Segmentation Module: Predicts a customer segment based on RFM inputs.

This project demonstrates how machine learning and data science can be practically applied in e-commerce for:

* Targeted marketing

* Personalized customer experience

* Customer retention

* Inventory and product strategy

With further improvements like real-time data integration, hybrid recommender systems, and campaign A/B testing, this solution can be extended for real-world deployment in large-scale retail businesses.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***