<a href="https://colab.research.google.com/github/VidhanNahata/Shopper_spectrum-/blob/main/Shopping_Spectrum.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Shopper spectrum



##### **Project Type**    -Unsupervised
##### **Contribution**    - Individual


# **Project Summary -**

The rapid expansion of the global e-commerce industry has resulted in the generation of immense volumes of transactional data every day. This data not only reflects a company's sales activities but also holds significant potential for uncovering deep insights into consumer purchasing behavior. By systematically analyzing transaction data, businesses can identify customer preferences, buying habits, and emerging market trends, informing strategic decisions and providing tailored experiences that fuel both customer satisfaction and sustained business growth.

A central challenge within this context is to segment customers effectively and recommend products they are more likely to purchase. Segmentation is crucial because not all customers have the same value to a business. Understanding and classifying customers based on their interactions and monetary contributions allows companies to allocate resources more efficiently and design targeted marketing campaigns. One robust and widely used technique for this purpose is RFM (Recency, Frequency, and Monetary) analysis.

RFM analysis breaks down customer behavior into three key dimensions:

Recency refers to how recently a customer has made a purchase, signaling ongoing engagement and propensity to buy again.

Frequency measures how often a customer purchases within a given period, highlighting repeat buyers versus one-time shoppers.

Monetary evaluates the total amount a customer has spent, identifying high-value individuals who substantially impact revenue.

By scoring customers along these three axes and clustering them into meaningful segments (such as loyal customers, potential loyalists, and at-risk customers), businesses can prioritize retention efforts, upsell strategies, and personalized promotions with higher precision.

Beyond segmentation, personalized product recommendations have become integral to modern e-commerce platforms. Recommender systems help customers discover products they might not otherwise find, improving the shopping experience and encouraging additional purchases. Among the various techniques used, collaborative filtering stands out for its effectiveness. Collaborative filtering leverages the wisdom of the crowd by analyzing patterns in user behavior and identifying similarities between users or items.

This approach generally falls into two categories:

User-based collaborative filtering: Recommends products to a user based on what similar users have liked or purchased.

Item-based collaborative filtering: Suggests items similar to those a user has interacted with, based on aggregate patterns across the user base.

By applying collaborative filtering to e-commerce transaction data, platforms can deliver highly relevant product suggestions, even in the absence of explicit user preferences. These recommendations can increase conversion rates, average order values, and foster long-term customer loyalty.

Integrating RFM-based segmentation with collaborative filtering further refines personalization. For example, different customer segments may receive tailored product recommendations or marketing messages: high-value repeat buyers could receive early access to new arrivals, while recent but infrequent purchasers might get bundled offers to encourage repeat business. Such granularity increases the effectiveness of marketing interventions and ensures that resources are focused where they are likely to deliver maximum impact.

The effectiveness of these methods hinges on rigorous data preprocessing, including handling missing values, removing anomalies, and structuring the data for meaningful interpretation. Visualization techniques such as heatmaps, cluster plots, and time-series analyses can make sense of complex behavioral patterns and demonstrate the distinctiveness of discovered segments.

In summary, the project explores a holistic approach to leveraging e-commerce transaction data, combining RFM segmentation with collaborative filtering to unlock the full potential of customer analytics. This dual strategy not only improves the relevance of product recommendations but also empowers businesses to nurture profitable customer relationships, boosting both customer satisfaction and revenue growth. By continuously refining these models using fresh data and feedback, e-commerce businesses can maintain a competitive edge in today’s data-driven marketplace.

# **GitHub Link -**

https://github.com/VidhanNahata/Shopper_spectrum-

# **Problem Statement**


The global e-commerce industry generates vast amounts of transaction data daily, offering valuable insights into customer purchasing behaviors. Analyzing this data is essential for identifying meaningful customer segments and recommending relevant products to enhance customer experience and drive business growth. This project aims to examine transaction data from an online retail business to uncover patterns in customer purchase behavior, segment customers based on Recency, Frequency, and Monetary (RFM) analysis, and develop a product recommendation system using collaborative filtering techniques.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
import pandas as pd
import numpy as np
import datetime as dt
from datetime import timedelta
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
df = pd.read_csv('online_retail.csv')


### Dataset First View

In [None]:

df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
df.info()

In [None]:
df.describe(include='all')

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
print(f"Number of duplicate rows: {df.duplicated().sum()}")

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print("Missing values per column:")
print(df.isnull().sum())

In [None]:
df.duplicated().sum()

In [None]:
df.columns

In [None]:
df['Country'].unique()

In [None]:
df['Country'].value_counts()

### What did you know about your dataset?

It contains 541,909 rows and 8 columns.

The columns are: InvoiceNo, StockCode, Description, Quantity, InvoiceDate, UnitPrice, CustomerID, and Country.

There are 5,268 duplicate rows.

There are missing values in the Description (1454 missing values) and CustomerID (135080 missing values) columns.

The data types are a mix of object, int64, and float64. The InvoiceDate column is currently an object type and will need to be converted to a datetime object for time-based analysis.

The Quantity column contains negative values, indicating potential returns or cancellations.

The UnitPrice column contains zero and negative values, which may need investigation.

The Country column contains data for 38 different countries, with the 'United Kingdom' being the most frequent.


### DROPPING ROWS WITH MISSING CUSTOMERID

In [None]:
df.dropna(subset=['CustomerID'], inplace=True)

In [None]:
df

In [None]:
df.isnull().sum()

DROPPING DUPLICATED DATA

In [None]:
df.duplicated().sum()

df = df.drop_duplicates()
df.duplicated().sum()

EXCLUDE CANCELLED INVOICES(INVOICE_NO STARTING WITH 'C')

In [None]:
df = df[~df['InvoiceNo'].astype(str).str.startswith('C')]
df['InvoiceNo'].value_counts()

REMOVE NEGATIVE OR ZERO QUANTITIES AND PRICES

In [None]:
df = df[(df['Quantity']>0) & (df['UnitPrice']>0)]
df

# EXPLORATORY DATA ANALYSIS(EDA)

In [None]:
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import seaborn as sns

### Analyze transaction volume by country


In [None]:
trans_by_country = df.groupby('Country')['InvoiceNo'].nunique().sort_values(ascending=False)
plt.figure(figsize=(12,6))
sns.barplot(x=trans_by_country.index, y=trans_by_country.values, palette='viridis')
plt.title('Number of Transactions by Country')
plt.xticks(rotation=90)
plt.ylabel('Number of Transactions')
plt.show()

### Identify top-selling products

In [None]:
top_products = df.groupby('Description')['Quantity'].sum().sort_values(ascending=False).head(20)
plt.figure(figsize=(12,6))
sns.barplot(y=top_products.index, x=top_products.values, palette='magma')
plt.title('Top 20 Selling Products by Quantity')
plt.xlabel('Total Quantity Sold')
plt.ylabel('Product Description')
plt.show()

### Visualize purchase trends over time

In [None]:
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])
df['Month'] = df['InvoiceDate'].dt.to_period('M')
monthly_sales = df.groupby('Month')['Quantity'].sum()
monthly_transactions = df.groupby('Month')['InvoiceNo'].nunique()

plt.figure(figsize=(14,6))
sns.lineplot(x=monthly_sales.index.astype(str), y=monthly_sales.values, label='Total Quantity Sold')
sns.lineplot(x=monthly_transactions.index.astype(str), y=monthly_transactions.values, label='Total Transactions')
plt.xticks(rotation=45)
plt.title('Monthly Purchase Trends')
plt.ylabel('Count')
plt.legend()
plt.show()

### Inspect monetary distribution per transaction and customer


In [None]:
# Calculate TotalPrice for each row
df['TotalPrice'] = df['Quantity'] * df['UnitPrice']

# Calculate Revenue per Transaction
transaction_revenue = df.groupby('InvoiceNo')['TotalPrice'].sum()

# Calculate Total Revenue per Customer
customer_revenue = df.groupby('CustomerID')['TotalPrice'].sum()

monetary_data = [
    transaction_revenue.values,        # Revenue per Transaction
    customer_revenue.dropna().values   # Revenue per Customer
]

plt.figure(figsize=(10,6))
sns.violinplot(data=monetary_data, palette=['#1f77b4', '#ff7f0e'])
plt.xticks([0, 1], ['Revenue per Transaction', 'Total Revenue per Customer'])
plt.title('Monetary Distribution: Transactions vs Customers')
plt.ylabel('Revenue')
plt.show()

### RFM distributions


In [None]:
snapshot_date = df['InvoiceDate'].max() + pd.Timedelta(days=1)

rfm = df.groupby('CustomerID').agg({
    'InvoiceDate': lambda x: (snapshot_date - x.max()).days,
    'InvoiceNo': 'nunique',
    'TotalPrice': 'sum'
}).reset_index()

rfm.columns = ['CustomerID', 'Recency', 'Frequency', 'Monetary']

# Plot RFM distributions
plt.figure(figsize=(15,4))
plt.subplot(1,3,1)
sns.histplot(rfm['Recency'], bins=50, color='blue', kde=True)
plt.title('Recency Distribution')

plt.subplot(1,3,2)
sns.histplot(rfm['Frequency'], bins=50, color='green', kde=True)
plt.title('Frequency Distribution')

plt.subplot(1,3,3)
sns.histplot(rfm['Monetary'], bins=50, color='red', kde=True)
plt.title('Monetary Distribution')

plt.tight_layout()
plt.show()

### Elbow curve for cluster selection

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

rfm_scaled = rfm[['Recency', 'Frequency', 'Monetary']].copy()

scaler = StandardScaler()
rfm_scaled = scaler.fit_transform(rfm_scaled)

sse = []
k_range = range(1, 11)
for k in k_range:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(rfm_scaled)
    sse.append(kmeans.inertia_)

plt.figure(figsize=(8,4))
plt.plot(k_range, sse, 'bx-')
plt.xlabel('Number of clusters K')
plt.ylabel('Sum of squared errors (SSE)')
plt.title('Elbow Curve for Optimal K')
plt.show()

### Customer cluster profiles


In [None]:
k_optimal = 4
kmeans = KMeans(n_clusters=k_optimal, random_state=42, n_init=10) # Added n_init for clarity and to avoid warning
rfm['Cluster'] = kmeans.fit_predict(rfm_scaled)

# Summary stats per cluster
cluster_summary = rfm.groupby('Cluster').agg({
    'Recency': ['mean', 'median'],
    'Frequency': ['mean', 'median'],
    'Monetary': ['mean', 'median', 'count']
}).round(1)
print(cluster_summary)

# Visualize clusters
plt.figure(figsize=(12,8))
sns.scatterplot(data=rfm, x='Recency', y='Monetary', hue='Cluster', palette='Set1', alpha=0.6)
plt.title('Customer Segments based on Recency and Monetary')
plt.xlabel('Recency (days)')
plt.ylabel('Monetary')
plt.show()

### Product recommendation heatmap / similarity matrix

In [None]:
customer_product = df.pivot_table(index='CustomerID', columns='Description', values='Quantity', aggfunc='sum', fill_value=0)

# Compute cosine similarity between products
from sklearn.metrics.pairwise import cosine_similarity
product_sim = cosine_similarity(customer_product.T)
product_sim_df = pd.DataFrame(product_sim, index=customer_product.columns, columns=customer_product.columns)

# For visualization, pick top 20 products by total sales
top_20_products = top_products.index.tolist()
sim_subset = product_sim_df.loc[top_20_products, top_20_products]

plt.figure(figsize=(12,10))
sns.heatmap(sim_subset, cmap='coolwarm', annot=False)
plt.title('Product Similarity Heatmap (Top 20 Products)')
plt.show()

# Clustering Methodology:


### Feature Engineering:

In [None]:
import pandas as pd
from datetime import timedelta

# Load data and preprocess
df = pd.read_csv('online_retail.csv', encoding='latin1')
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])

# Remove cancelled transactions (InvoiceNo starting with 'C')
df = df[~df['InvoiceNo'].astype(str).str.startswith('C')]

# Set snapshot date for Recency calculation (one day after last purchase date)
snapshot_date = df['InvoiceDate'].max() + timedelta(days=1)

# Calculate RFM metrics
rfm = df.groupby('CustomerID').agg({
    'InvoiceDate': lambda x: (snapshot_date - x.max()).days,  # Recency
    'InvoiceNo': 'nunique'                                   # Frequency
})

rfm['Monetary'] = df.groupby('CustomerID').apply(lambda x: (x['Quantity'] * x['UnitPrice']).sum())

# Rename columns for clarity
rfm.rename(columns={'InvoiceDate': 'Recency', 'InvoiceNo': 'Frequency'}, inplace=True)

print(rfm.head())

### Standardize/Normalize the RFM values

In [None]:
from sklearn.preprocessing import StandardScaler

# Standardize RFM columns (Recency, Frequency, Monetary)
scaler = StandardScaler()
rfm_scaled = scaler.fit_transform(rfm)

# Optionally, inspect the scaled data
import numpy as np
print(np.round(rfm_scaled[:5], 2))

### Choose Clustering Algorithm (KMeans, DBScan, Hierarchial etc)

### KMeans Clustering

In [None]:
from sklearn.cluster import KMeans

k = 5  # Update this with your chosen value based on elbow/silhouette analysis
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(rfm_scaled)
rfm['KMeans_Cluster'] = kmeans.labels_

print("KMeans Cluster assignments (first 5):")
print(rfm['KMeans_Cluster'].head())

# Optionally: Plot cluster centers (in scaled data)
print("KMeans cluster centers (scaled RFM):")
print(kmeans.cluster_centers_)

### DBScan clustering

In [None]:
from sklearn.cluster import DBSCAN

# You may need to tune eps and min_samples depending on your data distribution
dbscan = DBSCAN(eps=1, min_samples=5)
rfm['DBSCAN_Cluster'] = dbscan.fit_predict(rfm_scaled)

print("DBSCAN Cluster assignments (first 5):")
print(rfm['DBSCAN_Cluster'].head())
print("DBSCAN labels unique:", set(rfm['DBSCAN_Cluster']))

### Hierarchial clusteing

In [None]:
from sklearn.cluster import AgglomerativeClustering

# Try different n_clusters; start with 4 for consistency with KMeans
agg = AgglomerativeClustering(n_clusters=4)
rfm['Hierarchical_Cluster'] = agg.fit_predict(rfm_scaled)

print("Hierarchical Cluster assignments (first 5):")
print(rfm['Hierarchical_Cluster'].head())

# Using Elbow Method , Silhouette Score to decide the number of clusters

In [None]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score # Import silhouette_score

sse = []
silhouette_scores = []
k_range = range(2, 11)  # Testing cluster counts from 2 to 10

for k in k_range:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(rfm_scaled)

    sse.append(kmeans.inertia_)  # Sum of squared distances (WCSS)

    labels = kmeans.labels_
    sil_score = silhouette_score(rfm_scaled, labels)
    silhouette_scores.append(sil_score)

# Plot Elbow Curve and Silhouette Scores side-by-side
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.plot(k_range, sse, marker='o')
plt.title('Elbow Method: WCSS vs Number of Clusters')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Within-cluster Sum of Squares (WCSS)')

plt.subplot(1, 2, 2)
plt.plot(k_range, silhouette_scores, marker='s', color='orange')
plt.title('Silhouette Scores vs Number of Clusters')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Silhouette Score')

plt.tight_layout()
plt.show()

# Run Clustering


### kmean clustering

In [None]:
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=4, random_state=42)
rfm['Cluster'] = kmeans.fit_predict(rfm_scaled)



In [None]:
print("First 10 customers with cluster assignment:")
print(rfm[['Recency', 'Frequency', 'Monetary', 'Cluster']].head(10))

# Print the number of customers in each cluster
print("\nCustomer count per cluster:")
print(rfm['Cluster'].value_counts())

# Print average RFM values per cluster to help with labeling
print("\nCluster RFM Averages:")
print(rfm.groupby('Cluster')[['Recency', 'Frequency', 'Monetary']].mean().round(2))

### DBScan clustering

In [None]:
print(rfm.groupby('DBSCAN_Cluster')[['Recency', 'Frequency', 'Monetary']].mean())


### Hierarchial clustering

In [None]:
print(rfm.groupby('Hierarchical_Cluster')[['Recency', 'Frequency', 'Monetary']].mean())


In [None]:
# Compute relevant quantiles
recency_q = rfm['Recency'].quantile([0.25, 0.75])
frequency_q = rfm['Frequency'].quantile([0.25, 0.75])
monetary_q = rfm['Monetary'].quantile([0.25, 0.75])

def label_row(row):
    r, f, m = row['Recency'], row['Frequency'], row['Monetary']
    if r <= recency_q[0.25] and f >= frequency_q[0.75] and m >= monetary_q[0.75]:
        return 'High-Value'
    elif f >= frequency_q[0.25] and m >= monetary_q[0.25]:
        return 'Regular'
    elif f <= frequency_q[0.25] and m <= monetary_q[0.25] and r >= recency_q[0.75]:
        return 'Occasional'
    elif r >= recency_q[0.75] and f <= frequency_q[0.25] and m <= monetary_q[0.25]:
        return 'At-Risk'
    else:
        return 'Other'

rfm['Segment'] = rfm.apply(label_row, axis=1)


# Visualizing the clusters

### 2D Scatter Plot: Recency vs. Monetary

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(10,6))
sns.scatterplot(data=rfm, x='Recency', y='Monetary', hue='Cluster', palette='Set1')
plt.title("Customer Segments: Recency vs. Monetary by KMeans Cluster")
plt.xlabel("Recency (days)")
plt.ylabel("Monetary Value")
plt.legend(title='Cluster')
plt.show()

### 3D Scatter Plot: Recency, Frequency, Monetary

In [None]:
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt

fig = plt.figure(figsize=(10,8))
ax = fig.add_subplot(111, projection='3d')

scatter = ax.scatter(
    rfm['Recency'],
    rfm['Frequency'],
    rfm['Monetary'],
    c=rfm['Cluster'],
    cmap='Set1',
    alpha=0.7
)
ax.set_xlabel('Recency')
ax.set_ylabel('Frequency')
ax.set_zlabel('Monetary')
plt.title("Customer Segments (3D RFM Clusters) - KMeans")
plt.colorbar(scatter, label='Cluster')
plt.show()


# Recommendation System Approach:


In [None]:
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Load data
df = pd.read_csv('online_retail.csv', encoding='latin1')

# Preprocessing:
# Remove cancelled transactions (InvoiceNo starting with 'C')
df = df[~df['InvoiceNo'].astype(str).str.startswith('C')]

# Aggregate purchases by CustomerID and StockCode (or Description)
# Here we use Description for product name; StockCode can also be used.
df = df.dropna(subset=['CustomerID', 'Description'])
df['Quantity'] = df['Quantity'].clip(lower=0)  # Remove negative quantities if any

# Create a customer-product matrix
customer_product_matrix = df.pivot_table(
    index='CustomerID',
    columns='Description',
    values='Quantity',
    aggfunc='sum',
    fill_value=0
)

# Compute cosine similarity between products (columns)
product_similarity = cosine_similarity(customer_product_matrix.T)
product_similarity_df = pd.DataFrame(
    product_similarity,
    index=customer_product_matrix.columns,
    columns=customer_product_matrix.columns
)

def get_top_similar_products(product_name, top_n=5):
    """
    Given a product description, return the top N most similar products.
    """
    if product_name not in product_similarity_df.columns:
        return f"Product '{product_name}' not found in product list."

    # Retrieve similarity scores for the input product
    similarity_scores = product_similarity_df[product_name]

    # Exclude the input product itself and sort by similarity
    top_products = similarity_scores.drop(product_name).sort_values(ascending=False).head(top_n)

    return top_products

# Example usage:
input_product = "WHITE HANGING HEART T-LIGHT HOLDER"  # Replace with any valid product description
top_similar = get_top_similar_products(input_product, top_n=5)

print(f"Top 5 products similar to '{input_product}':")
print(top_similar)


In [None]:
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity

# Load your dataset
df = pd.read_csv('online_retail.csv', encoding='latin1')

# Preprocessing: Remove cancelled transactions
df = df[~df['InvoiceNo'].astype(str).str.startswith('C')]
df = df.dropna(subset=['CustomerID', 'Description'])
df['Quantity'] = df['Quantity'].clip(lower=0)  # Remove negative purchases

# Create the customer-product matrix
customer_product_matrix = df.pivot_table(
    index='CustomerID',
    columns='Description',
    values='Quantity',
    aggfunc='sum',
    fill_value=0
)

# Compute cosine similarity between products
from sklearn.metrics.pairwise import cosine_similarity
product_similarity = cosine_similarity(customer_product_matrix.T)
product_similarity_df = pd.DataFrame(
    product_similarity,
    index=customer_product_matrix.columns,
    columns=customer_product_matrix.columns
)

# Save the similarity DataFrame to a pickle file for later use in your Streamlit app
product_similarity_df.to_pickle('product_similarity_df.pkl')
print("product_similarity_df.pkl created successfully.")


In [None]:
import pandas as pd

# Load dataset
df = pd.read_csv('online_retail.csv', encoding='latin1')

# Clean data: remove cancelled transactions (Invoice starting with 'C')
df = df[~df['InvoiceNo'].astype(str).str.startswith('C')]

# Drop rows with missing CustomerID or Description
df = df.dropna(subset=['CustomerID', 'Description'])

# Optional: Remove negative quantities or returns if any
df['Quantity'] = df['Quantity'].clip(lower=0)

# Create the pivot table: customers x products
customer_product_matrix = df.pivot_table(
    index='CustomerID',
    columns='Description',  # or use 'StockCode' if preferred
    values='Quantity',
    aggfunc='sum',
    fill_value=0
)

# Save the DataFrame for later use
customer_product_matrix.to_pickle('customer_product_matrix.pkl')
print("Saved customer_product_matrix.pkl successfully.")


In [None]:
import pandas as pd

# Load dataset
df = pd.read_csv('online_retail.csv', encoding='latin1')

# Remove cancelled transactions
df = df[~df['InvoiceNo'].astype(str).str.startswith('C')]

# Drop rows with missing StockCode or Description
df = df.dropna(subset=['StockCode', 'Description'])

# Create mapping dictionary: StockCode -> Description
stockcode_to_name = df.drop_duplicates(subset='StockCode').set_index('StockCode')['Description'].to_dict()

# Save dict to a pickle file
import pickle
with open('stockcode_to_name.pkl', 'wb') as f:
    pickle.dump(stockcode_to_name, f)

print("stockcode_to_name.pkl created successfully.")


In [None]:
! pip install streamlit -q


In [None]:
!wget -q -O - ipv4.icanhazip.com

In [None]:
! streamlit run app.py & npx localtunnel --port 8501

# **Conclusion**

The Shopper Spectrum project demonstrates the power of leveraging e-commerce transaction data to gain actionable insights into customer purchasing behaviors and improve business outcomes. By applying RFM (Recency, Frequency, Monetary) analysis, the project successfully segments customers into meaningful groups such as High-Value, Regular, Occasional, and At-Risk, allowing targeted marketing and personalized engagement strategies.

The integration of an item-based collaborative filtering recommendation system, driven by product purchase histories and cosine similarity computations, enables the delivery of relevant product suggestions. This personalized recommendation enhances the user shopping experience and drives cross-selling opportunities.

Through rigorous exploratory data analysis, clustering, and model evaluation techniques (including the Elbow method and Silhouette scores), the project identifies optimal customer segments and ensures robust predictive performance. The modular design facilitates deployment in interactive environments like Streamlit, empowering business stakeholders with easy-to-use tools for both customer segmentation and product recommendation.

Overall, this dual approach—combining customer segmentation with collaborative filtering recommendations—enables e-commerce platforms to foster stronger customer loyalty, improve marketing efficiency, and enhance revenue growth in an increasingly competitive digital marketplace.