## Customer Behaviour Analysis

This analysis focuses on understanding customer purchasing behaviour to gain insights into customer engagement, loyalty, and value.

### What We Will Explore:

### 1. Descriptive Statistics
- **Total Sales**: Overall revenue generated.
- **Average Transaction Value**: Average spend per transaction.
- **Unique Customers**: Number of distinct customers.

### 2. Customer Behaviour Metrics (RFM Analysis)
- **Recency**: Days since last purchase.
- **Frequency**: Number of purchases.
- **Monetary**: Total spending.

### 3. Visualizations

To better understand the data distribution and patterns, we will use:

- **Matplotlib**: For basic histograms showing the distribution of Recency, Frequency, and Monetary values.
- **Seaborn**: For enhanced boxplots to visualize spread and outliers in customer behaviour metrics.
- **Plotly**: For interactive scatter plots or bar charts to explore relationships between customer segments and purchase behaviour dynamically.

These charts will provide both static and interactive views of customer behaviour, helping to identify key insights effectively.

### 4. Customer Segmentation
- Group customers by their RFM scores for targeted marketing strategies.

This approach combines statistical summaries with effective visual tools to gain a comprehensive understanding of customer behaviour.


## Importing Libraries

Before we begin our customer behaviour analysis, we need to import the essential Python libraries that will help us load, process, and visualize the data.

### Libraries Used:

- **pandas**: For data manipulation and analysis (e.g., reading data, grouping, summarizing).
- **numpy**: For numerical operations and handling arrays.
- **datetime**: For working with dates, which is important when calculating recency.
- **matplotlib.pyplot**: A core Python library for creating static visualizations like histograms and bar charts.
- **seaborn**: A higher-level visualization library built on top of matplotlib, useful for more aesthetic charts like boxplots and distribution plots.
- **plotly.express**: For interactive and dynamic visualizations like scatter plots and bar charts that allow user interaction.

### Display Settings:

- `%matplotlib inline`: Ensures that Matplotlib charts appear directly within the notebook cells.
- `plt.style.use('ggplot')`: Applies a clean and visually appealing style to Matplotlib plots.
- `sns.set(style='whitegrid')`: Sets the background style of Seaborn plots to show light gridlines for better readability.

These libraries will provide the tools we need to perform data cleaning, exploratory analysis, and visual storytelling throughout this notebook.


In [24]:
# Data manipulation
import pandas as pd
import numpy as np

# Working with dates
from datetime import datetime

# Data visualization (static)
import matplotlib.pyplot as plt
import seaborn as sns

# Data visualization (interactive)
import plotly.express as px

# Display settings
%matplotlib inline
plt.style.use('ggplot')        # Use ggplot style for matplotlib
sns.set(style='whitegrid')     # Set seaborn plot style to white grid

# Optional: Ignore warning messages for cleaner output
import warnings
warnings.filterwarnings('ignore')


In [None]:
#load cleaned dataset
#parse_dates automatically converts string dates into proper datetime format when reading a CSV.
df = pd.read_csv('../data/cleaned/cleaned_online_retail.csv', parse_dates=['InvoiceDate'])

df.head

<bound method NDFrame.head of        InvoiceNo StockCode                          ProductName  Quantity  \
0         536365    85123A   white hanging heart t-light holder         6   
1         536365     71053                  white metal lantern         6   
2         536365    84406B       cream cupid hearts coat hanger         8   
3         536365    84029G  knitted union flag hot water bottle         6   
4         536365    84029E       red woolly hottie white heart.         6   
...          ...       ...                                  ...       ...   
524873    581587     22613          pack of 20 spaceboy napkins        12   
524874    581587     22899          children's apron dolly girl         6   
524875    581587     23254         childrens cutlery dolly girl         4   
524876    581587     23255      childrens cutlery circus parade         4   
524877    581587     22138         baking set 9 piece retrospot         3   

               InvoiceDate  UnitPrice  Custom

In [25]:
#Explore cleaned dataset
df.info()



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 524878 entries, 0 to 524877
Data columns (total 14 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   InvoiceNo    524878 non-null  object        
 1   StockCode    524878 non-null  object        
 2   ProductName  524878 non-null  object        
 3   Quantity     524878 non-null  int64         
 4   InvoiceDate  524878 non-null  datetime64[ns]
 5   UnitPrice    524878 non-null  float64       
 6   CustomerID   524878 non-null  int64         
 7   Country      524878 non-null  object        
 8   TotalPrice   524878 non-null  float64       
 9   Year         524878 non-null  int64         
 10  Month        524878 non-null  int64         
 11  Day          524878 non-null  int64         
 12  Hour         524878 non-null  int64         
 13  Weekday      524878 non-null  object        
dtypes: datetime64[ns](1), float64(2), int64(6), object(5)
memory usage: 56.1+ MB


In [26]:
#to see column names
df.columns

Index(['InvoiceNo', 'StockCode', 'ProductName', 'Quantity', 'InvoiceDate',
       'UnitPrice', 'CustomerID', 'Country', 'TotalPrice', 'Year', 'Month',
       'Day', 'Hour', 'Weekday'],
      dtype='object')

### 1. Descriptive Statistics
- **Total Sales**: Overall revenue generated.
- **Average Transaction Value**: Average spend per transaction.
- **Unique Customers**: Number of distinct customers.


In [27]:
# Total Sales
total_sales = df['TotalPrice'].sum()

# Average Transaction Value per Invoice
avg_transaction_value = df.groupby('InvoiceNo')['TotalPrice'].sum().mean()

# Number of Unique Customers
unique_customers = df['CustomerID'].nunique()

# Total Number of Transactions
total_transactions = df['InvoiceNo'].nunique()

# Total Quantity Sold
total_quantity = df['Quantity'].sum()

# Display the results
print("📊 Descriptive Statistics:")
print(f"Total Sales: £{total_sales:,.2f}")
print(f"Average Transaction Value: £{avg_transaction_value:,.2f}")
print(f"Number of Unique Customers: {unique_customers}")
print(f"Total Number of Transactions: {total_transactions}")
print(f"Total Quantity Sold: {total_quantity}")


📊 Descriptive Statistics:
Total Sales: £10,642,110.80
Average Transaction Value: £533.12
Number of Unique Customers: 4338
Total Number of Transactions: 19962
Total Quantity Sold: 5572420


### 2. Customer Behaviour Metrics (RFM Analysis)

In [28]:
# RFM Analysis: Calculate Recency, Frequency, and Monetary metrics per customer

# 1. Define the snapshot date:
#    This is the reference date from which we measure Recency.
#    We set it as one day after the most recent purchase date in the dataset,
#    assuming the dataset ends on the last transaction date.
snapshot_date = df['InvoiceDate'].max() + pd.Timedelta(days=1)

# 2. Calculate Recency: Number of days since the customer's last purchase
recency = df.groupby('CustomerID')['InvoiceDate'].max().apply(lambda x: (snapshot_date - x).days)

# 3. Calculate Frequency: Number of unique invoices (transactions) made by the customer
frequency = df.groupby('CustomerID')['InvoiceNo'].nunique()

# 4. Calculate Monetary: Total amount spent by the customer
monetary = df.groupby('CustomerID')['TotalPrice'].sum()

# 5. Combine Recency, Frequency, and Monetary into a single DataFrame for analysis
rfm = pd.DataFrame({
    'CustomerID': recency.index,
    'Recency': recency.values,
    'Frequency': frequency.values,
    'Monetary': monetary.values
})

# Display the first few rows of the RFM table
print(rfm.head())


   CustomerID  Recency  Frequency  Monetary
0       12346      326          1  77183.60
1       12347        2          7   4310.00
2       12348       75          4   1797.24
3       12349       19          1   1757.55
4       12350      310          1    334.40


In [30]:
#To keep your RFM metrics organized, it’s best to create a separate rfm DataFrame:

# Create RFM table
rfm = df.groupby('CustomerID').agg({
    'InvoiceDate': lambda x: (snapshot_date - x.max()).days,
    'InvoiceNo': 'nunique',
    'TotalPrice': 'sum'
})

# Rename columns
rfm.rename(columns={
    'InvoiceDate': 'Recency',
    'InvoiceNo': 'Frequency',
    'TotalPrice': 'Monetary'
}, inplace=True)

# Preview
rfm.head()


Unnamed: 0_level_0,Recency,Frequency,Monetary
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
12346,326,1,77183.6
12347,2,7,4310.0
12348,75,4,1797.24
12349,19,1,1757.55
12350,310,1,334.4
