# Retail Customer Segmentation & Sales Analysis

## Project Overview
This project analyzes retail transaction data to understand customer purchasing behaviour, identify high-level customer segments, generate actionable insights, and evaluate the impact of several factors like discounts, promotions and more on sales performance.

## Business Problem 
Retail businesses need to understand how customers behave across different store formats, promotions, and seasons in order to improve targeting, optimize discount strategies, and increase overall revenue. They also want to identify high-value segments, and optimize promotions to increase revenue.

## Project Objectives
- Clean and preprocess raw transaction data
- Normalize transaction-level data into analysis-ready tables
- Create customer-level metrics and segements
- Analyze how customer behaviour varies by several factors
- Build an interactive dashboard
- Generate actionable insights for business decision-making

## Dataset Description
The dataset contains retail transaction-level data with 12 columns describing customer, product, and transaction attributes.

### Dataset Columns

| Column Name | Description |
|-------------|-------------|
| transaction_id | Unique identifier for each transaction |
| date | Date and time of purchase |
| customer_name | Name of the customer |
| product | List of products purchased in production |
| total_items | Total number of items in transaction |
| total_cost | Total monetary value of transaction |
| payment_method | Mode of payment (cash, card, etc.) |
| city | City where the transaction occured |
| store_type | Type of store (warehouse, department, etc.) |
| discount_applied | Indicates whether discount is applied (T/F) |
| customer_category | Type of customer (Professional, homemaker, etc.) |
| season | Season during which purchase occured |
| promotion | Promotion applied (None, BOGO, etc.) |

## Key Business Questions
1. How can customers be segmented based on their spending behaviour?
2. How do discounts and promotions influence total transaction value?
3. Which store types and cities generate the highest revenue?
4. How does customer purchasing behaviour vary across seasons?
5. Which customer type contribute most to overall sales?

## Notebook Scope 
This notebook focuses on data loading, cleaning, validation and restructuring.

In [1]:
# Importing required libraries
import pandas as pd
import numpy as np

## Data Loading

In [2]:
file_path = r"C:\Users\abc\Documents\projects\Retail_Transactions_Dataset.csv"
df_raw = pd.read_csv(file_path)
df_raw.head()

Unnamed: 0,Transaction_ID,Date,Customer_Name,Product,Total_Items,Total_Cost,Payment_Method,City,Store_Type,Discount_Applied,Customer_Category,Season,Promotion
0,1000000000,1/21/2022 6:27,Stacey Price,"['Ketchup', 'Shaving Cream', 'Light Bulbs']",3,71.65,Mobile Payment,Los Angeles,Warehouse Club,True,Homemaker,Winter,
1,1000000001,3/1/2023 13:01,Michelle Carlson,"['Ice Cream', 'Milk', 'Olive Oil', 'Bread', 'P...",2,25.93,Cash,San Francisco,Specialty Store,True,Professional,Fall,BOGO (Buy One Get One)
2,1000000002,3/21/2024 15:37,Lisa Graves,['Spinach'],6,41.49,Credit Card,Houston,Department Store,True,Professional,Winter,
3,1000000003,10/31/2020 9:59,Mrs. Patricia May,"['Tissues', 'Mustard']",1,39.34,Mobile Payment,Chicago,Pharmacy,True,Homemaker,Spring,
4,1000000004,12/10/2020 0:59,Susan Mitchell,['Dish Soap'],10,16.42,Debit Card,Houston,Specialty Store,False,Young Adult,Winter,Discount on Selected Items


In [6]:
df_sample = df_raw.sample(n=1000, random_state=42)
df_sample.to_csv(r"C:\Users\abc\Documents\projects\Retail-Customer_Segmentation-and-Sales-Analysis\data\sample\Retail_Transaction_Sample.csv", index=False)

## Initial Data inspection

In [7]:
df_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 13 columns):
 #   Column             Non-Null Count    Dtype  
---  ------             --------------    -----  
 0   Transaction_ID     1000000 non-null  int64  
 1   Date               1000000 non-null  object 
 2   Customer_Name      1000000 non-null  object 
 3   Product            1000000 non-null  object 
 4   Total_Items        1000000 non-null  int64  
 5   Total_Cost         1000000 non-null  float64
 6   Payment_Method     1000000 non-null  object 
 7   City               1000000 non-null  object 
 8   Store_Type         1000000 non-null  object 
 9   Discount_Applied   1000000 non-null  bool   
 10  Customer_Category  1000000 non-null  object 
 11  Season             1000000 non-null  object 
 12  Promotion          666057 non-null   object 
dtypes: bool(1), float64(1), int64(2), object(9)
memory usage: 92.5+ MB


#### **Key Insights**
1. The dataset contains 13 columns and 1000000 rows.
2. Most columns are stored as object types.
3. The date column will be converted to datetime format for time-based analysis.
4. The promotion column contains fewer non-null values indicating missing entries.

In [13]:
df_raw['Promotion'].isnull().sum()

np.int64(333943)

In [14]:
df_raw.describe()

Unnamed: 0,Transaction_ID,Total_Items,Total_Cost
count,1000000.0,1000000.0,1000000.0
mean,1000500000.0,5.495941,52.45522
std,288675.3,2.871654,27.416989
min,1000000000.0,1.0,5.0
25%,1000250000.0,3.0,28.71
50%,1000500000.0,5.0,52.42
75%,1000750000.0,8.0,76.19
max,1001000000.0,10.0,100.0


#### **Key Insights**
1. The maximum number of items purchased is 10 and the minimum is 1.
2. The highest payment made by a customer is 100.00, whereas the lowest is 5.00.

## Data Cleaning and Transformation

#### Dealing with missing values

There are missing values present in Promotion column, which likely indicates that no promotions were available. This will be handled accordingly

In [15]:
df_raw['Promotion'] = df_raw['Promotion'].fillna('No Promotion')

In [16]:
df_raw['Promotion'].head(5)

0                  No Promotion
1        BOGO (Buy One Get One)
2                  No Promotion
3                  No Promotion
4    Discount on Selected Items
Name: Promotion, dtype: object

In [17]:
df_raw['Promotion'].isna().sum()

np.int64(0)

#### Date and Time processing

The transaction timestamp was originally stored as a single object column containing both date and time. This column was converted to datetime format and split into seperate date and time columns.

In [18]:
df_raw['Date'] = pd.to_datetime(df_raw['Date'], format = '%m/%d/%Y %H:%M', errors='coerce')

In [23]:
df_raw['Transaction_date'] = df_raw['Date'].dt.date
df_raw['Transaction_time'] = df_raw['Date'].dt.time

In [24]:
df_raw = df_raw.drop(columns=['Date'])

In [31]:
df_raw[['Transaction_ID', 'Transaction_date', 'Transaction_time']].head(3)

Unnamed: 0,Transaction_ID,Transaction_date,Transaction_time
0,1000000000,2022-01-21,06:27:00
1,1000000001,2023-03-01,13:01:00
2,1000000002,2024-03-21,15:37:00


#### Customer Identification

A customer identifier was created to uniquely identify customers based on their names.

In [27]:
df_raw["Customer_ID"] = pd.factorize(df_raw['Customer_Name'])[0]+1
cols = ['Transaction_ID','Customer_ID','Customer_Name']+[col for col in df_raw.columns if col not in ['Transaction_ID','Customer_ID','Customer_Name']]
df_raw = df_raw[cols]

In [32]:
df_raw[['Transaction_ID','Customer_ID','Customer_Name']].head(5)

Unnamed: 0,Transaction_ID,Customer_ID,Customer_Name
0,1000000000,1,Stacey Price
1,1000000001,2,Michelle Carlson
2,1000000002,3,Lisa Graves
3,1000000003,4,Mrs. Patricia May
4,1000000004,5,Susan Mitchell
