In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# 1. Introduction

In today's competitive retail landscape, understanding customer behavior and sales trends is crucial for crafting effective business strategies. This comprehensive analysis delves into key aspects of customer demographics, purchasing patterns, and seasonal influences, providing valuable insights for informed decision-making. By examining data on gender distribution, age range, purchase amounts, and category-specific spending, this analysis highlights significant trends and patterns. Additionally, the study explores seasonal sales peaks and declines, offering recommendations to optimize marketing strategies, pricing, and customer engagement throughout the year.

# 2. Dataset Overview

## 2.1 Raw Dataset

In [2]:
retail_sales = pd.read_csv("../data/raw/retail_sales_dataset.csv")

The raw dataset is sourced from [kaggle](https://www.kaggle.com/datasets/mohammadtalib786/retail-sales-dataset/data). This dataset provides a fictional retail landscape, providing data about the retail sales and customers data. This dataset allows us to analyze retail sales patterns and customer demographics to understand purchasing behaviors. Through this analysis, we can identify trends, gain insights into customer preferences, and explore key factors that impact retail performance.

In [3]:
retail_sales.head()

Unnamed: 0,Transaction ID,Date,Customer ID,Gender,Age,Product Category,Quantity,Price per Unit,Total Amount
0,1,2023-11-24,CUST001,Male,34,Beauty,3,50,150
1,2,2023-02-27,CUST002,Female,26,Clothing,2,500,1000
2,3,2023-01-13,CUST003,Male,50,Electronics,1,30,30
3,4,2023-05-21,CUST004,Male,37,Clothing,1,500,500
4,5,2023-05-06,CUST005,Male,30,Beauty,2,50,100


### **Attributes**

- **`Transaction ID`**: Unique identifier for each transaction.  
- **`Date`**: The date on which the transaction occurred.  
- **`Customer ID`**: Unique identifier assigned to each customer.  
- **`Gender`**: Gender of the customer (e.g., Male, Female, etc.).  
- **`Age`**: Age of the customer at the time of the transaction.  
- **`Product Category`**: Category or type of product purchased.  
- **`Quantity`**: Number of units of the product purchased in a single transaction.  
- **`Price per Unit`**: Cost of a single unit of the product.  
- **`Total Amount`**: Total cost of the transaction, calculated as **`Quantity`** * **`Price per Unit`**

## 2.2 Data Cleaning

### 2.2.1 Transform columns name into lowercase snake format

**Before Transformation**

In [4]:
retail_sales.columns

Index(['Transaction ID', 'Date', 'Customer ID', 'Gender', 'Age',
       'Product Category', 'Quantity', 'Price per Unit', 'Total Amount'],
      dtype='object')

**Transformation**

In [5]:
retail_sales.columns = retail_sales.columns.str.lower().str.replace(' ', '_')

**After Transformation**

In [6]:
retail_sales.columns

Index(['transaction_id', 'date', 'customer_id', 'gender', 'age',
       'product_category', 'quantity', 'price_per_unit', 'total_amount'],
      dtype='object')

The column names are transformed into lower case snake_case format (e.g., from `Product Category` to `product_category`). This standardization simplifies data wrangling by ensuring the column names are consistent, easy to reference, and compatible with common data manipulation workflows.

### 2.2.2 Convert columns data type

In [7]:
# Change date data type to datetime
retail_sales["date"] = pd.to_datetime(retail_sales["date"], format="%Y-%m-%d")

# Change gender data type to category
retail_sales["gender"] = retail_sales["gender"].astype("category")

# Change product_category data type to category
retail_sales["product_category"] = retail_sales["product_category"].astype("category")

The objectives of the data type conversions are as follows:  
- **Convert `date` from `string` to `date`**: This enables efficient datetime operations and allows for easier analysis of time-based trends.  
- **Convert `gender` and `product_category` from `string` to `category`**: This improves performance by optimizing memory usage, as these columns have a limited number of unique values.

## 2.3 Cleaned Dataset

In [8]:
retail_sales.head()

Unnamed: 0,transaction_id,date,customer_id,gender,age,product_category,quantity,price_per_unit,total_amount
0,1,2023-11-24,CUST001,Male,34,Beauty,3,50,150
1,2,2023-02-27,CUST002,Female,26,Clothing,2,500,1000
2,3,2023-01-13,CUST003,Male,50,Electronics,1,30,30
3,4,2023-05-21,CUST004,Male,37,Clothing,1,500,500
4,5,2023-05-06,CUST005,Male,30,Beauty,2,50,100


### **Attributes**  

- **`transaction_id`**: Unique identifier for each transaction.  
- **`date`**: The date on which the transaction occurred.  
- **`customer_id`**: Unique identifier assigned to each customer.  
- **`gender`**: Gender of the customer (e.g., Male, Female, etc.).  
- **`age`**: Age of the customer at the time of the transaction.  
- **`product_category`**: Category or type of product purchased.  
- **`quantity`**: Number of units of the product purchased in a single transaction.  
- **`price_per_unit`**: Cost of a single unit of the product.  
- **`total_amount`**: Total cost of the transaction, calculated as **`quantity`** * **`price_per_unit`**.  

# 3. Exploratory Data Analysis (EDA) Insights

## 3.1 Customer Demographics

<img src="../plots/customers_gender_proportion.png" alt="Customer Gender Proportion" width="400" height="300">

- The blue section represents female customers, making up 51.00% of the total customer base.

- The orange section represents male customers, making up 49.00% of the total customer base.

<img src="../plots/customers_age_distribution.png" alt="Customer Age Distribution" width="400" height="300">

In [9]:
retail_sales["age"].describe()

count    1000.00000
mean       41.39200
std        13.68143
min        18.00000
25%        29.00000
50%        42.00000
75%        53.00000
max        64.00000
Name: age, dtype: float64

- X-axis: Age groups in intervals of four years, ranging from 18 to 64.
- Y-axis: Frequency of customers in each age group, with values ranging from 0 to 100.
- The customer age distribution is relatively even, with the **youngest customer being 18 years old** and the **oldest customer being 64 years old**. The **average customer age is 42 years**.

## 3.2 Product's Description

<img src="../plots/total_quantity_by_product_category.png" alt="Total Quantity By Product Category Plot" width="400" height="300">

- The products are grouped into three main categories: clothing, electronics, and beauty products. In terms of popularity based on quantity sold, clothing tops the list with 894 units, followed by electronics with 849 units, and beauty products with 771 units.

In [10]:
retail_sales.groupby("product_category", observed=False)["price_per_unit"].describe()[["min", "25%", "50%", "75%", "max"]]

Unnamed: 0_level_0,min,25%,50%,75%,max
product_category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Beauty,25.0,30.0,50.0,300.0,500.0
Clothing,25.0,30.0,50.0,300.0,500.0
Electronics,25.0,30.0,50.0,300.0,500.0


- The prices per product across all product categories are relatively similar. Ranging from 25 to 500

<img src="../plots/total_revenue_per_product_category.png" alt="Total Revenue Per Product Category Plot" width="400" height="300">

- Electronics generate the highest revenue at 156,905.00, followed by clothing at 155,580.00, and beauty products at 143,515.00.

## 3.3 Customer Purchase Behaviour

<img src="../plots/total_amount_distribution.png" alt="Total Amount Of Purchase Distribution" width="400" height="300">

- X-axis: Total amount of purchase.
- Y-axis: Frequency of each total purchase amount.
- The chart shows that the majority of purchases are clustered at the lower end of the scale, specifically at the purchase amount of 25, which has the highest frequency of 542. There’s a notable drop in frequency as the purchase amount increases, with many higher amounts having a frequency of 0.

<img src="../plots/average_quantity_by_product_category_and_gender.png" alt="Average Quantity By Product Category And Gender" width="400" height="300">

- This bar graph shows the average quantity of products purchased in three categories: Beauty, Clothing, and Electronics, separated by gender (Female and Male).
  
Here are the specific values for each category:
- Beauty: Females (2.52), Males (2.50)
- Clothing: Females (2.53), Males (2.56)
- Electronics: Females (2.58), Males (2.38)

Females purchase slightly more quantity of Beauty and Electronics products on average. Males purchase slightly more quantity of Clothing products on average.

<img src="../plots/average_purchase_by_product_category_and_gender.png" alt="Average Purchase By Product Category And Gender" width="400" height="300">

This bar graph shows the average purchase amounts for three product categories (Beauty, Clothing, and Electronics), broken down by gender (Female and Male).

- Beauty Products: Males spend slightly more on average (194.58) compared to females (179.02).

- Clothing: Females spend more on average (184.30) compared to males (164.03).

- Electronics: Males spend more on average (195.54) compared to females (174.79).

## 3.4 Seasonal Sales Pattern Analysis

<img src="../plots/monthly_sales_with_average.png" alt="Monthly Sales With Average" width="400" height="300">

**Peaks:**
- **February**: Increase in sales.
- **May**: Significant peak in sales.
- **October**: Sales rise again.
- **December**: Notable increase in sales.

**Drops:**
- **March**: Noticeable drop in sales.
- **September**: Significant decline in sales.

**Seasonal Patterns:**
- The peaks in **February**, **May**, **October**, and **December** suggest seasonal trends. These months might coincide with specific events, holidays, or marketing campaigns that drive higher sales.
- The significant sales increase in **December** could be attributed to holiday shopping and year-end promotions.
- The peak in **May** might be related to spring sales or special events like Mother's Day.

**Drops Analysis:**
- The **March** drop might indicate a post-holiday slump or the end of winter promotions.
- The decline in **September** could be due to the end of summer vacations and back-to-school expenses, causing lower discretionary spending.

<img src="../plots/monthly_sales_for_each_product_category.png" alt="Monthly Sales For Each Product" width="400" height="300">

 **Electronics**:
 - The peaks in April, July, and December suggest strong seasonal or promotional factors. For instance, April could be influenced by tax refunds, July by mid-year sales, and December by holiday shopping.

- The drops in March, June, and September might indicate off-peak periods where fewer promotions or events drive sales.

**Clothing**:
- The relatively stable sales indicate consistent demand throughout the year. Minor fluctuations could be attributed to seasonal changes in fashion trends or specific marketing campaigns.

**Beauty**:
- The peaks in February, July, and November suggest increased demand during specific times. February could be linked to Valentine's Day, July to summer promotions, and November to pre-holiday shopping.

- The drops in March, June, and September could indicate periods of lower consumer interest or fewer promotional activities.

# 4. Key Findings And Insights

### **Retail Products** 

The products are categorized into three main groups: clothing, electronics, and beauty products, with prices spanning a wide range from 25 to 500 across all categories.

### **Customer Behaviour** 

The customer base is well-balanced in terms of gender, with approximately half female and half male. Additionally, the age distribution is diverse, ranging from 18 to 64 years old. 

the majority of purchases are clustered at the lower end of the scale, specifically at the purchase amount of 25 to 190. There’s a notable drop in frequency as the purchase amount increases, with many higher amounts having a frequency of 0. 

There are noticeable differences in purchasing behavior between males and females. On average, females tend to purchase slightly higher quantities of Beauty and Electronics products, while males show a preference for slightly higher quantities of Clothing products. 

In terms of spending, there are distinct patterns between males and females across product categories. For Beauty products, males spend slightly more on average (194.58) compared to females (179.02). Conversely, females outspend males on Clothing, with an average spend of 184.30 compared to 164.03. For Electronics, males again lead, spending an average of 195.54, while females spend 174.79 on average. These differences highlight varying preferences and purchasing behaviors between genders. 

### **Seasonal Findings**
The sales data reveals distinct trends and patterns over the year. Notable sales peaks occurred in January, May, October, and December, suggesting possible seasonal influences such as holidays, special events, or marketing campaigns. December stands out with a significant increase in sales, likely driven by holiday shopping and year-end promotions, while the May peak might coincide with spring sales or events like Mother's Day. Conversely, sales experienced notable drops in March and September. The March decline may reflect a post-holiday slump or the end of winter promotions, whereas September's drop could be attributed to reduced discretionary spending following summer vacations and back-to-school expenses. The red dashed line representing the average monthly sales, 37,872.50, helps contextualize these fluctuations, showing how each month's performance compares to the overall trend. 

Sales patterns across different product categories reveal interesting seasonal trends and consumer behavior. In the electronics sector, there are significant sales spikes in April, July, and December, likely driven by seasonal or promotional factors—April might benefit from tax refunds, July from mid-year sales, and December from holiday shopping. Conversely, sales in March, June, and September experience notable declines, possibly reflecting off-peak periods with fewer promotions or events. In the clothing category, sales remain relatively stable throughout the year with only minor fluctuations, suggesting consistent demand with slight variations due to changing fashion trends or targeted marketing campaigns. The beauty industry sees increased sales in February, July, and November, which could be tied to events like Valentine's Day, summer promotions, and pre-holiday shopping. However, like electronics, beauty sales dip in March, June, and September, potentially due to lower consumer interest or fewer promotions during these months. These insights highlight the impact of seasonality and marketing activities in driving sales across different industries.

# 5. Reccomendations For Business Strategy

### 1. **Gender-Specific Marketing Strategies:**
   - **Female Customers:**
     - **Beauty Products**: Since females tend to purchase slightly higher quantities and spend significantly on beauty products, targeted marketing campaigns for beauty products can be more effective. Consider personalized offers, loyalty programs, and collaborations with influencers to attract and retain female customers.
     - **Clothing**: Given that females outspend males on clothing, the business should emphasize marketing clothing products to female customers. Seasonal promotions, trend-based collections, and exclusive discounts can enhance engagement.
   - **Male Customers:**
     - **Electronics**: Males spend significantly more on electronics. Implement marketing strategies focusing on electronics, such as tech reviews, product demonstrations, and bundles during peak sales periods like April, July, and December.
     - **Beauty Products**: With males also showing a strong interest in beauty products, consider marketing specific beauty products to men through targeted ads and promoting grooming and self-care routines.

### 2. **Pricing and Promotion Strategies:**
   - **Low Purchase Amounts**: Since the majority of purchases are clustered at lower purchase amounts (25 to 190), consider optimizing pricing strategies for this range to attract more customers. Implementing discounts, bundling products, and offering free shipping for purchases within this range can incentivize more sales.
   - **High Purchase Amounts**: For higher purchase amounts, introducing loyalty programs or incentives, such as cashback offers, membership benefits, or installment payment options, can encourage customers to spend more.

### 3. **Seasonal and Event-Based Promotions:**
   - **Peak Months**: Leverage the sales peaks in February, May, October, and December by running targeted marketing campaigns, special offers, and events. For example, during December, focus on holiday shopping and gift guides, while May can highlight spring sales or events like Mother's Day.
   - **Drop Months**: To address drops in March and September, introduce promotions or events to boost sales. For example, end-of-winter clearance sales in March and back-to-school promotions in September can drive customer interest.

### 4. **Category-Specific Strategies:**
   - **Electronics**: Focus on promotional events in April, July, and December when sales peak. Additionally, identify reasons for the drop in March, June, and September, and implement strategies to counteract these declines.
   - **Clothing**: Maintain consistent engagement with customers throughout the year. Introduce new collections, fashion trends, and limited-time offers to keep customer interest high.
   - **Beauty**: Increase marketing efforts in February (Valentine's Day), July (summer promotions), and November (pre-holiday shopping). Address drops in March, June, and September with targeted promotions and new product launches.

### 5. **Customer Engagement and Feedback:**
   - **Understanding Preferences**: Engage with customers through surveys, feedback forms, and social media interactions to better understand their preferences and purchasing behavior. Use this information to tailor marketing campaigns and product offerings.
   - **Personalization**: Utilize customer data to create personalized shopping experiences, recommendations, and offers. Personalized emails, targeted ads, and curated product suggestions can enhance customer satisfaction and loyalty.

By implementing these strategies, the business can optimize sales performance, enhance customer engagement, and better align product offerings with customer preferences throughout the year. If you need further analysis or have specific questions, feel free to ask!

# 6. Challenges And Limitations

The dataset lacks information on the country in which the retail sales are located. Since each country has its own specific holidays and seasons, this information could significantly influence seasonal and time-series analysis, offering valuable insights for a more accurate analysis. Additionally, the absence of details regarding the currency used poses a challenge in interpreting the price data accurately.

The product prices in the dataset are limited to only five unique values, which may not accurately reflect real-world pricing scenarios in retail.

Furthermore, more detailed information about the products would be beneficial, such as the brand and specific type of product, to provide a more comprehensive understanding of the data.