In [1]:
import pandas as pd
import random
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns

# Create Mockup Dataframe

In [2]:
import pandas as pd
import random
from datetime import datetime

# Define parameters
years = [2021, 2022, 2023, 2024]
customer_segments = ["small", "medium", "large"]
product_segments = ["product-segment-1", "product-segment-2", "product-segment-3"]
customers_per_year = 10000
customer_ids = [f"CUST-{i:04d}" for i in range(1, customers_per_year + 1)]

# Function to generate random order values based on customer segment
def generate_order_value(segment):
    if segment == "small":
        return round(random.uniform(5, 200), 2)
    elif segment == "medium":
        return round(random.uniform(200, 500), 2)
    else:
        return round(random.uniform(500, 1000), 2)

# Generating mocked up data
data = []
for year in years:
    for _ in range(customers_per_year):
        customer_id = random.choice(customer_ids)
        customer_segment = random.choice(customer_segments)
        product_id = f"PROD-{random.randint(100, 999)}"
        product_segment = random.choice(product_segments)
        order_value = generate_order_value(customer_segment)
        order_date = datetime(year, random.randint(1, 12), random.randint(1, 28)).strftime('%Y-%m-%d')

        data.append({
            "order_date": order_date,
            "customer_id": customer_id,
            "user_segment": customer_segment,
            "product_id": product_id,
            "product_segment": product_segment,
            "order_year": year,
            "order_value_eur": order_value
        })

df_orders = pd.DataFrame(data)


In [3]:
df_orders.head()

Unnamed: 0,order_date,customer_id,user_segment,product_id,product_segment,order_year,order_value_eur
0,2021-12-16,CUST-9198,large,PROD-263,product-segment-1,2021,833.48
1,2021-12-14,CUST-5321,large,PROD-769,product-segment-1,2021,889.39
2,2021-07-11,CUST-4352,large,PROD-346,product-segment-3,2021,689.92
3,2021-04-05,CUST-1908,large,PROD-345,product-segment-1,2021,529.83
4,2021-03-16,CUST-3863,medium,PROD-318,product-segment-2,2021,288.24


In [4]:
#save the data to a csv file
df_orders.to_csv('orders.csv', sep=';', index=False)

# **Orders Dataset Overview**

The following dataset represents a collection of orders placed over four years, from 2021 to 2024. It is structured to simulate real-world transactional data for analysis. Each field is explained below, along with a practical use case scenario to demonstrate how this data could be collected.

## **Dataset Fields Explanation**

- **`order_date`**: 
  - *Type*: Date (YYYY-MM-DD)
  - *Description*: The exact date when an order was placed. This field captures the transaction timing.
  - *Example*: `2021-08-16`
  - **Use**: Useful for analysing seasonality in purchases, monthly trends, or specific promotions that influence buying behavior.

- **`customer_id`**: 
  - *Type*: String
  - *Description*: A unique identifier for each customer. The same customer ID may appear multiple times if they placed orders in different years.
  - *Example*: `CUST-0444`
  - **Use**: Important for tracking customer purchasing patterns over time, identifying repeat buyers, and segmenting customers based on order history.

- **`user_segment`**: 
  - *Type*: Categorical (small, medium, large)
  - *Description*: The segment to which the customer belongs, based on factors such as spending history, purchasing frequency, or account size.
  - *Example*: `large`
  - **Use**: Segmentation helps in targeted marketing strategies, understanding customer behavior by group, and offering personalised discounts or promotions.

- **`product_id`**: 
  - *Type*: String
  - *Description*: A unique identifier for each product in the catalog.
  - *Example*: `PROD-183`
  - **Use**: This allows product-level analysis to see which items are most popular, measure the performance of specific products, and manage inventory effectively.

- **`product_segment`**: 
  - *Type*: Categorical (product-segment-1, product-segment-2, product-segment-3)
  - *Description*: The category or type of product purchased. This can represent different product lines, departments, or item classifications.
  - *Example*: `product-segment-1`
  - **Use**: Used to understand the performance of different product lines, identify key drivers of revenue, and manage cross-category strategies.

- **`order_year`**: 
  - *Type*: Integer (2021, 2022, 2023, 2024)
  - *Description*: The year in which the order was placed.
  - *Example*: `2021`
  - **Use**: Enables annual analysis to detect year-on-year growth, assess the impact of market changes or events on sales, and forecast future demand.

- **`order_value_eur`**: 
  - *Type*: Float (EUR)
  - *Description*: The total value of the order in euros. The value ranges are based on the customer segment, with "small" spending between 5 and 200 EUR, "medium" between 200 and 500 EUR, and "large" between 500 and 1000 EUR.
  - *Example*: `937.59 EUR`
  - **Use**: Crucial for revenue analysis, profit margins, and customer lifetime value (CLV) calculations. Also useful for understanding pricing sensitivity across different customer segments.

## **Example Use Case Scenario**

This dataset could represent orders collected by a **B2B SaaS platform** that provides software solutions to businesses of various sizes. These businesses may subscribe to different tiers of software packages or purchase additional services.

For instance:
- **Small business customers** could be purchasing basic subscription plans, which provide limited features suitable for startups or local operations.
- **Medium businesses** might opt for a more robust package that includes analytics, collaboration tools, and integrations with other systems.
- **Large enterprises** might purchase premium services such as dedicated customer support, advanced AI-driven features, or bespoke customisation of the platform.

