
<h1>Synthetic sales data generator for small businesses</h1>

Used for DSS, analytics, and ML experimentation


In [None]:
import pandas as pd
import numpy as np
from pathlib import Path 

: 

<strong><em>pandas (pd) :</em></strong> Handles dates, CSV files, DataFrames

<strong><em>numpy (np) :</em></strong> Used for random numbers, arrays, math, noise

<strong><em>Path from pathlib :</em></strong> Modern, safe way to handle file paths

<h2><em> 1. Configuration</em></h2>  

In [None]:
np.random.seed(42)


This assures that for each time you run the script, you get the same dataset.

Fixes randomness → <strong>'reproducible results'.</strong>


In [None]:
N_MONTHS = 24  # 2 years of data
START_DATE = "2023-01-01"

In [None]:
OUTPUT_PATH = Path("data/raw")
#creates a path object(not the folder yet)

OUTPUT_PATH.mkdir(parents=True, exist_ok=True)
#parents=True : creates the parent folder 'data' if missing
#exist_ok=True : prevents crash if folder already exists


<h2><em>2. Generate Base Timeline</em></h2> 

2.1. Create a sequence of dates

In [None]:
dates = pd.date_range(start=START_DATE, periods=N_MONTHS, freq="ME") 
#start(string) : starting date "2023-01-01"
#periods(number) : number of dates 24
#freq(string) : Monthly Frequency

<p><strong>Note</strong> : <br>
freq="ME" : stands for Month End<br>
freq="MS" : stands for Month Start</p>

2.2. Format Dates

In [None]:
months = dates.strftime("%Y-%m")
#formats dates into strings : %Y : year and %m : month
#results : "2023-01" "2023-02" .....

<h2> <em> 3. Business Features </h2> </em>

3.1. Seasonality

In [None]:
seasons = []
for d in dates:
    if d.month in [12, 1, 2]:
        seasons.append("Winter")
    elif d.month in [3, 4, 5]:
        seasons.append("Spring")
    elif d.month in [6, 7, 8]:
        seasons.append("Summer")
    else:
        seasons.append("Autumn")

#d.month : extracts the month number
#then assigns a 'season label' and add it to the list


In [None]:
season_multiplier = {
    "Winter": 0.9,
    "Spring": 1.05,
    "Summer": 1.2,
    "Autumn": 1.0
}

This models <strong>business behavior</strong> according to seasons for more realistic dataset :<br>
<ul > 
<li> Summer : higher demand </li>
<li>Winter : slower sales</li>
</ul>

3.2. Marketing Spend

In [None]:
marketing_spend = np.random.randint(2000, 12000, size=N_MONTHS)

#marketing_spend : random integers generated between 2K(low) and 12K(high)
#size=N_MONTHS : one value per month

3.3. Website Visits

In [None]:
website_visits = (marketing_spend * np.random.uniform(2.5, 4.0, N_MONTHS)).astype(int)
#uniform(2.5, 4.0 ,N_MONTHS) : random value between 2.5 and 4 generated per month
#astype(int) : visits must be whole numbers


Marketing spend drives traffic :<br>
Each $1 spent on marketion brings between 2.5 and 4 website visits<br>
 <strong> → realistic variability, not random noise</strong>

3.4. Conversion rate

In [None]:
base_conversion = np.random.uniform(0.02, 0.06, N_MONTHS)

Base Conversion between 2% and 6% :<br> For each 100 website visits, 2-6 become customers<br>
<strong>→ Realistic for small businesses</strong>

In [None]:
conversion_rate = [
    base_conversion[i] * season_multiplier[seasons[i]]
    for i in range(N_MONTHS)
]

Conversion rate affected by season : <br>
<ul>
<li>higher in summer </li>
<li>lower in winter</li>
</ul>

3.5. Number of customers

In [None]:
num_customers = (website_visits * conversion_rate).astype(int)

<strong> Number of customers </strong> = traffic * conversion rate

3.6. Average order value (AOV):

In [None]:
avg_order_value = np.random.normal(loc=60, scale=10, size=N_MONTHS)
#normal(mean, std, size) --> generates random numbers following a normal distribution
#mean : loc=60 --> most values will be around 60 ( center of the distribution )
#std : scale=10 --> 68% of values between 50 and 70 (Natural variation)
#size=N_MONTHS : one value generated per month


In [None]:
avg_order_value = np.clip(avg_order_value, 35, 100)
#np.clip : there are no orders under $35 or above $100 

This prevents unrealistic values:<br>
<ul>
<li>No $5 orders</li>
<li>No $500 spikes</li>
</ul>

3.7.  Discounts (%)

In [None]:
discount_rate = np.random.choice([0, 5, 10, 15, 20], size=N_MONTHS)
#randomly assigns discount percentage per month


<h2> <em>4. Revenue Calculation </h2> </em>

4.1. Apply real accounting logic :
 Customers × price × discount × seasonal impact


In [None]:
revenue = (
    num_customers
    * avg_order_value
    * (1 - discount_rate / 100)
    * [season_multiplier[s] for s in seasons]
)


4.2. Add Realistic Noise :

In [None]:
revenue = revenue + np.random.normal(0, 2000, size=N_MONTHS)
#loc = 0 → The noise is centered around zero (no systematic bias)
#scale = 2000 → This controls how big the random shocks are

In business terms, this represents:

<ul>
<li>refunds</li>
<li>late payments</li>
<li>invoice errors</li>
<li>stock issues</li>
<li>campaign hiccups</li>
<li>unexpected demand spikes/drops</li>
</ul>


In [None]:
revenue = revenue.clip(min=0).round(2)
#ensures there are no negative numbers 
#values rounded to cents

<h2><em> 5.Build Data Frame</h2></em>

In [None]:
df = pd.DataFrame({
    "month": months,
    "season": seasons,
    "marketing_spend": marketing_spend,
    "website_visits": website_visits,
    "conversion_rate": np.round(conversion_rate, 4),
    "num_customers": num_customers,
    "avg_order_value": np.round(avg_order_value, 2),
    "discount_rate": discount_rate,
    "revenue": revenue
})
#creates tabular dataset: 
#each key : column name
#each value : comumn data

<h2><em> 6. Save Dataset</h2></em>

In [None]:
output_file = OUTPUT_PATH / "small_business_sales.csv"
#creates the path object

df.to_csv(output_file, index=False)
#saves csv file
#index=False : ensures there is no extra index column

In [None]:
print("Synthetic dataset generated successfully.")
print(f"Saved to: {output_file}")