

# Time Series Dataset Construction (Deep Learning Phase)




## Constructing the time series-ready dataset for deep learning forecasting , especially useful for LSTM models.

**GOAL (FOR NOW):** To build a time-indexed dataset to forecast product-level sales or demand, which can later inform dynamic pricing decisions.



**Before we dive in lets understand what we are going to predict and How it will be interpretable**
Eventually, We need to decide what we're forecasting whether its demand, short-term demand (order count and quantity sold) or it can be pricing impact over time.



In [None]:
import pandas as pd
data = pd.read_csv('cleaned_orders.csv')
data.columns

Index(['order_id', 'order_date', 'user_id', 'sku_id', 'quantity',
       'price_per_unit', 'discount_applied', 'order_channel', 'payment_method',
       'price_missing', 'discount_missing', 'discount_amount', 'net_price',
       'line_revenue', 'effective_discount_pct', 'order_year', 'order_month',
       'order_quarter', 'order_dayofweek', 'is_weekend', 'discount_factor',
       'line_revenue_capped', 'line_revenue_check'],
      dtype='object')

Objective: Forecast weekly order quantity per SKU, given historical behavior + product context + discounting signals.

This gives us:
1. Price elasticity estimates

2. Demand trends



In [None]:
df_orders = data.copy()
df_orders.head()

Unnamed: 0,order_id,order_date,user_id,sku_id,quantity,price_per_unit,discount_applied,order_channel,payment_method,price_missing,...,line_revenue,effective_discount_pct,order_year,order_month,order_quarter,order_dayofweek,is_weekend,discount_factor,line_revenue_capped,line_revenue_check
0,O000001,2023-08-27,U4418,P1477,2,794.7,30.0,App,Wallet,0,...,1112.58,0.3,2023,8,3,6,True,0.3,1112.58,1112.58
1,O000002,2024-08-06,U3995,P0935,5,1912.46,20.0,Mobile,UPI,0,...,7649.84,0.2,2024,8,3,1,False,0.2,7649.84,7649.84
2,O000003,2024-11-29,U5880,P1126,2,621.7,0.0,Mobile,UPI,0,...,1243.4,0.0,2024,11,4,4,False,0.0,1243.4,1243.4
3,O000004,2025-07-03,U1969,P1491,6,1679.62,0.0,App,UPI,0,...,10077.72,0.0,2025,7,3,3,False,0.0,8499.42171,10077.72
4,O000005,2024-04-20,U1925,P0274,2,658.59,20.0,App,Wallet,0,...,1053.744,0.2,2024,4,2,5,True,0.2,1053.744,1053.744


In [None]:
# df_orders['order_date'].dtype
df_orders['order_date'] = pd.to_datetime(df_orders['order_date'])

In [None]:
# week start from order_date
df_orders['week_start'] = df_orders['order_date'] - pd.to_timedelta(df_orders['order_date'].dt.dayofweek, unit='D')

In [None]:
#  I WILL USE CAPPED REVENUE AS A REVENE COLUMN
df_orders['revenue'] = df_orders['line_revenue_capped']

In [None]:
# Aggregate per SKU per week
weekly_orders = df_orders.groupby(['sku_id', 'week_start']).agg({
    'quantity': 'sum',
    'revenue': 'sum'
}).reset_index()


In [None]:
# Renaming for clarity
weekly_orders.rename(columns={
    'quantity': 'weekly_order_qty',
    'revenue': 'weekly_revenue'
}, inplace=True)

In [None]:
weekly_orders.head()

Unnamed: 0,sku_id,week_start,weekly_order_qty,weekly_revenue
0,P0001,2023-09-25,3,2437.8075
1,P0001,2024-02-12,2,1670.244
2,P0001,2024-02-19,3,1540.2825
3,P0001,2024-03-04,3,2905.959
4,P0001,2024-03-18,1,626.178


In [None]:
weekly_orders.to_csv("weekly_orders.csv", index=False)

----------------------------------------------------------------------------------------------------------------------------------------------------------

## Now lets explore and identify what are some important column we need

In [None]:
data2 = pd.read_csv('cleaned_products.csv')
df_products = data2.copy()


In [None]:
df_products.head()

Unnamed: 0,sku_id,product_name,category,brand,MRP,base_cost,launch_date,mrp_missing,base_cost_missing,mrp_outlier,base_cost_outlier,margin,negative_margin,product_age_months,is_new,is_stale,price_cost_ratio,high_margin,low_margin
0,P0001,Program Go,Electronics,No Brand,1308.75,823.9,2024-07-07,0,0,0,0,484.85,False,12,False,False,1.588482,True,False
1,P0002,Whole Max,Apparel,BrandD,1465.23,854.11,2023-09-02,0,0,0,0,611.12,False,22,False,True,1.715505,True,False
2,P0003,Happy Plus,Electronics,BrandE,537.82,353.3,2021-08-14,0,0,0,0,184.52,False,47,False,True,1.522276,False,False
3,P0004,Sure Go,Beauty,BrandA,532.78,328.46,2022-07-16,0,0,0,0,204.32,False,36,False,True,1.622054,False,False
4,P0005,Though Go,Sports,BrandD,1316.92,769.06,2022-06-18,0,0,0,0,547.86,False,37,False,True,1.712376,True,False


In [None]:
weekly_orders.head()

Unnamed: 0,sku_id,week_start,weekly_order_qty,weekly_revenue
0,P0001,2023-09-25,3.0,2437.8075
1,P0001,2024-02-12,2.0,1670.244
2,P0001,2024-02-19,3.0,1540.2825
3,P0001,2024-03-18,1.0,626.178
4,P0001,2024-11-18,3.0,4235.94


## Columns we need for TS forecasting from products dataset.
- brand
- category
- MRP
- base_cost
- launch_date
- margin



In [None]:
#  Enrich with product data
prod_features = ['sku_id', 'brand', 'category', 'MRP', 'base_cost', 'launch_date', 'margin']
weekly_orders_enriched = pd.merge(
    weekly_orders,
    df_products[prod_features],
    on='sku_id',
    how='left')


In [None]:
weekly_orders_enriched.head()

Unnamed: 0,sku_id,week_start,weekly_order_qty,weekly_revenue,brand,category,MRP,base_cost,launch_date,margin
0,P0001,2023-09-25,3,2437.8075,No Brand,Electronics,1308.75,823.9,2024-07-07,484.85
1,P0001,2024-02-12,2,1670.244,No Brand,Electronics,1308.75,823.9,2024-07-07,484.85
2,P0001,2024-02-19,3,1540.2825,No Brand,Electronics,1308.75,823.9,2024-07-07,484.85
3,P0001,2024-03-04,3,2905.959,No Brand,Electronics,1308.75,823.9,2024-07-07,484.85
4,P0001,2024-03-18,1,626.178,No Brand,Electronics,1308.75,823.9,2024-07-07,484.85


## One more interesting features i will add with week lauch date, more preciously it will tells me, How many full weeks have passed since this product's launch date?

In [None]:
# At this week’s starting date,
# how many full weeks have passed since the product was first launched?
# Convert both to datetime
weekly_orders_enriched['week_start'] = pd.to_datetime(weekly_orders_enriched['week_start'])
weekly_orders_enriched['launch_date'] = pd.to_datetime(weekly_orders_enriched['launch_date'])

# weeks since launch
weekly_orders_enriched['weeks_since_launch'] = (
    (weekly_orders_enriched['week_start'] - weekly_orders_enriched['launch_date']).dt.days // 7
)

# Optional: Clamp negatives to 0
weekly_orders_enriched['weeks_since_launch'] = weekly_orders_enriched['weeks_since_launch'].clip(lower=0)



In [None]:
weekly_orders_enriched[['sku_id', 'week_start', 'launch_date', 'weeks_since_launch']].head()


Unnamed: 0,sku_id,week_start,launch_date,weeks_since_launch
0,P0001,2023-09-25,2024-07-07,0.0
1,P0001,2024-02-12,2024-07-07,0.0
2,P0001,2024-02-19,2024-07-07,0.0
3,P0001,2024-03-04,2024-07-07,0.0
4,P0001,2024-03-18,2024-07-07,0.0


In [None]:
weekly_orders_enriched.to_csv("ts_base_with_product.csv", index=False)

_____________________________________________________________________________________________________

## We're done here

Now move on to the
next data set that we about to merge will be the review data sets.

## GOAL: To add average review rating and number of reviews per SKU per week to the time-series modeling dataset.

In [None]:
data3 = pd.read_csv('cleaned_reviews.csv')
df_reviews = data3.copy()

In [None]:
df_reviews.head()

Unnamed: 0,sku_id,user_id,rating,review_text,review_date
0,P0322,U5679,3.0,Decent for the price.,2024-11-22
1,P1059,U3760,4.0,Great value for money.,2024-10-21
2,P0625,U1008,4.0,Exceeded my expectations.,2023-11-19
3,P1142,U2632,4.0,Very satisfied with the quality.,2024-04-11
4,P0554,U1467,4.0,Highly recommend it!,2024-09-01


In [None]:
df_reviews['review_date'] = pd.to_datetime(df_reviews['review_date'])

In [None]:
df_reviews['week_start'] = df_reviews['review_date'] - pd.to_timedelta(df_reviews['review_date'].dt.dayofweek, unit = 'd')

In [None]:
# Aggregate: weekly average rating and count of reviews
weekly_reviews = df_reviews.groupby(['sku_id', 'week_start']).agg(
    avg_rating=('rating', 'mean'),
    num_reviews=('rating', 'count')
).reset_index()

In [None]:
# merging with weekly_orders_enriched dataset
weekly_orders_enriched = weekly_orders_enriched.merge(
    weekly_reviews,
    on=['sku_id', 'week_start'],
    how='left'
)

In [None]:
weekly_orders_enriched.head(10)

Unnamed: 0,sku_id,week_start,weekly_order_qty,weekly_revenue,brand,category,MRP,base_cost,launch_date,margin,weeks_since_launch,avg_rating,num_reviews
0,P0001,2023-09-25,3.0,2437.8075,No Brand,Electronics,1308.75,823.9,2024-07-07,484.85,0.0,,
1,P0001,2024-02-12,2.0,1670.244,No Brand,Electronics,1308.75,823.9,2024-07-07,484.85,0.0,,
2,P0001,2024-02-19,3.0,1540.2825,No Brand,Electronics,1308.75,823.9,2024-07-07,484.85,0.0,,
3,P0001,2024-03-18,1.0,626.178,No Brand,Electronics,1308.75,823.9,2024-07-07,484.85,0.0,,
4,P0001,2024-11-18,3.0,4235.94,No Brand,Electronics,1308.75,823.9,2024-07-07,484.85,19.0,5.0,1.0
5,P0001,2025-04-21,2.0,2841.278,No Brand,Electronics,1308.75,823.9,2024-07-07,484.85,41.0,,
6,P0002,2023-08-14,1.0,950.72,BrandD,Apparel,1465.23,854.11,2023-09-02,611.12,0.0,,
7,P0002,2023-12-18,2.0,1936.242,BrandD,Apparel,1465.23,854.11,2023-09-02,611.12,15.0,,
8,P0002,2024-05-13,1.0,745.885,BrandD,Apparel,1465.23,854.11,2023-09-02,611.12,36.0,,
9,P0002,2024-06-03,1.0,1884.43,BrandD,Apparel,1465.23,854.11,2023-09-02,611.12,39.0,,


In [None]:
# Filling NaNs (weeks with no reviews)
weekly_orders_enriched['avg_rating'] = weekly_orders_enriched['avg_rating'].fillna(0)
weekly_orders_enriched['num_reviews'] = weekly_orders_enriched['num_reviews'].fillna(0)


# Well, our next dataset will be the user view data set for getting the views, ctr and the view_count.

### Goal: TO aggregate weekly product view activity for each SKU and merge it into the weekly_order_enriched dataset.

In [None]:
data4 = pd.read_csv('clean_user_views.csv')
df_user_views = data4.copy()

In [None]:
df_uv = pd.DataFrame(df_user_views)

In [None]:
column_names = [
    'user_id', 'sku_id', 'timestamp', 'session_id', 'device_type', 'referrer',
    'view_hour', 'view_dayofweek', 'is_weekend',
    'session_view_count', 'user_view_count', 'sku_total_views'
]
df_uv = pd.read_csv('clean_user_views.csv', skiprows=1, names=column_names)

In [None]:
df_uv.head()

Unnamed: 0,user_id,sku_id,timestamp,session_id,device_type,referrer,view_hour,view_dayofweek,is_weekend,session_view_count,user_view_count,sku_total_views
0,U3089,P1223,2024-11-17 20:03:29,S71948,mobile,paid search,20,6,1,5,33,115
1,U2658,P0448,2025-01-21 16:01:04,S71280,mobile,campaign,16,1,0,2,47,143
2,U3831,P1124,2024-03-23 09:41:11,S40100,app,social media,9,5,1,2,42,157
3,U2823,P0261,2023-10-17 13:32:16,S10259,app,campaign,13,1,0,2,41,136
4,U4688,P0354,2023-07-10 04:38:42,S70757,mobile,organic,4,0,0,7,48,139


In [None]:
def merge_weekly_user_views(base_df, user_views_df):
    # Step 1: Convert timestamp to datetime
    user_views_df['timestamp'] = pd.to_datetime(user_views_df['timestamp'])

    # Step 2: Derive 'week_start' (week's Monday)
    user_views_df['week_start'] = user_views_df['timestamp'] - pd.to_timedelta(user_views_df['timestamp'].dt.weekday, unit='D')

    # Step 3: Group by SKU and week_start to get weekly views
    weekly_views = (
        user_views_df
        .groupby(['sku_id', 'week_start'])['sku_total_views']
        .sum()
        .reset_index()
        .rename(columns={'sku_total_views': 'weekly_total_views'})
    )

    # Step 4: Merge with base modeling dataset
    enriched_df = pd.merge(base_df, weekly_views, on=['sku_id', 'week_start'], how='left')

    # Step 5: Fill NaNs (no view = 0 views)
    enriched_df['weekly_total_views'] = enriched_df['weekly_total_views'].fillna(0)

    return enriched_df

weekly_orders_enriched = merge_weekly_user_views(weekly_orders_enriched, df_uv)

In [None]:
weekly_orders_enriched.head()

Unnamed: 0,sku_id,week_start,weekly_order_qty,weekly_revenue,brand,category,MRP,base_cost,launch_date,margin,weeks_since_launch,avg_rating,num_reviews,weekly_total_views
0,P0001,2023-09-25,3,2437.8075,No Brand,Electronics,1308.75,823.9,2024-07-07,484.85,0.0,0.0,0.0,0.0
1,P0001,2024-02-12,2,1670.244,No Brand,Electronics,1308.75,823.9,2024-07-07,484.85,0.0,0.0,0.0,0.0
2,P0001,2024-02-19,3,1540.2825,No Brand,Electronics,1308.75,823.9,2024-07-07,484.85,0.0,0.0,0.0,0.0
3,P0001,2024-03-04,3,2905.959,No Brand,Electronics,1308.75,823.9,2024-07-07,484.85,0.0,0.0,0.0,0.0
4,P0001,2024-03-18,1,626.178,No Brand,Electronics,1308.75,823.9,2024-07-07,484.85,0.0,0.0,0.0,0.0


## Competitor Dataset Integration


In [None]:
data5 = pd.read_csv('clean_competitor_prices.csv')
df_cp = data5.copy()
df_cp.head()

Unnamed: 0,sku_id,date,competitor_id,competitor_price
0,P0851,2024-12-11,c003,1843.81
1,P0953,2024-10-05,c002,1076.1
2,P0251,2025-04-22,c006,2248.44
3,P0901,2025-05-17,c007,1990.13
4,P1405,2025-01-26,c005,1721.84


In [None]:
def merge_weekly_comp_prices(base_df, comp_path):
    """
    Enrich base_df with weekly competitor price signals.
    Parameters
    ----------
    base_df : DataFrame
        Must contain ['sku_id', 'week_start', 'weekly_revenue', 'weekly_order_qty'].
    comp_path : str
        Path to cleaned_competitor_prices.csv.
    Returns
    -------
    DataFrame
        base_df + ['comp_avg_price', 'price_gap_pct']"""

    # Load competitor data
    comp_df = pd.read_csv(comp_path)
    comp_df['date'] = pd.to_datetime(comp_df['date'])

    # Align to week_start (Monday of that week)
    comp_df['week_start'] = comp_df['date'] - pd.to_timedelta(
        comp_df['date'].dt.weekday, unit='D'
    )

    # Aggregate: average competitor price per SKU‑week
    weekly_comp = (
        comp_df.groupby(['sku_id', 'week_start'])['competitor_price']
        .mean()
        .reset_index()
        .rename(columns={'competitor_price': 'comp_avg_price'})
    )

    # Merge with modelling base
    df = base_df.merge(weekly_comp, on=['sku_id', 'week_start'], how='left')

    # Computing our avg selling price & price gap %
    '''To calculate the average price we actually sold the product at per unit during each week.'''
    df['our_avg_price'] = df['weekly_revenue'] / df['weekly_order_qty'].replace(0, pd.NA)  # our Avg Selling Price = Total Weekly Revenue / Total Weekly Orders

    df['price_gap_pct'] = (
        (df['comp_avg_price'] - df['our_avg_price']) / df['our_avg_price'] # This tells us how our price compares to competitors’ price for the same product and week.
    ) * 100  # Price Gap % = ((Competitor Price - Our Price) / Our Price) × 100


    # House‑keep NaNs
    df[['comp_avg_price', 'price_gap_pct']] = df[['comp_avg_price', 'price_gap_pct']].fillna(0)

    return df


comp_path = "clean_competitor_prices.csv"
weekly_orders_enriched = merge_weekly_comp_prices(weekly_orders_enriched, comp_path)


In [None]:
weekly_orders_enriched.head(3)

Unnamed: 0,sku_id,week_start,weekly_order_qty,weekly_revenue,brand,category,MRP,base_cost,launch_date,margin,weeks_since_launch,avg_rating,num_reviews,weekly_total_views,comp_avg_price,our_avg_price,price_gap_pct
0,P0001,2023-09-25,3,2437.8075,No Brand,Electronics,1308.75,823.9,2024-07-07,484.85,0.0,0.0,0.0,0.0,0.0,812.6025,0.0
1,P0001,2024-02-12,2,1670.244,No Brand,Electronics,1308.75,823.9,2024-07-07,484.85,0.0,0.0,0.0,0.0,0.0,835.122,0.0
2,P0001,2024-02-19,3,1540.2825,No Brand,Electronics,1308.75,823.9,2024-07-07,484.85,0.0,0.0,0.0,0.0,0.0,513.4275,0.0


In [None]:
df_uv.head()

Unnamed: 0,user_id,sku_id,timestamp,session_id,device_type,referrer,view_hour,view_dayofweek,is_weekend,session_view_count,user_view_count,sku_total_views
0,U5985,P0145,2025-05-05 06:01:09,S60346,app,email,6,0,0,1,38,131
1,U1349,P0785,2025-04-11 07:03:36,S62238,app,email,7,4,0,1,42,139
2,U4038,P0433,2024-04-19 14:44:16,S37323,app,organic,14,4,0,1,37,145
3,U2194,P0249,2024-01-11 16:56:48,S75629,desktop,campaign,16,3,0,2,42,139
4,U3998,P1135,2023-07-15 01:26:09,S52339,mobile,direct,1,5,1,1,37,126


In [None]:
weekly_orders_enriched.head()

Unnamed: 0,sku_id,week_start,weekly_order_qty,weekly_revenue,brand,category,MRP,base_cost,launch_date,margin,weeks_since_launch,avg_rating,num_reviews
0,P0001,2023-09-25,3.0,2437.8075,No Brand,Electronics,1308.75,823.9,2024-07-07,484.85,0.0,0.0,0.0
1,P0001,2024-02-12,2.0,1670.244,No Brand,Electronics,1308.75,823.9,2024-07-07,484.85,0.0,0.0,0.0
2,P0001,2024-02-19,3.0,1540.2825,No Brand,Electronics,1308.75,823.9,2024-07-07,484.85,0.0,0.0,0.0
3,P0001,2024-03-18,1.0,626.178,No Brand,Electronics,1308.75,823.9,2024-07-07,484.85,0.0,0.0,0.0
4,P0001,2024-11-18,3.0,4235.94,No Brand,Electronics,1308.75,823.9,2024-07-07,484.85,19.0,5.0,1.0


In [None]:
### **Goal:** To engineer weekly inventory metrics like stock availability, stockout signals, and average stock per SKU per week, and then merge into the weekly_orders_enriched base dataset.

In [None]:
data6 = pd.read_csv('clean_inventory.csv')
df_inv = data6.copy()
df_inv.head()

Unnamed: 0,sku_id,date,stock_level,restock_flag,supplier_lead_time,stock_level_capped,is_low_stock,days_until_restock,inventory_health_score
0,P0859,2024-02-27,25,False,4.0,25,False,0.0,5.0
1,P0900,2025-05-03,89,False,6.0,89,False,0.0,12.714286
2,P0671,2023-07-27,72,False,4.0,72,False,0.0,14.4
3,P0510,2023-09-21,65,False,10.0,65,False,0.0,5.909091
4,P0438,2024-06-04,0,False,8.0,0,True,0.0,0.0


In [None]:
df_inv.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64999 entries, 0 to 64998
Data columns (total 9 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   sku_id                  64999 non-null  object 
 1   date                    64999 non-null  object 
 2   stock_level             64999 non-null  int64  
 3   restock_flag            64999 non-null  bool   
 4   supplier_lead_time      64999 non-null  float64
 5   stock_level_capped      64999 non-null  int64  
 6   is_low_stock            64999 non-null  bool   
 7   days_until_restock      64999 non-null  float64
 8   inventory_health_score  64999 non-null  float64
dtypes: bool(2), float64(3), int64(2), object(2)
memory usage: 3.6+ MB


In [None]:
df_inv.columns

Index(['sku_id', 'date', 'stock_level', 'restock_flag', 'supplier_lead_time',
       'stock_level_capped', 'is_low_stock', 'days_until_restock',
       'inventory_health_score'],
      dtype='object')

## **Goal:** We want to aggregate inventory stats weekly for each sku_id and merge into weekly_orders_enriched using modular and professional code.

In [None]:
'''Feature Name	Meaning
avg_stock_level - 	Mean stock level across the week
low_stock_flag -	% of time stock was low (0–1)
avg_inventory_health_score - 	Average inventory health
avg_days_until_restock - 	Expected days to restock'''


In [None]:
def merge_inventory_data(base_df, inventory_df):
    """
    Merge weekly inventory stats into weekly_orders_enriched dataset.

    Parameters:
        base_df (pd.DataFrame): The main time-series dataset (weekly_orders_enriched).
        inventory_df (pd.DataFrame): Cleaned inventory data.

    Returns:
        pd.DataFrame: Enriched base_df with weekly inventory features."""

    # Convert date to datetime and extract week
    inventory_df['date'] = pd.to_datetime(inventory_df['date'])
    inventory_df['week_start'] = inventory_df['date'] - pd.to_timedelta(inventory_df['date'].dt.dayofweek, unit='d')

    #  Aggregate weekly metrics per SKU
    weekly_inventory = inventory_df.groupby(['sku_id', 'week_start']).agg({
        'stock_level': 'mean',
        'is_low_stock': 'mean',  # % of time low stock
        'inventory_health_score': 'mean',
        'days_until_restock': 'mean'
    }).reset_index()

    # Rename columns for clarity
    weekly_inventory.rename(columns={
        'stock_level': 'avg_stock_level',
        'is_low_stock': 'low_stock_pct',
        'inventory_health_score': 'avg_inventory_health_score',
        'days_until_restock': 'avg_days_until_restock'
    }, inplace=True)

    # Step 4: Merge with base dataset
    merged_df = pd.merge(base_df, weekly_inventory, on=['sku_id', 'week_start'], how='left')

    return merged_df


# Load inventory CSV
df_inventory = pd.read_csv("clean_inventory.csv")

# Merge into your enriched time series dataset
weekly_orders_enriched = merge_inventory_data(weekly_orders_enriched, df_inventory)

In [None]:
weekly_orders_enriched.head()

Unnamed: 0,sku_id,week_start,weekly_order_qty,weekly_revenue,brand,category,MRP,base_cost,launch_date,margin,...,avg_rating,num_reviews,weekly_total_views,comp_avg_price,our_avg_price,price_gap_pct,avg_stock_level,low_stock_pct,avg_inventory_health_score,avg_days_until_restock
0,P0001,2023-09-25,3,2437.8075,No Brand,Electronics,1308.75,823.9,2024-07-07,484.85,...,0.0,0.0,0.0,0.0,812.6025,0.0,,,,
1,P0001,2024-02-12,2,1670.244,No Brand,Electronics,1308.75,823.9,2024-07-07,484.85,...,0.0,0.0,0.0,0.0,835.122,0.0,,,,
2,P0001,2024-02-19,3,1540.2825,No Brand,Electronics,1308.75,823.9,2024-07-07,484.85,...,0.0,0.0,0.0,0.0,513.4275,0.0,18.0,0.0,1.285714,0.0
3,P0001,2024-03-04,3,2905.959,No Brand,Electronics,1308.75,823.9,2024-07-07,484.85,...,0.0,0.0,0.0,0.0,968.653,0.0,,,,
4,P0001,2024-03-18,1,626.178,No Brand,Electronics,1308.75,823.9,2024-07-07,484.85,...,0.0,0.0,0.0,0.0,626.178,0.0,,,,


In [None]:
weekly_orders_enriched['avg_stock_level'].fillna(0, inplace=True)
weekly_orders_enriched['low_stock_pct'].fillna(0, inplace=True)
weekly_orders_enriched['avg_inventory_health_score'].fillna(0, inplace=True)
weekly_orders_enriched['avg_days_until_restock'].fillna(method='ffill', inplace=True)  # optional


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  weekly_orders_enriched['avg_stock_level'].fillna(0, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  weekly_orders_enriched['low_stock_pct'].fillna(0, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate o

In [None]:
weekly_orders_enriched.head()

Unnamed: 0,sku_id,week_start,weekly_order_qty,weekly_revenue,brand,category,MRP,base_cost,launch_date,margin,...,avg_rating,num_reviews,weekly_total_views,comp_avg_price,our_avg_price,price_gap_pct,avg_stock_level,low_stock_pct,avg_inventory_health_score,avg_days_until_restock
0,P0001,2023-09-25,3,2437.8075,No Brand,Electronics,1308.75,823.9,2024-07-07,484.85,...,0.0,0.0,0.0,0.0,812.6025,0.0,0.0,0.0,0.0,
1,P0001,2024-02-12,2,1670.244,No Brand,Electronics,1308.75,823.9,2024-07-07,484.85,...,0.0,0.0,0.0,0.0,835.122,0.0,0.0,0.0,0.0,
2,P0001,2024-02-19,3,1540.2825,No Brand,Electronics,1308.75,823.9,2024-07-07,484.85,...,0.0,0.0,0.0,0.0,513.4275,0.0,18.0,0.0,1.285714,0.0
3,P0001,2024-03-04,3,2905.959,No Brand,Electronics,1308.75,823.9,2024-07-07,484.85,...,0.0,0.0,0.0,0.0,968.653,0.0,0.0,0.0,0.0,0.0
4,P0001,2024-03-18,1,626.178,No Brand,Electronics,1308.75,823.9,2024-07-07,484.85,...,0.0,0.0,0.0,0.0,626.178,0.0,0.0,0.0,0.0,0.0


In [None]:
weekly_orders_enriched['avg_days_until_restock'] = (
    weekly_orders_enriched['avg_days_until_restock']
    .fillna(weekly_orders_enriched['avg_days_until_restock'].mean())
)

In [None]:
weekly_orders_enriched.head(3)

Unnamed: 0,sku_id,week_start,weekly_order_qty,weekly_revenue,brand,category,MRP,base_cost,launch_date,margin,...,avg_rating,num_reviews,weekly_total_views,comp_avg_price,our_avg_price,price_gap_pct,avg_stock_level,low_stock_pct,avg_inventory_health_score,avg_days_until_restock
0,P0001,2023-09-25,3,2437.8075,No Brand,Electronics,1308.75,823.9,2024-07-07,484.85,...,0.0,0.0,0.0,0.0,812.6025,0.0,0.0,0.0,0.0,1.500541
1,P0001,2024-02-12,2,1670.244,No Brand,Electronics,1308.75,823.9,2024-07-07,484.85,...,0.0,0.0,0.0,0.0,835.122,0.0,0.0,0.0,0.0,1.500541
2,P0001,2024-02-19,3,1540.2825,No Brand,Electronics,1308.75,823.9,2024-07-07,484.85,...,0.0,0.0,0.0,0.0,513.4275,0.0,18.0,0.0,1.285714,0.0


'''total_refund_amount Total refunds per SKU per week
return_rate	Approximate return ratio = refund amount ÷ estimated sales
is_high_return_sku	Binary flag for SKUs with unusually high return ratios
top_return_reason (optional)	Dominant reason code per SKU/week
avg_return_lag (future optional)	Days between purchase and return (if we link to orders)

'''

In [None]:
data7 = pd.read_csv('clean_return.csv')
df_return = data7.copy()
df_return.head()

Unnamed: 0,return_id,user_id,sku_id,return_date,return_reason,product_condition,refund_amount
0,R00001,U2402,P0654,2024-07-16,Missing parts,Used,1925.59
1,R00002,U4300,P0268,2024-10-18,Wrong product,Used,2386.87
2,R00003,U3013,P0037,2025-03-03,Poor quality,Opened,1773.01
3,R00004,U2035,P0603,2025-06-23,Unknown,Opened,1484.13
4,R00005,U4557,P0981,2024-08-27,Poor quality,New,159.07


In [None]:
df_return.columns

Index(['return_id', 'user_id', 'sku_id', 'return_date', 'return_reason',
       'product_condition', 'refund_amount'],
      dtype='object')

In [None]:
def merge_returns_data(base_df, returns_df):
    """
    Merge weekly return-related features into weekly_orders_enriched dataset.

    Parameters:
        base_df (pd.DataFrame): The main time-series dataset.
        returns_df (pd.DataFrame): Cleaned returns data.

    Returns:
        pd.DataFrame: Enriched base_df with returns info.
    """
    #  Preprocess return_date to week_start
    returns_df['return_date'] = pd.to_datetime(returns_df['return_date'])
    returns_df['week_start'] = returns_df['return_date'] - pd.to_timedelta(returns_df['return_date'].dt.dayofweek, unit='d')

    #  Aggregate return metrics
    weekly_returns = returns_df.groupby(['sku_id', 'week_start']).agg({
        'refund_amount': 'sum',
        'return_id': 'count',
        'return_reason': lambda x: x.mode().iloc[0] if not x.mode().empty else None
    }).reset_index()

    # Rename columns
    weekly_returns.rename(columns={
        'refund_amount': 'total_refund_amount',
        'return_id': 'total_returns',
        'return_reason': 'top_return_reason'
    }, inplace=True)

    #  Merge into base dataset
    merged_df = pd.merge(base_df, weekly_returns, on=['sku_id', 'week_start'], how='left')

    # Optional Step 4: Derive return_rate (proxy)
    merged_df['return_rate'] = merged_df['total_refund_amount'] / merged_df['weekly_revenue']
    merged_df['return_rate'] = merged_df['return_rate'].fillna(0).clip(0, 1)

    # Step 5: Flag high return SKUs
    merged_df['is_high_return_sku'] = (merged_df['return_rate'] > 0.3).astype(int)  # You can tune threshold

    return merged_df


In [None]:
# Load the returns data
df_returns = pd.read_csv("clean_return.csv")

# Merge into base dataset
weekly_orders_enriched = merge_returns_data(weekly_orders_enriched, df_returns)

In [None]:
weekly_orders_enriched[['sku_id', 'week_start', 'return_rate', 'total_refund_amount']].head()

Unnamed: 0,sku_id,week_start,return_rate,total_refund_amount
0,P0001,2023-09-25,0.0,
1,P0001,2024-02-12,0.0,
2,P0001,2024-02-19,0.0,
3,P0001,2024-03-04,0.0,
4,P0001,2024-03-18,0.0,


In [None]:
weekly_orders_enriched['total_refund_amount'] = weekly_orders_enriched['total_refund_amount'].fillna(0)


In [None]:
weekly_orders_enriched.head()

Unnamed: 0,sku_id,week_start,weekly_order_qty,weekly_revenue,brand,category,MRP,base_cost,launch_date,margin,...,price_gap_pct,avg_stock_level,low_stock_pct,avg_inventory_health_score,avg_days_until_restock,total_refund_amount,total_returns,top_return_reason,return_rate,is_high_return_sku
0,P0001,2023-09-25,3,2437.8075,No Brand,Electronics,1308.75,823.9,2024-07-07,484.85,...,0.0,0.0,0.0,0.0,1.500541,0.0,,,0.0,0
1,P0001,2024-02-12,2,1670.244,No Brand,Electronics,1308.75,823.9,2024-07-07,484.85,...,0.0,0.0,0.0,0.0,1.500541,0.0,,,0.0,0
2,P0001,2024-02-19,3,1540.2825,No Brand,Electronics,1308.75,823.9,2024-07-07,484.85,...,0.0,18.0,0.0,1.285714,0.0,0.0,,,0.0,0
3,P0001,2024-03-04,3,2905.959,No Brand,Electronics,1308.75,823.9,2024-07-07,484.85,...,0.0,0.0,0.0,0.0,0.0,0.0,,,0.0,0
4,P0001,2024-03-18,1,626.178,No Brand,Electronics,1308.75,823.9,2024-07-07,484.85,...,0.0,0.0,0.0,0.0,0.0,0.0,,,0.0,0


In [None]:
weekly_orders_enriched['total_returns'] = weekly_orders_enriched['total_returns'].fillna(0).astype(int)

# fill missing reasons with 'No Returns'
weekly_orders_enriched['top_return_reason'] = weekly_orders_enriched['top_return_reason'].fillna('No Returns')
'''You're not treating "No Returns" as a true categorical return reason — it's just a clear signal for downstream processes that nothing happened that week for that SKU.'''

'You\'re not treating "No Returns" as a true categorical return reason — it\'s just a clear signal for downstream processes that nothing happened that week for that SKU.'

In [None]:
weekly_orders_enriched.to_csv("final_weekly_dataset_raw_v1.csv", index=False)