# Pre-Processing and Training Data Development<a id='Pre-Processing_and_Training_Data_Development'></a>

## Contents<a id="Contents"></a>  
* [Introduction](#Introduction)
* [Imports](#Imports)
* [Loading the data](#Loading_the_data)
* [Train/Test Split](#Train/Test_Split)
* [Feature Engineering](#Feature_Engineering)
* [Scale/encode numeric/categorical features](#Scale/encode_numeric/categorical_features)
* [Save Preprocessed Data](#Save_Preprocessed_Data)


## Introduction<a id='Introduction'></a>

The goal in this notebook is to create a cleaned development dataset I can use to complete the
modeling step of my project. 

## Imports<a id='Imports'></a>

In [212]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import scale
from sklearn.preprocessing import StandardScaler


## Loading the data<a id='Loading_the_data'></a>

In [215]:
data = pd.read_csv("../data/raw/Ecommerce_Sales_Prediction_Dataset.csv")
print(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 7 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Date              1000 non-null   object 
 1   Product_Category  1000 non-null   object 
 2   Price             1000 non-null   float64
 3   Discount          1000 non-null   float64
 4   Customer_Segment  1000 non-null   object 
 5   Marketing_Spend   1000 non-null   float64
 6   Units_Sold        1000 non-null   int64  
dtypes: float64(3), int64(1), object(3)
memory usage: 54.8+ KB
None


In [217]:
# Convert to datetime
data['Date'] = pd.to_datetime(data['Date'], format='%d-%m-%Y')
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 7 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   Date              1000 non-null   datetime64[ns]
 1   Product_Category  1000 non-null   object        
 2   Price             1000 non-null   float64       
 3   Discount          1000 non-null   float64       
 4   Customer_Segment  1000 non-null   object        
 5   Marketing_Spend   1000 non-null   float64       
 6   Units_Sold        1000 non-null   int64         
dtypes: datetime64[ns](1), float64(3), int64(1), object(2)
memory usage: 54.8+ KB


In [219]:
data = data.sort_values("Date")  # Always sort by date first!
data = data.set_index("Date")

In [221]:
data.head()

Unnamed: 0_level_0,Product_Category,Price,Discount,Customer_Segment,Marketing_Spend,Units_Sold
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2023-01-01,Sports,932.8,35.82,Occasional,6780.38,32
2023-01-02,Toys,569.48,3.6,Premium,6807.56,16
2023-01-03,Home Decor,699.68,3.56,Premium,3793.91,27
2023-01-04,Toys,923.27,0.61,Premium,9422.75,29
2023-01-05,Toys,710.17,47.83,Premium,1756.83,17


## Train/Test Split<a id='Train/Test_Split'></a>

In [224]:
#  Train/test split (chronological)
train_size = int(len(data) * 0.8)
train = data.iloc[:train_size]
test = data.iloc[train_size:]
print(test.head())

           Product_Category   Price  Discount Customer_Segment  \
Date                                                             
2025-03-11       Home Decor  281.42     13.57          Premium   
2025-03-12       Home Decor   88.63     24.83          Regular   
2025-03-13           Sports   94.80     14.21          Regular   
2025-03-14             Toys  895.25      6.69       Occasional   
2025-03-15          Fashion  199.95     31.48          Premium   

            Marketing_Spend  Units_Sold  
Date                                     
2025-03-11           714.16          23  
2025-03-12          3459.12          24  
2025-03-13          5108.72          36  
2025-03-14          1703.04          31  
2025-03-15          6615.21          36  


In [226]:
# Separate target and features
target_column = "Units_Sold"
X_train = train.drop(columns=[target_column])
y_train = train[target_column]
X_test = test.drop(columns=[target_column])
y_test = test[target_column]
print(X_train.head())

           Product_Category   Price  Discount Customer_Segment  \
Date                                                             
2023-01-01           Sports  932.80     35.82       Occasional   
2023-01-02             Toys  569.48      3.60          Premium   
2023-01-03       Home Decor  699.68      3.56          Premium   
2023-01-04             Toys  923.27      0.61          Premium   
2023-01-05             Toys  710.17     47.83          Premium   

            Marketing_Spend  
Date                         
2023-01-01          6780.38  
2023-01-02          6807.56  
2023-01-03          3793.91  
2023-01-04          9422.75  
2023-01-05          1756.83  


## Feature Engineering<a id="Feature_Engineering"></a>

In [229]:
# Date feature extraction
for X in [X_train, X_test]:
    X["month"] = X.index.month
    X["day"] = X.index.day
    X["weekday"] = X.index.weekday
print(X_train.head())

           Product_Category   Price  Discount Customer_Segment  \
Date                                                             
2023-01-01           Sports  932.80     35.82       Occasional   
2023-01-02             Toys  569.48      3.60          Premium   
2023-01-03       Home Decor  699.68      3.56          Premium   
2023-01-04             Toys  923.27      0.61          Premium   
2023-01-05             Toys  710.17     47.83          Premium   

            Marketing_Spend  month  day  weekday  
Date                                              
2023-01-01          6780.38      1    1        6  
2023-01-02          6807.56      1    2        0  
2023-01-03          3793.91      1    3        1  
2023-01-04          9422.75      1    4        2  
2023-01-05          1756.83      1    5        3  


## Scale/encode numeric/categorical features<a id="Scale/encode_numeric/categorical_features"></a>

In [232]:
# Scaling features
numeric_features = ['Price','Discount','Marketing_Spend', 'month', 'day', 'weekday']
categorical_features = ["Product_Category",'Customer_Segment']

In [234]:
# Combine train and test separately
X_train = pd.get_dummies(X_train, columns=categorical_features, drop_first=False)
X_test = pd.get_dummies(X_test, columns=categorical_features, drop_first=False)

# Align columns to make sure train/test have the same columns
X_test = X_test_encoded.reindex(columns=X_train_encoded.columns, fill_value=0)


In [236]:
scaler = StandardScaler()
X_train[numeric_features] = scaler.fit_transform(X_train[numeric_features])
X_test[numeric_features] = scaler.transform(X_test[numeric_features])

# Check shapes
print("X_train:\n", X_train)
print("y_train:\n", y_train)
print("X_test:\n", X_test)
print("y_test:\n", y_test)


X_train:
                Price  Discount  Marketing_Spend     month       day   weekday  \
Date                                                                            
2023-01-01  1.469768  0.743485         0.649904 -1.429309 -1.654277  1.497662   
2023-01-02  0.214711 -1.503750         0.659487 -1.429309 -1.540795 -1.497662   
2023-01-03  0.664476 -1.506540        -0.403064 -1.429309 -1.427314 -0.998441   
2023-01-04  1.436848 -1.712292         1.581550 -1.429309 -1.313832 -0.499221   
2023-01-05  0.700712  1.581141        -1.121296 -1.429309 -1.200351  0.000000   
...              ...       ...              ...       ...       ...       ...   
2025-03-06  1.671990  1.048975        -0.829759 -0.869207 -1.086869  0.000000   
2025-03-07 -1.278873  0.172260        -1.285034 -0.869207 -0.973388  0.499221   
2025-03-08 -1.361917  0.059271         0.334910 -0.869207 -0.859906  0.998441   
2025-03-09  0.759196 -1.256847         1.677994 -0.869207 -0.746425  1.497662   
2025-03-10  0.2600

In [238]:
# Turning Boolean columns True/False to 1/0
for X in [X_train, X_test]:
    bool_cols = X.select_dtypes(include=['bool']).columns
    X[bool_cols] = X[bool_cols].astype(int)
print(X_train.head())   

               Price  Discount  Marketing_Spend     month       day   weekday  \
Date                                                                            
2023-01-01  1.469768  0.743485         0.649904 -1.429309 -1.654277  1.497662   
2023-01-02  0.214711 -1.503750         0.659487 -1.429309 -1.540795 -1.497662   
2023-01-03  0.664476 -1.506540        -0.403064 -1.429309 -1.427314 -0.998441   
2023-01-04  1.436848 -1.712292         1.581550 -1.429309 -1.313832 -0.499221   
2023-01-05  0.700712  1.581141        -1.121296 -1.429309 -1.200351  0.000000   

            Product_Category_Electronics  Product_Category_Fashion  \
Date                                                                 
2023-01-01                             0                         0   
2023-01-02                             0                         0   
2023-01-03                             0                         0   
2023-01-04                             0                         0   
2023-01-05  

## Save Preprocessed Data<a id="Save_Preprocessed_Data"></a>

In [241]:
import joblib


joblib.dump(X_train, '../data/processed/X_train.joblib')
joblib.dump(X_test, '../data/processed/X_test.joblib')
joblib.dump(y_train, '../data/processed/y_train.joblib')
joblib.dump(y_test, '../data/processed/y_test.joblib')
joblib.dump(scaler, '../data/processed/scaler.joblib')

['../data/processed/scaler.joblib']