## 🧠 Feature Engineering Summary

This notebook transforms the cleaned Superstore dataset to make it suitable for regression modeling. Key logic behind feature engineering steps:

### ✅ Date-Based Features (from `Order Date`)
- `Order_Year`, `Order_Month`, `Order_Week`, `Order_Weekday`
- Capture seasonal and weekly sales patterns

### ✅ Shipping Delay
- `Shipping_Delay = Ship Date - Order Date`
- Proxy for customer experience, delivery efficiency, and potential impact on Sales or Profit

### ✅ Weekday Encoding
- `Order_Weekday` is ordinal encoded (Mon=0 to Sun=6)
- Models benefit from numeric representation of day-of-week trends

### ✅ One-Hot Encoding
- Applied to: `Ship Mode`, `Segment`, `Region`, `Category`, `Sub-Category`
- Converts categorical values to binary flags for model readability

### ✅ Kept Numerical Features
- `Quantity`, `Discount`, `Profit`
- Important indicators of order size, pricing strategy, and margin

### ✅ Dropped Irrelevant Fields
- IDs and high-cardinality text (`Customer Name`, `Product ID`, etc.)
- Avoids overfitting and unnecessary complexity

Resulting dataset is fully numeric and model-ready.


In [9]:
# Set up paths
from pathlib import Path

# Project root is two levels up from current notebook
PROJECT_ROOT = Path.cwd().parents[1]

# Define input/output paths
CLEANED_DATA_PATH = PROJECT_ROOT / "Data" / "Processed" / "cleaned_superstore.csv"
FEATURE_ENGINEERED_PATH = PROJECT_ROOT / "Data" / "Processed" / "feature_engineered_superstore.csv"

In [10]:
# Load cleaned dataset
import pandas as pd

# Load cleaned data with correct encoding
df = pd.read_csv(CLEANED_DATA_PATH, encoding='ISO-8859-1')
print("✅ Cleaned data loaded. Shape:", df.shape)
df.head()


✅ Cleaned data loaded. Shape: (9994, 21)


Unnamed: 0,Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country,City,...,Postal Code,Region,Product ID,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit
0,1,CA-2016-152156,2016-11-08,2016-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,...,42420,South,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.96,2,0.0,41.9136
1,2,CA-2016-152156,2016-11-08,2016-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,...,42420,South,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",731.94,3,0.0,219.582
2,3,CA-2016-138688,2016-06-12,2016-06-16,Second Class,DV-13045,Darrin Van Huff,Corporate,United States,Los Angeles,...,90036,West,OFF-LA-10000240,Office Supplies,Labels,Self-Adhesive Address Labels for Typewriters b...,14.62,2,0.0,6.8714
3,4,US-2015-108966,2015-10-11,2015-10-18,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,...,33311,South,FUR-TA-10000577,Furniture,Tables,Bretford CR4500 Series Slim Rectangular Table,957.5775,5,0.45,-383.031
4,5,US-2015-108966,2015-10-11,2015-10-18,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,...,33311,South,OFF-ST-10000760,Office Supplies,Storage,Eldon Fold 'N Roll Cart System,22.368,2,0.2,2.5164


In [11]:
# Create Date-Based Features
# Convert date columns
df['Order Date'] = pd.to_datetime(df['Order Date'])
df['Ship Date'] = pd.to_datetime(df['Ship Date'])

# Extract features from Order Date
df['Order_Year'] = df['Order Date'].dt.year
df['Order_Month'] = df['Order Date'].dt.month
df['Order_Week'] = df['Order Date'].dt.isocalendar().week
df['Order_Weekday'] = df['Order Date'].dt.day_name()



In [12]:
# Calculate shipping delay (delivery lead time) in days
df['Shipping_Delay'] = (df['Ship Date'] - df['Order Date']).dt.days

In [13]:
# Encode Weekday as Ordinal
weekday_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
df['Order_Weekday'] = df['Order_Weekday'].astype(pd.CategoricalDtype(categories=weekday_order, ordered=True))
df['Order_Weekday'] = df['Order_Weekday'].cat.codes


In [14]:
# One-Hot Encode Categorical Features
categorical_cols = ['Ship Mode', 'Segment', 'Region', 'Category', 'Sub-Category']
df = pd.get_dummies(df, columns=categorical_cols, drop_first=True)


In [15]:
# Drop identifier and high-cardinality text fields
drop_cols = [
    'Row ID', 'Order ID', 'Customer ID', 'Customer Name',
    'Product ID', 'Product Name', 'Country', 'City', 'State',
    'Postal Code', 'Order Date', 'Ship Date'
]
df = df.drop(columns=drop_cols)

print("✅ Final feature matrix shape:", df.shape)
df.head()


✅ Final feature matrix shape: (9994, 35)


Unnamed: 0,Sales,Quantity,Discount,Profit,Order_Year,Order_Month,Order_Week,Order_Weekday,Shipping_Delay,Ship Mode_Same Day,...,Sub-Category_Envelopes,Sub-Category_Fasteners,Sub-Category_Furnishings,Sub-Category_Labels,Sub-Category_Machines,Sub-Category_Paper,Sub-Category_Phones,Sub-Category_Storage,Sub-Category_Supplies,Sub-Category_Tables
0,261.96,2,0.0,41.9136,2016,11,45,1,3,False,...,False,False,False,False,False,False,False,False,False,False
1,731.94,3,0.0,219.582,2016,11,45,1,3,False,...,False,False,False,False,False,False,False,False,False,False
2,14.62,2,0.0,6.8714,2016,6,23,6,4,False,...,False,False,False,True,False,False,False,False,False,False
3,957.5775,5,0.45,-383.031,2015,10,41,6,7,False,...,False,False,False,False,False,False,False,False,False,True
4,22.368,2,0.2,2.5164,2015,10,41,6,7,False,...,False,False,False,False,False,False,False,True,False,False


In [16]:
# Save feature-ready dataset
FEATURE_ENGINEERED_PATH.parent.mkdir(parents=True, exist_ok=True)
df.to_csv(FEATURE_ENGINEERED_PATH, index=False)

print("✅ Feature-engineered dataset saved to:", FEATURE_ENGINEERED_PATH)


✅ Feature-engineered dataset saved to: /Users/nastaran/DSI_Project/C6_ML5/Data/Processed/feature_engineered_superstore.csv
