# Feature Engineering: FashionWorld Packaging Optimization

## 🎯 Objective

The goal of this notebook is to engineer new features that enhance the predictive power of our dataset for modeling packaging quality in FashionWorld operations.

We apply domain-specific transformations to extract temporal, textual, and categorical insights, followed by encoding and memory optimization to produce a clean, model-ready dataset.


In [None]:
from google.colab import files
df_merged = files.upload()

Saving df_merged.csv to df_merged.csv


In [None]:
import pandas as pd
df = pd.read_csv("df_merged.csv")

## 1. Target Encofing (Binary)

In [None]:
# Create a binary target variable from 'PackagingQuality_Clean'
# 'Good' → 1, 'Bad' → 0, and 'Uncertain' entries remain NaN initially
df['Target'] = df['PackagingQuality_Clean'].map({'Good': 1, 'Bad': 0})

# Check missing values in the target column
missing_target = df['Target'].isna().sum()
print(f"🔍 Missing values in Target: {missing_target}")

🔍 Missing values in Target: 1824


We created a new binary target variable called `Target`:
- `Good` → 1
- `Bad` → 0
- `Uncertain` entries will now be imputed as bad (`0`) to reflect the business assumption that unknown quality likely represents risk.

## 2. Date-based Features

In [None]:
# Convert column to datetime
df['DateOfReport'] = pd.to_datetime(df['DateOfReport'], errors='coerce')
df['ReportMonth'] = df['DateOfReport'].dt.month
df['ReportQuarter'] = df['DateOfReport'].dt.quarter
df['Weekday'] = df['DateOfReport'].dt.day_name()

# Map month into seasons
def get_season(month):
    if month in [12, 1, 2]:
        return 'Winter'
    elif month in [3, 4, 5]:
        return 'Spring'
    elif month in [6, 7, 8]:
        return 'Summer'
    else:
        return 'Autumn'

df['Season'] = df['ReportMonth'].apply(get_season)

From the `DateOfReport` column, we engineered several new temporal features to capture time-based effects on packaging performance:

- `ReportMonth`: The numeric month (1–12) extracted from DateOfReport.

- `ReportQuarter`: The numeric quarter of the year (1–4), highlighting potential seasonal cycles.

- `Weekday`: The day of the week (e.g., Monday, Tuesday), to account for weekly operational patterns.

- `Season`: A categorical variable grouping the month into meteorological seasons:

  - Winter: December, January, February
  - Spring: March, April, May
  - Summer: June, July, August
  - Autumn: September, October, November

These features help identify temporal trends, such as potential seasonal fluctuations in packaging quality or weekday-specific operational differences.


## 3. Text-based Feature

In [None]:
# Create a numeric proxy for product name complexity
df['ProductName_Length'] = df['ProductName'].apply(lambda x: len(str(x)))

We computed `ProductName_Length` as a proxy for name complexity, which may relate to packaging variability or manual entry issues.

##4. Supplier History score (bucketed risk score)

In [None]:
# Quantify supplier performance based on risk features
from sklearn.preprocessing import KBinsDiscretizer

risk_cols = ['BadPackagingRate (%)', 'TotalIncidents', 'AverageCostPerIncident (€)']
risk_data = df[risk_cols].fillna(df[risk_cols].median())

discretizer = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='quantile')
df['SupplierHistoryScore'] = discretizer.fit_transform(risk_data).mean(axis=1).round(0).astype(int)




In this step, we’re creating a **SupplierHistoryScore**—a composite feature that quantifies how risky a supplier might be based on their historical performance.

First, we identified three risk-related features that reflect past performance:

* **BadPackagingRate (%)**: This measures how often packaging has been rated as bad for each supplier.
* **TotalIncidents**: This counts the total number of incidents linked to each supplier.
* **AverageCostPerIncident (€)**: This captures the average cost of incidents associated with each supplier.

To prepare these features for scoring, we fill any missing values with the median of each column. This ensures we don’t have gaps in our data when calculating risk scores.

Next, we use the `KBinsDiscretizer` from scikit-learn to **discretize** each of these three features. We bin them into five categories (from 0 to 4) using a quantile-based strategy. This approach assigns a low score (0) to the safest (lowest-risk) observations and a high score (4) to the riskiest observations. The result is that each feature reflects its relative risk level in the dataset.

After discretizing, we calculate the **SupplierHistoryScore** for each row by taking the average of these three binned risk scores. We round the result and convert it to an integer. This yields a single, easy-to-interpret number (from 0 to 4) that captures the overall risk profile for each supplier.

A lower `SupplierHistoryScore` means the supplier has a better track record (fewer incidents, lower costs, better packaging), while a higher score indicates a higher level of risk.

This feature is particularly useful as it consolidates several risk factors into one concise metric that can help inform models or analyses of packaging quality.

## 5. Categorical Encoding (One-Hot)

In [None]:
df_encoded = pd.get_dummies(df, columns=[
    'SupplierName_Clean',
    'GarmentType',
    'Material',
    'ProposedFoldingMethod_Clean',
    'ProposedLayout_Clean',
    'ProductReference_Format',
    'ProposedUnitsPerCarton_Format',
    'Size',
    'Collection',
    'Season',
    'Weekday'
], drop_first=True)

To convert categorical variables for modeling:
- We applied one-hot encoding to convert categorical features into binary indicator columns, which are suitable for both tree-based and linear models.
- This avoids information loss and enables non-tree models if needed.

## 6. Final cleaning

In [None]:
features_to_drop = [
    'ReportID', 'ProductReference', 'ProductReference_Cleaned', 'DateOfReport', 'PackagingQuality',
    'PackagingQuality_Clean', 'ProductName'
]
df_encoded.drop(columns=features_to_drop, inplace=True)

In [None]:
# Check for any object (string) columns
print(df_encoded.dtypes.value_counts())
# Check which columns are still objects
object_cols = df_encoded.select_dtypes(include='object').columns
print("🧩 Object columns:", object_cols.tolist())
# Drop object columns not used in modeling
df_encoded = df_encoded.drop(columns=['SupplierName', 'ProposedFoldingMethod', 'ProposedLayout', 'Month'])
print(df_encoded.dtypes.value_counts())  # no more object!

bool       53
float64    12
int64       8
object      4
int32       2
Name: count, dtype: int64
🧩 Object columns: ['SupplierName', 'ProposedFoldingMethod', 'ProposedLayout', 'Month']
bool       53
float64    12
int64       8
int32       2
Name: count, dtype: int64


### 🧹 Final Cleanup

We remove columns that are either identifiers or redundant after encoding.

In [None]:
# Downcast numeric types
# Identify float64 and int64 columns
float_cols = df_encoded.select_dtypes(include='float64').columns.tolist()
int_cols = df_encoded.select_dtypes(include='int64').columns.tolist()

print("\n🔢 Float64 columns (to downcast to float32):")
print(float_cols)

# Display first 5 rows of float64 columns
print("🔍 Float64 columns preview:")
display(df_encoded[float_cols].head())

# Downcast float64 to float32
for col in float_cols:
    df_encoded[col] = pd.to_numeric(df_encoded[col], downcast='float')

print("\n🔢 Int64 columns (to downcast to int16 where possible):")
print(int_cols)

# Display first 5 rows of int64 columns
print("🔍 Int64 columns preview:")
display(df_encoded[int_cols].head())

# Downcast int64 to int16 if within range
for col in int_cols:
    min_val, max_val = df_encoded[col].min(), df_encoded[col].max()
    if min_val >= -32768 and max_val <= 32767:
        df_encoded[col] = df_encoded[col].astype('int16')
    else:
        print(f"⚠️ Skipping '{col}' (out of int16 range): min={min_val}, max={max_val}")

# Summary of types after optimization
print("\n✅ Data types after downcasting:")
print(df_encoded.dtypes.value_counts())



🔢 Float64 columns (to downcast to float32):
['Weight', 'ProposedUnitsPerCarton', 'ProposedUnitsPerCarton_Pos', 'IncidentCount', 'AvgCostImpact', 'TotalCostImpact', 'BadPackagingRate (%)', 'AverageCostPerIncident (€)', 'OnTimeDeliveryRate (%)', 'TotalHistoricalIncidents', 'AvgIncidentCost', 'Target']
🔍 Float64 columns preview:


Unnamed: 0,Weight,ProposedUnitsPerCarton,ProposedUnitsPerCarton_Pos,IncidentCount,AvgCostImpact,TotalCostImpact,BadPackagingRate (%),AverageCostPerIncident (€),OnTimeDeliveryRate (%),TotalHistoricalIncidents,AvgIncidentCost,Target
0,0.35,29.0,29.0,0.0,0.0,0.0,7.55,560.77,91.67,0.0,0.0,1.0
1,0.21,20.0,20.0,1.0,163.0,163.0,26.88,551.78,69.76,1.0,163.0,1.0
2,0.2,31.0,31.0,1.0,387.0,387.0,7.55,560.77,91.67,1.0,387.0,1.0
3,1.3,5.0,5.0,0.0,0.0,0.0,7.55,560.77,91.67,0.0,0.0,1.0
4,1.11,9.0,9.0,0.0,0.0,0.0,7.55,560.77,91.67,0.0,0.0,1.0



🔢 Int64 columns (to downcast to int16 where possible):
['X_in_name', 'PackagesHandled', 'TotalIncidents', 'AnomaliesDetected', 'FoldingMethodWasMissing', 'HadAnyIncident', 'ProductName_Length', 'SupplierHistoryScore']
🔍 Int64 columns preview:


Unnamed: 0,X_in_name,PackagesHandled,TotalIncidents,AnomaliesDetected,FoldingMethodWasMissing,HadAnyIncident,ProductName_Length,SupplierHistoryScore
0,0,7510,169,21,0,0,17,2
1,0,4126,173,27,0,1,15,2
2,0,7510,169,21,0,1,14,2
3,0,7510,169,21,0,0,14,2
4,0,7510,169,21,0,0,16,2



✅ Data types after downcasting:
bool       53
float32    12
int16       8
int32       2
Name: count, dtype: int64


### 🧠 Memory Optimization

We reduce memory usage by:
- Converting `float64` columns to `float32`
- Converting `int64` columns to `int16` (when values are within safe range)

This improves model training speed and avoids Colab crashes.

In [None]:
# Save final dataset
df_encoded.to_csv('feature_engineered_data.csv', index=False)
print("✅ Feature engineering complete. Data saved as 'feature_engineered_data.csv'.")

# Download from Colab
from google.colab import files
files.download("feature_engineered_data.csv")

✅ Feature engineering complete. Data saved as 'feature_engineered_data.csv'.


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

### 💾 Export

The final, memory-optimized, model-ready dataset is exported to `feature_engineered_data.csv`.