# **02 – EDA and Feature Engineering (Olist Delivery Delays)**

This notebook loads olist_model_data.csv, explores delivery delay patterns, and creates additional features to be used in the modelling notebook.

### Step 1: Load prepared data and basic info

In [None]:
import os
import requests
import pandas as pd
import gzip
import shutil
import matplotlib.pyplot as plt
import seaborn as sns

sns.set(style="whitegrid")

PROCESSED_URL = "https://raw.githubusercontent.com/aejae-da/bda-olist-project/main/data/processed/olist_model_data.csv.gz"

def download_and_decompress(url, local_path, decompressed_path):
    if not os.path.exists(decompressed_path):
        if not os.path.exists(local_path):
            print(f"Downloading compressed data from {url}...")
            r = requests.get(url, stream=True)
            r.raise_for_status()
            with open(local_path, 'wb') as f:
                for chunk in r.iter_content(chunk_size=8192):
                    f.write(chunk)
            print(f"Downloaded {local_path}")

        print(f"Decompressing {local_path} to {decompressed_path}...")
        with gzip.open(local_path, 'rb') as f_in:
            with open(decompressed_path, 'wb') as f_out:
                shutil.copyfileobj(f_in, f_out)
        print(f"Decompressed to {decompressed_path}")
    else:
        print(f"Decompressed file {decompressed_path} already exists.")

# Paths for local files
compressed_file = "olist_model_data.csv.gz"
decompressed_file = "olist_model_data.csv"

# Run download and decompression
download_and_decompress(PROCESSED_URL, compressed_file, decompressed_file)

# Load the decompressed CSV into a dataframe with date parsing
df = pd.read_csv(
    decompressed_file,
    parse_dates=[
        "order_purchase_timestamp",
        "order_delivered_customer_date",
        "order_estimated_delivery_date"
    ]
)

print(f"Loaded {decompressed_file} with shape: {df.shape}")
df.head()

### Step 2: Basic overview

In [12]:
df.info()

df[['delivery_time_days', 'delay_days']].describe()

df['late_delivery_flag'].value_counts(normalize=True)

print("\nMissing values per column:")
print(df.isna().sum())

print("\nValue counts of key categorical variables:")
print(df['product_category_name'].value_counts().head(10))
print(df['customer_state'].value_counts().head(10))
print(df['seller_state'].value_counts().head(10))


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 94010 entries, 0 to 94009
Data columns (total 18 columns):
 #   Column                         Non-Null Count  Dtype         
---  ------                         --------------  -----         
 0   order_id                       94010 non-null  object        
 1   customer_id                    94010 non-null  object        
 2   customer_unique_id             94010 non-null  object        
 3   seller_id                      94010 non-null  object        
 4   product_id                     94010 non-null  object        
 5   order_purchase_timestamp       94010 non-null  datetime64[ns]
 6   order_delivered_customer_date  94010 non-null  datetime64[ns]
 7   order_estimated_delivery_date  94010 non-null  datetime64[ns]
 8   delivery_time_days             94010 non-null  int64         
 9   delay_days                     94010 non-null  int64         
 10  late_delivery_flag             94010 non-null  int64         
 11  price          

### Step 3: Explore delay patterns

In [None]:
#Some key plots
#Distribution of delay_days

plt.figure(figsize=(8,5))
sns.histplot(df['delay_days'], bins=40, kde=True)
plt.title('Distribution of Delay Days')
plt.xlabel('Delay (days)')
plt.ylabel('Count')
plt.show()

#Late vs on-time proportion

plt.figure(figsize=(5,4))
sns.countplot(data=df, x='late_delivery_flag')
plt.title('Late vs On-time Deliveries')
plt.xlabel('Late Delivery Flag (0 = on time/early, 1 = late)')
plt.ylabel('Number of Orders')
plt.show()

#Late rate by customer_state (top states)

late_by_state = (
df.groupby('customer_state')['late_delivery_flag']
.mean()
.sort_values(ascending=False)
)

plt.figure(figsize=(10,6))
late_by_state.head(10).plot(kind='bar')
plt.title('Top 10 Customer States by Late Delivery Rate')
plt.xlabel('Customer State')
plt.ylabel('Late Delivery Rate')
plt.show()

#Late rate by product_category_name (optional, for main categories)

late_by_cat = (
df.groupby('product_category_name')['late_delivery_flag']
.mean()
.sort_values(ascending=False)
)

plt.figure(figsize=(10,6))
late_by_cat.head(10).plot(kind='bar')
plt.title('Top 10 Product Categories by Late Delivery Rate')
plt.xlabel('Product Category')
plt.ylabel('Late Delivery Rate')
plt.show()

### Step 4: Feature engineering – time features

In [None]:
#Year, month and day of week from purchase timestamp
df['purchase_year'] = df['order_purchase_timestamp'].dt.year
df['purchase_month'] = df['order_purchase_timestamp'].dt.month
df['purchase_dayofweek'] = df['order_purchase_timestamp'].dt.dayofweek # 0=Monday

#Quick check
df[['purchase_year', 'purchase_month', 'purchase_dayofweek']].head()

### Step 5: Encode categorical variables

Label encoding is applied to categorical variables to convert categories into numeric codes for modelling. We chose label encoding over one-hot encoding due to the high cardinality of categories which would create too many features.

In [None]:
from sklearn.preprocessing import LabelEncoder
df_fe = df.copy()

#Product category
le_product = LabelEncoder()
df_fe['product_category_encoded'] = le_product.fit_transform(
df_fe['product_category_name'].fillna('unknown')
)

#Customer and seller states
le_cust_state = LabelEncoder()
df_fe['customer_state_encoded'] = le_cust_state.fit_transform(
df_fe['customer_state'].fillna('unknown')
)

le_seller_state = LabelEncoder()
df_fe['seller_state_encoded'] = le_seller_state.fit_transform(
df_fe['seller_state'].fillna('unknown')
)

df_fe[['product_category_encoded', 'customer_state_encoded', 'seller_state_encoded']].head()

### Step 6: Define final feature set for modelling

Final features selected for modelling exclude any direct delivery delay measures to prevent target leakage. They include price, freight value, encoded product category, customer and seller states, and purchase time features (month, day of week). These features capture logistic, temporal, and product aspects that hypothesize impact delivery punctuality.

In [None]:
#Defining which columns will go into the model
feature_cols = [
'price',
'freight_value',
'product_category_encoded',
'customer_state_encoded',
'seller_state_encoded',
'purchase_month',
'purchase_dayofweek'
]

X = df_fe[feature_cols]
y = df_fe['late_delivery_flag']

X.shape, y.shape

### Step 7: Save feature-engineered dataset

In [None]:
#Combine features and target into one DataFrame
df_model = df_fe[feature_cols + ['late_delivery_flag']].copy()

df_model.to_csv('olist_model_features.csv', index=False)

print("Saved olist_model_features.csv with shape:", df_model.shape)

In [None]:
from google.colab import files
files.download('olist_model_features.csv')

### Next Steps: Ready for Modelling

**Saved:** `olist_model_features.csv` (94K rows × 8 columns)

The next notebook `03_modelling.ipynb` will load the saved feature-engineered dataset `olist_model_features.csv` and build predictive models to classify late deliveries using Logistic Regression and Random Forest algorithms.

The feature set avoids leakage and focuses on realistic predictors available at order placement time.
