# Business Understanding

## Problem Statement
The large increase in shipping demand has not been matched by an increase in the capabilities of logistics companies.<br>
Delayed delivery can be a risk in many sectors, one of which is retail sales in e-commerce, late delivery will cause the product supply chain to be hampered and reduce the credibility of the retailer. Apart from that, delays by the expedition will also cause buyer disappointment which of course can be detrimental to the retailer.

## Goal
Building a **binary classification** machine learning model that can **predict delays** in logistics/product delivery in e-commerce with **high accuracy**

## Objectives
1. Analyze the data and determine the target feature/binary label (is_late -> (1 or 0), according to the problem statement (delay in delivery)
2. Carry out data processing, to produce data that is clean from noise
3. Carrying out feature engineering, by creating new features to add data patterns which will make it easier for the model to carry out classification (so it is hoped that the accuracy will increase)
4. Select features with high importance using feature importance techniques (Pearson Correlation Matrix, KBest, ChiSquare, and SHAP), to reduce model complexity, computational load, and improve model performance
5. Carry out modeling using several Baseline algorithms (Logistic Regression, SVM, and Decision Tree), as well as advanced algorithms using Ensemble Learning (XGBoost, LGBM, CatBoost, Adaboost, and Random Forest)
6. Evaluate the model with accuracy metrics

# Data Understanding

The data is divided into 2 parts, namely training data and testing data. Each section has 5 types of tables, namely the df_Customers, df_OrderItems, df_Orders, df_Payments, and df_Products tables. <br>
The following is a further explanation of each data:
1. Table df_Customers <br>
This table contains data on customers who make product transactions on ecommerce
2. Table df_OrderItems <br>
This table contains a mapping between orders placed by customers and the table of products purchased
3. Table df_Orders <br>
This table contains data on orders placed by each user
4. df_Payments table <br>
This table contains payments made by each user, containing payment details and transaction value
5. df_Products table <br>
This table contains a list of products sold on ecccomercce and contained in transactions <br>
Let's look at the detailed contents of the table

## Importing Common Libraries

In [6]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

pd.set_option('display.max_columns', 99)


In [7]:
import warnings 
warnings.filterwarnings("ignore")

In [10]:
import kagglehub

path = kagglehub.dataset_download("bytadit/ecommerce-order-dataset")

print("Path to dataset files:", path)

Path to dataset files: /Users/arunekambaram/.cache/kagglehub/datasets/bytadit/ecommerce-order-dataset/versions/1


In [8]:
tables = ['Orders','Customers','Products','Payments','Order Items']