In [None]:
import sys
from pathlib import Path

project_root = Path.cwd().parent
sys.path.insert(0, str(project_root))

print("Project root set to:", project_root)


Project root set to: f:\projects\marketing


In [2]:
from src.config import get_paths
paths = get_paths(project_root)
paths

Paths(root=WindowsPath('F:/projects/marketing'), raw=WindowsPath('F:/projects/marketing/data/raw'), processed=WindowsPath('F:/projects/marketing/data/processed'))

In [6]:
import pandas as pd

In [3]:
from src.IO import read_csv

orders = read_csv(paths.raw / "olist_orders_dataset.csv")
orders.head()

Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date
0,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15,2017-10-04 19:55:00,2017-10-10 21:25:13,2017-10-18 00:00:00
1,53cdb2fc8bc7dce0b6741e2150273451,b0830fb4747a6c6d20dea0b8c802d7ef,delivered,2018-07-24 20:41:37,2018-07-26 03:24:27,2018-07-26 14:31:00,2018-08-07 15:27:45,2018-08-13 00:00:00
2,47770eb9100c2d0c44946d9cf07ec65d,41ce2a54c0b03bf3443c3d931a367089,delivered,2018-08-08 08:38:49,2018-08-08 08:55:23,2018-08-08 13:50:00,2018-08-17 18:06:29,2018-09-04 00:00:00
3,949d5b44dbf5de918fe9c16f97b45f8a,f88197465ea7920adcdbec7375364d82,delivered,2017-11-18 19:28:06,2017-11-18 19:45:59,2017-11-22 13:39:59,2017-12-02 00:28:42,2017-12-15 00:00:00
4,ad21c59c0840e6cb83a9ceb5573f8159,8ab97904e6daea8866dbdbc4fb7aad2c,delivered,2018-02-13 21:18:39,2018-02-13 22:20:29,2018-02-14 19:46:34,2018-02-16 18:17:02,2018-02-26 00:00:00


### Orders table: column relevance for marketing budget reallocation

The following columns are required for the analysis:
- `order_id`: primary key, required for joins and aggregation
- `customer_id`: required to link orders to customers
- `order_purchase_timestamp`: required for cohorting, recency, and tenure
- `order_status`: required to interpret nulls and exclude canceled orders if needed

The following columns are **not used** in downstream analysis:
- `order_approved_at`: operational timestamp, not relevant to marketing decisions
- `order_delivered_carrier_date`: logistics-related, no impact on acquisition or retention strategy
- `order_delivered_customer_date`: delivery performance, not used for CLV or budget allocation
- `order_estimated_delivery_date`: planning metadata, not required for this analysis

These columns are retained in raw data but excluded from the analytical dataset to reduce noise and focus on decision-relevant features.


In [4]:
orders.shape

(99441, 8)

In [5]:
orders.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99441 entries, 0 to 99440
Data columns (total 8 columns):
 #   Column                         Non-Null Count  Dtype 
---  ------                         --------------  ----- 
 0   order_id                       99441 non-null  object
 1   customer_id                    99441 non-null  object
 2   order_status                   99441 non-null  object
 3   order_purchase_timestamp       99441 non-null  object
 4   order_approved_at              99281 non-null  object
 5   order_delivered_carrier_date   97658 non-null  object
 6   order_delivered_customer_date  96476 non-null  object
 7   order_estimated_delivery_date  99441 non-null  object
dtypes: object(8)
memory usage: 6.1+ MB


### Missing values assessment (orders table)

The following columns contain missing values:

- `order_approved_at`: 160 missing values (~0.16%)
- `order_delivered_carrier_date`: 1,783 missing values (~1.79%)
- `order_delivered_customer_date`: 2,965 missing values (~2.98%)

These missing values represent a small fraction of the dataset and are likely related to
orders that were canceled or not fully processed.  
At this stage, rows with missing values are **retained** to avoid prematurely removing
potentially meaningful business events. Handling of these nulls will be informed by
`order_status` during later analysis.

No missing values are present in:
- `order_id`
- `customer_id`
- `order_status`
- `order_purchase_timestamp`
- `order_estimated_delivery_date`


### Data type assessment

All columns in the orders table are currently stored as `object` dtype.

The following columns represent timestamps and should be converted to `datetime`:

- `order_purchase_timestamp`
- `order_approved_at`
- `order_delivered_carrier_date`
- `order_delivered_customer_date`
- `order_estimated_delivery_date`

Converting these columns to datetime is required to support:
- time-based analysis (cohorting, recency, tenure)
- correct handling of missing values
- reliable downstream CLV calculations

Data type conversion will be performed using `pd.to_datetime` with coercion to handle
any invalid or malformed values safely.


In [7]:
datetime_cols = [
    "order_purchase_timestamp",
    "order_approved_at",
    "order_delivered_carrier_date",
    "order_delivered_customer_date",
    "order_estimated_delivery_date",
]

for col in datetime_cols:
    orders[col] = pd.to_datetime(orders[col], errors="coerce")


In [9]:
orders.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99441 entries, 0 to 99440
Data columns (total 8 columns):
 #   Column                         Non-Null Count  Dtype         
---  ------                         --------------  -----         
 0   order_id                       99441 non-null  object        
 1   customer_id                    99441 non-null  object        
 2   order_status                   99441 non-null  object        
 3   order_purchase_timestamp       99441 non-null  datetime64[ns]
 4   order_approved_at              99281 non-null  datetime64[ns]
 5   order_delivered_carrier_date   97658 non-null  datetime64[ns]
 6   order_delivered_customer_date  96476 non-null  datetime64[ns]
 7   order_estimated_delivery_date  99441 non-null  datetime64[ns]
dtypes: datetime64[ns](5), object(3)
memory usage: 6.1+ MB
