# **E-Commerce Analysis: Customer Churn & Delivery Problems**
An analysis of Brazillian E-Commerce transactions made at [Olist Store](www.olist.com).

**Created By:** *Fritz Immanuel & Gerard Louis Howan* (JCDS 2502)

## **> Introduction**

<hr>

[Olist](www.olist.com) is a Brazilian-based technology company that provides a platform to help small and medium-sized businesses sell their products on major online marketplaces. Founded in 2015, Olist acts as a bridge between merchants and e-commerce platforms like Amazon, Mercado Libre, and Magalu, enabling sellers to manage their inventory, listings, and orders from a single interface.

By offering logistics, catalog optimization, and customer service tools, Olist simplifies the process of selling online for retailers who might otherwise struggle with the complexity of marketplace requirements. Its goal is to democratize access to e-commerce and boost the visibility of smaller sellers in competitive digital environments.

### **Context**

Olist noticed something isn't right and decided to hire us, a team of Data Scientists, to take a deeper look. By giving us this dataset, we are tasked to look for problems with their business. So we decided to take a look at how their customers are doing and if theres anything wrong that we can improve. Most certainly, it is about customer churning. As we all know, an E-commerce platform is nothing without its customers. For this analysis, we will determine that any **customer who has not made any second order within 6 months** as a **churned customer**.

**Target:**<br>
0 => Staying Customer / non-churn<br>
1 => Leaving Customer / Churn<br>

### **Problem Statement**

With the growth of e-commerce, competition has become tougher than ever. There are so many online stores today, and customers can easily find one that fits their needs. This, however, isn’t great news for the platforms that aren’t as popular. To stay ahead, companies have to work hard to provide the best experience and attract more customers.

One way to do this is through big marketing campaigns. While these campaigns can be very expensive, they don’t always give the best results. In fact, companies often spend a lot of money without seeing a clear return. Instead of spending so much on uncertain results, it might make more sense to focus on **keeping** the customers they already have. Why? Because keeping existing customers is often easier than finding new ones. Offering things like discount vouchers or paying attention to customer complaints can help keep customers loyal and encourage them to return.

### **Goals**

To improve customer retention and reduce revenue loss, the company wants to build a model that can **accurately predict which customers are likely to churn**. This capability will allow the business to allocate retention resources more effectively—targeting at-risk customers with tailored offers and interventions. Beyond prediction, the company is also focused on understanding the **underlying reasons behind churn**, so it can improve the overall customer experience and make data-driven enhancements to its platform.

**Our goal is to answer the following key questions:**

* How do delivery problems—such as late shipments or inconsistent delivery times—affect customer churn?
* Are there geolocation patterns (e.g., by city, state, or zip code) associated with poor delivery experiences and higher churn rates?
* What customer or product-level attributes are most strongly correlated with an increased risk of churn?

By answering these questions, the company can take both **proactive and strategic actions** to improve service quality and customer satisfaction.


### **Analytic Approach**

We will be analyzing data to look for patterns, in which may help us determine the factors of customers churning. We will then build a classification machine learning model to help us and the company to determine which customers are more likely to churn/leave.

### **Metric Evaluation**

![Confusion Matrix](./images/Confusion%20Matrix.png)

There are 2 kinds of errors, **False Positive** & **False Negative**, in which has their own drawbacks.

#### **False Positive (FP)**<br>
The company spends resources on customers who are staying (not churning) because the model incorrectly predicts that they are leaving (churning).

#### **False Negative (FN)**<br>
The company loses a customer who actually churns, but the model fails to identify them as at risk, so little-to-no retention efforts are made.

#### **Cost Assumptions**<br>
To actually understand the scope of financial losses for each type of error we will assume some potential costs. For **False Positives**, on most cases companies will be prioritizing potential churners. This means the company will be spending more resources in order to retain these customers by giving them more benefits such as discount vouchers, etc.. We will assume that the company will be spending **R$150/customer/month**.

For **False Negatives**, the cost is typically higher because the company fails to act, resulting in the actual loss of a customer. For an e-commerce company like Olist, which operates a marketplace model supporting small and medium businesses, the value of a single active seller or buyer can be significant. With an approximate average of R$800/customer/month in gross revenue, the company probably takes about 15-20% as net revenue. To that, we will assume that the estimated loss will be **R$150/customer/month**

#### **Evaluation Metric**<br>
Although both **False Positives** and **False Negatives** are estimated at R\$150 per customer per month, the nature of these costs is fundamentally different: **False Positive costs are potential and often inflated**, as not all flagged customers will fully utilize retention offers like discounts or perks, meaning the actual expense may be lower than estimated. On the other hand, **False Negatives represent a concrete and irreversible loss**, since a churned customer directly translates into lost revenue and potential long-term value. Because of this asymmetry in risk and impact, relying solely on accuracy or other unbalanced metrics can be misleading. A more suitable evaluation metric in this context is the **F1 Score**, which balances precision and recall to ensure the model effectively identifies true churners while minimizing unnecessary retention efforts.


## **> Data Understanding & Cleaning**
<hr>

### **Data Source & Structure**

Dataset source: [Kaggle - Brazilian E-Commerce](https://www.kaggle.com/datasets/olistbr/brazilian-ecommerce)

The dataset consists of 9 different 'tables', which are:

<br>

**`olist_order_items_dataset.csv` (112,650 rows)**

| Column Name           | Data Type        | Description                                                |
| --------------------- | ---------------- | ---------------------------------------------------------- |
| `order_id`            | `object`/`string`         | Identifier linking to the order.                           |
| `order_item_id`       | `int64`          | Sequential number identifying items within the same order. |
| `product_id`          | `object`/`string`         | Identifier for the purchased product.                      |
| `seller_id`           | `object`/`string`         | Identifier for the seller of the product.                  |
| `shipping_limit_date` | `datetime64` | Latest date the seller should ship the item.               |
| `price`               | `float64`        | Price paid for the item.                                   |
| `freight_value`       | `float64`        | Shipping cost charged for the item.                        |

<br>

**`olist_order_payments_dataset.csv` (103,886 rows)**

| Column Name            | Data Type | Description                                                         |
| ---------------------- | --------- | ------------------------------------------------------------------- |
| `order_id`             | `object`/`string`  | Identifier linking to the order.                                    |
| `payment_sequential`   | `int64`   | Sequential number identifying multiple payments for the same order. |
| `payment_type`         | `object`/`string`  | Payment method used (e.g., credit card, boleto).                    |
| `payment_installments` | `int64`   | Number of installments for the payment.                             |
| `payment_value`        | `float64` | Total amount paid in the transaction.                               |

<br>

**`olist_order_reviews_dataset.csv` (100,000 rows)**

| Column Name               | Data Type        | Description                                           |
| ------------------------- | ---------------- | ----------------------------------------------------- |
| `review_id`               | `object`/`string`         | Unique identifier for each review.                    |
| `order_id`                | `object`/`string`         | Identifier linking to the order.                      |
| `review_score`            | `int64`          | Score given by the customer (1 to 5).                 |
| `review_comment_title`    | `object`/`string`         | Title of the review comment.                          |
| `review_comment_message`  | `object`/`string`         | Content of the review comment.                        |
| `review_creation_date`    | `datetime64` | Date when the review was created.                     |
| `review_answer_timestamp` | `datetime64` | Timestamp when the review was answered by the seller. |

<br>

**`olist_products_dataset.csv` (32,951 rows)**

| Column Name                  | Data Type | Description                                   |
| ---------------------------- | --------- | --------------------------------------------- |
| `product_id`                 | `object`/`string`  | Unique identifier for each product.           |
| `product_category_name`      | `object`/`string`  | Category of the product (in Portuguese).      |
| `product_name_lenght`        | `float64` | Length of the product name.                   |
| `product_description_lenght` | `float64` | Length of the product description.            |
| `product_photos_qty`         | `float64` | Number of photos associated with the product. |
| `product_weight_g`           | `float64` | Weight of the product in grams.               |
| `product_length_cm`          | `float64` | Length of the product package in centimeters. |
| `product_height_cm`          | `float64` | Height of the product package in centimeters. |
| `product_width_cm`           | `float64` | Width of the product package in centimeters.  |

<br>

**`olist_sellers_dataset.csv` (3,095 rows)**

| Column Name              | Data Type | Description                                 |
| ------------------------ | --------- | ------------------------------------------- |
| `seller_id`              | `object`/`string`  | Unique identifier for each seller.          |
| `seller_zip_code_prefix` | `int64`   | First five digits of the seller's zip code. |
| `seller_city`            | `object`/`string`  | City where the seller is located.           |
| `seller_state`           | `object`/`string`  | State where the seller is located.          |

<br>

**`olist_geolocation_dataset.csv` (1,000,016 rows)**

| Column Name                   | Data Type | Description                          |
| ----------------------------- | --------- | ------------------------------------ |
| `geolocation_zip_code_prefix` | `int64`   | First five digits of the zip code.   |
| `geolocation_lat`             | `float64` | Latitude coordinate.                 |
| `geolocation_lng`             | `float64` | Longitude coordinate.                |
| `geolocation_city`            | `object`/`string`  | City corresponding to the zip code.  |
| `geolocation_state`           | `object`/`string`  | State corresponding to the zip code. |

<br>

**`product_category_name_translation.csv` (71 rows)**

| Column Name                     | Data Type | Description                          |
| ------------------------------- | --------- | ------------------------------------ |
| `product_category_name`         | `object`/`string`  | Product category name in Portuguese. |
| `product_category_name_english` | `object`/`string`  | Product category name in English.    |



### **Context**

- Dataset is **imbalanced**
- Dataset contains **high-cardinality categorical features**
- 

### **Import Libraries**

In [23]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

import folium

import warnings
warnings.filterwarnings("ignore")

pd.set_option('display.max_columns', None) # display all columns, without limits
pd.set_option('display.max_colwidth', None)

### **Load Datasets**

In [24]:
# load dataset
customers_df = pd.read_csv('dataset/raw/olist_customers_dataset.csv')

orders_df = pd.read_csv('dataset/raw/olist_orders_dataset.csv')
order_items_df = pd.read_csv('dataset/raw/olist_order_items_dataset.csv')
order_payments_df = pd.read_csv('dataset/raw/olist_order_payments_dataset.csv')
order_reviews_df = pd.read_csv('dataset/raw/olist_order_reviews_dataset.csv')

products_df = pd.read_csv('dataset/raw/olist_products_dataset.csv')
pcateg_translation_df = pd.read_csv('dataset/raw/product_category_name_translation.csv')

sellers_df = pd.read_csv('dataset/raw/olist_sellers_dataset.csv')

geolocation_df = pd.read_csv('dataset/raw/olist_geolocation_dataset.csv')

### **Function Library**

In [25]:
def showUniqueValues(source, limit: int):
	listItem = []
	for col in source.columns :
		listItem.append([col, source[col].nunique(), source[col].sort_values().unique()])

	df_uniques_per_column = pd.DataFrame(columns=['Column Name', 'Number of Unique', 'Unique Sample'], data=listItem)

	if limit > 0:
		return df_uniques_per_column[df_uniques_per_column['Number of Unique']<=limit].sort_values('Number of Unique', ascending=False)
	else:
		return df_uniques_per_column.sort_values('Number of Unique', ascending=False)

### **Consolidate Tables (Optional)**

In [26]:
# translate product categories
final_products_df = products_df.merge(pcateg_translation_df, on='product_category_name', how='left')
final_products_df = final_products_df.drop(columns=['product_category_name'])
final_products_df = final_products_df.rename(columns={'product_category_name_english': 'product_category_name'})
final_products_df

Unnamed: 0,product_id,product_name_lenght,product_description_lenght,product_photos_qty,product_weight_g,product_length_cm,product_height_cm,product_width_cm,product_category_name
0,1e9e8ef04dbcff4541ed26657ea517e5,40.0,287.0,1.0,225.0,16.0,10.0,14.0,perfumery
1,3aa071139cb16b67ca9e5dea641aaa2f,44.0,276.0,1.0,1000.0,30.0,18.0,20.0,art
2,96bd76ec8810374ed1b65e291975717f,46.0,250.0,1.0,154.0,18.0,9.0,15.0,sports_leisure
3,cef67bcfe19066a932b7673e239eb23d,27.0,261.0,1.0,371.0,26.0,4.0,26.0,baby
4,9dc1a7de274444849c219cff195d0b71,37.0,402.0,4.0,625.0,20.0,17.0,13.0,housewares
...,...,...,...,...,...,...,...,...,...
32946,a0b7d5a992ccda646f2d34e418fff5a0,45.0,67.0,2.0,12300.0,40.0,40.0,40.0,furniture_decor
32947,bf4538d88321d0fd4412a93c974510e6,41.0,971.0,1.0,1700.0,16.0,19.0,16.0,construction_tools_lights
32948,9a7c6041fa9592d9d9ef6cfe62a71f8c,50.0,799.0,1.0,1400.0,27.0,7.0,27.0,bed_bath_table
32949,83808703fc0706a22e264b9d75f04a2e,60.0,156.0,2.0,700.0,31.0,13.0,20.0,computers_accessories


In [27]:
# order related data
final_order_df = orders_df.merge(order_items_df, on='order_id', how='left')
final_order_df = final_order_df.merge(order_payments_df, on='order_id', how='left')
# final_order_df = final_order_df.merge(order_reviews_df, on='order_id', how='left')
final_order_df = final_order_df.merge(final_products_df, on='product_id', how='left')
final_order_df

Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date,order_item_id,product_id,seller_id,shipping_limit_date,price,freight_value,payment_sequential,payment_type,payment_installments,payment_value,product_name_lenght,product_description_lenght,product_photos_qty,product_weight_g,product_length_cm,product_height_cm,product_width_cm,product_category_name
0,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15,2017-10-04 19:55:00,2017-10-10 21:25:13,2017-10-18 00:00:00,1.0,87285b34884572647811a353c7ac498a,3504c0cb71d7fa48d967e0e4c94d59d9,2017-10-06 11:07:15,29.99,8.72,1.0,credit_card,1.0,18.12,40.0,268.0,4.0,500.0,19.0,8.0,13.0,housewares
1,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15,2017-10-04 19:55:00,2017-10-10 21:25:13,2017-10-18 00:00:00,1.0,87285b34884572647811a353c7ac498a,3504c0cb71d7fa48d967e0e4c94d59d9,2017-10-06 11:07:15,29.99,8.72,3.0,voucher,1.0,2.00,40.0,268.0,4.0,500.0,19.0,8.0,13.0,housewares
2,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15,2017-10-04 19:55:00,2017-10-10 21:25:13,2017-10-18 00:00:00,1.0,87285b34884572647811a353c7ac498a,3504c0cb71d7fa48d967e0e4c94d59d9,2017-10-06 11:07:15,29.99,8.72,2.0,voucher,1.0,18.59,40.0,268.0,4.0,500.0,19.0,8.0,13.0,housewares
3,53cdb2fc8bc7dce0b6741e2150273451,b0830fb4747a6c6d20dea0b8c802d7ef,delivered,2018-07-24 20:41:37,2018-07-26 03:24:27,2018-07-26 14:31:00,2018-08-07 15:27:45,2018-08-13 00:00:00,1.0,595fac2a385ac33a80bd5114aec74eb8,289cdb325fb7e7f891c38608bf9e0962,2018-07-30 03:24:27,118.70,22.76,1.0,boleto,1.0,141.46,29.0,178.0,1.0,400.0,19.0,13.0,19.0,perfumery
4,47770eb9100c2d0c44946d9cf07ec65d,41ce2a54c0b03bf3443c3d931a367089,delivered,2018-08-08 08:38:49,2018-08-08 08:55:23,2018-08-08 13:50:00,2018-08-17 18:06:29,2018-09-04 00:00:00,1.0,aa4383b373c6aca5d8797843e5594415,4869f7a5dfa277a7dca6462dcf3b52b2,2018-08-13 08:55:23,159.90,19.22,1.0,credit_card,3.0,179.12,46.0,232.0,1.0,420.0,24.0,19.0,21.0,auto
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
118429,63943bddc261676b46f01ca7ac2f7bd8,1fca14ff2861355f6e5f14306ff977a7,delivered,2018-02-06 12:58:58,2018-02-06 13:10:37,2018-02-07 23:22:42,2018-02-28 17:37:56,2018-03-02 00:00:00,1.0,f1d4ce8c6dd66c47bbaa8c6781c2a923,1f9ab4708f3056ede07124aad39a2554,2018-02-12 13:10:37,174.90,20.10,1.0,credit_card,3.0,195.00,52.0,828.0,4.0,4950.0,40.0,10.0,40.0,baby
118430,83c1379a015df1e13d02aae0204711ab,1aa71eb042121263aafbe80c1b562c9c,delivered,2017-08-27 14:46:43,2017-08-27 15:04:16,2017-08-28 20:52:26,2017-09-21 11:24:17,2017-09-27 00:00:00,1.0,b80910977a37536adeddd63663f916ad,d50d79cb34e38265a8649c383dcffd48,2017-09-05 15:04:16,205.99,65.02,1.0,credit_card,5.0,271.01,51.0,500.0,2.0,13300.0,32.0,90.0,22.0,home_appliances_2
118431,11c177c8e97725db2631073c19f07b62,b331b74b18dc79bcdf6532d51e1637c1,delivered,2018-01-08 21:28:27,2018-01-08 21:36:21,2018-01-12 15:35:03,2018-01-25 23:32:54,2018-02-15 00:00:00,1.0,d1c427060a0f73f6b889a5c7c61f2ac4,a1043bafd471dff536d0c462352beb48,2018-01-12 21:36:21,179.99,40.59,1.0,credit_card,4.0,441.16,59.0,1893.0,1.0,6550.0,20.0,20.0,20.0,computers_accessories
118432,11c177c8e97725db2631073c19f07b62,b331b74b18dc79bcdf6532d51e1637c1,delivered,2018-01-08 21:28:27,2018-01-08 21:36:21,2018-01-12 15:35:03,2018-01-25 23:32:54,2018-02-15 00:00:00,2.0,d1c427060a0f73f6b889a5c7c61f2ac4,a1043bafd471dff536d0c462352beb48,2018-01-12 21:36:21,179.99,40.59,1.0,credit_card,4.0,441.16,59.0,1893.0,1.0,6550.0,20.0,20.0,20.0,computers_accessories


In [28]:
final_order_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 118434 entries, 0 to 118433
Data columns (total 26 columns):
 #   Column                         Non-Null Count   Dtype  
---  ------                         --------------   -----  
 0   order_id                       118434 non-null  object 
 1   customer_id                    118434 non-null  object 
 2   order_status                   118434 non-null  object 
 3   order_purchase_timestamp       118434 non-null  object 
 4   order_approved_at              118258 non-null  object 
 5   order_delivered_carrier_date   116360 non-null  object 
 6   order_delivered_customer_date  115037 non-null  object 
 7   order_estimated_delivery_date  118434 non-null  object 
 8   order_item_id                  117604 non-null  float64
 9   product_id                     117604 non-null  object 
 10  seller_id                      117604 non-null  object 
 11  shipping_limit_date            117604 non-null  object 
 12  price                         

### **Customers Dataset**

some descriptive text here

**`olist_customers_dataset.csv` (99,441 rows)**

| Column Name                | Data Type | Description                                                 |
| -------------------------- | --------- | ----------------------------------------------------------- |
| `customer_id`              | `object`/`string`  | Unique identifier for each customer.                        |
| `customer_unique_id`       | `object`/`string`  | Unique identifier for each customer across multiple orders. |
| `customer_zip_code_prefix` | `int64`   | First five digits of the customer's zip code.               |
| `customer_city`            | `object`/`string`  | City where the customer is located.                         |
| `customer_state`           | `object`/`string`  | State where the customer is located.                        |

#### **Unique Values**

In [29]:
showUniqueValues(customers_df, 14994) # add limit to remove '_id's from the list, as it is not important here

Unnamed: 0,Column Name,Number of Unique,Unique Sample
2,customer_zip_code_prefix,14994,"[1003, 1004, 1005, 1006, 1007, 1008, 1009, 1011, 1012, 1013, 1014, 1015, 1016, 1017, 1019, 1020, 1021, 1022, 1023, 1024, 1025, 1026, 1027, 1030, 1031, 1032, 1033, 1035, 1036, 1037, 1038, 1041, 1042, 1043, 1044, 1045, 1046, 1047, 1048, 1049, 1050, 1102, 1103, 1105, 1106, 1107, 1108, 1120, 1121, 1122, 1123, 1124, 1125, 1127, 1129, 1131, 1132, 1134, 1135, 1136, 1137, 1138, 1139, 1140, 1141, 1144, 1150, 1151, 1152, 1153, 1154, 1155, 1156, 1201, 1202, 1203, 1204, 1205, 1206, 1207, 1208, 1209, 1210, 1211, 1213, 1214, 1215, 1216, 1217, 1218, 1219, 1220, 1221, 1222, 1223, 1224, 1225, 1226, 1227, 1228, ...]"
3,customer_city,4119,"[abadia dos dourados, abadiania, abaete, abaetetuba, abaiara, abaira, abare, abatia, abdon batista, abelardo luz, abrantes, abre campo, abreu e lima, acaiaca, acailandia, acajutiba, acarau, acari, acegua, acopiara, acreuna, acu, acucena, adamantina, adhemar de barros, adolfo, adrianopolis, adustina, afogados da ingazeira, afonso claudio, afranio, agisse, agrestina, agrolandia, agronomica, agua boa, agua branca, agua clara, agua comprida, agua doce, agua doce do norte, agua fria de goias, agua limpa, agua nova, agua preta, agua santa, aguai, aguas belas, aguas claras, aguas da prata, aguas de lindoia, aguas de santa barbara, aguas de sao pedro, aguas formosas, aguas frias, aguas lindas de goias, aguas mornas, aguas vermelhas, agudo, agudos, aguia branca, aimores, aiuaba, aiuruoca, ajapi, ajuricaba, alagoa, alagoa grande, alagoa nova, alagoinha, alagoinhas, alambari, alcantara, alcinopolis, alcobaca, alegre, alegrete, alegrete do piaui, alegria, alem paraiba, alexandra, alexandria, alexandrita, alexania, alfenas, alfredo chaves, alfredo marcondes, alfredo vasconcelos, alfredo wagner, alhandra, alianca, alianca do tocantins, almas, almenara, almino afonso, almirante tamandare, almirante tamandare do sul, alpercata, alpestre, alpinopolis, ...]"
4,customer_state,27,"[AC, AL, AM, AP, BA, CE, DF, ES, GO, MA, MG, MS, MT, PA, PB, PE, PI, PR, RJ, RN, RO, RR, RS, SC, SE, SP, TO]"


#### **Missing Values**

In [30]:
customers_df.isna().sum()

customer_id                 0
customer_unique_id          0
customer_zip_code_prefix    0
customer_city               0
customer_state              0
dtype: int64

No missing value / NaNs needed to be handled.

#### **Standardize Text-Case**

In [31]:
customers_df['customer_city'] = customers_df['customer_city'].str.lower()

#### **Duplicates**

In [32]:
customers_df.duplicated().sum()

0

No duplicates found, no action needed.

### **Orders Dataset**

some desc text here

**`olist_orders_dataset.csv` (99,441 rows)**

| Column Name                     | Data Type        | Description                                                       |
| ------------------------------- | ---------------- | ----------------------------------------------------------------- |
| `order_id`                      | `object`/`string`         | Unique identifier for each order.                                 |
| `customer_id`                   | `object`/`string`         | Unique identifier for the customer who placed the order.          |
| `order_status`                  | `object`/`string`         | Current status of the order (e.g., delivered, shipped, canceled). |
| `order_purchase_timestamp`      | `object`/`string` | Timestamp when the order was placed.                              |
| `order_approved_at`             | `object`/`string` | Timestamp when the order was approved.                            |
| `order_delivered_carrier_date`  | `object`/`string` | Timestamp when the order was handed over to the carrier.          |
| `order_delivered_customer_date` | `object`/`string` | Timestamp when the order was delivered to the customer.           |
| `order_estimated_delivery_date` | `object`/`string` | Estimated delivery date for the order.                            |

#### **Unique Values**

In [33]:
showUniqueValues(orders_df, 98875)

Unnamed: 0,Column Name,Number of Unique,Unique Sample
3,order_purchase_timestamp,98875,"[2016-09-04 21:15:19, 2016-09-05 00:15:34, 2016-09-13 15:24:19, 2016-09-15 12:16:38, 2016-10-02 22:07:52, 2016-10-03 09:44:50, 2016-10-03 16:56:50, 2016-10-03 21:01:41, 2016-10-03 21:13:36, 2016-10-03 22:06:03, 2016-10-03 22:31:31, 2016-10-03 22:44:10, 2016-10-03 22:51:30, 2016-10-04 09:06:10, 2016-10-04 09:16:33, 2016-10-04 09:59:03, 2016-10-04 10:05:45, 2016-10-04 10:16:04, 2016-10-04 10:41:17, 2016-10-04 11:03:14, 2016-10-04 11:44:01, 2016-10-04 12:06:11, 2016-10-04 12:53:17, 2016-10-04 13:02:10, 2016-10-04 13:11:29, 2016-10-04 13:15:46, 2016-10-04 13:15:52, 2016-10-04 13:16:57, 2016-10-04 13:22:56, 2016-10-04 13:30:13, 2016-10-04 13:38:37, 2016-10-04 13:40:43, 2016-10-04 13:45:08, 2016-10-04 14:13:22, 2016-10-04 14:23:50, 2016-10-04 14:44:33, 2016-10-04 14:49:13, 2016-10-04 14:51:15, 2016-10-04 15:02:37, 2016-10-04 15:02:53, 2016-10-04 15:07:21, 2016-10-04 15:10:15, 2016-10-04 15:12:18, 2016-10-04 15:35:35, 2016-10-04 15:44:44, 2016-10-04 16:02:18, 2016-10-04 16:05:29, 2016-10-04 16:08:52, 2016-10-04 16:28:25, 2016-10-04 16:40:07, 2016-10-04 16:41:59, 2016-10-04 17:08:39, 2016-10-04 17:25:09, 2016-10-04 18:02:37, 2016-10-04 18:07:24, 2016-10-04 18:52:56, 2016-10-04 19:01:13, 2016-10-04 19:25:29, 2016-10-04 19:30:28, 2016-10-04 19:41:32, 2016-10-04 19:41:52, 2016-10-04 20:41:45, 2016-10-04 20:41:55, 2016-10-04 21:25:32, 2016-10-04 21:32:35, 2016-10-04 21:35:54, 2016-10-04 21:54:52, 2016-10-04 22:03:46, 2016-10-04 22:15:11, 2016-10-04 22:33:44, 2016-10-04 22:43:33, 2016-10-04 23:08:08, 2016-10-04 23:26:51, 2016-10-04 23:39:01, 2016-10-04 23:45:54, 2016-10-04 23:59:01, 2016-10-05 00:32:31, 2016-10-05 01:47:40, 2016-10-05 07:31:49, 2016-10-05 08:01:47, 2016-10-05 08:04:21, 2016-10-05 08:41:18, 2016-10-05 08:46:55, 2016-10-05 09:19:46, 2016-10-05 09:46:43, 2016-10-05 11:01:40, 2016-10-05 11:05:52, 2016-10-05 11:08:46, 2016-10-05 11:23:13, 2016-10-05 11:44:41, 2016-10-05 11:54:15, 2016-10-05 12:32:55, 2016-10-05 12:34:04, 2016-10-05 12:41:38, 2016-10-05 12:44:09, 2016-10-05 13:12:43, 2016-10-05 13:22:20, 2016-10-05 14:16:28, 2016-10-05 14:36:55, 2016-10-05 14:40:44, ...]"
6,order_delivered_customer_date,95664,"[2016-10-11 13:46:32, 2016-10-11 14:46:49, 2016-10-13 03:10:34, 2016-10-13 07:45:48, 2016-10-13 15:44:27, 2016-10-13 15:44:57, 2016-10-13 15:45:44, 2016-10-13 15:49:48, 2016-10-13 15:56:11, 2016-10-13 15:56:28, 2016-10-13 16:00:43, 2016-10-13 16:03:06, 2016-10-13 16:03:33, 2016-10-13 16:03:46, 2016-10-13 16:51:46, 2016-10-13 19:31:39, 2016-10-14 02:49:22, 2016-10-14 03:10:07, 2016-10-14 08:29:50, 2016-10-14 09:09:13, 2016-10-14 10:16:04, 2016-10-14 11:03:10, 2016-10-14 12:13:52, 2016-10-14 12:14:57, 2016-10-14 12:15:24, 2016-10-14 15:07:11, 2016-10-14 15:59:10, 2016-10-14 15:59:26, 2016-10-14 16:08:00, 2016-10-14 19:28:40, 2016-10-14 19:29:13, 2016-10-14 22:15:33, 2016-10-15 01:01:29, 2016-10-15 03:51:25, 2016-10-15 04:17:21, 2016-10-15 05:02:06, 2016-10-15 05:02:16, 2016-10-15 11:00:25, 2016-10-15 11:02:24, 2016-10-15 13:19:54, 2016-10-15 13:22:13, 2016-10-15 13:42:50, 2016-10-15 15:09:06, 2016-10-15 16:34:44, 2016-10-15 18:32:03, 2016-10-15 18:34:07, 2016-10-15 18:54:23, 2016-10-15 20:38:26, 2016-10-15 22:02:14, 2016-10-16 10:41:50, 2016-10-16 14:36:00, 2016-10-16 14:36:59, 2016-10-16 14:57:02, 2016-10-16 15:35:21, 2016-10-16 15:55:15, 2016-10-16 16:57:14, 2016-10-16 17:51:52, 2016-10-17 02:55:39, 2016-10-17 11:25:59, 2016-10-17 12:03:07, 2016-10-17 12:03:19, 2016-10-17 12:03:34, 2016-10-17 12:03:38, 2016-10-17 13:02:12, 2016-10-17 13:02:21, 2016-10-17 13:02:46, 2016-10-17 14:01:34, 2016-10-17 15:36:53, 2016-10-17 15:42:00, 2016-10-17 16:47:46, 2016-10-17 17:43:18, 2016-10-17 17:43:19, 2016-10-17 18:39:46, 2016-10-17 19:08:17, 2016-10-17 19:29:03, 2016-10-17 19:31:23, 2016-10-17 19:41:12, 2016-10-17 20:24:25, 2016-10-18 01:37:54, 2016-10-18 05:56:37, 2016-10-18 06:02:45, 2016-10-18 06:03:07, 2016-10-18 09:57:48, 2016-10-18 13:24:17, 2016-10-18 17:13:12, 2016-10-18 17:21:51, 2016-10-18 18:04:27, 2016-10-18 18:34:50, 2016-10-18 19:17:36, 2016-10-18 20:14:38, 2016-10-18 20:23:49, 2016-10-18 20:37:33, 2016-10-18 22:35:47, 2016-10-18 22:57:42, 2016-10-19 00:47:36, 2016-10-19 04:01:38, 2016-10-19 04:01:59, 2016-10-19 11:47:52, 2016-10-19 17:31:37, 2016-10-19 18:47:43, ...]"
4,order_approved_at,90733,"[2016-09-15 12:16:38, 2016-10-04 09:43:32, 2016-10-04 10:18:57, 2016-10-04 10:19:23, 2016-10-04 10:25:46, 2016-10-04 10:26:40, 2016-10-04 10:28:07, 2016-10-04 10:28:19, 2016-10-04 10:28:25, 2016-10-04 10:45:33, 2016-10-04 11:06:07, 2016-10-04 12:25:20, 2016-10-04 13:26:11, 2016-10-04 13:46:31, 2016-10-04 13:47:04, 2016-10-04 13:47:06, 2016-10-04 13:47:45, 2016-10-04 14:08:38, 2016-10-04 14:09:08, 2016-10-04 14:27:49, 2016-10-04 14:46:48, 2016-10-05 02:44:29, 2016-10-05 02:45:16, 2016-10-05 02:46:17, 2016-10-05 03:08:27, 2016-10-05 03:10:31, 2016-10-05 03:10:59, 2016-10-05 03:11:34, 2016-10-05 03:11:49, 2016-10-05 03:45:41, 2016-10-05 08:45:09, 2016-10-05 17:06:51, 2016-10-06 02:46:24, 2016-10-06 02:46:32, 2016-10-06 03:07:51, 2016-10-06 03:10:33, 2016-10-06 03:10:59, 2016-10-06 07:45:47, 2016-10-06 07:46:39, 2016-10-06 07:46:47, 2016-10-06 11:43:20, 2016-10-06 14:21:55, 2016-10-06 14:22:12, 2016-10-06 14:22:19, 2016-10-06 15:41:47, 2016-10-06 15:43:49, 2016-10-06 15:43:50, 2016-10-06 15:44:00, 2016-10-06 15:44:09, 2016-10-06 15:44:26, 2016-10-06 15:44:28, 2016-10-06 15:44:36, 2016-10-06 15:44:56, 2016-10-06 15:45:18, 2016-10-06 15:45:42, 2016-10-06 15:45:52, 2016-10-06 15:46:02, 2016-10-06 15:46:26, 2016-10-06 15:46:29, 2016-10-06 15:47:17, 2016-10-06 15:47:29, 2016-10-06 15:49:04, 2016-10-06 15:49:11, 2016-10-06 15:49:47, 2016-10-06 15:50:14, 2016-10-06 15:50:35, 2016-10-06 15:50:54, 2016-10-06 15:50:56, 2016-10-06 15:51:05, 2016-10-06 15:51:13, 2016-10-06 15:51:26, 2016-10-06 15:51:36, 2016-10-06 15:51:37, 2016-10-06 15:51:38, 2016-10-06 15:51:42, 2016-10-06 15:51:51, 2016-10-06 15:52:09, 2016-10-06 15:52:44, 2016-10-06 15:52:49, 2016-10-06 15:53:06, 2016-10-06 15:53:12, 2016-10-06 15:53:38, 2016-10-06 15:53:39, 2016-10-06 15:54:47, 2016-10-06 15:54:57, 2016-10-06 15:55:39, 2016-10-06 15:55:40, 2016-10-06 15:55:55, 2016-10-06 15:56:10, 2016-10-06 15:56:27, 2016-10-06 15:56:40, 2016-10-06 15:56:43, 2016-10-06 15:56:49, 2016-10-06 15:57:05, 2016-10-06 15:57:10, 2016-10-06 15:57:13, 2016-10-06 15:57:38, 2016-10-06 15:57:59, 2016-10-06 15:58:16, 2016-10-06 15:58:44, ...]"
5,order_delivered_carrier_date,81018,"[2016-10-08 10:34:01, 2016-10-08 13:46:32, 2016-10-08 14:46:49, 2016-10-09 02:45:17, 2016-10-09 03:45:42, 2016-10-09 17:06:52, 2016-10-10 02:46:24, 2016-10-10 03:07:51, 2016-10-10 03:10:34, 2016-10-10 07:45:48, 2016-10-10 15:44:09, 2016-10-10 15:44:27, 2016-10-10 15:44:57, 2016-10-10 15:45:44, 2016-10-10 15:49:48, 2016-10-10 15:51:13, 2016-10-10 15:56:11, 2016-10-10 15:56:28, 2016-10-10 15:58:46, 2016-10-10 16:00:43, 2016-10-10 16:03:01, 2016-10-10 16:03:06, 2016-10-10 16:03:33, 2016-10-10 16:03:46, 2016-10-10 16:45:50, 2016-10-10 16:51:46, 2016-10-10 19:31:39, 2016-10-10 22:37:21, 2016-10-11 02:49:22, 2016-10-11 03:10:07, 2016-10-11 08:29:50, 2016-10-11 09:09:13, 2016-10-11 10:12:10, 2016-10-11 10:16:04, 2016-10-11 12:13:52, 2016-10-11 12:14:57, 2016-10-11 12:15:24, 2016-10-11 14:52:05, 2016-10-11 15:07:11, 2016-10-11 15:59:10, 2016-10-11 15:59:26, 2016-10-11 16:12:23, 2016-10-11 19:28:40, 2016-10-11 19:29:13, 2016-10-11 22:15:33, 2016-10-11 23:13:46, 2016-10-12 03:51:25, 2016-10-12 04:17:21, 2016-10-12 04:27:44, 2016-10-12 09:03:41, 2016-10-12 09:10:47, 2016-10-12 11:00:25, 2016-10-12 11:02:41, 2016-10-12 13:19:54, 2016-10-12 13:22:13, 2016-10-12 16:34:44, 2016-10-12 18:34:07, 2016-10-12 18:54:23, 2016-10-12 19:55:50, 2016-10-12 20:38:26, 2016-10-13 02:33:03, 2016-10-13 13:36:59, 2016-10-13 13:57:02, 2016-10-13 14:35:21, 2016-10-13 14:55:15, 2016-10-13 15:57:14, 2016-10-14 01:55:39, 2016-10-14 02:44:30, 2016-10-14 10:40:50, 2016-10-14 11:03:07, 2016-10-14 11:03:10, 2016-10-14 11:03:19, 2016-10-14 11:03:34, 2016-10-14 11:03:38, 2016-10-14 12:02:12, 2016-10-14 12:02:21, 2016-10-14 12:02:28, 2016-10-14 12:02:36, 2016-10-14 12:02:46, 2016-10-14 13:01:34, 2016-10-14 14:42:00, 2016-10-14 15:02:15, 2016-10-14 16:43:18, 2016-10-14 16:43:19, 2016-10-14 17:39:46, 2016-10-14 18:00:00, 2016-10-14 18:08:17, 2016-10-14 18:29:03, 2016-10-14 18:31:23, 2016-10-14 22:45:26, 2016-10-14 23:20:15, 2016-10-14 23:21:11, 2016-10-14 23:21:18, 2016-10-14 23:21:27, 2016-10-15 04:56:37, 2016-10-15 05:02:45, 2016-10-15 05:03:07, 2016-10-15 05:03:18, 2016-10-15 08:57:48, 2016-10-15 09:57:52, ...]"
7,order_estimated_delivery_date,459,"[2016-09-30 00:00:00, 2016-10-04 00:00:00, 2016-10-20 00:00:00, 2016-10-24 00:00:00, 2016-10-25 00:00:00, 2016-10-27 00:00:00, 2016-10-28 00:00:00, 2016-11-07 00:00:00, 2016-11-14 00:00:00, 2016-11-16 00:00:00, 2016-11-17 00:00:00, 2016-11-18 00:00:00, 2016-11-23 00:00:00, 2016-11-24 00:00:00, 2016-11-25 00:00:00, 2016-11-28 00:00:00, 2016-11-29 00:00:00, 2016-11-30 00:00:00, 2016-12-01 00:00:00, 2016-12-02 00:00:00, 2016-12-05 00:00:00, 2016-12-06 00:00:00, 2016-12-07 00:00:00, 2016-12-08 00:00:00, 2016-12-09 00:00:00, 2016-12-12 00:00:00, 2016-12-13 00:00:00, 2016-12-14 00:00:00, 2016-12-16 00:00:00, 2016-12-19 00:00:00, 2016-12-20 00:00:00, 2016-12-23 00:00:00, 2016-12-30 00:00:00, 2017-01-09 00:00:00, 2017-01-11 00:00:00, 2017-01-19 00:00:00, 2017-02-01 00:00:00, 2017-02-07 00:00:00, 2017-02-09 00:00:00, 2017-02-10 00:00:00, 2017-02-13 00:00:00, 2017-02-14 00:00:00, 2017-02-15 00:00:00, 2017-02-16 00:00:00, 2017-02-17 00:00:00, 2017-02-20 00:00:00, 2017-02-21 00:00:00, 2017-02-22 00:00:00, 2017-02-23 00:00:00, 2017-02-24 00:00:00, 2017-02-27 00:00:00, 2017-02-28 00:00:00, 2017-03-01 00:00:00, 2017-03-02 00:00:00, 2017-03-03 00:00:00, 2017-03-06 00:00:00, 2017-03-07 00:00:00, 2017-03-08 00:00:00, 2017-03-09 00:00:00, 2017-03-10 00:00:00, 2017-03-13 00:00:00, 2017-03-14 00:00:00, 2017-03-15 00:00:00, 2017-03-16 00:00:00, 2017-03-17 00:00:00, 2017-03-20 00:00:00, 2017-03-21 00:00:00, 2017-03-22 00:00:00, 2017-03-23 00:00:00, 2017-03-24 00:00:00, 2017-03-27 00:00:00, 2017-03-28 00:00:00, 2017-03-29 00:00:00, 2017-03-30 00:00:00, 2017-03-31 00:00:00, 2017-04-03 00:00:00, 2017-04-04 00:00:00, 2017-04-05 00:00:00, 2017-04-06 00:00:00, 2017-04-07 00:00:00, 2017-04-10 00:00:00, 2017-04-11 00:00:00, 2017-04-12 00:00:00, 2017-04-13 00:00:00, 2017-04-14 00:00:00, 2017-04-17 00:00:00, 2017-04-18 00:00:00, 2017-04-19 00:00:00, 2017-04-20 00:00:00, 2017-04-24 00:00:00, 2017-04-25 00:00:00, 2017-04-26 00:00:00, 2017-04-27 00:00:00, 2017-04-28 00:00:00, 2017-05-02 00:00:00, 2017-05-03 00:00:00, 2017-05-04 00:00:00, 2017-05-05 00:00:00, 2017-05-08 00:00:00, 2017-05-09 00:00:00, ...]"
2,order_status,8,"[approved, canceled, created, delivered, invoiced, processing, shipped, unavailable]"


#### **Handle Column Type**

In [34]:
orders_df['order_purchase_timestamp'] = pd.to_datetime(orders_df['order_purchase_timestamp'])
orders_df['order_approved_at'] = pd.to_datetime(orders_df['order_approved_at'])
orders_df['order_delivered_carrier_date'] = pd.to_datetime(orders_df['order_delivered_carrier_date'])
orders_df['order_delivered_customer_date'] = pd.to_datetime(orders_df['order_delivered_customer_date'])
orders_df['order_estimated_delivery_date'] = pd.to_datetime(orders_df['order_estimated_delivery_date'])

#### **Missing Values**

In [35]:
orders_df.isna().sum()

order_id                            0
customer_id                         0
order_status                        0
order_purchase_timestamp            0
order_approved_at                 160
order_delivered_carrier_date     1783
order_delivered_customer_date    2965
order_estimated_delivery_date       0
dtype: int64

We can see that there are some missing values on 3 columns. The columns in question happens to be related to the status of the order, thus we cant blindy remove them. For this we will check for patterns.

In [36]:
missing_pattern_by_status = orders_df.groupby('order_status').apply(
		lambda x: x[['order_approved_at', 'order_delivered_carrier_date', 'order_delivered_customer_date']].isna().astype(int)
		.groupby(list(x[['order_approved_at', 'order_delivered_carrier_date', 'order_delivered_customer_date']].columns)).size()
	).reset_index(name='count')
missing_pattern_by_status

Unnamed: 0,order_status,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,count
0,approved,0,1,1,2
1,canceled,0,0,0,6
2,canceled,0,0,1,69
3,canceled,0,1,1,409
4,canceled,1,1,1,141
5,created,1,1,1,5
6,delivered,0,0,0,96455
7,delivered,0,0,1,7
8,delivered,0,1,0,1
9,delivered,0,1,1,1


After reviewing the data and identifying inconsistencies, we decided to remove rows with specific `order_status` and missing values patterns that didn't align with expected e-commerce behavior. Specifically, we removed:

1. **Canceled orders** that had all delivery information present — this is inconsistent since a canceled order should not have delivery timestamps (`id 1`).
2. **Delivered orders** with missing or contradictory delivery timestamps — these rows are either incomplete or logically impossible (`id 7, 8, 9, 10`).
3. **Unavailable orders** that lacked delivery information — the status of "unavailable" should generally not have valid delivery data, making these rows unclear or incomplete (`id 14`).

In [37]:
orders_df = orders_df[~orders_df.apply(
  lambda row: (row['order_status'], pd.isna(row['order_approved_at']), pd.isna(row['order_delivered_carrier_date']), pd.isna(row['order_delivered_customer_date'])) in [
    ('canceled', False, False, False),   # id 1
    ('delivered', False, False, True),   # id 7
    ('delivered', False, True, False),   # id 8
    ('delivered', False, True, True),    # id 9
    ('delivered', True, False, False),   # id 10
    ('unavailable', False, True, True)   # id 14
], axis=1)]


In [38]:
orders_df

Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date
0,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15,2017-10-04 19:55:00,2017-10-10 21:25:13,2017-10-18
1,53cdb2fc8bc7dce0b6741e2150273451,b0830fb4747a6c6d20dea0b8c802d7ef,delivered,2018-07-24 20:41:37,2018-07-26 03:24:27,2018-07-26 14:31:00,2018-08-07 15:27:45,2018-08-13
2,47770eb9100c2d0c44946d9cf07ec65d,41ce2a54c0b03bf3443c3d931a367089,delivered,2018-08-08 08:38:49,2018-08-08 08:55:23,2018-08-08 13:50:00,2018-08-17 18:06:29,2018-09-04
3,949d5b44dbf5de918fe9c16f97b45f8a,f88197465ea7920adcdbec7375364d82,delivered,2017-11-18 19:28:06,2017-11-18 19:45:59,2017-11-22 13:39:59,2017-12-02 00:28:42,2017-12-15
4,ad21c59c0840e6cb83a9ceb5573f8159,8ab97904e6daea8866dbdbc4fb7aad2c,delivered,2018-02-13 21:18:39,2018-02-13 22:20:29,2018-02-14 19:46:34,2018-02-16 18:17:02,2018-02-26
...,...,...,...,...,...,...,...,...
99436,9c5dedf39a927c1b2549525ed64a053c,39bd1228ee8140590ac3aca26f2dfe00,delivered,2017-03-09 09:54:05,2017-03-09 09:54:05,2017-03-10 11:18:03,2017-03-17 15:08:01,2017-03-28
99437,63943bddc261676b46f01ca7ac2f7bd8,1fca14ff2861355f6e5f14306ff977a7,delivered,2018-02-06 12:58:58,2018-02-06 13:10:37,2018-02-07 23:22:42,2018-02-28 17:37:56,2018-03-02
99438,83c1379a015df1e13d02aae0204711ab,1aa71eb042121263aafbe80c1b562c9c,delivered,2017-08-27 14:46:43,2017-08-27 15:04:16,2017-08-28 20:52:26,2017-09-21 11:24:17,2017-09-27
99439,11c177c8e97725db2631073c19f07b62,b331b74b18dc79bcdf6532d51e1637c1,delivered,2018-01-08 21:28:27,2018-01-08 21:36:21,2018-01-12 15:35:03,2018-01-25 23:32:54,2018-02-15


#### **Duplicates**

In [39]:
orders_df.duplicated().sum()

0

No duplicates found, no action needed.

#### **Feature Engineering**

To help us analyze the data further, we will have to decompose datetime related

In [None]:
datetime_columns = [
	'order_purchase_timestamp', 'order_approved_at', 
	'order_delivered_carrier_date', 'order_delivered_customer_date', 'order_estimated_delivery_date'
]

for col in datetime_columns:
	orders_df[f'{col}_year'] = orders_df[col].dt.year
	orders_df[f'{col}_month'] = orders_df[col].dt.month
	orders_df[f'{col}_day'] = orders_df[col].dt.day
	orders_df[f'{col}_hour'] = orders_df[col].dt.hour
	orders_df[f'{col}_minute'] = orders_df[col].dt.minute
	orders_df[f'{col}_second'] = orders_df[col].dt.second

# Calculate the durations between different stages
orders_df['purchase_to_approval'] = (orders_df['order_approved_at'] - orders_df['order_purchase_timestamp']).dt.total_seconds()
orders_df['approval_to_carrier'] = (orders_df['order_delivered_carrier_date'] - orders_df['order_approved_at']).dt.total_seconds()
orders_df['carrier_to_customer'] = (orders_df['order_delivered_customer_date'] - orders_df['order_delivered_carrier_date']).dt.total_seconds()
orders_df['purchase_to_customer'] = (orders_df['order_delivered_customer_date'] - orders_df['order_purchase_timestamp']).dt.total_seconds()

In [None]:
orders_df

#### **1. Customers Dataset**

In [59]:
customers_df.isna().sum()

customer_id                 0
customer_unique_id          0
customer_zip_code_prefix    0
customer_city               0
customer_state              0
dtype: int64

#### **1. Customers Dataset**

In [60]:
customers_df.isna().sum()

customer_id                 0
customer_unique_id          0
customer_zip_code_prefix    0
customer_city               0
customer_state              0
dtype: int64