# üáßüá∑ Olist E-Commerce: Detailed Data Inspection

## üéØ Objective
The goal of this notebook is to inspect **every single CSV file** in the Olist dataset to understand its "gotext" (context), schema, and relationships. This foundation is crucial for building robust ML models.

## üìÇ Dataset Overview
The dataset consists of 9 files:
1.  `olist_orders_dataset.csv`: The core table connecting everything.
2.  `olist_order_items_dataset.csv`: Details of products within orders.
3.  `olist_order_payments_dataset.csv`: Payment methods and values.
4.  `olist_order_reviews_dataset.csv`: Customer satisfaction scores.
5.  `olist_customers_dataset.csv`: Customer IDs and geolocations.
6.  `olist_products_dataset.csv`: Product categories and dimensions.
7.  `olist_sellers_dataset.csv`: Seller information.
8.  `olist_geolocation_dataset.csv`: Lat/Lng for zip codes.
9.  `product_category_name_translation.csv`: Portuguese to English translations.

In [7]:
import pandas as pd
import os

# Config
DATA_DIR = "olist_data"

def inspect_csv(filename, description):
    path = os.path.join(DATA_DIR, filename)
    print(f"\n{'='*80}\nüìÇ INSPECTING: {filename}\n{'='*80}")
    print(f"‚ÑπÔ∏è CONTEXT: {description}\n")
    
    try:
        df = pd.read_csv(path)
        print(f"Shape: {df.shape} (Rows, Cols)")
        print("\n--- üìã Columns & Data Types ---")
        print(df.dtypes)
        print("\n--- üîç Head (First 3 Rows) ---")
        display(df.head(3))
        print("\n--- üìâ Basic Stats ---")
        display(df.describe(include='all').T.iloc[:, :7]) # Transposed for readability
        return df
    except Exception as e:
        print(f"‚ùå Error loading {filename}: {e}")
        return None

---

## 1. Orders Dataset (`olist_orders_dataset.csv`)
**Context:** This is the "parent" table. Every order has a unique `order_id`. It connects customers to the items they bought. Key columns are the timestamps (purchase, approval, carrier delivery, customer delivery, estimated delivery) and the `order_status`.

In [8]:
df_orders = inspect_csv(
    "olist_orders_dataset.csv",
    "Central table connecting all other datasets. Tracks the life-cycle of an order."
)


üìÇ INSPECTING: olist_orders_dataset.csv
‚ÑπÔ∏è CONTEXT: Central table connecting all other datasets. Tracks the life-cycle of an order.

Shape: (99441, 8) (Rows, Cols)

--- üìã Columns & Data Types ---
order_id                         str
customer_id                      str
order_status                     str
order_purchase_timestamp         str
order_approved_at                str
order_delivered_carrier_date     str
order_delivered_customer_date    str
order_estimated_delivery_date    str
dtype: object

--- üîç Head (First 3 Rows) ---


Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date
0,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15,2017-10-04 19:55:00,2017-10-10 21:25:13,2017-10-18 00:00:00
1,53cdb2fc8bc7dce0b6741e2150273451,b0830fb4747a6c6d20dea0b8c802d7ef,delivered,2018-07-24 20:41:37,2018-07-26 03:24:27,2018-07-26 14:31:00,2018-08-07 15:27:45,2018-08-13 00:00:00
2,47770eb9100c2d0c44946d9cf07ec65d,41ce2a54c0b03bf3443c3d931a367089,delivered,2018-08-08 08:38:49,2018-08-08 08:55:23,2018-08-08 13:50:00,2018-08-17 18:06:29,2018-09-04 00:00:00



--- üìâ Basic Stats ---


Unnamed: 0,count,unique,top,freq
order_id,99441,99441,e481f51cbdc54678b7cc49136f2d6af7,1
customer_id,99441,99441,9ef432eb6251297304e76186b10a928d,1
order_status,99441,8,delivered,96478
order_purchase_timestamp,99441,98875,2018-03-31 15:08:21,3
order_approved_at,99281,90733,2018-02-27 04:31:10,9
order_delivered_carrier_date,97658,81018,2018-05-09 15:48:00,47
order_delivered_customer_date,96476,95664,2018-05-14 20:02:44,3
order_estimated_delivery_date,99441,459,2017-12-20 00:00:00,522


## 2. Order Items Dataset (`olist_order_items_dataset.csv`)
**Context:** A single order can contain multiple items. This table links `order_id` to `product_id` and `seller_id`. It provides the `price` and `freight_value` for each item. This is essential for revenue analysis.

In [9]:
df_items = inspect_csv(
    "olist_order_items_dataset.csv",
    "Contains the individual items (SKUs) within each order. Links orders to products and sellers."
)


üìÇ INSPECTING: olist_order_items_dataset.csv
‚ÑπÔ∏è CONTEXT: Contains the individual items (SKUs) within each order. Links orders to products and sellers.

Shape: (112650, 7) (Rows, Cols)

--- üìã Columns & Data Types ---
order_id                   str
order_item_id            int64
product_id                 str
seller_id                  str
shipping_limit_date        str
price                  float64
freight_value          float64
dtype: object

--- üîç Head (First 3 Rows) ---


Unnamed: 0,order_id,order_item_id,product_id,seller_id,shipping_limit_date,price,freight_value
0,00010242fe8c5a6d1ba2dd792cb16214,1,4244733e06e7ecb4970a6e2683c13e61,48436dade18ac8b2bce089ec2a041202,2017-09-19 09:45:35,58.9,13.29
1,00018f77f2f0320c557190d7a144bdd3,1,e5f2d52b802189ee658865ca93d83a8f,dd7ddc04e1b6c2c614352b383efe2d36,2017-05-03 11:05:13,239.9,19.93
2,000229ec398224ef6ca0657da4fc703e,1,c777355d18b72b67abbeef9df44fd0fd,5b51032eddd242adc84c38acab88f23d,2018-01-18 14:48:30,199.0,17.87



--- üìâ Basic Stats ---


Unnamed: 0,count,unique,top,freq,mean,std,min
order_id,112650.0,98666.0,8272b63d03f5f79c56e9e4120aec44ef,21.0,,,
order_item_id,112650.0,,,,1.197834,0.705124,1.0
product_id,112650.0,32951.0,aca2eb7d00ea1a7b8ebd4e68314663af,527.0,,,
seller_id,112650.0,3095.0,6560211a19b47992c3666cc44a7e94c0,2033.0,,,
shipping_limit_date,112650.0,93318.0,2018-03-01 02:50:48,21.0,,,
price,112650.0,,,,120.653739,183.633928,0.85
freight_value,112650.0,,,,19.99032,15.806405,0.0


## 3. Order Payments Dataset (`olist_order_payments_dataset.csv`)
**Context:** Shows how the order was paid (Credit Card, Boleto, Voucher, etc.) and the payment value. An order can have multiple payment methods (sequential) or installments.

In [10]:
df_payments = inspect_csv(
    "olist_order_payments_dataset.csv",
    "Financial data: payment types (credit card, voucher), installments, and transaction values."
)


üìÇ INSPECTING: olist_order_payments_dataset.csv
‚ÑπÔ∏è CONTEXT: Financial data: payment types (credit card, voucher), installments, and transaction values.

Shape: (103886, 5) (Rows, Cols)

--- üìã Columns & Data Types ---
order_id                    str
payment_sequential        int64
payment_type                str
payment_installments      int64
payment_value           float64
dtype: object

--- üîç Head (First 3 Rows) ---


Unnamed: 0,order_id,payment_sequential,payment_type,payment_installments,payment_value
0,b81ef226f3fe1789b1e8b2acac839d17,1,credit_card,8,99.33
1,a9810da82917af2d9aefd1278f1dcfa0,1,credit_card,1,24.39
2,25e8ea4e93396b6fa0d3dd708e76c1bd,1,credit_card,1,65.71



--- üìâ Basic Stats ---


Unnamed: 0,count,unique,top,freq,mean,std,min
order_id,103886.0,99440.0,fa65dad1b0e818e3ccc5cb0e39231352,29.0,,,
payment_sequential,103886.0,,,,1.092679,0.706584,1.0
payment_type,103886.0,5.0,credit_card,76795.0,,,
payment_installments,103886.0,,,,2.853349,2.687051,0.0
payment_value,103886.0,,,,154.10038,217.494064,0.0


## 4. Order Reviews Dataset (`olist_order_reviews_dataset.csv`)
**Context:** The voice of the customer. Includes `review_score` (1-5) and text comments (`review_comment_message`). Crucial for Sentiment Analysis or measuring Customer Satisfaction (CSAT).

In [11]:
df_reviews = inspect_csv(
    "olist_order_reviews_dataset.csv",
    "Customer feedback: ratings (1-5) and free-text comments written by customers."
)


üìÇ INSPECTING: olist_order_reviews_dataset.csv
‚ÑπÔ∏è CONTEXT: Customer feedback: ratings (1-5) and free-text comments written by customers.

Shape: (99224, 7) (Rows, Cols)

--- üìã Columns & Data Types ---
review_id                    str
order_id                     str
review_score               int64
review_comment_title         str
review_comment_message       str
review_creation_date         str
review_answer_timestamp      str
dtype: object

--- üîç Head (First 3 Rows) ---


Unnamed: 0,review_id,order_id,review_score,review_comment_title,review_comment_message,review_creation_date,review_answer_timestamp
0,7bc2406110b926393aa56f80a40eba40,73fc7af87114b39712e6da79b0a377eb,4,,,2018-01-18 00:00:00,2018-01-18 21:46:59
1,80e641a11e56f04c1ad469d5645fdfde,a548910a1c6147796b98fdf73dbeba33,5,,,2018-03-10 00:00:00,2018-03-11 03:05:13
2,228ce5500dc1d8e020d8d1322874b6f0,f9e4b658b201a9f2ecdecbb34bed034b,5,,,2018-02-17 00:00:00,2018-02-18 14:36:24



--- üìâ Basic Stats ---


Unnamed: 0,count,unique,top,freq,mean,std,min
review_id,99224.0,98410.0,c444278834184f72b1484dfe47de7f97,3.0,,,
order_id,99224.0,98673.0,c88b1d1b157a9999ce368f218a407141,3.0,,,
review_score,99224.0,,,,4.086421,1.347579,1.0
review_comment_title,11568.0,4527.0,Recomendo,423.0,,,
review_comment_message,40977.0,36159.0,Muito bom,230.0,,,
review_creation_date,99224.0,636.0,2017-12-19 00:00:00,463.0,,,
review_answer_timestamp,99224.0,98248.0,2017-06-15 23:21:05,4.0,,,


## 5. Customers Dataset (`olist_customers_dataset.csv`)
**Context:** Contains customer location data (zip code, city, state). **Important:** It maps `customer_id` (used in Orders table) to `customer_unique_id` (the distinct real human). If a person orders twice, they get 2 `customer_id`s but keep 1 `customer_unique_id`.

In [12]:
df_customers = inspect_csv(
    "olist_customers_dataset.csv",
    "Demographics: Links order-specific 'customer_id' to persistent 'customer_unique_id' and location."
)


üìÇ INSPECTING: olist_customers_dataset.csv
‚ÑπÔ∏è CONTEXT: Demographics: Links order-specific 'customer_id' to persistent 'customer_unique_id' and location.

Shape: (99441, 5) (Rows, Cols)

--- üìã Columns & Data Types ---
customer_id                   str
customer_unique_id            str
customer_zip_code_prefix    int64
customer_city                 str
customer_state                str
dtype: object

--- üîç Head (First 3 Rows) ---


Unnamed: 0,customer_id,customer_unique_id,customer_zip_code_prefix,customer_city,customer_state
0,06b8999e2fba1a1fbc88172c00ba8bc7,861eff4711a542e4b93843c6dd7febb0,14409,franca,SP
1,18955e83d337fd6b2def6b18a428ac77,290c77bc529b7ac935b93aa66c333dc3,9790,sao bernardo do campo,SP
2,4e7b3e00288586ebd08712fdd0374a03,060e732b5b29e8181a18229c7b0b2b5e,1151,sao paulo,SP



--- üìâ Basic Stats ---


Unnamed: 0,count,unique,top,freq,mean,std,min
customer_id,99441.0,99441.0,06b8999e2fba1a1fbc88172c00ba8bc7,1.0,,,
customer_unique_id,99441.0,96096.0,8d50f5eadf50201ccdcedfb9e2ac8455,17.0,,,
customer_zip_code_prefix,99441.0,,,,35137.474583,29797.938996,1003.0
customer_city,99441.0,4119.0,sao paulo,15540.0,,,
customer_state,99441.0,27.0,SP,41746.0,,,


## 6. Products Dataset (`olist_products_dataset.csv`)
**Context:** Attributes of the goods sold: category name, weight, and dimensions (L x W x H). Essential for freight calculation analysis.

In [13]:
df_products = inspect_csv(
    "olist_products_dataset.csv",
    "Catalog info: Category names, product weight, and dimensions."
)


üìÇ INSPECTING: olist_products_dataset.csv
‚ÑπÔ∏è CONTEXT: Catalog info: Category names, product weight, and dimensions.

Shape: (32951, 9) (Rows, Cols)

--- üìã Columns & Data Types ---
product_id                        str
product_category_name             str
product_name_lenght           float64
product_description_lenght    float64
product_photos_qty            float64
product_weight_g              float64
product_length_cm             float64
product_height_cm             float64
product_width_cm              float64
dtype: object

--- üîç Head (First 3 Rows) ---


Unnamed: 0,product_id,product_category_name,product_name_lenght,product_description_lenght,product_photos_qty,product_weight_g,product_length_cm,product_height_cm,product_width_cm
0,1e9e8ef04dbcff4541ed26657ea517e5,perfumaria,40.0,287.0,1.0,225.0,16.0,10.0,14.0
1,3aa071139cb16b67ca9e5dea641aaa2f,artes,44.0,276.0,1.0,1000.0,30.0,18.0,20.0
2,96bd76ec8810374ed1b65e291975717f,esporte_lazer,46.0,250.0,1.0,154.0,18.0,9.0,15.0



--- üìâ Basic Stats ---


Unnamed: 0,count,unique,top,freq,mean,std,min
product_id,32951.0,32951.0,1e9e8ef04dbcff4541ed26657ea517e5,1.0,,,
product_category_name,32341.0,73.0,cama_mesa_banho,3029.0,,,
product_name_lenght,32341.0,,,,48.476949,10.245741,5.0
product_description_lenght,32341.0,,,,771.495285,635.115225,4.0
product_photos_qty,32341.0,,,,2.188986,1.736766,1.0
product_weight_g,32949.0,,,,2276.472488,4282.038731,0.0
product_length_cm,32949.0,,,,30.815078,16.914458,7.0
product_height_cm,32949.0,,,,16.937661,13.637554,2.0
product_width_cm,32949.0,,,,23.196728,12.079047,6.0


## 7. Sellers Dataset (`olist_sellers_dataset.csv`)
**Context:** Location of the merchants fulfilling orders. Combined with customer location, this allows calculating distance.

In [14]:
df_sellers = inspect_csv(
    "olist_sellers_dataset.csv",
    "Merchant info: Seller zip codes, cities, and states."
)


üìÇ INSPECTING: olist_sellers_dataset.csv
‚ÑπÔ∏è CONTEXT: Merchant info: Seller zip codes, cities, and states.

Shape: (3095, 4) (Rows, Cols)

--- üìã Columns & Data Types ---
seller_id                   str
seller_zip_code_prefix    int64
seller_city                 str
seller_state                str
dtype: object

--- üîç Head (First 3 Rows) ---


Unnamed: 0,seller_id,seller_zip_code_prefix,seller_city,seller_state
0,3442f8959a84dea7ee197c632cb2df15,13023,campinas,SP
1,d1b65fc7debc3361ea86b5f14c68d2e2,13844,mogi guacu,SP
2,ce3ad9de960102d0677a81f5d0bb7b2d,20031,rio de janeiro,RJ



--- üìâ Basic Stats ---


Unnamed: 0,count,unique,top,freq,mean,std,min
seller_id,3095.0,3095.0,3442f8959a84dea7ee197c632cb2df15,1.0,,,
seller_zip_code_prefix,3095.0,,,,32291.059451,32713.45383,1001.0
seller_city,3095.0,611.0,sao paulo,694.0,,,
seller_state,3095.0,23.0,SP,1849.0,,,


## 8. Geolocation Dataset (`olist_geolocation_dataset.csv`)
**Context:** A mapping of Brazilian zip codes (`geolocation_zip_code_prefix`) to Latitude/Longitude coordinates. Allows visualizing data on a map.

In [15]:
df_geo = inspect_csv(
    "olist_geolocation_dataset.csv",
    "Geospatial data: Mapping zip codes to Lat/Lng coordinates."
)


üìÇ INSPECTING: olist_geolocation_dataset.csv
‚ÑπÔ∏è CONTEXT: Geospatial data: Mapping zip codes to Lat/Lng coordinates.

Shape: (1000163, 5) (Rows, Cols)

--- üìã Columns & Data Types ---
geolocation_zip_code_prefix      int64
geolocation_lat                float64
geolocation_lng                float64
geolocation_city                   str
geolocation_state                  str
dtype: object

--- üîç Head (First 3 Rows) ---


Unnamed: 0,geolocation_zip_code_prefix,geolocation_lat,geolocation_lng,geolocation_city,geolocation_state
0,1037,-23.545621,-46.639292,sao paulo,SP
1,1046,-23.546081,-46.64482,sao paulo,SP
2,1046,-23.546129,-46.642951,sao paulo,SP



--- üìâ Basic Stats ---


Unnamed: 0,count,unique,top,freq,mean,std,min
geolocation_zip_code_prefix,1000163.0,,,,36574.166466,30549.33571,1001.0
geolocation_lat,1000163.0,,,,-21.176153,5.715866,-36.605374
geolocation_lng,1000163.0,,,,-46.390541,4.269748,-101.466766
geolocation_city,1000163.0,8011.0,sao paulo,135800.0,,,
geolocation_state,1000163.0,27.0,SP,404268.0,,,


## 9. Category Translation (`product_category_name_translation.csv`)
**Context:** Helper table to translate Portuguese category names to English.

In [16]:
df_trans = inspect_csv(
    "product_category_name_translation.csv",
    "Translation Map: product_category_name (PT) -> product_category_name_english (EN)"
)


üìÇ INSPECTING: product_category_name_translation.csv
‚ÑπÔ∏è CONTEXT: Translation Map: product_category_name (PT) -> product_category_name_english (EN)

Shape: (71, 2) (Rows, Cols)

--- üìã Columns & Data Types ---
product_category_name            str
product_category_name_english    str
dtype: object

--- üîç Head (First 3 Rows) ---


Unnamed: 0,product_category_name,product_category_name_english
0,beleza_saude,health_beauty
1,informatica_acessorios,computers_accessories
2,automotivo,auto



--- üìâ Basic Stats ---


Unnamed: 0,count,unique,top,freq
product_category_name,71,71,beleza_saude,1
product_category_name_english,71,71,health_beauty,1
