# Olist Bronze Data Exploration

This notebook provides a business-focused and technical overview of the Olist e-commerce platform’s bronze data layer. The bronze layer contains raw, but structured, data ingested from Olist’s marketplace operations, including customers, orders, products, payments, reviews, sellers, and geolocation information.

## Business Context
Olist is a leading Brazilian e-commerce marketplace. Understanding and exploring the bronze data layer is crucial for:
* Gaining insights into customer behavior, order trends, and product performance
* Supporting marketing, operations, and customer service teams with accurate, up-to-date data
* Enabling advanced analytics, reporting, and machine learning use cases

## Notebook Purpose
This notebook demonstrates how to access, inspect, and validate the core bronze tables. It helps ensure data quality and provides a foundation for deeper analysis and business intelligence.

## Workflow Summary
1. **Load Data:** Read all main bronze Delta tables into Spark DataFrames for easy manipulation and analysis.
2. **Schema Inspection:** Print the schema of each table to understand the structure and available fields, which is essential for building queries and reports.
3. **Data Validation:** Count the number of records in key tables (e.g., customers and orders) to check for data completeness and potential issues such as data leakage or missing records.

By following this workflow, business analysts, data scientists, and other team members can quickly familiarize themselves with the available data, validate its integrity, and prepare for further exploration or modeling. This approach ensures that all teams work from a consistent, reliable foundation, accelerating insight generation and business value.

In [0]:
customers_df = spark.read.table('olist_ecommerce.bronze.brz_customers')

In [0]:
orders_df = spark.read.table('olist_ecommerce.bronze.brz_orders')
order_items_df = spark.read.table('olist_ecommerce.bronze.brz_order_items')
products_df = spark.read.table('olist_ecommerce.bronze.brz_products')
payments_df = spark.read.table('olist_ecommerce.bronze.brz_order_payments')
reviews_df = spark.read.table('olist_ecommerce.bronze.brz_order_reviews')
sellers_df = spark.read.table('olist_ecommerce.bronze.brz_sellers')
geolocation_df = spark.read.table('olist_ecommerce.bronze.brz_geolocation')
product_category_name_translation_df = spark.read.table('olist_ecommerce.bronze.brz_product_category_name_translation')

In [0]:
customers_df.printSchema()

In [0]:
orders_df.printSchema()

In [0]:
# Data Lekage or Drop

print(f'Customers : {customers_df.count()} rows')
print(f'Orders : {orders_df.count()} rows')

In [0]:
customers_df.columns

In [0]:
from pyspark.sql.functions import col

# Check for nulls in critical fields
customers_df.select([col(c).isNull().alias(c) for c in customers_df.columns]).show()

In [0]:
from pyspark.sql.functions import col,when ,count

# Check for nulls in critical fields
customers_df.select([count(when(col(c).isNull(),1)).alias(c) for c in customers_df.columns]).show()


In [0]:
# Duplicate Values in the customer id
duplicate_customer_ids_df = customers_df.groupBy("customer_id").count().filter(col("count") > 1)
display(duplicate_customer_ids_df)

In [0]:
# Customer Distribution by state
customer_state_distribution_df = customers_df.groupBy("customer_state").count().orderBy("count", ascending=False)
display(customer_state_distribution_df)

In [0]:
orders_df.show()

In [0]:
# Order - Order status distribution
order_status_distribution_df = orders_df.groupBy("order_status").count().orderBy("count", ascending=False)
display(order_status_distribution_df)

In [0]:
# Payments

payments_df.show()

In [0]:
# Payment type distribution
payment_type_distribution_df = payments_df.groupBy('payment_type').count().orderBy('count', ascending=False)
display(payment_type_distribution_df)

#Top selling Products

In [0]:
order_items_df.show()

In [0]:
from pyspark.sql.functions import sum

# Calculate total sales per product and display top 20 products by sales
top_products = order_items_df.groupBy('product_id').agg(sum('price').alias('total_sales'))
display(top_products.orderBy('total_sales', ascending=False).limit(20))

In [0]:
# Average Delivery Time Analysis 

delivery_df = orders_df.select('order_id','order_purchase_timestamp','order_delivered_customer_date')

In [0]:
delivery_df.show()

In [0]:
from pyspark.sql.functions import datediff, col

# Calculate delivery time in days for each order
delivery_detail_df = delivery_df.withColumn(
    'delivery_time',
    datediff(col('order_delivered_customer_date'), col('order_purchase_timestamp'))
)

display(delivery_detail_df)

In [0]:
display(delivery_detail_df.orderBy('delivery_time', ascending=False))

In [0]:
delivery_detail_df.orderBy('delivery_time',ascending=False).show()