# Olist Data Ingestion Notebook

This notebook demonstrates how Olist’s raw e-commerce data is systematically ingested and organized for business value. The Olist dataset represents a real-world Brazilian marketplace, and this workflow is designed to empower business teams, analysts, and data scientists to make data-driven decisions that improve customer experience, streamline operations, and drive growth.

## Business Value & Collaboration
- **Unified Data Foundation:** By automating the ingestion of all raw data files into structured Delta tables, we create a single source of truth for the business. This ensures that marketing, operations, finance, and analytics teams all work from consistent, up-to-date information.
- **Faster Insights:** Clean, well-structured data enables rapid reporting on key metrics such as order volume, customer retention, seller performance, and payment trends. This supports timely business decisions and competitive advantage.
- **Scalability & Trust:** Using Delta tables ensures data reliability, versioning, and scalability as the business grows, reducing manual errors and rework for all teams.

## Workflow Overview
1. **Bulk Ingestion:** All main Olist CSV files are loaded into Spark DataFrames and saved as Delta tables in the bronze layer. This step standardizes raw data and prepares it for deeper analysis.
2. **Accessible Data:** The notebook demonstrates how to access and inspect these tables, making it easy for any team member to explore customer, order, and product data.
3. **Foundation for Analytics:** With this foundation, teams can build dashboards, run advanced analytics, and develop machine learning models to optimize marketing, logistics, and customer service.

This approach ensures that everyone in the organization—from business analysts to data engineers—can collaborate efficiently, trust the data, and focus on delivering value to customers and the business.

In [0]:
dataset_names = [
    "olist_customers_dataset.csv",
    "olist_orders_dataset.csv",
    "olist_order_items_dataset.csv",
    "olist_order_payments_dataset.csv",
    "olist_order_reviews_dataset.csv",
    "olist_products_dataset.csv",
    "olist_sellers_dataset.csv",
    "olist_geolocation_dataset.csv",
    "product_category_name_translation.csv"
]

catalog_name = "olist_ecommerce"

bronze_table_names = [
    "brz_customers",
    "brz_orders",
    "brz_order_items",
    "brz_order_payments",
    "brz_order_reviews",
    "brz_products",
    "brz_sellers",
    "brz_geolocation",
    "brz_product_category_name_translation"
]

for name, bronze_table in zip(dataset_names, bronze_table_names):
    df = spark.read.csv(f'/Volumes/olist_ecommerce/olist_source_data/olist_raw/{name}', header=True, inferSchema=True)
    df.write.format("delta") \
        .mode("overwrite") \
        .option("mergeSchema", "true") \
        .saveAsTable(f"{catalog_name}.bronze.{bronze_table}")

In [0]:
customers_df = spark.read.table('olist_ecommerce.bronze.brz_customers')

In [0]:
print(customers_df.printSchema())
customers_df.show(10)