# Olist E-commerce Data Exploration and Understanding

This notebook provides a comprehensive overview of the Olist e-commerce dataset. It is designed to help new users and collaborators understand the structure, contents, and initial exploration steps for the raw Olist data files. 

## Notebook Purpose
The main goal is to load, inspect, and explore the raw CSV files provided by Olist. This includes reading each dataset into Spark DataFrames, printing their schemas, and displaying sample records. These steps are essential for:
* Understanding the available data and its structure
* Identifying key tables and relationships
* Verifying data quality and schema inference

## Workflow Summary
1. Load the customers dataset as a starting point and inspect its schema and sample records.
2. Loop through all main Olist raw datasets, loading each into a DataFrame, printing its schema, and displaying a sample of rows.

This process provides a foundation for further analytics, data cleaning, and transformation tasks. All code is annotated and organized for clarity, making it easy for anyone to follow and extend the analysis.

In [0]:
customers_df = spark.read.csv('/Volumes/olist_ecommerce/olist_source_data/olist_raw/olist_customers_dataset.csv',header=True,inferSchema=True)

In [0]:
print(customers_df.printSchema())
customers_df.show(10)

In [0]:
dataset_names = [
    "olist_orders_dataset.csv",
    "olist_order_items_dataset.csv",
    "olist_order_payments_dataset.csv",
    "olist_order_reviews_dataset.csv",
    "olist_products_dataset.csv",
    "olist_sellers_dataset.csv",
    "olist_geolocation_dataset.csv",
    "product_category_name_translation.csv"
]

for name in dataset_names:
    print(f"Dataset: {name}")
    df = spark.read.csv(f'/Volumes/olist_ecommerce/olist_source_data/olist_raw/{name}', header=True, inferSchema=True)
    df.printSchema()
    display(df.limit(10))