In [0]:
spark

## Bronze Layer â€“ Raw Data Ingestion

The Bronze layer is responsible for ingesting raw e-commerce data from the source into the data lake without applying any transformations.  
This ensures that the original data is preserved for traceability, reprocessing, and auditing purposes.


In [0]:
# List the files in the olist-data container
dbutils.fs.ls("abfss://olist-data@retailds.dfs.core.windows.net/")

### Data Source
- Dataset: Olist Brazilian E-Commerce Dataset
- Storage: Azure Data Lake Storage Gen2
- Format: CSV files

In [0]:
# Read the orders csv file from raw/olist/ folder
df = spark.read.option("header", "true").option("inferSchema", "true").csv("abfss://olist-data@retailds.dfs.core.windows.net/raw/olist/olist_orders_dataset.csv")

In [0]:
# Display the first 5 rows of orders dataframe
display(df.limit(5))

In [0]:
# Print the schema of orders dataframe
df.printSchema()

In [0]:
# Count the number of rows in orders dataframe
df.count()

### Bronze Storage Strategy
The raw data is stored in Delta format in the Bronze layer to enable schema enforcement, versioning, and efficient downstream processing.


In [0]:
# Write orders dataframe into bronze/ folder
df.write \
  .format("delta") \
  .mode("overwrite") \
  .option("overwriteSchema", "true") \
  .save("abfss://olist-data@retailds.dfs.core.windows.net/bronze/orders")

In [0]:
# Read remaining csv file from raw/olist/ folder and write into bronze/ folder
tables = [
    "customers",
    "geolocation",
    "order_items",
    "order_payments",
    "order_reviews",
    "products",
    "sellers"

]
for table in tables:
    df = spark.read.option("header", "true").option("inferSchema", "true").csv(f"abfss://olist-data@retailds.dfs.core.windows.net/raw/olist/olist_{table}_dataset.csv")

    df.write.format("delta").mode("overwrite").option("overwriteSchema", "true").save(f"abfss://olist-data@retailds.dfs.core.windows.net/bronze/{table}")


In [0]:
# Count the number of rows in orders table
spark.sql("""select count(*) from bronze.orders""").display()

In [0]:
# Display the first 5 rows of orders table 
spark.sql("""select * from bronze.orders limit 5""").display()