## Bronze Layer: Data Ingestion and Raw Storage (The Source of Truth)

The **Bronze Layer** is the foundational layer of the Medallion Lakehouse Architecture. Its purpose is simple: to ingest the raw source data (our individual resume text files) with minimal transformation and store it reliably as a Delta Lake table.

The data here is kept in its most original format. If we ever need to re-process the data from scratch due to new cleansing requirements, we always return to the Bronze Layer. The output of this layer is the `bronze_resumes` Delta table.

### Data Ingestion: Reading Files from the Volume

This section performs the initial data load. We use the PySpark `text` format to read all individual resume files (`*.txt`) from the secure Volume location. PySpark automatically assigns a unique ID (`monotonically_increasing_id`) and puts the entire resume content into a single column (`value`), which we alias as `Resume_Text`.

Crucially, the data is saved as a **Delta Lake** table. Delta Lake ensures data reliability, transaction logging, and performance—core principles of the modern Lakehouse.

In [0]:
# Cell 1: Ingest Text Data (Bronze Layer) - Reverting to Volume Path

# Import PySpark functions and types
from pyspark.sql import functions as F
from pyspark.sql.types import ArrayType, StringType
import re

# 1. Configuration - Use the correct Volume path (assuming files are still there)
volume_dir_path = "/Volumes/workspace/default/raw_resumes/*.txt" 

print(f"Attempting to read all text files from: {volume_dir_path}")

# 2. Read the Text files into a PySpark DataFrame
try:
  raw_df = (
    spark.read.format("text")
    .load(volume_dir_path)
  )
    
  # Assign a unique ID and alias the text column
  bronze_df = raw_df.select(
      F.monotonically_increasing_id().alias("Resume_ID"),
      F.col("value").alias("Resume_Text") 
  )

  print(f"\n✅ Successfully loaded {bronze_df.count()} resumes.")
  
  # 3. Save the data to the Bronze Delta Table
  bronze_df.write.format("delta").mode("overwrite").saveAsTable("bronze_resumes")
  print("\n💾 Bronze table 'bronze_resumes' created successfully.")

except Exception as e:
  print(f"\n❌ Error loading data. Please ensure your compute is running and the path is correct. Error: {e}")

Attempting to read all text files from: /Volumes/workspace/default/raw_resumes/*.txt

✅ Successfully loaded 541 resumes.

💾 Bronze table 'bronze_resumes' created successfully.
