# WORKING WITH BRONZE LAYER 

 ACTUALLY MEANS : 

Bronze layer is NOT for analytics.
Bronze layer is for understanding, validating, and preparing raw data safely.

In projects, working with Bronze means:

- Validate data arrived correctly

- Understand structure & variability

- Profile resource types

- Prepare controlled views for Silver

- Detect schema drift

- Ensure nothing is lost


## Verify files inside the Volume (Sanity check)

117 JSON files

Each file = one patient

In [0]:

dbutils.fs.ls("/Volumes/angad_kumar91/fhir_healthcare_analytics_rawdataset/raw_fhir")


## Read RAW FHIR JSON from Volume

multiLine=true ‚Üí FHIR JSON spans multiple lines

*.json ‚Üí read all 117 patients

Spark creates one row per file

In [0]:
raw_df = spark.read \
    .option("multiLine", "true") \
    .json("/Volumes/angad_kumar91/fhir_healthcare_analytics_rawdataset/raw_fhir/*.json")


## Inspect raw structure

This confirms:

Each row = one FHIR Bundle

entry contains all Patient, Encounter, Condition, etc.

In [0]:
raw_df.printSchema()


## Add ingestion metadata (Healthcare best practice)

In healthcare, it must answer:

Which file did this record come from?

When was it ingested?

In [0]:
from pyspark.sql.functions import current_timestamp, col

bronze_ready_df = raw_df \
    .withColumn("source_file", col("_metadata.file_path")) \
    .withColumn("ingest_time", current_timestamp())


## Create SQL table
Register Bronze table in Unity Catalog

we can query using SQL

Unity Catalog manages permissions

You are following enterprise Databricks standards

In [0]:
%sql
CREATE TABLE IF NOT EXISTS angad_kumar91.fhir_healthcare_analytics_bronze.fhir_bundle
USING DELTA;

### Understanding what a Bronze row looks like


resourceType ‚Üí always "Bundle"

type ‚Üí "transaction"

source_file ‚Üí patient JSON file name

ingest_time ‚Üí load timestamp

This confirms Bronze metadata is correct.



In [0]:
bronze_ready_df.write \
    .format("delta") \
    .option("mergeSchema", "true") \
    .mode("append") \
    .saveAsTable(
        "angad_kumar91.fhir_healthcare_analytics_bronze.fhir_bundle"
    )


u

In [0]:
%sql
-- DESCRIBE DETAIL angad_kumar91.fhir_healthcare_analytics_bronze.fhir_bundle;


DESCRIBE EXTENDED angad_kumar91.fhir_healthcare_analytics_bronze.fhir_bundle


In [0]:
# Using python to do the same as above
display(
    spark.sql(
        "DESCRIBE DETAIL angad_kumar91.fhir_healthcare_analytics_bronze.fhir_bundle"
    )
)

In [0]:
%sql
SELECT COUNT(*) 
FROM angad_kumar91.fhir_healthcare_analytics_bronze.fhir_bundle;


In [0]:
%sql
select * from angad_kumar91.fhir_healthcare_analytics_bronze.fhir_bundle;

### What needs SMALL correction / confirmation for Unity Catalog

1Ô∏è‚É£ STEP 2 ‚Äì Querying Bronze table (‚úÖ OK)

‚úî Works in Unity Catalog
‚úî Confirms metadata
‚úî No change needed

In [0]:
%sql
select * from angad_kumar91.fhir_healthcare_analytics_bronze.fhir_bundle
limit 1;

In [0]:
%sql
SELECT
  resourceType          ,
  type,
  source_file,
  ingest_time
FROM angad_kumar91.fhir_healthcare_analytics_bronze.fhir_bundle
LIMIT 5;


2Ô∏è‚É£ STEP 3 ‚Äì Inspect ENTRY array (‚úÖ OK)


‚úî Correct
‚úî Unity Catalog compatible
‚úî No change needed

In [0]:
bronze_df = spark.table(
    "angad_kumar91.fhir_healthcare_analytics_bronze.fhir_bundle"
)

bronze_df.selectExpr("size(entry) as resource_count").show(10)

3Ô∏è‚É£ STEP 4 ‚Äì Explode ENTRY temporarily (‚úÖ OK)


‚úî Correct
‚úî Temporary DataFrame only
‚úî No storage
‚úî No UC issues

In [0]:
from pyspark.sql.functions import explode, col

exploded_df = bronze_df \
    .select(
        col("source_file"),
        col("ingest_time"),
        explode("entry").alias("entry")
    )

4Ô∏è‚É£ STEP 5 ‚Äì Identify resource types (‚úÖ OK)


‚úî Correct
‚úî Confirms FHIR event-based design
‚úî No change needed

In [0]:
resource_distribution = exploded_df \
    .select(col("entry.resource.resourceType").alias("resource_type")) \
    .groupBy("resource_type") \
    .count() \
    .orderBy("count", ascending=False)

resource_distribution.show(truncate=False)

5Ô∏è‚É£ STEP 6 ‚Äì Repetition per patient (‚úÖ OK)

‚úî Correct
‚úî This is proper Bronze data-quality validation
‚úî No change needed

In [0]:
per_patient_resources = exploded_df \
    .groupBy("source_file", "entry.resource.resourceType") \
    .count() \
    .orderBy("source_file")

per_patient_resources.show(20, truncate=False)

6Ô∏è‚É£ STEP 7 ‚Äì Validate FHIR references (‚ö†Ô∏è small safety fix)

Your code (works, but can error if field missing)


In [0]:
exploded_df.select(
    col("entry.resource.resourceType"),
    col("entry.resource.subject.reference"),
    col("entry.resource.encounter.reference")
).show(20, truncate=False)

‚úÖ Safer UC-friendly version (recommended)


Some FHIR resources do not have subject or encounter, so use this:


‚úî Same logic
‚úî Cleaner output
‚úî Avoids confusion when values are null

In [0]:
exploded_df.select(
    col("entry.resource.resourceType").alias("resource_type"),
    col("entry.resource.subject.reference").alias("subject_ref"),
    col("entry.resource.encounter.reference").alias("encounter_ref")
).show(20, truncate=False)


7Ô∏è‚É£ STEP 8 ‚Äì Schema drift detection (‚úÖ OK)


‚úî Correct
‚úî This huge schema is EXPECTED for FHIR
‚úî Bronze must accept it

In [0]:
bronze_df.printSchema()

8Ô∏è‚É£ STEP 9 ‚Äì Create Bronze exploded VIEW (‚úÖ OK, UC compatible)

Your SQL is correct for Unity Catalog:


‚úî View (no storage)
‚úî UC-governed
‚úî Perfect for Silver input

In [0]:
%sql
CREATE OR REPLACE VIEW angad_kumar91.fhir_healthcare_analytics_bronze.fhir_entry_view AS
SELECT
  source_file,
  ingest_time,
  explode(entry) AS entry
FROM angad_kumar91.fhir_healthcare_analytics_bronze.fhir_bundle;

9Ô∏è‚É£ STEP 10 ‚Äì Validate Bronze VIEW (‚úÖ OK)


‚úî Correct
‚úî Final Bronze validation
‚úî Confirms ingestion correctness

In [0]:
%sql
SELECT
  entry.resource.resourceType,
  COUNT(*) AS cnt
FROM angad_kumar91.fhir_healthcare_analytics_bronze.fhir_entry_view
GROUP BY entry.resource.resourceType
ORDER BY cnt DESC;

Databricks visualization. Run in Databricks to view.

üßæ FINAL CLEAN VERSION (YOU CAN KEEP THIS)

If you want a single clean reference, this is the final UC-correct Bronze working sequence:



In [0]:

bronze_df = spark.table(
    "angad_kumar91.fhir_healthcare_analytics_bronze.fhir_bundle"
)

bronze_df.selectExpr("size(entry) as resource_count").show(10)

from pyspark.sql.functions import explode, col

exploded_df = bronze_df.select(
    col("source_file"),
    col("ingest_time"),
    explode("entry").alias("entry")
)

exploded_df.select(
    col("entry.resource.resourceType").alias("resource_type"),
    col("entry.resource.subject.reference").alias("subject_ref"),
    col("entry.resource.encounter.reference").alias("encounter_ref")
).show(20, truncate=False)

üß† Final confirmation 

‚úÖ Bronze layer is now complete and correct

‚úÖ Unity Catalog‚Äìcompliant

‚úÖ Medallion-aligned

‚úÖ Healthcare-grade