## Pipeline: Bronze to Silver

## Data Source

- **Catalog Location:**  `workspace.hospital_bronze.visits`
- **Format:** Delta Lake Table


## Destination

- **Catalog Location:** `workspace.hospital_silver.visits`
- **Format:** Delta Lake Table 

In [0]:
entity = "visits"

In [0]:
# Databricks Storage
catalog_name = "workspace"
schema_silver = "hospital_silver"
schema_bronze = "hospital_bronze"
schema_gold = "hospital_gold"

# data source path
data_source = "s3://buckethospitaldata/data_batching/"

# for streaming: schema and checkpoint location (stored in data source S3 buckets)
checkpoint_location = f"s3://buckethospitaldata/pipeline_checkpoints/data_streaming/_checkpoints/silver/{entity}"

## Read Data to from Bronze Layer

In [0]:
df = spark.readStream.table(f"{catalog_name}.{schema_bronze}.{entity}")

In [0]:
df.printSchema()

## Matching Data Type

- Cost and insurance columns: float
- Room_Charges_daily_rate: float
- Follow_Up_Visit_Date: date
- Admitted_Date: date
- Discharge_Date: date
- Emergency_visit: Boolean

In [0]:
from pyspark.sql.functions import col, to_date, when

df = df.withColumn("Room_Charges_daily_rate", col("Room_Charges_daily_rate").cast("float")) \
        .withColumn("Treatment_Cost", col("Treatment_Cost").cast("float")) \
        .withColumn("Medication_Cost", col("Medication_Cost").cast("float")) \
        .withColumn("Insurance_Coverage", col("Insurance_Coverage").cast("float")) \
        .withColumn("Date_of_Visit", to_date(col("Date_of_Visit"), "M/d/yyyy")) \
        .withColumn("Follow_Up_Visit_Date", to_date(col("Follow_Up_Visit_Date"), "M/d/yyyy")) \
        .withColumn("Admitted_Date", to_date(col("Admitted_Date"), "M/d/yyyy")) \
        .withColumn("Discharge_Date", to_date(col("Discharge_Date"), "M/d/yyyy")) \
        .withColumn("Emergency_Visit", when(col("Emergency_Visit") == "Yes", True)
                                        .when(col("Emergency_Visit") == "No", False)
                                        .otherwise(None))
df.printSchema()

## Handling Missing data

In [0]:
from pyspark.sql.functions import lit, when

# Fill Room_Type with 'Outpatient' if missing
df = df.fillna({'Room_Type': 'Outpatient'})

# Fill Insurance_Coverage with 0.0 if missing means no coverage
df = df.fillna({'Insurance_Coverage': 0.0})

# For Admitted_Date and Discharge_Date, fill with None or leave as is if missingness is meaningful
# Flag column: patients who are admitted
df = df.withColumn('Was_Admitted', when(df.Admitted_Date.isNull(), False).otherwise(True))
# Flag column: patients who follow up
df = df.withColumn(
    "Has_Follow_Up",
    when(col("Follow_Up_Visit_Date").isNotNull(), True).otherwise(False)
)


## Calculate Derived Columns

In [0]:
from pyspark.sql.functions import to_timestamp, col, concat_ws, datediff, when, sha2


# to be deleted
# df = spark.read.table(f"{catalog_name}.{schema_bronze}.{entity}")

# Length of Stay
df = df.withColumn(
    "Length_of_stay",
    when(
        col("Discharge_Date").isNull() | col("Admitted_Date").isNull(),
        0
    ).otherwise(
        datediff(col("Discharge_Date"), col("Admitted_Date"))
    )
)
           
df = (df.withColumn("Room_Cost", col("Room_Charges_daily_rate") * col("Length_of_stay"))
        .withColumn("Visit_Cost", col("Room_Cost") + col("Medication_Cost") + col("Treatment_Cost") - col("Insurance_Coverage"))
        .withColumn("Revenue_per_visit", col("Room_Cost") + col("Medication_Cost") + col("Treatment_Cost")))

## Write Data to Silver Layer

In [0]:

(
    df.writeStream
        .format("delta")
        .option("checkpointLocation", checkpoint_location)
        .option("mergeSchema", "true")  # Optional but useful
        .outputMode("append")
        .trigger(once=True)
        .table(f"{catalog_name}.{schema_silver}.{entity}")
)
