# Bronze layer

The goal of this notebook is to perform a bronze-layer process, where raw data (csv) is saved as delta table with minimal tranformation. The idea is to store bronze data using delta format.

## 1. Notebook parameters

Parameters, which are called by pipelines.

Inputs:

* **in_parameter_run_id**: run identifier.
* **in_parameter_process_date**: process date.
* **in_parameter_path_storage**: path project storage.
* **in_parameter_path_container**: path project container.
* **in_parameter_bd**: bd name.

Outputs:

* **out_parameter_count_processed**: count of processed rows.

In [1]:
in_parameter_run_id = 0
in_parameter_process_date = "1900-01-01 00:00:00"
in_parameter_path_storage = "datalake20251021"
in_parameter_path_container = "dajobcanada"
in_parameter_bd = "dajobcanada_db"
out_parameter_count_processed = 0

## 2. Read raw data

In [22]:
from pyspark.sql.functions import col, to_timestamp, lit, date_format

In [23]:
df = spark.read.load(f"abfss://{in_parameter_path_container}@{in_parameter_path_storage}.dfs.core.windows.net/raw/*.csv",
                    format = "csv",
                    header = True)

## 3. Basic transformations

Columns receive better name.

In [24]:
df = df.select(col("Job ID").alias("job_id"),
                col("Job Title").alias("job_title"),
                col("Company Name").alias("company_name"),
                col("Language and Tools").alias("language_tools"),
                col("Job Salary").alias("job_salary"),
                col("City").alias("city"),
                col("Province").alias("province"),
                col("Job Link").alias("job_link"))

Each load receive its own identifier (run_id), process datetime and process data for partition. Each load is stored in a partition defined by the day of process.

In [25]:
df = df.withColumn("run_id", lit(in_parameter_run_id))
df = df.withColumn("process_datetime", lit(in_parameter_process_date))
df = df.withColumn("process_datetime", to_timestamp("process_datetime", "yyyy-MM-dd HH:mm:ss"))
df = df.withColumn("process_date", date_format("process_datetime", "yyyy-MM-dd"))

## 4. Save results

Save bronze table.

In [26]:
df.write.partitionBy("process_date") \
                .mode("append") \
                .format("delta") \
                .save(f"abfss://{in_parameter_path_container}@{in_parameter_path_storage}.dfs.core.windows.net/bronze/jobs/")

Identify the number of processed rows.

In [27]:
out_parameter_count_processed = df.count()

In [28]:
mssparkutils.notebook.exit(out_parameter_count_processed)