![./ImageLab.png](./Images/ImageLab.png "./ImageLab.png")

# Data Engineering with Lakeflow, Jobs, AutoLoader and more

### Catalog and schema creation and deletion of tables in case they exist

It is necessary to create a connector to Databricks on Azure and follow the steps below to create the external location where data from tables will be saved

![Connector.png](./Images/Connector.png "Connector.png")

![Access.png](./Images/Access.png "Access.png")

![Access2.png](./Images/Access2.png "Access2.png")

![Access3.png](./Images/Access3.png "Access3.png")

![Connector2.png](./Images/Connector2.png "Connector2.png")

![Credential.png](./Images/Credential.png "Credential.png")

![Credential2.png](./Images/Credential2.png "Credential2.png")

Add a parameter called param_location that is the external location: Edit > Add parameter... and insert the abfss://container@storageaccount.dfs.core.windows.net/

In [0]:
LOCATION_VAR = dbutils.widgets.get("param_location")
from pyspark.sql import functions as F
#spark.sql("DROP CATALOG IF EXISTS medallion CASCADE")

In [0]:
try:
    spark.sql(f"""CREATE EXTERNAL LOCATION ext_location
URL '{LOCATION_VAR}'
WITH (STORAGE CREDENTIAL credentialbaraldi)
COMMENT 'External location no ADLS';""")
except:
    print("External location already exists")

In [0]:
spark.sql(f"CREATE CATALOG IF NOT EXISTS medallion MANAGED LOCATION '{LOCATION_VAR}/catalog'")
spark.sql(f"CREATE DATABASE IF NOT EXISTS medallion.bronze MANAGED LOCATION '{LOCATION_VAR}/catalog/bronze'")
spark.sql(f"CREATE DATABASE IF NOT EXISTS medallion.silver MANAGED LOCATION '{LOCATION_VAR}/catalog/silver'")
spark.sql(f"CREATE DATABASE IF NOT EXISTS medallion.gold MANAGED LOCATION '{LOCATION_VAR}/catalog/gold'")

## Data Load of Dimensions

As this is a incremental batch scenario, first we need to load data. For this case, we will look for data in an Object Storage (but without checkpoints from AutoLoader). Upload the folder data_medallion in the ADLS and follow the steps below

Customer Table

In [0]:
# Reading files for dim_customer
df_raw=spark.read \
  .option("header", True) \
  .format("csv") \
  .load(f"{LOCATION_VAR}/data_medallion/customer") \
  .selectExpr(
            "*",
            "_metadata.file_path as file_path",
            "_metadata.file_modification_time as file_mod_time"
        )


# Escrita na tabela dim_customer
df_raw.write \
  .mode("append") \
  .saveAsTable("medallion.bronze.dim_customer")

Product Table

In [0]:
# Leitura dos arquivos para dim_product
df_raw=spark.read \
  .option("header", True) \
  .format("csv") \
  .load(f"{LOCATION_VAR}/data_medallion/products") \
  .selectExpr(
            "*",
            "_metadata.file_path as file_path",
            "_metadata.file_modification_time as file_mod_time"
        )


# Escrita na tabela dim_product
df_raw.write \
  .mode("append") \
  .saveAsTable("medallion.bronze.dim_product")

Store Table

In [0]:
# Leitura dos arquivos para dim_store
df_raw=spark.read \
  .option("header", True) \
  .format("csv") \
  .load(f"{LOCATION_VAR}/data_medallion/stores") \
  .selectExpr(
            "*",
            "_metadata.file_path as file_path",
            "_metadata.file_modification_time as file_mod_time"
        )


# Escrita na tabela dim_store
df_raw.write \
  .mode("append") \
  .saveAsTable("medallion.bronze.dim_store")

Datetime Table

In [0]:
# Leitura dos arquivos para dim_datetime
df_raw=spark.read \
  .option("header", True) \
  .format("csv") \
  .load(f"{LOCATION_VAR}/data_medallion/datetime") \
  .selectExpr(
            "*",
            "_metadata.file_path as file_path",
            "_metadata.file_modification_time as file_mod_time"
        )


# Escrita na tabela dim_datetime
df_raw.write \
  .mode("append") \
  .saveAsTable("medallion.bronze.dim_datetime")

Fact Table

In [0]:
# Leitura dos arquivos para fact
df_raw=spark.read \
  .option("header", True) \
  .format("csv") \
  .load(f"{LOCATION_VAR}/data_medallion/fact") \
  .selectExpr(
            "*",
            "_metadata.file_path as file_path",
            "_metadata.file_modification_time as file_mod_time"
        )


# Escrita na tabela fact
df_raw.write \
  .mode("append") \
  .saveAsTable("medallion.bronze.fact")