# Podstawy procesu DLT

Notatnik Delta Live Tables (DLT) przetważa pliki źrodłowe JSON i używa funkcjinalnośći autoloadera z poprzednich etapów kursu. 
Jest to piersza część medalioinu ładująca dane z warsty raw do bronze.  

* Ten notatnik zasila pierwszą warstwę bonze zawiera dane ze źródła, nie przetwożone.


Celem notanika jest:
* Definicja Delta Live Tables
* Ładowanie danych z Auto Loader
* Uzycie parametrów DLT Pipelines


## Notatnik DLT

Serverless nie uruchomi kodu DLT, musi być pelny runtime. 
W związku z tym trzeba storzyć pipeline który uruchomi kod 

## Parametry

Podczas tworzenia pipeline dodajemy parametry w polu konfiguracje


Te parametry należą do konfiguracji Sparka.

In Python, we can access these values using **`spark.conf.get()`**.



In [0]:
import dlt

param_environment = spark.conf.get("param_environment", "dev")
param_source_name = spark.conf.get("param_source_name", "")
schema = spark.conf.get("bronze_schema")



[0;31m---------------------------------------------------------------------------[0m
[0;31mModuleNotFoundError[0m                       Traceback (most recent call last)
File [0;32m<command-2229313379879429>, line 1[0m
[0;32m----> 1[0m [38;5;28;01mimport[39;00m [38;5;21;01mdlt[39;00m
[1;32m      3[0m param_environment [38;5;241m=[39m spark[38;5;241m.[39mconf[38;5;241m.[39mget([38;5;124m"[39m[38;5;124mparam_environment[39m[38;5;124m"[39m, [38;5;124m"[39m[38;5;124mdev[39m[38;5;124m"[39m)
[1;32m      4[0m param_source_name [38;5;241m=[39m spark[38;5;241m.[39mconf[38;5;241m.[39mget([38;5;124m"[39m[38;5;124mparam_source_name[39m[38;5;124m"[39m, [38;5;124m"[39m[38;5;124m"[39m)

File [0;32m/databricks/python_shell/lib/dbruntime/autoreload/discoverability/hook.py:71[0m, in [0;36mAutoreloadDiscoverabilityHook._patched_import[0;34m(self, name, *args, **kwargs)[0m
[1;32m     65[0m [38;5;28;01mif[39;00m [38;5;129;01mnot[39;00m [38;5;28m

In [0]:
errors_path = f"/Volumes/bronze/raw/log/errors"
checkpoint_path = f"/Volumes/bronze/raw/log/checkpoint"
source_path = f"/Volumes/bronze/raw/data/"

## Tabele

* **Live tables** są to materializowane widoki na lakehouse; zwracają wyniki przy każdym odświeżeniue 
* **Streaming live tables** są to table inkrementalne, oparte na streamie


Podstawowa definicja DLT:

**`@dlt.table`**<br/>
**`def <function-name>():`**<br/>
**`    return (<query>)`**</br>

In [0]:
%python
import dlt
from pyspark.sql.functions import lit, col, current_timestamp

@dlt.table(
    name=f"bronze.{schema}.{param_source_name}", 
    comment="Raw data ingested from cloud storage"
)
def load_raw_files():
    cloudfile = {
        "cloudFiles.format": "json",
        "pathGlobFilter": "*.json",
        "cloudFiles.inferColumnTypes": "true",
        "cloudFiles.schemaLocation": checkpoint_path
    }
    return (
        spark.readStream.format("cloudFiles")
            .options(**cloudfile)
            .option("checkpointLocation", checkpoint_path)
            .option("badRecordsPath", errors_path)
            .option("multiline", True)
            .load(source_path)
            .selectExpr("*", "_metadata")
            .withColumn("source_system", lit(param_source_name))
            .withColumn("file_path", col("_metadata.file_path"))
            .withColumn("inserted_at", lit(current_timestamp()))
    )

Name,Type
ISBN10,string
answered_questions,bigint
asin,string
availability,string
best_sellers_rank,"array<struct<category:string,rank:bigint>>"
brand,string
buybox_seller,string
categories,array<string>
currency,string
date_first_available,string


## Validating, Enriching, and Transforming Data

DLT allows users to easily declare tables from results of any standard Spark transformations. DLT adds new functionality for data quality checks and provides a number of options to allow users to enrich the metadata for created tables.

Let's break down the syntax of the query below.

### Options for **`@dlt.table()`**

There are <a href="https://docs.databricks.com/data-engineering/delta-live-tables/delta-live-tables-python-ref.html#create-table" target="_blank">a number of options</a> that can be specified during table creation. Here, we use two of these to annotate our dataset.

##### **`comment`**

Table comments are a standard for relational databases. They can be used to provide useful information to users throughout your organization. In this example, we write a short human-readable description of the table that describes how data is being ingested and enforced (which could also be gleaned from reviewing other table metadata).

##### **`table_properties`**

This field can be used to pass any number of key/value pairs for custom tagging of data. Here, we set the value **`silver`** for the key **`quality`**.

Note that while this field allows for custom tags to be arbitrarily set, it is also used for configuring number of settings that control how a table will perform. While reviewing table details, you may also encounter a number of settings that are turned on by default any time a table is created.

### Data Quality Constraints

The Python version of DLT uses decorator functions to set <a href="https://docs.databricks.com/data-engineering/delta-live-tables/delta-live-tables-expectations.html#delta-live-tables-data-quality-constraints" target="_blank">data quality constraints</a>. We'll see a number of these throughout the course.

DLT uses simple boolean statements to allow quality enforcement checks on data. In the statement below, we:
* Declare a constraint named **`valid_date`**
* Define the conditional check that the field **`order_timestamp`** must contain a value greater than January 1, 2021
* Instruct DLT to fail the current transaction if any records violate the constraint by using the decorator **`@dlt.expect_or_fail()`**

Each constraint can have multiple conditions, and multiple constraints can be set for a single table. In addition to failing the update, constraint violation can also automatically drop records or just record the number of violations while still processing these invalid records.

### DLT Read Methods

The Python **`dlt`** module provides the **`read()`** and **`read_stream()`** methods to easily configure references to other tables and views in your DLT Pipeline. This syntax allows you to reference these datasets by name without any database reference. You can also use **`spark.table("LIVE.<table_name.")`**, where **`LIVE`** is a keyword substituted for the database being referenced in the DLT Pipeline.

## Live Tables vs. Streaming Live Tables

The two functions we've reviewed so far have both created streaming live tables. Below, we see a simple function that returns a live table (or materialized view) of some aggregated data.

Spark has historically differentiated between batch queries and streaming queries. Live tables and streaming live tables have similar differences.

Note that these table types inherit the syntax (as well as some of the limitations) of the PySpark and Structured Streaming APIs.

Below are some of the differences between these types of tables.

### Live Tables
* Always "correct", meaning their contents will match their definition after any update.
* Return same results as if table had just been defined for first time on all data.
* Should not be modified by operations external to the DLT Pipeline (you'll either get undefined answers or your change will just be undone).

### Streaming Live Tables
* Only supports reading from "append-only" streaming sources.
* Only reads each input batch once, no matter what (even if joined dimensions change, or if the query definition changes, etc).
* Can perform operations on the table outside the managed DLT Pipeline (append data, perform GDPR, etc).