# Platinum Layer Fact Delta Table - Generic notebook to guide on the conventions

### Configuration
|Item|Value|
|---|---|
|Parameter|JSON object with all applicable parameters|
|Source File Identification|Not applicable, will read from a Delta table|
|Output|Delta table in the platinum database stored in the platinum container of the data lake|
|Data Wrangling|Business Rules addition, joining multiple source entities together to form one consolidated entity, applying rules and conventions.|
|Ingestion Method|Read the data from the source silver delta tables, only for the latest ingestion date time stamp. Then MERGE into the target delta table. |
|**Considerations**|If multiple tables are used as source, the new and editted records they contain in the latest batch ingested need to be in sync to ensure they can join togther and there wont be misallignment. If this is not possible, the entire source silver tables will need to be included. |

# Import all libraries required

In [0]:
import json
from pyspark.sql.functions import expr

# Call Administration notebook to perform tasks before the data modelling can continue.

In [0]:
%run ../Administration/CreateDatabaseIfNotExists

# Call the applicable Helper Functions notebooks to include their functions for use in this Notebook

In [0]:
%run ../HelperFunctions/DataLakeHelperFunctions

# Parameters / Widgets

Here the single JSON parameter widget of the notebook will be generated.

Having it as JSON means we can send multiple values from the calling code such as Data Factory to the notebook at run time with a single widget / parameter. 

When we want to add more parameter values leter, it doesn't require a notebook change to add more widgets OR a data factory change to add more parameters. Just change the value sent. 

The JSON object will be unpacked and the attributes needed will be extracted in to variables with the "p" prefix to show it was a notebook level parameter.

Run this when you want to re-initialise the widget in the following cell with new default values. 
Do not let this cell run as part of normal operations

In [0]:
#Run this when you want to re-initialise the widget in the following cell with new default values. 
#Do not let this cell run as part of normal operations
dbutils.widgets.removeAll()

### Define the widget
Note, once createed, it stays attached to the notebook. Keeping this code to run each time to ensure it exists. 
|Parameter Value|Description|
|--|--|
|"SourceDataLakeContainer"|Container/Area in the data flow where the source delta lake table is located|
|"TargetDataLakeContainer"|Container/Area in the data flow where the target delta lake table is located|
|"TargetEntityName"|Name of the entity being processed. Will become the name of the target delta table.|
|"PartitionByField"|Field in the final structure that the Delta table will be partitioned by. |

In [0]:
# Create the widget in the first place with a default value one can use for testing
# This iteration expects a single JSON Object, not an array
dbutils.widgets.text("widgetJSONString", 
'''
{
"SourceDataLakeContainer": "silver",
"TargetDataLakeContainer": "platinum",
"TargetEntityName": "FactSales",
"PartitionByField": "DimCalendarFK_OrderDate"
}
'''
)

### Transform parameter values received into usable format
* Data type conversion
* String manipulation
* Property extraction
* etc.

In [0]:
# At this stage, the string in the variable is still just a string, not typed as JSON. 
# Convert it to a JSON typed value using json.loads
pNotebookWidgetWithJSONString = json.loads(dbutils.widgets.get("widgetJSONString"))

# Print out full value received for logging purposes
print("pNotebookWidgetWithJSONString: " + str(pNotebookWidgetWithJSONString))

# Assign each attribute to the applicable variabe to be used going forward

pSourceDataLakeContainer = pNotebookWidgetWithJSONString["SourceDataLakeContainer"]
pTargetDataLakeContainer = pNotebookWidgetWithJSONString["TargetDataLakeContainer"]
pTargetEntityName = pNotebookWidgetWithJSONString["TargetEntityName"]
pPartitionByField = pNotebookWidgetWithJSONString["PartitionByField"]

# Calculated parameters from the widget values
pTargetDatabaseTableName = 'datalakehouse_' + pTargetDataLakeContainer + '.' + pTargetEntityName

#Print the rest of the values for troubleshooting
print('pSourceDataLakeContainer: ' + pSourceDataLakeContainer)
print('pTargetDataLakeContainer: ' + pTargetDataLakeContainer)
print('pTargetEntityName: ' + pTargetEntityName)
print('pPartitionByField: ' + pPartitionByField)
print('pTargetDatabaseTableName: ' + pTargetDatabaseTableName)


# Ensure applicable data lake containers are mounted

Once this has run once, it should never have to run again. Just including for safety in all notebooks.

* Source Container: silver
* Target Container: platinum

In [0]:
mount_lake_container(pSourceDataLakeContainer)

In [0]:
mount_lake_container(pTargetDataLakeContainer)

# Get latest IngestionDateTimeStampUTC from source entities
Used to filter the source data to this latest batch received so we dont re-process all history data. 

**Note**, this assumes all entities used as source have data for the same latest IngestionDateTimeStampUTC and records between them can join and are in sync. 

If this is not the case, additional logic is needed here - dependant on the situation at hand.

In [0]:
# vIngestionDateTimeStampUTC is used as filter criteria later when reading from the silver layer
vIngestionDateTimeStampUTC = ''

#This is not automatic for now, because the scenarios are too wide to automate it. "Hardcoding" for now
vSourceSilverDeltaTableName = 'datalakehouse_silver.SalesOrderHeader'

# Create the dataframe that will be used to get the latest IngestionDateTimeStampUTC for the source silver table
SourceSilverTableDF = spark.table(vSourceSilverDeltaTableName)

# Filter the dataframe to get the max value for IngestionDateTimeStampUTC
# collect() will bring data into the driver node and return data as an n-dimensional array. 
# using the index notation, we can select the first row and the first column using [0][0] to get a scalar value back. 
vIngestionDateTimeStampUTC = SourceSilverTableDF.select(expr("max(IngestionDateTimeStampUTC)")).collect()[0][0]

print('vIngestionDateTimeStampUTC : ' + vIngestionDateTimeStampUTC)

# Create single consolidated entity from all applicable silver entities

Rules to apply
* Keep the temporary view name standard - **SourceSilverData**
* Perform null value replacement with friendly value
  * Strings: 'Unknown'
  * Dates: '1900-01-01' or '9999-12-31' -  on the context
  * Boolean: null - dont assume false
* Perform lookups to other tables to get the names of IDs and Codes if not availabel in Dimensions
* Create calculated columns that will be used in reporting layer
* Field Names Convention: PascalCase, no spaces
* When concatenating values together, use CONCAT_WS and use the pipe characer '|' as the seperator

**PK - Primary Key**
* Used as the value other Dimensions and Fact entities link to. 
* Other Dimensions and Facts will include the same fields in their Foreign Key field to this table. 
* This will also be the join column in the MERGE into target table.
* Always use CONCAT_WS with a '|' as the seperator to concatenate the values before applying the SHA2 hashing algorythm
  * Even if only one column is included, This ensures the standard is code, and ensures INT values dont need explicit cast to VARCHAR because CONCAT_WS returns a string already
  
**BK - Business Key**
* Business key(s) of the entity. 
* Fields that uniquely identity the record using the source system IDs and Codes

**FK - Foreign Key**
* Foreign keys to other tables
* Also needs to be hashed the same way that the PK of the target tables are hashed. 
  * EXCEPT Dates - they are not hashed - see **Table Attributes & Calculated Columns**
* Note, the source business keys that make up these foreign key values are included for troubleshooting - they also get the BK suffix, even though they are not the business keys of this table itself

**ID - Identification and Code fields**
* The source business keys that make up the foreign key values are included for troubleshooting - they also get the BK suffix, even though they are not the business keys of this table itself
* Dont perform alteration, this is for viewing purposes. 
* Add the suffix ID to the field names if they are business keys of entities - but only use ID if the field is not in it's primary source table. 
* E.g. if the CustomerCodeID value is in the FactSales table, it is ourside the primary source which is DimCustomer. Thus the value gets a ID suffix in the FactSales table, but it gets the BK suffix in DimCustomer. 
* If the source value doesnt have ID as a suffix already, add it. 
* If it has something like "Code" as the suffix, still add the ID to have consistency.

**Table Attributes & Calculated Columns**
* Date fields - keep as proper date i.e. yyyy-MM-dd. Alias as DimCalendarFK_*UniqueDateFieldName*. Because they will be foreign keys to the consolidated date dimension. Also enables incremental loading in Power BI. Since the Calendar's PK will also be a yyyy-MM-dd this doesnt require a hash. 
* Field names in general - do not include any abbreviations unless it is very clear or clearly documented somewhere e.g. Amt should be listed as Amount, Qty should be Quantity
* Monetary values - If possible, always try to add the Amount suffix to the name to indicate this is a monetary aggregateable amount. 
* Numerical Count values - If possible, always try to add the Quantity suffix to the name to indicate this is a non-monetary aggregateable amount. 

**Metadata**
* Information about the data we either get from source or manually define in this notebook such as change date times, deleted flags etc. 
* **LatestModifiedDateTimeUTC** - If this record is inserted or updated based on join conditions, this will be available to indicate when last a change was made on this record.

**HashChecksum**
* Hashed version of the source attributes and measures fields from source entities - unaltered
* Used to check in the MERGE statement if a record has changed since the last load so we can only alter the editted records in target.
* Do not include any alterations to fields used, only if a source field changes does the checksum change
* Exclude - PK, BK fields - the fields used in the join dont need to be included, but they can be, wont affect the process.
* Include all other non metadata fieldsincluding FK, ID, string, date, numerical fields etc. 
* Hash Algorythm: Use the SHA2 function with 256 bits as default.

In [0]:
%sql
CREATE OR REPLACE TEMPORARY VIEW SourceSilverData
AS
SELECT 
  /*==================================================
  PK - Primary Key
  ==================================================*/
  SHA2(
    CONCAT_WS(
      '|'
      , `SalesOrderHeader`.`SalesOrderID`
      , `SalesOrderDetail`.`SalesOrderDetailID`
    )   
    , 256
  ) AS `FactSalesPK`
  
  /*==================================================
  BK - Business Key
  ==================================================*/
  , `SalesOrderHeader`.`SalesOrderID` AS `SalesOrderIDBK`
  , `SalesOrderDetail`.`SalesOrderDetailID` AS `SalesOrderDetailIDBK`
  
  /*==================================================
  FK - Foreign Key
  ==================================================*/
  , `SalesOrderHeader`.`OrderDate` AS `DimCalendarFK_OrderDate`
  , `SalesOrderHeader`.`DueDate` AS `DimCalendarFK_DueDate`
  , `SalesOrderHeader`.`ShipDate` AS `DimCalendarFK_ShipDate`
  , SHA2(CONCAT_WS('|', `SalesOrderHeader`.`CustomerID`), 256) AS `DimCustomerFK`
  , SHA2(CONCAT_WS('|', `SalesOrderHeader`.`ShipToAddressID`), 256) AS `DimAddressFK_ShipToAddress`
  , SHA2(CONCAT_WS('|', `SalesOrderHeader`.`BillToAddressID`), 256) AS `DimAddressFK_BillToAddress`
  , SHA2(CONCAT_WS('|', `SalesOrderDetail`.`ProductID`), 256) AS `DimProductFK`

  /*==================================================
  ID - Identification and Code fields
  ==================================================*/
  , `SalesOrderHeader`.`CustomerID`
  , `SalesOrderHeader`.`ShipToAddressID`
  , `SalesOrderHeader`.`BillToAddressID`
  , `SalesOrderDetail`.`ProductID`
  
  /*==================================================
  Table Attributes & Calculated Columns
  ==================================================*/
  , `SalesOrderHeader`.`RevisionNumber`
  , `SalesOrderHeader`.`Status` --Would require a case statement to give a meaningsul name to the code
  , `SalesOrderHeader`.`OnlineOrderFlag` AS `IsOnelineOrder`
  , `SalesOrderHeader`.`SalesOrderNumber`
  , `SalesOrderHeader`.`PurchaseOrderNumber`
  , `SalesOrderHeader`.`AccountNumber`
  , `SalesOrderHeader`.`ShipMethod`
  , COALESCE(`SalesOrderHeader`.`CreditCardApprovalCode`, 'Unknown') AS `CreditCardApprovalCode`
  , `SalesOrderHeader`.`SubTotal` AS `SubTotalAmount` /*Note, since this is coming from the header, it will duplicate*/
  , `SalesOrderHeader`.`TaxAmt` AS `TaxAmount` /*Note, since this is coming from the header, it will duplicate*/
  , `SalesOrderHeader`.`Freight` AS `FreightAmount` /*Note, since this is coming from the header, it will duplicate*/
  , `SalesOrderHeader`.`TotalDue` AS `TotalDueAmount` /*Note, since this is coming from the header, it will duplicate*/
  , `SalesOrderHeader`.`Comment`

  , `SalesOrderDetail`.`OrderQty` AS `OrderQuantity`
  , `SalesOrderDetail`.`UnitPrice` AS `UnitPriceAmount`
  , `SalesOrderDetail`.`UnitPriceDiscount` AS `UnitPriceDiscountAmount`
  , `SalesOrderDetail`.`LineTotal` AS `LineTotalAmount`
  
  /*==================================================
  Metadata
  ==================================================*/
  , current_timestamp() AS `LatestModifiedDateTimeUTC`

  /*==================================================
  HashChecksum
  ==================================================*/
  , SHA2(
    CONCAT_WS(
      '|'
      , `SalesOrderHeader`.`OrderDate`
      , `SalesOrderHeader`.`DueDate`
      , `SalesOrderHeader`.`ShipDate`
      , `SalesOrderHeader`.`CustomerID`
      , `SalesOrderHeader`.`ShipToAddressID`
      , `SalesOrderHeader`.`BillToAddressID`
      , `SalesOrderDetail`.`ProductID`
      , `SalesOrderHeader`.`CustomerID`
      , `SalesOrderHeader`.`ShipToAddressID`
      , `SalesOrderHeader`.`BillToAddressID`
      , `SalesOrderDetail`.`ProductID`
      , `SalesOrderHeader`.`RevisionNumber`
      , `SalesOrderHeader`.`Status`
      , `SalesOrderHeader`.`OnlineOrderFlag`
      , `SalesOrderHeader`.`SalesOrderNumber`
      , `SalesOrderHeader`.`PurchaseOrderNumber`
      , `SalesOrderHeader`.`AccountNumber`
      , `SalesOrderHeader`.`ShipMethod`
      , `SalesOrderHeader`.`CreditCardApprovalCode`
      , `SalesOrderHeader`.`SubTotal`
      , `SalesOrderHeader`.`TaxAmt`
      , `SalesOrderHeader`.`Freight`
      , `SalesOrderHeader`.`TotalDue`
      , `SalesOrderHeader`.`Comment`
      , `SalesOrderDetail`.`OrderQty`
      , `SalesOrderDetail`.`UnitPrice`
      , `SalesOrderDetail`.`UnitPriceDiscount`
      , `SalesOrderDetail`.`LineTotal`
    ), 256 
  ) AS `HashChecksum`
  
FROM datalakehouse_silver.SalesOrderHeader
LEFT JOIN datalakehouse_silver.SalesOrderDetail
  ON SalesOrderHeader.SalesOrderID = SalesOrderDetail.SalesOrderID


Top 10 records view of the source data for reference

In [0]:
%sql
SELECT *
FROM SourceSilverData
LIMIT 10;

FactSalesPK,SalesOrderIDBK,SalesOrderDetailIDBK,DimCalendarFK_OrderDate,DimCalendarFK_DueDate,DimCalendarFK_ShipDate,DimCustomerFK,DimAddressFK_ShipToAddress,DimAddressFK_BillToAddress,DimProductFK,CustomerID,ShipToAddressID,BillToAddressID,ProductID,RevisionNumber,Status,IsOnelineOrder,SalesOrderNumber,PurchaseOrderNumber,AccountNumber,ShipMethod,CreditCardApprovalCode,SubTotalAmount,TaxAmount,FreightAmount,TotalDueAmount,Comment,OrderQuantity,UnitPriceAmount,UnitPriceDiscountAmount,LineTotalAmount,LatestModifiedDateTimeUTC,HashChecksum
dcdeafb4ac1f50f8c031d46bdda6439908b19c5888a2ab4c4daafc21e613fff1,71774,110563,2008-06-01,2008-06-13,2008-06-08,88241b234b35a8245558994c9177079327a0aac540a5fc05eda6367d27e62249,5f302d143dace627a6a87157fd1362b010874e4dc64609b17d87db648de0af3c,5f302d143dace627a6a87157fd1362b010874e4dc64609b17d87db648de0af3c,f391e014b2ee3a42955272b8fc78634de1d5833e0cacb412b180376f9c756e49,29847,1092,1092,822,2,5,False,SO71774,PO348186287,10-4020-000609,CARGO TRANSPORT 5,Unknown,880.3484,70.4279,22.0087,972.785,,1.0,356.898,0.0,356.898,2022-03-10T10:39:33.296+0000,7bd95dd5fc45c97026e5bfc33204559f968b7a11fa25c98f7f0a1252f183ecc6
cf6918afc978cfb327d111f3d73ed47b6486ba0d89fb7d1bbc004e1e1cf4bffc,71774,110562,2008-06-01,2008-06-13,2008-06-08,88241b234b35a8245558994c9177079327a0aac540a5fc05eda6367d27e62249,5f302d143dace627a6a87157fd1362b010874e4dc64609b17d87db648de0af3c,5f302d143dace627a6a87157fd1362b010874e4dc64609b17d87db648de0af3c,33eb7e4ae43f9873d9c84c0f07b055946b24a71ca27daa60acbbf95b44c7c5e0,29847,1092,1092,836,2,5,False,SO71774,PO348186287,10-4020-000609,CARGO TRANSPORT 5,Unknown,880.3484,70.4279,22.0087,972.785,,1.0,356.898,0.0,356.898,2022-03-10T10:39:33.296+0000,c21d2391676aca7a1253ccb8b706f5af0be15d9dc068a83eeb1d58b14e947b63
0e855d8d0e5fc570646caf7eecb86424ce8f0e76a9d9f708f1097ea911b34075,71776,110567,2008-06-01,2008-06-13,2008-06-08,96ff6611ed5904a4a36dc31993d735e7f932c1923cfcd6272341ecf3bfed56d9,3f1bb7c0da3c01e685edd592f3a3ca0b149a399d25b97c0da47118c24a39f59a,3f1bb7c0da3c01e685edd592f3a3ca0b149a399d25b97c0da47118c24a39f59a,c8c9cad7b920b50f713830b8dc55f59fffbbad98335d9f30e0bca8fab5dfeedd,30072,640,640,907,2,5,False,SO71776,PO19952192051,10-4020-000106,CARGO TRANSPORT 5,Unknown,78.81,6.3048,1.9703,87.0851,,1.0,63.9,0.0,63.9,2022-03-10T10:39:33.296+0000,593892141b9a52944bd1cfb934a34cddf9f3c3cc146348ff57db3e97ddb816c4
ef519b461f397cc54924c9c1bd5d0d0933872051b4fc152f0fe684af85a0c0f8,71780,110644,2008-06-01,2008-06-13,2008-06-08,44372afdce73ffc3f015ae258d57afd254801c84aeec73fe33e8827f2f2560b4,5f128c8385e577cd1539a0e5a758e4004f4b97e5986b00fb17d393a5ee5ed85d,5f128c8385e577cd1539a0e5a758e4004f4b97e5986b00fb17d393a5ee5ed85d,52efd2aad05d27e3eac3665b82f2bffa6da52351ce871c1c28e4ba69b40ea3e6,30113,653,653,880,2,5,False,SO71780,PO19604173239,10-4020-000340,CARGO TRANSPORT 5,Unknown,38418.6895,3073.4952,960.4672,42452.6519,,1.0,32.994,0.0,32.994,2022-03-10T10:39:33.296+0000,6d29f63647c0785f6ce8cde516d9af61ba6e3c2f28a0acbade8796541df52aeb
87935f57b347c8a72fa61f6391869bc40de878757eeeadc4816c12e0e1617ead,71780,110643,2008-06-01,2008-06-13,2008-06-08,44372afdce73ffc3f015ae258d57afd254801c84aeec73fe33e8827f2f2560b4,5f128c8385e577cd1539a0e5a758e4004f4b97e5986b00fb17d393a5ee5ed85d,5f128c8385e577cd1539a0e5a758e4004f4b97e5986b00fb17d393a5ee5ed85d,5b7c4e75c9485e2e988dce7c57bd9e9915a74217914e7d7a1f13955367db0899,30113,653,653,869,2,5,False,SO71780,PO19604173239,10-4020-000340,CARGO TRANSPORT 5,Unknown,38418.6895,3073.4952,960.4672,42452.6519,,7.0,41.994,0.0,293.958,2022-03-10T10:39:33.296+0000,ff5fde9ebf3cfb09a4458c898d9d29d57171ffce77e30c0a75189b0e9450ef8b
dec0b2ec70c3a39f715843ecbd581f5efee06bea85be65bb60ece661f310b9e4,71780,110642,2008-06-01,2008-06-13,2008-06-08,44372afdce73ffc3f015ae258d57afd254801c84aeec73fe33e8827f2f2560b4,5f128c8385e577cd1539a0e5a758e4004f4b97e5986b00fb17d393a5ee5ed85d,5f128c8385e577cd1539a0e5a758e4004f4b97e5986b00fb17d393a5ee5ed85d,51054b8a03281fd02034378a5570ae0c970fb1d5d64246e0eb981481c228c108,30113,653,653,925,2,5,False,SO71780,PO19604173239,10-4020-000340,CARGO TRANSPORT 5,Unknown,38418.6895,3073.4952,960.4672,42452.6519,,1.0,149.874,0.0,149.874,2022-03-10T10:39:33.296+0000,ba1da07445e10ec42cbd31e7445e7f892ffa3d052edab2d48b7e46b8d7810af0
231dc95884d84c919b3e33a3d71c710a251989b8cbd1bf8b940e62056aa3dde4,71780,110641,2008-06-01,2008-06-13,2008-06-08,44372afdce73ffc3f015ae258d57afd254801c84aeec73fe33e8827f2f2560b4,5f128c8385e577cd1539a0e5a758e4004f4b97e5986b00fb17d393a5ee5ed85d,5f128c8385e577cd1539a0e5a758e4004f4b97e5986b00fb17d393a5ee5ed85d,b064bdba191139689139124101c1c39926326a9b221bd8dfcd603f065c3dc3b8,30113,653,653,935,2,5,False,SO71780,PO19604173239,10-4020-000340,CARGO TRANSPORT 5,Unknown,38418.6895,3073.4952,960.4672,42452.6519,,2.0,24.294,0.0,48.588,2022-03-10T10:39:33.296+0000,7bd6fa8038f6b09ebcb1af020178c986f22d73b72bcd9156549763e7e1c0190a
2615eaf82985211926cde7010cd58e9c055688fb75677a75d2ec08daf56937be,71780,110640,2008-06-01,2008-06-13,2008-06-08,44372afdce73ffc3f015ae258d57afd254801c84aeec73fe33e8827f2f2560b4,5f128c8385e577cd1539a0e5a758e4004f4b97e5986b00fb17d393a5ee5ed85d,5f128c8385e577cd1539a0e5a758e4004f4b97e5986b00fb17d393a5ee5ed85d,5d85be4cc5af40a7cf2c4f0818d92689c185fdea6566745ef26305d80413f483,30113,653,653,810,2,5,False,SO71780,PO19604173239,10-4020-000340,CARGO TRANSPORT 5,Unknown,38418.6895,3073.4952,960.4672,42452.6519,,1.0,72.162,0.0,72.162,2022-03-10T10:39:33.296+0000,c1dd7025600e2f4a0f58ee2e18a64441c4b95ba7c2c4d16544d4574656f8dae6
7ab71539607a9eada4ef8daf7b3930d6dd5bbe04772b67fc0da6951ddf86c455,71780,110639,2008-06-01,2008-06-13,2008-06-08,44372afdce73ffc3f015ae258d57afd254801c84aeec73fe33e8827f2f2560b4,5f128c8385e577cd1539a0e5a758e4004f4b97e5986b00fb17d393a5ee5ed85d,5f128c8385e577cd1539a0e5a758e4004f4b97e5986b00fb17d393a5ee5ed85d,0b06d2ffebd5c025cf444cb95a73e1fff046569238eafd1e80f511ea2a807de3,30113,653,653,809,2,5,False,SO71780,PO19604173239,10-4020-000340,CARGO TRANSPORT 5,Unknown,38418.6895,3073.4952,960.4672,42452.6519,,3.0,37.152,0.0,111.456,2022-03-10T10:39:33.296+0000,72fd265d1e68fcf6b58af4c4b9cfedd52394bf48070e3ad280f02e2a388683f6
2087d79388f3a4f895c99493f250aa1fe6a45929e64213f73a3c1b9abcbe958d,71780,110638,2008-06-01,2008-06-13,2008-06-08,44372afdce73ffc3f015ae258d57afd254801c84aeec73fe33e8827f2f2560b4,5f128c8385e577cd1539a0e5a758e4004f4b97e5986b00fb17d393a5ee5ed85d,5f128c8385e577cd1539a0e5a758e4004f4b97e5986b00fb17d393a5ee5ed85d,5620e84be3e5141819e0d9e4ba10b782ba40e232e56352ed636dc0282161b543,30113,653,653,783,2,5,False,SO71780,PO19604173239,10-4020-000340,CARGO TRANSPORT 5,Unknown,38418.6895,3073.4952,960.4672,42452.6519,,5.0,1376.994,0.0,6884.97,2022-03-10T10:39:33.296+0000,3d8cb8a75ceefb4e50fb176520c248ea031751ab4470655f74fcabe21545c7be


# Output to target Delta entity

## Ensure the target Delta table exists in the expected location. Create if not.
* Use standardised location for where to place the files in the data lake. 
* Dimensions: /mnt/datalake_platinum/Dimensions
* Facts: /mnt/datalake_platinum/Facts

In [0]:
vTargetDeltaTablePath = '/mnt/datalake_platinum/Fact/' + pTargetEntityName
 
#Check values
print('vTargetDeltaTablePath: ' + vTargetDeltaTablePath)

Dynamic SQL Statement that will only create the table on the first run.

* If the table already exists, this will do nothing.
* Partitioning - Platinum Layer Logic
  * **Facts** - Partition by the most applicable Date field such as Sales Date or Order Date. One that is typically used to limit data ingested into reporting. Dont partition on Ingestion date time stamp like in silver. 
    * **Performance of writes and reads need to be checked. If performance is struggling, revise partitioning level. It all depends on the volume of data per partition. Ideally dont have the partitions be too small because that impacts compression and data skipping. But too large files mean large data movements are needed at MERGE time**
  * **Dimensions** - Do not partition dimensions that are small. Only if the number of rows become extremely large and filtering on a specific attribute would aid performance in reporting.

In [0]:
vDeltaTableCreateStatement = 'CREATE TABLE IF NOT EXISTS ' \
+ 'datalakehouse_' + pTargetDataLakeContainer + '.' + pTargetEntityName + ' \n' \
+ 'USING DELTA ' + '\n' \
+ 'LOCATION \'' + vTargetDeltaTablePath + '\' ' + '\n' \
+ 'PARTITIONED BY (`' + pPartitionByField + '`) ' + '\n' \
+ 'AS' + '\n' \
+ 'SELECT * FROM SourceSilverData'

#Check final output
print('vDeltaTableCreateStatement: ' + vDeltaTableCreateStatement)

#Exsecute the SQL
spark.sql(vDeltaTableCreateStatement)


Select top 10 records to check information

In [0]:
#Create the string that is the SQL query to execute
vSelectTop10RecordsSQLString = 'SELECT * ' + '\n' \
+ 'FROM datalakehouse_' + pTargetDataLakeContainer + '.' + pTargetEntityName + '\n' \
+ 'LIMIT 10;'

#Check the statement to be executed
print(vSelectTop10RecordsSQLString)

#Exsecute the SQL
vSelectTop10RecordsSQLString_resultDF = spark.sql(vSelectTop10RecordsSQLString)

# Check the output of the dynamic SQL query
display(vSelectTop10RecordsSQLString_resultDF)

FactSalesPK,SalesOrderIDBK,SalesOrderDetailIDBK,DimCalendarFK_OrderDate,DimCalendarFK_DueDate,DimCalendarFK_ShipDate,DimCustomerFK,DimAddressFK_ShipToAddress,DimAddressFK_BillToAddress,DimProductFK,CustomerID,ShipToAddressID,BillToAddressID,ProductID,RevisionNumber,Status,IsOnelineOrder,SalesOrderNumber,PurchaseOrderNumber,AccountNumber,ShipMethod,CreditCardApprovalCode,SubTotalAmount,TaxAmount,FreightAmount,TotalDueAmount,Comment,OrderQuantity,UnitPriceAmount,UnitPriceDiscountAmount,LineTotalAmount,LatestModifiedDateTimeUTC,HashChecksum
dcdeafb4ac1f50f8c031d46bdda6439908b19c5888a2ab4c4daafc21e613fff1,71774,110563,2008-06-01,2008-06-13,2008-06-08,,,,,29847,1092,1092,822,2,5,False,SO71774,PO348186287,10-4020-000609,CARGO TRANSPORT 5,Unknown,880.3484,70.4279,22.0087,972.785,,1.0,356.898,0.0,356.898,2022-03-10T10:06:45.691+0000,7bd95dd5fc45c97026e5bfc33204559f968b7a11fa25c98f7f0a1252f183ecc6
cf6918afc978cfb327d111f3d73ed47b6486ba0d89fb7d1bbc004e1e1cf4bffc,71774,110562,2008-06-01,2008-06-13,2008-06-08,,,,,29847,1092,1092,836,2,5,False,SO71774,PO348186287,10-4020-000609,CARGO TRANSPORT 5,Unknown,880.3484,70.4279,22.0087,972.785,,1.0,356.898,0.0,356.898,2022-03-10T10:06:45.691+0000,c21d2391676aca7a1253ccb8b706f5af0be15d9dc068a83eeb1d58b14e947b63
0e855d8d0e5fc570646caf7eecb86424ce8f0e76a9d9f708f1097ea911b34075,71776,110567,2008-06-01,2008-06-13,2008-06-08,,,,,30072,640,640,907,2,5,False,SO71776,PO19952192051,10-4020-000106,CARGO TRANSPORT 5,Unknown,78.81,6.3048,1.9703,87.0851,,1.0,63.9,0.0,63.9,2022-03-10T10:06:45.691+0000,593892141b9a52944bd1cfb934a34cddf9f3c3cc146348ff57db3e97ddb816c4
ef519b461f397cc54924c9c1bd5d0d0933872051b4fc152f0fe684af85a0c0f8,71780,110644,2008-06-01,2008-06-13,2008-06-08,,,,,30113,653,653,880,2,5,False,SO71780,PO19604173239,10-4020-000340,CARGO TRANSPORT 5,Unknown,38418.6895,3073.4952,960.4672,42452.6519,,1.0,32.994,0.0,32.994,2022-03-10T10:06:45.691+0000,6d29f63647c0785f6ce8cde516d9af61ba6e3c2f28a0acbade8796541df52aeb
87935f57b347c8a72fa61f6391869bc40de878757eeeadc4816c12e0e1617ead,71780,110643,2008-06-01,2008-06-13,2008-06-08,,,,,30113,653,653,869,2,5,False,SO71780,PO19604173239,10-4020-000340,CARGO TRANSPORT 5,Unknown,38418.6895,3073.4952,960.4672,42452.6519,,7.0,41.994,0.0,293.958,2022-03-10T10:06:45.691+0000,ff5fde9ebf3cfb09a4458c898d9d29d57171ffce77e30c0a75189b0e9450ef8b
dec0b2ec70c3a39f715843ecbd581f5efee06bea85be65bb60ece661f310b9e4,71780,110642,2008-06-01,2008-06-13,2008-06-08,,,,,30113,653,653,925,2,5,False,SO71780,PO19604173239,10-4020-000340,CARGO TRANSPORT 5,Unknown,38418.6895,3073.4952,960.4672,42452.6519,,1.0,149.874,0.0,149.874,2022-03-10T10:06:45.691+0000,ba1da07445e10ec42cbd31e7445e7f892ffa3d052edab2d48b7e46b8d7810af0
231dc95884d84c919b3e33a3d71c710a251989b8cbd1bf8b940e62056aa3dde4,71780,110641,2008-06-01,2008-06-13,2008-06-08,,,,,30113,653,653,935,2,5,False,SO71780,PO19604173239,10-4020-000340,CARGO TRANSPORT 5,Unknown,38418.6895,3073.4952,960.4672,42452.6519,,2.0,24.294,0.0,48.588,2022-03-10T10:06:45.691+0000,7bd6fa8038f6b09ebcb1af020178c986f22d73b72bcd9156549763e7e1c0190a
2615eaf82985211926cde7010cd58e9c055688fb75677a75d2ec08daf56937be,71780,110640,2008-06-01,2008-06-13,2008-06-08,,,,,30113,653,653,810,2,5,False,SO71780,PO19604173239,10-4020-000340,CARGO TRANSPORT 5,Unknown,38418.6895,3073.4952,960.4672,42452.6519,,1.0,72.162,0.0,72.162,2022-03-10T10:06:45.691+0000,c1dd7025600e2f4a0f58ee2e18a64441c4b95ba7c2c4d16544d4574656f8dae6
7ab71539607a9eada4ef8daf7b3930d6dd5bbe04772b67fc0da6951ddf86c455,71780,110639,2008-06-01,2008-06-13,2008-06-08,,,,,30113,653,653,809,2,5,False,SO71780,PO19604173239,10-4020-000340,CARGO TRANSPORT 5,Unknown,38418.6895,3073.4952,960.4672,42452.6519,,3.0,37.152,0.0,111.456,2022-03-10T10:06:45.691+0000,72fd265d1e68fcf6b58af4c4b9cfedd52394bf48070e3ad280f02e2a388683f6
2087d79388f3a4f895c99493f250aa1fe6a45929e64213f73a3c1b9abcbe958d,71780,110638,2008-06-01,2008-06-13,2008-06-08,,,,,30113,653,653,783,2,5,False,SO71780,PO19604173239,10-4020-000340,CARGO TRANSPORT 5,Unknown,38418.6895,3073.4952,960.4672,42452.6519,,5.0,1376.994,0.0,6884.97,2022-03-10T10:06:45.691+0000,3d8cb8a75ceefb4e50fb176520c248ea031751ab4470655f74fcabe21545c7be


# Merge the incoming source data into the output delta table

* Dimensions: Don't filter on the IngestionDateTimeStampUTC field for dimensions as they are not partitioned.
* Facts: Filter on IngestionDateTimeStampUTC to only process new data

Expected output to look like this 
```
MERGE INTO datalakehouse_platinum.FactSales as target

USING SourceSilverData as source
  ON target.FactSalesPK = source.FactSalesPK
    
WHEN MATCHED 
  AND target.HashChecksum <> source.HashChecksum
  THEN UPDATE SET *
  
WHEN NOT MATCHED 
  THEN INSERT *
;
```

In [0]:
vMergeIntoTargetDeltaTableSQLString = 'MERGE INTO datalakehouse_platinum.' + pTargetEntityName + ' as target' + '\n' \
+ 'USING SourceSilverData as source' + '\n' \
+ 'ON target.' + pTargetEntityName + 'PK = source.' + pTargetEntityName + 'PK' + '\n' \
+ 'WHEN MATCHED AND' + '\n' \
+ 'target.HashChecksum <> source.HashChecksum' + '\n' \
+ 'THEN UPDATE SET *' + '\n' \
+ 'WHEN NOT MATCHED' + '\n' \
+ 'THEN INSERT *'

#Check the statement to be executed
print('vMergeIntoTargetDeltaTableSQLString: ' + vMergeIntoTargetDeltaTableSQLString)

#Execute the statement
vMergeIntoTargetDeltaTableSQLString_Result = spark.sql(vMergeIntoTargetDeltaTableSQLString)

#Check the outpyt of the dynamic sql query
display(vMergeIntoTargetDeltaTableSQLString_Result)

num_affected_rows,num_updated_rows,num_deleted_rows,num_inserted_rows
0,0,0,0


## Perform optimisations to the delta table for better performance

* ZORDER BY: To ensure the keys of the data you would join or filter on are sorted correctly. Allows for data skipping on read and more efficient joins. 
  * Dont add too many columns in the CSV list, with each column added, the effectiveness of the sorting has less effect. 
    * Only add the keys that will be used in joins to other tables i.e. the business keys or the Priamry key of the table. 
      * For Fact and Dimension tables, use the Primary Keys as they are used in the MERGE join. 
* Since no one single partition will be altered at MERGE, we cant apply a consistent WHERE clause on the ZOrder by. THus we have to apply it on the whole table. 
  * If performance becomes an issue, perhaps ZORDER BY on the 10 latest partitions, as they are most likelty to have had changes...

In [0]:
# Field that will be used to sort the data in the delta table to ensure efficient data skipping on MERGE next time
vZOrderByFieldName = pTargetEntityName + 'PK'

#Create the string that is the SQL query to execute using the latest ingestion time stamp and the right entity name
vOptimiseTableWithZOrderClauseSQL = 'OPTIMIZE datalakehouse_' + pTargetDataLakeContainer + '.' + pTargetEntityName + '\n' \
+ 'ZORDER BY `' + vZOrderByFieldName + '`'

#Check the statement to be executed
print(vOptimiseTableWithZOrderClauseSQL)

#Exsecute the SQL
vOptimiseTableWithZOrderClauseSQL_resultDF = spark.sql(vOptimiseTableWithZOrderClauseSQL)

# Check the output of the dynamic SQL query
display(vOptimiseTableWithZOrderClauseSQL_resultDF)

path,metrics
dbfs:/mnt/datalake_platinum/Fact/FactSales,"List(0, 0, List(null, null, 0.0, 0, 0), List(null, null, 0.0, 0, 0), 1, List(minCubeSize(107374182400), List(0, 0), List(1, 95174), 0, List(0, 0), 0, null), 0, 1, 1, false)"


* VACUUM: Remove previous version files that are no longer needed if older than the specified number of hours. 168 hours = 7 days is the default. 
  * Note, this means timetravel to before this period will not be possible. 
  * This ensures the table remains clean and as small as possible

Sample query expected
```
VACUUM datalakehouse_silver.SalesOrderHeader RETAIN 168 HOURS
```

In [0]:
#Create the string that is the SQL query to execute
vVacuumTableSQLStatement = 'VACUUM datalakehouse_' + pTargetDataLakeContainer + '.' + pTargetEntityName + ' RETAIN 168 HOURS'

#Check the statement to be executed
print(vVacuumTableSQLStatement)

#Exsecute the SQL
vVacuumTableSQLStatement_resultDF = spark.sql(vVacuumTableSQLStatement)

# Check the output of the dynamic SQL query
display(vVacuumTableSQLStatement_resultDF)