# Platinum Layer Dimension Delta Table - Generic notebook to guide on the conventions

### Configuration
|Item|Value|
|---|---|
|Parameter|JSON object with all applicable parameters|
|Source File Identification|Not applicable, will read from a Delta table|
|Output|It will write the output to a Delta table in the platinum database stored in the platinum container of the data lake|
|Manipulations included|Business Rules addition, joining multiple source entities together to form one consolidated entity, applying dimension rules and conventions. |
|Slowly Changing Dimension Type|SCD 1 - Insert new records, Update existing records in place.|

# Import all libraries required

In [0]:
import json

# Call Administration notebook to perform tasks before the data modelling can continue. 

*Look into having a single admin notebook where we store all the admin tasks and always just all that one??*

In [0]:
%run ../Administration/CreateDatabaseIfNotExists

# Call the applicable Helper Functions notebooks to include their functions for use in this Notebook

In [0]:
%run ../HelperFunctions/DataLakeHelperFunctions

# Parameters / Widgets

Here the single JSON parameter widget of the notebook will be generated.

Having it as JSON means we can send multiple values from the calling code such as Data Factory to the notebook at run time with a single widget / parameter. 

When we want to add more parameter values leter, it doesn't require a notebook change to add more widgets OR a data factory change to add more parameters. Just change the value sent. 

The JSON object will be unpacked and the attributes needed will be extracted in to variables with the "p" prefix to show it was a notebook level parameter.

Run this when you want to re-initialise the widget in the following cell with new default values. 
Do not let this cell run as part of normal operations

In [0]:
#Run this when you want to re-initialise the widget in the following cell with new default values. 
#Do not let this cell run as part of normal operations
dbutils.widgets.removeAll()

### Define the widget
Note, once createed, it stays attached to the notebook. Keeping this code to run each time to ensure it exists. 
|Parameter Value|Description|
|--|--|
|"SourceDataLakeContainer"|Container/Area in the data flow where the source delta lake table is located|
|"TargetDataLakeContainer"|Container/Area in the data flow where the target delta lake table is located|
|"TargetEntityName"|Name of the entity being processed. Will become the name of the target delta table.|

In [0]:
# Create the widget in the first place with a default value one can use for testing
# This iteration expects a single JSON Object, not an array
dbutils.widgets.text("widgetJSONString", 
'''
{
"SourceDataLakeContainer": "silver",
"TargetDataLakeContainer": "platinum",
"TargetEntityName": "DimProduct"
}
'''
)

### Transform parameter values received into usable format
* Data type conversion
* String manipulation
* Property extraction
* etc.

In [0]:
# At this stage, the string in the variable is still just a string, not typed as JSON. 
# Convert it to a JSON typed value using json.loads
pNotebookWidgetWithJSONString = json.loads(dbutils.widgets.get("widgetJSONString"))

# Print out full value received for logging purposes
print("pNotebookWidgetWithJSONString: " + str(pNotebookWidgetWithJSONString))

# Assign each attribute to the applicable variabe to be used going forward

pSourceDataLakeContainer = pNotebookWidgetWithJSONString["SourceDataLakeContainer"]
pTargetDataLakeContainer = pNotebookWidgetWithJSONString["TargetDataLakeContainer"]
pTargetEntityName = pNotebookWidgetWithJSONString["TargetEntityName"]

# Calculated parameters from the widget values
pTargetDatabaseTableName = 'datalakehouse_' + pTargetDataLakeContainer + '.' + pTargetEntityName

#Print the rest of the values for troubleshooting
print('pSourceDataLakeContainer: ' + pSourceDataLakeContainer)
print('pTargetDataLakeContainer: ' + pTargetDataLakeContainer)
print('pTargetEntityName: ' + pTargetEntityName)
print('pTargetDatabaseTableName: ' + pTargetDatabaseTableName)


In [0]:
#pSourceDataLakeContainer = 'silver'
#pTargetDataLakeContainer = 'platinum'
#pTargetEntityName = 'DimProduct'

# Ensure applicable data lake containers are mounted

Once this has run once, it should never have to run again. Just including for safety in all notebooks.

* Source Container: silver
* Target Container: platinum

In [0]:
mount_lake_container(pSourceDataLakeContainer)

In [0]:
mount_lake_container(pTargetDataLakeContainer)

# Create single consolidated entity from all applicable silver entities

Rules to apply
* Keep the temporary view name standard - **SourceSilverData**
* Perform null value replacement with friendly value
  * Strings: 'Unknown'
  * Dates: '1900-01-01' or '9999-12-31' -  on the context
  * Boolean: null - dont assume false
* Perform lookups to other tables to get the names of IDs and Codes if not availabel in Dimensions
* Create calculated columns that will be used in reporting layer
* Field Names Convention: PascalCase, no spaces
* When concatenating values together, use CONCAT_WS and use the pipe characer '|' as the seperator

**PK - Primary Key**
* Used as the value other Dimensions and Fact entities link to. 
* Other Dimensions and Facts will include the same fields in their Foreign Key field to this table. 
* This will also be the join column in the MERGE into target table.
* Always use CONCAT_WS with a '|' as the seperator to concatenate the values before applying the SHA2 hashing algorythm
  * Even if only one column is included, This ensures the standard is code, and ensures INT values dont need explicit cast to VARCHAR because CONCAT_WS returns a string already
  
**BK - Business Key**
* Business key(s) of the entity. 
* Fields that uniquely identity the record using the source system IDs and Codes

**FK - Foreign Key**
* Foreign keys to other tables
* Also needs to be hashed the same way that the PK of the target tables are hashed. 
  * EXCEPT Dates - they are not hashed - see **Table Attributes & Calculated Columns**
* Note, the source business keys that make up these foreign key values are included for troubleshooting - they also get the BK suffix, even though they are not the business keys of this table itself

**ID - Identification and Code fields**
* The source business keys that make up the foreign key values are included for troubleshooting - they also get the BK suffix, even though they are not the business keys of this table itself
* Dont perform alteration, this is for viewing purposes. 
* Add the suffix ID to the field names if they are business keys of entities - but only use ID if the field is not in it's primary source table. 
* E.g. if the CustomerCodeID value is in the FactSales table, it is ourside the primary source which is DimCustomer. Thus the value gets a ID suffix in the FactSales table, but it gets the BK suffix in DimCustomer. 
* If the source value doesnt have ID as a suffix already, add it. 
* If it has something like "Code" as the suffix, still add the ID to have consistency.

**Table Attributes & Calculated Columns**
* Date fields - keep as proper date i.e. yyyy-MM-dd. Alias as DimCalendarFK_*UniqueDateFieldName*. Because they will be foreign keys to the consolidated date dimension. Also enables incremental loading in Power BI. Since the Calendar's PK will also be a yyyy-MM-dd this doesnt require a hash. 
* Field names in general - do not include any abbreviations unless it is very clear or clearly documented somewhere e.g. Amt should be listed as Amount, Qty should be Quantity
* Monetary values - If possible, always try to add the Amount suffix to the name to indicate this is a monetary aggregateable amount. 
* Numerical Count values - If possible, always try to add the Quantity suffix to the name to indicate this is a non-monetary aggregateable amount. 

**Metadata**
* Information about the data we either get from source or manually define in this notebook such as change date times, deleted flags etc. 
* **LatestModifiedDateTimeUTC** - If this record is inserted or updated based on join conditions, this will be available to indicate when last a change was made on this record.

**HashChecksum**
* Hashed version of the source attributes and measures fields from source entities - unaltered
* Used to check in the MERGE statement if a record has changed since the last load so we can only alter the editted records in target.
* Do not include any alterations to fields used, only if a source field changes does the checksum change
* Exclude - PK, BK fields - the fields used in the join dont need to be included, but they can be, wont affect the process.
* Include all other non metadata fieldsincluding FK, ID, string, date, numerical fields etc. 
* Hash Algorythm: Use the SHA2 function with 256 bits as default.

In [0]:
%sql
CREATE OR REPLACE TEMPORARY VIEW SourceSilverData
AS
SELECT 
  /*==================================================
  PK - Primary Key
  ==================================================*/
  SHA2(`Product`.`ProductID`, 256) AS `DimProductPK`
  
  /*==================================================
  BK - Business Key
  ==================================================*/
  , `Product`.`ProductID` AS `ProductIDBK`

  /*==================================================
  ID - Identification
  ==================================================*/
  
  /*==================================================
  Table Attributes & Calculated COlumns
  ==================================================*/
  , `Product`.`Name`
  , `Product`.`ProductNumber`
  , `Product`.`Color`
  , `Product`.`StandardCost`
  , `Product`.`ListPrice`
  , COALESCE(`Product`.`Size`, 'Unknown') AS `Size`
  , `Product`.`Weight`
  , `Product`.`SellStartDate`
  , COALESCE(`Product`.`SellEndDate`, '9999-12-31') AS `SellEndDate`
  , COALESCE(`Product`.`DiscontinuedDate`, '9999-12-31') AS `DiscontinuedDate`
  , `ProductCategory`.`Name` AS `ProductCetegory`
  , `ProductCategory_Parent`.`Name` AS `ProductCetegoryParent`
  , `ProductModel`.`Name` AS `ProductModel`
  , COALESCE(`ProductModel`.`CatalogDescription`, 'Unknown') AS `CatalogDescription`
  , COALESCE(`ProductDescription`.`Description`, 'Unknown') AS `Description`
  
  /*==================================================
  Metadata
  ==================================================*/
  , current_timestamp() AS `LatestModifiedDateTimeUTC`

  /*==================================================
  HashChecksum
  ==================================================*/
  , SHA2(
    CONCAT_WS(
      '|'
      , `Product`.`Name`
      , `Product`.`ProductNumber`
      , `Product`.`Color`
      , `Product`.`StandardCost`
      , `Product`.`ListPrice`
      , `Product`.`Size`
      , `Product`.`Weight`
      , `Product`.`SellStartDate`
      , `Product`.`SellEndDate`
      , `Product`.`DiscontinuedDate`
      , `ProductCategory`.`Name` /* AS `ProductCetegory`*/
      , `ProductCategory_Parent`.`Name` /*AS `ProductCetegoryParent`*/
      , `ProductModel`.`Name` /*AS `ProductModel`*/
      , `ProductModel`.`CatalogDescription`
      , `ProductDescription`.`Description`
    ), 256 
  ) AS `HashChecksum`
  
FROM datalakehouse_silver.Product
LEFT JOIN datalakehouse_silver.ProductCategory
  ON Product.ProductCategoryID = ProductCategory.ProductCategoryID
/*Get the product category parent as well to form the full hierarchyu*/
LEFT JOIN datalakehouse_silver.ProductCategory AS ProductCategory_Parent
  ON ProductCategory.ParentProductCategoryID = ProductCategory_Parent.ProductCategoryID
LEFT JOIN datalakehouse_silver.ProductModel
  ON Product.ProductModelID = ProductModel.ProductModelID
/*ProductModelProductDescription is a link table to get the correct description. No fields will be returned from this table. */
LEFT JOIN datalakehouse_silver.ProductModelProductDescription
  ON ProductModel.ProductModelID = ProductModelProductDescription.ProductModelID
    AND ProductModelProductDescription.Culture = 'en' /*English description*/
LEFT JOIN datalakehouse_silver.ProductDescription
  ON ProductDescription.ProductDescriptionID = ProductModelProductDescription.ProductDescriptionID


Top 10 records view of the source data for reference

In [0]:
%sql
SELECT *
FROm SourceSilverData
LIMIT 10;

DimProductPK,ProductIDBK,Name,ProductNumber,Color,StandardCost,ListPrice,Size,Weight,SellStartDate,SellEndDate,DiscontinuedDate,ProductCetegory,ProductCetegoryParent,ProductModel,CatalogDescription,Description,LatestModifiedDateTimeUTC,HashChecksum
a4c6af0cb6f02dff01ba174e4cf11f24f73d9ed16ca7a1e3c9d831c0139faa5c,680,"HL Road Frame - Black, 58",FR-R92B-58,Black,1059.31,1431.5,58,1016.04,2002-06-01,9999-12-31,9999-12-31,Road Frames,Components,HL Road Frame,Unknown,Unknown,2022-03-10T10:49:36.860+0000,c9375c2a30ab2a688a3fa02072db05285797949928af215227f070c3a1c1f120
35254aa9a21444e50349cebb5465b9b42cb4a625ebcbffe24504b178c35bcb85,706,"HL Road Frame - Red, 58",FR-R92R-58,Red,1059.31,1431.5,58,1016.04,2002-06-01,9999-12-31,9999-12-31,Road Frames,Components,HL Road Frame,Unknown,Unknown,2022-03-10T10:49:36.860+0000,a00a2abdadc182341d2d9a942c3ec989429ebda142c579422644543ac4d64f37
5b60f221d4a1852afd0194ad0857fae9c558608e35621dce43301e8c771b7877,707,"Sport-100 Helmet, Red",HL-U509-R,Red,13.0863,34.99,Unknown,,2005-07-01,9999-12-31,9999-12-31,Helmets,Accessories,Sport-100,Unknown,Unknown,2022-03-10T10:49:36.860+0000,95bd1566ad8c22b504307c935d91a28f76d954de0a8a1b105c306af90b3a2fba
1706be6c293444756e72b05e4afa9eb1038e552ac6ce058309451ef7ddad7748,708,"Sport-100 Helmet, Black",HL-U509,Black,13.0863,34.99,Unknown,,2005-07-01,9999-12-31,9999-12-31,Helmets,Accessories,Sport-100,Unknown,Unknown,2022-03-10T10:49:36.860+0000,dacc12ee26ae469870eda708164f0ddd17124b3e215fdbfc3400573cf18b52a6
92c5fd0421c1d619cbf1bdba83a207261f2c5f764aed46db9b4d2de03b72b654,709,"Mountain Bike Socks, M",SO-B909-M,White,3.3963,9.5,M,,2005-07-01,2006-06-30,9999-12-31,Socks,Clothing,Mountain Bike Socks,Unknown,Unknown,2022-03-10T10:49:36.860+0000,ba522656039ff7362ec0594c29422c434b6641548765ecf2d6d1f220ec79eb39
40f8d6d22b99ea3388538fd60bbf532256434b0eac401df1d9a2bdbb29354ae8,713,"Long-Sleeve Logo Jersey, S",LJ-0192-S,Multi,38.4923,49.99,S,,2005-07-01,9999-12-31,9999-12-31,Jerseys,Clothing,Long-Sleeve Logo Jersey,Unknown,Unknown,2022-03-10T10:49:36.860+0000,932a715bf83676a4ff6b3071395a901e99b1a2686747d455128f44743471bccc
35c71bd7eaf4607047bb7c186d17251942204229b897e033923b13dc8ce2d109,715,"Long-Sleeve Logo Jersey, L",LJ-0192-L,Multi,38.4923,49.99,L,,2005-07-01,9999-12-31,9999-12-31,Jerseys,Clothing,Long-Sleeve Logo Jersey,Unknown,Unknown,2022-03-10T10:49:36.860+0000,f401a10ee9618b062e1e0b642086cf456cf89d9dea27e57a19a4f1d2790a21ff
d536a8c1664fec0bc85615cf3cb2645871e8b2935c9642c534c67ac85315cd35,717,"HL Road Frame - Red, 62",FR-R92R-62,Red,868.6342,1431.5,62,1043.26,2005-07-01,9999-12-31,9999-12-31,Road Frames,Components,HL Road Frame,Unknown,Unknown,2022-03-10T10:49:36.860+0000,3f55b8b1fb1618ae5a0b2b2945118942a9081dd7527a1dfd4afedcc9ba2a5315
d829857eb1366e70be857a69886d1555af0d32681beab068afb93492c2e2b843,720,"HL Road Frame - Red, 52",FR-R92R-52,Red,868.6342,1431.5,52,997.9,2005-07-01,9999-12-31,9999-12-31,Road Frames,Components,HL Road Frame,Unknown,Unknown,2022-03-10T10:49:36.860+0000,c317c67bfc815d0bf603460ec52bedcdbdc31d50e515842a8bd655fc9043cb53
74de057f768beb42de17ffc4b8a56100f0bed85947ecacaef111e3d3ec997950,721,"HL Road Frame - Red, 56",FR-R92R-56,Red,868.6342,1431.5,56,1016.04,2005-07-01,9999-12-31,9999-12-31,Road Frames,Components,HL Road Frame,Unknown,Unknown,2022-03-10T10:49:36.860+0000,9fdb7830a65caf0c0b41f293bb2c9a5e2b2915a304a0179bdad2c9de07d36c3f


# Output to target Delta entity

## Ensure the target Delta table exists in the expected location. Create if not.
* Use standardised location for where to place the files in the data lake. 
* Dimensions: /mnt/datalake_platinum/Dimensions
* Facts: /mnt/datalake_platinum/Facts

In [0]:
vTargetDeltaTablePath = '/mnt/datalake_platinum/Dimension/' + pTargetEntityName
 
#Check values
print('vTargetDeltaTablePath: ' + vTargetDeltaTablePath)

Dynamic SQL Statement that will only create the table on the first run.

* If the table already exists, this will do nothing.
* Partitioning - Platinum Layer Logic
  * **Facts** - Partition by the most applicable Date field such as Sales Date or Order Date. One that is typically used to limit data ingested into reporting. Dont partition on Ingestion date time stamp like in silver. 
  * **Dimensions** - Do not partition dimensions that are small. Only if the number of rows become extremely large and filtering on a specific attribute would aid performance in reporting.

In [0]:
#vTargetEntityName": "DimProduct",
vDeltaTableCreateStatement = 'CREATE TABLE IF NOT EXISTS ' \
+ 'datalakehouse_' + pTargetDataLakeContainer + '.' + pTargetEntityName + ' \n' \
+ 'USING DELTA ' + '\n' \
+ 'LOCATION \'' + vTargetDeltaTablePath + '\' ' + '\n' \
+ 'AS' + '\n' \
+ 'SELECT * FROM SourceSilverData'

#Check final output
print('vDeltaTableCreateStatement: ' + vDeltaTableCreateStatement)

#Exsecute the SQL
spark.sql(vDeltaTableCreateStatement)


Select top 10 records to check information

In [0]:
#Create the string that is the SQL query to execute
vSelectTop10RecordsSQLString = 'SELECT * ' + '\n' \
+ 'FROM datalakehouse_' + pTargetDataLakeContainer + '.' + pTargetEntityName + '\n' \
+ 'LIMIT 10;'

#Check the statement to be executed
print(vSelectTop10RecordsSQLString)

#Exsecute the SQL
vSelectTop10RecordsSQLString_resultDF = spark.sql(vSelectTop10RecordsSQLString)

# Check the output of the dynamic SQL query
display(vSelectTop10RecordsSQLString_resultDF)

DimProductPK,ProductIDBK,Name,ProductNumber,Color,StandardCost,ListPrice,Size,Weight,SellStartDate,SellEndDate,DiscontinuedDate,ProductCetegory,ProductCetegoryParent,ProductModel,CatalogDescription,Description,LatestModifiedDateTimeUTC,HashChecksum
a4c6af0cb6f02dff01ba174e4cf11f24f73d9ed16ca7a1e3c9d831c0139faa5c,680,"HL Road Frame - Black, 58",FR-R92B-58,Black,1059.31,1431.5,58,1016.04,2002-06-01,9999-12-31,9999-12-31,Road Frames,Components,HL Road Frame,Unknown,Unknown,2022-03-10T10:48:38.245+0000,c9375c2a30ab2a688a3fa02072db05285797949928af215227f070c3a1c1f120
35254aa9a21444e50349cebb5465b9b42cb4a625ebcbffe24504b178c35bcb85,706,"HL Road Frame - Red, 58",FR-R92R-58,Red,1059.31,1431.5,58,1016.04,2002-06-01,9999-12-31,9999-12-31,Road Frames,Components,HL Road Frame,Unknown,Unknown,2022-03-10T10:48:38.245+0000,a00a2abdadc182341d2d9a942c3ec989429ebda142c579422644543ac4d64f37
5b60f221d4a1852afd0194ad0857fae9c558608e35621dce43301e8c771b7877,707,"Sport-100 Helmet, Red",HL-U509-R,Red,13.0863,34.99,Unknown,,2005-07-01,9999-12-31,9999-12-31,Helmets,Accessories,Sport-100,Unknown,Unknown,2022-03-10T10:48:38.245+0000,95bd1566ad8c22b504307c935d91a28f76d954de0a8a1b105c306af90b3a2fba
1706be6c293444756e72b05e4afa9eb1038e552ac6ce058309451ef7ddad7748,708,"Sport-100 Helmet, Black",HL-U509,Black,13.0863,34.99,Unknown,,2005-07-01,9999-12-31,9999-12-31,Helmets,Accessories,Sport-100,Unknown,Unknown,2022-03-10T10:48:38.245+0000,dacc12ee26ae469870eda708164f0ddd17124b3e215fdbfc3400573cf18b52a6
92c5fd0421c1d619cbf1bdba83a207261f2c5f764aed46db9b4d2de03b72b654,709,"Mountain Bike Socks, M",SO-B909-M,White,3.3963,9.5,M,,2005-07-01,2006-06-30,9999-12-31,Socks,Clothing,Mountain Bike Socks,Unknown,Unknown,2022-03-10T10:48:38.245+0000,ba522656039ff7362ec0594c29422c434b6641548765ecf2d6d1f220ec79eb39
40f8d6d22b99ea3388538fd60bbf532256434b0eac401df1d9a2bdbb29354ae8,713,"Long-Sleeve Logo Jersey, S",LJ-0192-S,Multi,38.4923,49.99,S,,2005-07-01,9999-12-31,9999-12-31,Jerseys,Clothing,Long-Sleeve Logo Jersey,Unknown,Unknown,2022-03-10T10:48:38.245+0000,932a715bf83676a4ff6b3071395a901e99b1a2686747d455128f44743471bccc
35c71bd7eaf4607047bb7c186d17251942204229b897e033923b13dc8ce2d109,715,"Long-Sleeve Logo Jersey, L",LJ-0192-L,Multi,38.4923,49.99,L,,2005-07-01,9999-12-31,9999-12-31,Jerseys,Clothing,Long-Sleeve Logo Jersey,Unknown,Unknown,2022-03-10T10:48:38.245+0000,f401a10ee9618b062e1e0b642086cf456cf89d9dea27e57a19a4f1d2790a21ff
d536a8c1664fec0bc85615cf3cb2645871e8b2935c9642c534c67ac85315cd35,717,"HL Road Frame - Red, 62",FR-R92R-62,Red,868.6342,1431.5,62,1043.26,2005-07-01,9999-12-31,9999-12-31,Road Frames,Components,HL Road Frame,Unknown,Unknown,2022-03-10T10:48:38.245+0000,3f55b8b1fb1618ae5a0b2b2945118942a9081dd7527a1dfd4afedcc9ba2a5315
d829857eb1366e70be857a69886d1555af0d32681beab068afb93492c2e2b843,720,"HL Road Frame - Red, 52",FR-R92R-52,Red,868.6342,1431.5,52,997.9,2005-07-01,9999-12-31,9999-12-31,Road Frames,Components,HL Road Frame,Unknown,Unknown,2022-03-10T10:48:38.245+0000,c317c67bfc815d0bf603460ec52bedcdbdc31d50e515842a8bd655fc9043cb53
74de057f768beb42de17ffc4b8a56100f0bed85947ecacaef111e3d3ec997950,721,"HL Road Frame - Red, 56",FR-R92R-56,Red,868.6342,1431.5,56,1016.04,2005-07-01,9999-12-31,9999-12-31,Road Frames,Components,HL Road Frame,Unknown,Unknown,2022-03-10T10:48:38.245+0000,9fdb7830a65caf0c0b41f293bb2c9a5e2b2915a304a0179bdad2c9de07d36c3f


# Merge the incoming source data into the output delta table

* Dimensions: Don't filter on the IngestionDateTimeStampUTC field for dimensions as they are not partitioned.
* Facts: Filter on IngestionDateTimeStampUTC to only process new data


Expected output to look like this 
```
MERGE INTO datalakehouse_platinum.DimProduct as target

USING SourceSilverData as source
  ON target.DimProductPK = source.DimProductPK
    
WHEN MATCHED 
  AND target.HashChecksum <> source.HashChecksum
  THEN UPDATE SET *
  
WHEN NOT MATCHED 
  THEN INSERT *
;
```

In [0]:
vMergeIntoTargetDeltaTableSQLString = 'MERGE INTO datalakehouse_platinum.' + pTargetEntityName + ' as target' + '\n' \
+ 'USING SourceSilverData as source' + '\n' \
+ 'ON target.' + pTargetEntityName + 'PK = source.' + pTargetEntityName + 'PK' + '\n' \
+ 'WHEN MATCHED AND' + '\n' \
+ 'target.HashChecksum <> source.HashChecksum' + '\n' \
+ 'THEN UPDATE SET *' + '\n' \
+ 'WHEN NOT MATCHED' + '\n' \
+ 'THEN INSERT *'

#Check the statement to be executed
print('vMergeIntoTargetDeltaTableSQLString: ' + vMergeIntoTargetDeltaTableSQLString)

#Execute the statement
vMergeIntoTargetDeltaTableSQLString_Result = spark.sql(vMergeIntoTargetDeltaTableSQLString)

#Check the outpyt of the dynamic sql query
display(vMergeIntoTargetDeltaTableSQLString_Result)


num_affected_rows,num_updated_rows,num_deleted_rows,num_inserted_rows
0,0,0,0


## Perform optimisations to the delta table for better performance

* ZORDER BY: To ensure the keys of the data you would join or filter on are sorted correctly. Allows for data skipping on read and more efficient joins. 
  * Dont add too many columns in the CSV list, with each column added, the effectiveness of the sorting has less effect. 
    * Only add the keys that will be used in joins to other tables i.e. the business keys or the Priamry key of the table. 
      * For Fact and Dimension tables, use the Primary Keys as they are used in the MERGE join. 
* Since no one single partition will be altered at MERGE, we cant apply a consistent WHERE clause on the ZOrder by. THus we have to apply it on the whole table. 
  * If performance becomes an issue, perhaps ZORDER BY on the 10 latest partitions, as they are most likelty to have had changes...

In [0]:
# Field that will be used to sort the data in the delta table to ensure efficient data skipping on MERGE next time
vZOrderByFieldName = pTargetEntityName + 'PK'

#Create the string that is the SQL query to execute using the latest ingestion time stamp and the right entity name
vOptimiseTableWithZOrderClauseSQL = 'OPTIMIZE datalakehouse_' + pTargetDataLakeContainer + '.' + pTargetEntityName + '\n' \
+ 'ZORDER BY `' + vZOrderByFieldName + '`'

#Check the statement to be executed
print(vOptimiseTableWithZOrderClauseSQL)

#Exsecute the SQL
vOptimiseTableWithZOrderClauseSQL_resultDF = spark.sql(vOptimiseTableWithZOrderClauseSQL)

# Check the output of the dynamic SQL query
display(vOptimiseTableWithZOrderClauseSQL_resultDF)

path,metrics
dbfs:/mnt/datalake_platinum/Dimension/DimProduct,"List(0, 0, List(null, null, 0.0, 0, 0), List(null, null, 0.0, 0, 0), 0, List(minCubeSize(107374182400), List(0, 0), List(1, 58426), 0, List(0, 0), 0, null), 0, 1, 1, false)"


* VACUUM: Remove previous version files that are no longer needed if older than the specified number of hours. 168 hours = 7 days is the default. 
  * Note, this means timetravel to before this period will not be possible. 
  * This ensures the table remains clean and as small as possible

Sample query expected
```
VACUUM datalakehouse_silver.SalesOrderHeader RETAIN 168 HOURS
```

In [0]:
#Create the string that is the SQL query to execute
vVacuumTableSQLStatement = 'VACUUM datalakehouse_' + pTargetDataLakeContainer + '.' + pTargetEntityName + ' RETAIN 168 HOURS'

#Check the statement to be executed
print(vVacuumTableSQLStatement)

#Exsecute the SQL
vVacuumTableSQLStatement_resultDF = spark.sql(vVacuumTableSQLStatement)

# Check the output of the dynamic SQL query
display(vVacuumTableSQLStatement_resultDF)

path
dbfs:/mnt/datalake_platinum/Dimension/DimProduct
