# Generic notebook to guide on the conventions for reading from Raw or Bronze and generating parquet files in the silver layer

### Configuration
|Item|Value|
|---|---|
|Parameter|Expects a parameter with a JSON object that contains an attribute with a full file path to the source file.|
|Source File Identification|Retrive from URL received in JSON parameter|
|Output|Will write the output to parquet files in the silver container of the data lake. Number of files based on spark partition setting|
|Manipulations included|Data Type correction, normalise JSON, hard structure changes to enable platinum layer to function efficiently|

## Enhancement still to be done
1. Add the IngestionDateTimeUTC value from the source file into target files. 
2.

# Import all libraries required
Only import the libraries and functions that are actually used.

In [0]:
import json
import pyspark.sql

# Call Administration notebook to perform tasks before the data modelling can continue. 

*Look into having a single admin notebook where we store all the admin tasks and always just all that one??*

In [0]:
%run ../Administration/CreateDatabaseIfNotExists

# Call the applicable Helper Functions notebooks to include their functions for use in this Notebook

In [0]:
%run ../HelperFunctions/DataLakeHelperFunctions

# Parameters / Widgets

Here the single JSON typed parameter widget of the notebook will be called and the value stored in a variable. 

Having it as JSON means we can send multiple values from the calling code such as Data Factory to the notebook at run time with a single widget / paramter. 

When we want to add more parameter values leter, it doesn't require a notebook change to add more widgets OR a data factory change to add more parameters. 

The JSON object will be unpacked and the attributes needed extracted in to variables with the "p" prefix to show it was a notebook level parameter.

Run this when you want to re-initialise the widget in the following cell with new default values. 
Do not let this cell run as part of normal operations

In [0]:
#Run this when you want to re-initialise the widget in the following cell with new default values. 
#Do not let this cell run as part of normal operations
#dbutils.widgets.removeAll()

### Define the widget

In [0]:
# Create the widget in the first place with a default value one can use for testing
# This iteration expects a single JSON Object, not an array
dbutils.widgets.text("widgetNotebookWidgetWithJSONString", 
'''
{
"SourceDataLakeContainer": "rawdata",
"FileFullPath": "https://dianrandddatalake.blob.core.windows.net/rawdata/DummyAutomatedDirectory/2022/02/28/16/30/wwi-dimstockitem.csv",
"TargetDataLakeContainer": "silver",
"TargetDataLakeDirectory": "DummyAutomatedDirectory/2022/02/28/16/30"
}
'''
)

### Transform raw parameter values received into usable format
* Data type conversion
* String manipulation
* Property extraction
* etc.

Currently this focusses on the file path, but will be extended to work on all parameter values received

In [0]:
# At this stage, the string in the variable is still just a string, not typed as JSON. 
# Convert it to a JSON typed value using json.loads
pNotebookWidgetWithJSONString = json.loads(dbutils.widgets.get("widgetNotebookWidgetWithJSONString"))

# Print out full value received for logging purposes
print("pNotebookWidgetWithJSONString: " + str(pNotebookWidgetWithJSONString))

# Assign each attribute to the applicable variabe to be used going forward
pFileFullPath = pNotebookWidgetWithJSONString["FileFullPath"]
pSourceDataLakeContainer = pNotebookWidgetWithJSONString["SourceDataLakeContainer"]
pTargetDataLakeContainer = pNotebookWidgetWithJSONString["TargetDataLakeContainer"]
pTargetDataLakeDirectory = pNotebookWidgetWithJSONString["TargetDataLakeDirectory"]

# Use helper functions to convert the full file path received to the mount point path instead to be used going forward
vMountPointPath = convert_full_file_path_to_mount_point(pFileFullPath)

print("vMountPointPath: " + vMountPointPath)

# Ensure applicable data lake containers are mounted

Once this has run once, it should never have to run again. Just including for safety in all notebooks.
* Source and Target containers should be mounted

In [0]:
mount_lake_container(pSourceDataLakeContainer)

In [0]:
mount_lake_container(pTargetDataLakeContainer)

# Import source data into Data Frame and create temporary views for use in this notebook

In [0]:
#Example CSV file - read into temp view for easy manipulation
#THis will be different for JSON or Parquet source files
rawSourceDF = spark.read.format("csv")\
.options(header='true', inderSchem='true', delimiter='|')\
.load(vMountPointPath)\
.createOrReplaceTempView("rawSourceDF")

# Apply transformations

Top 10 records view of the source data for reference

In [0]:
%sql
SELECT *
FROM rawSourceDF
LIMIT 10

Stock Item Key,WWI Stock Item ID,Stock Item,Color,Selling Package,Buying Package,Brand,Size,Lead Time Days,Quantity Per Outer,Is Chiller Stock,Barcode,Tax Rate,Unit Price,Recommended Retail Price,Typical Weight Per Unit,Valid From,Valid To,Lineage Key
0,0,Unknown,,,,,,0,0,False,,0.0,0.0,0.0,0.0,2013-01-01 00:00:00.0000000,9999-12-31 23:59:59.9999999,0
1,219,Void fill 400 L bag (White) 400L,,Each,Each,,400L,14,10,False,,14.0,50.0,74.75,1.0,2013-01-01 00:00:00.0000000,2016-05-31 23:00:00.0000000,5
2,218,Void fill 300 L bag (White) 300L,,Each,Each,,300L,14,10,False,,14.0,37.5,56.06,0.75,2013-01-01 00:00:00.0000000,2016-05-31 23:00:00.0000000,5
3,217,Void fill 200 L bag (White) 200L,,Each,Each,,200L,14,10,False,,14.0,25.0,37.38,0.5,2013-01-01 00:00:00.0000000,2016-05-31 23:00:00.0000000,5
4,216,Void fill 100 L bag (White) 100L,,Each,Each,,100L,14,10,False,,14.0,12.5,18.69,0.25,2013-01-01 00:00:00.0000000,2016-05-31 23:00:00.0000000,5
5,215,Air cushion machine (Blue),,Each,Each,,,20,1,False,,20.0,1899.0,2839.01,10.0,2013-01-01 00:00:00.0000000,2016-05-31 23:00:00.0000000,5
6,214,Air cushion film 200mmx200mm 325m,,Each,Each,,325m,14,1,False,,14.0,90.0,134.55,6.0,2013-01-01 00:00:00.0000000,2016-05-31 23:00:00.0000000,5
7,213,Air cushion film 200mmx100mm 325m,,Each,Each,,325m,14,1,False,,14.0,87.0,130.07,5.0,2013-01-01 00:00:00.0000000,2016-05-31 23:00:00.0000000,5
8,212,Large replacement blades 18mm,,Each,Each,,18mm,14,10,False,,14.0,4.3,6.43,0.8,2013-01-01 00:00:00.0000000,2016-05-31 23:00:00.0000000,5
9,211,Small 9mm replacement blades 9mm,,Each,Each,,9mm,14,10,False,,14.0,4.1,6.13,0.7,2013-01-01 00:00:00.0000000,2016-05-31 23:00:00.0000000,5


* Rename to fit to parquet naming standard
* Cast to appropriate data type
  * Data type reference: [cast function](https://docs.microsoft.com/en-us/azure/databricks/sql/language-manual/functions/cast)
* Remember to use \` character to encapsulate field names
* Save results to new temp view that is referenced either by the next transformation step, or by the step that writes the output to lake
* Note
  * It is more efficient to perform all new column additions and type transforms in SQL because .withColumn and .withColumnRenamed creates a new dataframe after each call.

In [0]:
%sql
CREATE OR REPLACE TEMPORARY VIEW transformedView
AS
SELECT 
    CAST(`Stock Item Key` AS INT ) AS `StockItemKey`
  , CAST(`WWI Stock Item ID` AS INT ) AS `WWIStockItemID`
  , CAST(`Stock Item` AS VARCHAR(8000) ) AS `StockItem`
  , CAST(`Color` AS VARCHAR(8000) ) AS `Color`
  , CAST(`Selling Package` AS VARCHAR(8000) ) AS `SellingPackage`
  , CAST(`Buying Package` AS VARCHAR(8000) ) AS `Buying`
  , CAST(`Brand` AS VARCHAR(8000) ) AS `Brand`
  , CAST(`Size` AS VARCHAR(8000) ) AS `Size`
  , CAST(`Lead Time Days` AS INT ) AS `LeadTimeDays`
  , CAST(`Quantity Per Outer` AS INT ) AS `QuantityPerOuter`
  , CAST(`Is Chiller Stock` AS BOOLEAN ) AS `IsChillerStock`
  , CAST(`Barcode` AS VARCHAR(8000) ) AS `Barcode`
  , CAST(`Tax Rate` AS DECIMAL(19,4) ) AS `TaxRate`
  , CAST(`Unit Price` AS DECIMAL(19,4) ) AS `UnitPrice`
  , CAST(`Recommended Retail Price` AS DECIMAL(19,4) ) AS `RecommendedRetailPrice`
  , CAST(`Typical Weight Per Unit` AS DECIMAL(19,4) ) AS `TypicalWeightPerUnit`
  , CAST(`Valid From` AS TIMESTAMP ) AS `ValidFrom`
  , CAST(`Valid To` AS TIMESTAMP ) AS `ValidTo`
  , CAST(`Lineage Key` AS INT ) AS `LineageKey`
FROM rawSourceDF

# Output data to lake

* For now just writing to a fixed location for testing
* See the things to do list at the top of the notebook on how this needs to be automated

### Define the new output path

* This can be made dynamic using input parameters
* This should be made dynamic using the yyyy/MM/dd/HH/mm of when the source files was received

In [0]:
# THis assumed the values look like this (but if the source parameters look different this will need to be updated as well)
# "TargetDataLakeContainer": "silver",
# "TargetDataLakeDirectory": "DummyAutomatedDirectory/2022/02/28/16/30"
# The actual output files will then be directly in the "30" folder
vOutputPath = '/mnt/datalake_' + pTargetDataLakeContainer + '/' +pTargetDataLakeDirectory

print(vOutputPath)

### Convert temp view into data frame in order to write to target location

In [0]:
finalDF = sqlContext.sql("SELECT * FROM transformedView")

## Write to output

* Note
  * The vOutputPath here will become the parent directory for the files generated
  * It will contain the actual parquet files, but also the metadata files such as _committed, _started, _SUCCESS

In [0]:
finalDF.write\
.format("parquet")\
.save(vOutputPath)