####We will load data from bronze to silver here for one of the source systems and encourage the trainees to load the data for the other source system.<br> 
######Here we will do basic transformations like, column selection, filtering, deduplication, schema enforment, etc.

In [0]:
#create schema for silver if it does not exist
spark.sql("""CREATE SCHEMA IF NOT EXISTS psl_salesdev.silver MANAGED LOCATION 'abfss://sales@stavikaslakefreetrail.dfs.core.windows.net/silver/'""")

DataFrame[]

####First we will read our order bronze table directly from the source using catalog reference

In [0]:
df_orderdetails_retail=spark.table("psl_salesdev.bronze.ordersdetails_retail")
# Product ID:string
# Category:string
# Sub-Category:string
# Product Name:string

#Product table - Silver data load

Column pruning- We will only select the required columns from the orders table

In [0]:
df_selected=df_orderdetails_retail.select("Product ID","Product Name","Sub-Category","Category","Source")

In [0]:
df_selected.limit(5).display()

Product ID,Product Name,Sub-Category,Category,Source
FUR-BO-10001798,Bush Somerset Collection Bookcase,Bookcases,Furniture,Retail CSV
FUR-CH-10000454,"Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back",Chairs,Furniture,Retail CSV
OFF-LA-10000240,Self-Adhesive Address Labels for Typewriters by Universal,Labels,Office Supplies,Retail CSV
FUR-TA-10000577,Bretford CR4500 Series Slim Rectangular Table,Tables,Furniture,Retail CSV
OFF-ST-10000760,Eldon Fold 'N Roll Cart System,Storage,Office Supplies,Retail CSV


####Standardise Column Names

In [0]:
df_selected_std=df_selected.withColumnRenamed("Product ID","productID").withColumnRenamed("Product Name","productName").withColumnRenamed("Sub-Category","subCategory").withColumnRenamed("Category","category").withColumnRenamed("Source","source")

Remove Duplicates from the source

In [0]:
#drop duplicates from the selected dataframe
df_deduplicated=df_selected_std.dropDuplicates()

In [0]:
df_deduplicated.display()

productID,productName,subCategory,category,source
TEC-PH-10000486,Plantronics HL10 Handset Lifter,Phones,Technology,Retail CSV
TEC-AC-10003832,Logitech P710e Mobile Speakerphone,Accessories,Technology,Retail CSV
TEC-AC-10002076,Microsoft Natural Keyboard Elite,Accessories,Technology,Retail CSV
TEC-PH-10004667,Cisco 8x8 Inc. 6753i IP Business Phone System,Phones,Technology,Retail CSV
OFF-BI-10000545,GBC Ibimaster 500 Manual ProClick Binding System,Binders,Office Supplies,Retail CSV
OFF-PA-10003543,Xerox 1985,Paper,Office Supplies,Retail CSV
OFF-PA-10000994,Xerox 1915,Paper,Office Supplies,Retail CSV
FUR-TA-10003008,"Lesro Round Back Collection Coffee Table, End Table",Tables,Furniture,Retail CSV
OFF-PA-10002464,HP Office Recycled Paper (20Lb. and 87 Bright),Paper,Office Supplies,Retail CSV
TEC-AC-10001109,Logitech Trackman Marble Mouse,Accessories,Technology,Retail CSV


Add current timestamp

In [0]:
from pyspark.sql.functions import current_timestamp

df_cleaned=df_deduplicated.withColumn("ingestionTimestamp", current_timestamp())

In [0]:
df_cleaned.limit(5).display()

productID,productName,subCategory,category,source,ingestionTimestamp
TEC-PH-10000486,Plantronics HL10 Handset Lifter,Phones,Technology,Retail CSV,2024-09-09T19:20:21.623Z
TEC-AC-10003832,Logitech P710e Mobile Speakerphone,Accessories,Technology,Retail CSV,2024-09-09T19:20:21.623Z
TEC-AC-10002076,Microsoft Natural Keyboard Elite,Accessories,Technology,Retail CSV,2024-09-09T19:20:21.623Z
TEC-PH-10004667,Cisco 8x8 Inc. 6753i IP Business Phone System,Phones,Technology,Retail CSV,2024-09-09T19:20:21.623Z
OFF-BI-10000545,GBC Ibimaster 500 Manual ProClick Binding System,Binders,Office Supplies,Retail CSV,2024-09-09T19:20:21.623Z


while creating tables with create table statement in databricks, databricks by default creates delta tables

In [0]:
spark.sql("""
CREATE TABLE IF NOT EXISTS psl_salesdev.silver.product_cleaned
(
  productID STRING NOT NULL COMMENT 'Unique identifier for each product',
  productName STRING NOT NULL COMMENT 'Name of the product',
  subCategory STRING COMMENT 'Sub-category of the product',
  category STRING COMMENT 'Category of the product',
  source STRING COMMENT 'Source system of the data',
  ingestionTimestamp TIMESTAMP COMMENT 'Timestamp when the data was ingested'
)
COMMENT 'This external table contains cleaned product data, partitioned by ingestion timestamp'
PARTITIONED BY (ingestionTimestamp)
LOCATION 'abfss://sales@stavikaslakefreetrail.dfs.core.windows.net/silver/product_cleaned'
""")


DataFrame[]

The mergeScehma command below will help in the schema evolution as and when needed

In [0]:
df_cleaned.write.mode("overwrite").option("mergeSchema", True).partitionBy("ingestionTimestamp").saveAsTable("psl_salesdev.silver.product_cleaned")

Let us now, verify if our data write was succesful

In [0]:
spark.table("psl_salesdev.silver.product_cleaned").display()

productID,productName,subCategory,category,source,ingestionTimestamp
TEC-PH-10000486,Plantronics HL10 Handset Lifter,Phones,Technology,Retail CSV,2024-09-09T19:36:52.229Z
TEC-AC-10003832,Logitech P710e Mobile Speakerphone,Accessories,Technology,Retail CSV,2024-09-09T19:36:52.229Z
TEC-AC-10002076,Microsoft Natural Keyboard Elite,Accessories,Technology,Retail CSV,2024-09-09T19:36:52.229Z
TEC-PH-10004667,Cisco 8x8 Inc. 6753i IP Business Phone System,Phones,Technology,Retail CSV,2024-09-09T19:36:52.229Z
OFF-BI-10000545,GBC Ibimaster 500 Manual ProClick Binding System,Binders,Office Supplies,Retail CSV,2024-09-09T19:36:52.229Z
OFF-PA-10003543,Xerox 1985,Paper,Office Supplies,Retail CSV,2024-09-09T19:36:52.229Z
OFF-PA-10000994,Xerox 1915,Paper,Office Supplies,Retail CSV,2024-09-09T19:36:52.229Z
FUR-TA-10003008,"Lesro Round Back Collection Coffee Table, End Table",Tables,Furniture,Retail CSV,2024-09-09T19:36:52.229Z
OFF-PA-10002464,HP Office Recycled Paper (20Lb. and 87 Bright),Paper,Office Supplies,Retail CSV,2024-09-09T19:36:52.229Z
TEC-AC-10001109,Logitech Trackman Marble Mouse,Accessories,Technology,Retail CSV,2024-09-09T19:36:52.229Z


------------------------------------------------------------------------------

#Sales table

column Pruning- We will only select the columns required for the sales caculation

In [0]:
df_sales_selected=df_orderdetails_retail.select("Order ID","Order Date","Product ID","Sales","Quantity","Profit","Source")


We now want to standardise the column names using Functions

In [0]:
# Function to convert a single column name to camelCase
def to_camel_case(column_name):
    words = column_name.split(" ")  # Split by space
    if len(words) == 1:  # If it's a single word or character, return it in lowercase
        return words[0].lower()
    else:
        return words[0].lower() + "".join(word.capitalize() for word in words[1:])

# Function to standardize column names using withColumnRenamed and a loop
def standardize_columns(df):
    for old_col in df.columns:
        new_col = to_camel_case(old_col)  # Apply the camelCase transformation
        df = df.withColumnRenamed(old_col, new_col)  # Rename each column
    return df

#Function to add an ingestion timestamp
from pyspark.sql.functions import current_timestamp, col, to_date
def add_ingestionTimesatmp(df):
    return df.withColumn("ingestionTimestamp", current_timestamp())

In [0]:
#standardize columns
df_sales_standarized=standardize_columns(df_sales_selected)

#add ingestion timestamp
df_sales_standarized=add_ingestionTimesatmp(df_sales_standarized)

#convert orderDate into Date format
df_sales_standarized=df_sales_standarized.withColumn("orderDate", to_date(col("orderDate"), "dd/MM/yy"))

%md
##to_date 
function in PySpark is used to convert a string column containing date-like values into a proper DateType column. This function allows you to specify the format of the input date string, so it can interpret and parse the date correctly.

Key Features:
Input: The function takes a string column and a date format as input.
Format: You need to provide a date format (e.g., "MM/dd/yyyy", "dd-MM-yyyy") to let the function know how to interpret the string.
Output: It converts the string into a DateType, which is useful for performing date-related operations like filtering, comparisons, or date arithmetic

<br>
<b>Let us take a look at the modified dataframe

In [0]:
df_sales_standarized.display()

orderId,orderDate,productId,sales,quantity,profit,source,ingestionTimestamp
CA-2016-152156,2016-11-08,FUR-BO-10001798,261.96,2.0,41.9136,Retail CSV,2024-09-10T07:30:24.932Z
CA-2016-152156,2016-11-08,FUR-CH-10000454,731.94,3.0,219.582,Retail CSV,2024-09-10T07:30:24.932Z
CA-2016-138688,2016-06-12,OFF-LA-10000240,14.62,2.0,6.8714,Retail CSV,2024-09-10T07:30:24.932Z
US-2015-108966,2015-10-11,FUR-TA-10000577,957.5775,5.0,-383.031,Retail CSV,2024-09-10T07:30:24.932Z
US-2015-108966,2015-10-11,OFF-ST-10000760,22.368,2.0,2.5164,Retail CSV,2024-09-10T07:30:24.932Z
CA-2014-115812,2014-06-09,FUR-FU-10001487,48.86,7.0,14.1694,Retail CSV,2024-09-10T07:30:24.932Z
CA-2014-115812,2014-06-09,OFF-AR-10002833,7.28,4.0,1.9656,Retail CSV,2024-09-10T07:30:24.932Z
CA-2014-115812,2014-06-09,TEC-PH-10002275,907.152,6.0,90.7152,Retail CSV,2024-09-10T07:30:24.932Z
CA-2014-115812,2014-06-09,OFF-BI-10003910,18.504,3.0,5.7825,Retail CSV,2024-09-10T07:30:24.932Z
CA-2014-115812,2014-06-09,OFF-AP-10002892,114.9,5.0,34.47,Retail CSV,2024-09-10T07:30:24.932Z


Now we will create an external table for our silver table for sales

In [0]:
spark.sql("""
CREATE TABLE if not exists psl_salesdev.silver.sales_cleaned (
    orderId STRING COMMENT 'Unique identifier for each order',
    orderDate DATE COMMENT 'Date of the order',
    productId STRING COMMENT 'Unique identifier for each product',
    sales DOUBLE COMMENT 'Sales amount for the order',
    quantity DOUBLE COMMENT 'Quantity of products sold',
    profit DOUBLE COMMENT 'Profit made from the sale',
    source STRING COMMENT 'Source of the order (e.g., online, in-store)',
    ingestionTimestamp TIMESTAMP COMMENT 'Timestamp of data ingestion'
)
COMMENT 'This table stores sales order details'
PARTITIONED BY (ingestionTimestamp)
LOCATION 'abfss://sales@stavikaslakefreetrail.dfs.core.windows.net/silver/sales_cleaned/'
""")

DataFrame[]

write sales data to the silver layer

In [0]:
df_sales_standarized.write.mode("overwrite").saveAsTable("psl_salesdev.silver.sales_cleaned")

In [0]:
%sql
select * from psl_salesdev.silver.sales_cleaned

orderId,orderDate,productId,sales,quantity,profit,source,ingestionTimestamp
CA-2016-152156,2016-11-08,FUR-BO-10001798,261.96,2.0,41.9136,Retail CSV,2024-09-10T07:40:05.637Z
CA-2016-152156,2016-11-08,FUR-CH-10000454,731.94,3.0,219.582,Retail CSV,2024-09-10T07:40:05.637Z
CA-2016-138688,2016-06-12,OFF-LA-10000240,14.62,2.0,6.8714,Retail CSV,2024-09-10T07:40:05.637Z
US-2015-108966,2015-10-11,FUR-TA-10000577,957.5775,5.0,-383.031,Retail CSV,2024-09-10T07:40:05.637Z
US-2015-108966,2015-10-11,OFF-ST-10000760,22.368,2.0,2.5164,Retail CSV,2024-09-10T07:40:05.637Z
CA-2014-115812,2014-06-09,FUR-FU-10001487,48.86,7.0,14.1694,Retail CSV,2024-09-10T07:40:05.637Z
CA-2014-115812,2014-06-09,OFF-AR-10002833,7.28,4.0,1.9656,Retail CSV,2024-09-10T07:40:05.637Z
CA-2014-115812,2014-06-09,TEC-PH-10002275,907.152,6.0,90.7152,Retail CSV,2024-09-10T07:40:05.637Z
CA-2014-115812,2014-06-09,OFF-BI-10003910,18.504,3.0,5.7825,Retail CSV,2024-09-10T07:40:05.637Z
CA-2014-115812,2014-06-09,OFF-AP-10002892,114.9,5.0,34.47,Retail CSV,2024-09-10T07:40:05.637Z
