#### About Data:
This is the Sample Superstore data, which comes from the following sources:
1. CSV - OrderDetails.csv, People.csv, Returns.csv
2. JSON - OrderDetails.json

The description of the data source tables is as follows:

1. Orders (Sample Superstore main table):  
This table includes all transactional details such as sales, products, customers, regions, etc.

2. Returns:  
Contains information about the orders that were returned. The fields typically include:
- Order ID: Unique identifier to match with the main Orders table.
- Returned: Indicates if the order was returned (yes/no).

3. People:  
Includes data about the employees responsible for different regions or sales activities. The fields typically include:
- Region: The region where the employee works (matches with the Orders table).
- Person: Name of the employee responsible for the region.


----------------------------------------------

## Read All Tables
In the Medallion architecture, we first ingest the data in its raw form. In the sections below, we will read each file one by one from our landing zone folder.

-------------------------------------------------------------------------------------------


## People
While we read the data in its raw form, take note of the number of jobs Spark is going to create during this process.


In [0]:
root_path = 'abfss://landingzone@stavikaslakefreetrail.dfs.core.windows.net/'



In [0]:
# "abfss://landingzone@stavikaslakefreetrail.dfs.core.windows.net/csv_sourcesystem/People.csv"
# var=4
# print(f"num={var}")
# asked as adhoc question

num=4


In [0]:
df_people_extracted=spark.read.options(header='true', inferSchema='true').csv(f'{root_path}csv_sourcesystem/People.csv')


Total 2 jobs are created here

## ReturnedOrders
We will now manually define the schema for the source, ensuring it matches exactly as expected in the CSV source.


In [0]:
from pyspark.sql.types import IntegerType,StringType, FloatType, DateType, TimestampType, StructType, StructField,DoubleType

In [0]:
#Create the schema using structype and structField objects
ro_schema=StructType([StructField('Order ID',StringType())
                     ,StructField('Returned',StringType(),nullable=True)])

In [0]:
df_rtorderds_extracted=spark.read.options(header='true', schema=ro_schema).csv(f'{root_path}csv_sourcesystem/ReturnedOrders.csv')

Again looking at the above cell, we can see the number of jobs created here is only 1.<br>
<b>why?

In [0]:
order_schema = StructType([
    StructField('Row ID', IntegerType(), nullable=False),
    StructField('Order ID', StringType(), nullable=False),
    StructField('Order Date', StringType(), nullable=True),
    StructField('Ship Date', StringType(), nullable=True),
    StructField('Ship Mode', StringType(), nullable=True),
    StructField('Customer ID', StringType(), nullable=False),
    StructField('Customer Name', StringType(), nullable=True),
    StructField('Segment', StringType(), nullable=True),
    StructField('Country', StringType(), nullable=True),
    StructField('City', StringType(), nullable=True),
    StructField('State', StringType(), nullable=True),
    StructField('Postal Code', IntegerType(), nullable=True),
    StructField('Region', StringType(), nullable=True),
    StructField('Product ID', StringType(), nullable=False),
    StructField('Category', StringType(), nullable=True),
    StructField('Sub-Category', StringType(), nullable=True),
    StructField('Product Name', StringType(), nullable=True),
    StructField('Sales', DoubleType(), nullable=True),
    StructField('Quantity', DoubleType(), nullable=True),
    StructField('Discount', DoubleType(), nullable=True),
    StructField('Profit', DoubleType(), nullable=True)
])

df_orders_extracted=spark.read.options(header='true').csv(f'{root_path}csv_sourcesystem/OrderDetails.csv',schema=order_schema)

##Adding Audit Columns in all dataframes
1. ingestion timestamp
2. SourceSystem

###We will do it with help of below functions/APIs

###current_timestamp:<br>
This function returns the current timestamp (date and time, including hours, minutes, seconds, and milliseconds) at the moment when the query or transformation is executed.<br>
It is often used to track the time of events or actions in data processing, such as when a row was inserted or updated in a DataFrame.<br>
Example:<br>
df.withColumn('current_time', current_timestamp())

###col:<br>
The col function is used to reference a column in a DataFrame. It allows you to select, manipulate, or use the column in expressions.<br>
It is commonly used for selecting columns, applying transformations, or filtering based on column values.<br>
Example:<br>
df.select(col('column_name'))

###lit:<br>
The lit function creates a constant column or literal value in a DataFrame. It is useful when you need to add a column with a fixed value or use a constant in an expression.<br>
It ensures that the value is treated as a literal constant rather than a column or expression.<br>
Example:<br>
df.withColumn('constant_column', lit(100))


###withColumn:
The withColumn function is used to add a new column or modify an existing one in a DataFrame. You can either provide an expression, a function (like current_timestamp), or a literal value using lit.<br>
It returns a new DataFrame with the updated or added column while keeping the original DataFrame unchanged (since DataFrames are immutable).<br>
Example:<br>
df.withColumn('new_column', col('existing_column') * 2)
You can combine these functions to create or manipulate columns in a flexible way.

In [0]:
from pyspark.sql.functions import current_timestamp,col,lit

df_orders_audited=df_orders_extracted.withColumn('ingestion_timestamp',current_timestamp()).withColumn('Source',lit('Retail CSV'))

df_rtorderds_audited=df_rtorderds_extracted.withColumn('ingestion_timestamp',current_timestamp()).withColumn('Source',lit('Retail CSV'))

df_people_audited=df_people_extracted.withColumn('ingestion_timestamp',current_timestamp()).withColumn('Source',lit('Retail CSV'))

In [0]:
"We will now display the content of all the tables we have created"
print("Orders")
df_orders_audited.display()
print("Returned Orders")
df_rtorderds_audited.display()
print("People")
df_people_audited.display()

Orders


Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country,City,State,Postal Code,Region,Product ID,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit,ingestion_timestamp,Source
1,CA-2016-152156,08/11/16,11/11/16,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420,South,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.96,2.0,0.0,41.9136,2024-09-11T05:48:27.022Z,Retail CSV
2,CA-2016-152156,08/11/16,11/11/16,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420,South,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back",731.94,3.0,0.0,219.582,2024-09-11T05:48:27.022Z,Retail CSV
3,CA-2016-138688,12/06/16,16/06/16,Second Class,DV-13045,Darrin Van Huff,Corporate,United States,Los Angeles,California,90036,West,OFF-LA-10000240,Office Supplies,Labels,Self-Adhesive Address Labels for Typewriters by Universal,14.62,2.0,0.0,6.8714,2024-09-11T05:48:27.022Z,Retail CSV
4,US-2015-108966,11/10/15,18/10/15,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,33311,South,FUR-TA-10000577,Furniture,Tables,Bretford CR4500 Series Slim Rectangular Table,957.5775,5.0,0.45,-383.031,2024-09-11T05:48:27.022Z,Retail CSV
5,US-2015-108966,11/10/15,18/10/15,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,33311,South,OFF-ST-10000760,Office Supplies,Storage,Eldon Fold 'N Roll Cart System,22.368,2.0,0.2,2.5164,2024-09-11T05:48:27.022Z,Retail CSV
6,CA-2014-115812,09/06/14,14/06/14,Standard Class,BH-11710,Brosina Hoffman,Consumer,United States,Los Angeles,California,90032,West,FUR-FU-10001487,Furniture,Furnishings,"Eldon Expressions Wood and Plastic Desk Accessories, Cherry Wood",48.86,7.0,0.0,14.1694,2024-09-11T05:48:27.022Z,Retail CSV
7,CA-2014-115812,09/06/14,14/06/14,Standard Class,BH-11710,Brosina Hoffman,Consumer,United States,Los Angeles,California,90032,West,OFF-AR-10002833,Office Supplies,Art,Newell 322,7.28,4.0,0.0,1.9656,2024-09-11T05:48:27.022Z,Retail CSV
8,CA-2014-115812,09/06/14,14/06/14,Standard Class,BH-11710,Brosina Hoffman,Consumer,United States,Los Angeles,California,90032,West,TEC-PH-10002275,Technology,Phones,Mitel 5320 IP Phone VoIP phone,907.152,6.0,0.2,90.7152,2024-09-11T05:48:27.022Z,Retail CSV
9,CA-2014-115812,09/06/14,14/06/14,Standard Class,BH-11710,Brosina Hoffman,Consumer,United States,Los Angeles,California,90032,West,OFF-BI-10003910,Office Supplies,Binders,DXL Angle-View Binders with Locking Rings by Samsill,18.504,3.0,0.2,5.7825,2024-09-11T05:48:27.022Z,Retail CSV
10,CA-2014-115812,09/06/14,14/06/14,Standard Class,BH-11710,Brosina Hoffman,Consumer,United States,Los Angeles,California,90032,West,OFF-AP-10002892,Office Supplies,Appliances,Belkin F5C206VTEL 6 Outlet Surge,114.9,5.0,0.0,34.47,2024-09-11T05:48:27.022Z,Retail CSV


Returned Orders


Returned,Order ID,ingestion_timestamp,Source
Yes,CA-2017-153822,2024-09-11T05:48:29.575Z,Retail CSV
Yes,CA-2017-129707,2024-09-11T05:48:29.575Z,Retail CSV
Yes,CA-2014-152345,2024-09-11T05:48:29.575Z,Retail CSV
Yes,CA-2015-156440,2024-09-11T05:48:29.575Z,Retail CSV
Yes,US-2017-155999,2024-09-11T05:48:29.575Z,Retail CSV
Yes,CA-2014-157924,2024-09-11T05:48:29.575Z,Retail CSV
Yes,CA-2017-131807,2024-09-11T05:48:29.575Z,Retail CSV
Yes,CA-2016-124527,2024-09-11T05:48:29.575Z,Retail CSV
Yes,CA-2017-135692,2024-09-11T05:48:29.575Z,Retail CSV
Yes,CA-2014-123225,2024-09-11T05:48:29.575Z,Retail CSV


People


Person,Region,ingestion_timestamp,Source
Anna Andreadi,West,2024-09-11T05:48:29.928Z,Retail CSV
Chuck Magee,East,2024-09-11T05:48:29.928Z,Retail CSV
Kelly Williams,Central,2024-09-11T05:48:29.928Z,Retail CSV
Cassandra Brandow,South,2024-09-11T05:48:29.928Z,Retail CSV


###Read data form the api source:api_source

In [0]:
df_orders_extractedjson=spark.read.json(f'{root_path}api_source/OrderDetails.json',multiLine=True)

In [0]:
df_orders_json_audited=df_orders_extractedjson.withColumn('ingestion_timestamp',current_timestamp()).withColumn('Source',lit('API JSON'))

In [0]:
df_orders_json_audited.display()

Category,City,Country,Customer ID,Customer Name,Discount,Order Date,Order ID,Postal Code,Product ID,Product Name,Profit,Quantity,Region,Row ID,Sales,Segment,Ship Date,Ship Mode,State,Sub-Category,ingestion_timestamp,Source
Furniture,New York City,United States,JK-15370,Jay Kimmel,0.1,21/08/16,CA-2016-122581,10035,FUR-CH-10002961,"Leather Task Chair, Black",63.686,7,East,9880,573.174,Consumer,25/08/16,Standard Class,New York,Chairs,2024-09-11T05:48:53.718Z,API JSON
Office Supplies,Cleveland,United States,CC-12100,Chad Cunningham,0.2,29/05/15,CA-2015-104297,44105,OFF-PA-10000474,Easy-staple paper,28.7064,3,East,9881,85.056,Home Office,31/05/15,First Class,Ohio,Paper,2024-09-11T05:48:53.718Z,API JSON
Office Supplies,Woodstock,United States,LL-16840,Lauren Leatherbury,0.0,12/08/14,CA-2014-153927,30188,OFF-BI-10000138,Acco Translucent Poly Ring Binders,6.7392,3,South,9882,14.04,Consumer,13/08/14,First Class,Georgia,Binders,2024-09-11T05:48:53.718Z,API JSON
Technology,Woodstock,United States,LL-16840,Lauren Leatherbury,0.0,12/08/14,CA-2014-153927,30188,TEC-AC-10000023,"Maxell 74 Minute CD-R Spindle, 50/Pack",98.1396,13,South,9883,272.61,Consumer,13/08/14,First Class,Georgia,Accessories,2024-09-11T05:48:53.718Z,API JSON
Office Supplies,Los Angeles,United States,KE-16420,Katrina Edelman,0.0,03/04/14,CA-2014-112291,90008,OFF-EN-10001415,Staple envelope,5.58,2,West,9884,11.16,Corporate,08/04/14,Standard Class,California,Envelopes,2024-09-11T05:48:53.718Z,API JSON
Technology,Los Angeles,United States,KE-16420,Katrina Edelman,0.0,03/04/14,CA-2014-112291,90008,TEC-AC-10001284,Enermax Briskie RF Wireless Keyboard and Mouse Combo,22.4316,3,West,9885,62.31,Corporate,08/04/14,Standard Class,California,Accessories,2024-09-11T05:48:53.718Z,API JSON
Technology,Los Angeles,United States,KE-16420,Katrina Edelman,0.0,03/04/14,CA-2014-112291,90008,TEC-AC-10000736,Logitech G600 MMO Gaming Mouse,57.5928,2,West,9886,159.98,Corporate,08/04/14,Standard Class,California,Accessories,2024-09-11T05:48:53.718Z,API JSON
Office Supplies,Lafayette,United States,SG-20605,Speros Goranitis,0.0,23/01/14,CA-2014-146997,47905,OFF-FA-10003467,"Alliance Big Bands Rubber Bands, 12/Pack",0.0,3,Central,9887,5.94,Consumer,27/01/14,Standard Class,Indiana,Fasteners,2024-09-11T05:48:53.718Z,API JSON
Office Supplies,New York City,United States,CA-12265,Christina Anderson,0.0,12/10/17,CA-2017-169607,10024,OFF-PA-10000477,Xerox 1952,4.6812,2,East,9888,9.96,Consumer,15/10/17,First Class,New York,Paper,2024-09-11T05:48:53.718Z,API JSON
Technology,Utica,United States,RD-19585,Rob Dowd,0.0,08/08/15,CA-2015-127544,13501,TEC-AC-10000736,Logitech G600 MMO Gaming Mouse,28.7964,1,East,9889,79.99,Consumer,12/08/15,Standard Class,New York,Accessories,2024-09-11T05:48:53.718Z,API JSON


We will now save all the tables to bronze layer, in parquet format
[know more about parquet](https://www.databricks.com/glossary/what-is-parquet)

In [0]:
from datetime import datetime

# Get current timestamp
current_timestamp = datetime.now()

print(current_timestamp)


2024-09-11 05:49:00.035387


In [0]:
#define bronze root
bronze_root = f'abfss://sales@stavikaslakefreetrail.dfs.core.windows.net/bronze/'

In [0]:
#csv Source
df_orders_audited.write.mode('overwrite').parquet(f'{bronze_root}ordersDetails/source=Retail CSV/{current_timestamp}/')

df_rtorderds_audited.write.mode('overwrite').parquet(f'{bronze_root}returnedOrders/source=Retail CSV/{current_timestamp}/')

df_people_audited.write.mode('overwrite').parquet(f'{bronze_root}people/source=Retail CSV/{current_timestamp}/')

#Json Source
df_orders_json_audited.write.mode('overwrite').parquet(f'{bronze_root}ordersDetails/source=API JSON/{current_timestamp}/')

Let's create external table references for all the bronze tables within the bronze schema, as part of best practices.

In [0]:
spark.sql("drop table if exists psl_salesdev.bronze.ordersDetails_retail")

create_table=f"""create table if not exists psl_salesdev.bronze.ordersDetails_retail 
using parquet location '{bronze_root}ordersDetails/source=Retail CSV/{current_timestamp}/'"""
spark.sql(create_table)

spark.sql("drop table if exists psl_salesdev.bronze.ordersDetails_api")

create_table=f"""create table if not exists psl_salesdev.bronze.ordersDetails_api 
using parquet location '{bronze_root}ordersDetails/source=API JSON/{current_timestamp}/'"""

spark.sql(create_table)

DataFrame[]

We will now check if our write to the bronze layer was successful, along with verifying the table references.

In [0]:
%sql
--drop table psl_salesdev.bronze.ordersdetails_api
select max(ingestion_timestamp) as max_ingestion_timestamp,
min(ingestion_timestamp) as min_ingestion_timestamp   from psl_salesdev.bronze.ordersDetails_retail

max_ingestion_timestamp,min_ingestion_timestamp
2024-09-11T05:49:21.368Z,2024-09-11T05:49:21.368Z
