**Bronze zone transformations**

In [1]:
dfOrder = spark.read.format("csv").load("Files/Landing/CSV/SalesOrderHeader.csv")
display(dfOrder)

StatementMeta(, 39228894-191f-4f16-be07-97b841f7fc47, 3, Finished, Available)

SynapseWidget(Synapse.DataFrame, 6b6c68cb-fe69-49d3-b793-ecccbd7f8fa2)

Schema enforcement and filtering out unnecessary columns

In [2]:
from  pyspark.sql.functions import *
dfOrder=dfOrder\
.withColumnRenamed("_c0","SalesOrderID")\
.withColumnRenamed("_c2","OrderDate")\
.withColumnRenamed("_c3","DueDate")\
.withColumnRenamed('_c10','CustomerID')\
.drop('_c12')\
.select('SalesOrderID','CustomerID','OrderDate','DueDate',col('_c18').alias('TotalAmount'))
display(dfOrder)

StatementMeta(, 39228894-191f-4f16-be07-97b841f7fc47, 4, Finished, Available)

SynapseWidget(Synapse.DataFrame, fd51d2d2-14d2-4dde-94eb-01aba4452729)

Filtering out data irregularities: **filter** and **where** methods

In [3]:
dfOrder=dfOrder.filter('Orderdate IS NOT NULL AND SalesOrderID IS NOT NULL').where("SalesOrderID<> 'SalesOrderID'")
display(dfOrder)

StatementMeta(, 39228894-191f-4f16-be07-97b841f7fc47, 5, Finished, Available)

SynapseWidget(Synapse.DataFrame, 238f25d4-3cba-4b15-8f38-3d0d4e4106d0)

Adding ingestion metadata: **withColumn** method

In [4]:
dfOrder=dfOrder.withColumn("SourceFilename",input_file_name()).withColumn("InsertedDateTime",current_timestamp())
display(dfOrder)

StatementMeta(, 39228894-191f-4f16-be07-97b841f7fc47, 6, Finished, Available)

SynapseWidget(Synapse.DataFrame, 8182112f-4a24-4dc1-82d5-e010770b0383)

De-duplication: **dropDuplicates** method

In [5]:
dfOrder=dfOrder.dropDuplicates(subset=["SalesOrderID"])
display(dfOrder)

StatementMeta(, 39228894-191f-4f16-be07-97b841f7fc47, 7, Finished, Available)

SynapseWidget(Synapse.DataFrame, dbbdb429-a5c2-43b3-af5f-5dee998cfd15)

Writing into destination tables-**append** mode

In [6]:
dfOrder.write.format('delta').mode('append').saveAsTable('SalesOrderHeader')

StatementMeta(, 39228894-191f-4f16-be07-97b841f7fc47, 8, Finished, Available)

Reading Customer file

In [7]:
dfCustomer = spark.read.format("csv").option('header','True').load("Files/Landing/CSV/Customer.csv")
display(dfCustomer)

StatementMeta(, 39228894-191f-4f16-be07-97b841f7fc47, 9, Submitted, Running)

Transform/Append to destination

In [14]:
dfCustomer\
.select('CustomerID','FirstName','LastName')\
.filter('CustomerID IS NOT NULL')\
.write.format('delta')\
.mode('append')\
.saveAsTable('Customer')

StatementMeta(, 9667cd6f-4e0c-45fe-93a2-24d7360f15b1, 16, Finished, Available)

**Parsing JSON files**

In [12]:
dfJson = spark.read.option("multiline", "true").json("Files/Landing/JSON/SampleJson.json")
display(dfJson)

StatementMeta(, c1540b1e-e41b-4e36-b8ba-b239f02e7d08, 14, Finished, Available)

SynapseWidget(Synapse.DataFrame, 21f1cbec-fe2f-481e-bb3c-350fe0772ded)

Exploring dataframe schema

In [13]:
dfJson.printSchema()

StatementMeta(, c1540b1e-e41b-4e36-b8ba-b239f02e7d08, 15, Finished, Available)

root
 |-- batters: struct (nullable = true)
 |    |-- batter: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- id: string (nullable = true)
 |    |    |    |-- type: string (nullable = true)
 |-- id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- ppu: double (nullable = true)
 |-- topping: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- type: string (nullable = true)
 |-- type: string (nullable = true)



Referring to elements of struct type by '.' notation.

In [14]:
from pyspark.sql.functions import *
dfJson=dfJson.withColumn('batter',col('batters').batter)
display(dfJson)

StatementMeta(, c1540b1e-e41b-4e36-b8ba-b239f02e7d08, 16, Finished, Available)

SynapseWidget(Synapse.DataFrame, e168c210-f6e0-484b-9584-4627f3b30e1f)

Using **explode** method to flatten an array

In [15]:
from pyspark.sql.functions import *
dfJson=dfJson.withColumn('batter_array',explode(col('batter')))
display(dfJson)

StatementMeta(, c1540b1e-e41b-4e36-b8ba-b239f02e7d08, 17, Finished, Available)

SynapseWidget(Synapse.DataFrame, 4141a547-5514-40ad-8cad-778ecb20df21)

Extracting parts of struct type

In [18]:
dfJson=dfJson.select('name','batter_array.*','batter_array')
display(dfJson)

StatementMeta(, c1540b1e-e41b-4e36-b8ba-b239f02e7d08, 20, Finished, Available)

SynapseWidget(Synapse.DataFrame, abbd826d-1146-40d9-87c2-e8e815832aa5)