
# Chapter 3 -> Spark ETL with Azure (Blob | ADLS)

Task to do 
1. Install required spark libraries
2. Create connection with Azure Blob storage
3. Read data from blob and store into dataframe
4. Transform data
5. write data into parquet file 
6. write data into JSON file

Reference:
https://learn.microsoft.com/en-us/azure/open-datasets/dataset-catalog

In [1]:
# First Load all the required library and also Start Spark Session
# Load all the required library
from pyspark.sql import SparkSession

In [2]:
#Start Spark Session
spark = SparkSession.builder.appName("chapter3").getOrCreate()
sqlContext = SparkSession(spark)
#Dont Show warning only error
spark.sparkContext.setLogLevel("ERROR")

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/03/09 21:58:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/03/09 21:58:18 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
23/03/09 21:58:18 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
23/03/09 21:58:18 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.


1. Create connection with Azure Blob storage

In [3]:
# Azure storage for Holiday 
blob_account_name = "azureopendatastorage"
blob_container_name = "holidaydatacontainer"
blob_relative_path = "Processed"
blob_sas_token = r""

In [4]:

# Allow SPARK to read from Blob remotely
wasbs_path = 'wasbs://%s@%s.blob.core.windows.net/%s' % (blob_container_name, blob_account_name, blob_relative_path)
spark.conf.set('fs.azure.sas.%s.%s.blob.core.windows.net' % (blob_container_name, blob_account_name),blob_sas_token)
print('Remote blob path: ' + wasbs_path)

Remote blob path: wasbs://holidaydatacontainer@azureopendatastorage.blob.core.windows.net/Processed


3. Read data from blob and store into dataframe

In [5]:
df = spark.read.parquet(wasbs_path)

                                                                                

In [6]:
df.printSchema()

root
 |-- countryOrRegion: string (nullable = true)
 |-- holidayName: string (nullable = true)
 |-- normalizeHolidayName: string (nullable = true)
 |-- isPaidTimeOff: boolean (nullable = true)
 |-- countryRegionCode: string (nullable = true)
 |-- date: timestamp (nullable = true)



In [7]:
df.show(n=2)

[Stage 1:>                                                          (0 + 1) / 1]

+---------------+--------------------+--------------------+-------------+-----------------+-------------------+
|countryOrRegion|         holidayName|normalizeHolidayName|isPaidTimeOff|countryRegionCode|               date|
+---------------+--------------------+--------------------+-------------+-----------------+-------------------+
|      Argentina|Año Nuevo [New Ye...|Año Nuevo [New Ye...|         null|               AR|1970-01-01 00:00:00|
|      Australia|      New Year's Day|      New Year's Day|         null|               AU|1970-01-01 00:00:00|
+---------------+--------------------+--------------------+-------------+-----------------+-------------------+
only showing top 2 rows



                                                                                

4. Transform data

In [8]:
print('Register the DataFrame as a SQL temporary view: source')
df.createOrReplaceTempView('tempSource')

Register the DataFrame as a SQL temporary view: source


In [9]:
print('Displaying top 10 rows: ')
display(spark.sql('SELECT * FROM tempSource LIMIT 10'))

Displaying top 10 rows: 


DataFrame[countryOrRegion: string, holidayName: string, normalizeHolidayName: string, isPaidTimeOff: boolean, countryRegionCode: string, date: timestamp]

In [10]:
newdf = spark.sql('SELECT * FROM tempSource LIMIT 10')

5. write data into parquet file 
6. write data into JSON file

In [11]:
newdf.write.format("parquet").option("compression","snappy").save("parquetholidaydata",mode='append')

                                                                                

In [12]:
newdf.write.format("csv").option("header","true").save("csvdata",mode='append')

                                                                                

In [13]:
newdf.show()

[Stage 8:>                                                          (0 + 1) / 1]

+---------------+--------------------+--------------------+-------------+-----------------+-------------------+
|countryOrRegion|         holidayName|normalizeHolidayName|isPaidTimeOff|countryRegionCode|               date|
+---------------+--------------------+--------------------+-------------+-----------------+-------------------+
|      Argentina|Año Nuevo [New Ye...|Año Nuevo [New Ye...|         null|               AR|1970-01-01 00:00:00|
|      Australia|      New Year's Day|      New Year's Day|         null|               AU|1970-01-01 00:00:00|
|        Austria|             Neujahr|             Neujahr|         null|               AT|1970-01-01 00:00:00|
|        Belgium|       Nieuwjaarsdag|       Nieuwjaarsdag|         null|               BE|1970-01-01 00:00:00|
|         Brazil|            Ano novo|            Ano novo|         null|               BR|1970-01-01 00:00:00|
|         Canada|      New Year's Day|      New Year's Day|         null|               CA|1970-01-01 00

                                                                                

In [15]:
df.count()

                                                                                

69557