## Lab 052

In this lab we will look at manipulating a data lake using Spark tooling.

## Access data on Azure Data Lake Storage Gen2 (ADLS Gen2) with Synapse Spark

Azure Data Lake Storage Gen2 (ADLS Gen2) is used as the storage account associated with a Synapse workspace. A synapse workspace can have a default ADLS Gen2 storage account and additional linked storage accounts. 

You can access data on ADLS Gen2 with Synapse Spark via following URL:
    
    abfss://<container_name>@<storage_account_name>.dfs.core.windows.net/<path>

## Pre-requisites
Synapse leverage AAD pass-through to access any ADLS Gen2 account (or folder) to which you have a **Blob Storage Contributor** permission. No credentials or access token is required. 

## Load a sample data

Let's first load the [public holidays](https://azure.microsoft.com/en-us/services/open-datasets/catalog/public-holidays/) of last 6 months from Azure Open datasets as a sample.



In [6]:
from azureml.opendatasets import PublicHolidays

from datetime import datetime
from dateutil import parser
from dateutil.relativedelta import relativedelta


end_date = datetime.today()
start_date = datetime.today() - relativedelta(months=6)
hol = PublicHolidays(start_date=start_date, end_date=end_date)
hol_df = hol.to_spark_dataframe()

StatementMeta(SparkPool01, 4, 1, Finished, Available)



In [15]:
# Display 5 rows
hol_df.show(5, truncate = False)

StatementMeta(SparkPool01, 4, 10, Finished, Available)

+---------------+----------------------+----------------------+-------------+-----------------+-------------------+
|countryOrRegion|holidayName           |normalizeHolidayName  |isPaidTimeOff|countryRegionCode|date               |
+---------------+----------------------+----------------------+-------------+-----------------+-------------------+
|Ukraine        |День захисника України|День захисника України|null         |UA               |2020-10-14 00:00:00|
|Norway         |Søndag                |Søndag                |null         |NO               |2020-10-18 00:00:00|
|Sweden         |Söndag                |Söndag                |null         |SE               |2020-10-18 00:00:00|
|Hungary        |Nemzeti ünnep         |Nemzeti ünnep         |null         |HU               |2020-10-23 00:00:00|
|Norway         |Søndag                |Søndag                |null         |NO               |2020-10-25 00:00:00|
+---------------+----------------------+----------------------+---------

## Write data to the default ADLS Gen2 storage

We are going to write the spark dateframe to your default ADLS Gen2 storage account.


In [8]:
from pyspark.sql import SparkSession
from pyspark.sql.types import *

# Primary storage info
account_name = 'Your primary storage account name' # fill in your primary account name
container_name = 'Your container name' # fill in your container name
relative_path = 'Your relative path' # fill in your relative folder path

# here is mine for reference only
#account_name = 'asaworkspacedavew891'
#container_name = 'wwi-02'
#relative_path = 'holidays'

adls_path = 'abfss://%s@%s.dfs.core.windows.net/%s' % (container_name, account_name, relative_path)
print('Primary storage account path: ' + adls_path)

StatementMeta(SparkPool01, 4, 3, Finished, Available)

Primary storage account path: abfss://wwi-02@asaworkspacedavew891.dfs.core.windows.net/holidays

### Save a dataframe as Parquet, JSON or CSV
If you have a dataframe, you can save it to Parquet or JSON with the .write.parquet(), .write.json() and .write.csv() methods respectively.

Dataframes can be saved in any format, regardless of the input format.

In [13]:
parquet_path = adls_path + '/holidays.parquet'
json_path = adls_path + '/holidays.json'
csv_path = adls_path + '/holidays.csv'
print('parquet file path: ' + parquet_path)
print('json file path： ' + json_path)
print('csv file path: ' + csv_path)



StatementMeta(SparkPool01, 4, 8, Finished, Available)

parquet file path: abfss://wwi-02@asaworkspacedavew891.dfs.core.windows.net/holidays/holidays.parquet
json file path： abfss://wwi-02@asaworkspacedavew891.dfs.core.windows.net/holidays/holidays.json
csv file path: abfss://wwi-02@asaworkspacedavew891.dfs.core.windows.net/holidays/holidays.csv

In [16]:
hol_df.write.parquet(parquet_path, mode = 'overwrite')

# go check your data lake and ensure the files were written

StatementMeta(SparkPool01, 4, 11, Finished, Available)

Py4JJavaError: An error occurred while calling o259.parquet.
: Status code: -1 error code: null error message: InvalidAbfsRestOperationExceptionjava.net.UnknownHostException: asaworkspacedavew891.dfs.core.windows.net
	at org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.executeHttpOperation(AbfsRestOperation.java:225)
	at org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.execute(AbfsRestOperation.java:154)
	at org.apache.hadoop.fs.azurebfs.services.AbfsClient.getAclStatus(AbfsClient.java:574)
	at org.apache.hadoop.fs.azurebfs.services.AbfsClient.getAclStatus(AbfsClient.java:556)
	at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.getIsNamespaceEnabled(AzureBlobFileSystemStore.java:238)
	at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.getFileStatus(AzureBlobFileSystemStore.java:537)
	at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.getFileStatus(AzureBlobFileSystem.java:430)
	at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1627)
	at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:93)
	at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
	at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
	at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:134)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:130)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:158)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:155)
	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
	at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:105)
	at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:103)
	at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
	at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
	at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75)
	at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
	at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
	at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:566)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.UnknownHostException: asaworkspacedavew891.dfs.core.windows.net
	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
	at java.net.Socket.connect(Socket.java:607)
	at org.wildfly.openssl.OpenSSLSocket.connect(OpenSSLSocket.java:563)
	at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
	at sun.net.www.http.HttpClient.openServer(HttpClient.java:463)
	at sun.net.www.http.HttpClient.openServer(HttpClient.java:558)
	at sun.net.www.protocol.https.HttpsClient.<init>(HttpsClient.java:264)
	at sun.net.www.protocol.https.HttpsClient.New(HttpsClient.java:367)
	at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.getNewHttpClient(AbstractDelegateHttpsURLConnection.java:191)
	at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1162)
	at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1056)
	at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:177)
	at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1570)
	at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1498)
	at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
	at sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:352)
	at org.apache.hadoop.fs.azurebfs.services.AbfsHttpOperation.processResponse(AbfsHttpOperation.java:267)
	at org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.executeHttpOperation(AbfsRestOperation.java:210)
	... 40 more


In [None]:
hol_df.write.json(json_path, mode = 'overwrite')
hol_df.write.csv(csv_path, mode = 'overwrite', header = 'true')

In [None]:
# here is how you would create a df from your data lake files
df_parquet = spark.read.parquet(parquet_path)

In [None]:
df_parquet.show(10)