# Access data on Azure Data Lake Storage Gen2 (ADLS Gen2) with Synapse Spark

Azure Data Lake Storage Gen2 (ADLS Gen2) is used as the storage account associated with a Synapse workspace. A synapse workspace can have a default ADLS Gen2 storage account and additional linked storage accounts. 

You can access data on ADLS Gen2 with Synapse Spark via following URL:
    
    abfss://<container_name>@<storage_account_name>.dfs.core.windows.net/<path>

This notebook provides examples of how to read data from ADLS Gen2 account into a Spark context and how to write the output of Spark jobs directly into an ADLS Gen2 location.

## Pre-requisites
Synapse leverage AAD pass-through to access any ADLS Gen2 account (or folder) to which you have a **Blob Storage Contributor** permission. No credentials or access token is required. 

## Load a sample data

Let's first load the [public holidays](https://azure.microsoft.com/en-us/services/open-datasets/catalog/public-holidays/) of last 6 months from Azure Open datasets as a sample.

In [5]:
from azureml.opendatasets import PublicHolidays

from datetime import datetime
from dateutil import parser
from dateutil.relativedelta import relativedelta


end_date = datetime.today()
start_date = datetime.today() - relativedelta(months=6)
hol = PublicHolidays(start_date=start_date, end_date=end_date)
hol_df = spark.createDataFrame(hol.to_pandas_dataframe()) #spark.createDataFrame()


In [6]:
# Display 5 rows
hol_df.show(5, truncate = False)

## Write data to the default ADLS Gen2 storage

We are going to write the spark dateframe to your default ADLS Gen2 storage account.


In [2]:
from pyspark.sql import SparkSession
from pyspark.sql.types import *

# Primary storage info
account_name = 'dbbatch03synapse' # fill in your primary account name
container_name = 'fsbatch03synapse' # fill in your container name
relative_path = 'bronze' # fill in your relative folder path

adls_path = 'abfss://%s@%s.dfs.core.windows.net/%s' % (container_name, account_name, relative_path)
print('Primary storage account path: ' + adls_path)

### Save a dataframe as Parquet, JSON or CSV
If you have a dataframe, you can save it to Parquet or JSON with the .write.parquet(), .write.json() and .write.csv() methods respectively.

Dataframes can be saved in any format, regardless of the input format.


In [3]:
parquet_path = adls_path + 'holiday.parquet'
json_path = adls_path + 'holiday.json'
csv_path = adls_path + 'holiday.csv'
print('parquet file path: ' + parquet_path)
print('json file path： ' + json_path)
print('csv file path: ' + csv_path)

In [7]:
hol_df.write.parquet(parquet_path, mode = 'overwrite')
hol_df.write.json(json_path, mode = 'overwrite')
hol_df.write.csv(csv_path, mode = 'overwrite', header = 'true')

### Save a dataframe as text files
If you have a dataframe that you want ot save as text file, you must first covert it to an RDD and then save that RDD as a text file.


In [8]:
# Define the text file path
text_path = adls_path + 'holiday.txt'
print('text file path: ' + text_path)

In [9]:
# Covert spark dataframe into RDD 
hol_RDD = hol_df.rdd
type(hol_RDD)

If you have an RDD, you can convert it to a text file like the following:


In [10]:
 # Save RDD as text file
hol_RDD.saveAsTextFile(text_path)

# Read data from the default ADLS Gen2 storage


### Create a dataframe from parquet files


In [11]:
df_parquet = spark.read.parquet(parquet_path)

### Create a dataframe from JSON files


In [12]:
df_json = spark.read.json(json_path)

### Create a dataframe from CSV files


In [13]:
df_csv = spark.read.csv(csv_path, header = 'true')

### Create an RDD from text file


In [14]:
text = sc.textFile(text_path)