title

date

draft

tags

Read parquet files from Azure Data Lake with Apache Spark

2022-10-02 20:28:57 +0200

false

Apache Spark

Azure

Azure Blob Storage - and especially the Datalake Gen 2 which is built on top and allows additional features like hierarchical namespace - is a great place to store data in all kinds of format: a data lake.

Here, I explain how to connect to this data lake with Apache Spark and read in parquet-files from the data lake.

Retrieve the access key from the Azure portal. Go to the storage account you want to connect to, choose "Access keys" under "Security + networking" and copy one of the keys.

When creating your Spark Session in your application or notebook, you need to add one package that will be automatically downloaded and included in the session:

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .master("local") \
    .config('spark.jars.packages', 'org.apache.hadoop:hadoop-azure:3.3.1')\
    .getOrCreate()

Hand over the key from (1) to the Spark config. Note that the key allows access to your data and is therefore a secret! You should handle it properly, for example with a local env-file that is not checked into version control, or better something like a key vault.
```
spark.conf.set(
    "fs.azure.account.key.<storage account name>.dfs.core.windows.net",
    "<storage account key>"
)
```

You can now read in parquet files:

df = spark.read.format("parquet")\
    .load("abfss://<container name>@<storage account name>.dfs.core.windows.net/<file path>")

Note that the container name comes before the "@"-sign but does not appear within the file path.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spark-read-from-azure-datalake.md

spark-read-from-azure-datalake.md

Files

spark-read-from-azure-datalake.md

Latest commit

History

spark-read-from-azure-datalake.md

File metadata and controls