Skip to content

Example of local pyspark setup including DeltaLake for unit-testing

License

Notifications You must be signed in to change notification settings

frizzleqq/pyspark-deltalake

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Local env with spark+delta

Minimal example of a local Python setup with Spark and DeltaLake that allows unit-testing spark/delta via pytest.

The setup is inspired by dbx by Databricks.

Delta

To include Delta in the Spark session created by pytest the spark fixture in ./tests/conftest.py runs configure_spark_with_delta_pip and adds the following settings to the spark config:

key value
spark.sql.extensions io.delta.sql.DeltaSparkSessionExtension
spark.sql.catalog.spark_catalog org.apache.spark.sql.delta.catalog.DeltaCatalog

See https://docs.delta.io/3.2.0/quick-start.html#python for more info.

Development

Requirements:

Setup Virtual environment

Following commands create and activate a virtual environment. The [dev] also installs development tools. The --editable makes the CLI script available.

  • Makefile:
    make requirements
    source .venv/bin/activate
  • PowerShell:
    python -m venv .venv
    .venv\Scripts\Activate.ps1
    python -m pip install --upgrade pip
    pip install --editable .[dev]
  • Windows CMD:
    python -m venv .venv
    .venv\Scripts\activate.bat
    python -m pip install --upgrade pip
    pip install --editable .[dev]
    

Windows

I recommend using wsl instead, as even with the additional hadoop libraries spark-delta occasionally simply freezes.

To run this on Windows you need additional Haddop libraries, see https://cwiki.apache.org/confluence/display/HADOOP2/WindowsProblems.

"In particular, %HADOOP_HOME%\BIN\WINUTILS.EXE must be locatable."

  1. Download the bin directory https://github.com/steveloughran/winutils/tree/master/hadoop-3.0.0/bin (required files: hadoop.dll and winutils.exe)
  2. Set environment variable HADOOP_HOME to the directory above the bin directory

Run tests

  • Makefile
make test
  • Windows
pytest -vv

About

Example of local pyspark setup including DeltaLake for unit-testing

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published