GitHub - brad-do/pyspark-unit-testing: Project demonstrating how you can unit test a PySpark client using Docker

Overview

I write a fair amount of Spark applications that run on Hadoop platforms and do lots of data parsing, transformation, and loading to HDFS, Hive, or other data repositories. When I code my applications in Scala, I usually use Eclipse and the ScalaTest framework to test my work.

However, I like to write PySpark solutions, too, but haven't found a great way to test my solutions in an editor like VS Code. Recently, though, it occurred to me that maybe I could just test my code in a ready-made Hadoop environment, like a Docker image for Hadoop. This project is a simple starter app for developing a PySpark application that you can easily test in a Docker container running Spark.

Setup

Set up a local, virtual environment

To keep this project nice and isolated from my other Python projects, I like to setup a virtual environment for it.

In my WSL2 command shell, navigate to my development folder: cd /mnt/c/Users/brad/dev
Create a directory for my project: mkdir ./pyspark-unit-testing
Enter the new project folder: cd pyspark-unit-testing
Create a subdir for my tests: mkdir ./tests
Create my virtual environment (note: I initially had some issues doing this in WSL2 and found this blog post helpful in overcoming them): python3 -m venv app-env
Start the new virtual environment: source app-env/bin/activate
Install the pytest package: pip3 install pytest --trusted-host pypi.org --trusted-host files.pythonhosted.org
(Optional) Create a requirements doc of your development environment: pip3 freeze -> requirements-dev.txt
Leave the virtual environment: deactivate
Now, fire up VS Code: code .

The Databricks Spark-XML dependency

As an added bonus, this project demonstrates how to code and test against third party APIs. In this case, I'm leveraging Databricks' Spark-XML API. For simplicity, I just downloaded the JAR file directly into my project from its Maven repository.

Set up the Docker container

For most of my Spark needs, I like to use the Jupyter "all-spark-notebook" image.

Download the image (I like to use the "spark-2" version, but I'm sure the latest will work just fine, too): docker pull /jupyter/all-spark-notebook:spark-2
Use my Dockerfile to build my own image: docker build -t my_spark_image:v1 .

Running the Docker container and testing

When I start my Docker container, I like to include several commands that are not necessary for unit testing but helpful with other uses of the container like its Jupyter Notebook capabilities. Here's the command I normally use:

docker run -d -p 9000:8888 -e JUPYTER_ENABLE_LAB=yes -e GRANT_SUDO=yes -v /mnt/c/Users/brad/dev/pyspark-unit-testing:/home/jovyan/work my_spark_image:v1

Next, open a bash shell into your container: docker exec -it bash

Finally, run your pytest unit tests with this command:

pytest tests/ -s --disable-pytest-warnings

As an aside, if you're interested in the container's Jupyter Notebook capabilities, this command is helpful: jupyter notebook list

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commits
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
my_pyspark_client.py		my_pyspark_client.py
requirements-dev.txt		requirements-dev.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tests

tests

.gitignore

.gitignore

Dockerfile

Dockerfile

README.md

README.md

my_pyspark_client.py

my_pyspark_client.py

requirements-dev.txt

requirements-dev.txt

Repository files navigation

Overview

Setup

Set up a local, virtual environment

The Databricks Spark-XML dependency

Set up the Docker container

Running the Docker container and testing

About

Releases

Packages

Languages

brad-do/pyspark-unit-testing

Folders and files

Latest commit

History

Repository files navigation

Overview

Setup

Set up a local, virtual environment

The Databricks Spark-XML dependency

Set up the Docker container

Running the Docker container and testing

About

Resources

Stars

Watchers

Forks

Languages