Demo PySpark Library

This is a demo python library to support talks around Big Data Engineering and DevOps.

The repository contains a simple library for Spark 3 which provides some simple functions such as parsing Excel dates, and standardising column names.

This is also meant to show the source code being stored in Github, but with the pipelines, environments, and secrets being configured in Azure DevOps.

Setting up locally

The code has been built against Python 3.7 (as this matches Databricks runtime 7) and so it is advised that this is the version used. There are no features in use however which would prevent the usage of Python 3.6 and above.

You will need to create a virtual environment. The coverage configuration expects this to be at .venv but it can be anything, just remember to update .coveragerc to reflect your virtual environment location.

Windows

# Create the python virtual environment and activate it
python -m venv .venv

# Alternatively, if you have multiple versions of python installed
py -3.7 -m venv .venv

# Activate the virtual environment
.\.venv\scripts\active # Or activate.ps1 if using Powershell

Linux/Mac

On a Linux environment you may need to install the python venv package separately. On Debian based distributions this might look something like.

sudo apt install python3-venv

python3 -m venv .venv
source .venv/bin/activate

Finally

N.B. To deactivate your virtual environment just type deactivate and press enter.

Once your virtual environment is set up then make sure you update pip and install the package dependencies. These will include the development dependencies as well.

python3 -m pip install --upgrade pip
python3 -m pip install -r requirements.txt

This project uses Flake8 with the pep8-naming package for code linting. It uses Coverage.py for producing code coverage reports, and it uses xmlrunner for running the unit tests and producing JUnit style reports for Azure Pipelines to consume.

All unit tests are writting the unittest framework which is part of the standard library. This is just because I happen to like it.

Also in use is the Rope refactoring library which is used by Visual Studio Code for performaing refactoring activities such as variable renaming.

Visual Studio Code

The library was developed using Visual Studio Code, and so an extensions.json file exists to provide quick access to the extensions the author feels are needed for building on this library. To access these simply open the Extensions panel, and type in @recommended in the search bar, this will show the workspace recommendations. These are not mandatory, but can make development a bit easier.

Running tests and producing coverage

# Inside the virtual environment

# Run tests and produce coverage information
python -m coverage run -m xmlrunner -o test-results discover -v -s ./tests -p test_*.py

# Generate coverage XML report
python -m coverage xml

# Generate coverage HTML report
python -m coverage html

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.vscode		.vscode
dazspark		dazspark
tests		tests
.coveragerc		.coveragerc
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
azure-pipelines.yml		azure-pipelines.yml
requirements.txt		requirements.txt
set-version.ps1		set-version.ps1
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.vscode

.vscode

dazspark

dazspark

tests

tests

.coveragerc

.coveragerc

.gitignore

.gitignore

LICENSE.md

LICENSE.md

README.md

README.md

azure-pipelines.yml

azure-pipelines.yml

requirements.txt

requirements.txt

set-version.ps1

set-version.ps1

setup.py

setup.py

Repository files navigation

Demo PySpark Library

Setting up locally

Windows

Linux/Mac

Finally

Visual Studio Code

Running tests and producing coverage

About

Releases

Languages

License

dazfuller/demo-pyspark-lib

Folders and files

Latest commit

History

Repository files navigation

Demo PySpark Library

Setting up locally

Windows

Linux/Mac

Finally

Visual Studio Code

Running tests and producing coverage

About

Topics

Resources

License

Stars

Watchers

Forks

Languages