Skip to content

Latest commit

 

History

History
269 lines (174 loc) · 10.3 KB

python_basic.rst

File metadata and controls

269 lines (174 loc) · 10.3 KB

Basic Python Template

To create a project from this template, please run the following command:

dbx init --template=python_basic

The new project will be located in a folder with the chosen project name.

Your generated template will have some generic parts, and some will be CI tool specific.

The clean project structure, without any CI-related files will look like this:

.
├── .dbx
│   ├── lock.json  # please note that this file shall be ignored and not added to your git repository.
│   └── project.json
├── .gitignore
├── README.md
├── conf
│   ├── deployment.yml
│   └── test
│       └── sample.yml
├── pytest.ini
├── sample_project
│   ├── __init__.py # <- this is the root folder of your Python package
│   ├── common.py # <- this file contains a generic class called Job, which provides you all necessary tools, such as Spark and DBUtils
│   └── jobs
│       ├── __init__.py
│       └── sample
│           ├── __init__.py
│           └── entrypoint.py
├── setup.py
├── tests
│   ├── integration
│   │   └── sample_test.py
│   └── unit
│       └── sample_test.py
└── unit-requirements.txt

Here are some comments about this structure:

  • .dbx folder is an auxiliary folder, where metadata about environments and execution context is located.
  • sample_project - Python package with your code (the directory name will follow your project name)
  • tests - directory with your package tests
  • conf/deployment.yml - deployment configuration file. Please note that this file is used to configure the job deployment properties, such as dependent libraries, tasks, cluster sizes etc.

Please note that this project mostly follows the classical Python package structure as described here.

The conf/deployment.yml is one of the main components that allows you to flexibly describe your jobs definitions.

All CI tools in this template are following the same concepts, described in this section.

Depending on your choices of the CI tool during the project creation, your project structure will look like this:

.. tabs::

    .. tab:: GitHub Actions

        .. code-block::

            <file structure as above>
            ├── .github
            │   └── workflows
            │       ├── onpush.yml
            │       └── onrelease.yml

       Some explanations regarding structure:

        * :code:`.github/workflows/` - workflow definitions for GitHub Actions, in particular:

            * :code:`.github/workflows/onpush.yml` defines the CI pipeline logic
            * :code:`.github/workflows/onrelease.yml` defines the CD pipeline logic

    .. tab:: Azure DevOps

        .. code-block::

            <file structure as above>
            ├── azure-pipelines.yml

       Some explanations regarding structure:
        - :code:`azure-pipelines.yml` - Azure DevOps Pipelines workflow definition with both CI and CD pipeline logic.

    .. tab:: GitLab

        .. code-block::

            <file structure as above>
            ├── .gitlab-ci.yml

       Some explanations regarding structure:
        - :code:`.gitlab-ci.yml` - GitLab CI/CD workflow definition with both CI and CD pipeline logic.

After generating the project, we'll need to setup the local development environment.

  • Create new conda environment and activate it:
conda create -n <your-environment-name> python=3.7.5
conda activate <your-environment-name>
  • If you would like to be able to run local unit tests, you'll need JDK. If you don't have one, it can be installed via:
conda install -c anaconda "openjdk=8.0.152"
  • Move the shell to the project directory:
cd <project_name>
  • Install dependencies for local development:
pip install -r unit-requirements.txt
  • Install package in development mode, so your IDE can provide you all required introspection:
pip install -e .

At this stage, you have the following:

  • Configured Python package
  • Configured environment for local development

Now, you can open the project in your IDE. Don't forget to point the IDE to the given conda environment name for a full code introspection.

Take a look at the code sample in the <project_name>/jobs/sample/entrypoint.py. This entrypoint file contains an example of an implemented job, based on the abstract Job name. You can see that a configuration object, named self.conf referenced in this job - these parameters will be provided from a conf/test/sample.yml file during Databricks run. In the local test, you can override this configuration - please find examples in tests/unit/sample_test.py.

To launch local test, simply use the pytest framework from the root directory of the project:

pytest tests/unit --cov <project_name>

At this stage, you have the following:

  • Configured Python package
  • Configured environment for local development
  • Python package is tested locally

Now, it's time to launch our code on the Databricks clusters.

To upload your code from the local environment to Databricks and execute it, there are multiple options:

  • execute your code on an interactive (also called all-purpose) cluster
  • launch your code as a job on automated cluster
  • launch your code as a job on interactive cluster

The third option in general is a bad idea, for a very simple reason - your local package will be installed on a cluster-wide level, which means that:

  • other users won't be able to override your code, unless your restart the interactive cluster
  • you won't be able to install another version of the same library, unless your restart the interactive cluster

Therefore, we're considering two first options.

Option #1 (execution on interactive cluster) is really suitable when you would like to run your code on interactive cluster during development process to verify that code work properly within real environment. Your library will be installed in a separate context, which means that other users won't be affected, and you still will be able to install newer versions.

Use this command to execute a specific job on interactive cluster:

dbx execute --job=<job-name> --cluster-name=<cluster-name>

Now, if you would like to launch your job on an automated cluster, you probably would like to configure some specific cluster properties, such as size, environment etc. To do this, please take a look at the conf/deployment.yml file. In general, this file follows the Databricks API structures, but it has some additional features, described through this documentation.

After setting the configuration in deployment file, it's time to launch the job. However, we probably don't really want to affect the real job object in our environment. Instead of this, we're going to perform something called jobless deployment, by providing the --files-only property. Please take a look at this section for more details:

dbx deploy --jobs=<job-name> --files-only

Now the job can be launched in a run submit mode:

dbx launch --as-run-submit --job=<job-name>

At this stage, you have the following:

  • Configured Python package
  • Configured environment for local development
  • Python package is tested locally
  • Job has been launched on interactive cluster
  • Job has been deployed and launched in a jobless (also called ephemeral or run-submit) mode

Depending on your CI tool, please choose the instruction accordingly:

.. tabs::

    .. tab:: GitHub Actions

        Please do the following:

            * Create a new repository on GitHub
            * Configure :code:`DATABRICKS_HOST` and :code:`DATABRICKS_TOKEN` secrets for your project in `GitHub UI <https://docs.github.com/en/free-pro-team@latest/actions/reference/encrypted-secrets>`_
            * Add a remote origin to the local repo
            * Push the code
            * Open the GitHub Actions for your project to verify the state of the deployment pipeline

        .. warning::

            There is no need to manually create the releases via UI in case of the release.
            Release pipeline will create the release automatically.

    .. tab:: Azure DevOps

        Please do the following:

            * Create a new repository on GitHub or in Azure DevOps (or in any Azure DevOps-compatible git system)
            * Configure :code:`DATABRICKS_HOST` and :code:`DATABRICKS_TOKEN` secrets for your project in `Azure DevOps <https://docs.microsoft.com/en-us/azure/devops/pipelines/release/azure-key-vault?view=azure-devops>`_. Note that secret variables must be mapped to env as mentioned in the `official documentation <https://docs.microsoft.com/en-us/azure/devops/pipelines/process/variables?view=azure-devops&tabs=yaml%2Cbatch#secret-variables>`_.
            * Push the code
            * Open the Azure DevOps UI to check the deployment status

    .. tab:: GitLab

        Please do the following:
            * Create a new repository on Gitlab
            * Configure :code:`DATABRICKS_HOST` and :code:`DATABRICKS_TOKEN` secrets for your project in `GitLab UI <https://docs.gitlab.com/ee/ci/variables/#create-a-custom-variable-in-the-ui>`_
            * Add a remote origin to the local repo
            * Push the code
            * Open the GitLab CI/CD UI to check the deployment status

Please note that to create a release and deploy the job in a normal mode, tag the latest commit in the main branch and push the tags:

git fetch
git checkout main
git pull
git tag -a v0.0.1 -m "Release for version 0.0.1"
git push --tags