Simulation Data Engineering - Part 2

The first part of this project here orchestrated simulation data generation, which eventually landed in Azure storage.

This part now deploys a Databricks Asset Bundle (DAB) - which is infrastructure as code for Databricks.

This allows for CI/CD integration - which has been set up here using GitHub Actions.

The actions produce two environments - development and production. A staging env could also be added.

The deployment to Databricks includes an ELT pipeline, developed using Delta Live Tables. Which is a declarative ETL/ELT framework for the Databricks Data Intelligence Platform link

The ELT pipeline uses a medallion data design pattern and is in development.

ELT Structure

The Delta Live Tables allow for data quality checks as expectations. These can cause multiple behaviours, such as pipeline failures or logging of failed rows. Such as below where we log missing values in a field.

The CI/CD workflow roughly follows the below pattern, as described here Albeit with some minor adjustments - we don't have a staging area here, although this can easily be added, we use GitHub Actions here, rather than Azure DevOps. In addition, for demo purposes we are building the dev and prod environments in the same workspace. If done in reality, we would use separate workspaces.

Getting started

Below are some general DAB instructions. You would need to setup a Databricks workspace and get the host name, setup a service principal, set permissions and authenticate.

Install the Databricks CLI from https://docs.databricks.com/dev-tools/cli/databricks-cli.html
Authenticate to your Databricks workspace, if you have not done so already:
```
$ databricks configure
```
To deploy a development copy of this project, type:
```
$ databricks bundle deploy --target dev
```
(Note that "dev" is the default target, so the --target parameter is optional here.)

This deploys everything that's defined for this project. For example, the default template would deploy a job called [dev yourname] ae_sim_job to your workspace. You can find that job by opening your workpace and clicking on Workflows.
Similarly, to deploy a production copy, type:
```
$ databricks bundle deploy --target prod
```
Note that the default job from the template has a schedule that runs every day (defined in resources/ae_sim_job.yml). The schedule is paused when deploying in development mode (see https://docs.databricks.com/dev-tools/bundles/deployment-modes.html).
To run a job or pipeline, use the "run" command:
```
$ databricks bundle run
```
Optionally, install developer tools such as the Databricks extension for Visual Studio Code from https://docs.databricks.com/dev-tools/vscode-ext.html.
For documentation on the Databricks asset bundles format used for this project, and for CI/CD configuration, see https://docs.databricks.com/dev-tools/bundles/index.html.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
.github/workflows		.github/workflows
fixtures		fixtures
images		images
resources		resources
scratch		scratch
src		src
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
databricks.yml		databricks.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Simulation Data Engineering - Part 2

ELT Structure

Getting started

About

Releases

Packages

Languages

Ya5s3r/ae-sim-deployment-databricks

Folders and files

Latest commit

History

Repository files navigation

Simulation Data Engineering - Part 2

ELT Structure

Getting started

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages