# Software Engineering Best Practices, DevOps, and CI/CD Fundamentals

## Introduction to Software Engineering (SWE) Best Practices

![swe-best-practices](images/swe-best-practices.png)
![swe-best-practices-components](images/swe-best-practices-components.png)
![swe-databricks](images/swe-databricks.png)

## Introduction to Modularizing PySpark Code

![modularizing-pyspark-benefits](images/modularizing-pyspark-benefits.png)
![modularizing-pyspark-code](images/modularizing-pyspark-code.png)
![modularizing-pyspark-code-1](images/modularizing-pyspark-code-1.png)
![modularizing-pyspark-code-2](images/modularizing-pyspark-code-2.png)
![modularizing-pyspark-code-3](images/modularizing-pyspark-code-3.png)
![modularizing-pyspark-code-4](images/modularizing-pyspark-code-4.png)
![modularizing-pyspark-code-5](images/modularizing-pyspark-code-5.png)
![modularizing-pyspark-code-6](images/modularizing-pyspark-code-6.png)
![modularizing-pyspark-code-7](images/modularizing-pyspark-code-7.png)
![modularizing-pyspark-code-8](images/modularizing-pyspark-code-8.png)

## DevOps Fundamentals

![devops-fundamentals](images/devops-fundamentals.png)
![devops-life-cycle-components](images/devops-life-cycle-components.png)
![devops-life-cycle-process](images/devops-life-cycle-process.png)
![devops-data-engineering-machine-learning](images/devops-data-engineering-machine-learning.png)
![devops-dataops-mlops](images/devops-dataops-mlops.png)

## The Role of CI/CD in DevOps

![role-cicd](images/role-cicd.png)
![role-cicd-devops](images/role-cicd-devops.png)
![role-cicd-devops-ci](images/role-cicd-devops-ci.png)
![role-cicd-devops-cd.png](images/role-cicd-devops-cd.png)
![role-cicd-workflow](images/role-cicd-workflow.png)

# Continuous Integration

## Planning the Project

![planning-the-project-requirements](images/planning-the-project-requirements.png)
![planning-the-project-data-environments](images/planning-the-project-data-environments.png)
![planning-the-project-isolating-environments](images/planning-the-project-isolating-environments.png)
![planning-the-project-workspace-isolation](images/planning-the-project-workspace-isolation.png)
![planning-the-project-unity-catalog-isolation](images/planning-the-project-unity-catalog-isolation.png)
![planning-the-project-archtecture](images/planning-the-project-archtecture.png)

## Introduction to Unit Tests for PySpark
![unit-tests-PySpark-introduction](images/unit-tests-PySpark-introduction.png)
![unit-tests-PySpark-utils](images/unit-tests-PySpark-utils.png)
![unit-tests-PySpark-example](images/unit-tests-PySpark-example.png)
![unit-tests-PySpark-goal](images/unit-tests-PySpark-goal.png)
![unit-tests-PyTest-framework](images/unit-tests-PyTest-framework.png)

## Creating and Executing Unit Tests
![creating-and-executing-1](images/creating-and-executing-1.png)
![creating-and-executing-2](images/creating-and-executing-2.png)
![creating-and-executing-3](images/creating-and-executing-3.png)
![creating-and-executing-unit-tests](images/creating-and-executing-unit-tests.png)
![creating-and-executing-unit-tests-1](images/creating-and-executing-unit-tests-1.png)
![creating-and-executing-unit-tests-2](images/creating-and-executing-unit-tests-2.png)
![creating-and-executing-unit-tests-3](images/creating-and-executing-unit-tests-3.png)
![creating-and-executing-unit-tests-4](images/creating-and-executing-unit-tests-4.png)
![creating-and-executing-unit-tests-5](images/creating-and-executing-unit-tests-5.png)
![creating-and-executing-unit-tests-6](images/creating-and-executing-unit-tests-6.png)
![creating-and-executing-unit-tests-7](images/creating-and-executing-unit-tests-7.png)

## Executing Integration Tests with DLT and Workflows

![executing-integration-test-types](images/executing-integration-test-types.png)
![executing-integration-test-dlt](images/executing-integration-test-dlt.png)
![executing-integration-test-workflow](images/executing-integration-test-workflow.png)

## Performing Integration Tests with DLT and Workflows

![integration-with-workflow](images/integration-with-workflow.png)
![integration-with-workflow-dlt-pipeline](images/integration-with-workflow-dlt-pipeline.png)
![integration-with-workflow-dlt-pipeline-settings](images/integration-with-workflow-dlt-pipeline-settings.png)
![integration-with-workflow-dlt-pipeline-settings-2](images/integration-with-workflow-dlt-pipeline-settings-2.png)
![integration-with-workflow-dlt-pipeline-settings-yaml](images/integration-with-workflow-dlt-pipeline-settings-yaml.png)
![integration-with-workflow-dlt-pipeline-details](images/integration-with-workflow-dlt-pipeline-details.png)
![integration-with-workflow-dlt-pipeline-configuration](images/integration-with-workflow-dlt-pipeline-configuration.png)
![integration-with-workflow-dlt-pipeline-ingest](images/integration-with-workflow-dlt-pipeline-ingest.png)
![integration-with-workflow-dlt-pipeline-silver](images/integration-with-workflow-dlt-pipeline-silver.png)
![integration-with-workflow-dlt-pipeline-gold](images/integration-with-workflow-dlt-pipeline-gold.png)
![integration-with-workflow-dlt-pipeline-dictionary-create](images/integration-with-workflow-dlt-pipeline-dictionary-create.png)
![integration-with-workflow-dlt-pipeline-dictionary-materialized-view](images/integration-with-workflow-dlt-pipeline-dictionary-materialized-view.png)
![integration-with-workflow-dlt-pipeline-execute-integration-test](images/integration-with-workflow-dlt-pipeline-execute-integration-test.png)

## Version Control with Git Overview

Git version control is important because it provides a structured way to manage, track, and collaborate on code changes in software projects.  Essentially, it is a tool for modern software development and DevOps.

### Complications with Version Control

Let's look at complications with version control. A lack of centralized version control leads to development silos, which results in duplicate code, inconsistent standards, and lower software quality.  Unstable versioning makes tracking, reverting, and auditing changes very difficult, increasing risks when managing frequent updates. Additionally, branching, merging, and CICD integration become challenging. Ultimately, this is all limiting for scalability.

### Secure code changes through branching

Starting with the first version, v0. 1, on the main branch, branching occurs to the feature branch where testing can be properly performed in isolation.

![secure-code-changes-through-branching](images/secure-code-changes-through-branching.png)


### Overview of Git with Databricks

Git is a free and open source software framework that's designed to track changes in source code during your software development phase.

![overview-of-git-with-databricks](images/overview-of-git-with-databricks.png)


Version control enables tracking code changes, facilitating rollback and collaboration, while branching and merging allow multiple developers to work in parallel and integrate changes efficiently.  A distributed workflow ensures each developer has a full local repository, which enhances flexibility and reliability. Moreover, Git is optimized for high-performance handling of large projects, and security features use cryptographic integrity checks to prevent data corruption.

Users can leverage remote Git repos while developing code inside of Databricks notebooks,  and the repo's REST API enables integration of data and AI projects into CICD pipelines, allowing users to automate Git workflows.

And lastly, you should always be strategic about how you're integrating third party tools into your workflows and use whatever makes sense for your organizational needs. 



### Connecting to Databricks with GitHub PAT

Once you have the proper permissions configured for your GitHub repository, go ahead and copy the PAT and head on over to Databricks.

![connecting-to-databricks-with-gitHub-PAT](images/connecting-to-databricks-with-gitHub-PAT.png)


### Git-Based Repos in Databricks


The Repos API provides programmatic access to Git-based repos. And with the API, you can integrate Databricks repos with your CICD workflow.

You can programmatically create, update, and delete repos, perform Git operations, and specify Git versions when running jobs based on notebooks within the repo.  

![git-folders-databricks](images/git-folders-databricks.png)

![git-based-repos-in-databricks](images/git-based-repos-in-databricks.png)

### Arbitrary Files Support in Repos
![arbitrary-files-support-in-repos](images/arbitrary-files-support-in-repos.png)

# Introduction to Continuous Deployment (CD)

## Deploying Databricks Assets Overview

### Deployment Options

When thinking of deploying Databricks assets, we have three different developer options. We have the REST API, the Databricks CLI, and the Databricks SDK.  

The REST API is most flexible, but complex, and this is best for custom integrations. The CLI simplifies REST API operations, but has limited flexibility.

Finally, the SDK is the most developer-friendly and is best used for embedding Databricks functionality within applications.

![deployment-options](images/deployment-options.png)

### Databricks Asset Bundles

Databricks Asset Bundles, or DABs, are designed to facilitate the adoption of best practices in software engineering, particularly for data and AI projects.  Here, we will identify four core components of software engineering practices that are supported by DABs.

![databricks-asset-bundles](images/databricks-asset-bundles.png)

Databricks asset bundles provide a structured approach to managing Databricks projects while adhering to software engineering best practices. By combining infrastructure as code principles with automation capabilities, they streamline collaboration, improve quality assurance, and enable efficient delivery of data-driven solutions.

DABS integrates seamlessly with Git-based workflows, enabling users to version their Databricks resources alongside source code.  By treating Databricks resources as code, DABS enables peer review through standard Git workflows like pull requests. Developers can use the Databricks  CLI with DABS to run tests on bundles and isolated environments. This ensures that workflows behave as intended.  

Finally, DABS integrates with CICD tools like GitHub Actions or Azure DevOps to automate validation, deployment, and execution on Databricks workflows.  With this foundation in place, let's take a closer look at what DABs are.  

![dbas](images/DBAs.png)

With DABs, you can create code that can be deployed across multiple environments without modification. This ensures consistency, reduces manual errors, and accelerates delivery by automating deployment processes.

DABs are a tool designed to streamline this process for Databricks projects. They enable developers to define Databricks resources like jobs, pipelines, and notebooks as source files and metadata in YAML format.

Dabs work by first defining your resources and requirements in a Databricks. yaml file. You then validate the bundle utilizing the Databricks CLI and deploy it to your chosen workspace. Once deployed, workflows or pipelines described in the bundle can be executed.


### Development and CI/CD with DABs

![development-and-CI/CD-with-DABs](images/development-and-CI-CD-with-DABs.png)

To summarize, Databricks Asset Bundles are a tool designed to simplify the management and deployment of data and AI projects on the Databricks platform.  They follow an infrastructure as code approach, allowing users to define and manage Databricks resources such as jobs, pipelines, notebooks, and machine learning models through YAML configuration files.

These bundles streamline collaboration, testing, deployment, and version control across various environments.

![ci-cd-pipeline](images/ci-cd-pipeline.png)
![ci-cd-pipeline-job](images/ci-cd-pipeline-job.png)
![ci-cd-pipeline-job-workflow](images/ci-cd-pipeline-job-workflow.png)
![ci-cd-pipeline-job-workflow-delta](images/ci-cd-pipeline-job-workflow-delta.png)
![ci-cd-pipeline-job-workflow-visualization](images/ci-cd-pipeline-job-workflow-visualization.png)