Welcome to the repository for the Databricks Data Science Workshop
This repository contains the notebooks that are used in the workshop to demonstrate the use of different Databricks tools in a Data Science environment.
- Introduction : Databricks Data Science Workshop
- Reading Resources
- Workshop Flow
- Setup / Requirements
The workshop consists of 4 interactive sections that are separated by 4 notebooks located in the notebooks folder in this repository. Each is run sequentially as we explore the abilities of the Databricks Data Science platform from Pandas integration to distributed modeling, and CI/CD workflows.
Notebook | Summary |
---|---|
01-Koalas.py |
Leveraging Koalas for Pandas workloads on Databricks |
02-MLFlow.scala |
Designing a Segmentation Forecaster on Databricks |
03-Cluster Optimization.py |
Debugging and understanding cluster performance |
04-Git Integration.py |
CI/CD Workflow for Databricks Repos |
This workshop requires a running Databricks workspace. If you are an existing Databricks customer, you can use your existing Databricks workspace. Otherwise, the notebooks in this workshop have been tested to run on Databricks Community Edition as well.
The features used in this workshop require DBR 8.3 ML
.
If you have repos enabled on your Databricks workspace. You can directly import this repo and run the notebooks as is and avoid the DBC archive step.
Download the DBC archive from releases and import the archive into your Databricks workspace.