No description or website provided.
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
src
.gitignore
.travis.yml
CONTRIBUTING.md
LICENSE
README.md
pom.xml

README.md

Databricks Maven Plugin

This is the maven databricks plugin, which uses the databricks rest api.

Build Status

Build Status

API Overview

Javadocs

Prerequisites

For Users:

  • You have a databricks account
  • You are somewhat familiar with Maven and have maven installed
  • You have an s3 bucket (we will call databricksRepo) that you will use to store your artifacts.
  • You have AWS keys that can write to this s3 bucket
  • Databricks has access to an IAM Role that can read from this bucket.

For Contributors:

  • You need to be able execute an integration test that will actually do things on your databricks account.

Configuring

System Properties

For databricks specific properties we also support system properties. This can be useful for when you don't want tokens or passwords stored in a pom or a script and instead want it to be available on a build server. Currently the following environment properties are supported: DB_URL -> my-databrics.cloud.databricks.com DB_TOKEN -> dapiceexampletoken DB_USER -> my_user DB_PASSWORD -> my_password

We can continue to support more system properties in the future if users have a compelling reason for it.

AWS Credentials

For the upload mojo that uploads your artifact to s3, the default aws provider chain is used. As long as you have appropriate permissions on that chain it should just work.

All other properties

For all other properties we support configuration in the following ways:

  1. via configuration in the mojo
  2. via property configuration on the command line or in the section

Examples

If you would like to setup default profiles for users, you can take the following approach. NOTE: if you define like below, you cannot override via CLI args unless you use project properties as well.

         <!-- Databricks QA Credentials -->
         <profile>
             <id>QA</id>
             <build>
                 <plugins>
                     <plugin>
                         <groupId>com.edmunds</groupId>
                         <artifactId>databricks-maven-plugin</artifactId>
                         <version>${oss-databricks-maven-plugin-version}</version>
                         <configuration>
                             <databricksRepo>${which bucket you want to use to store databricks artifacts}</databricksRepo>
                             <!-- This is used to be able to allow for conditional configuration in job settings -->
                             <environment>QA</environment>
                             <host>${qa-host-here}</host>
                             <user>${qa-user-here}</user>
                             <password>${qa-password-here}</password>
                         </configuration>
                     </plugin>
                 </plugins>
             </build>
         </profile>

Yet another option is to provide all of your credentials when you call the plugin. You can even rely on System Properties or the default aws provider chain for the host/user/password OR token for databricks rest client. Please see End to End testing section or the BaseDatabricksMojo for information on these system properties.

mvn databricks:upload-to-s3 -Ddatabricks.repo=my-repo -Denvironment=QA

Instructions

Use Case 1 - Uploading an Artifact to s3, for Databricks

#This approach will build, run tests and copy your artifacts to s3.
mvn clean deploy

#This approach will only load your artifacts to s3.
mvn databricks:upload-to-s3

Use Case 2 - Attaching a Jar to a Cluster

This will install a library on a databricks cluster, taking care of any restarts necessary.

mvn clean install databricks:upload-to-s3 \
databricks:library -Dlibrary.command=INSTALL -Dclusters={myDatabricksCluster}

Use Case 3 - Exporting Notebooks to a Workspace

This command demonstrates exporting notebooks to a workspace as well as uploading a jar and attaching it to a cluster, which is a common operation when you have a notebook that also depends on library code.

mvn clean install databricks:upload-to-s3 \
databricks:library databricks:import-workspace \
-Dlibrary.command=INSTALL -Dclusters=sam_test

Use Case 4 - Upsert a Job to a Workspace

You must have a job definition file. This file should be in the resources directory named databricks-plugin/databricks-job-settings.json and should be a serialized form of an array of type JobSettingsDTO. Note that this file is a template, that has access to both the java system properties, as well as the maven project data. It uses freemarker to merge this file, with that data.

[
 {
   //There is validation rules around job names based on groupId and artifactId these can be turned off
   "name": "myTeam/myArtifact",
   "new_cluster": {
     "spark_version": "4.1.x-scala2.11",
     "aws_attributes": {
       "first_on_demand": 1,
       "availability": "SPOT_WITH_FALLBACK", // Can also set to SPOT
       "instance_profile_arn": "yourArn",
       "spot_bid_price_percent": 100,
       "ebs_volume_type": "GENERAL_PURPOSE_SSD",
       "ebs_volume_count": 1,
       "ebs_volume_size": 100
     },
    "driver_node_type_id" : "r4.xlarge",
     "node_type_id": "m4.large",
     "num_workers": 1,
    "autoscale" : {
      "min_workers" : 1,
      "max_workers" : 3
    },
    "custom_tags": {
        "team": "myTeam"
      },
     "autotermination_minutes": 0,
     "enable_elastic_disk": false
   },
   "existing_cluster_id": null,
    "spark_conf" : {
    "spark.databricks.delta.retentionDurationCheck.enabled" : "false"
    },
   "timeout_seconds": 10800, //3hrs
  "schedule" : {
    "quartz_cron_expression" : "0 0/30 * ? * * *",
    "timezone_id" : "America/Los_Angeles"
  },
     "spark_jar_task": {
      "main_class_name": "com.edmunds.dwh.VehicleInventoryHistoryDriver"
    },
    //If you emit this section, it will automatically be added to your job
    "libraries": [
      {
        "jar": "s3://${projectProperties['databricks.repo']}/${projectProperties['databricks.repo.key']}"
      }
   ],
  "email_notifications" : {
    "on_failure" : [ "myEmail@email.com" ],
    "on_start" : null,
    "on_success" : null
  },
   "retry_on_timeout": false,
   "max_retries": 0,
   "min_retry_interval_millis": 0,
   "max_concurrent_runs": 1
 }
]

To upsert your job run the following: You can invoke it manually, like so, or attach it as an execution (see case 2 for example):

#deploys the current version
mvn databricks:upsert-job

#deploys a specific version
mvn databricks:upsert-job -Ddeploy-version=1.0

#you don't want validation! 
#If so, it could be good to create an issue and let us know where our validation rules are too specific
mvn databricks:upsert-job -Dvalidate=false

You can use freemarker templating like so:

      <#if environment == "QA" || environment == "DEV">
      "node_type_id": "r3.xlarge",
      "driver_node_type_id": "r3.xlarge",
      "num_workers": 5,
      <#else>
      "node_type_id": "r3.4xlarge",
      "driver_node_type_id": "r3.xlarge",
      "num_workers": 10,
      </#if>

For additional information please consult: https://docs.databricks.com/api/latest/jobs.html#create And the JobSettingsDTO in: https://www.javadoc.io/doc/com.edmunds/databricks-rest-client/

Use Case 5 - Control a Job (start, stop, restart)

You can control a job (stop it, start it, restart it) via this mojo. There is 1 required property:jobCommand. You can add it to your configuration section, or invoke manually, like so:

(note: you can override the jobName in this example, which is otherwise derived from the job settings json file)

mvn databricks:job -Djob.command=STOP
mvn databricks:job -Djob.command=START
mvn databricks:job -Djob.command=RESTART

Building, Installing and Running

How to build the project locally: mvn clean install

How to run the project locally (if applicable):

Running the tests

mvn clean test

End-to-End testing

Please have these set in your .bash_profile.

export DB_USER=myuser
export DB_PASSWORD=mypassword
export DB_URL=my-test-db-instance
export DB_TOKEN=my-db-token
export DB_REPO=my-s3-bucket/my-artifact-location
export INSTANCE_PROFILE_ARN=arn:aws:iam::123456789:instance-profile/MyDatabricksRole
mvn clean -P run-its install

Please note, this will:

  1. create a job, if it does not exist, delete it if it does
  2. start the job (e.g. run it once)
  3. wait for the job to finish and ensure it's success

Releasing

Please see the contributing section on how to RELEASE.

Contributing

Please read CONTRIBUTING.md for the process for merging code into master.