Run databricks task locally #1951

pingsutw · 2023-11-10T09:48:41Z

TL;DR

This PR allows to submit databricks job from local, and save intermediate data in the blob store. It simplifies the process of testing and developing Databricks tasks locally.

Two ways to run the databricks job locally.

pyflyte run databricks.py wf - Run a spark task in the local process
pyflyte run --raw-output-data-prefix s3://databricks-agent/demo databricks.py wf - submit to databricks platform. Fall back to 1 (local execution) if agent raises an exception.

Note: To submit job from local, you need AWS credential and Databricks access key in the environment variable.

export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...
export AWS_SESSION_TOKEN=...
export DATABRICKS_TOKEN=...

Type

Bug Fix
Feature
Plugin

Are all requirements met?

Complete description

import datetime
import os
import random
from operator import add

from click.testing import CliRunner

import flytekit
from flytekit import Resources, Secret, task, workflow, ImageSpec
from flytekit.clis.sdk_in_container import pyflyte
from flytekitplugins.spark import Databricks

SECRET_GROUP = "token-info"
SECRET_NAME = "token_secret"

image = ImageSpec(base_image="pingsutw/databricks:v4", registry="pingsutw")

@task(
    task_config=Databricks(
        # this configuration is applied to the spark cluster
        spark_conf={
            "spark.driver.memory": "600M",
            "spark.executor.memory": "600M",
            "spark.executor.cores": "1",
            "spark.executor.instances": "1",
            "spark.driver.cores": "1",
        },
        executor_path="/databricks/python3/bin/python",
        applications_path="dbfs:///FileStore/tables/entrypoint.py",
        databricks_conf={
            "run_name": "flytekit databricks plugin example",
            "new_cluster": {
                "spark_version": "13.3.x-scala2.12",
                "node_type_id": "m6i.large",  # TODO: test m6i.large, i3.xlarge
                "num_workers": 3,
                "aws_attributes": {
                    "availability": "ON_DEMAND",
                    "instance_profile_arn": "arn:aws:iam::xxxxxx:instance-profile/databricks-agent",
                    "ebs_volume_type": "GENERAL_PURPOSE_SSD",
                    "ebs_volume_count": 1,
                    "ebs_volume_size": 100,
                },
            },
            "timeout_seconds": 3600,
            "max_retries": 1,
        },
        databricks_instance="xxxxxxx.cloud.databricks.com",
    ),
    limits=Resources(mem="2000M"),
    # container_image=image,
    container_image="pingsutw/databricks:v7"
)
def hello_spark(partitions: int) -> float:
    print("Starting Spark with Partitions: {}".format(partitions))

    n = 100000 * partitions
    sess = flytekit.current_context().spark_session
    count = (
        sess.sparkContext.parallelize(range(1, n + 1), partitions).map(f).reduce(add)
    )
    pi_val = 4.0 * count / n
    print("Pi val is :{}".format(pi_val))
    return pi_val


def f(_):
    x = random.random() * 2 - 1
    y = random.random() * 2 - 1
    return 1 if x**2 + y**2 <= 1 else 0


@task(cache_version="1")
def print_every_time(value_to_print: float, date_triggered: datetime.datetime) -> int:
    print("My printed value: {} @ {}".format(value_to_print, date_triggered))
    return 1


@workflow
def wf(
    triggered_date: datetime.datetime = datetime.datetime.now(),
) -> float:
    """
    Using the workflow is still as any other workflow. As image is a property of the task, the workflow does not care
    about how the image is configured.
    """
    pi = hello_spark(partitions=50)
    print_every_time(value_to_print=pi, date_triggered=triggered_date)
    return pi


if __name__ == '__main__':
    runner = CliRunner()
    result = runner.invoke(pyflyte.main,
                           ["run",
                            "--raw-output-data-prefix",
                            "s3://flyte-batch/spark/",
                            "/Users/kevin/git/flytekit/flyte-example/databricks_wf",
                            "wf"])
    print(result.output)

FROM databricksruntime/standard:13.3-LTS
LABEL org.opencontainers.image.source=https://github.com/flyteorg/flytesnacks

ENV PYTHONPATH /databricks/driver
ENV PATH="/databricks/python3/bin:$PATH"
USER 0

RUN sudo apt-get update && sudo apt-get install -y make build-essential libssl-dev git
RUN /databricks/python3/bin/pip install git+https://github.com/Future-Outlier/flytekit.git@master#subdirectory=plugins/flytekit-spark
RUN /databricks/python3/bin/pip install markupsafe==2.0.0

COPY flyte-example/databricks_wf.py /databricks/driver/
WORKDIR /databricks/driver
ENV PYTHONPATH /databricks/driver

Tracking Issue

flyteorg/flyte#3936

Follow-up issue

NA

Signed-off-by: Kevin Su <pingsutw@apache.org>

codecov · 2023-11-10T10:14:10Z

Codecov Report

Attention: 18 lines in your changes are missing coverage. Please review.

Comparison is base (9c8481e) 85.91% compared to head (9d23a1d) 85.90%.
Report is 1 commits behind head on master.

Files	Patch %	Lines
flytekit/extend/backend/base_agent.py	66.66%	12 Missing and 3 partials ⚠️
...ugins/flytekit-spark/flytekitplugins/spark/task.py	86.66%	2 Missing ⚠️
...gins/flytekit-spark/flytekitplugins/spark/agent.py	94.11%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1951      +/-   ##
==========================================
- Coverage   85.91%   85.90%   -0.02%     
==========================================
  Files         306      306              
  Lines       22818    22867      +49     
  Branches     3466     3470       +4     
==========================================
+ Hits        19605    19644      +39     
- Misses       2622     2629       +7     
- Partials      591      594       +3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Future-Outlier · 2023-11-10T15:15:53Z

Here's how I try to specify the path.

if __name__ == '__main__':
    runner = CliRunner()
    result = runner.invoke(pyflyte.main,
                           ["run",
                            "--raw-output-data-prefix",
                            "s3://flyte-batch/spark/",
                            "/mnt/c/code/dev/example/plugins/databricks_wf",
                            "wf"])
    print(result.output)

Can you explain how to set the --raw-output-data-prefix?
I tested it in the local environment.

Environment variable:

# for flyte s3 minio
export FLYTE_AWS_ENDPOINT="http://localhost:30080/"
export FLYTE_AWS_ACCESS_KEY_ID="minio"
export FLYTE_AWS_SECRET_ACCESS_KEY="miniostorage"

My S3 Storage:

Error Message:

(dev) root@googler:/mnt/c/code/dev/example/plugins# python databricks_wf.py
Running Execution on local.
Failed with Exception Code: USER:AssertionError
Underlying Exception: Not Found
Failed to put data from /tmp/tmpbfgzko5e/script_mode.tar.gz to s3://flyte-batch/spark/025c7d20ac403c3c26629b35c0bca000/script_mode.tar.gz (recursive=False).

Original exception: Not Found

Future-Outlier · 2023-11-10T15:43:16Z

How to Setup

Let's say you have a python file called databricks_wf.py,
and you want to run it on the databricks platform.

0. Databricks SETUP

(0) Setup your workspace
https://docs.flyte.org/en/latest/deployment/plugins/webapi/databricks.html#deployment-plugin-setup-webapi-databricks

(1) Enable BYOC (bring your own container)

curl -X PATCH -n \
  -H "Authorization: Bearer <your-personal-access-token>" \
  https://<databricks-instance>/api/2.0/workspace-conf \
  -d '{
    "enableDcs": "true"
    }'

Note: remeber to use the token, it doesn't be written in the docs
reference: https://docs.databricks.com/en/clusters/custom-containers.html
(2) Upload your entrypoint.py to dbfs (DataBricks File System)
Copy the python file in (0) and upload it to dbfs.

You can browse your dbfs in catalog to check if the file exist.

(3) instance profile
(Kevin, please elaborate more, details is so important, or please give an example)
reference: https://docs.databricks.com/en/aws/iam/instance-profile-tutorial.html

1. Build your Dockerfile (Will support ImageSpec in the future)

FROM databricksruntime/standard:13.3-LTS
LABEL org.opencontainers.image.source=https://github.com/flyteorg/flytesnacks

ENV PYTHONPATH /databricks/driver
ENV PATH="/databricks/python3/bin:$PATH"
USER 0

RUN sudo apt-get update && sudo apt-get install -y make build-essential libssl-dev git
RUN /databricks/python3/bin/pip install git+https://github.com/Future-Outlier/flytekit.git@master#subdirectory=plugins/flytekit-spark
RUN /databricks/python3/bin/pip install markupsafe==2.0.0

COPY flyte-example/databricks_wf.py /databricks/driver/
WORKDIR /databricks/driver
ENV PYTHONPATH /databricks/driver

docker built -t pingsutw/databricks:v7 .

Note: you have to put your python file to your PYTHONPATH

2. Run the code

Locally

(Wait for Kevin's reply)

Remotely

You can use pyflyte register or pyflyte register --non-fast
the second one will skip zipping and uploading the package
(which means you don't need to download the input from s3, will make the workflow faster)

pyflyte register databricks_wf.py --version DB-FIRST

pyflyte register --non-fast databricks_wf.py --version DB-SECOND

Now, you can run it!

Future-Outlier · 2023-11-10T15:46:27Z

I run it successfully in remote development since I don't have the AWS s3's secret, so I can't put the data, but all functions work well!

Future-Outlier · 2023-11-10T15:58:57Z

Is the applications_path="local:///usr/local/bin/entrypoint.py", in config necessary?

pingsutw · 2023-11-10T17:05:43Z

This is the command to run it locally.

pyflyte --verbose run --raw-output-data-prefix s3://flyte-batch/spark/ flyte-example/databricks_wf.py wf

Signed-off-by: Kevin Su <pingsutw@apache.org>

kumare3 · 2023-11-11T00:15:01Z

I thought about it some-more, after delibration i think this is confusing.
Here is what I think makes more sense,

By default all local executions of tasks with agent will automatically try to invoke remote services. For agents with PythonFunctionTask or AgentExecutorMixin we should check if raw-output-prefix is set. If not, we raise an error

class AgentFunctionTaskExecutor():
  ...
def execute():
   if ctx.raw_output_prefix is local:
      raise AssertionError("Using agent {self.name} locally needs to have a way to pass the data/code from local to remote. This needs the configuration of a common shared blob store like S3, gcs etc. This can be achieved using `--raw-output-prefix` in `pyflyte run`. If you want to run the task code locally without invoking the remote service (e.g. testing) use `--local-agent-emulation` flag in `pyflyte run`
    ... continue to execution

Thus a user has to specify pyflyte run --raw-output-prefix or pyflyte run --local-agent-emulation to run it correctly.
This is almost self documenting

kumare3 · 2023-11-11T00:17:03Z

plugins/flytekit-spark/flytekitplugins/spark/task.py

+    def execute(self, **kwargs) -> Any:
+        if isinstance(self.task_config, Databricks):
+            # Since we only have databricks agent
+            return AsyncAgentExecutorMixin.execute(self, **kwargs)


would this also automatically invoke the local method?

Signed-off-by: Kevin Su <pingsutw@apache.org>

…bricks

Signed-off-by: Kevin Su <pingsutw@apache.org>

…bricks

Signed-off-by: Kevin Su <pingsutw@apache.org>

kumare3

LGTM.
also i can try and test this?

Signed-off-by: Kevin Su <pingsutw@apache.org> Signed-off-by: Future Outlier <eric901201@gmai.com>

Signed-off-by: Kevin Su <pingsutw@apache.org> Signed-off-by: Rafael Raposo <rafaelraposo@spotify.com>

pingsutw added 4 commits November 9, 2023 23:24

Run databricks task locally

f6f48af

Signed-off-by: Kevin Su <pingsutw@apache.org>

wip

d87703d

Signed-off-by: Kevin Su <pingsutw@apache.org>

wip

085bc64

Signed-off-by: Kevin Su <pingsutw@apache.org>

wip

7141ed7

Signed-off-by: Kevin Su <pingsutw@apache.org>

pingsutw requested review from wild-endeavor, kumare3, eapolinario and cosmicBboy as code owners November 10, 2023 09:48

pingsutw marked this pull request as draft November 10, 2023 09:48

lint

904ec63

Signed-off-by: Kevin Su <pingsutw@apache.org>

nit

7505e49

Signed-off-by: Kevin Su <pingsutw@apache.org>

kumare3 reviewed Nov 11, 2023

View reviewed changes

Future-Outlier mentioned this pull request Nov 17, 2023

Databricks Plugin Setup Doc Enhancement flyteorg/flyte#4445

Merged

3 tasks

pingsutw added 8 commits November 22, 2023 01:22

Add support existing_cluster

05d8e93

Signed-off-by: Kevin Su <pingsutw@apache.org>

Merged master

bc8cf4b

Signed-off-by: Kevin Su <pingsutw@apache.org>

use python_file to run

144ce0e

Signed-off-by: Kevin Su <pingsutw@apache.org>

Fall back to local execution

cb59c1c

Signed-off-by: Kevin Su <pingsutw@apache.org>

nit

cb7d197

Signed-off-by: Kevin Su <pingsutw@apache.org>

More tests

041ea95

Signed-off-by: Kevin Su <pingsutw@apache.org>

Merge branch 'master' of github.com:flyteorg/flytekit into local_data…

cc4ab07

…bricks

lint

e930ca4

Signed-off-by: Kevin Su <pingsutw@apache.org>

pingsutw marked this pull request as ready for review November 24, 2023 11:19

lint

e9a01af

Signed-off-by: Kevin Su <pingsutw@apache.org>

pingsutw added 7 commits November 25, 2023 23:08

Add entrypoint.py

92f1c2a

Signed-off-by: Kevin Su <pingsutw@apache.org>

update entrypoint.py

955dbc7

Signed-off-by: Kevin Su <pingsutw@apache.org>

Update databricks agent

f2d2c6a

Signed-off-by: Kevin Su <pingsutw@apache.org>

Merge branch 'master' of github.com:flyteorg/flytekit into local_data…

f1c51f9

…bricks

nit

46a644c

Signed-off-by: Kevin Su <pingsutw@apache.org>

lint

a6e677e

Signed-off-by: Kevin Su <pingsutw@apache.org>

Remove entrypoint file

9d23a1d

Signed-off-by: Kevin Su <pingsutw@apache.org>

eapolinario approved these changes Dec 7, 2023

View reviewed changes

kumare3 approved these changes Dec 8, 2023

View reviewed changes

pingsutw merged commit 5a45657 into master Dec 8, 2023
72 of 74 checks passed

Future-Outlier pushed a commit to Future-Outlier/flytekit that referenced this pull request Dec 12, 2023

Run databricks task locally (flyteorg#1951)

57bfaa5

Signed-off-by: Kevin Su <pingsutw@apache.org> Signed-off-by: Future Outlier <eric901201@gmai.com>

RRap0so pushed a commit to RRap0so/flytekit that referenced this pull request Dec 15, 2023

Run databricks task locally (flyteorg#1951)

1d7d952

Signed-off-by: Kevin Su <pingsutw@apache.org> Signed-off-by: Rafael Raposo <rafaelraposo@spotify.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run databricks task locally #1951

Run databricks task locally #1951

pingsutw commented Nov 10, 2023 •

edited

codecov bot commented Nov 10, 2023 •

edited

Future-Outlier commented Nov 10, 2023 •

edited

Future-Outlier commented Nov 10, 2023 •

edited

Future-Outlier commented Nov 10, 2023 •

edited

Future-Outlier commented Nov 10, 2023

pingsutw commented Nov 10, 2023

kumare3 commented Nov 11, 2023

kumare3 Nov 11, 2023

kumare3 left a comment

Run databricks task locally #1951

Run databricks task locally #1951

Conversation

pingsutw commented Nov 10, 2023 • edited

TL;DR

Type

Are all requirements met?

Complete description

Tracking Issue

Follow-up issue

codecov bot commented Nov 10, 2023 • edited

Codecov Report

Future-Outlier commented Nov 10, 2023 • edited

Future-Outlier commented Nov 10, 2023 • edited

How to Setup

0. Databricks SETUP

1. Build your Dockerfile (Will support ImageSpec in the future)

2. Run the code

Locally

Remotely

Future-Outlier commented Nov 10, 2023 • edited

Future-Outlier commented Nov 10, 2023

pingsutw commented Nov 10, 2023

kumare3 commented Nov 11, 2023

kumare3 Nov 11, 2023

Choose a reason for hiding this comment

kumare3 left a comment

Choose a reason for hiding this comment

pingsutw commented Nov 10, 2023 •

edited

codecov bot commented Nov 10, 2023 •

edited

Future-Outlier commented Nov 10, 2023 •

edited

Future-Outlier commented Nov 10, 2023 •

edited

Future-Outlier commented Nov 10, 2023 •

edited