Add hadoop aws/gcp jar to the spark default image #1908

pingsutw · 2023-10-22T10:15:41Z

TL;DR

Add hadoop s3 and gcs dependencies to the default spark image. Spark needs these jar to read the data from s3 / gcs.

Type

Bug Fix
Feature
Plugin

Are all requirements met?

Complete description

Spark example:

import datetime
import random
from operator import add

import flytekit
from flytekit import Resources, task, workflow
from flytekitplugins.spark import Spark
from flytekit.image_spec.image_spec import ImageSpec

spark_image = ImageSpec(base_image="pingsutw/spark-v2", registry="pingsutw")


@task(
    task_config=Spark(
        # this configuration is applied to the spark cluster
        spark_conf={
            "spark.driver.memory": "1000M",
            "spark.executor.memory": "1000M",
            "spark.executor.cores": "1",
            "spark.executor.instances": "2",
            "spark.driver.cores": "1",
        },
        executor_path="/usr/bin/python3",
        applications_path="local:///usr/local/bin/entrypoint.py",
    ),
    limits=Resources(mem="2000M"),
    cache_version="1",
    container_image=spark_image,
)
def hello_spark(partitions: int) -> float:
    print("Starting Sparkfk wifth Partitions: {}".format(partitions))
    n = 100000 * partitions
    sess = flytekit.current_context().spark_session
    count = (
        sess.sparkContext.parallelize(range(1, n + 1), partitions).map(f).reduce(add)
    )
    pi_val = 4.0 * count / n
    print("Pi val is :{}".format(pi_val))
    return pi_val


def f(_):
    x = random.random() * 2 - 1
    y = random.random() * 2 - 1
    return 1 if x**2 + y**2 <= 1 else 0


@task(cache_version="1", container_image=spark_image)
def print_every_time(value_to_print: float, date_triggered: datetime.datetime) -> int:
    print("My printed value: {} @ {}".format(value_to_print, date_triggered))
    return 1


@workflow
def wf(triggered_date: datetime.datetime = datetime.datetime.now()) -> float:
    """
    Using the workflow is still as any other workflow. As image is a property of the task, the workflow does not care
    about how the image is configured.
    """
    pi = hello_spark(partitions=50)
    print_every_time(value_to_print=pi, date_triggered=triggered_date)
    return pi


if __name__ == "__main__":
    print(f"Running {__file__} main...")
    print(
        f"Running my_spark(triggered_date=datetime.datetime.now()){wf(triggered_date=datetime.datetime.now())}"
    )

Tracking Issue

NA

Follow-up issue

NA

Signed-off-by: Kevin Su <pingsutw@apache.org>

codecov · 2023-10-22T10:26:12Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (63e6632) 62.81% compared to head (17ed538) 62.81%.

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #1908   +/-   ##
=======================================
  Coverage   62.81%   62.81%           
=======================================
  Files         307      307           
  Lines       22984    22984           
  Branches     3490     3490           
=======================================
  Hits        14438    14438           
  Misses       8124     8124           
  Partials      422      422

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

iaroslav-ciupin

Thanks for fixing! Appreciate it!

kumare3 · 2023-10-22T17:44:49Z

This is a band-aid. How will this work on gcp? It should be at the user level imo

Signed-off-by: Kevin Su <pingsutw@apache.org>

* Add hadoop-aws jar to the spark default image Signed-off-by: Kevin Su <pingsutw@apache.org> * no-cache-dir Signed-off-by: Kevin Su <pingsutw@apache.org> --------- Signed-off-by: Kevin Su <pingsutw@apache.org>

* Add hadoop-aws jar to the spark default image Signed-off-by: Kevin Su <pingsutw@apache.org> * no-cache-dir Signed-off-by: Kevin Su <pingsutw@apache.org> --------- Signed-off-by: Kevin Su <pingsutw@apache.org> Signed-off-by: Rafael Raposo <rafaelraposo@spotify.com>

Add hadoop-aws jar to the spark default image

eae8fb2

Signed-off-by: Kevin Su <pingsutw@apache.org>

pingsutw requested review from wild-endeavor, kumare3, eapolinario and cosmicBboy as code owners October 22, 2023 10:15

iaroslav-ciupin previously approved these changes Oct 22, 2023

View reviewed changes

pingsutw changed the title ~~Add hadoop-aws jar to the spark default image~~ Add hadoop aws/gcp jar to the spark default image Oct 23, 2023

pingsutw added 3 commits October 22, 2023 19:05

Merge branch 'master' of github.com:flyteorg/flytekit into add-s3-jar

cd3e897

merged master

b64dd80

Signed-off-by: Kevin Su <pingsutw@apache.org>

no-cache-dir

5167305

Signed-off-by: Kevin Su <pingsutw@apache.org>

pingsutw dismissed iaroslav-ciupin’s stale review via 5167305 November 3, 2023 20:52

Merge branch 'master' of github.com:flyteorg/flytekit into add-s3-jar

17ed538

iaroslav-ciupin approved these changes Nov 6, 2023

View reviewed changes

cosmicBboy approved these changes Nov 6, 2023

View reviewed changes

pingsutw merged commit 145fd77 into master Nov 6, 2023
71 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add hadoop aws/gcp jar to the spark default image #1908

Add hadoop aws/gcp jar to the spark default image #1908

pingsutw commented Oct 22, 2023

codecov bot commented Oct 22, 2023 •

edited

Loading

iaroslav-ciupin left a comment

kumare3 commented Oct 22, 2023

Add hadoop aws/gcp jar to the spark default image #1908

Add hadoop aws/gcp jar to the spark default image #1908

Conversation

pingsutw commented Oct 22, 2023

TL;DR

Type

Are all requirements met?

Complete description

Tracking Issue

Follow-up issue

codecov bot commented Oct 22, 2023 • edited Loading

Codecov Report

iaroslav-ciupin left a comment

Choose a reason for hiding this comment

kumare3 commented Oct 22, 2023

codecov bot commented Oct 22, 2023 •

edited

Loading