Skip to content

Why would I use sagemaker-spark if I have to use anyways my Spark cluster? #13

@Manelmc

Description

@Manelmc

In the line of "Building your own algorithm container", is it possible to use Spark code entirely (and distributively) on Sagemaker?. What I get from the documentation, is that I'm supposed to do ETL in my Spark Cluster. And then, when fitting the data to the model, use a sagemaker_pyspark algorithm that will create a Sagemaker training job. Moving the dataframe into S3 with protobuf format, to then train with a new Sagemaker instance cluster.

The question is: If I already have my dataframe loaded into my distributed cluster, why would I want to use Sagemaker? I might as well use Spark ML for it, which has a better algorithm support, and avoids creating an additional cluster. Maybe I got the whole thing wrong...

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions