In the line of "Building your own algorithm container", is it possible to use Spark code entirely (and distributively) on Sagemaker?. What I get from the documentation, is that I'm supposed to do ETL in my Spark Cluster. And then, when fitting the data to the model, use a sagemaker_pyspark algorithm that will create a Sagemaker training job. Moving the dataframe into S3 with protobuf format, to then train with a new Sagemaker instance cluster.
The question is: If I already have my dataframe loaded into my distributed cluster, why would I want to use Sagemaker? I might as well use Spark ML for it, which has a better algorithm support, and avoids creating an additional cluster. Maybe I got the whole thing wrong...