Skip to content

EMR with spark-submit #108

@ehameyie

Description

@ehameyie

Please fill out the form below.

System Information

  • Spark or PySpark: PySpark
  • SDK Version:
  • Spark Version: Spark 2.4.0
  • Algorithm (e.g. KMeans): XGBoost

Describe the problem

I am running into three errors when calling SageMakerModel.fromModelS3Path() inside my script run in EMR:

Error#1:

AttributeError: 'Option' object has no attribute '_java_obj'
Exception ignored in: <bound method JavaWrapper.__del__ of <sagemaker_pyspark.wrapper.ScalaMap object at 0x7f3a02df6ac8>>
Traceback (most recent call last):

Error#2 :

There was an error calling SageMaker: An error occurred while calling z:com.amazonaws.services.sagemaker.sparksdk.SageMakerModel.fromModelS3Path. Trace:
py4j.Py4JException: Method fromModelS3Path([class java.lang.String, class java.lang.String, class java.lang.String, class java.lang.String, class java.lang.Integer, class com.amazonaws.services.sagemaker.sparksdk.transformation.serializers.LibSVMRequestRowSerializer, class com.amazonaws.services.sagemaker.sparksdk.transformation.deserializers.XGBoostCSVRowDeserializer, class scala.collection.immutable.HashMap, class scala.Enumeration$Val, class com.amazonaws.services.sagemaker.AmazonSageMakerClient, class java.lang.Boolean, class com.amazonaws.services.sagemaker.sparksdk.RandomNamePolicy, class java.lang.String]) does not exist

Error#3:
When including --jar in spark-submit :

Exception in thread "main" java.io.FileNotFoundException: File file:/mnt/var/lib/hadoop/steps/s-xxxxxxx/SageMakerSparkApplicationJar.jar does not exist

I am able to run SageMakerModel.fromModelS3Path() smoothly within a sagemaker notebook, but the code fails inside the EMR cluster.

Minimal repo / logs

The logs per above

  • Exact command to reproduce:
    The command I use to submit the EMR application is per the documentation:
  --packages com.amazonaws:aws-java-sdk:1.11.613 \
  --deploy-mode cluster \
  --conf spark.driver.userClassPathFirst=true \
  --conf spark.executor.userClassPathFirst=true \
  --jars SageMakerSparkApplicationJar.jar,...

When I include --jar SageMakerSparkApplicationJar.jar, I get Error#3 above.

When I do not include --jar SageMakerSparkApplicationJar.jar in spark-submit but rather include the following code in my main script; everything runs but produces Error#1 and Error#2 above:

import sagemaker_pyspark
from pyspark.sql import SparkSession
classpath = ":".join(sagemaker_pyspark.classpath_jars())
spark = SparkSession.builder.config("spark.driver.extraClassPath", classpath).getOrCreate()

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions