support Spark 3.2 and EMR 6.7 #98

xiaoxshe · 2022-08-25T23:08:43Z

Issue #, if available:

Description of changes:

The reason for this task is spark >= 3.2 added pandas API (covers 90%), which make it easy to use for pandas users. Introducing pandas API on Apache Spark to unify small data API and big data API (learn more here).
Completing the ANSI SQL compatability mode to simplify migration of SQL workloads.
Productionizing adaptive query execution to speed up Spark SQL at runtime.
Introducing RocksDB statestore to make state processing more scalable.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

xiaoxshe · 2022-08-26T17:08:46Z

running test-unit

test/unit/test_bootstrapper.py::test_recursive_deserialize_user_configuration PASSED
test/unit/test_bootstrapper.py::test_site_multiple_classifications PASSED
test/unit/test_bootstrapper.py::test_env_classification PASSED
test/unit/test_bootstrapper.py::test_copy_aws_jars PASSED
test/unit/test_bootstrapper.py::test_bootstrap_smspark_submit PASSED
test/unit/test_bootstrapper.py::test_bootstrap_history_server PASSED
test/unit/test_bootstrapper.py::test_wait_for_hadoop PASSED
test/unit/test_bootstrapper.py::test_copy_cluster_config PASSED
test/unit/test_bootstrapper.py::test_start_hadoop_daemons_on_primary PASSED
test/unit/test_bootstrapper.py::test_start_hadoop_daemons_on_worker PASSED
test/unit/test_bootstrapper.py::test_spark_standalone_primary PASSED
test/unit/test_bootstrapper.py::test_set_regional_configs PASSED
test/unit/test_bootstrapper.py::test_set_regional_configs_empty PASSED
test/unit/test_bootstrapper.py::test_get_regional_configs_cn PASSED
test/unit/test_bootstrapper.py::test_get_regional_configs_gov PASSED
test/unit/test_bootstrapper.py::test_get_regional_configs_us PASSED
test/unit/test_bootstrapper.py::test_get_regional_configs_missing_region PASSED
test/unit/test_bootstrapper.py::test_load_processing_job_config PASSED
test/unit/test_bootstrapper.py::test_load_processing_job_config_fallback PASSED
test/unit/test_bootstrapper.py::test_load_instance_type_info PASSED
test/unit/test_bootstrapper.py::test_set_yarn_spark_resource_config PASSED
test/unit/test_bootstrapper.py::test_set_yarn_spark_resource_config_fallback PASSED
test/unit/test_bootstrapper.py::test_get_yarn_spark_resource_config PASSED
test/unit/test_cli.py::test_submit[missing APP arg should fail] PASSED
test/unit/test_cli.py::test_submit[invalid spark options should fail] PASSED
test/unit/test_cli.py::test_submit[happy path should pass] PASSED
test/unit/test_cli.py::test_submit[valid spark option should pass] PASSED
test/unit/test_cli.py::test_submit[single local jar should pass0] PASSED
test/unit/test_cli.py::test_submit[list of local jars should pass0] PASSED
test/unit/test_cli.py::test_submit[s3 url to jar should pass0] PASSED
test/unit/test_cli.py::test_submit[s3a url to jar should pass0] PASSED
test/unit/test_cli.py::test_submit[multiple s3 urls to jar should pass0] PASSED
test/unit/test_cli.py::test_submit[mixed s3 urls to jars and local paths should pass0] PASSED
test/unit/test_cli.py::test_submit[relative paths should fail0] PASSED
test/unit/test_cli.py::test_submit[nonexistent paths should fail0] PASSED
test/unit/test_cli.py::test_submit[directory with no files should fail0] PASSED
test/unit/test_cli.py::test_submit[single local jar should pass1] PASSED
test/unit/test_cli.py::test_submit[list of local jars should pass1] PASSED
test/unit/test_cli.py::test_submit[s3 url to jar should pass1] PASSED
test/unit/test_cli.py::test_submit[s3a url to jar should pass1] PASSED
test/unit/test_cli.py::test_submit[multiple s3 urls to jar should pass1] PASSED
test/unit/test_cli.py::test_submit[mixed s3 urls to jars and local paths should pass1] PASSED
test/unit/test_cli.py::test_submit[relative paths should fail1] PASSED
test/unit/test_cli.py::test_submit[nonexistent paths should fail1] PASSED
test/unit/test_cli.py::test_submit[directory with no files should fail1] PASSED
test/unit/test_cli.py::test_submit[single local jar should pass2] PASSED
test/unit/test_cli.py::test_submit[list of local jars should pass2] PASSED
test/unit/test_cli.py::test_submit[s3 url to jar should pass2] PASSED
test/unit/test_cli.py::test_submit[s3a url to jar should pass2] PASSED
test/unit/test_cli.py::test_submit[multiple s3 urls to jar should pass2] PASSED
test/unit/test_cli.py::test_submit[mixed s3 urls to jars and local paths should pass2] PASSED
test/unit/test_cli.py::test_submit[relative paths should fail2] PASSED
test/unit/test_cli.py::test_submit[nonexistent paths should fail2] PASSED
test/unit/test_cli.py::test_submit[directory with no files should fail2] PASSED
test/unit/test_cli.py::test_submit[quotes are handled correctly] PASSED
test/unit/test_config.py::test_core_site_xml PASSED
test/unit/test_config.py::test_hadoop_env_sh PASSED
test/unit/test_config.py::test_hadoop_log4j PASSED
test/unit/test_config.py::test_hive_env PASSED
test/unit/test_config.py::test_hive_log4j PASSED
test/unit/test_config.py::test_hive_exec_log4j PASSED
test/unit/test_config.py::test_hive_site PASSED
test/unit/test_config.py::test_spark_defaults_conf PASSED
test/unit/test_config.py::test_spark_env PASSED
test/unit/test_config.py::test_spark_log4j_properties PASSED
test/unit/test_config.py::test_spark_hive_site PASSED
test/unit/test_config.py::test_spark_metrics_properties PASSED
test/unit/test_config.py::test_yarn_env PASSED
test/unit/test_config.py::test_yarn_size PASSED
test/unit/test_errors.py::test_algorithm_error PASSED
test/unit/test_errors.py::test_exit PASSED
test/unit/test_history_server_cli.py::test_run_history_server PASSED
test/unit/test_history_server_cli.py::test_submit[When arguments are set, should be passed job manager] PASSED
test/unit/test_history_server_utils.py::test_config_history_server_with_env_variable spark.history.fs.logDirectory=s3://bucket/spark-events
PASSED
test/unit/test_history_server_utils.py::test_config_history_server_without_env_variable PASSED
test/unit/test_history_server_utils.py::test_start_history_server PASSED
test/unit/test_nginx_utils.py::test_start_nginx PASSED
test/unit/test_nginx_utils.py::test_write_nginx_default_conf PASSED
test/unit/test_nginx_utils.py::test_write_nginx_default_conf_without_domain_name PASSED
test/unit/test_nginx_utils.py::test_copy_nginx_default_conf PASSED
test/unit/test_spark_event_logs_publisher.py::test_run_with_event_log_dir PASSED
test/unit/test_spark_event_logs_publisher.py::test_run_with_spark_events_s3_uri PASSED
test/unit/test_status.py::test_status_app PASSED
test/unit/test_status.py::test_status_server PASSED
test/unit/test_status.py::test_status_map_one_host PASSED
test/unit/test_status.py::test_status_map_multiple_hosts PASSED
test/unit/test_status.py::test_status_map_propagate_errors PASSED
test/unit/test_status.py::test_status_map_http_error PASSED
test/unit/test_waiter.py::test_waiter PASSED
test/unit/test_waiter.py::test_waiter_timeout PASSED
test/unit/test_waiter.py::test_waiter_pred_fn_errors PASSED

xiaoxshe · 2022-08-26T17:11:50Z

make install-container-library

No known security vulnerabilities found.

mahendruajay · 2022-08-30T17:30:19Z

spark/processing/3.2/py3/docker/py39/Dockerfile.cpu

+ENV HDFS_DATANODE_USER="root"
+ENV HDFS_SECONDARYNAMENODE_USER="root"
+
+


We are missing a statement similar to this: https://github.com/aws/sagemaker-spark-container/blob/master/spark/processing/3.1/py3/docker/py39/Dockerfile.cpu#L102 in this Dockerfile. Can we make sure its added?

Hi Ajay, I was in the assumption that you want to keep this because this command remove lookup class from the jar file because log4j might have some vulnerability so that we want to remove Lookup class from the jar. See here: https://community.bmc.com/s/article/Log4j-CVE-2021-44228-REMEDIATION-Remove-JndiLookup-class-from-log4j-core-2-jar

But if you look into it, there is another approach that we can update the version of log4j, which is done piggybacked by hive version upgrade. I remove this line on-purpose because everytime you have to look up log4j version in hive package to determine which jar to update, that is very cumbersome way to do it.
As state above, there is no need to add this line.

Got it thanks for the explanation.

cixuuz · 2022-09-29T19:36:44Z

Hi Thanks for upgrading the spark to 3.2. I'd like to confirm that what minor version of this Spark. Is this Spark 3.2.1? Thanks!

support Spark 3.2 and EMR 6.7

3627f01

xiaoxshe added 7 commits August 26, 2022 10:35

fix previous commit format error by running black

e0b042f

fix missing docstring in public module

b6c50bf

fix epel release error

55ba750

fix epel release error

556bd8b

revert back epel change

2a538b4

change for CR feedback

6b41ebd

delete setup.py

3d1df4e

mahendruajay reviewed Aug 30, 2022

View reviewed changes

mahendruajay approved these changes Aug 30, 2022

View reviewed changes

mahendruajay merged commit 3ccb729 into aws:master Aug 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

support Spark 3.2 and EMR 6.7 #98

support Spark 3.2 and EMR 6.7 #98

Uh oh!

xiaoxshe commented Aug 25, 2022

Uh oh!

xiaoxshe commented Aug 26, 2022

Uh oh!

xiaoxshe commented Aug 26, 2022

Uh oh!

mahendruajay Aug 30, 2022

Uh oh!

xiaoxshe Aug 30, 2022

Uh oh!

mahendruajay Aug 30, 2022

Uh oh!

cixuuz commented Sep 29, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		ENV HDFS_DATANODE_USER="root"
		ENV HDFS_SECONDARYNAMENODE_USER="root"

support Spark 3.2 and EMR 6.7 #98

support Spark 3.2 and EMR 6.7 #98

Uh oh!

Conversation

xiaoxshe commented Aug 25, 2022

Uh oh!

xiaoxshe commented Aug 26, 2022

Uh oh!

xiaoxshe commented Aug 26, 2022

Uh oh!

mahendruajay Aug 30, 2022

Choose a reason for hiding this comment

Uh oh!

xiaoxshe Aug 30, 2022

Choose a reason for hiding this comment

Uh oh!

mahendruajay Aug 30, 2022

Choose a reason for hiding this comment

Uh oh!

cixuuz commented Sep 29, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants