Skip to content

Conversation

@xiaoxshe
Copy link
Contributor

Issue #, if available:

Description of changes:

The reason for this task is spark >= 3.2 added pandas API (covers 90%), which make it easy to use for pandas users. Introducing pandas API on Apache Spark to unify small data API and big data API (learn more here).
Completing the ANSI SQL compatability mode to simplify migration of SQL workloads.
Productionizing adaptive query execution to speed up Spark SQL at runtime.
Introducing RocksDB statestore to make state processing more scalable.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@xiaoxshe
Copy link
Contributor Author

running test-unit

test/unit/test_bootstrapper.py::test_recursive_deserialize_user_configuration PASSED
test/unit/test_bootstrapper.py::test_site_multiple_classifications PASSED
test/unit/test_bootstrapper.py::test_env_classification PASSED
test/unit/test_bootstrapper.py::test_copy_aws_jars PASSED
test/unit/test_bootstrapper.py::test_bootstrap_smspark_submit PASSED
test/unit/test_bootstrapper.py::test_bootstrap_history_server PASSED
test/unit/test_bootstrapper.py::test_wait_for_hadoop PASSED
test/unit/test_bootstrapper.py::test_copy_cluster_config PASSED
test/unit/test_bootstrapper.py::test_start_hadoop_daemons_on_primary PASSED
test/unit/test_bootstrapper.py::test_start_hadoop_daemons_on_worker PASSED
test/unit/test_bootstrapper.py::test_spark_standalone_primary PASSED
test/unit/test_bootstrapper.py::test_set_regional_configs PASSED
test/unit/test_bootstrapper.py::test_set_regional_configs_empty PASSED
test/unit/test_bootstrapper.py::test_get_regional_configs_cn PASSED
test/unit/test_bootstrapper.py::test_get_regional_configs_gov PASSED
test/unit/test_bootstrapper.py::test_get_regional_configs_us PASSED
test/unit/test_bootstrapper.py::test_get_regional_configs_missing_region PASSED
test/unit/test_bootstrapper.py::test_load_processing_job_config PASSED
test/unit/test_bootstrapper.py::test_load_processing_job_config_fallback PASSED
test/unit/test_bootstrapper.py::test_load_instance_type_info PASSED
test/unit/test_bootstrapper.py::test_set_yarn_spark_resource_config PASSED
test/unit/test_bootstrapper.py::test_set_yarn_spark_resource_config_fallback PASSED
test/unit/test_bootstrapper.py::test_get_yarn_spark_resource_config PASSED
test/unit/test_cli.py::test_submit[missing APP arg should fail] PASSED
test/unit/test_cli.py::test_submit[invalid spark options should fail] PASSED
test/unit/test_cli.py::test_submit[happy path should pass] PASSED
test/unit/test_cli.py::test_submit[valid spark option should pass] PASSED
test/unit/test_cli.py::test_submit[single local jar should pass0] PASSED
test/unit/test_cli.py::test_submit[list of local jars should pass0] PASSED
test/unit/test_cli.py::test_submit[s3 url to jar should pass0] PASSED
test/unit/test_cli.py::test_submit[s3a url to jar should pass0] PASSED
test/unit/test_cli.py::test_submit[multiple s3 urls to jar should pass0] PASSED
test/unit/test_cli.py::test_submit[mixed s3 urls to jars and local paths should pass0] PASSED
test/unit/test_cli.py::test_submit[relative paths should fail0] PASSED
test/unit/test_cli.py::test_submit[nonexistent paths should fail0] PASSED
test/unit/test_cli.py::test_submit[directory with no files should fail0] PASSED
test/unit/test_cli.py::test_submit[single local jar should pass1] PASSED
test/unit/test_cli.py::test_submit[list of local jars should pass1] PASSED
test/unit/test_cli.py::test_submit[s3 url to jar should pass1] PASSED
test/unit/test_cli.py::test_submit[s3a url to jar should pass1] PASSED
test/unit/test_cli.py::test_submit[multiple s3 urls to jar should pass1] PASSED
test/unit/test_cli.py::test_submit[mixed s3 urls to jars and local paths should pass1] PASSED
test/unit/test_cli.py::test_submit[relative paths should fail1] PASSED
test/unit/test_cli.py::test_submit[nonexistent paths should fail1] PASSED
test/unit/test_cli.py::test_submit[directory with no files should fail1] PASSED
test/unit/test_cli.py::test_submit[single local jar should pass2] PASSED
test/unit/test_cli.py::test_submit[list of local jars should pass2] PASSED
test/unit/test_cli.py::test_submit[s3 url to jar should pass2] PASSED
test/unit/test_cli.py::test_submit[s3a url to jar should pass2] PASSED
test/unit/test_cli.py::test_submit[multiple s3 urls to jar should pass2] PASSED
test/unit/test_cli.py::test_submit[mixed s3 urls to jars and local paths should pass2] PASSED
test/unit/test_cli.py::test_submit[relative paths should fail2] PASSED
test/unit/test_cli.py::test_submit[nonexistent paths should fail2] PASSED
test/unit/test_cli.py::test_submit[directory with no files should fail2] PASSED
test/unit/test_cli.py::test_submit[quotes are handled correctly] PASSED
test/unit/test_config.py::test_core_site_xml PASSED
test/unit/test_config.py::test_hadoop_env_sh PASSED
test/unit/test_config.py::test_hadoop_log4j PASSED
test/unit/test_config.py::test_hive_env PASSED
test/unit/test_config.py::test_hive_log4j PASSED
test/unit/test_config.py::test_hive_exec_log4j PASSED
test/unit/test_config.py::test_hive_site PASSED
test/unit/test_config.py::test_spark_defaults_conf PASSED
test/unit/test_config.py::test_spark_env PASSED
test/unit/test_config.py::test_spark_log4j_properties PASSED
test/unit/test_config.py::test_spark_hive_site PASSED
test/unit/test_config.py::test_spark_metrics_properties PASSED
test/unit/test_config.py::test_yarn_env PASSED
test/unit/test_config.py::test_yarn_size PASSED
test/unit/test_errors.py::test_algorithm_error PASSED
test/unit/test_errors.py::test_exit PASSED
test/unit/test_history_server_cli.py::test_run_history_server PASSED
test/unit/test_history_server_cli.py::test_submit[When arguments are set, should be passed job manager] PASSED
test/unit/test_history_server_utils.py::test_config_history_server_with_env_variable spark.history.fs.logDirectory=s3://bucket/spark-events
PASSED
test/unit/test_history_server_utils.py::test_config_history_server_without_env_variable PASSED
test/unit/test_history_server_utils.py::test_start_history_server PASSED
test/unit/test_nginx_utils.py::test_start_nginx PASSED
test/unit/test_nginx_utils.py::test_write_nginx_default_conf PASSED
test/unit/test_nginx_utils.py::test_write_nginx_default_conf_without_domain_name PASSED
test/unit/test_nginx_utils.py::test_copy_nginx_default_conf PASSED
test/unit/test_spark_event_logs_publisher.py::test_run_with_event_log_dir PASSED
test/unit/test_spark_event_logs_publisher.py::test_run_with_spark_events_s3_uri PASSED
test/unit/test_status.py::test_status_app PASSED
test/unit/test_status.py::test_status_server PASSED
test/unit/test_status.py::test_status_map_one_host PASSED
test/unit/test_status.py::test_status_map_multiple_hosts PASSED
test/unit/test_status.py::test_status_map_propagate_errors PASSED
test/unit/test_status.py::test_status_map_http_error PASSED
test/unit/test_waiter.py::test_waiter PASSED
test/unit/test_waiter.py::test_waiter_timeout PASSED
test/unit/test_waiter.py::test_waiter_pred_fn_errors PASSED

@xiaoxshe
Copy link
Contributor Author

make install-container-library

No known security vulnerabilities found.

ENV HDFS_DATANODE_USER="root"
ENV HDFS_SECONDARYNAMENODE_USER="root"


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are missing a statement similar to this: https://github.com/aws/sagemaker-spark-container/blob/master/spark/processing/3.1/py3/docker/py39/Dockerfile.cpu#L102 in this Dockerfile. Can we make sure its added?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Ajay, I was in the assumption that you want to keep this because this command remove lookup class from the jar file because log4j might have some vulnerability so that we want to remove Lookup class from the jar. See here: https://community.bmc.com/s/article/Log4j-CVE-2021-44228-REMEDIATION-Remove-JndiLookup-class-from-log4j-core-2-jar

But if you look into it, there is another approach that we can update the version of log4j, which is done piggybacked by hive version upgrade. I remove this line on-purpose because everytime you have to look up log4j version in hive package to determine which jar to update, that is very cumbersome way to do it.
As state above, there is no need to add this line.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it thanks for the explanation.

@mahendruajay mahendruajay merged commit 3ccb729 into aws:master Aug 31, 2022
@cixuuz
Copy link

cixuuz commented Sep 29, 2022

Hi Thanks for upgrading the spark to 3.2. I'd like to confirm that what minor version of this Spark. Is this Spark 3.2.1? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants