-
Notifications
You must be signed in to change notification settings - Fork 75
support Spark 3.2 and EMR 6.7 #98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
running test-unit test/unit/test_bootstrapper.py::test_recursive_deserialize_user_configuration PASSED |
|
make install-container-library No known security vulnerabilities found. |
| ENV HDFS_DATANODE_USER="root" | ||
| ENV HDFS_SECONDARYNAMENODE_USER="root" | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are missing a statement similar to this: https://github.com/aws/sagemaker-spark-container/blob/master/spark/processing/3.1/py3/docker/py39/Dockerfile.cpu#L102 in this Dockerfile. Can we make sure its added?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Ajay, I was in the assumption that you want to keep this because this command remove lookup class from the jar file because log4j might have some vulnerability so that we want to remove Lookup class from the jar. See here: https://community.bmc.com/s/article/Log4j-CVE-2021-44228-REMEDIATION-Remove-JndiLookup-class-from-log4j-core-2-jar
But if you look into it, there is another approach that we can update the version of log4j, which is done piggybacked by hive version upgrade. I remove this line on-purpose because everytime you have to look up log4j version in hive package to determine which jar to update, that is very cumbersome way to do it.
As state above, there is no need to add this line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it thanks for the explanation.
|
Hi Thanks for upgrading the spark to 3.2. I'd like to confirm that what minor version of this Spark. Is this Spark 3.2.1? Thanks! |
Issue #, if available:
Description of changes:
The reason for this task is spark >= 3.2 added pandas API (covers 90%), which make it easy to use for pandas users. Introducing pandas API on Apache Spark to unify small data API and big data API (learn more here).
Completing the ANSI SQL compatability mode to simplify migration of SQL workloads.
Productionizing adaptive query execution to speed up Spark SQL at runtime.
Introducing RocksDB statestore to make state processing more scalable.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.