generated from amazon-archives/__template_Apache-2.0
-
Notifications
You must be signed in to change notification settings - Fork 75
support Spark 3.2 and EMR 6.7 #98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
3627f01
support Spark 3.2 and EMR 6.7
xiaoxshe e0b042f
fix previous commit format error by running black
xiaoxshe b6c50bf
fix missing docstring in public module
xiaoxshe 55ba750
fix epel release error
xiaoxshe 556bd8b
fix epel release error
xiaoxshe 2a538b4
revert back epel change
xiaoxshe 6b41ebd
change for CR feedback
xiaoxshe 3d1df4e
delete setup.py
xiaoxshe File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,41 @@ | ||
| [[source]] | ||
| name = "pypi" | ||
| url = "https://pypi.org/simple" | ||
| verify_ssl = true | ||
|
|
||
| [dev-packages] | ||
|
|
||
| [packages] | ||
| tenacity = "==8.0.1" | ||
| psutil = "==5.9.0" | ||
| click = "==8.1.2" | ||
| watchdog = "==0.10.3" | ||
| waitress = "==2.1.2" | ||
| types-waitress = "==2.0.6" | ||
| requests = "==2.27.1" | ||
| types-requests = "==2.27.16" | ||
| rsa = "==4.3" | ||
| pyasn1 = "==0.4.8" | ||
| boto3 = "==1.21.33" | ||
| safety = "==1.10.3" | ||
| black = "==22.3.0" | ||
| mypy = "==0.942" | ||
| flake8 = "==4.0.1" | ||
| flake8-docstrings = "==1.5.0" | ||
| pytest = "==7.1.1" | ||
| pytest-cov = "==2.10.0" | ||
| pytest-xdist = "==2.5.0" | ||
| docker = "==5.0.3" | ||
| docker-compose = "==1.29.2" | ||
| cryptography = "==36.0.2" | ||
| typing-extensions = "==4.1.1" | ||
| sagemaker = "==2.83.0" | ||
| smspark = {editable = true, path = "."} | ||
| importlib-metadata = "==4.11.3" | ||
| pytest-parallel = "==0.1.1" | ||
| pytest-rerunfailures = "10.0" | ||
| numpy = "==1.22.2" | ||
| protobuf = "==3.20.1" | ||
|
|
||
| [requires] | ||
| python_version = "3.9" |
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,7 +1,7 @@ | ||
| --- | ||
| new_images: | ||
| - spark: "3.1.1" | ||
| - spark: "3.2" | ||
| use-case: "processing" | ||
| processors: ["cpu"] | ||
| python: ["py39"] | ||
| sm_version: "1.3" | ||
| sm_version: "1.0" |
1 change: 1 addition & 0 deletions
1
spark/processing/3.2/py3/container-bootstrap-config/bootstrap.sh
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| echo "Not implemented" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,128 @@ | ||
| FROM 137112412989.dkr.ecr.us-west-2.amazonaws.com/amazonlinux:2 | ||
| ARG REGION | ||
| ENV AWS_REGION ${REGION} | ||
|
|
||
| RUN yum clean all \ | ||
| && yum update -y \ | ||
| && yum install -y awscli bigtop-utils curl gcc gzip unzip zip gunzip tar wget liblapack* libblas* libopencv* libopenblas* | ||
|
|
||
| # Install python 3.9 | ||
| ARG PYTHON_BASE_VERSION=3.9 | ||
| ARG PYTHON_WITH_BASE_VERSION=python${PYTHON_BASE_VERSION} | ||
| ARG PIP_WITH_BASE_VERSION=pip${PYTHON_BASE_VERSION} | ||
| ARG PYTHON_VERSION=${PYTHON_BASE_VERSION}.12 | ||
| RUN yum -y groupinstall 'Development Tools' \ | ||
| && yum -y install openssl-devel bzip2-devel libffi-devel sqlite-devel xz-devel \ | ||
| && wget https://www.python.org/ftp/python/${PYTHON_VERSION}/Python-${PYTHON_VERSION}.tgz \ | ||
| && tar xzf Python-${PYTHON_VERSION}.tgz \ | ||
| && cd Python-*/ \ | ||
| && ./configure --enable-optimizations \ | ||
| && make altinstall \ | ||
| && echo -e 'alias python3=python3.9\nalias pip3=pip3.9' >> ~/.bashrc \ | ||
| && ln -s $(which ${PYTHON_WITH_BASE_VERSION}) /usr/local/bin/python3 \ | ||
| && ln -s $(which ${PIP_WITH_BASE_VERSION}) /usr/local/bin/pip3 \ | ||
| && cd .. \ | ||
| && rm Python-${PYTHON_VERSION}.tgz \ | ||
| && rm -rf Python-${PYTHON_VERSION} | ||
|
|
||
| # install nginx amazonlinux:2.0.20200304.0 does not have nginx, so need to install epel-release first | ||
| RUN wget https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm | ||
| RUN yum install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm | ||
| RUN yum install -y nginx | ||
|
|
||
| RUN rm -rf /var/cache/yum | ||
|
|
||
| ENV PYTHONDONTWRITEBYTECODE=1 | ||
| ENV PYTHONUNBUFFERED=1 | ||
| # http://blog.stuart.axelbrooke.com/python-3-on-spark-return-of-the-pythonhashseed | ||
| ENV PYTHONHASHSEED 0 | ||
| ENV PYTHONIOENCODING UTF-8 | ||
| ENV PIP_DISABLE_PIP_VERSION_CHECK 1 | ||
|
|
||
| # Install EMR Spark/Hadoop | ||
| ENV HADOOP_HOME /usr/lib/hadoop | ||
| ENV HADOOP_CONF_DIR /usr/lib/hadoop/etc/hadoop | ||
| ENV SPARK_HOME /usr/lib/spark | ||
|
|
||
| COPY yum/emr-apps.repo /etc/yum.repos.d/emr-apps.repo | ||
|
|
||
| # Install hadoop / spark dependencies from EMR's yum repository for Spark optimizations. | ||
| # replace placeholder with region in repository URL | ||
| RUN sed -i "s/REGION/${AWS_REGION}/g" /etc/yum.repos.d/emr-apps.repo | ||
| RUN adduser -N hadoop | ||
|
|
||
| # These packages are a subset of what EMR installs in a cluster with the | ||
| # "hadoop", "spark", and "hive" applications. | ||
| # They include EMR-optimized libraries and extras. | ||
| RUN yum install -y aws-hm-client \ | ||
| aws-java-sdk \ | ||
| aws-sagemaker-spark-sdk \ | ||
| emr-goodies \ | ||
| emr-ruby \ | ||
| emr-scripts \ | ||
| emr-s3-select \ | ||
| emrfs \ | ||
| hadoop \ | ||
| hadoop-client \ | ||
| hadoop-hdfs \ | ||
| hadoop-hdfs-datanode \ | ||
| hadoop-hdfs-namenode \ | ||
| hadoop-httpfs \ | ||
| hadoop-kms \ | ||
| hadoop-lzo \ | ||
| hadoop-yarn \ | ||
| hadoop-yarn-nodemanager \ | ||
| hadoop-yarn-proxyserver \ | ||
| hadoop-yarn-resourcemanager \ | ||
| hadoop-yarn-timelineserver \ | ||
| hive \ | ||
| hive-hcatalog \ | ||
| hive-hcatalog-server \ | ||
| hive-jdbc \ | ||
| hive-server2 \ | ||
| s3-dist-cp \ | ||
| spark-core \ | ||
| spark-datanucleus \ | ||
| spark-external \ | ||
| spark-history-server \ | ||
| spark-python | ||
|
|
||
|
|
||
| # Point Spark at proper python binary | ||
| ENV PYSPARK_PYTHON=/usr/local/bin/python3.9 | ||
|
|
||
| # Setup Spark/Yarn/HDFS user as root | ||
| ENV PATH="/usr/bin:/opt/program:${PATH}" | ||
| ENV YARN_RESOURCEMANAGER_USER="root" | ||
| ENV YARN_NODEMANAGER_USER="root" | ||
| ENV HDFS_NAMENODE_USER="root" | ||
| ENV HDFS_DATANODE_USER="root" | ||
| ENV HDFS_SECONDARYNAMENODE_USER="root" | ||
|
|
||
|
|
||
|
|
||
| # Set up bootstrapping program and Spark configuration | ||
| COPY hadoop-config /opt/hadoop-config | ||
| COPY nginx-config /opt/nginx-config | ||
| COPY aws-config /opt/aws-config | ||
| COPY Pipfile Pipfile.lock setup.py *.whl /opt/program/ | ||
| ENV PIPENV_PIPFILE=/opt/program/Pipfile | ||
| # Use --system flag, so it will install all packages into the system python, | ||
| # and not into the virtualenv. Since docker containers do not need to have virtualenvs | ||
| # pipenv > 2022.4.8 fails to build smspark | ||
| RUN /usr/local/bin/python3.9 -m pip install pipenv==2022.4.8 \ | ||
| && pipenv install --system \ | ||
| && /usr/local/bin/python3.9 -m pip install /opt/program/*.whl | ||
|
|
||
| # Setup container bootstrapper | ||
| COPY container-bootstrap-config /opt/container-bootstrap-config | ||
| RUN chmod +x /opt/container-bootstrap-config/bootstrap.sh \ | ||
| && /opt/container-bootstrap-config/bootstrap.sh | ||
|
|
||
| # With this config, spark history server will not run as daemon, otherwise there | ||
| # will be no server running and container will terminate immediately | ||
| ENV SPARK_NO_DAEMONIZE TRUE | ||
|
|
||
| WORKDIR $SPARK_HOME | ||
|
|
||
| ENTRYPOINT ["smspark-submit"] | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,26 @@ | ||
| <?xml version="1.0" encoding="UTF-8"?> | ||
| <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> | ||
| <!-- Put site-specific property overrides in this file. --> | ||
|
|
||
| <configuration> | ||
| <property> | ||
| <name>fs.defaultFS</name> | ||
| <value>hdfs://nn_uri/</value> | ||
| <description>NameNode URI</description> | ||
| </property> | ||
| <property> | ||
| <name>fs.s3a.aws.credentials.provider</name> | ||
| <value>com.amazonaws.auth.DefaultAWSCredentialsProviderChain</value> | ||
| <description>AWS S3 credential provider</description> | ||
| </property> | ||
| <property> | ||
| <name>fs.s3.impl</name> | ||
| <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value> | ||
| <description>s3a filesystem implementation</description> | ||
| </property> | ||
| <property> | ||
| <name>fs.AbstractFileSystem.s3a.imp</name> | ||
| <value>org.apache.hadoop.fs.s3a.S3A</value> | ||
| <description>s3a filesystem implementation</description> | ||
| </property> | ||
| </configuration> |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,67 @@ | ||
| <?xml version="1.0" encoding="UTF-8"?> | ||
| <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> | ||
| <!-- Put site-specific property overrides in this file. --> | ||
|
|
||
| <configuration> | ||
| <property> | ||
| <name>dfs.datanode.data.dir</name> | ||
| <value>file:///opt/amazon/hadoop/hdfs/datanode</value> | ||
| <description>Comma separated list of paths on the local filesystem of a DataNode where it should store its\ | ||
| blocks.</description> | ||
| </property> | ||
|
|
||
| <property> | ||
| <name>dfs.namenode.name.dir</name> | ||
| <value>file:///opt/amazon/hadoop/hdfs/namenode</value> | ||
| <description>Path on the local filesystem where the NameNode stores the namespace and transaction logs per\ | ||
| sistently.</description> | ||
| </property> | ||
|
|
||
| <!-- Fix for "Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try" | ||
| From https://community.cloudera.com/t5/Support-Questions/Failed-to-replace-a-bad-datanode-on-the-existing-pipeline/td-p/207711 | ||
| This issue can be caused by Continuous network issues causing or repeated packet drops. This specially happens when data is | ||
| being written to any one of the DataNode which is in process of pipelining the data to next datanode and due to any communicaiton | ||
| issue it may lead to pipeline failure. We are only see this issue in small regions. --> | ||
| <property> | ||
| <name>dfs.client.block.write.replace-datanode-on-failure.enable</name> | ||
| <value>true</value> | ||
| <description> | ||
| If there is a datanode/network failure in the write pipeline, | ||
| DFSClient will try to remove the failed datanode from the pipeline | ||
| and then continue writing with the remaining datanodes. As a result, | ||
| the number of datanodes in the pipeline is decreased. The feature is | ||
| to add new datanodes to the pipeline. | ||
|
|
||
| This is a site-wide property to enable/disable the feature. | ||
|
|
||
| When the cluster size is extremely small, e.g. 3 nodes or less, cluster | ||
| administrators may want to set the policy to NEVER in the default | ||
| configuration file or disable this feature. Otherwise, users may | ||
| experience an unusually high rate of pipeline failures since it is | ||
| impossible to find new datanodes for replacement. | ||
|
|
||
| See also dfs.client.block.write.replace-datanode-on-failure.policy | ||
| </description> | ||
| </property> | ||
|
|
||
| <property> | ||
| <name>dfs.client.block.write.replace-datanode-on-failure.policy</name> | ||
| <value>ALWAYS</value> | ||
| <description> | ||
| This property is used only if the value of | ||
| dfs.client.block.write.replace-datanode-on-failure.enable is true. | ||
|
|
||
| ALWAYS: always add a new datanode when an existing datanode is | ||
| removed. | ||
|
|
||
| NEVER: never add a new datanode. | ||
|
|
||
| DEFAULT: | ||
| Let r be the replication number. | ||
| Let n be the number of existing datanodes. | ||
| Add a new datanode only if r is greater than or equal to 3 and either | ||
| (1) floor(r/2) is greater than or equal to n; or | ||
| (2) r is greater than n and the block is hflushed/appended. | ||
| </description> | ||
| </property> | ||
| </configuration> |
10 changes: 10 additions & 0 deletions
10
spark/processing/3.2/py3/hadoop-config/spark-defaults.conf
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,10 @@ | ||
| spark.driver.extraClassPath /usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar | ||
| spark.driver.extraLibraryPath /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native | ||
| spark.executor.extraClassPath /usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar | ||
| spark.executor.extraLibraryPath /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native | ||
| spark.driver.host=sd_host | ||
| spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 | ||
|
|
||
| # Fix for "Uncaught exception: org.apache.spark.rpc.RpcTimeoutException: Cannot | ||
| # receive any reply from 10.0.109.30:35219 in 120 seconds."" | ||
| spark.rpc.askTimeout=300s |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,3 @@ | ||
| #EMPTY FILE AVOID OVERRIDDING ENV VARS | ||
| # Specifically, without copying the empty file, SPARK_HISTORY_OPTS will be overriden, | ||
| # spark.history.ui.port defaults to 18082, and spark.eventLog.dir defaults to local fs |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,34 @@ | ||
| <?xml version="1.0"?> | ||
| <!-- Site specific YARN configuration properties --> | ||
| <configuration> | ||
| <property> | ||
| <name>yarn.resourcemanager.hostname</name> | ||
| <value>rm_hostname</value> | ||
| <description>The hostname of the RM.</description> | ||
| </property> | ||
| <property> | ||
| <name>yarn.nodemanager.hostname</name> | ||
| <value>nm_hostname</value> | ||
| <description>The hostname of the NM.</description> | ||
| </property> | ||
| <property> | ||
| <name>yarn.nodemanager.webapp.address</name> | ||
| <value>nm_webapp_address</value> | ||
| </property> | ||
| <property> | ||
| <name>yarn.nodemanager.vmem-pmem-ratio</name> | ||
| <value>5</value> | ||
| <description>Ratio between virtual memory to physical memory.</description> | ||
| </property> | ||
| <property> | ||
| <name>yarn.resourcemanager.am.max-attempts</name> | ||
| <value>1</value> | ||
| <description>The maximum number of application attempts.</description> | ||
| </property> | ||
| <property> | ||
| <name>yarn.nodemanager.env-whitelist</name> | ||
| <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,YARN_HOME,AWS_CONTAINER_CREDENTIALS_RELATIVE_URI,AWS_REGION</value> | ||
| <description>Environment variable whitelist</description> | ||
| </property> | ||
|
|
||
| </configuration> |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are missing a statement similar to this: https://github.com/aws/sagemaker-spark-container/blob/master/spark/processing/3.1/py3/docker/py39/Dockerfile.cpu#L102 in this Dockerfile. Can we make sure its added?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Ajay, I was in the assumption that you want to keep this because this command remove lookup class from the jar file because log4j might have some vulnerability so that we want to remove Lookup class from the jar. See here: https://community.bmc.com/s/article/Log4j-CVE-2021-44228-REMEDIATION-Remove-JndiLookup-class-from-log4j-core-2-jar
But if you look into it, there is another approach that we can update the version of log4j, which is done piggybacked by hive version upgrade. I remove this line on-purpose because everytime you have to look up log4j version in hive package to determine which jar to update, that is very cumbersome way to do it.
As state above, there is no need to add this line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it thanks for the explanation.