Run cd infrastructure/data
and ./download.sh
This project is configured in a way described in this post.
To use it:
- Make a copy of
infrastructure/config/default.deployment.template.yaml
to/infrastructure/config/default.deployment.yaml
and populate it with the corresponding values - Create a file
infrastructure/config/aws-profile.txt
and store the name of the aws profile to use (as named in~/.aws/config
)
This is described in more detail in this post.
- Clone the spark repository
git clone git@github.com:apache/spark.git
- Checkout a specific version
git checkout v3.1.1
- Create ECR repository
spark-operator/spark
andspark-operator/spark
Login to ECR with docker, for private repository:
aws ecr get-login-password --region <region> | docker login --username AWS --password-stdin <account id>.dkr.ecr.<region>.amazonaws.com
for public repository
aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws
Build Spark with a specific Hadoop Version
./build/mvn -Pkubernetes -Dhadoop.version=3.3.1 -DskipTests clean package
Build and tag the docker image
./bin/docker-image-tool.sh -r public.ecr.aws/z2m5w4m3/spark-operator -t v3.1.1-hadoop3.3.1 -p ./resource-managers/kubernetes/docker/src/main/dockerfiles/spark/bindings/python/Dockerfile build
Push the image
./bin/docker-image-tool.sh -r public.ecr.aws/z2m5w4m3/spark-operator -t v3.1.1-hadoop3.3.1 push
The infrastructure/
directory contains an AWS CDK project written in TypeScript. You will need to setup your tooling, i.e. npm
, TypeScript etc, and install packages with npm install
.
Then run
./cdk.sh deploy
At the moment, I cannot deploy everything in one go (add dependencies between constructs is not enough), so it needs to be done in two steps:
- Deploy first with all the switches in
DeployJobs
ininfrastructure/config/default.yaml
set to'false'
- Then deploy a second time with the job you wish to deploy switched on
- Your Hadoop version will dictate what version of hadoop-aws you will use (it will be the same)
- This in turn will dictate what version of aws-java-sdk-bundle you need to use