This project lets you create COGs from other raster files, using Apache Spark and AWS's Elastic Map Reduce.
In src/create_cogs.py
, you'll need to define the get_input_and_output_paths
method to
create a list of tuples (input_uri, output_uri)
that map input images to output paths.
Either of these paths can be local or on S3.
gdal_cog_commands
is where the commands for creating a cog live - if you want to modify
how the COG is made, e.g. changing compression or resampling method options, that's where
you should make changes.
Use make
to spin up an EMR cluster using terraform.
- Terraform 0.11 or later.
- aws-cli
- Set the environment variable
AWS_PROFILE
to your target profile.
terraform/variables.tf contains the full set of variables which can be specified to modify an EMR deployment. Only those not provided defaults need to be specified, and these can be found within tfvars.tpl - be sure to make a copy of this template and remove 'tpl' from the filename.
You'll also have to edit the COG_EMR_S3_PREFIX
in the options.mk
file. This is where on S3
the python script is uploaded to so that EMR can run it.
The options.mk
also settings should be edited when required to tune spark performance.
The Makefile commands you'' generally run are:
> make upload-code
> make create-cluster
> make run
> make proxy
> make terminate-cluster
make upload-code
will upload the python script and bootstrap.sh script to the location specified in theMakefile
asCOG_EMR_S3_PREFIX
make create-cluster
will create the the cluster and use thebootstrap.sh
that was just uploaded.make run
Will run the COG create job.make proxy
Will create a ssh tunnel, required to access the UIs as described here.make terminate-cluster
kills the cluster after you are done with it.
Here is a list of all the commands:
Command | Description |
---|---|
terraform-init | terraform init - Initialize terraform |
terraform-plan | terraform plan - Create the cluster plan. |
validate-cluster | `terraform validate - Validate terraform |
create-cluster | terraform init, if it's the first run |
upload-code | Upload the code so it can be run by EMR. |
run | Runs the pyspark job |
ssh | SSH into a running EMR cluster |
proxy | Creates a ssh tunnel to the EMR cluster, needed for UIs |
terminate-cluster | Destroy a running EMR cluster |
print-vars | Print out env vars for diagnostic and debug purposes |
Long startup times (15 minutes or more) probably indicates that you have chosen a spot price that is too low.
If you want to see the UIs such as the Resource Manager, which can take you to the Spark UI for running jobs, you'll have to jump through some setup hoops, described here.
This cluster will have a running Zeppelin interface, which you can run python and Scala code through.
Upload the code with make upload-code
before you run, or after you make changes to the python script.
Use make run
to run
This happens a lot, so make sure to call "make terminate-cluster" to tear down your cluster after use. Alternatively you can terminate the cluster through the AWS UI.