Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to set custom EMR classification #193

Closed
OElesin opened this issue Apr 19, 2020 · 7 comments
Closed

How to set custom EMR classification #193

OElesin opened this issue Apr 19, 2020 · 7 comments
Assignees
Labels
enhancement New feature or request minor release Will be addressed in the next minor release
Milestone

Comments

@OElesin
Copy link

OElesin commented Apr 19, 2020

Is your idea related to a problem? Please describe.
I already made use of the library and it was super helpful. I tried setting custom EMR classifications so as to make use of EMR 6.0.0, but I could not set custom classification. Is it possible to do this currently, or it has to be a feature request?

@OElesin OElesin added the enhancement New feature or request label Apr 19, 2020
@igorborgest
Copy link
Contributor

Hi @OElesin! Thanks for reaching out, really relevant topic.

Currently there is no support for "container-executor", "docker" and custom classifications.

But we will address all for sure.

@igorborgest igorborgest added major release Will be addressed in the next major release minor release Will be addressed in the next minor release and removed major release Will be addressed in the next major release labels Apr 20, 2020
@igorborgest igorborgest added this to the 1.1.0 milestone Apr 20, 2020
igorborgest added a commit that referenced this issue Apr 25, 2020
@igorborgest igorborgest added the WIP Work in progress label Apr 25, 2020
@igorborgest
Copy link
Contributor

Hi @OElesin!

I just added support for Docker and Custom Classification.

Docker example:

import awswrangler as wr

cluster_id = wr.emr.create_cluster(
    subnet_id="SUBNET_ID",
    spark_docker=True,
    spark_docker_image="{ACCOUNT_ID}.dkr.ecr.{REGION}.amazonaws.com/{IMAGE_NAME}:{TAG}",
    ecr_credentials_step=True
)

Custom Classification example:

cluster_id = wr.emr.create_cluster(
    subnet_id="SUBNET_ID",
    custom_classifications=[
        {
            "Classification": "livy-conf",
            "Properties": {
                "livy.spark.master": "yarn",
                "livy.spark.deploy-mode": "cluster",
                "livy.server.session.timeout": "16h",
            },
        }
    ],
)

I also create two new tutorials about it:

To install the related branch:
pip install git+https://github.com/awslabs/aws-data-wrangler.git@emr-6

Please, could you test it and give us feedback?

@OElesin
Copy link
Author

OElesin commented Apr 25, 2020

This is excellent! I will give this a try.

Is there a plan to add this to the master branch?

igorborgest added a commit that referenced this issue Apr 25, 2020
@igorborgest
Copy link
Contributor

igorborgest commented Apr 25, 2020

@OElesin The plain is to release this features on version 1.1.0 on next Friday!

Will be really nice if you could help us with some feedback. Thanks!

@OElesin
Copy link
Author

OElesin commented Apr 26, 2020

@igorborgest, Thanks for this. Tested it in the following conditions:

  • Using your example, it worked but only started the cluster with master instance only.
  • Tested with a master instance and core instance, see below:
cluster_id = wr.emr.create_cluster(
    cluster_name="my-demo-cluster-v2",
    logging_s3_path=f"s3://my-logs-bucket/emr-logs/",
    emr_release="emr-6.0.0",
    subnet_id="SUBNET_ID",
    emr_ec2_role="EMR_EC2_DefaultRole",
    emr_role="EMR_DefaultRole",
    instance_type_master="m5.2xlarge",
    instance_type_core="m5.2xlarge",
    instance_ebs_size_master=50,
    instance_ebs_size_core=50,
    instance_num_on_demand_master=0,
    instance_num_on_demand_core=0,
    instance_num_spot_master=1,
    instance_num_spot_core=2,
    spot_bid_percentage_of_on_demand_master=50,
    spot_bid_percentage_of_on_demand_core=50,
    spot_provisioning_timeout_master=5,
    spot_provisioning_timeout_core=5,
    spot_timeout_to_on_demand_master=False,
    spot_timeout_to_on_demand_core=False,
    python3=True,
    ecr_credentials_step=True,
    spark_docker=True,
    spark_docker_image=DOCKER_IMAGE,
    spark_glue_catalog=True,
    hive_glue_catalog=True,
    presto_glue_catalog=True,
    debugging=True,
    applications=["Hadoop", "Spark", "Hive", "Zeppelin", "Livy"],
    visible_to_all_users=True,
    maximize_resource_allocation=True,
    keep_cluster_alive_when_no_steps=True,
    termination_protected=False,
    spark_pyarrow=True
)

Error message:

/bin/bash: docker: command not found
Command exiting with ret '127'

@igorborgest
Copy link
Contributor

igorborgest commented Apr 26, 2020

Hi @OElesin, thanks a lot for the quick response!

You are right, I just figured out EMR does not have docker installed in the master node, only in the core ones.
Due that, we will not be able to refresh the ECR credentials programatically without an external file on s3.

I revisited the implementation and the tutorial and now the expected usage is:

import awswrangler as wr

cluster_id = wr.emr.create_cluster(subnet, docker=True)

wr.emr.submit_ecr_credentials_refresh(cluster_id, path="s3://bucket/emr/")

wr.emr.submit_spark_step(
    cluster_id,
    "s3://bucket/app.py",
    docker_image=DOCKER_IMAGE
)

What you think?

P.S. The custom_classifications usage keeps the same.

@igorborgest
Copy link
Contributor

Available on version 1.1.0

@igorborgest igorborgest removed the WIP Work in progress label May 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request minor release Will be addressed in the next minor release
Projects
None yet
Development

No branches or pull requests

2 participants