Skip to content

aws-samples/aws-systems-manager-create-emr-edge-node

Edge node with RStudio

Data scientists who are comfortable working with RStudio and want to use packages like sparklyr to work with Spark through R need an edge node on a Hadoop cluster. Edge nodes offer good performance compared to pulling data remotely via Livy and also offer convenient access to local Spark and Hive shells.

However, edge nodes are not easy to deploy. Edge nodes must have the same versions of Hadoop, Spark, Java, and other tools as the Hadoop cluster, and require the same Hadoop configuration as nodes in the cluster. In order to reduce the effort to deploy edge nodes, this module automatically deploys and edge node configured with a recent version of RStudio.

Workflow

We use a Systems Manager automation document to perform the following steps automatically:

  • Create an AMI from the master node of the EMR cluster.
  • Launch a new EC2 instance using that AMI.
  • Install the SSM agent on the new instance.
  • Install Ansible on the new instance.
  • Run an Ansible playbook to install RStudio.
  • Add a CloudWatch alarm to recover the node in case of a host failure.

Deployment

First we run this module to create the SSM document and some associated security settings. Start by reviewing the variables in vars.tf and set or override any settings you need.

tf init # one time
tf apply

Now that the document is created, you can execute it through the SSM console. You'll need to provide the instance ID of the EMR master node, a subnet to place the edge node, and a name for the node.

Using the edge node

RStudio is installed with a default user name of ruser. You must set the password by logging into the edge node itself and updating the Linux password for ruser.

Access is over port 8787. You'll need to open an SSH tunnel using Session Manager.

aws ssm start-session \
  --target instance-id \ # Get this from the output of the automation document
  --document-name AWS-StartPortForwardingSession \
  --parameters '{"portNumber":["8787"], "localPortNumber":["8787"]}'

Then you can access RStudio at http://localhost:8787.

License

Copyright 2020 Amazon.com, Inc. or its affiliates. All Rights Reserved. SPDX-License-Identifier: MIT-0

About

No description, website, or topics provided.

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages