Installation script and instructions for setting up Tessera environment on Amazon Elastic MapReduce
Shell
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
_previous Merge branch 'master' of https://github.com/tesseradata/install-emr Sep 16, 2015
scripts Update link to Tessera drat repo Jan 30, 2016
.gitignore
CONTRIBUTING.md Add CONTRIBUTING.md Jun 13, 2014
README.md
install-package-gh.sh
install-package.sh
tessera-emr.sh Make custom Hadoop config parameters work Feb 11, 2016

README.md

Tessera Environment on Amazon EMR

Note: These scripts are experimental. We would appreciate users testing them out and using them and providing feedback / fixes.

Prerequisites

  • An Amazon AWS account with properly set up security groups and policies
  • An s3 bucket
  • AWS Command Line Interface

If you don't have these prerequisites, they will be covered in more detail below.

Installation

You can install the scripts simply by cloning the github respository.

git clone https://github.com/tesseradata/install-emr
cd install-emr

Usage

If you have the prerequisites and you have a bash shell, you can simply simply call tessera-emr.sh as follows:

./tessera-emr.sh -s <s3 bucket>

To see more options (number of workers, instance types, etc.):

./tessera-emr.sh -h

This script does the following:

  • Syncs the custom Tessera bootstrap scripts to a "scripts" folder in your s3 bucket
  • Creates a security group to allow RStudio Server to be served over port 80 (by default open to just your IP address)
  • Launches the EMR cluster and installs and configures all Tessera components

Once your cluster is up and running, if you need to install additional R packages on the nodes, there are some helper scripts for this:

# CRAN package
./install-package.sh <cluster id> <s3 bucket> rvest
# github package
./install-package-gh.sh <cluster id> <s3 bucket> bokeh/rbokeh

If you want finer control over things, take a look at tessera-emr.sh and modify the aws create-cluster command for your needs.

Please note that you are responsible for making sure that instances you have started are terminated when you are done. Please familiarize yourself with the following resources for monitoring usage, and check them frequently. It is your responsibility to monitor and handle your resource usage.

  • AWS Console -> EMR (direct link) - you can view running EMR clusters and terminate them here
  • AWS Console -> EC2 -> Instances (direct link) - you can view running instances and terminate them here
  • AWS Console -> Menu Bar -> (username dropdown) -> Billing and Cost Management: you can view your account balance here

Set up an AWS account

If you don't already have an AWS account, go to http://aws.amazon.com and click the button that says "Create a Free Account" or if you have logged in to the system before, the button will say something like "Sign in to the Console".

You can sign in if you have an existing amazon.com account or create a new account.

Set up account credentials

  • Sign in to the AWS management console
  • Click on "Identity and Access Management"
  • Click on "Users" and then click the "Create New Users" button and create your user
  • After you have created the user, click the "Download Credentials" button - this will give you a file, credentials.csv, with your user's key and secret key that will be used when we configure the AWS Command Line Interface
  • Click on "Groups" and click the "Create New Group" button
  • Call the group what you'd like, e.g. "tessera"
  • Attach the following two policies to the group: AmazonDynamoDBFullAccess, AmazonElasticMapReduceFullAccess (DynamoDB access only required if you are going to use EMRFS with the -e option)
  • Now click "Groups" and click on the entry of the group you just created
  • Click the "Add Users to Group" button and select your user

Get an EC2 key pair

  • Sign in to the AWS management console
  • Click on "EC2"
  • Click on "Key Pairs" under "Network & Security"
  • Click the "Create Key Pair" button
  • Name it what you'd like, e.g. "tessera-emr"
  • Keep track of the name of this file, as it will be the -k argument to tessera-emr.sh.
  • A file with that name and a .pem extension will be downloaded
  • You can put this file where you'd like but treat it with care (don't share with anyone or put it anywhere where others can get it)
  • You can put it in the emr-3.2.1 directory of this repo if you'd like (but don't check it in to git)

Set up an s3 bucket

We will use this to store the EMR startup scripts and you can also use it to store your HDFS data.

  • Sign in to the AWS management console
  • Click "S3"
  • Click the "Create Bucket" button and go through the steps
  • Enable logging for the bucket with the default prefix "logs/"
  • Make sure you make note of the Region you choose

Get the AWS command line interface

The AWS CLI uses Python so make sure you have that installed.

Instructions for how to install the AWS CLI can be found here.

Configure AWS CLI

Follow the instrutions here to configure the AWS CLI.

Some notes:

  • Use your user credentials.csv file you downloaded when you created the user to get your key and secret key
  • If you don't have this file, follow this guide.
  • To see the possibilities for "region", look at the codes here - it is a good idea to choose the same one as your s3 bucket
  • You can choose the default value for "output" - it doesn't matter which you choose

You should now be ready to run tessera-emr.sh as outlined at the beginning of this README.

Notes

  • m1.large or larger instance types must be used. Smaller instance types have caused issues where hadoop is unable to start.
  • Each time a cluster is started a new security group is created with a name TesseraEMR-xxxxxx. Periodically you may want to check your security groups and clean out old groups with this prefix.