Skip to content

Commit

Permalink
Changed main folder to lib. Made an executable and added a setup.py f…
Browse files Browse the repository at this point in the history
…or easy install.
  • Loading branch information
iph committed Jul 6, 2012
1 parent 8176a54 commit 136146c
Show file tree
Hide file tree
Showing 20 changed files with 154 additions and 80 deletions.
60 changes: 45 additions & 15 deletions README.md
Expand Up @@ -3,47 +3,77 @@ EMRio

Elastic MapReduce instance optimizer

EMRio helps you save money on Elastic MapReduce by using your last two
months of usage to estimate how many EC2 reserved instances you should buy
for the next year.

Introduction
------------
Elastic MapReduce is a service provided by Amazon that makes it easy to use MapReduce. EMR run on machines called EC2 instances. They come in many different flavors from heavy memory usage to heavy CPU usage. When businesses start using EMR, they use these services as a pay-as-you-go service. After some time, the amount of instances you use can become stable. If you utilize enough instances over time, it might make sense to switch from the pay-as-you-go service, or On-Demand service, to a pay-upfront service, or Reserved Instances service.
Elastic MapReduce is a service provided by Amazon that makes it easy to use
MapReduce. EMR run on machines called EC2 instances. They come in many
different flavors from heavy memory usage to heavy CPU usage. When businesses
start using EMR, they use these services as a pay-as-you-go service. After
some time, the amount of instances you use can become stable. If you utilize
enough instances over time, it might make sense to switch from the pay-as-you
-go service, or On-Demand service, to a pay-upfront service, or Reserved
Instances service.

How Reserved Instances work can be read
[here](http://aws.amazon.com/ec2/reserved-instances/). If you think that
switching to reserved instances is a good plan, but don't know how many to
buy, that's what EMRio is for!

How Reserved Instances work can be read [here](http://aws.amazon.com/ec2/reserved-instances/). If you think that switching to reserved instances is a good plan, but don't know how many to buy, that's what EMRio is for!
How It Works
------------
EMRio first looks at your EMR history. That data has a two month limit. It then acts as if the job flow was reoccurring for a year. It has to estimate a year's worth of data for Reserved Instances to be worth the cost. It then simulates different configurations using the job flow history and will produce the best pool of instances to buy.
EMRio first looks at your EMR history. That data has a two month limit. It
then acts as if the job flow was reoccurring for a year. It has to estimate
a year's worth of data for Reserved Instances to be worth the cost. It then
simulates different configurations using the job flow history and will
produce the best pool of instances to buy.

Dependencies
------------
-boto
-tzinfo
-matplotlib
*boto
*tzinfo
*matplotlib
How to Run EMRio
----------------
Once you have the dependencies installed, you need to set up your boto configuration file. Look at our boto config as an example. Once you fill in the AWS key information and region information, copy it to either /etc/boto.conf or ~/.boto
Once you have the dependencies installed, you need to set up your boto
configuration file. Look at our boto config as an example. Once you fill in
the AWS key information and region information, copy it to either /etc/boto.
conf or ~/.boto

After that is setup, cd into emrio and run:
After that is setup, `cd` into `emrio` and run:

python EMRio.py

This should take a minute or two to grab the information off S3, do a few simulations, and output the resultant optimized instance pool.
This should take a minute or two to grab the information off S3, do a few
simulations, and output the resultant optimized instance pool.

If you want to see instance usage over time (how many instances are running at the same time), you run::
If you want to see instance usage over time (how many instances are running
at the same time), you run::

python EMRio.py --graph cost

After it calculates the same data, you will now see graphs of each instance-type's usage over time, like this::
After it calculates the same data, you will now see graphs of each instance-
type's usage over time, like this::

IMAGE HERE

Now, re-calculating the optimal instances is kind of pointless on the same data, so in order to save and load optimal instance configurations, use this:
Now, re-calculating the optimal instances is kind of pointless on the same
data, so in order to save and load optimal instance configurations, use this:

python EMRio.py --save-optimized=output.txt
python EMRio.py --cache=output.txt

If you want to see how this is formatted, check out the tests folder where an example instance file can be found.
If you want to see how this is formatted, check out the tests folder where
an example instance file can be found.

Which will save the results in output.txt, and load them like so:

python EMRio.py --optimized=output.txt

If you want to see all the commands, try --help.
If you want to see all the commands, try `--help`.

python EMRio.py --help


3 changes: 1 addition & 2 deletions boto_example.config
Expand Up @@ -8,9 +8,8 @@ emr_region_name = us-west-1
emr_region_endpoint = us-west-1.elasticmapreduce.amazonaws.com
ec2_region_endpoint = us-west-1.ec2.amazonaws.com

## Here are some examples of other regions tat you can use other than us-west-1
## Here are some examples of other regions that you can use other than us-west-1
## The list of possible regions are currently (June 28th 2012):
## us-east-1 (US EAST)
## us-west-1 (US WEST NORTH CALFORNIA)
## us-west-2 (US WEST OREGON)
## eu-west-1 (EU IRELAND)
Expand Down
5 changes: 5 additions & 0 deletions emrio
@@ -0,0 +1,5 @@
#!/usr/bin/python
import emrio_lib
import sys
if __name__ == '__main__':
emrio_lib.EMRio.main(sys.argv[1:])
12 changes: 6 additions & 6 deletions emrio/EMRio.py → emrio_lib/EMRio.py
Expand Up @@ -16,6 +16,7 @@
import boto

from config import EC2
from ec2_cost import instance_types_in_pool
from graph_jobs import instance_usage_graph
from graph_jobs import total_hours_graph
from job_handler import get_job_flows, load_job_flows_from_amazon
Expand Down Expand Up @@ -55,9 +56,8 @@ def main(args):


def make_option_parser():
usage = '%prog [options]'
description = 'Print a giant report on EMR usage.'
option_parser = OptionParser(usage=usage, description=description)
option_parser = OptionParser(description=description)
option_parser.add_option(
'-v', '--verbose', dest='verbose', default=False, action='store_true',
help='print more messages to stderr')
Expand All @@ -84,7 +84,7 @@ def make_option_parser():
'starts before this day, it is discarded (e.g.: --max-days 2012/05/07)')
)
option_parser.add_option(
'-f', '--file', dest='file_inputs', type='string', default=None,
'--file', dest='file_inputs', type='string', default=None,
help="Input a file that has job flows JSON encoded. The format is 1 job"
"per line or comma separated jobs."
)
Expand All @@ -93,7 +93,7 @@ def make_option_parser():
help=("Uses a previously saved optimized pool instead of calculating it from"
" the job flows"))
option_parser.add_option(
'--save_optimized', dest='save', type='string', default=None,
'--cache', dest='save', type='string', default=None,
help='Save the optimized results so you dont calculate them multiple times')
option_parser.add_option(
'-g', '--graph', dest='graph', type='string', default='None',
Expand Down Expand Up @@ -298,8 +298,8 @@ def output_statistics(log, pool, demand_log,):
owned_reserved_instances = get_owned_reserved_instances()
buy_instances = calculate_instances_to_buy(owned_reserved_instances, pool)

all_instances = EC2.instance_types_in_pool(pool)
all_instances.union(EC2.instance_types_in_pool(owned_reserved_instances))
all_instances = instance_types_in_pool(pool)
all_instances.union(instance_types_in_pool(owned_reserved_instances))

print "%20s %15s %15s %15s" % ('', 'Optimal', 'Owned', 'To Purchase')
for utilization_class in EC2.RESERVE_PRIORITIES:
Expand Down
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
88 changes: 44 additions & 44 deletions emrio/ec2_cost.py → emrio_lib/ec2_cost.py
Expand Up @@ -21,6 +21,7 @@
import copy
from collections import defaultdict


class EC2Info(object):
"""This class is used to store EC2 info like costs from the config
file. All the functions in it use that config to build pools or
Expand Down Expand Up @@ -140,23 +141,6 @@ def init_reserve_costs(self, init_value):
reserve_costs[utilization_class] = init_value
return reserve_costs

@staticmethod
def instance_types_in_pool(pool):
"""Gets the set of all instance types in
a pool or log
Args:
pool: Instances currently owned for each utilization_classization type.
Returns:
A set of all the instances used for all utilization_classization types.
"""
instance_types = set()
for utilization_class in pool:
for instance_type in pool[utilization_class]:
instance_types.add(instance_type)
return instance_types

def is_reserve_type(self, instance_type):
"""This just returns if a utilization_classization type is
a reserve instance. If not, it is probably DEMAND type.
Expand Down Expand Up @@ -196,33 +180,49 @@ def color_scheme(self):
green = int(green + increment)
return colors

@staticmethod
def fill_instance_types(job_flows, pool):
"""Use this function to fill the instance pool
with all the instance types used in the job flows.
example: if the job_flows has m1.small, and m1.large
and we had 2 utils of LIGHT_UTIL and HEAVY_UTIL, the
resultant pool from the function will be:
pool = {
LIGHT_UTIL: {
'm1.small': 0, 'm1.large': 0
}
HEAVY_UTIL: {
'm1.small': 0, 'm1.large': 0
}

def fill_instance_types(job_flows, pool):
"""Use this function to fill the instance pool
with all the instance types used in the job flows.
example: if the job_flows has m1.small, and m1.large
and we had 2 utils of LIGHT_UTIL and HEAVY_UTIL, the
resultant pool from the function will be:
pool = {
LIGHT_UTIL: {
'm1.small': 0, 'm1.large': 0
}
Args:
pool: A dict of utilization level dictionaries with nothing in them.
HEAVY_UTIL: {
'm1.small': 0, 'm1.large': 0
}
}
Args:
pool: A dict of utilization level dictionaries with nothing in them.
Mutates:
pool: for each utilization type, it fills in all the instance_types
that any job uses.
"""
for job in job_flows:
for instance in job.get('instancegroups'):
instance_type = instance.get('instancetype')
for utilization_class in pool.keys():
pool[utilization_class][instance_type] = pool[utilization_class][instance_type]
Mutates:
pool: for each utilization type, it fills in all the instance_types
that any job uses.
"""
for job in job_flows:
for instance in job.get('instancegroups'):
instance_type = instance.get('instancetype')
for utilization_class in pool.keys():
pool[utilization_class][instance_type] = pool[utilization_class][instance_type]


def instance_types_in_pool(pool):
"""Gets the set of all instance types in
a pool or log
Args:
pool: Instances currently owned for each utilization_classization type.
Returns:
A set of all the instances used for all utilization_classization types.
"""
instance_types = set()
for utilization_class in pool:
for instance_type in pool[utilization_class]:
instance_types.add(instance_type)
return instance_types
3 changes: 2 additions & 1 deletion emrio/graph_jobs.py → emrio_lib/graph_jobs.py
Expand Up @@ -10,6 +10,7 @@
import matplotlib.pyplot as plt

from config import EC2
from ec2_cost import instance_types_in_pool
from simulate_jobs import Simulator, SimulationObserver

COLORS = EC2.color_scheme()
Expand Down Expand Up @@ -73,7 +74,7 @@ def graph_over_time(info_over_time,
if end_time.hour != 0:
end_time = end_time.replace(hour=0, day=(end_time.day + 1))

for instance_type in EC2.instance_types_in_pool(info_over_time):
for instance_type in instance_types_in_pool(info_over_time):
# Locators / Formatters to pretty up the graph.
hours = mdates.HourLocator(byhour=None, interval=1)
days = mdates.DayLocator(bymonthday=None, interval=1)
Expand Down
File renamed without changes.
6 changes: 4 additions & 2 deletions emrio/optimizer.py → emrio_lib/optimizer.py
Expand Up @@ -5,6 +5,8 @@
import logging
from math import ceil

from ec2_cost import instance_types_in_pool
from ec2_cost import fill_instance_types
from simulate_jobs import Simulator


Expand Down Expand Up @@ -32,8 +34,8 @@ def run(self, pre_existing_pool=None):

# Zero-ing the instances just makes it so the optimized pool
# knows all the instance_types the job flows use beforehand.
self.EC2.fill_instance_types(self.job_flows, optimized_pool)
for instance in self.EC2.instance_types_in_pool(optimized_pool):
fill_instance_types(self.job_flows, optimized_pool)
for instance in instance_types_in_pool(optimized_pool):
logging.debug("Finding optimal instances for %s", instance)
self.optimize_reserve_pool(instance, optimized_pool)
return optimized_pool
Expand Down
File renamed without changes.
37 changes: 37 additions & 0 deletions setup.py
@@ -0,0 +1,37 @@
import os
from setuptools import setup
setuptools_kwargs = {
'install_requires': [
'boto>=2.2.0',
'PyYAML',
'simplejson>=2.0.9',
],
'provides': ['emrio'],
'tests_require': ['unittest2'],
}


# Utility function to read the README file.
# Used for the long_description. It's nice, because now 1) we have a top level
# README file and 2) it's easier to type in the README file than to put a raw
# string in below ...
def read(fname):
return open(os.path.join(os.path.dirname(__file__), fname)).read()

setup(
name="emrio",
version="0.0.1",
author="Sean Myers",
author_email="SeanMyers0608@gmail.com",
description=("EMR instance optimizer will take your past EMR history and"
"attempt to optimize the max reserved instances for it"),
license="Apache?",
keywords="EMRio EMR Instance Optimizer Reserved Instances",
url="http://github.com/Yelp/EMRio",
packages=['emrio_lib', 'tests'],
long_description=read('README.md'),
classifiers=[
"Development Status :: 3 - Alpha",
"Topic :: Utilities",
],
)
2 changes: 1 addition & 1 deletion tests/test_ec2_cost.py
Expand Up @@ -5,7 +5,7 @@
import unittest
from collections import defaultdict

from emrio.ec2_cost import EC2Info
from emrio_lib.ec2_cost import EC2Info
from test_prices import HEAVY_UTIL, MEDIUM_UTIL, LIGHT_UTIL, DEMAND
from test_prices import COST, RESERVE_PRIORITIES

Expand Down
4 changes: 2 additions & 2 deletions tests/test_emrio.py
@@ -1,7 +1,7 @@
"""Tests for the main EMRio module are here."""
import unittest
from emrio.ec2_cost import EC2Info
from emrio.EMRio import read_optimal_instances
from emrio_lib.ec2_cost import EC2Info
from emrio_lib.EMRio import read_optimal_instances
from test_prices import COST, RESERVE_PRIORITIES

EC2 = EC2Info(COST, RESERVE_PRIORITIES)
Expand Down
4 changes: 2 additions & 2 deletions tests/test_instance_predictor.py
Expand Up @@ -3,8 +3,8 @@
import datetime
from unittest import TestCase

from emrio.ec2_cost import EC2Info
from emrio.simulate_jobs import Simulator
from emrio_lib.ec2_cost import EC2Info
from emrio_lib.simulate_jobs import Simulator
from test_prices import COST, HEAVY_UTIL, MEDIUM_UTIL, LIGHT_UTIL, RESERVE_PRIORITIES
from test_prices import DEMAND

Expand Down
6 changes: 3 additions & 3 deletions tests/test_job_handler.py
Expand Up @@ -4,10 +4,10 @@

import pytz
# Setup a mock EC2 since west coast can be changed in the future.
from emrio.job_handler import no_date_filter, range_date_filter
from emrio.ec2_cost import EC2Info
from emrio_lib.job_handler import no_date_filter, range_date_filter
from emrio_lib.ec2_cost import EC2Info
from test_prices import COST, RESERVE_PRIORITIES
from emrio.config import TIMEZONE
from emrio_lib.config import TIMEZONE

EC2 = EC2Info(COST, RESERVE_PRIORITIES)

Expand Down
4 changes: 2 additions & 2 deletions tests/test_optimize.py
Expand Up @@ -4,8 +4,8 @@
import copy
from math import ceil

from emrio.optimizer import Optimizer, convert_to_yearly_estimated_hours
from emrio import ec2_cost
from emrio_lib.optimizer import Optimizer, convert_to_yearly_estimated_hours
from emrio_lib import ec2_cost
from test_prices import *

EC2 = ec2_cost.EC2Info(COST, RESERVE_PRIORITIES)
Expand Down

0 comments on commit 136146c

Please sign in to comment.