Skip to content

Commit

Permalink
Merge remote-tracking branch 'upstream/master'
Browse files Browse the repository at this point in the history
  • Loading branch information
alejandro-rivera committed Jul 13, 2015
2 parents 1e0a2b1 + 0b91e6c commit bed31b3
Show file tree
Hide file tree
Showing 47 changed files with 1,228 additions and 1,952 deletions.
3 changes: 3 additions & 0 deletions .gitignore
Expand Up @@ -14,3 +14,6 @@
/mrjob.egg-info
/docs/_build
.tox
.coverage
*,cover
j-*
29 changes: 28 additions & 1 deletion CHANGES.txt
@@ -1,4 +1,32 @@
v0.5.0, 2015-??-?? -- ???
* requires boto 2.35.0 or newer (#980)
* jobs:
* mr() no longer takes positional arguments (#814)
* removed jar() (use mrjob.step.JarStep)
* removed testing methods parse_counters() and parse_output()
* mrjob.step:
* JarStep only takes "args" and "main_class" keyword args
* removed MRJobStep (use MRStep)
* runners:
* All runners:
* removed IS_SUCCESSFUL cleanup option (use ALL)
* EMR:
* default AWS region is us-west-2 (#1025)
* aws_region is no longer inferred from s3_scratch_uri
* mrjob works correctly across AWS regions:
* connect to each S3 bucket on appropriate endpoint (#1028)
* create/select temp bucket in same region as EMR jobs (#687)
* added iam_endpoint option (#1067)
* removed s3_conn args from methods in EMRJobRunner and S3Filesystem
* removed iam_job_flow_role option (use iam_instance_profile)
* removed support for _$folder$ keys, which EMR no longer creates
* removed mrjob.compat.get_jobconf_value() (use jobconf_from_env())
* mrjob.util:
* renamed buffer_iterator_to_line_iterator() to to_lines()
* to_lines() no longer appends a newline to data (#819)
* removed extract_dir_for_tar()
* gunzip_stream() now yields chunks, not lines
* removed hash_object()

v0.4.5, 2015-??-?? -- IAM Fortified
* runners:
Expand Down Expand Up @@ -29,7 +57,6 @@ v0.4.3, 2015-04-08 -- SO many bugfixes
* --check-input-paths and --no-check-input-paths options (#864)
* skip (very slow) validation of s3 buckets if boto < 2.25.0 (#865)
* Fix for max_hours_idle bug that was terminating job flows early (#932)
* Job flows are visible to all IAM users by default (#922)
* --emr-api-param allows users to pass additional parameters to boto's
EMR API (#879)
* unset paramaters with --no-emr-api-param
Expand Down
4 changes: 1 addition & 3 deletions docs/guides/configs-all-runners.rst
Expand Up @@ -42,7 +42,7 @@ options related to file uploading.
:default: (automatic)

Should we automatically tar up the mrjob library and install it when we run
job? By default, we do unless :mrjob:`interpreter` is set.
job? By default, we do unless :mrjob-opt:`interpreter` is set.

Set this to ``False`` if you've already installed ``mrjob`` on your
Hadoop cluster or install it by some other method.
Expand Down Expand Up @@ -126,8 +126,6 @@ Temp files and cleanup
* ``'JOB'``: stop job if on EMR and the job is not done when cleanup runs
* ``'JOB_FLOW'``: terminate the job flow if on EMR and the job is not done
on cleanup
* ``'IF_SUCCESSFUL'`` (deprecated): same as ``ALL``. Not supported for
``cleanup_on_failure``.

In the config file::

Expand Down
4 changes: 2 additions & 2 deletions docs/guides/configs-hadoopy-runners.rst
Expand Up @@ -68,7 +68,7 @@ Options available to hadoop and emr runners
:set: all
:default: script's module name, or ``no_script``

Description of this job to use as the part of its name.
Alternate label for the job

.. mrjob-opt::
:config: owner
Expand All @@ -77,7 +77,7 @@ Options available to hadoop and emr runners
:set: all
:default: :py:func:`getpass.getuser`, or ``no_user`` if that fails

Who is running this job. Used solely to set the job name.
Who is running this job (if different from the current user)

.. mrjob-opt::
:config: partitioner
Expand Down
2 changes: 1 addition & 1 deletion docs/guides/emr-bootstrap-cookbook.rst
Expand Up @@ -151,7 +151,7 @@ into your newly upgraded version of Python. If you use other
bootstrap commands to install/upgrade Python libraries, you should also
run them *after* upgrading Python.

When to use bootsrap, and when to use setup
When to use bootstrap, and when to use setup
-------------------------------------------

You can use :mrjob-opt:`bootstrap` and :mrjob-opt:`setup` together.
Expand Down
53 changes: 35 additions & 18 deletions docs/guides/emr-opts.rst
Expand Up @@ -48,8 +48,6 @@ about setting these options.
supposed to be secret! Use the environment variable
:envvar:`AWS_SECURITY_TOKEN` instead.

This option requires boto >= 2.5.0.

.. mrjob-opt::
:config: ec2_key_pair
:switch: --ec2-key-pair
Expand Down Expand Up @@ -115,11 +113,11 @@ Job flow creation and configuration
:switch: --aws-region
:type: :ref:`string <data-type-string>`
:set: emr
:default: infer from scrach bucket region
:default: ``'us-west-2'``

region to connect to S3 and EMR on (e.g. ``us-west-1``). If you want to
use separate regions for S3 and EMR, set :mrjob-opt:`emr_endpoint` and
:mrjob-opt:`s3_endpoint`.
region to run EMR jobs on (e.g. ``us-west-1``). Also used by mrjob
to create temporary buckets if you don't set :mrjob-opt:`s3_scratch_uri`
explicitly.

.. mrjob-opt::
:config: emr_api_params
Expand Down Expand Up @@ -169,9 +167,11 @@ Job flow creation and configuration
:set: emr
:default: infer from :mrjob-opt:`aws_region`

optional host to connect to when communicating with S3 (e.g.
Force mrjob to connect to EMR on this endpoint (e.g.
``us-west-1.elasticmapreduce.amazonaws.com``).

Mostly exists as a workaround for network issues.

.. mrjob-opt::
:config: emr_tags
:switch: --emr-tag
Expand Down Expand Up @@ -216,6 +216,18 @@ Job flow creation and configuration
the EMR instance, rather than to a local file or one on S3. Rarely
necessary to set this by hand.

.. mrjob-opt::
:config: iam_endpoint
:switch: --iam-endpoint
:type: :ref:`string <data-type-string>`
:set: emr
:default: (automatic)

Force mrjob to connect to IAM on this endpoint (e.g.
``iam.us-gov.amazonaws.com``).

Mostly exists as a workaround for network issues.

.. mrjob-opt::
:config: iam_instance_profile
:switch: --iam-instance-profile
Expand Down Expand Up @@ -640,23 +652,23 @@ Choosing/creating a job flow to join

S3 paths and options
--------------------
MRJob uses boto to manipulate/access S3. Older versions of boto prior to 2.25.0
would enumerate all keys in a bucket by default to validate existence, slowing
down MRJob and inflating costs. 2.25.0 and above use a HEAD request to validate
a bucket.

MRJob will validate a bucket using the constant in mrjob.utils.VALIDATE_BUCKET,
which is set to True if boto.Version >= '2.25.0'
MRJob uses boto to manipulate/access S3.

.. mrjob-opt::
:config: s3_endpoint
:switch: --s3-endpoint
:type: :ref:`string <data-type-string>`
:set: emr
:default: infer from :mrjob-opt:`aws_region`
:default: (automatic)

Host to connect to when communicating with S3 (e.g.
``s3-us-west-1.amazonaws.com``).
Force mrjob to connect to S3 on this endpoint, rather than letting it
choose the appropriate endpoint for each S3 bucket.

Mostly exists as a workaround for network issues.

.. warning:: If you set this to a region-specific endpoint
(e.g. ``'s3-us-west-1.amazonaws.com'``) mrjob will not
be able to access buckets located in other regions.

.. mrjob-opt::
:config: s3_log_uri
Expand All @@ -675,11 +687,16 @@ which is set to True if boto.Version >= '2.25.0'
:switch: --s3-scratch-uri
:type: :ref:`string <data-type-string>`
:set: emr
:default: ``tmp/mrjob`` in the first bucket belonging to you
:default: (automatic)

S3 directory (URI ending in ``/``) to use as scratch space, e.g.
``s3://yourbucket/tmp/``.

By default, mrjob looks for a bucket belong to you whose name starts with
``mrjob-`` and which matches :mrjob-opt:`aws_region`. If it can't find
one, it creates one with a random name. This option is then set to `tmp/`
in this bucket (e.g. ``s3://mrjob-01234567890abcdef/tmp/``).

.. mrjob-opt::
:config: s3_sync_wait_time
:switch: --s3-sync-wait-time
Expand Down
3 changes: 0 additions & 3 deletions docs/job.rst
Expand Up @@ -34,7 +34,6 @@ Multi-step jobs

.. automethod:: MRJob.steps
.. automethod:: MRJob.mr
.. automethod:: MRJob.jar

Running the job
---------------
Expand Down Expand Up @@ -124,5 +123,3 @@ Hooks for testing
.. currentmodule:: mrjob.job

.. automethod:: MRJob.sandbox
.. automethod:: MRJob.parse_output
.. automethod:: MRJob.parse_counters
1 change: 0 additions & 1 deletion docs/runners-emr.rst
Expand Up @@ -26,5 +26,4 @@ S3 Utilities
.. automethod:: S3Filesystem.make_s3_conn
.. automethod:: S3Filesystem.get_s3_key
.. automethod:: S3Filesystem.get_s3_keys
.. automethod:: S3Filesystem.get_s3_folder_keys
.. automethod:: S3Filesystem.make_s3_key
12 changes: 8 additions & 4 deletions mrjob/aws.py
Expand Up @@ -158,7 +158,7 @@
_S3_REGIONLESS_ENDPOINT = 's3.amazonaws.com'

# us-east-1 doesn't have its own endpoint or need bucket location constraints
_S3_REGIONS_WITH_NO_LOCATION_CONSTRAINT = ['us-east-1']
_S3_REGION_WITH_NO_LOCATION_CONSTRAINT = 'us-east-1'


# "EU" is an alias for the eu-west-1 region
Expand Down Expand Up @@ -198,10 +198,14 @@ def emr_ssl_host_for_region(region):


def s3_endpoint_for_region(region):
"""Get the host for S3 in the given AWS region."""
"""Get the host for S3 in the given AWS region.
This will accept ``''`` for region as well, so it's fine to
use location constraint in place of region.
"""
region = _fix_region(region)

if not region or region in _S3_REGIONS_WITH_NO_LOCATION_CONSTRAINT:
if not region or region == _S3_REGION_WITH_NO_LOCATION_CONSTRAINT:
return _S3_REGIONLESS_ENDPOINT
else:
return _S3_REGION_ENDPOINT % {'region': region}
Expand All @@ -212,7 +216,7 @@ def s3_location_constraint_for_region(region):
services can connect to it in the given region."""
region = _fix_region(region)

if not region or region in _S3_REGIONS_WITH_NO_LOCATION_CONSTRAINT:
if not region or region == _S3_REGION_WITH_NO_LOCATION_CONSTRAINT:
return ''
else:
return region
Expand Down
3 changes: 0 additions & 3 deletions mrjob/compat.py
Expand Up @@ -594,9 +594,6 @@ def jobconf_from_env(variable, default=None):

return default

# old, deprecated name for get_jobconf_value().
get_jobconf_value = jobconf_from_env


def jobconf_from_dict(jobconf, name, default=None):
"""Get the value of a jobconf variable from the given dictionary.
Expand Down

0 comments on commit bed31b3

Please sign in to comment.