Skip to content

Commit

Permalink
Merge pull request #1811 from davidmarin/files-attr
Browse files Browse the repository at this point in the history
added FILES, DIRS, and ARCHIVES attrs (fixes #1431)
  • Loading branch information
David Marin committed Jul 30, 2018
2 parents 94e9f65 + 6abd9e2 commit 6139439
Show file tree
Hide file tree
Showing 21 changed files with 868 additions and 26 deletions.
2 changes: 1 addition & 1 deletion LICENSE.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
Copyright 2009-2015 Yelp
Copyright 2009-2018 Yelp and Contributors

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
Expand Down
8 changes: 4 additions & 4 deletions docs/guides/setup-cookbook.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,8 +22,8 @@ EMR, take a look at the :doc:`emr-bootstrap-cookbook`.
Uploading your source tree
--------------------------

.. note:: This relies on a feature that was added in 0.5.8. See below
for how to do it in earlier versions.
.. note:: If you're using mrjob 0.6.4 or later, check out
:ref:`uploading-modules-and-packages` first.

mrjob can automatically tarball your source directory and include it
in your job's working directory. We can use setup scripts to upload the
Expand Down Expand Up @@ -52,8 +52,8 @@ this in your :file:`mrjob.conf`:
Uploading your source tree as an archive
----------------------------------------

If you're using an earlier version of Python, you'll have to build the
tarball yourself:
Prior to mrjob 0.5.8, you had to archive directories yourself before uploading
them.

.. code-block:: sh
Expand Down
55 changes: 55 additions & 0 deletions docs/guides/writing-mrjobs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -649,6 +649,61 @@ before and after it by overriding :py:meth:`~mrjob.job.MRJob.pick_protocols`.
Best practice in this case is to put all your input into a single
directory and pass that as your input path.

.. _uploading-modules-and-packages:

Using other python modules and packages
---------------------------------------

.. versionadded:: 0.6.4

If you want to run Python code outside of the file containing your
:py:class:`~mrjob.job.MRJob`, you'll to make sure that code gets uploaded to
Hadoop.

The easiest way to do this is with by setting the
:py:attr:`~mrjob.job.MRJob.DIRS` attribute in your job. Put the code you
want to import in one or more packages (directories with an
``__init__.py`` file), and point :py:attr:`~mrjob.job.MRJob.DIRS`
at them::

class MRPackageUsingJob(MRJob):

DIRS = ['mycode', '../someothercode']

...

And then import code from inside a mapper or reducer::

def mapper(self, key, value):
from mycode.custom import important_business_logic
from someotherlibrary import util_function
...

(If you want to import code from the top level of your script rather than
inside a method, make sure it's in your ``PYTHONPATH``, just like with
any other code.)

:py:attr:`~mrjob.job.MRJob.DIRS` is relative to the directory your script is
in (not the current working directory). This works inside Hadoop because the
current working directory is the same as the directory your script is in.

If you want to access individual Python modules or other support code, you
can use :py:attr:`~mrjob.job.MRJob.FILES` to upload them to your job's working
directory inside Hadoop::

class MRFileUsingJob(MRJob):

FILES = ['mymodule.py', '../data/zipcodes.db']

def mapper(self, key, value):
from mymodule import open_zipcode_db
with open_zipcode_db('zipcodes.db') as db:
...

For jobs with more complex dependencies (e.g. code that needs to be compiled),
you may need to use the :mrjob-opt:`setup` option. See :doc:`setup-cookbook`
for more information.

.. _writing-cl-opts:

Defining command line options
Expand Down
12 changes: 12 additions & 0 deletions docs/job.rst
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,18 @@ configuration options.
.. automethod:: MRJob.load_args
.. automethod:: MRJob.is_task

.. _uploading-support-files:

Uploading support files
-----------------------

.. autoattribute:: MRJob.FILES
.. autoattribute:: MRJob.DIRS
.. autoattribute:: MRJob.ARCHIVES
.. automethod:: MRJob.files
.. automethod:: MRJob.dirs
.. automethod:: MRJob.archives

.. _job-configuration:

Job runner configuration
Expand Down
2 changes: 1 addition & 1 deletion docs/runners-runner.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ Running your job
.. automethod:: MRJobRunner.cat_output
.. automethod:: MRJobRunner.stream_output
.. automethod:: MRJobRunner.cleanup
.. autodata:: mrjob.runner.CLEANUP_CHOICES
.. autodata:: mrjob.options.CLEANUP_CHOICES

Run Information
---------------
Expand Down
2 changes: 1 addition & 1 deletion docs/whats-new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1240,7 +1240,7 @@ delay itself in case a job flow becomes available. Reference:

The ``JOB`` and ``JOB_FLOW`` cleanup options tell mrjob to clean up the job
and/or the job flow on failure (including Ctrl+C). See
:py:data:`~mrjob.runner.CLEANUP_CHOICES` for more information.
:py:data:`~mrjob.options.CLEANUP_CHOICES` for more information.

0.3.3
-----
Expand Down
24 changes: 20 additions & 4 deletions mrjob/examples/mr_most_used_word.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
#!/usr/bin/python
# Copyright 2009-2010 Yelp
# Copyright 2013 David Marin
# Copyright 2018 Yelp
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
Expand All @@ -13,7 +14,11 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Determine the most used word in the input."""
"""Determine the most used word in the input, ignoring common "stop" words.
Shows how to do a multi-step job, and how to load a support file
from the same directory.
"""
from mrjob.job import MRJob
from mrjob.protocol import JSONValueProtocol
from mrjob.step import MRStep
Expand All @@ -23,13 +28,20 @@


class MRMostUsedWord(MRJob):
FILES = ['stop_words.txt']

OUTPUT_PROTOCOL = JSONValueProtocol

def mapper_init(self):
with open('stop_words.txt') as f:
self.stop_words = set(line.strip() for line in f)

def mapper_get_words(self, _, line):
# yield each word in the line
for word in WORD_RE.findall(line):
yield (word.lower(), 1)
word = word.lower()
if word not in self.stop_words:
yield (word.lower(), 1)

def combiner_count_words(self, word, counts):
# sum the words we've seen so far
Expand All @@ -44,11 +56,15 @@ def reducer_count_words(self, word, counts):
def reducer_find_max_word(self, _, word_count_pairs):
# each item of word_count_pairs is (count, word),
# so yielding one results in key=counts, value=word
yield max(word_count_pairs)
try:
yield max(word_count_pairs)
except ValueError:
pass

def steps(self):
return [
MRStep(mapper=self.mapper_get_words,
MRStep(mapper_init=self.mapper_init,
mapper=self.mapper_get_words,
combiner=self.combiner_count_words,
reducer=self.reducer_count_words),
MRStep(reducer=self.reducer_find_max_word)
Expand Down
91 changes: 91 additions & 0 deletions mrjob/examples/stop_words.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
a
about
also
am
an
and
any
are
as
at
be
but
by
can
com
did
do
does
for
from
had
has
have
he
he'd
he'll
he's
her
here
hers
him
his
i
i'd
i'll
i'm
i've
if
in
into
is
it
it's
its
just
me
mine
my
of
on
or
org
our
ours
she
she'd
she'll
she's
some
than
that
the
their
them
then
there
these
they
they'd
they'll
they're
this
those
to
us
was
we
we'd
we'll
we're
were
what
where
which
who
will
with
would
you
your
yours
6 changes: 5 additions & 1 deletion mrjob/inline.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,12 +19,14 @@
process. Useful for debugging."""
import logging
import os
import sys

from mrjob.job import MRJob
#from mrjob.parse import parse_mr_job_stderr
from mrjob.sim import SimMRJobRunner
from mrjob.util import save_current_environment
from mrjob.util import save_cwd
from mrjob.util import save_sys_path

log = logging.getLogger(__name__)

Expand Down Expand Up @@ -83,9 +85,11 @@ def _invoke_task_func(self, task_type, step_num, task_num):

# Don't care about pickleability since this runs in the same process
def invoke_task(stdin, stdout, stderr, wd, env):
with save_current_environment(), save_cwd():
with save_current_environment(), save_cwd(), save_sys_path():
# pretend we're running the script in the working dir
os.environ.update(env)
os.chdir(wd)
sys.path = [os.getcwd()] + sys.path

input_uri = None
try:
Expand Down

0 comments on commit 6139439

Please sign in to comment.