Skip to content

Commit

Permalink
Merge pull request #1784 from davidmarin/mapper-raw-docs
Browse files Browse the repository at this point in the history
document mapper_raw() (fixes #1753)
  • Loading branch information
David Marin committed May 24, 2018
2 parents 6c47af1 + c3089c2 commit 7b9718e
Show file tree
Hide file tree
Showing 3 changed files with 55 additions and 15 deletions.
2 changes: 2 additions & 0 deletions docs/guides/setup-cookbook.rst
Original file line number Diff line number Diff line change
Expand Up @@ -117,6 +117,8 @@ If you're a :mrjob-opt:`setup` purist, you can also do something like this:
since :command:`true` has no effect and ignores its arguments.

.. _using-a-virtualenv:

Using a virtualenv
------------------

Expand Down
67 changes: 52 additions & 15 deletions docs/guides/writing-mrjobs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -239,18 +239,15 @@ lines containing the string "kitty"::
if __name__ == '__main__':
KittyJob().run()

Step commands are run without a shell. But if you'd like to use shell features
such as pipes, you can use :py:func:`mrjob.util.bash_wrap()` to wrap your
command in a call to ``bash``.
Step commands are run without a shell, so if you want to use pipes, etc, you'll
need to run them in a subshell. For example:

::

from mrjob.util import bash_wrap
.. code-block:: python
class DemoJob(MRJob):
def mapper_cmd(self):
return bash_wrap("grep 'blah blah' | wc -l")
return 'sh -c "grep 'blah' | wc -l"'
.. note::

Expand Down Expand Up @@ -313,18 +310,17 @@ The output of the job should always be ``0``, since every line that gets to
:py:func:`test_for_kitty()` is filtered by :command:`grep` to have "kitty" in
it.

Filter commands are run without a shell. But if you'd like to use shell
features such as pipes, you can use :py:func:`mrjob.util.bash_wrap()` to wrap
your command in a call to ``bash``. See :ref:`cmd-filters` for an example of
:py:func:`mrjob.util.bash_wrap()`.

.. _job-protocols:

Protocols
---------

mrjob assumes that all data is newline-delimited bytes. It automatically
serializes and deserializes these bytes using :term:`protocols <protocol>`.
Hadoop streaming assumes that all data is newline-delimited bytes. By default,
mrjob assumes all output is in JSON format, but it can actually read and write
lines in any format by using protocols.

(If you need to read non-line-based data, see :ref:`raw-input`, below.)

Each job has an :term:`input protocol`, an :term:`output protocol`, and an
:term:`internal protocol`.

Expand Down Expand Up @@ -550,11 +546,52 @@ You can improve performance significantly by caching the
serialization/deserialization results of keys. Look at the source code of
:py:mod:`mrjob.protocol` for an example.

.. _raw-input:

Passing entire files to the mapper
----------------------------------

Sometimes you need to read binary data (e.g. image files), or text-based
data that has records longer than one line.

By using :py:meth:`~mrjob.job.MRJob.mapper_raw`, you can pass entire files
to your mapper, and read them however you want. Each mapper gets one file,
and is passed both the path of a local copy of the file, and the URI where
the original file is located on Hadoop's filesystem.

For example, if you want to read ``.wet`` files from
`Common Crawl <http://commoncrawl.org/>`__ data, you could handle them like
this::

class MRCrawler(MRJob):

def mapper_raw(self, wet_path, wet_uri):
from warcio.archiveiterator import ArchiveIterator

with open(wet_path, 'rb') as f:
for record in ArchiveIterator(f):
...

To use a library like :py:mod:`warcio`, you'll need to ensure that it gets
installed on your cluster. See :ref:`using-a-virtualenv` for one way to do
this.

Under the hood, mrjob is passes an input manifest (a list of
URIs of input files) to Hadoop, and instructs Hadoop to send one line to
each mapper. In most cases, this should be seamless, even to the point of
telling you which file was being read when a task fails.

.. warning::

For all runners except EMR, mrjob uses :command:`hadoop fs` to download
files to the local filesystem, which means Hadoop has to invoke
itself. If your cluster has tightly tuned memory requirements, this can
sometimes cause an out-of-memory error.

.. _non-hadoop-streaming-jar-steps:

Jar steps
^^^^^^^^^
---------

You can run Java directly on Hadoop (bypassing Hadoop Streaming) by using
:py:class:`~mrjob.step.JarStep` instead of :py:meth:`~mrjob.step.MRStep`.
Expand Down
1 change: 1 addition & 0 deletions docs/job.rst
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ One-step jobs
.. automethod:: MRJob.mapper_pre_filter
.. automethod:: MRJob.reducer_pre_filter
.. automethod:: MRJob.combiner_pre_filter
.. automethod:: MRJob.mapper_raw
.. automethod:: MRJob.spark

Multi-step jobs
Expand Down

0 comments on commit 7b9718e

Please sign in to comment.