Permalink
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
297 lines (223 sloc) 10.8 KB

Glossary

.. glossary::

   blob
        An arbitrary file stored in :ref:`DDFS`.

        See also :ref:`blobs`.

   client
        The program which submits a :term:`job` to the :term:`master`.

   data locality
        Performing computation over a set of data near where the data
        is located.  Disco preserves *data locality* whenever
        possible, since transferring data over a network can be
        prohibitively expensive when operating on massive amounts of
        data.

        See `locality of reference <http://en.wikipedia.org/wiki/Locality_of_reference>`_.

   DDFS
        See :ref:`DDFS`.

   Erlang
        See `Erlang <http://en.wikipedia.org/wiki/Erlang_(programming_language)>`_.

   garbage collection (GC)
        DDFS has a tag-based filesystem, which means that a given blob
        could be addressed via multiple tags.  This means that blobs
        can only be deleted once the last reference to it is deleted.
        DDFS uses a garbage collection procedure to detect and delete
        such unreferenced data.

   grouping
        A :term:`grouping` operation is performed on the inputs to a
        :term:`stage <stage>`; each resulting group becomes the input
        to a single :term:`task <task>` in that stage.  A grouping
        operation is what connects two adjacent stages in a Disco
        :ref:`pipeline <pipeline>` together.

        The possible grouping operations that can be done are
        :term:`split`, :term:`group_node`, :term:`group_label`,
        :term:`group_node_label`, and :term:`group_all`.

   group_all
        A :term:`grouping` operation that groups all the inputs to a
        :term:`stage` into a single group, regardless of the labels
        and nodes of the inputs.

        This grouping is typically used to define :term:`reduce`
        stages that contain a single reduce task.

   group_label
        A :term:`grouping` operation that groups all the inputs with
        the same :term:`label` into a single group, regardless of the
        nodes the inputs reside on.  Thus, the number of tasks that
        run in a :term:`group_label` stage is controlled by the number
        of :term:`labels <label>` generated by the tasks in the
        previous stage.

        This grouping is typically used to define :term:`reduce`
        stages that contain a reduce task for each :term:`label`.

   group_node
        A :term:`grouping` operation that groups all the inputs on the
        same node into a single group, regardless of the labels of the
        inputs.  Thus, the number of tasks that run in a
        :term:`group_node` stage depends on the number of distinct
        cluster nodes on which the tasks in the previous stage (who
        actually generated output) actually executed.

        This grouping can be used to condense the intermediate data
        generated on a cluster node by the tasks in a stage, in order
        to reduce the potential network resources used to transfer
        this data across the cluster to the tasks in the subsequent
        stage.

        This grouping is typically used to define :term:`shuffle`
        stages.

   group_node_label
        A :term:`grouping` operation that groups all the inputs with
        the same :term:`label` on the same node into a single group.

        This grouping can be used to condense the intermediate data
        generated on a cluster node by the tasks in a stage, in order
        to reduce the potential network resources used to transfer
        this data across the cluster to the tasks in the subsequent
        stage.

        This grouping is typically used to define :term:`shuffle`
        stages.

   split
        A :term:`grouping` operation that groups each single input
        into its own group, regardless of its label or the node it
        resides on.  Thus, the number of tasks that run in a
        :term:`split` stage is equal to the number of inputs to that
        stage.

        This grouping is typically used to define :term:`map` stages.

   immutable
        See `immutable object <http://en.wikipedia.org/wiki/Immutable_object>`_.

   job
        A set of map and/or reduce :term:`tasks <task>`, coordinated
        by the Disco :term:`master`.  When the master receives a
        :class:`disco.job.JobPack`, it assigns a unique name for the
        job, and assigns the tasks to :term:`workers <worker>` until
        they are all completed.

        See also :mod:`disco.job`

   job functions
        Job functions are the functions that the user can specify for a
        :mod:`disco.worker.classic.worker`.
        For example,
        :func:`disco.worker.classic.func.map`,
        :func:`disco.worker.classic.func.reduce`,
        :func:`disco.worker.classic.func.combiner`, and
        :func:`disco.worker.classic.func.partition` are job functions.

   job dict
       The first field in a :term:`job pack`, which contains
       parameters needed by the master for job execution.

       See also :ref:`jobdict` and :attr:`disco.job.JobPack.jobdict`.

   job home
        The working directory in which a :term:`worker` is executed.
        The :term:`master` creates the *job home* from a :term:`job
        pack`, by unzipping the contents of its :ref:`jobhome
        <jobhome>` field.

        See also :ref:`jobhome` and :attr:`disco.job.JobPack.jobhome`.

   job pack
        The packed contents sent to the master when submitting a new
        job.  Includes the :term:`job dict` and :term:`job home`,
        among other things.

        See also :ref:`jobpack` and :class:`disco.job.JobPack`.

   JSON
        JavaScript Object Notation.

        See `Introducing JSON <http://www.json.org>`_.

   label
        Each output file created by a :term:`task` is annotated with
        an integer label chosen by the task.  This label is used by
        :term:`grouping` operations in the :ref:`pipeline <pipeline>`.

   map
        The first phase of a conventional :term:`mapreduce`
        :term:`job`, in which :term:`tasks <task>` are usually
        scheduled on the same node where their input data is hosted,
        so that local computation can be performed.

        Also refers to an individual task in this phase, which
        produces records that may be :term:`partitioned
        <partitioning>`, and :term:`reduced <reduce>`.  Generally
        there is one map task per input.

   mapreduce
        A paradigm and associated framework for distributed computing,
        which decouples application code from the core challenges of
        fault tolerance and data locality.  The framework handles
        these issues so that :term:`jobs <job>` can focus on what is
        specific to their application.

        See `MapReduce <http://en.wikipedia.org/wiki/MapReduce>`_.

   master
        Distributed core that takes care of managing :term:`jobs
        <job>`, garbage collection for :term:`DDFS`, and other central
        processes.

        See also :ref:`overview`.

   partitioning
        The process of dividing output records into a set of labeled
        bins, much like :term:`tags <tag>` in :term:`DDFS`.
        Typically, the output of :term:`map` is partitioned, and each
        :term:`reduce` operates on a single partition.

   pid
        A process identifier.  In Disco this usually refers to the
        :term:`worker` *pid*.

        See `process identifier <http://en.wikipedia.org/wiki/Process_identifier>`_.

   pipeline
        The :ref:`structure <pipeline>` of a Disco job as a linear
        sequence of :term:`stages <stage>`.

   reduce
        The last phase of a conventional :term:`mapreduce`
        :term:`job`, in which non-local computation is usually
        performed.

        Also refers to an individual :term:`task` in this phase, which
        usually has access to all values for a given key produced by
        the :term:`map` phase.  Grouping data for reduce is achieved
        via :term:`partitioning`.

   replica
        Multiple copies (or replicas) of blobs are stored on different
        cluster nodes so that blobs are still available inspite of a
        small number of nodes going down.

   re-replication
        When a node goes down, the system tries to create additional
        replicas to replace copies that were lost at the loss of the
        node.

   SSH
        Network protocol used by :term:`Erlang` to start :term:`slaves <slave>`.

        See `SSH <http://en.wikipedia.org/wiki/Secure_Shell>`_.

   shuffle
        The implicit middle phase of a conventional :term:`mapreduce`
        :term:`job`, in which a single logical input for a
        :term:`reduce` task is created for each :term:`label` from all
        the inputs with that label generated by the tasks in a
        :term:`map` stage.

        This phase typically creates intensive network activity
        between the cluster nodes.  This load on the network can be
        reduced in a Disco :ref:`pipeline <pipeline>` by judicious use
        of node-local grouping operations, by condensing the
        intermediate data generated on a node before it gets
        transmitted across the network.

   slave
        The process started by the :term:`Erlang` `slave module`_.

        .. _slave module: http://www.erlang.org/doc/man/slave.html

        See also :ref:`overview`.

   stage
        A stage consists of a task definition, and a grouping
        operation.  The grouping operation is performed on the inputs
        of a stage; each resulting input group becomes the input to a
        single task.

   stdin
        The standard input file descriptor.  The :term:`master`
        responds to the :term:`worker` over *stdin*.

        See `standard streams <http://en.wikipedia.org/wiki/Standard_streams>`_.

   stdout
        The standard output file descriptor.  Initially redirected to
        :term:`stderr` for a Disco :term:`worker`.

        See `standard streams <http://en.wikipedia.org/wiki/Standard_streams>`_.

   stderr
        The standard error file descriptor.  The :term:`worker` sends
        messages to the :term:`master` over *stderr*.

        See `standard streams <http://en.wikipedia.org/wiki/Standard_streams>`_.

   tag
        A labeled collection of data in :term:`DDFS`.

        See also :ref:`tags`.

   task
        A *task* is essentially a unit of work, provided to a
        :term:`worker`.

        See also :mod:`disco.task`.

   worker
        A *worker* is responsible for carrying out a :term:`task`.  A
        Disco :term:`job` specifies the executable that is the worker.
        Workers are scheduled to run on the nodes, close to the data
        they are supposed to be processing.

        .. seealso::
           :mod:`The Python Worker module<disco.worker>`, and
           :ref:`worker_protocol`.

   ZIP
        Archive/compression format, used e.g. for the :term:`job
        home`.

        See `ZIP <http://en.wikipedia.org/wiki/ZIP_(file_format)>`_.