Skip to content

Commit

Permalink
Docs dataframe joins (#4569)
Browse files Browse the repository at this point in the history
* [skip ci] Add docs for random array creation (#4566)

* [skip ci] Move categoricals outside of design doc

* respond to feedback

* [skip ci] Apply suggestions from code review

Co-Authored-By: mrocklin <mrocklin@gmail.com>
  • Loading branch information
mrocklin authored and jrbourbeau committed Mar 12, 2019
1 parent 0236183 commit ebcc44e
Show file tree
Hide file tree
Showing 4 changed files with 151 additions and 83 deletions.
83 changes: 83 additions & 0 deletions docs/source/dataframe-categoricals.rst
@@ -0,0 +1,83 @@
Categoricals
============

Dask DataFrame divides `categorical data`_ into two types:

- Known categoricals have the ``categories`` known statically (on the ``_meta``
attribute). Each partition **must** have the same categories as found on the
``_meta`` attribute
- Unknown categoricals don't know the categories statically, and may have
different categories in each partition. Internally, unknown categoricals are
indicated by the presence of ``dd.utils.UNKNOWN_CATEGORIES`` in the
categories on the ``_meta`` attribute. Since most DataFrame operations
propagate the categories, the known/unknown status should propagate through
operations (similar to how ``NaN`` propagates)

For metadata specified as a description (option 2 above), unknown categoricals
are created.

Certain operations are only available for known categoricals. For example,
``df.col.cat.categories`` would only work if ``df.col`` has known categories,
since the categorical mapping is only known statically on the metadata of known
categoricals.

The known/unknown status for a categorical column can be found using the
``known`` property on the categorical accessor:

.. code-block:: python
>>> ddf.col.cat.known
False
Additionally, an unknown categorical can be converted to known using
``.cat.as_known()``. If you have multiple categorical columns in a DataFrame,
you may instead want to use ``df.categorize(columns=...)``, which will convert
all specified columns to known categoricals. Since getting the categories
requires a full scan of the data, using ``df.categorize()`` is more efficient
than calling ``.cat.as_known()`` for each column (which would result in
multiple scans):

.. code-block:: python
>>> col_known = ddf.col.cat.as_known() # use for single column
>>> col_known.cat.known
True
>>> ddf_known = ddf.categorize() # use for multiple columns
>>> ddf_known.col.cat.known
True
To convert a known categorical to an unknown categorical, there is also the
``.cat.as_unknown()`` method. This requires no computation as it's just a
change in the metadata.

Non-categorical columns can be converted to categoricals in a few different
ways:

.. code-block:: python
# astype operates lazily, and results in unknown categoricals
ddf = ddf.astype({'mycol': 'category', ...})
# or
ddf['mycol'] = ddf.mycol.astype('category')
# categorize requires computation, and results in known categoricals
ddf = ddf.categorize(columns=['mycol', ...])
Additionally, with Pandas 0.19.2 and up, ``dd.read_csv`` and ``dd.read_table``
can read data directly into unknown categorical columns by specifying a column
dtype as ``'category'``:

.. code-block:: python
>>> ddf = dd.read_csv(..., dtype={col_name: 'category'})
.. _`categorical data`: http://pandas.pydata.org/pandas-docs/stable/categorical.html

Moreover, with Pandas 0.21.0 and up, ``dd.read_csv`` and ``dd.read_table`` can read
data directly into *known* categoricals by specifying instances of
``pd.api.types.CategoricalDtype``:

.. code-block:: python
>>> dtype = {'col': pd.api.types.CategoricalDtype(['a', 'b', 'c'])}
>>> ddf = dd.read_csv(..., dtype=dtype)
84 changes: 1 addition & 83 deletions docs/source/dataframe-design.rst
Expand Up @@ -71,90 +71,8 @@ This keyword is available on all functions/methods that take user provided
callables (e.g. ``DataFrame.map_partitions``, ``DataFrame.apply``, etc...), as
well as many creation functions (e.g. ``dd.from_delayed``).

Categoricals
------------

Dask DataFrame divides `categorical data`_ into two types:

- Known categoricals have the ``categories`` known statically (on the ``_meta``
attribute). Each partition **must** have the same categories as found on the
``_meta`` attribute
- Unknown categoricals don't know the categories statically, and may have
different categories in each partition. Internally, unknown categoricals are
indicated by the presence of ``dd.utils.UNKNOWN_CATEGORIES`` in the
categories on the ``_meta`` attribute. Since most DataFrame operations
propagate the categories, the known/unknown status should propagate through
operations (similar to how ``NaN`` propagates)

For metadata specified as a description (option 2 above), unknown categoricals
are created.

Certain operations are only available for known categoricals. For example,
``df.col.cat.categories`` would only work if ``df.col`` has known categories,
since the categorical mapping is only known statically on the metadata of known
categoricals.

The known/unknown status for a categorical column can be found using the
``known`` property on the categorical accessor:

.. code-block:: python
>>> ddf.col.cat.known
False
Additionally, an unknown categorical can be converted to known using
``.cat.as_known()``. If you have multiple categorical columns in a DataFrame,
you may instead want to use ``df.categorize(columns=...)``, which will convert
all specified columns to known categoricals. Since getting the categories
requires a full scan of the data, using ``df.categorize()`` is more efficient
than calling ``.cat.as_known()`` for each column (which would result in
multiple scans):

.. code-block:: python
>>> col_known = ddf.col.cat.as_known() # use for single column
>>> col_known.cat.known
True
>>> ddf_known = ddf.categorize() # use for multiple columns
>>> ddf_known.col.cat.known
True
To convert a known categorical to an unknown categorical, there is also the
``.cat.as_unknown()`` method. This requires no computation as it's just a
change in the metadata.

Non-categorical columns can be converted to categoricals in a few different
ways:

.. code-block:: python
# astype operates lazily, and results in unknown categoricals
ddf = ddf.astype({'mycol': 'category', ...})
# or
ddf['mycol'] = ddf.mycol.astype('category')
# categorize requires computation, and results in known categoricals
ddf = ddf.categorize(columns=['mycol', ...])
Additionally, with Pandas 0.19.2 and up, ``dd.read_csv`` and ``dd.read_table``
can read data directly into unknown categorical columns by specifying a column
dtype as ``'category'``:

.. code-block:: python
>>> ddf = dd.read_csv(..., dtype={col_name: 'category'})
.. _`categorical data`: http://pandas.pydata.org/pandas-docs/stable/categorical.html

Moreover, with Pandas 0.21.0 and up, ``dd.read_csv`` and ``dd.read_table`` can read
data directly into *known* categoricals by specifying instances of
``pd.api.types.CategoricalDtype``:

.. code-block:: python
>>> dtype = {'col': pd.api.types.CategoricalDtype(['a', 'b', 'c'])}
>>> ddf = dd.read_csv(..., dtype=dtype)

.. _dataframe-design-partitions:

Partitions
----------
Expand Down
65 changes: 65 additions & 0 deletions docs/source/dataframe-joins.rst
@@ -0,0 +1,65 @@
Joins
=====

DataFrame joins are a common and expensive computation that benefit from a
variety of optimizations in different situations. Understanding how your data
is laid out and what you're trying to accomplish can have a large impact on
performance. This documentation page goes through the various different
options and their performance impacts.

Large to Large Unsorted Joins
-----------------------------

In the worst case scenario you have two large tables with many partitions each
and you want to join them both along a column that may not be sorted.

This can be slow. In this case Dask DataFrame will need to move all of your
data around so that rows with matching values in the joining columns are in the
same partition. This large-scale movement can create communication costs, and
can require a large amount of memory. If enough memory can not be found then
Dask will have to read and write data to disk, which may cause other
performance costs.

These problems are solvable, but will be significantly slower than many other
operations. They are best avoided if possible.

Large to Small Joins
--------------------

Many join or merge computations combine a large table with one small one. If
the small table is either a single partition Dask DataFrame or even just a
normal Pandas DataFrame then the computation can proceed in an embarrassingly
parallel way, where each partition of the large DataFrame is joined against the
single small table. This incurs almost no overhead relative to Pandas joins.

If your smaller table can easily fit in memory, then you might want to ensure
that it is a single partition with the following

.. code-block:: python
small = small.repartition(npartitions=1)
result = big.merge(small)
Sorted Joins
------------

The Pandas merge API supports the ``left_index=`` and ``right_index=`` options
to perform joins on the index. For Dask DataFrames these keyword options hold
special significance if the index has known divisions
(see :ref:`dataframe-design-partitions`).
In this case the DataFrame partitions are aligned along these divisions (which
is generally fast) and then an embarrassingly parallel Pandas join happens
across partition pairs. This is generally relatively fast.

Sorted or indexed joins are a good solution to the large-large join problem.
If you plan to join against a dataset repeatedly then it may be worthwhile to
set the index ahead of time, and possibly store the data in a format that
maintains that index, like Parquet.

.. code-block:: python
left = left.set_index('id').persist()
left.merge(right_one, left_index=True, ...)
left.merge(right_two, left_index=True, ...)
...
2 changes: 2 additions & 0 deletions docs/source/dataframe.rst
Expand Up @@ -10,7 +10,9 @@ DataFrame
dataframe-performance.rst
dataframe-design.rst
dataframe-groupby.rst
dataframe-joins.rst
dataframe-indexing.rst
dataframe-categoricals.rst
dataframe-extend.rst

A Dask DataFrame is a large parallel DataFrame composed of many smaller Pandas
Expand Down

0 comments on commit ebcc44e

Please sign in to comment.