Docs dataframe joins (#4569)

* [skip ci] Add docs for random array creation (#4566) * [skip ci] Move categoricals outside of design doc * respond to feedback * [skip ci] Apply suggestions from code review Co-Authored-By: mrocklin <mrocklin@gmail.com>
dask · Mar 12, 2019 · ebcc44e · ebcc44e
1 parent 0236183
commit ebcc44e
Show file tree

Hide file tree

Showing 4 changed files with 151 additions and 83 deletions.
diff --git a/docs/source/dataframe-categoricals.rst b/docs/source/dataframe-categoricals.rst
@@ -0,0 +1,83 @@
+Categoricals
+============
+
+Dask DataFrame divides `categorical data`_ into two types:
+
+- Known categoricals have the ``categories`` known statically (on the ``_meta``
+  attribute).  Each partition **must** have the same categories as found on the
+  ``_meta`` attribute
+- Unknown categoricals don't know the categories statically, and may have
+  different categories in each partition.  Internally, unknown categoricals are
+  indicated by the presence of ``dd.utils.UNKNOWN_CATEGORIES`` in the
+  categories on the ``_meta`` attribute.  Since most DataFrame operations
+  propagate the categories, the known/unknown status should propagate through
+  operations (similar to how ``NaN`` propagates)
+
+For metadata specified as a description (option 2 above), unknown categoricals
+are created.
+
+Certain operations are only available for known categoricals.  For example,
+``df.col.cat.categories`` would only work if ``df.col`` has known categories,
+since the categorical mapping is only known statically on the metadata of known
+categoricals.
+
+The known/unknown status for a categorical column can be found using the
+``known`` property on the categorical accessor:
+
+.. code-block:: python
+
+    >>> ddf.col.cat.known
+    False
+
+Additionally, an unknown categorical can be converted to known using
+``.cat.as_known()``.  If you have multiple categorical columns in a DataFrame,
+you may instead want to use ``df.categorize(columns=...)``, which will convert
+all specified columns to known categoricals.  Since getting the categories
+requires a full scan of the data, using ``df.categorize()`` is more efficient
+than calling ``.cat.as_known()`` for each column (which would result in
+multiple scans):
+
+.. code-block:: python
+
+    >>> col_known = ddf.col.cat.as_known()  # use for single column
+    >>> col_known.cat.known
+    True
+    >>> ddf_known = ddf.categorize()        # use for multiple columns
+    >>> ddf_known.col.cat.known
+    True
+
+To convert a known categorical to an unknown categorical, there is also the
+``.cat.as_unknown()`` method. This requires no computation as it's just a
+change in the metadata.
+
+Non-categorical columns can be converted to categoricals in a few different
+ways:
+
+.. code-block:: python
+
+    # astype operates lazily, and results in unknown categoricals
+    ddf = ddf.astype({'mycol': 'category', ...})
+    # or
+    ddf['mycol'] = ddf.mycol.astype('category')
+
+    # categorize requires computation, and results in known categoricals
+    ddf = ddf.categorize(columns=['mycol', ...])
+
+Additionally, with Pandas 0.19.2 and up, ``dd.read_csv`` and ``dd.read_table``
+can read data directly into unknown categorical columns by specifying a column
+dtype as ``'category'``:
+
+.. code-block:: python
+
+    >>> ddf = dd.read_csv(..., dtype={col_name: 'category'})
+
+.. _`categorical data`: http://pandas.pydata.org/pandas-docs/stable/categorical.html
+
+Moreover, with Pandas 0.21.0 and up, ``dd.read_csv`` and ``dd.read_table`` can read
+data directly into *known* categoricals by specifying instances of
+``pd.api.types.CategoricalDtype``:
+
+.. code-block:: python
+
+    >>> dtype = {'col': pd.api.types.CategoricalDtype(['a', 'b', 'c'])}
+    >>> ddf = dd.read_csv(..., dtype=dtype)
diff --git a/docs/source/dataframe-design.rst b/docs/source/dataframe-design.rst
@@ -71,90 +71,8 @@ This keyword is available on all functions/methods that take user provided
 callables (e.g. ``DataFrame.map_partitions``, ``DataFrame.apply``, etc...), as
 well as many creation functions (e.g. ``dd.from_delayed``).
 
-Categoricals
-------------
-
-Dask DataFrame divides `categorical data`_ into two types:
-
-- Known categoricals have the ``categories`` known statically (on the ``_meta``
-  attribute).  Each partition **must** have the same categories as found on the
-  ``_meta`` attribute
-- Unknown categoricals don't know the categories statically, and may have
-  different categories in each partition.  Internally, unknown categoricals are
-  indicated by the presence of ``dd.utils.UNKNOWN_CATEGORIES`` in the
-  categories on the ``_meta`` attribute.  Since most DataFrame operations
-  propagate the categories, the known/unknown status should propagate through
-  operations (similar to how ``NaN`` propagates)
-
-For metadata specified as a description (option 2 above), unknown categoricals
-are created.
-
-Certain operations are only available for known categoricals.  For example,
-``df.col.cat.categories`` would only work if ``df.col`` has known categories,
-since the categorical mapping is only known statically on the metadata of known
-categoricals.
-
-The known/unknown status for a categorical column can be found using the
-``known`` property on the categorical accessor:
-
-.. code-block:: python
-
-    >>> ddf.col.cat.known
-    False
-
-Additionally, an unknown categorical can be converted to known using
-``.cat.as_known()``.  If you have multiple categorical columns in a DataFrame,
-you may instead want to use ``df.categorize(columns=...)``, which will convert
-all specified columns to known categoricals.  Since getting the categories
-requires a full scan of the data, using ``df.categorize()`` is more efficient
-than calling ``.cat.as_known()`` for each column (which would result in
-multiple scans):
-
-.. code-block:: python
-
-    >>> col_known = ddf.col.cat.as_known()  # use for single column
-    >>> col_known.cat.known
-    True
-    >>> ddf_known = ddf.categorize()        # use for multiple columns
-    >>> ddf_known.col.cat.known
-    True
-
-To convert a known categorical to an unknown categorical, there is also the
-``.cat.as_unknown()`` method. This requires no computation as it's just a
-change in the metadata.
-
-Non-categorical columns can be converted to categoricals in a few different
-ways:
-
-.. code-block:: python
-
-    # astype operates lazily, and results in unknown categoricals
-    ddf = ddf.astype({'mycol': 'category', ...})
-    # or
-    ddf['mycol'] = ddf.mycol.astype('category')
-
-    # categorize requires computation, and results in known categoricals
-    ddf = ddf.categorize(columns=['mycol', ...])
-
-Additionally, with Pandas 0.19.2 and up, ``dd.read_csv`` and ``dd.read_table``
-can read data directly into unknown categorical columns by specifying a column
-dtype as ``'category'``:
-
-.. code-block:: python
-
-    >>> ddf = dd.read_csv(..., dtype={col_name: 'category'})
-
-.. _`categorical data`: http://pandas.pydata.org/pandas-docs/stable/categorical.html
-
-Moreover, with Pandas 0.21.0 and up, ``dd.read_csv`` and ``dd.read_table`` can read
-data directly into *known* categoricals by specifying instances of
-``pd.api.types.CategoricalDtype``:
-
-.. code-block:: python
-
-    >>> dtype = {'col': pd.api.types.CategoricalDtype(['a', 'b', 'c'])}
-    >>> ddf = dd.read_csv(..., dtype=dtype)
 
+.. _dataframe-design-partitions:
 
 Partitions
 ----------

diff --git a/docs/source/dataframe-joins.rst b/docs/source/dataframe-joins.rst
@@ -0,0 +1,65 @@
+Joins
+=====
+
+DataFrame joins are a common and expensive computation that benefit from a
+variety of optimizations in different situations.  Understanding how your data
+is laid out and what you're trying to accomplish can have a large impact on
+performance.  This documentation page goes through the various different
+options and their performance impacts.
+
+Large to Large Unsorted Joins
+-----------------------------
+
+In the worst case scenario you have two large tables with many partitions each
+and you want to join them both along a column that may not be sorted.
+
+This can be slow.  In this case Dask DataFrame will need to move all of your
+data around so that rows with matching values in the joining columns are in the
+same partition.  This large-scale movement can create communication costs, and
+can require a large amount of memory.  If enough memory can not be found then
+Dask will have to read and write data to disk, which may cause other
+performance costs.
+
+These problems are solvable, but will be significantly slower than many other
+operations.  They are best avoided if possible.
+
+Large to Small Joins
+--------------------
+
+Many join or merge computations combine a large table with one small one.  If
+the small table is either a single partition Dask DataFrame or even just a
+normal Pandas DataFrame then the computation can proceed in an embarrassingly
+parallel way, where each partition of the large DataFrame is joined against the
+single small table.  This incurs almost no overhead relative to Pandas joins.
+
+If your smaller table can easily fit in memory, then you might want to ensure
+that it is a single partition with the following
+
+.. code-block:: python
+
+    small = small.repartition(npartitions=1)
+    result = big.merge(small)
+
+Sorted Joins
+------------
+
+The Pandas merge API supports the ``left_index=`` and ``right_index=`` options
+to perform joins on the index.  For Dask DataFrames these keyword options hold
+special significance if the index has known divisions
+(see :ref:`dataframe-design-partitions`).
+In this case the DataFrame partitions are aligned along these divisions (which
+is generally fast) and then an embarrassingly parallel Pandas join happens
+across partition pairs.  This is generally relatively fast.
+
+Sorted or indexed joins are a good solution to the large-large join problem.
+If you plan to join against a dataset repeatedly then it may be worthwhile to
+set the index ahead of time, and possibly store the data in a format that
+maintains that index, like Parquet.
+
+.. code-block:: python
+
+    left = left.set_index('id').persist()
+
+    left.merge(right_one, left_index=True, ...)
+    left.merge(right_two, left_index=True, ...)
+    ...
diff --git a/docs/source/dataframe.rst b/docs/source/dataframe.rst
@@ -10,7 +10,9 @@ DataFrame
    dataframe-performance.rst
    dataframe-design.rst
    dataframe-groupby.rst
+   dataframe-joins.rst
    dataframe-indexing.rst
+   dataframe-categoricals.rst
    dataframe-extend.rst
 
 A Dask DataFrame is a large parallel DataFrame composed of many smaller Pandas