From ebcc44e9e93f41d15fac952853f8d3b67c1936e8 Mon Sep 17 00:00:00 2001 From: Matthew Rocklin Date: Mon, 11 Mar 2019 19:53:12 -0700 Subject: [PATCH] Docs dataframe joins (#4569) * [skip ci] Add docs for random array creation (#4566) * [skip ci] Move categoricals outside of design doc * respond to feedback * [skip ci] Apply suggestions from code review Co-Authored-By: mrocklin --- docs/source/dataframe-categoricals.rst | 83 +++++++++++++++++++++++++ docs/source/dataframe-design.rst | 84 +------------------------- docs/source/dataframe-joins.rst | 65 ++++++++++++++++++++ docs/source/dataframe.rst | 2 + 4 files changed, 151 insertions(+), 83 deletions(-) create mode 100644 docs/source/dataframe-categoricals.rst create mode 100644 docs/source/dataframe-joins.rst diff --git a/docs/source/dataframe-categoricals.rst b/docs/source/dataframe-categoricals.rst new file mode 100644 index 00000000000..f6bd1c6fbe1 --- /dev/null +++ b/docs/source/dataframe-categoricals.rst @@ -0,0 +1,83 @@ +Categoricals +============ + +Dask DataFrame divides `categorical data`_ into two types: + +- Known categoricals have the ``categories`` known statically (on the ``_meta`` + attribute). Each partition **must** have the same categories as found on the + ``_meta`` attribute +- Unknown categoricals don't know the categories statically, and may have + different categories in each partition. Internally, unknown categoricals are + indicated by the presence of ``dd.utils.UNKNOWN_CATEGORIES`` in the + categories on the ``_meta`` attribute. Since most DataFrame operations + propagate the categories, the known/unknown status should propagate through + operations (similar to how ``NaN`` propagates) + +For metadata specified as a description (option 2 above), unknown categoricals +are created. + +Certain operations are only available for known categoricals. For example, +``df.col.cat.categories`` would only work if ``df.col`` has known categories, +since the categorical mapping is only known statically on the metadata of known +categoricals. + +The known/unknown status for a categorical column can be found using the +``known`` property on the categorical accessor: + +.. code-block:: python + + >>> ddf.col.cat.known + False + +Additionally, an unknown categorical can be converted to known using +``.cat.as_known()``. If you have multiple categorical columns in a DataFrame, +you may instead want to use ``df.categorize(columns=...)``, which will convert +all specified columns to known categoricals. Since getting the categories +requires a full scan of the data, using ``df.categorize()`` is more efficient +than calling ``.cat.as_known()`` for each column (which would result in +multiple scans): + +.. code-block:: python + + >>> col_known = ddf.col.cat.as_known() # use for single column + >>> col_known.cat.known + True + >>> ddf_known = ddf.categorize() # use for multiple columns + >>> ddf_known.col.cat.known + True + +To convert a known categorical to an unknown categorical, there is also the +``.cat.as_unknown()`` method. This requires no computation as it's just a +change in the metadata. + +Non-categorical columns can be converted to categoricals in a few different +ways: + +.. code-block:: python + + # astype operates lazily, and results in unknown categoricals + ddf = ddf.astype({'mycol': 'category', ...}) + # or + ddf['mycol'] = ddf.mycol.astype('category') + + # categorize requires computation, and results in known categoricals + ddf = ddf.categorize(columns=['mycol', ...]) + +Additionally, with Pandas 0.19.2 and up, ``dd.read_csv`` and ``dd.read_table`` +can read data directly into unknown categorical columns by specifying a column +dtype as ``'category'``: + +.. code-block:: python + + >>> ddf = dd.read_csv(..., dtype={col_name: 'category'}) + +.. _`categorical data`: http://pandas.pydata.org/pandas-docs/stable/categorical.html + +Moreover, with Pandas 0.21.0 and up, ``dd.read_csv`` and ``dd.read_table`` can read +data directly into *known* categoricals by specifying instances of +``pd.api.types.CategoricalDtype``: + +.. code-block:: python + + >>> dtype = {'col': pd.api.types.CategoricalDtype(['a', 'b', 'c'])} + >>> ddf = dd.read_csv(..., dtype=dtype) diff --git a/docs/source/dataframe-design.rst b/docs/source/dataframe-design.rst index c8e24e1a9f6..54874daf04d 100644 --- a/docs/source/dataframe-design.rst +++ b/docs/source/dataframe-design.rst @@ -71,90 +71,8 @@ This keyword is available on all functions/methods that take user provided callables (e.g. ``DataFrame.map_partitions``, ``DataFrame.apply``, etc...), as well as many creation functions (e.g. ``dd.from_delayed``). -Categoricals ------------- - -Dask DataFrame divides `categorical data`_ into two types: - -- Known categoricals have the ``categories`` known statically (on the ``_meta`` - attribute). Each partition **must** have the same categories as found on the - ``_meta`` attribute -- Unknown categoricals don't know the categories statically, and may have - different categories in each partition. Internally, unknown categoricals are - indicated by the presence of ``dd.utils.UNKNOWN_CATEGORIES`` in the - categories on the ``_meta`` attribute. Since most DataFrame operations - propagate the categories, the known/unknown status should propagate through - operations (similar to how ``NaN`` propagates) - -For metadata specified as a description (option 2 above), unknown categoricals -are created. - -Certain operations are only available for known categoricals. For example, -``df.col.cat.categories`` would only work if ``df.col`` has known categories, -since the categorical mapping is only known statically on the metadata of known -categoricals. - -The known/unknown status for a categorical column can be found using the -``known`` property on the categorical accessor: - -.. code-block:: python - - >>> ddf.col.cat.known - False - -Additionally, an unknown categorical can be converted to known using -``.cat.as_known()``. If you have multiple categorical columns in a DataFrame, -you may instead want to use ``df.categorize(columns=...)``, which will convert -all specified columns to known categoricals. Since getting the categories -requires a full scan of the data, using ``df.categorize()`` is more efficient -than calling ``.cat.as_known()`` for each column (which would result in -multiple scans): - -.. code-block:: python - - >>> col_known = ddf.col.cat.as_known() # use for single column - >>> col_known.cat.known - True - >>> ddf_known = ddf.categorize() # use for multiple columns - >>> ddf_known.col.cat.known - True - -To convert a known categorical to an unknown categorical, there is also the -``.cat.as_unknown()`` method. This requires no computation as it's just a -change in the metadata. - -Non-categorical columns can be converted to categoricals in a few different -ways: - -.. code-block:: python - - # astype operates lazily, and results in unknown categoricals - ddf = ddf.astype({'mycol': 'category', ...}) - # or - ddf['mycol'] = ddf.mycol.astype('category') - - # categorize requires computation, and results in known categoricals - ddf = ddf.categorize(columns=['mycol', ...]) - -Additionally, with Pandas 0.19.2 and up, ``dd.read_csv`` and ``dd.read_table`` -can read data directly into unknown categorical columns by specifying a column -dtype as ``'category'``: - -.. code-block:: python - - >>> ddf = dd.read_csv(..., dtype={col_name: 'category'}) - -.. _`categorical data`: http://pandas.pydata.org/pandas-docs/stable/categorical.html - -Moreover, with Pandas 0.21.0 and up, ``dd.read_csv`` and ``dd.read_table`` can read -data directly into *known* categoricals by specifying instances of -``pd.api.types.CategoricalDtype``: - -.. code-block:: python - - >>> dtype = {'col': pd.api.types.CategoricalDtype(['a', 'b', 'c'])} - >>> ddf = dd.read_csv(..., dtype=dtype) +.. _dataframe-design-partitions: Partitions ---------- diff --git a/docs/source/dataframe-joins.rst b/docs/source/dataframe-joins.rst new file mode 100644 index 00000000000..fd34b3d81f6 --- /dev/null +++ b/docs/source/dataframe-joins.rst @@ -0,0 +1,65 @@ +Joins +===== + +DataFrame joins are a common and expensive computation that benefit from a +variety of optimizations in different situations. Understanding how your data +is laid out and what you're trying to accomplish can have a large impact on +performance. This documentation page goes through the various different +options and their performance impacts. + +Large to Large Unsorted Joins +----------------------------- + +In the worst case scenario you have two large tables with many partitions each +and you want to join them both along a column that may not be sorted. + +This can be slow. In this case Dask DataFrame will need to move all of your +data around so that rows with matching values in the joining columns are in the +same partition. This large-scale movement can create communication costs, and +can require a large amount of memory. If enough memory can not be found then +Dask will have to read and write data to disk, which may cause other +performance costs. + +These problems are solvable, but will be significantly slower than many other +operations. They are best avoided if possible. + +Large to Small Joins +-------------------- + +Many join or merge computations combine a large table with one small one. If +the small table is either a single partition Dask DataFrame or even just a +normal Pandas DataFrame then the computation can proceed in an embarrassingly +parallel way, where each partition of the large DataFrame is joined against the +single small table. This incurs almost no overhead relative to Pandas joins. + +If your smaller table can easily fit in memory, then you might want to ensure +that it is a single partition with the following + +.. code-block:: python + + small = small.repartition(npartitions=1) + result = big.merge(small) + +Sorted Joins +------------ + +The Pandas merge API supports the ``left_index=`` and ``right_index=`` options +to perform joins on the index. For Dask DataFrames these keyword options hold +special significance if the index has known divisions +(see :ref:`dataframe-design-partitions`). +In this case the DataFrame partitions are aligned along these divisions (which +is generally fast) and then an embarrassingly parallel Pandas join happens +across partition pairs. This is generally relatively fast. + +Sorted or indexed joins are a good solution to the large-large join problem. +If you plan to join against a dataset repeatedly then it may be worthwhile to +set the index ahead of time, and possibly store the data in a format that +maintains that index, like Parquet. + +.. code-block:: python + + left = left.set_index('id').persist() + + left.merge(right_one, left_index=True, ...) + left.merge(right_two, left_index=True, ...) + ... diff --git a/docs/source/dataframe.rst b/docs/source/dataframe.rst index 84680866d73..dfe30e7aa91 100644 --- a/docs/source/dataframe.rst +++ b/docs/source/dataframe.rst @@ -10,7 +10,9 @@ DataFrame dataframe-performance.rst dataframe-design.rst dataframe-groupby.rst + dataframe-joins.rst dataframe-indexing.rst + dataframe-categoricals.rst dataframe-extend.rst A Dask DataFrame is a large parallel DataFrame composed of many smaller Pandas