Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* [skip ci] Add docs for random array creation (#4566) * [skip ci] Move categoricals outside of design doc * respond to feedback * [skip ci] Apply suggestions from code review Co-Authored-By: mrocklin <mrocklin@gmail.com>
- Loading branch information
1 parent
0236183
commit ebcc44e
Showing
4 changed files
with
151 additions
and
83 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,83 @@ | ||
Categoricals | ||
============ | ||
|
||
Dask DataFrame divides `categorical data`_ into two types: | ||
|
||
- Known categoricals have the ``categories`` known statically (on the ``_meta`` | ||
attribute). Each partition **must** have the same categories as found on the | ||
``_meta`` attribute | ||
- Unknown categoricals don't know the categories statically, and may have | ||
different categories in each partition. Internally, unknown categoricals are | ||
indicated by the presence of ``dd.utils.UNKNOWN_CATEGORIES`` in the | ||
categories on the ``_meta`` attribute. Since most DataFrame operations | ||
propagate the categories, the known/unknown status should propagate through | ||
operations (similar to how ``NaN`` propagates) | ||
|
||
For metadata specified as a description (option 2 above), unknown categoricals | ||
are created. | ||
|
||
Certain operations are only available for known categoricals. For example, | ||
``df.col.cat.categories`` would only work if ``df.col`` has known categories, | ||
since the categorical mapping is only known statically on the metadata of known | ||
categoricals. | ||
|
||
The known/unknown status for a categorical column can be found using the | ||
``known`` property on the categorical accessor: | ||
|
||
.. code-block:: python | ||
>>> ddf.col.cat.known | ||
False | ||
Additionally, an unknown categorical can be converted to known using | ||
``.cat.as_known()``. If you have multiple categorical columns in a DataFrame, | ||
you may instead want to use ``df.categorize(columns=...)``, which will convert | ||
all specified columns to known categoricals. Since getting the categories | ||
requires a full scan of the data, using ``df.categorize()`` is more efficient | ||
than calling ``.cat.as_known()`` for each column (which would result in | ||
multiple scans): | ||
|
||
.. code-block:: python | ||
>>> col_known = ddf.col.cat.as_known() # use for single column | ||
>>> col_known.cat.known | ||
True | ||
>>> ddf_known = ddf.categorize() # use for multiple columns | ||
>>> ddf_known.col.cat.known | ||
True | ||
To convert a known categorical to an unknown categorical, there is also the | ||
``.cat.as_unknown()`` method. This requires no computation as it's just a | ||
change in the metadata. | ||
|
||
Non-categorical columns can be converted to categoricals in a few different | ||
ways: | ||
|
||
.. code-block:: python | ||
# astype operates lazily, and results in unknown categoricals | ||
ddf = ddf.astype({'mycol': 'category', ...}) | ||
# or | ||
ddf['mycol'] = ddf.mycol.astype('category') | ||
# categorize requires computation, and results in known categoricals | ||
ddf = ddf.categorize(columns=['mycol', ...]) | ||
Additionally, with Pandas 0.19.2 and up, ``dd.read_csv`` and ``dd.read_table`` | ||
can read data directly into unknown categorical columns by specifying a column | ||
dtype as ``'category'``: | ||
|
||
.. code-block:: python | ||
>>> ddf = dd.read_csv(..., dtype={col_name: 'category'}) | ||
.. _`categorical data`: http://pandas.pydata.org/pandas-docs/stable/categorical.html | ||
|
||
Moreover, with Pandas 0.21.0 and up, ``dd.read_csv`` and ``dd.read_table`` can read | ||
data directly into *known* categoricals by specifying instances of | ||
``pd.api.types.CategoricalDtype``: | ||
|
||
.. code-block:: python | ||
>>> dtype = {'col': pd.api.types.CategoricalDtype(['a', 'b', 'c'])} | ||
>>> ddf = dd.read_csv(..., dtype=dtype) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,65 @@ | ||
Joins | ||
===== | ||
|
||
DataFrame joins are a common and expensive computation that benefit from a | ||
variety of optimizations in different situations. Understanding how your data | ||
is laid out and what you're trying to accomplish can have a large impact on | ||
performance. This documentation page goes through the various different | ||
options and their performance impacts. | ||
|
||
Large to Large Unsorted Joins | ||
----------------------------- | ||
|
||
In the worst case scenario you have two large tables with many partitions each | ||
and you want to join them both along a column that may not be sorted. | ||
|
||
This can be slow. In this case Dask DataFrame will need to move all of your | ||
data around so that rows with matching values in the joining columns are in the | ||
same partition. This large-scale movement can create communication costs, and | ||
can require a large amount of memory. If enough memory can not be found then | ||
Dask will have to read and write data to disk, which may cause other | ||
performance costs. | ||
|
||
These problems are solvable, but will be significantly slower than many other | ||
operations. They are best avoided if possible. | ||
|
||
Large to Small Joins | ||
-------------------- | ||
|
||
Many join or merge computations combine a large table with one small one. If | ||
the small table is either a single partition Dask DataFrame or even just a | ||
normal Pandas DataFrame then the computation can proceed in an embarrassingly | ||
parallel way, where each partition of the large DataFrame is joined against the | ||
single small table. This incurs almost no overhead relative to Pandas joins. | ||
|
||
If your smaller table can easily fit in memory, then you might want to ensure | ||
that it is a single partition with the following | ||
|
||
.. code-block:: python | ||
small = small.repartition(npartitions=1) | ||
result = big.merge(small) | ||
Sorted Joins | ||
------------ | ||
|
||
The Pandas merge API supports the ``left_index=`` and ``right_index=`` options | ||
to perform joins on the index. For Dask DataFrames these keyword options hold | ||
special significance if the index has known divisions | ||
(see :ref:`dataframe-design-partitions`). | ||
In this case the DataFrame partitions are aligned along these divisions (which | ||
is generally fast) and then an embarrassingly parallel Pandas join happens | ||
across partition pairs. This is generally relatively fast. | ||
|
||
Sorted or indexed joins are a good solution to the large-large join problem. | ||
If you plan to join against a dataset repeatedly then it may be worthwhile to | ||
set the index ahead of time, and possibly store the data in a format that | ||
maintains that index, like Parquet. | ||
|
||
.. code-block:: python | ||
left = left.set_index('id').persist() | ||
left.merge(right_one, left_index=True, ...) | ||
left.merge(right_two, left_index=True, ...) | ||
... |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters