Skip to content

Commit

Permalink
Categorical type (pandas-dev#16015)
Browse files Browse the repository at this point in the history
  • Loading branch information
TomAugspurger authored and alanbato committed Nov 10, 2017
1 parent e18cdca commit da7ad15
Show file tree
Hide file tree
Showing 31 changed files with 1,092 additions and 288 deletions.
4 changes: 3 additions & 1 deletion doc/source/advanced.rst
Original file line number Diff line number Diff line change
Expand Up @@ -638,9 +638,11 @@ and allows efficient indexing and storage of an index with a large number of dup

.. ipython:: python
from pandas.api.types import CategoricalDtype
df = pd.DataFrame({'A': np.arange(6),
'B': list('aabbca')})
df['B'] = df['B'].astype('category', categories=list('cab'))
df['B'] = df['B'].astype(CategoricalDtype(list('cab')))
df
df.dtypes
df.B.cat.categories
Expand Down
5 changes: 4 additions & 1 deletion doc/source/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -646,7 +646,10 @@ strings and apply several methods to it. These can be accessed like
Categorical
~~~~~~~~~~~

If the Series is of dtype ``category``, ``Series.cat`` can be used to change the the categorical
.. autoclass:: api.types.CategoricalDtype
:members: categories, ordered

If the Series is of dtype ``CategoricalDtype``, ``Series.cat`` can be used to change the categorical
data. This accessor is similar to the ``Series.dt`` or ``Series.str`` and has the
following usable methods and properties:

Expand Down
103 changes: 95 additions & 8 deletions doc/source/categorical.rst
Original file line number Diff line number Diff line change
Expand Up @@ -89,12 +89,22 @@ By passing a :class:`pandas.Categorical` object to a `Series` or assigning it to
df["B"] = raw_cat
df
You can also specify differently ordered categories or make the resulting data ordered, by passing these arguments to ``astype()``:
Anywhere above we passed a keyword ``dtype='category'``, we used the default behavior of

1. categories are inferred from the data
2. categories are unordered.

To control those behaviors, instead of passing ``'category'``, use an instance
of :class:`~pandas.api.types.CategoricalDtype`.

.. ipython:: python
s = pd.Series(["a","b","c","a"])
s_cat = s.astype("category", categories=["b","c","d"], ordered=False)
from pandas.api.types import CategoricalDtype
s = pd.Series(["a", "b", "c", "a"])
cat_type = CategoricalDtype(categories=["b", "c", "d"],
ordered=True)
s_cat = s.astype(cat_type)
s_cat
Categorical data has a specific ``category`` :ref:`dtype <basics.dtypes>`:
Expand Down Expand Up @@ -133,6 +143,75 @@ constructor to save the factorize step during normal constructor mode:
splitter = np.random.choice([0,1], 5, p=[0.5,0.5])
s = pd.Series(pd.Categorical.from_codes(splitter, categories=["train", "test"]))
.. _categorical.categoricaldtype:

CategoricalDtype
----------------

.. versionchanged:: 0.21.0

A categorical's type is fully described by

1. ``categories``: a sequence of unique values and no missing values
2. ``ordered``: a boolean

This information can be stored in a :class:`~pandas.api.types.CategoricalDtype`.
The ``categories`` argument is optional, which implies that the actual categories
should be inferred from whatever is present in the data when the
:class:`pandas.Categorical` is created. The categories are assumed to be unordered
by default.

.. ipython:: python
from pandas.api.types import CategoricalDtype
CategoricalDtype(['a', 'b', 'c'])
CategoricalDtype(['a', 'b', 'c'], ordered=True)
CategoricalDtype()
A :class:`~pandas.api.types.CategoricalDtype` can be used in any place pandas
expects a `dtype`. For example :func:`pandas.read_csv`,
:func:`pandas.DataFrame.astype`, or in the Series constructor.

.. note::

As a convenience, you can use the string ``'category'`` in place of a
:class:`~pandas.api.types.CategoricalDtype` when you want the default behavior of
the categories being unordered, and equal to the set values present in the
array. In other words, ``dtype='category'`` is equivalent to
``dtype=CategoricalDtype()``.

Equality Semantics
~~~~~~~~~~~~~~~~~~

Two instances of :class:`~pandas.api.types.CategoricalDtype` compare equal
whenever they have the same categories and orderedness. When comparing two
unordered categoricals, the order of the ``categories`` is not considered

.. ipython:: python
c1 = CategoricalDtype(['a', 'b', 'c'], ordered=False)
# Equal, since order is not considered when ordered=False
c1 == CategoricalDtype(['b', 'c', 'a'], ordered=False)
# Unequal, since the second CategoricalDtype is ordered
c1 == CategoricalDtype(['a', 'b', 'c'], ordered=True)
All instances of ``CategoricalDtype`` compare equal to the string ``'category'``

.. ipython:: python
c1 == 'category'
.. warning::

Since ``dtype='category'`` is essentially ``CategoricalDtype(None, False)``,
and since all instances ``CategoricalDtype`` compare equal to ``'category'``,
all instances of ``CategoricalDtype`` compare equal to a
``CategoricalDtype(None, False)``, regardless of ``categories`` or
``ordered``.

Description
-----------

Expand Down Expand Up @@ -184,7 +263,7 @@ It's also possible to pass in the categories in a specific order:

.. ipython:: python
s = pd.Series(list('babc')).astype('category', categories=list('abcd'))
s = pd.Series(list('babc')).astype(CategoricalDtype(list('abcd')))
s
# categories
Expand Down Expand Up @@ -301,7 +380,9 @@ meaning and certain operations are possible. If the categorical is unordered, ``
s = pd.Series(pd.Categorical(["a","b","c","a"], ordered=False))
s.sort_values(inplace=True)
s = pd.Series(["a","b","c","a"]).astype('category', ordered=True)
s = pd.Series(["a","b","c","a"]).astype(
CategoricalDtype(ordered=True)
)
s.sort_values(inplace=True)
s
s.min(), s.max()
Expand Down Expand Up @@ -401,9 +482,15 @@ categories or a categorical with any list-like object, will raise a TypeError.

.. ipython:: python
cat = pd.Series([1,2,3]).astype("category", categories=[3,2,1], ordered=True)
cat_base = pd.Series([2,2,2]).astype("category", categories=[3,2,1], ordered=True)
cat_base2 = pd.Series([2,2,2]).astype("category", ordered=True)
cat = pd.Series([1,2,3]).astype(
CategoricalDtype([3, 2, 1], ordered=True)
)
cat_base = pd.Series([2,2,2]).astype(
CategoricalDtype([3, 2, 1], ordered=True)
)
cat_base2 = pd.Series([2,2,2]).astype(
CategoricalDtype(ordered=True)
)
cat
cat_base
Expand Down
11 changes: 8 additions & 3 deletions doc/source/merging.rst
Original file line number Diff line number Diff line change
Expand Up @@ -830,8 +830,10 @@ The left frame.

.. ipython:: python
from pandas.api.types import CategoricalDtype
X = pd.Series(np.random.choice(['foo', 'bar'], size=(10,)))
X = X.astype('category', categories=['foo', 'bar'])
X = X.astype(CategoricalDtype(categories=['foo', 'bar']))
left = pd.DataFrame({'X': X,
'Y': np.random.choice(['one', 'two', 'three'], size=(10,))})
Expand All @@ -842,8 +844,11 @@ The right frame.

.. ipython:: python
right = pd.DataFrame({'X': pd.Series(['foo', 'bar']).astype('category', categories=['foo', 'bar']),
'Z': [1, 2]})
right = pd.DataFrame({
'X': pd.Series(['foo', 'bar'],
dtype=CategoricalDtype(['foo', 'bar'])),
'Z': [1, 2]
})
right
right.dtypes
Expand Down
27 changes: 27 additions & 0 deletions doc/source/whatsnew/v0.21.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@ users upgrade to this version.
Highlights include:

- Integration with `Apache Parquet <https://parquet.apache.org/>`__, including a new top-level :func:`read_parquet` and :func:`DataFrame.to_parquet` method, see :ref:`here <io.parquet>`.
- New user-facing :class:`pandas.api.types.CategoricalDtype` for specifying
categoricals independent of the data, see :ref:`here <whatsnew_0210.enhancements.categorical_dtype>`.

Check the :ref:`API Changes <whatsnew_0210.api_breaking>` and :ref:`deprecations <whatsnew_0210.deprecations>` before updating.

Expand Down Expand Up @@ -89,6 +91,31 @@ This does not raise any obvious exceptions, but also does not create a new colum

Setting a list-like data structure into a new attribute now raise a ``UserWarning`` about the potential for unexpected behavior. See :ref:`Attribute Access <indexing.attribute_access>`.

.. _whatsnew_0210.enhancements.categorical_dtype:

``CategoricalDtype`` for specifying categoricals
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

:class:`pandas.api.types.CategoricalDtype` has been added to the public API and
expanded to include the ``categories`` and ``ordered`` attributes. A
``CategoricalDtype`` can be used to specify the set of categories and
orderedness of an array, independent of the data themselves. This can be useful,
e.g., when converting string data to a ``Categorical`` (:issue:`14711`,
:issue:`15078`, :issue:`16015`):

.. ipython:: python

from pandas.api.types import CategoricalDtype

s = pd.Series(['a', 'b', 'c', 'a']) # strings
dtype = CategoricalDtype(categories=['a', 'b', 'c', 'd'], ordered=True)
s.astype(dtype)

The ``.dtype`` property of a ``Categorical``, ``CategoricalIndex`` or a
``Series`` with categorical type will now return an instance of ``CategoricalDtype``.

See the :ref:`CategoricalDtype docs <categorical.categoricaldtype>` for more.

.. _whatsnew_0210.enhancements.other:

Other Enhancements
Expand Down
Loading

0 comments on commit da7ad15

Please sign in to comment.