In [1]:
import numpy as np
import pandas as pd
import seaborn as sns

## Basics

- All values of a categorical valiable are either in `categories` or `np.nan`.

- Order is defined by the order of `categories`, not the lexical order of the values.

- Internally, the data structure consists of a `categories` array and an integer arrays of `codes`, which point to the values in the `categories` array.

- The memory usage of a categorical variable is proportional to the number of categories plus the length of the data, while that for an object dtype is a constant times the length of the data. As the number of categories approaches the length of the data, memory usage approaches that of object type.

- Categories can be useful in the following scenarios:

    - To save memory (if number of categories small relative to number of rows)
    
    - If logical order differs from lexical order (e.g. 'small', 'medium', 'large')
    
    - To signal to libraries that column should be treated as a category (e.g. for plotting)

## General best practices

Based on [this](https://towardsdatascience.com/staying-sane-while-adopting-pandas-categorical-datatypes-78dbd19dcd8a) useful article.

- Operate on category values rather than column elements. E.g. to rename categories use `df.catvar.cat.rename_rategories(*args, **kwargs)`, if there is no `cat` method available,
consider operating on categories directly with `df.catvar.cat.categories`.

- Merging on categories: the two key things to remember are that 1) Pandas treats categorical variables with different categories as different data types, and 2) category merge keys will only be categories in the merged dataframe if they are of the same data types (i.e. have the same categories), otherwise they will be converted back to objects.

- Grouping on categories: remember that by default we group on all categories, not just those present in the data. More often than not, you'll want to use `df.groupby(catvar, observed=True)` to only use categories observed in the data.

In [2]:
titanic = sns.load_dataset("titanic")
titanic.head(2)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False


## Operations I frequently use

### Renaming categories

In [3]:
titanic["class"].cat.rename_categories(str.upper)[:2]

0    THIRD
1    FIRST
Name: class, dtype: category
Categories (3, object): ['FIRST', 'SECOND', 'THIRD']

### Appending new categories

In [4]:
titanic["class"].cat.add_categories(["Fourth"]).cat.categories

Index(['First', 'Second', 'Third', 'Fourth'], dtype='object')

### Removing categories

In [5]:
titanic["class"].cat.remove_categories(["Third"]).cat.categories

Index(['First', 'Second'], dtype='object')

### Remove unused categories

In [6]:
titanic_small = titanic.iloc[:2]
titanic_small

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False


In [7]:
titanic_small["class"].cat.remove_unused_categories().cat.categories

Index(['First', 'Third'], dtype='object')

### Remove and add categories simultaneously

In [8]:
titanic["class"].value_counts(dropna=False)

Third     491
First     216
Second    184
Name: class, dtype: int64

In [9]:
titanic["class"].cat.set_categories(["First", "Third", "Fourth"]).value_counts(
    dropna=False
)

Third     491
First     216
NaN       184
Fourth      0
Name: class, dtype: int64

### Using string and datetime accessors

This works as expected, and if the number of distinct categories is small relative to the number of rows, then operating on the categories is faster (because under the hood, pandas applies the change to `categories` and constructs a new series (see [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html#string-and-datetime-accessors)) so no need to do this manually as I was inclined to).

In [10]:
cat_class = titanic["class"]
%timeit cat_class.str.contains('d')

str_class = titanic["class"].astype("object")
%timeit str_class.str.contains('d')

149 µs ± 7.84 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
398 µs ± 16.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


## Object creation

Convert *sex* and *class* to the same categorical type, with categories being the union of all unique values of both columns.

In [11]:
cols = ["sex", "who"]
unique_values = np.unique(titanic[cols].to_numpy().ravel())
categories = pd.CategoricalDtype(categories=unique_values)
titanic[cols] = titanic[cols].astype(categories)
print(titanic.sex.cat.categories)
print(titanic.who.cat.categories)

Index(['child', 'female', 'male', 'man', 'woman'], dtype='object')
Index(['child', 'female', 'male', 'man', 'woman'], dtype='object')


In [12]:
# restore sex and who to object types
titanic[cols] = titanic[cols].astype("object")

## Custom order 

In [13]:
df = pd.DataFrame({"quality": ["good", "excellent", "very good"]})
df.sort_values("quality")

Unnamed: 0,quality
1,excellent
0,good
2,very good


In [14]:
ordered_quality = pd.CategoricalDtype(["good", "very good", "excellent"], ordered=True)
df.quality = df.quality.astype(ordered_quality)
df.sort_values("quality")

Unnamed: 0,quality
0,good
2,very good
1,excellent


## Unique values

In [15]:
small_titanic = titanic.iloc[:2]
small_titanic

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False


`Series.unique` returns values in order appearance, and only returns values that are present in the data.

In [16]:
small_titanic["class"].unique()

['Third', 'First']
Categories (3, object): ['First', 'Second', 'Third']

`Series.cat.categories` returns all category values.

In [17]:
small_titanic["class"].cat.categories

Index(['First', 'Second', 'Third'], dtype='object')

## References

- [Docs](https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html#object-creation)

- [Useful Medium article](https://towardsdatascience.com/staying-sane-while-adopting-pandas-categorical-datatypes-78dbd19dcd8a)