# Categorical Data

Categorical data in pandas contain `categories` and `codes`.

In [46]:
import pandas as pd
import numpy as np

fruits = ['apple', 'orange', 'apple', 'apple'] * 2
N = len(fruits)
df = pd.DataFrame({'fruit': fruits,
'basket_id': np.arange(N),
'count': np.random.randint(3, 15, size=N),
'weight': np.random.uniform(0, 4, size=N)},
columns=['basket_id', 'fruit', 'count', 'weight'])

df

Unnamed: 0,basket_id,fruit,count,weight
0,0,apple,6,1.697275
1,1,orange,10,2.963134
2,2,apple,4,0.252656
3,3,apple,6,2.678084
4,4,apple,7,3.693899
5,5,orange,10,2.551818
6,6,apple,14,0.787548
7,7,apple,3,3.735868


We can convert the `fruit` column to categorical. Note that the values are of type categorical, while the column itself is a Series.

In [47]:
fruit_cat = df['fruit'].astype('category')
print(type(fruit_cat))
print(type(fruit_cat.values))
c = fruit_cat.values

<class 'pandas.core.series.Series'>
<class 'pandas.core.arrays.categorical.Categorical'>


Categorical data have `categories` and `codes`.

In [48]:
c.categories, c.codes

(Index(['apple', 'orange'], dtype='object'),
 array([0, 1, 0, 0, 0, 1, 0, 0], dtype=int8))

We can create categorical data from a list using `pandas.Categorical`.

In [49]:
my_categories = pd.Categorical(['foo', 'bar', 'baz', 'foo', 'bar'])
my_categories

[foo, bar, baz, foo, bar]
Categories (3, object): [bar, baz, foo]

The `pandas.Categorical` class has a `from_codes` method, that allows creating categorical data starting from the codes and the categories.

In [50]:
codes = [0, 1, 0, 2, 2, 1]
categories = ['Klingon', 'Romulan', 'Vulcan']
star_trek_races = pd.Categorical.from_codes(codes, categories)
star_trek_races

[Klingon, Romulan, Klingon, Vulcan, Vulcan, Romulan]
Categories (3, object): [Klingon, Romulan, Vulcan]

Categorical data can be ordered or not.

In [51]:
star_trek_races = pd.Categorical.from_codes(codes, categories, ordered=True)
star_trek_races

[Klingon, Romulan, Klingon, Vulcan, Vulcan, Romulan]
Categories (3, object): [Klingon < Romulan < Vulcan]

We can reorder the categories with `reorder_categories`.

In [52]:
star_trek_races = star_trek_races.reorder_categories(
    new_categories=['Romulan', 'Klingon', 'Vulcan'])
star_trek_races

[Klingon, Romulan, Klingon, Vulcan, Vulcan, Romulan]
Categories (3, object): [Romulan < Klingon < Vulcan]

We can assign labels to categories obtained from numerical operations, like `qcut`.

In [53]:
np.random.seed(12345)
draws = np.random.randn(1000)
bins = pd.qcut(draws, q=4)  # 4 bins
bins

[(-0.684, -0.0101], (-0.0101, 0.63], (-0.684, -0.0101], (-0.684, -0.0101], (0.63, 3.928], ..., (-0.0101, 0.63], (-0.684, -0.0101], (-2.9499999999999997, -0.684], (-0.0101, 0.63], (0.63, 3.928]]
Length: 1000
Categories (4, interval[float64]): [(-2.9499999999999997, -0.684] < (-0.684, -0.0101] < (-0.0101, 0.63] < (0.63, 3.928]]

In [54]:
bins = pd.qcut(draws, 4, labels=['Q1', 'Q2', 'Q3', 'Q4'])
bins

[Q2, Q3, Q2, Q2, Q4, ..., Q3, Q2, Q1, Q3, Q4]
Length: 1000
Categories (4, object): [Q1 < Q2 < Q3 < Q4]

In [55]:
tmp = pd.Series(bins)
results = (pd.Series(draws)
.groupby(bins)
.agg(['count', 'min', 'max'])
.reset_index())

In [56]:
results

Unnamed: 0,index,count,min,max
0,Q1,250,-2.949343,-0.685484
1,Q2,250,-0.683066,-0.010115
2,Q3,250,-0.010032,0.628894
3,Q4,250,0.634238,3.927528


## Categorical Methods

Similar to `Series.str` methods there is a set of `Series.cat` methods. In the example below, `cat_s` is a Series, therefore `cat_s.categories` would fail, but `cat_s.cat.categories` would not.

In [57]:
s = pd.Series(list('abcd') * 2)
cat_s = s.astype('category')
cat_s

0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (4, object): [a, b, c, d]

In [58]:
cat_s.cat.categories

Index(['a', 'b', 'c', 'd'], dtype='object')

We can redefine the categories, for example adding a new one, with `set_categories`.

In [59]:
cat_s2 = cat_s.cat.set_categories(['a', 'b', 'c', 'd', 'e'])
cat_s2

0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (5, object): [a, b, c, d, e]

The same result can be obtained more simply wiht `Series.cat.add_categories()`. The new categories can be a string or a list-like object.

In [61]:
cat_s2 = cat_s.cat.add_categories('e')
print(cat_s2)
print(cat_s2.value_counts())

0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (5, object): [a, b, c, d, e]
d    2
c    2
b    2
a    2
e    0
dtype: int64


Some related methods are:

- rename_categories
- reorder_categories
- remove_categories
- remove_unused_categories
- set_categories


We can remove unused categories with `remove_unused_categories`.

In [81]:
print(cat_s2.cat.categories)
cat_s2 = cat_s2.cat.remove_unused_categories()
print(cat_s2.cat.categories)

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
Index(['a', 'b', 'c', 'd'], dtype='object')


Categorical methods for series.

| Method | Description |
|--------|-------------|
| `add_categories` | Append new (unused) categories at end of existing categories |
| `as_ordered` | Make categories ordered |
| `as_unordered` | Make categories unordered |
| `remove_categories` | Remove categories, setting any removed values to null |
| `remove_unused_categories` | Remove any category values which do not appear in the data |
| `rename_categories`	| Replace categories with indicated set of new category names; cannot change the number of categories |
| `reorder_categories` | Behaves like rename_categories, but can also change the result to have ordered categories |
| `set_categories` | Replace the categories with the indicated set of new categories; can add or remove categories |

## The `transform` function

Transform has the following constraints on the functions that can be used. The function must:

1. be able to produce a scalar value that can be broadcast to the shape of the group or...
2. ...be able to produce an object of the same shape as the input group, and...
3. it must not mutate its input.

In [83]:
df2 = pd.DataFrame(np.random.randint(0, 255, 2500).reshape(100, 25),
index=list('ABCD') * 25)
df2.index.name = 'group'
df2.head()

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,15,16,17,18,19,20,21,22,23,24
group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A,63,229,15,10,102,138,167,200,219,17,...,203,214,120,45,7,135,51,0,214,79
B,165,109,149,244,197,87,155,7,138,187,...,78,202,233,116,118,148,20,203,39,240
C,42,218,62,221,221,206,130,236,26,165,...,197,222,156,180,184,157,13,134,19,215
D,84,183,156,121,150,189,113,250,122,127,...,113,108,196,228,145,180,206,240,77,144
A,107,202,245,34,33,130,222,52,101,170,...,169,77,20,165,8,113,127,185,65,126


In [84]:
df2.reset_index(inplace=True)
gpd = df2.groupby('group')
transformed = gpd.transform(lambda x: (x - x.mean()) / x.std())

This has performed within-group normalization on each column.

In [87]:
transformed.groupby(df2.group).agg(['mean', 'std']).T

Unnamed: 0,group,A,B,C,D
0,mean,-8.770762000000001e-17,8.881784e-18,6.217249e-17,-1.065814e-16
0,std,1.0,1.0,1.0,1.0
1,mean,3.5527140000000005e-17,-2.309264e-16,1.332268e-16,2.9976020000000004e-17
1,std,1.0,1.0,1.0,1.0
2,mean,-7.993606e-17,1.065814e-16,-1.7763570000000002e-17,-8.65974e-17
2,std,1.0,1.0,1.0,1.0
3,mean,6.883383e-17,2.664535e-17,-6.217249e-17,-9.325873e-17
3,std,1.0,1.0,1.0,1.0
4,mean,-1.265654e-16,9.769963000000001e-17,6.217249e-17,1.909584e-16
4,std,1.0,1.0,1.0,1.0


`apply` would work as well here.

In [89]:
tmp = gpd.apply(lambda x: (x - x.mean()) / x.std())

In [90]:
tmp.groupby(df2.group).agg(['mean', 'std']).T

Unnamed: 0,group,A,B,C,D
0,mean,-1.026956e-16,-1.7763570000000002e-17,8.881784000000001e-17,-1.110223e-16
0,std,1.0,1.0,1.0,1.0
1,mean,8.881784e-18,-2.220446e-16,1.154632e-16,2.9976020000000004e-17
1,std,1.0,1.0,1.0,1.0
2,mean,-8.881784000000001e-17,1.065814e-16,-6.217249e-17,-8.65974e-17
2,std,1.0,1.0,1.0,1.0
3,mean,6.883383e-17,3.5527140000000005e-17,-5.329071000000001e-17,-9.325873e-17
3,std,1.0,1.0,1.0,1.0
4,mean,-1.265654e-16,1.24345e-16,6.217249e-17,1.909584e-16
4,std,1.0,1.0,1.0,1.0


In [91]:
gpd.apply(lambda x: x.rank(ascending=False))

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,15,16,17,18,19,20,21,22,23,24
0,20.0,2.0,25.0,24.0,17.0,15.0,10.0,9.0,4.0,24.0,...,9.5,6.0,11.0,19.5,25.0,11.0,20.0,25.0,5.0,17.0
1,5.0,21.0,11.0,3.0,5.0,18.5,10.0,24.0,11.0,7.0,...,16.0,7.0,4.0,19.0,13.0,12.0,24.0,4.0,20.0,2.0
2,20.0,4.0,23.0,6.5,4.0,5.0,16.0,4.5,18.0,9.0,...,7.0,6.0,10.0,8.0,6.5,13.0,22.0,11.0,23.0,8.0
3,18.0,7.0,7.0,12.0,10.0,9.0,12.0,2.0,17.0,11.0,...,14.0,12.0,7.5,7.0,13.0,10.0,4.0,3.0,18.0,11.0
4,13.0,5.0,2.0,20.0,23.0,16.0,5.0,21.0,15.0,4.0,...,13.0,19.0,24.0,9.0,24.0,16.0,12.0,8.0,20.0,11.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,7.0,21.0,14.0,11.0,16.0,7.0,4.0,24.0,24.0,8.0,...,9.0,21.0,19.0,3.0,3.0,7.0,23.0,4.0,1.0,21.0
96,12.0,24.0,6.0,15.0,9.0,19.0,13.0,11.0,14.0,15.0,...,25.0,24.0,19.0,7.0,21.0,6.0,14.0,3.0,25.0,1.0
97,10.0,2.0,3.0,1.0,3.0,7.0,11.0,3.0,10.0,17.0,...,17.5,12.0,21.0,12.0,24.0,21.0,14.0,16.0,18.0,12.0
98,3.0,3.0,6.5,8.5,19.0,7.0,11.5,16.5,5.0,23.0,...,10.0,24.0,20.0,22.0,8.0,1.0,2.0,4.0,10.0,19.0
