# Categorical Data

From chapter 12 - special ways to represent categories more efficiently in terms of memory and for ML.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Frequently we will use categories encoded as strings or other objects, and these may be repeated. We've already seen some of this.

In [2]:
values = pd.Series(['apple', 'orange', 'apple', 'apple'] * 2)
values

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
dtype: object

In [3]:
values.value_counts()

apple     6
orange    2
dtype: int64

When you know category values will repeat like this, a best practice is to implement a lookup table, sometimes called *dimension tables* containing the distinct values and storing the primary observations in an encoded format. This matters when your tables get really big.

In [4]:
values = pd.Series([0, 1, 0, 0] * 2)
dim = pd.Series(['apple', 'orange'])

We can use the `take` method to restore the original values. 

In [5]:
dim.take(values)

0     apple
1    orange
0     apple
0     apple
0     apple
1    orange
0     apple
0     apple
dtype: object

## Categorical Type in pandas

pandas has a special `Categorical` type for holding data that uses the integer-based categorical representation or *encoding*.

In [7]:
fruits = ['apple', 'orange', 'apple', 'apple'] * 2
N = len(fruits)
df = pd.DataFrame({'fruit': fruits,
                   'basket_id': np.arange(N),
                   'count': np.random.randint(3, 15, size=N),
                   'weight': np.random.uniform(0, 4, size=N),
                  },
                 columns=['basket_id', 'fruit', 'count', 'weight'])
df

Unnamed: 0,basket_id,fruit,count,weight
0,0,apple,5,0.744492
1,1,orange,14,2.270337
2,2,apple,11,0.147969
3,3,apple,13,3.322908
4,4,apple,5,1.81091
5,5,orange,9,0.368495
6,6,apple,5,1.040929
7,7,apple,14,3.859535


We can convert the "fruit" column to the categorical type as follows.

In [9]:
fruit_cat = df['fruit'].astype('category')
fruit_cat

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
Name: fruit, dtype: category
Categories (2, object): ['apple', 'orange']

This is not quite the same as a numpy array, but a pandas type.

In [11]:
c = fruit_cat.values
type(c)

pandas.core.arrays.categorical.Categorical

In [12]:
c.categories

Index(['apple', 'orange'], dtype='object')

In [13]:
c.codes

array([0, 1, 0, 0, 0, 1, 0, 0], dtype=int8)

Now the codes are an array.

You can convert a DataFrame column to categorical by re-assigning it.

In [15]:
df['fruit'] = df['fruit'].astype('category')
df

Unnamed: 0,basket_id,fruit,count,weight
0,0,apple,5,0.744492
1,1,orange,14,2.270337
2,2,apple,11,0.147969
3,3,apple,13,3.322908
4,4,apple,5,1.81091
5,5,orange,9,0.368495
6,6,apple,5,1.040929
7,7,apple,14,3.859535


In [16]:
type(df['fruit'])

pandas.core.series.Series

You can also create a `pandas.Categorical` from other types of python sequences.

In [18]:
my_categories = pd.Categorical(['foo', 'bar', 'baz', 'foo', 'bar'])
my_categories

['foo', 'bar', 'baz', 'foo', 'bar']
Categories (3, object): ['bar', 'baz', 'foo']

If your input data has categories and codes seperately, you can use the method `from_codes` to build the object.

In [19]:
categories = ['foo', 'bar', 'baz']
codes = [0, 1, 2, 0, 1]
my_cats_2 = pd.Categorical.from_codes(codes, categories)
my_cats_2

['foo', 'bar', 'baz', 'foo', 'bar']
Categories (3, object): ['foo', 'bar', 'baz']

By default, there is no order to the categories in a `Categorical`, but a user can specify that it be ordered.

In [21]:
ordered_cat = pd.Categorical.from_codes(codes, categories, ordered=True)
ordered_cat

['foo', 'bar', 'baz', 'foo', 'bar']
Categories (3, object): ['foo' < 'bar' < 'baz']

In [22]:
my_cats_2.as_ordered()

['foo', 'bar', 'baz', 'foo', 'bar']
Categories (3, object): ['foo' < 'bar' < 'baz']

We demonstrated all of this with strings, but it can really be any immutable type.

## Computations with Categoricals

Generally it is the same. Some operations, like `groupby`, are faster. Some functions can utilize the `ordered` property of categoricals.

Categories are also used under the hood - the `qcut` function that bins based on values returns categoricals.

In [23]:
np.random.seed(12345)
draws = np.random.randn(1000)
# quartile intervals using qcut
bins = pd.qcut(draws, 4)
bins

[(-0.684, -0.0101], (-0.0101, 0.63], (-0.684, -0.0101], (-0.684, -0.0101], (0.63, 3.928], ..., (-0.0101, 0.63], (-0.684, -0.0101], (-2.9499999999999997, -0.684], (-0.0101, 0.63], (0.63, 3.928]]
Length: 1000
Categories (4, interval[float64]): [(-2.9499999999999997, -0.684] < (-0.684, -0.0101] < (-0.0101, 0.63] < (0.63, 3.928]]

It might be better to have names for these quartile categories, and we can do that.

In [25]:
bins = pd.qcut(draws, 4, labels=['Q1', 'Q2', 'Q3', 'Q4'])
bins

['Q2', 'Q3', 'Q2', 'Q2', 'Q4', ..., 'Q3', 'Q2', 'Q1', 'Q3', 'Q4']
Length: 1000
Categories (4, object): ['Q1' < 'Q2' < 'Q3' < 'Q4']

The representation is quite efficient

In [26]:
bins.codes[:10]

array([1, 2, 1, 1, 3, 3, 2, 2, 3, 3], dtype=int8)

Now that labels have been given to the categories, the actual boundaries are no longer as clear.  We can find some minimums and maximums using groupby. To do this we also show how to take a Categorical and turn it into a series.

In [27]:
bins = pd.Series(bins, name='quartile')
results = pd.Series(draws).groupby(bins).agg(['count', 'min', 'max']).reset_index()
results

Unnamed: 0,quartile,count,min,max
0,Q1,250,-2.949343,-0.685484
1,Q2,250,-0.683066,-0.010115
2,Q3,250,-0.010032,0.628894
3,Q4,250,0.634238,3.927528


This "quartile" column is still categorical, and retains the efficiency.

In [28]:
results['quartile']

0    Q1
1    Q2
2    Q3
3    Q4
Name: quartile, dtype: category
Categories (4, object): ['Q1' < 'Q2' < 'Q3' < 'Q4']

### Performance and Memory gains

No kidding; but now he shows us.

In [29]:
N = 1000000
draws = pd.Series(np.random.randn(N))
labels = pd.Series(['foo', 'bar', 'baz', 'qux'] * (N // 4))

Now we convert labels to categorical.

In [30]:
categories = labels.astype('category')

The categorical version uses an 8th of the memory

In [31]:
labels.memory_usage()

8000128

In [32]:
categories.memory_usage()

1000332

Faster as well

In [33]:
%time draws.groupby(labels).sum()

CPU times: user 59.7 ms, sys: 5.09 ms, total: 64.8 ms
Wall time: 66.6 ms


bar    676.000912
baz    134.358220
foo   -145.600549
qux    781.703568
dtype: float64

In [34]:
%time draws.groupby(categories).sum()

CPU times: user 9.6 ms, sys: 3.36 ms, total: 13 ms
Wall time: 9.49 ms


bar    676.000912
baz    134.358220
foo   -145.600549
qux    781.703568
dtype: float64

### Categorical Methods

Series containing categorical data have some special methods.

In [35]:
cat_s = pd.Series(['a', 'b', 'c', 'd'] * 2).astype('category')
cat_s

0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (4, object): ['a', 'b', 'c', 'd']

The special attribute `cat` provides access to the categorical methods in a way similar to the `str` attribute on string series.

In [36]:
cat_s.cat.codes

0    0
1    1
2    2
3    3
4    0
5    1
6    2
7    3
dtype: int8

In [39]:
cat_s.cat.categories

Index(['a', 'b', 'c', 'd'], dtype='object')

Suppose we know that the actual categories extends beyond the 4 categories we see in the data. In this case, we can set the actual categories.

In [40]:
actual_categories = ['a', 'b', 'c', 'd', 'e']
cat_s2 = cat_s.cat.set_categories(actual_categories)
cat_s2

0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (5, object): ['a', 'b', 'c', 'd', 'e']

While it appears unchanged, the actual categories will be reflected in operations that use them.

In [41]:
cat_s2.value_counts()

a    2
b    2
c    2
d    2
e    0
dtype: int64

After you filter a large `DataFrame` or `Series`, many of the categories may not appear in the data. Remove them.

In [42]:
cat_s3 = cat_s2[cat_s2.isin(['a', 'b'])]
cat_s3

0    a
1    b
4    a
5    b
dtype: category
Categories (5, object): ['a', 'b', 'c', 'd', 'e']

In [43]:
cat_s3.cat.remove_unused_categories()

0    a
1    b
4    a
5    b
dtype: category
Categories (2, object): ['a', 'b']

### Creating dummy variables for modeling

When using statistics or ML tools, you may need to transform categorical data into *dummy variables*, also known as *one-hot* encoding.  This involves creating a DataFrame with one column for each distinct category.  These columns contain 1 if a sample is in the category and 0 otherwise.

Note also that he's jumping straight to categorical and dropping the usage of `astype`.

In [44]:
cat_s = pd.Series(['a', 'b', 'c', 'd'] * 2, dtype='category')
cat_s

0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (4, object): ['a', 'b', 'c', 'd']

`pandas.get_dummies` converts this to dummies.

In [45]:
pd.get_dummies(cat_s)

Unnamed: 0,a,b,c,d
0,1,0,0,0
1,0,1,0,0
2,0,0,1,0
3,0,0,0,1
4,1,0,0,0
5,0,1,0,0
6,0,0,1,0
7,0,0,0,1
