# Chapter 12: Advanced pandas

## 12.1 Categorical Data

### 12.1.1 Background and Motivation

`unique` and `value_counts` enable us to extract the distinct values from an array and comput their frequencies.

In [2]:
import numpy as np; import pandas as pd

In [3]:
values = pd.Series(['apple', 'orange', 'apple',
                    'apple'] * 2)
values

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
dtype: object

In [4]:
pd.unique(values)

array(['apple', 'orange'], dtype=object)

In [6]:
pd.value_counts(values)

apple     6
orange    2
dtype: int64

In [7]:
values = pd.Series([0, 1, 0, 0] * 2)
dim = pd.Series(['apple', 'orange'])
values

0    0
1    1
2    0
3    0
4    0
5    1
6    0
7    0
dtype: int64

In [8]:
dim

0     apple
1    orange
dtype: object

In [9]:
dim.take(values)

0     apple
1    orange
0     apple
0     apple
0     apple
1    orange
0     apple
0     apple
dtype: object

**Categorical** representation - This representation as integers.

**Categories** - The array of distinct values.

**Category codes** - The integer values that reference the categories.

### 12.1.2 Categorical Type in pandas

In [10]:
fruits = ['apple', 'orange', 'apple', 'apple'] * 2
N = len(fruits)
df = pd.DataFrame({'fruit': fruits,
                    'basket_id': np.arange(N),
                    'count': np.random.randint(3, 15, size=N),
                    'weight': np.random.uniform(0, 4, size=N)},
                    columns=['basket_id', 'fruit', 'count', 'weight'])
df

Unnamed: 0,basket_id,fruit,count,weight
0,0,apple,4,2.435154
1,1,orange,9,0.357263
2,2,apple,14,0.900945
3,3,apple,3,3.269101
4,4,apple,8,1.112761
5,5,orange,10,3.519473
6,6,apple,4,0.508921
7,7,apple,9,2.946574


In [11]:
fruit_cat = df['fruit'].astype('category')
fruit_cat

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
Name: fruit, dtype: category
Categories (2, object): ['apple', 'orange']

In [12]:
# Values of fruit_cat are not NumPy array

c = fruit_cat.values
type(c)

pandas.core.arrays.categorical.Categorical

In [13]:
c.categories

Index(['apple', 'orange'], dtype='object')

In [14]:
c.codes

array([0, 1, 0, 0, 0, 1, 0, 0], dtype=int8)

In [15]:
df['fruit'] = df['fruit'].astype('category')
df.fruit

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
Name: fruit, dtype: category
Categories (2, object): ['apple', 'orange']

In [16]:
my_categories = pd.Categorical(['foo', 'bar', 'baz', 'foo', 'bar'])
my_categories

['foo', 'bar', 'baz', 'foo', 'bar']
Categories (3, object): ['bar', 'baz', 'foo']

In [17]:
categories = ['foo', 'bar', 'baz']
codes = [0, 1, 2, 0, 0, 1]
my_cats_2 = pd.Categorical.from_codes(codes, categories)
my_cats_2

['foo', 'bar', 'baz', 'foo', 'foo', 'bar']
Categories (3, object): ['foo', 'bar', 'baz']

Unless explicitly specified, categorical conversion assume no specific ordering of the categories.

In [18]:
ordered_cat = pd.Categorical.from_codes(codes, categories,
                                        ordered=True)
ordered_cat

['foo', 'bar', 'baz', 'foo', 'foo', 'bar']
Categories (3, object): ['foo' < 'bar' < 'baz']

The output `['foo' < 'bar' < 'baz']` indicates that `'foo'` precedes `'bar'` in the ordering and so on.

In [19]:
# Unordered to ordered
my_cats_2.as_ordered()

['foo', 'bar', 'baz', 'foo', 'foo', 'bar']
Categories (3, object): ['foo' < 'bar' < 'baz']

### 12.1.3 Computations with Categoricals

### 12.1.4 Categorical Methods

## 12.2 Advanced GroupBy Use

### 12.2.1 Group Transforms and "Unwrapped" GroupBys

### 12.2.2 Grouped Time Resampling

## 12.3 Techniques for Method Cleaning

### 12.3.1 The pipe Method