# [Categorical data](https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html#categorical-data)

Categoricals为pandas中的一种数据类型——分类对象，类似于R中的因子factor，但又有不同。

In [1]:
import pandas as pd

In [2]:
s = pd.Series(["a", "b", "c", "a"], dtype="category")
s

0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): ['a', 'b', 'c']

In [3]:
df = pd.DataFrame({"A": ["a", "b", "c", "a"]})
df["B"] = df["A"].astype("category")
df

Unnamed: 0,A,B
0,a,a
1,b,b
2,c,c
3,a,a


## Cat属性

* **categories**：查看类别
* **ordered**：查看是否排序
* **codes**：每一个序列的类别会被赋予唯一的整数编号，它们的编号取决于 cat.categories 中的顺序，该属性可以通过 codes 访问

In [16]:
df = pd.read_csv('pd_data/learn_pandas.csv', usecols = ['Grade', 'Name', 'Gender', 'Height', 'Weight'])

In [17]:
df.Grade = df.Grade.astype('category')
df.dtypes

Grade     category
Name        object
Gender      object
Height     float64
Weight     float64
dtype: object

In [28]:
#  categories 被存储在 Index 中
df.Grade.cat.categories

Index(['Freshman', 'Junior', 'Senior', 'Sophomore'], dtype='object')

In [26]:
df.Grade.cat.ordered

False

In [27]:
df.Grade.cat.codes

0      0
1      0
2      2
3      3
4      3
      ..
195    1
196    2
197    2
198    2
199    3
Length: 200, dtype: int8

##  类别的增改删

* **add_categories**：增加类别
* **remove_categories**：删除类别，原来序列中的该类会被设置为缺失
* **remove_unused_categories**：删除未出现在序列中的类别
* **set_categories**：设置新类别，原来的类别中如果存在元素不属于新类别，那么会被设置为缺失
* **rename_categories**：修改类别

In [47]:
s = pd.Series(["a", "b", "c", "a"], dtype="category")

In [48]:
s.cat.add_categories('d')

0    a
1    b
2    c
3    a
dtype: category
Categories (4, object): ['a', 'b', 'c', 'd']

In [49]:
s = s.cat.add_categories('d')

In [50]:
s.cat.remove_unused_categories()

0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): ['a', 'b', 'c']

In [51]:
s.cat.remove_categories('a')

0    NaN
1      b
2      c
3    NaN
dtype: category
Categories (3, object): ['b', 'c', 'd']

In [52]:
s.cat.categories = ["Group %s" % g for g in s.cat.categories]
s

0    Group a
1    Group b
2    Group c
3    Group a
dtype: category
Categories (4, object): ['Group a', 'Group b', 'Group c', 'Group d']

In [54]:
s.cat.rename_categories([1, 2, 3, 4])

0    1
1    2
2    3
3    1
dtype: category
Categories (4, int64): [1, 2, 3, 4]

In [55]:
s.cat.remove_unused_categories()
s.cat.rename_categories({'Group a': "x", 'Group b': "y", 'Group c': "z"})

0    x
1    y
2    z
3    x
dtype: category
Categories (4, object): ['x', 'y', 'z', 'Group d']

In [60]:
s.cat.set_categories(['Group a', 'e', 'f'])

0    Group a
1        NaN
2        NaN
3    Group a
dtype: category
Categories (3, object): ['Group a', 'e', 'f']

## 有序列别