# 10分钟入门Pandas
http://crazygit.wiseturtles.com/2017/12/20/10-minutes-to-pandas/
http://pandas.pydata.org/pandas-docs/stable/10min.html

## 1. 安装 ##
## 2. 概览 ##
pandas的基本数据结构:
* Series: 一维数据
* DataFrame: 二维数据
* Panel: 三维数据(从0.20.0版本开始，已经不再推荐使用)
* Panel4D, PanelND(不再推荐使用)

DataFrame是由Series构成的

In [23]:
import pandas as pd
import numpy as np

## 3. 创建Series ##
创建Series最简单的方法

s = pd.Series(data, index=index)

data可以是不同的类型:
* python字典
* ndarray
* 标量(比如: 5)

### 3.1. 使用ndarray创建(From ndarray)
如果 data 是 ndarray, 那么 index 的长度必须和 data 的长度相同，当没有明确 index 参数时，默认使用[0, … len(data) - 1]作为index。

In [90]:
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
s

a   -0.450512
b    1.225196
c   -2.107209
d   -2.127904
e    0.887104
dtype: float64

In [91]:
s.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [92]:
pd.Series(np.random.randn(5))

0    0.415735
1    1.093092
2    0.398761
3   -0.524604
4   -0.617224
dtype: float64

In [94]:
# 需要注意的是: pandas里的索引并不要求唯一性，如果一个操作不支持重复的索引，会自动抛出异常。这么做的原因是很多操作不会用到索引，比如GroupBy。
s = pd.Series(np.random.randn(5), index=['a', 'a', 'a', 'a', 'a'])
s

a    1.391142
a    0.555436
a    1.110584
a    0.088797
a    0.292238
dtype: float64

In [95]:
s.index

Index(['a', 'a', 'a', 'a', 'a'], dtype='object')

### 3.2. 使用dict创建(From dict)
当data是dict类型时，如果指定了index参数，那么就使用index参数作为索引。否者，就使用排序后的data的key作为index。

In [96]:
d = {'b': 0., 'a': 1., 'c': 2.}

# 索引的值是排序后的
pd.Series(d)

a    1.0
b    0.0
c    2.0
dtype: float64

In [97]:
# 字典中不存在的key, 直接赋值为NaN(Not a number)
pd.Series(d, index=['b', 'c', 'd', 'a'])

b    0.0
c    2.0
d    NaN
a    1.0
dtype: float64

### 3.3. 使用标量创建(From scalar value)
当data是标量时，必须提供index, 值会被重复到index的长度

In [98]:
pd.Series(5., index=['a', 'b', 'c', 'd', 'e'])

a    5.0
b    5.0
c    5.0
d    5.0
e    5.0
dtype: float64

## 4. 创建DataFrame ##
  * DataFrame是一个二维的数据结构，可以看做是一个excel表格或一张SQL表，或者值为Series的字典。 跟Series一样，DataFrame也可以通过多种类型的数据结构来创建
    * 字典(包含一维ndarray数组，列表，字典或Series)
    * 二维的ndarray数组
    * 结构化的ndarray
    * Series
    * 另一个DataFrame

  * 除了data之外，还<fc #ff0000>接受 **index** 和 **columns** 参数来分布指定行和列的标签</fc>

  * {{:forum:python:packages:pandas:screen_shot_2018-01-19_at_9.47.09_pm.png|}}

### 4.1 从Series字典或嵌套的字典创建(From dict of Series or dicts) ###
  * 结果的索引是多个Series索引的合集，<fc #ff0000>如果没有指定 columns，就用排序后的字典的 key 作为列标签</fc>。

In [5]:
d = {'one': pd.Series([1,2,3], index=['a', 'b', 'c']),
     'two': pd.Series([1,2,3,4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
df

Unnamed: 0,one,two
a,1.0,1
b,2.0,2
c,3.0,3
d,,4


In [6]:
pd.DataFrame(d, index=['d', 'b', 'a'])

Unnamed: 0,one,two
d,,4
b,2.0,2
a,1.0,1


In [7]:
pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])

Unnamed: 0,two,three
d,4,
b,2,
a,1,


In [8]:
df.index

Index(['a', 'b', 'c', 'd'], dtype='object')

In [9]:
df.columns

Index(['one', 'two'], dtype='object')

### 4.2. 从ndarray类型/列表类型的字典(From dict of ndarrays / lists)

In [105]:
d = {'one': [1,2,3,4], 'two': [4,3,2,1]}
pd.DataFrame(d)

Unnamed: 0,one,two
0,1,4
1,2,3
2,3,2
3,4,1


In [106]:
pd.DataFrame(d, index=['a', 'b', 'c', 'd'])

Unnamed: 0,one,two
a,1,4
b,2,3
c,3,2
d,4,1


### 4.3. 从结构化ndarray创建(From structured or record array)

In [107]:
data = np.zeros((2, ), dtype=[('A', 'i4'), ('B', 'f4'), ('C', 'a10')])
data

array([(0,  0., b''), (0,  0., b'')],
      dtype=[('A', '<i4'), ('B', '<f4'), ('C', 'S10')])

In [108]:
data[:] = [(1, 2., 'Hello'), (2, 3., 'World')]
pd.DataFrame(data)

Unnamed: 0,A,B,C
0,1,2.0,b'Hello'
1,2,3.0,b'World'


In [109]:
pd.DataFrame(data, index=['first', 'second'])

Unnamed: 0,A,B,C
first,1,2.0,b'Hello'
second,2,3.0,b'World'


In [110]:
pd.DataFrame(data, index=['first', 'second'], columns=['C', 'A', 'B'])

Unnamed: 0,C,A,B
first,b'Hello',1,2.0
second,b'World',2,3.0


### 4.4. 从字典列表里创建(a list of dicts)

In [100]:
data2 = [{"a": 1, "b": 2}, {"a": 5, "b": 10, "c": 20}]
pd.DataFrame(data2)

Unnamed: 0,a,b,c
0,1,2,
1,5,10,20.0


In [103]:
pd.DataFrame(data2, index=["first", "second"])

Unnamed: 0,a,b,c
first,1,2,
second,5,10,20.0


In [104]:
pd.DataFrame(data2, columns=["a", "b"])

Unnamed: 0,a,b
0,1,2
1,5,10


### 4.5. 从元祖字典创建（From a dict of tuples）
通过元祖字典，可以创建多索引的DataFrame

In [111]:
pd.DataFrame({('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2},
              ('a', 'a'): {('A', 'C'): 3, ('A', 'B'): 4},
              ('a', 'c'): {('A', 'B'): 5, ('A', 'C'): 6},
              ('b', 'a'): {('A', 'C'): 7, ('A', 'B'): 8},
              ('b', 'b'): {('A', 'D'): 9, ('A', 'B'): 10}})

Unnamed: 0_level_0,Unnamed: 1_level_0,a,a,a,b,b
Unnamed: 0_level_1,Unnamed: 1_level_1,a,b,c,a,b
A,B,4.0,1.0,5.0,8.0,10.0
A,C,3.0,2.0,6.0,7.0,
A,D,,,,,9.0


### 4.6. 通过Series创建(From a Series)

In [99]:
pd.DataFrame(pd.Series([1,2,3]))

Unnamed: 0,0
0,1
1,2
2,3


## 5. 查看数据 ##

In [79]:
dates = pd.date_range('20130101', periods=6)
dates

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [80]:
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
df

Unnamed: 0,A,B,C,D
2013-01-01,0.638738,0.21129,0.070305,-0.262013
2013-01-02,0.172626,-0.578972,0.421645,2.082302
2013-01-03,-0.71844,1.332991,1.216145,1.349412
2013-01-04,-0.603157,0.300095,2.37634,0.405867
2013-01-05,0.543936,-0.285008,0.099598,-1.22905
2013-01-06,-1.182834,-0.730433,1.474248,2.23893


In [81]:
# 获取前几行(默认前5行)
df.head()

Unnamed: 0,A,B,C,D
2013-01-01,0.638738,0.21129,0.070305,-0.262013
2013-01-02,0.172626,-0.578972,0.421645,2.082302
2013-01-03,-0.71844,1.332991,1.216145,1.349412
2013-01-04,-0.603157,0.300095,2.37634,0.405867
2013-01-05,0.543936,-0.285008,0.099598,-1.22905


In [82]:
# 获取后3行
df.tail(3)

Unnamed: 0,A,B,C,D
2013-01-04,-0.603157,0.300095,2.37634,0.405867
2013-01-05,0.543936,-0.285008,0.099598,-1.22905
2013-01-06,-1.182834,-0.730433,1.474248,2.23893


In [83]:
# 获取索引
df.index

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [84]:
# 获取列信息
df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

In [85]:
# 获取数据信息
df.values

array([[ 0.63873774,  0.21129015,  0.07030501, -0.26201344],
       [ 0.17262643, -0.57897238,  0.42164531,  2.08230237],
       [-0.71844047,  1.3329914 ,  1.21614512,  1.34941194],
       [-0.60315708,  0.30009494,  2.37633991,  0.405867  ],
       [ 0.54393604, -0.28500779,  0.09959785, -1.22904988],
       [-1.18283398, -0.73043291,  1.47424789,  2.2389298 ]])

In [86]:
# 获取简单的统计信息
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,-0.191522,0.041661,0.943047,0.764241
std,0.747345,0.755132,0.911705,1.371806
min,-1.182834,-0.730433,0.070305,-1.22905
25%,-0.68962,-0.505481,0.18011,-0.095043
50%,-0.215265,-0.036859,0.818895,0.877639
75%,0.451109,0.277894,1.409722,1.89908
max,0.638738,1.332991,2.37634,2.23893


In [87]:
# 转置矩阵
df.T

Unnamed: 0,2013-01-01 00:00:00,2013-01-02 00:00:00,2013-01-03 00:00:00,2013-01-04 00:00:00,2013-01-05 00:00:00,2013-01-06 00:00:00
A,0.638738,0.172626,-0.71844,-0.603157,0.543936,-1.182834
B,0.21129,-0.578972,1.332991,0.300095,-0.285008,-0.730433
C,0.070305,0.421645,1.216145,2.37634,0.099598,1.474248
D,-0.262013,2.082302,1.349412,0.405867,-1.22905,2.23893


In [88]:
# 按照列排序
df.sort_values(by='B')

Unnamed: 0,A,B,C,D
2013-01-06,-1.182834,-0.730433,1.474248,2.23893
2013-01-02,0.172626,-0.578972,0.421645,2.082302
2013-01-05,0.543936,-0.285008,0.099598,-1.22905
2013-01-01,0.638738,0.21129,0.070305,-0.262013
2013-01-04,-0.603157,0.300095,2.37634,0.405867
2013-01-03,-0.71844,1.332991,1.216145,1.349412


## 6. 选择数据 ##

### 6.1. 获取

### 6.2. 通过Label选择

### 6.3. 通过Position选择

### 6.4. 布尔索引

### 6.5. 赋值

## 7. 数据缺失 ##
pandas 使用 **np.nan** 来表示缺失的数据，它默认不参与任何运算

In [89]:
df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])
df1

Unnamed: 0,A,B,C,D,E
2013-01-01,0.638738,0.21129,0.070305,-0.262013,
2013-01-02,0.172626,-0.578972,0.421645,2.082302,
2013-01-03,-0.71844,1.332991,1.216145,1.349412,
2013-01-04,-0.603157,0.300095,2.37634,0.405867,


## 8. 运算操作 ##

### 8.1. Stats统计 ###
运算操作都会排除NaN元素

In [49]:
dates = pd.date_range('20130101', periods=6)
df = pd.DataFrame(np.arange(24).reshape(6,4),index=dates,columns=list('ABCD'))
df

Unnamed: 0,A,B,C,D
2013-01-01,0,1,2,3
2013-01-02,4,5,6,7
2013-01-03,8,9,10,11
2013-01-04,12,13,14,15
2013-01-05,16,17,18,19
2013-01-06,20,21,22,23


In [51]:
# 计算列的平均值
df.mean()          # same as df.mean(0), df.mean(axis=0), df.mean(axis=None)

A    10.0
B    11.0
C    12.0
D    13.0
dtype: float64

In [52]:
# 计算行的平均值
df.mean(1)

2013-01-01     1.5
2013-01-02     5.5
2013-01-03     9.5
2013-01-04    13.5
2013-01-05    17.5
2013-01-06    21.5
Freq: D, dtype: float64

In [54]:
df.mean(axis=None)

A    10.0
B    11.0
C    12.0
D    13.0
dtype: float64

In [55]:
# shift(n),按照列的方向，从上往下移动n个位置
s = pd.Series([1,3,5,np.nan,6,8], index=dates).shift(2)
s

2013-01-01    NaN
2013-01-02    NaN
2013-01-03    1.0
2013-01-04    3.0
2013-01-05    5.0
2013-01-06    NaN
Freq: D, dtype: float64

In [58]:
# sub函数,DataFrame相减操作, 等于 df - s
df.sub(s, axis='index')      # same as df.sub(s, axis=0)

Unnamed: 0,A,B,C,D
2013-01-01,,,,
2013-01-02,,,,
2013-01-03,7.0,8.0,9.0,10.0
2013-01-04,9.0,10.0,11.0,12.0
2013-01-05,11.0,12.0,13.0,14.0
2013-01-06,,,,


### 8.2. Apply ###

In [59]:
df

Unnamed: 0,A,B,C,D
2013-01-01,0,1,2,3
2013-01-02,4,5,6,7
2013-01-03,8,9,10,11
2013-01-04,12,13,14,15
2013-01-05,16,17,18,19
2013-01-06,20,21,22,23


In [62]:
# 在列方向累加
df.apply(np.cumsum)      # same as df.apply(np.cumsum, axis='index'), df.apply(np.cumsum, axis=0)

Unnamed: 0,A,B,C,D
2013-01-01,0,1,2,3
2013-01-02,4,6,8,10
2013-01-03,12,15,18,21
2013-01-04,24,28,32,36
2013-01-05,40,45,50,55
2013-01-06,60,66,72,78


In [63]:
# 列方向的最大值-最小值， 得到的是一个Series
df.apply(lambda x: x.max() - x.min())

A    20
B    20
C    20
D    20
dtype: int64

### 8.3. 直方图 Histogramming ###

In [67]:
s = pd.Series(np.random.randint(0, 7, size=10))
s

0    6
1    4
2    3
3    2
4    3
5    1
6    4
7    5
8    1
9    0
dtype: int64

In [68]:
# 索引是出现的数字，值是次数
s.value_counts()

4    2
3    2
1    2
6    1
5    1
2    1
0    1
dtype: int64

### 8.4. 字符串方法 ###

In [75]:
s = pd.Series(['A_B', 'C_D'])
s

0    A_B
1    C_D
dtype: object

In [76]:
s.str.lower()

0    a_b
1    c_d
dtype: object

In [77]:
s.str.split('_')

0    [A, B]
1    [C, D]
dtype: object

In [78]:
s.str.replace('_', '')

0    AB
1    CD
dtype: object

## 9. 合并 ##

### 9.1. Concat ###

In [24]:
df = pd.DataFrame(np.random.randn(10, 4))
df

Unnamed: 0,0,1,2,3
0,-0.236541,2.129663,0.040113,0.17616
1,-0.443854,0.340397,-0.838473,-0.635955
2,1.455616,-0.262421,0.736989,0.796192
3,0.653311,0.152395,-0.087053,0.130809
4,-0.639655,-1.370976,-0.146624,0.569646
5,0.855968,-0.968835,-0.361527,-0.946837
6,0.208828,-0.269765,-1.912707,-1.507191
7,0.361779,1.090555,0.015652,-0.205713
8,-0.430038,3.929388,1.05024,1.413717
9,0.674238,0.916587,0.39995,-0.256237


In [25]:
# 分成小块
pieces = [df[:3], df[3:7], df[7:]]

# 合并
pd.concat(pieces)

Unnamed: 0,0,1,2,3
0,-0.236541,2.129663,0.040113,0.17616
1,-0.443854,0.340397,-0.838473,-0.635955
2,1.455616,-0.262421,0.736989,0.796192
3,0.653311,0.152395,-0.087053,0.130809
4,-0.639655,-1.370976,-0.146624,0.569646
5,0.855968,-0.968835,-0.361527,-0.946837
6,0.208828,-0.269765,-1.912707,-1.507191
7,0.361779,1.090555,0.015652,-0.205713
8,-0.430038,3.929388,1.05024,1.413717
9,0.674238,0.916587,0.39995,-0.256237


### 9.2. Join ###

In [26]:
# 跟数据库的Join操作一样
left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
left

Unnamed: 0,key,lval
0,foo,1
1,foo,2


In [27]:
right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})
right

Unnamed: 0,key,rval
0,foo,4
1,foo,5


In [28]:
pd.merge(left, right, on='key')

Unnamed: 0,key,lval,rval
0,foo,1,4
1,foo,1,5
2,foo,2,4
3,foo,2,5


In [29]:
# 另一个例子
left = pd.DataFrame({'key': ['foo', 'bar'], 'lval': [1, 2]})
left

Unnamed: 0,key,lval
0,foo,1
1,bar,2


In [30]:
right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [4, 5]})

Unnamed: 0,key,rval
0,foo,4
1,bar,5


In [31]:
pd.merge(left, right, on='key')

Unnamed: 0,key,lval,rval
0,foo,1,4
1,bar,2,5


### 9.3. Append ###

In [32]:
df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
df

Unnamed: 0,A,B,C,D
0,0.108675,1.066178,0.431081,0.532731
1,-0.863063,0.860666,0.241755,0.925792
2,1.627093,-0.293025,-0.183325,0.877897
3,-0.328073,-0.113541,0.406766,0.400485
4,-0.954278,-1.459062,0.846754,0.083291
5,2.105786,-1.006416,0.823606,-1.873452
6,-0.870628,-0.578416,-0.390428,0.379792
7,-0.287156,0.72406,1.520279,0.162994


In [33]:
s = df.iloc[3]
df.append(s, ignore_index=True)

Unnamed: 0,A,B,C,D
0,0.108675,1.066178,0.431081,0.532731
1,-0.863063,0.860666,0.241755,0.925792
2,1.627093,-0.293025,-0.183325,0.877897
3,-0.328073,-0.113541,0.406766,0.400485
4,-0.954278,-1.459062,0.846754,0.083291
5,2.105786,-1.006416,0.823606,-1.873452
6,-0.870628,-0.578416,-0.390428,0.379792
7,-0.287156,0.72406,1.520279,0.162994
8,-0.328073,-0.113541,0.406766,0.400485


## 10. Grouping ##

group by的操作需要经过以下1个或多个步骤
  * 根据条件分组数据(Spliting)
  * 在各个分组上执行函数(Applying)
  * 合并结果(Combining)

In [34]:
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                          'foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three',
                          'two', 'two', 'one', 'three'],
                   'C' : np.arange(1, 9),
                   'D' : np.arange(2, 10)})
df

Unnamed: 0,A,B,C,D
0,foo,one,1,2
1,bar,one,2,3
2,foo,two,3,4
3,bar,three,4,5
4,foo,two,5,6
5,bar,two,6,7
6,foo,one,7,8
7,foo,three,8,9


In [35]:
# 分组求和
df.groupby('A').sum()

Unnamed: 0_level_0,C,D
A,Unnamed: 1_level_1,Unnamed: 2_level_1
bar,12,15
foo,24,29


In [36]:
# 多列分组
df.groupby(['A','B']).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,C,D
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,2,3
bar,three,4,5
bar,two,6,7
foo,one,8,10
foo,three,8,9
foo,two,8,10


In [37]:
b = df.groupby(['A','B']).sum()
# 多索引
b.index

MultiIndex(levels=[['bar', 'foo'], ['one', 'three', 'two']],
           labels=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]],
           names=['A', 'B'])

In [38]:
b.columns

Index(['C', 'D'], dtype='object')

## 11. Reshaping ##

## 12. 时间序列 ##

pandas在时间序列上，提供了很方便的按照频率重新采样的功能，在财务分析上非常有用

In [39]:
# 把每秒的数据按5分钟聚合
rng = pd.date_range('1/1/2012', periods=100, freq='S')
ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
ts.resample('5Min').sum()

2012-01-01    25210
Freq: 5T, dtype: int64

In [40]:
# 加上时区信息
rng = pd.date_range('3/6/2012 00:00', periods=5, freq='D')
ts = pd.Series(np.random.randn(len(rng)), rng)
ts

2012-03-06    0.997218
2012-03-07    0.385599
2012-03-08    0.508370
2012-03-09   -0.215921
2012-03-10   -0.381707
Freq: D, dtype: float64

In [41]:
ts_utc = ts.tz_localize('UTC')
ts_utc

2012-03-06 00:00:00+00:00    0.997218
2012-03-07 00:00:00+00:00    0.385599
2012-03-08 00:00:00+00:00    0.508370
2012-03-09 00:00:00+00:00   -0.215921
2012-03-10 00:00:00+00:00   -0.381707
Freq: D, dtype: float64

In [43]:
# 转换成另一个时区
ts_utc.tz_convert('Asia/Shanghai')# 时间跨度转换
rng = pd.date_range('1/1/2012', periods=5, freq='M')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts

2012-01-31   -0.946623
2012-02-29   -0.195326
2012-03-31    1.048476
2012-04-30    1.228735
2012-05-31   -1.866153
Freq: M, dtype: float64

In [44]:
ps = ts.to_period()
ps

2012-01   -0.946623
2012-02   -0.195326
2012-03    1.048476
2012-04    1.228735
2012-05   -1.866153
Freq: M, dtype: float64

In [45]:
ps.to_timestamp()

2012-01-01   -0.946623
2012-02-01   -0.195326
2012-03-01    1.048476
2012-04-01    1.228735
2012-05-01   -1.866153
Freq: MS, dtype: float64

In [46]:
# 转换季度时间
prng = pd.period_range('1990Q1', '2000Q4', freq='Q-NOV')
ts = pd.Series(np.random.randn(len(prng)), prng)
ts.head()

1990Q1   -1.813945
1990Q2   -0.009036
1990Q3   -0.472445
1990Q4   -0.606948
1991Q1   -1.173856
Freq: Q-NOV, dtype: float64

In [47]:
ts.index = (prng.asfreq('M', 'e') + 1).asfreq('H', 's') + 9
ts.head()

1990-03-01 09:00   -1.813945
1990-06-01 09:00   -0.009036
1990-09-01 09:00   -0.472445
1990-12-01 09:00   -0.606948
1991-03-01 09:00   -1.173856
Freq: H, dtype: float64

## 13. Categoricals分类 ##

In [11]:
df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']})
df

Unnamed: 0,id,raw_grade
0,1,a
1,2,b
2,3,b
3,4,a
4,5,a
5,6,e


In [12]:
# 转换原始类别为分类数据类型
df["grade"] = df["raw_grade"].astype("category")
df

Unnamed: 0,id,raw_grade,grade
0,1,a,a
1,2,b,b
2,3,b,b
3,4,a,a
4,5,a,a
5,6,e,e


In [13]:
df["grade"]

0    a
1    b
2    b
3    a
4    a
5    e
Name: grade, dtype: category
Categories (3, object): [a, b, e]

In [14]:
# 重命名分类为更有意义的名称
df["grade"].cat.categories = ["very good", "good", "very bad"]
df

Unnamed: 0,id,raw_grade,grade
0,1,a,very good
1,2,b,good
2,3,b,good
3,4,a,very good
4,5,a,very good
5,6,e,very bad


In [15]:
# 重新安排顺分类,同时添加缺少的分类(序列 .cat方法下返回新默认序列)
df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"])
df

Unnamed: 0,id,raw_grade,grade
0,1,a,very good
1,2,b,good
2,3,b,good
3,4,a,very good
4,5,a,very good
5,6,e,very bad


In [16]:
df["grade"]

0    very good
1         good
2         good
3    very good
4    very good
5     very bad
Name: grade, dtype: category
Categories (5, object): [very bad, bad, medium, good, very good]

In [18]:
# 按照分类排序
df.sort_values(by="grade")

Unnamed: 0,id,raw_grade,grade
5,6,e,very bad
1,2,b,good
2,3,b,good
0,1,a,very good
3,4,a,very good
4,5,a,very good


In [19]:
# 按照分类分组，同时也会显示空的分类
df.groupby("grade").size()

grade
very bad     1
bad          0
medium       0
good         2
very good    3
dtype: int64

In [20]:
df.groupby("grade").count()

Unnamed: 0_level_0,id,raw_grade
grade,Unnamed: 1_level_1,Unnamed: 2_level_1
very bad,1,1
bad,0,0
medium,0,0
good,2,2
very good,3,3


## 14. Plotting ##
## 15. 数据In/Out ##
## 16. 扩展阅读 ##