# 7.3 数据转换

In [1]:
import pandas as pd
import numpy as np

# 7.3.6 检测和过滤异常值

过滤或转换异常值是数组操作的一个重头戏。下面的DataFrame有正态分布的数据：

In [41]:
data = pd.DataFrame(np.random.randn(1000, 4))
data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,-0.035313,0.021959,0.031389,0.020317
std,0.969247,0.971092,0.990622,1.008166
min,-2.971566,-2.975596,-3.122171,-2.952428
25%,-0.709442,-0.62338,-0.623692,-0.653202
50%,-0.031208,0.036752,0.0062,0.031741
75%,0.638806,0.66604,0.685398,0.698864
max,2.801499,3.006995,3.285299,3.262342


假设我们想要找一个列中，绝对值大于3的数字：

In [42]:
data.head()

Unnamed: 0,0,1,2,3
0,0.213161,1.063108,-0.06508,0.613593
1,1.080061,-0.27811,-0.113194,0.102974
2,0.095631,0.167631,-0.609479,-0.757723
3,-1.000688,-2.024424,1.197767,0.081444
4,0.147295,1.167497,2.023805,-0.605488


In [43]:
col = data[2]
col.head()

0   -0.065080
1   -0.113194
2   -0.609479
3    1.197767
4    2.023805
Name: 2, dtype: float64

In [44]:
col[np.abs(col) > 3]

22     3.116936
384    3.178319
605   -3.122171
647    3.204323
680    3.285299
Name: 2, dtype: float64

选中所有绝对值大于3的行，可以用any方法在一个boolean DataFrame上：

In [45]:
data[(np.abs(data) > 3)].head()

Unnamed: 0,0,1,2,3
0,,,,
1,,,,
2,,,,
3,,,,
4,,,,


In [46]:
data[(np.abs(data) > 3).any(1)] # any中axis=1表示column

Unnamed: 0,0,1,2,3
22,0.135428,-1.062371,3.116936,0.476789
266,-0.307996,0.390771,-0.503426,3.214715
384,0.048947,-0.170126,3.178319,-0.98317
605,-1.116989,0.919441,-3.122171,0.219854
647,1.430065,0.702229,3.204323,1.766474
666,0.360114,1.569596,-0.301235,3.262342
680,2.149453,-0.178407,3.285299,0.849912
988,0.58373,3.006995,-1.049895,-1.090622


下面是把绝对值大于3的数字直接变成-3或3：

In [47]:
data[np.abs(data) > 3] = np.sign(data) * 3

In [48]:
data[21:23]

Unnamed: 0,0,1,2,3
21,1.083451,-0.541657,-0.384706,0.26938
22,0.135428,-1.062371,3.0,0.476789


In [49]:
data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,-0.035313,0.021952,0.030726,0.01984
std,0.969247,0.97107,0.987797,1.006696
min,-2.971566,-2.975596,-3.0,-2.952428
25%,-0.709442,-0.62338,-0.623692,-0.653202
50%,-0.031208,0.036752,0.0062,0.031741
75%,0.638806,0.66604,0.685398,0.698864
max,2.801499,3.0,3.0,3.0


np.sign(data)会根据值的正负号来得到1或-1：

In [50]:
np.sign(data).head()

Unnamed: 0,0,1,2,3
0,1.0,1.0,-1.0,1.0
1,1.0,-1.0,-1.0,1.0
2,1.0,1.0,-1.0,-1.0
3,-1.0,-1.0,1.0,1.0
4,1.0,1.0,1.0,-1.0


# 7.3.7 排列和随机采样

排列（随机排序）一个series或DataFrame中的row，用numpy.random.permutation函数很容易就能做到。调用permutation的时候设定好你想要进行排列的axis，会产生一个整数数组表示新的顺序：

In [51]:
df = pd.DataFrame(np.arange(5 * 4).reshape((5, 4)))
df

Unnamed: 0,0,1,2,3
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15
4,16,17,18,19


In [52]:
sampler = np.random.permutation(5)
sampler

array([2, 1, 3, 4, 0])

这个数组能被用在基于iloc上的indexing或take函数：

In [53]:
df.take(sampler)

Unnamed: 0,0,1,2,3
2,8,9,10,11
1,4,5,6,7
3,12,13,14,15
4,16,17,18,19
0,0,1,2,3


为了选中一个随机的子集，而且没有代替功能(既不影响原来的值，返回一个新的series或DataFrame)，可以用sample方法：

In [54]:
df.sample(n=3)

Unnamed: 0,0,1,2,3
4,16,17,18,19
2,8,9,10,11
1,4,5,6,7


如果想要生成的样本带有替代功能(即允许重复)，给sample中设定replace=True:

In [55]:
choices = pd.Series([5, 7, -1, 6, 4])

draws = choices.sample(n=10, replace=True)

draws

2   -1
0    5
2   -1
3    6
3    6
2   -1
3    6
2   -1
0    5
1    7
dtype: int64

# 7.3.8 计算指示/虚拟(哑)变量

> Dummy Variables：虚拟变量，又称虚设变量、名义变量或哑变量,用以反映质的属性的一个人工变量,是量化了的自变量,通常取值为0或1。

另一种在统计模型上的转换或机器学习应用是把一个categorical variable(类别变量)变为一个dummy or indicator matrix（虚拟或指示器矩阵）。如果DataFrame中的一列有k个不同的值，我们可以用一个矩阵或DataFrame用k列来表示，1或0。pandas有一个get_dummies函数实现这个工作，当然，你自己设计一个其实也不难。这里举个例子：

In [56]:
df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],
                   'data1': range(6)})
df

Unnamed: 0,data1,key
0,0,b
1,1,b
2,2,a
3,3,c
4,4,a
5,5,b


In [57]:
pd.get_dummies(df['key'])

Unnamed: 0,a,b,c
0,0,1,0
1,0,1,0
2,1,0,0
3,0,0,1
4,1,0,0
5,0,1,0


在一些情况里，如果我们想要给column加一个prefix， 可以用data.get_dummies里的prefix参数来实现：

In [58]:
dummies = pd.get_dummies(df['key'], prefix='key')

In [59]:
df_with_dummy = df[['data1']].join(dummies)
df_with_dummy

Unnamed: 0,data1,key_a,key_b,key_c
0,0,0,1,0
1,1,0,1,0
2,2,1,0,0
3,3,0,0,1
4,4,1,0,0
5,5,0,1,0


如果DataFrame中的a row属于多个类别，事情会变得复杂一些。我们来看一下MoviesLens 1M 数据集：

In [60]:
mnames = ['movie_id', 'title', 'genres']

In [61]:
movies = pd.read_table('../datasets/movielens/movies.dat', sep='::',
                       header=None, names=mnames, engine='python')
movies[:10]

FileNotFoundError: [Errno 2] No such file or directory: '../datasets/movielens/movies.dat'

给每个genre添加一个指示变量比较麻烦。首先我们先取出所有不同的类别：

In [None]:
all_genres = []

for x in movies.genres:
    all_genres.extend(x.split('|'))
    
genres = pd.unique(all_genres)
genres

一种构建indicator dataframe的方法是先构建一个全是0的DataFrame：

In [None]:
zero_matrix = np.zeros((len(movies), len(genres)))

zero_matrix.shape

In [None]:
dummies = pd.DataFrame(zero_matrix, columns=genres)
dummies.head()

然后迭代每一部movie，并设置每一行中的dummies为1。使用dummies.columns来计算每一列的genre的指示器：

In [None]:
gen = movies.genres[0]
gen.split('|')

In [None]:
dummies.columns.get_indexer(gen.split('|'))

然后，使用.iloc，根据索引来设定值：

In [None]:
for i, gen in enumerate(movies.genres):
    indices = dummies.columns.get_indexer(gen.split('|'))
    dummies.iloc[i, indices] = 1

In [None]:
dummies.head()

然后，我们可以结合这个和movies:

In [None]:
movies_windic = movies.join(dummies.add_prefix('Genre_'))
movies_windic.iloc[0]

对于一个很大的数据集，这种构建多个成员指示变量的方法并不会加快速度。写一个低层级的函数来直接写一个numpy array，并把写过整合到DataFrame会更快一些。

一个有用的recipe诀窍是把get_dummies和离散函数（比如cut）结合起来：

In [None]:
np.random.seed(12345)

In [None]:
values = np.random.rand(10)
values

In [None]:
bins = [0, 0.2, 0.4, 0.6, 0.8, 1.]

In [None]:
pd.cut(values, bins)

In [None]:
pd.get_dummies(pd.cut(values, bins))