In [58]:
import numpy as np
import pandas as pd

## 索引和选择数据

Pandas现在支持三种类型的多轴索引。\[\] .iloc 和 .loc

.loc主要是基于标签的，但也可以与布尔数组一起使用。

当索引的标签不存在时，.loc会raise KeyError。

.loc接受以下形式的输入：

1.单个标签，例如5或'a'（注意，当使用5时，.loc会在index的标签中查询而不是在序号中查询。）。

2.列表或标签数组。\['a', 'b', 'c'\]

3.带有标签的切片对象'a':'f'（注意，与通常的python切片相反，包括起始标签和结束标签）。

4.布尔数组



.iloc是基于主要的整数位置（从0到 length-1所述轴的），但也可以用布尔数组使用。 如果请求的索引器超出范围，.iloc则会引发IndexError，但允许越界索引的切片索引器除外。（这符合Python / NumPy 切片 语义）。允许的输入是：

1.一个整数，例如5。

2.整数列表或数组。\[4, 3, 0\]

3.带有整数的切片对象1:7。

4.布尔数组。

甲callable使用一个参数（调用系列，数据帧或面板）以及函数返回索引有效输出（上面的一个）。

版本0.18.1中的新功能。

有关详细信息，请参阅按位置选择， 高级索引和高级层次结构。

.loc，.iloc以及[]索引也可以接受一个callable索引器。在Select By Callable中查看更多信息。

### \[  \]

使用\[   \]选取数据是最简单的一种方式，不过功能很有限

In [59]:
dates = pd.date_range('1/1/2000', periods=8)

In [60]:
df = pd.DataFrame(np.random.randn(8, 4), index=dates, columns=['A', 'B', 'C', 'D'])

In [61]:
df

Unnamed: 0,A,B,C,D
2000-01-01,-1.314162,-0.551005,0.773934,0.62451
2000-01-02,-0.275972,-0.142225,0.221955,0.363552
2000-01-03,0.55271,-0.119661,0.480205,-0.611139
2000-01-04,0.061236,-0.566644,0.476947,0.999314
2000-01-05,0.54177,-0.34493,-0.693993,0.687947
2000-01-06,0.745842,-1.123486,0.190824,-0.408939
2000-01-07,0.905149,-0.142425,0.448674,-0.653492
2000-01-08,-0.883286,0.12885,-1.252295,0.482245


In [62]:
s = df['A']

In [63]:
s

2000-01-01   -1.314162
2000-01-02   -0.275972
2000-01-03    0.552710
2000-01-04    0.061236
2000-01-05    0.541770
2000-01-06    0.745842
2000-01-07    0.905149
2000-01-08   -0.883286
Freq: D, Name: A, dtype: float64

In [64]:
s[0]

-1.3141621092540765

In [65]:
time = s.index[0]

In [66]:
time

Timestamp('2000-01-01 00:00:00', freq='D')

In [67]:
s[time]

-1.3141621092540765

\[  \]的功能不够强大，只是为了使用方便创造出来的索引方式，新手更推荐多使用iloc和loc熟悉索引后，再仔细研究\[  \]

### loc

.loc是平时使用最多的选取方法、

.loc主要是基于标签的，但也可以与布尔数组一起使用。

当索引的标签不存在时，.loc会raise KeyError。

.loc接受以下形式的输入：

1.单个标签，例如5或'a'（注意，当使用5时，.loc会在index的标签中查询而不是在序号中查询。）。

2.列表或标签数组。['a', 'b', 'c']

3.带有标签的切片对象'a':'f'（注意，与通常的python切片相反，包括起始标签和结束标签）。

4.布尔数组


In [68]:
s1 = pd.Series(np.random.randn(6),index=list('abcdef'))

In [69]:
s1

a   -0.550452
b   -0.166279
c   -0.005383
d   -0.297832
e    1.294682
f   -0.896679
dtype: float64

In [70]:
s1.loc['c']

-0.005382667766166223

In [71]:
s1.loc[['c']] # 注意与上面的对比

c   -0.005383
dtype: float64

In [72]:
s1.loc[['c', 'd']]

c   -0.005383
d   -0.297832
dtype: float64

In [73]:
s1.loc['c': 'e']

c   -0.005383
d   -0.297832
e    1.294682
dtype: float64

In [74]:
bol1 = s1 > s1.mean()

In [75]:
bol1

a    False
b    False
c     True
d    False
e     True
f    False
dtype: bool

In [76]:
s1.loc[bol1]

c   -0.005383
e    1.294682
dtype: float64

注意bool索引也是和标签相关的  
顺序打乱不会影响选取结果

In [77]:
bol2 = bol1.sort_index(ascending=False)

In [78]:
bol2

f    False
e     True
d    False
c     True
b    False
a    False
dtype: bool

In [79]:
s1.loc[bol2]

c   -0.005383
e    1.294682
dtype: float64

理解了Series的loc，就很容易理解DataFrame的loc了

In [80]:
df1 = pd.DataFrame(np.random.randn(6,4),
                   index=list('abcdef'),
                   columns=list('ABCD'))

In [81]:
df1

Unnamed: 0,A,B,C,D
a,0.499998,0.49382,-0.701505,0.249116
b,-1.389333,-1.170189,0.811325,-0.438201
c,0.957332,-0.972852,-0.895084,0.372848
d,-0.166218,-0.372639,-1.280283,1.554392
e,-0.476285,-0.450482,1.984062,1.267421
f,-0.007222,0.041043,-1.395735,1.247498


In [82]:
df1.loc["a", :]

A    0.499998
B    0.493820
C   -0.701505
D    0.249116
Name: a, dtype: float64

In [83]:
df1.loc[:, "A"]

a    0.499998
b   -1.389333
c    0.957332
d   -0.166218
e   -0.476285
f   -0.007222
Name: A, dtype: float64

In [84]:
df1.loc["a", "A"]

0.49999844029877755

In [85]:
df1.loc[["a"], :]

Unnamed: 0,A,B,C,D
a,0.499998,0.49382,-0.701505,0.249116


In [86]:
df1.loc[:, ["A"]]

Unnamed: 0,A
a,0.499998
b,-1.389333
c,0.957332
d,-0.166218
e,-0.476285
f,-0.007222


In [87]:
df1.loc[["a", "b"], :]

Unnamed: 0,A,B,C,D
a,0.499998,0.49382,-0.701505,0.249116
b,-1.389333,-1.170189,0.811325,-0.438201


DataFrame也有bool索引，并且这在数据分析中经常会用到

In [88]:
bool1 = df1.loc[:, "A"] > df1.loc[:, "A"].mean()

In [89]:
bool1

a     True
b    False
c     True
d    False
e    False
f     True
Name: A, dtype: bool

In [90]:
bool2 = df1.loc["a", :] > df1.loc["a", :].mean()

In [91]:
bool2

A     True
B     True
C    False
D     True
Name: a, dtype: bool

In [92]:
df1.loc[bool1, bool2]

Unnamed: 0,A,B,D
a,0.499998,0.49382,0.249116
c,0.957332,-0.972852,0.372848
f,-0.007222,0.041043,1.247498


### iloc

了解了loc之后，iloc的理解就非常简单了，loc是基于标签进行选取，而iloc是基于序号进行选取

In [93]:
df1

Unnamed: 0,A,B,C,D
a,0.499998,0.49382,-0.701505,0.249116
b,-1.389333,-1.170189,0.811325,-0.438201
c,0.957332,-0.972852,-0.895084,0.372848
d,-0.166218,-0.372639,-1.280283,1.554392
e,-0.476285,-0.450482,1.984062,1.267421
f,-0.007222,0.041043,-1.395735,1.247498


In [94]:
df1.iloc[0, :]

A    0.499998
B    0.493820
C   -0.701505
D    0.249116
Name: a, dtype: float64

In [95]:
df1.iloc[:, 0]

a    0.499998
b   -1.389333
c    0.957332
d   -0.166218
e   -0.476285
f   -0.007222
Name: A, dtype: float64

In [96]:
df1.iloc[:, 0:2]

Unnamed: 0,A,B
a,0.499998,0.49382
b,-1.389333,-1.170189
c,0.957332,-0.972852
d,-0.166218,-0.372639
e,-0.476285,-0.450482
f,-0.007222,0.041043


In [97]:
df1.iloc[0, 0]

0.49999844029877755

In [98]:
df1.iloc[[0], :]

Unnamed: 0,A,B,C,D
a,0.499998,0.49382,-0.701505,0.249116


但是需要注意，iloc无法使用带标签的布尔索引（Series）

In [99]:
bol1 = s1 > s1.mean()

In [100]:
bol1

a    False
b    False
c     True
d    False
e     True
f    False
dtype: bool

In [102]:
df1.iloc[bol1, :]

ValueError: Location based indexing can only have [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array] types

In [103]:
bolarray = bol1.values

In [104]:
bolarray

array([False, False,  True, False,  True, False])

In [105]:
df1.iloc[bolarray, :]

Unnamed: 0,A,B,C,D
c,0.957332,-0.972852,-0.895084,0.372848
e,-0.476285,-0.450482,1.984062,1.267421


## 数据对齐与运算

DataFrame对象之间的运算会自动在列和行标签上对齐。生成的新DataFrame列和行标签的将是并集。

In [110]:
df = pd.DataFrame(np.random.randn(10, 4), columns=['A', 'B', 'C', 'D'])
df

Unnamed: 0,A,B,C,D
0,-0.267119,-0.531159,0.648965,0.412334
1,1.212974,0.721008,-1.144679,2.783355
2,-1.401605,1.033101,-0.637352,-1.863762
3,1.470259,0.862259,-0.667233,-0.246961
4,-0.086252,-2.168218,-1.81401,-1.917489
5,-0.656167,0.97348,-0.665872,0.798356
6,0.090406,1.503657,-1.03795,0.165396
7,1.220028,0.783355,1.31405,-0.867105
8,0.81258,-1.593367,-0.5835,-0.332158
9,-0.06854,1.614876,0.307755,-1.613589


In [111]:
df2 = pd.DataFrame(np.random.randn(7, 3), columns=['A', 'B', 'C'])
df2

Unnamed: 0,A,B,C
0,1.267391,-0.294642,-1.321651
1,0.347538,-0.133557,1.39941
2,-0.565898,-0.005639,-0.054753
3,0.810455,0.447193,0.776799
4,-2.269533,-1.426041,0.122747
5,-2.255188,-0.865349,-0.178483
6,0.39147,0.214126,-0.394714


In [109]:
df + df2

Unnamed: 0,A,B,C,D
0,0.27731,1.198301,-0.691736,
1,-0.221725,-2.265799,2.664189,
2,-0.402637,0.952688,-2.197345,
3,1.434542,0.845121,-0.591081,
4,0.470982,-1.094138,0.689015,
5,1.280501,-1.156402,1.278614,
6,-1.041181,1.091867,-1.423891,
7,,,,
8,,,,
9,,,,


In [None]:
在DataFrame和Series之间执行操作时，默认行为是在DataFrame列上对齐Series索引，从而按行进行广播。例如

In [113]:
s = df.iloc[0, :]
s

A   -0.267119
B   -0.531159
C    0.648965
D    0.412334
Name: 0, dtype: float64

In [114]:
df - s

Unnamed: 0,A,B,C,D
0,0.0,0.0,0.0,0.0
1,1.480093,1.252166,-1.793643,2.371021
2,-1.134487,1.564259,-1.286317,-2.276096
3,1.737377,1.393418,-1.316198,-0.659295
4,0.180866,-1.63706,-2.462975,-2.329823
5,-0.389049,1.504639,-1.314837,0.386022
6,0.357525,2.034815,-1.686915,-0.246938
7,1.487146,1.314514,0.665085,-1.279438
8,1.079698,-1.062209,-1.232465,-0.744492
9,0.198579,2.146035,-0.34121,-2.025923


In [None]:
如果想按照列进行广播，需要特殊处理

In [115]:
s1 = df.iloc[:, 0]

In [116]:
s1

0   -0.267119
1    1.212974
2   -1.401605
3    1.470259
4   -0.086252
5   -0.656167
6    0.090406
7    1.220028
8    0.812580
9   -0.068540
Name: A, dtype: float64

In [117]:
df - s1

Unnamed: 0,A,B,C,D,0,1,2,3,4,5,6,7,8,9
0,,,,,,,,,,,,,,
1,,,,,,,,,,,,,,
2,,,,,,,,,,,,,,
3,,,,,,,,,,,,,,
4,,,,,,,,,,,,,,
5,,,,,,,,,,,,,,
6,,,,,,,,,,,,,,
7,,,,,,,,,,,,,,
8,,,,,,,,,,,,,,
9,,,,,,,,,,,,,,


In [118]:
df.sub(s1, axis=0)

Unnamed: 0,A,B,C,D
0,0.0,-0.26404,0.916083,0.679453
1,0.0,-0.491966,-2.357653,1.570381
2,0.0,2.434706,0.764253,-0.462156
3,0.0,-0.607999,-2.137492,-1.71722
4,0.0,-2.081966,-1.727758,-1.831237
5,0.0,1.629647,-0.009705,1.454523
6,0.0,1.41325,-1.128356,0.07499
7,0.0,-0.436673,0.094022,-2.087132
8,0.0,-2.405947,-1.39608,-1.144737
9,0.0,1.683416,0.376295,-1.545049


df - s1等价于df.sub(s1, axis=1)

或者直接用转置方法

In [124]:
(df.T - s1).T

Unnamed: 0,A,B,C,D
0,0.0,-0.26404,0.916083,0.679453
1,0.0,-0.491966,-2.357653,1.570381
2,0.0,2.434706,0.764253,-0.462156
3,0.0,-0.607999,-2.137492,-1.71722
4,0.0,-2.081966,-1.727758,-1.831237
5,0.0,1.629647,-0.009705,1.454523
6,0.0,1.41325,-1.128356,0.07499
7,0.0,-0.436673,0.094022,-2.087132
8,0.0,-2.405947,-1.39608,-1.144737
9,0.0,1.683416,0.376295,-1.545049


与数字的运算非常简单

In [125]:
df * 5 + 2

Unnamed: 0,A,B,C,D
0,0.664407,-0.655793,5.244824,4.06167
1,8.064871,5.605038,-3.723393,15.916775
2,-5.008027,7.165503,-1.186762,-7.318808
3,9.351294,6.311297,-1.336165,0.765196
4,1.568739,-8.841091,-7.07005,-7.587445
5,-1.280836,6.867401,-1.32936,5.991779
6,2.452031,9.518283,-3.189749,2.826979
7,8.10014,5.916776,8.570249,-2.335523
8,6.062898,-5.966837,-0.917501,0.339212
9,1.6573,10.074381,3.538775,-6.067947


In [126]:
1 / df

Unnamed: 0,A,B,C,D
0,-3.743656,-1.882677,1.540916,2.425219
1,0.82442,1.386948,-0.873608,0.359279
2,-0.713468,0.96796,-1.568991,-0.536549
3,0.680152,1.159744,-1.498727,-4.049225
4,-11.593898,-0.461208,-0.551265,-0.521515
5,-1.524002,1.027242,-1.50179,1.252574
6,11.061198,0.665045,-0.963438,6.0461
7,0.819653,1.27656,0.761006,-1.153263
8,1.230649,-0.627602,-1.713796,-3.010619
9,-14.590008,0.619243,3.249338,-0.619736


In [127]:
df ** 4

Unnamed: 0,A,B,C,D
0,0.005091,0.079597,0.177372,0.028907
1,2.164742,0.270246,1.716857,60.017012
2,3.859252,1.139123,0.165013,12.065949
3,4.672778,0.552779,0.198203,0.00372
4,5.5e-05,22.101004,10.828261,13.518596
5,0.185378,0.898066,0.196591,0.406243
6,6.7e-05,5.112044,1.160661,0.000748
7,2.215537,0.376561,2.981585,0.565309
8,0.435977,6.445607,0.115921,0.012172
9,2.2e-05,6.800752,0.008971,6.779103


In [131]:
df1 = pd.DataFrame({'a' : [1, 0, 1], 'b' : [0, 1, 1] }, dtype=bool)
df1

Unnamed: 0,a,b
0,True,False
1,False,True
2,True,True


In [132]:
df2 = pd.DataFrame({'a' : [0, 1, 1], 'b' : [1, 1, 0] }, dtype=bool)
df2

Unnamed: 0,a,b
0,False,True
1,True,True
2,True,False


In [133]:
df1 & df2

Unnamed: 0,a,b
0,False,False
1,False,True
2,True,False


In [134]:
df1 | df2

Unnamed: 0,a,b
0,True,True
1,True,True
2,True,True


In [138]:
-df1

Unnamed: 0,a,b
0,False,True
1,True,False
2,False,False
