虽然 pandas 采用了 NumPy 的许多编码习语，但最大的区别是 pandas 是为处理表格或异构数据(tabular or heterogeneous data)而设计的。相比之下，NumPy 最适合处理同质(homogeneous)类型的数值数组数据。

In [3]:
import numpy as np 
import pandas as pd 
from pandas import Series, DataFrame

rng = np.random.default_rng(seed = 12345)

# 5.1 Introduction to pandas Data Structures

## Series

In [2]:
obj = pd.Series([4, 7, -5, 3])  
obj

0    4
1    7
2   -5
3    3
dtype: int64

In [4]:
obj.array

<NumpyExtensionArray>
[np.int64(4), np.int64(7), np.int64(-5), np.int64(3)]
Length: 4, dtype: int64

In [5]:
obj.index

RangeIndex(start=0, stop=4, step=1)

In [6]:
obj2 = pd.Series([3, 4, 2, -7], index=['d', 'b', 'c', 'a'])
obj2


d    3
b    4
c    2
a   -7
dtype: int64

In [7]:
obj2.index

Index(['d', 'b', 'c', 'a'], dtype='object')

In [None]:
print(obj2['a'])
print(obj2[['d', 'b']]) # 输入一个list

-7
d    3
b    4
dtype: int64


使用布尔数组过滤、标量乘法或应用数学函数，将保留索引值链接：

In [12]:
obj2[obj2 > 0]

d    3
b    4
c    2
dtype: int64

In [13]:
obj2*2

d     6
b     8
c     4
a   -14
dtype: int64

In [14]:
np.exp(obj2)

d    20.085537
b    54.598150
c     7.389056
a     0.000912
dtype: float64

In [16]:
print('b' in obj2)
print('e' in obj2)

True
False


In [None]:
# 如果有现成的dict 数据，可以直接传递 dict 创建 Series
sdata = {"Ohio": 35000, "Texas": 71000, "Oregon": 16000, "Utah": 5000}
obj3 = pd.Series(sdata)
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

In [24]:
# 从 Series convert to dict
obj3.to_dict(sdata)

  obj3.to_dict(sdata)


{'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}

当您只传递字典时，生成的 Series 中的索引将根据字典的 keys 方法（取决于键的插入顺序）遵循键的顺序。您可以通过按照您希望它们出现在生成的 Series 中的顺序传递带有字典键的索引来覆盖此设置：

In [25]:
states = ["California", "Ohio", "Oregon", "Texas"]
obj4 = pd.Series(sdata, index=states)
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

在 sdata 中找到的三个值被放置在适当的位置，但由于没有找到 “California” 的值，它显示为 NaN（不是数字），这在 pandas 中被认为是标记缺失值或 NA 值。由于 “犹他州” 未包含在各州中，因此将其排除在生成的对象之外。

In [26]:
print(pd.isna(obj4)) # 检查是否为缺失值
print(pd.notna(obj4)) # 检查是否不为缺失值

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool
California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool


In [27]:
obj4.isna() # 也可以用实例方法的方式使用

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [None]:
print([obj3, obj4])
obj3 + obj4 # 后续讨论数据对齐逻辑

[Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64, California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64]


California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

In [None]:
# 为 Series 增加名称Attribute
obj4.name = "state_population"
obj4.index.name = "state"
obj4

state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: state_population, dtype: float64

In [None]:
print(obj)
obj.index = ["CA", "OH", "OR", "TX"] # 重新赋值index
obj


0    4
1    7
2   -5
3    3
dtype: int64


CA    4
OH    7
OR   -5
TX    3
dtype: int64

## Dataframe

In [33]:
# from a dictionary of equal-length lists or NumPy arrays:
data = {"state": ["Ohio", "Ohio", "Ohio", "Nevada", "Nevada", "Nevada"],
        "year": [2000, 2001, 2002, 2001, 2002, 2003],
        "pop": [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


In [None]:
frame.head() # the head method select first 5 rows

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


In [None]:
frame.tail() # return last 5 rows

Unnamed: 0,state,year,pop
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


In [37]:
frame2 = pd.DataFrame(data, columns=["year", "state"])#, "pop", 'debt'])
frame2

Unnamed: 0,year,state
0,2000,Ohio
1,2001,Ohio
2,2002,Ohio
3,2001,Nevada
4,2002,Nevada
5,2003,Nevada


In [38]:
frame2 = pd.DataFrame(data, columns=["year", "state", "pop", 'debt'])
frame2

Unnamed: 0,year,state,pop,debt
0,2000,Ohio,1.5,
1,2001,Ohio,1.7,
2,2002,Ohio,3.6,
3,2001,Nevada,2.4,
4,2002,Nevada,2.9,
5,2003,Nevada,3.2,


In [41]:
print(frame['state']) # slicing
print(frame.state) # Attribues

0      Ohio
1      Ohio
2      Ohio
3    Nevada
4    Nevada
5    Nevada
Name: state, dtype: object
0      Ohio
1      Ohio
2      Ohio
3    Nevada
4    Nevada
5    Nevada
Name: state, dtype: object


In [46]:
print(frame2.loc[0])
print(frame2.iloc[-2:])

year     2000
state    Ohio
pop       1.5
debt      NaN
Name: 0, dtype: object
   year   state  pop debt
4  2002  Nevada  2.9  NaN
5  2003  Nevada  3.2  NaN


1. 将列表或数组分配给列时，值的长度必须与 DataFrame 的长度匹配。

In [50]:
frame2['debt'] = rng.random(6)
frame2

Unnamed: 0,year,state,pop,debt
0,2000,Ohio,1.5,0.598309
1,2001,Ohio,1.7,0.186734
2,2002,Ohio,3.6,0.672756
3,2001,Nevada,2.4,0.941803
4,2002,Nevada,2.9,0.248246
5,2003,Nevada,3.2,0.948881


2. 如果分配一个 Series，其标签将完全重新对齐到 DataFrame 的索引，并在任何不存在的索引值中插入缺失值：

In [51]:
val = pd.Series([-1.2, -1.5, -1.7], index = [2, 4, 5])
frame2['debt'] = val
frame2

Unnamed: 0,year,state,pop,debt
0,2000,Ohio,1.5,
1,2001,Ohio,1.7,
2,2002,Ohio,3.6,-1.2
3,2001,Nevada,2.4,
4,2002,Nevada,2.9,-1.5
5,2003,Nevada,3.2,-1.7


3. 分配column 如果不存在，会创建一个新列

In [61]:
frame2['eastern'] = frame2['state'] == 'Ohio'
frame2

Unnamed: 0,year,state,pop,debt,eastern
0,2000,Ohio,1.5,,True
1,2001,Ohio,1.7,,True
2,2002,Ohio,3.6,-1.2,True
3,2001,Nevada,2.4,,False
4,2002,Nevada,2.9,-1.5,False
5,2003,Nevada,3.2,-1.7,False


In [62]:
del frame2['eastern']
frame2.columns


Index(['year', 'state', 'pop', 'debt'], dtype='object')

如果将嵌套字典传递给 DataFrame，pandas 会将外部key解释为column，将内部key解释为row索引：

In [60]:
populations = {"Ohio": {2000: 1.5, 2001: 1.7, 2002: 3.6},
               "Nevada": {2000: 2.4, 2001: 2.9}}
frame3 = pd.DataFrame(populations)
frame3

Unnamed: 0,Ohio,Nevada
2000,1.5,2.4
2001,1.7,2.9
2002,3.6,


In [63]:
frame3.T

Unnamed: 0,2000,2001,2002
Ohio,1.5,1.7,3.6
Nevada,2.4,2.9,


In [None]:
pd.DataFrame(populations, index = [2010, 2011, 2012])
# 显式指定列会覆盖 nested dict 中的 inner key

Unnamed: 0,Ohio,Nevada
2010,,
2011,,
2012,,


In [65]:
pdata = {"Ohio": frame3['Ohio'][:-1],
         "Nevada": frame3['Nevada'][:2]}
pd.DataFrame(pdata)


Unnamed: 0,Ohio,Nevada
2000,1.5,2.4
2001,1.7,2.9


DataFrame 的行(index)和列(column)有 names attribute

In [66]:
frame3.index.name = 'year'
frame3.columns.name = 'state'
frame3

state,Ohio,Nevada
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2000,1.5,2.4
2001,1.7,2.9
2002,3.6,


In [67]:
frame3.to_numpy()

array([[1.5, 2.4],
       [1.7, 2.9],
       [3.6, nan]])

In [68]:
frame2.to_numpy()

array([[2000, 'Ohio', 1.5, nan],
       [2001, 'Ohio', 1.7, nan],
       [2002, 'Ohio', 3.6, -1.2],
       [2001, 'Nevada', 2.4, nan],
       [2002, 'Nevada', 2.9, -1.5],
       [2003, 'Nevada', 3.2, -1.7]], dtype=object)

## Index Objects

In [70]:
obj = pd.Series(rng.standard_normal(4), index=['a', 'b', 'c', 'd'])
print(obj.index)
print(obj.index[:1])

Index(['a', 'b', 'c', 'd'], dtype='object')
Index(['a'], dtype='object')


Index objects are immutable

In [71]:
obj.index[1] = 'dom'

TypeError: Index does not support mutable operations

In [72]:
# Immutability makes it safer to share Index objects among data structures:
labels = pd.Index(np.arange(3))
obj2 = pd.Series([1.5, -2.5, 0], index=labels)
obj2.index is labels

True

In [75]:
print(frame3)
print(frame3.index)
print(frame3.columns)
print('Ohio' in frame3.columns)
print(2003 in frame3.index)

state  Ohio  Nevada
year               
2000    1.5     2.4
2001    1.7     2.9
2002    3.6     NaN
Index([2000, 2001, 2002], dtype='int64', name='year')
Index(['Ohio', 'Nevada'], dtype='object', name='state')
True
False


# 5.2 Essential Functionality

## Reindexing index 撞库

.reindex() 基本上遵循以下逻辑：
   1. 你给它一个目标索引 (target index)。
   2. 它会遍历这个目标索引中的每一个标签。
   3. 对于目标索引中的每个标签，它会去原始对象中寻找具有相同标签的数据。
       * 如果找到了，就把原始数据值复制到新对象中对应的位置。
       * 如果没找到 (即目标索引中有而原始索引中没有的标签)，就在新对象中该标签的位置放入 NaN
         (表示缺失)。
   4. 原始对象中那些不在目标索引里的标签，它们的数据就会被丢弃。
   5. 最终，它会返回一个全新的 `Series` 或 `DataFrame`，其索引就是你指定的目标索引。

In [85]:
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=["d", "b", "a", "c"])
obj

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

In [87]:
obj2 = obj.reindex(['beijing', 'shanghai', 'wuxi', 'shenzhen', 'tokyo'])
obj2

beijing    NaN
shanghai   NaN
wuxi       NaN
shenzhen   NaN
tokyo      NaN
dtype: float64

In [88]:
obj3 = obj.reindex(['d', 'b', 'a', 'c', 'e'])
obj3

d    4.5
b    7.2
a   -5.3
c    3.6
e    NaN
dtype: float64

判断是“修改”还是“替换”：

  “等号 (`=`) 的左边是什么？”

   1. 如果是 `变量名[索引]` 或 `对象.属性[索引]` (如 a[0], arr[1],
      frame.loc['d', 'Utah'])
       * 这通常意味着你正在尝试修改 (Mutate) 对象内部的数据。
       * 如果对象是可变的 (list, ndarray, DataFrame data)，操作会成功。
       * 如果对象是不可变的 (tuple, string, pd.Index)，操作会失败。

   2. 如果仅仅是 `变量名` 或 `对象.属性` (如 a, arr, frame.columns)
       * 这永远是替换 (Replace) 操作。
       * 它的意思是“让这个变量名（或属性名）指向等号右边这个全新的对象”。
       * 这个操作总是会成功，因为它与左边变量之前指向的对象是可变还是不可变
         无关。


In [None]:
frame = pd.DataFrame(np.arange(9).reshape((3, 3)),
                     index=['a', 'c', 'd'],
                     columns=['Ohio', 'Texas', 'California'])
print("--- 初始DataFrame ---")
print(frame)

# --- 1. 对不可变对象 (Index) 的操作 ---

# 尝试“修改 (Mutate)” Index -> 失败！
try:
    print("\n尝试修改 frame.columns[1] ...")
    frame.columns[1] = "Utah"
except TypeError as e:
    print(f"失败: {e}") # Index对象不支持内部修改

# “替换 (Replace)” Index -> 成功！
print("\n执行替换 frame.columns ...")
frame.columns = ['Ohio', 'Utah', 'California'] # 用一个全新的列表替换整个columns属性
print("替换columns后:\n", frame)
# 新的list会继承值
 
# --- 2. 对可变对象 (DataFrame中的数据) 的操作 ---

# “修改 (Mutate)” DataFrame内部的数据 -> 成功！
print("\n执行修改 DataFrame 内部数据 ...")
frame.loc['d', 'Utah'] = 999 # 使用.loc直接定位并修改数据
print("修改数据后:\n", frame)

--- 初始DataFrame ---
   Ohio  Texas  California
a     0      1           2
c     3      4           5
d     6      7           8

尝试修改 frame.columns[1] ...
失败: Index does not support mutable operations

执行替换 frame.columns ...
替换columns后:
    Ohio  Utah  California
a     0     1           2
c     3     4           5
d     6     7           8

执行修改 DataFrame 内部数据 ...
修改数据后:
    Ohio  Utah  California
a     0     1           2
c     3     4           5
d     6   999           8


In [None]:
obj4 = pd.Series(["blue", "purple", "yellow"], index=[0, 2, 4])
print(obj4)
obj4.reindex(np.arange(6), method = 'ffill') # reindex 没有修改 obj4


0      blue
2    purple
4    yellow
dtype: object


0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

In [95]:
frame = pd.DataFrame(np.arange(9).reshape((3, 3)),
                     index=['a', 'c', 'd'],
                     columns=['Ohio', 'Texas', 'California'])
print(frame)

frame2 = frame.reindex(['a', 'b', 'c', 'd'])
print(frame2)

   Ohio  Texas  California
a     0      1           2
c     3      4           5
d     6      7           8
   Ohio  Texas  California
a   0.0    1.0         2.0
b   NaN    NaN         NaN
c   3.0    4.0         5.0
d   6.0    7.0         8.0


In [None]:
states = ["Texas", "Utah", "California"]
frame.reindex(columns=states)

Unnamed: 0,Texas,Utah,California
a,1,,2
c,4,,5
d,7,,8


In [None]:
frame.loc[["a", "d", "c"], ["California", "Texas"]] # .loc() => locate by label

Unnamed: 0,California,Texas
a,2,1
d,8,7
c,5,4


## Dropping Entries from an Axis 从轴中删除条目

from Series

In [3]:
obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
obj


a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [6]:
new_obj = obj.drop('c') # drop 方法将返回一个新对象
print(new_obj)
print(obj)

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64
a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64


In [7]:
obj.drop(['d', 'c'])

a    0.0
b    1.0
e    4.0
dtype: float64

From DataFrame

In [8]:
data = pd.DataFrame(np.arange(16).reshape((4,4)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns=['one', 'two', 'three', 'four'])
data


Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [9]:
data.drop(['Colorado', 'Ohio'])

Unnamed: 0,one,two,three,four
Utah,8,9,10,11
New York,12,13,14,15


In [11]:
data.drop(columns= ['four', 'one'])

Unnamed: 0,two,three
Ohio,1,2
Colorado,5,6
Utah,9,10
New York,13,14


drop with axis = 1/0 or 'columns'/'index'

In [None]:
data.drop('two', axis=1) # like numpy, 0 = index, 1 =columns

Unnamed: 0,one,three,four
Ohio,0,2,3
Colorado,4,6,7
Utah,8,10,11
New York,12,14,15


In [13]:
data.drop(['two','four'], axis='columns')

Unnamed: 0,one,three
Ohio,0,2
Colorado,4,6
Utah,8,10
New York,12,14


In [14]:
data.drop(['Utah', 'Ohio'], axis='index' )

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
New York,12,13,14,15


## Indexing, Selection, and Filtering

In [22]:
obj = pd.Series(np.arange(4.), index=["a", "b", "c", "d"])
print(obj)
print('---------')
print(obj['b'])
print(obj[1])
print('---------')
print(obj[2:4])
print(obj[['b', 'a', 'd']])
print(obj[[1, 3]])
print('---------')
print(obj[obj < 2])

a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64
---------
1.0
1.0
---------
c    2.0
d    3.0
dtype: float64
b    1.0
a    0.0
d    3.0
dtype: float64
b    1.0
d    3.0
dtype: float64
---------
a    0.0
b    1.0
dtype: float64


  print(obj[1])
  print(obj[[1, 3]])


loc方法 => 标签索引

iloc方法 => 整数索引

In [4]:
obj1 = pd.Series([1, 2, 3], index=[2, 0, 1])
obj2 = pd.Series([1, 2, 3], index=["a", "b", "c"])
print(obj1)
print('---------')
print(obj2)

2    1
0    2
1    3
dtype: int64
---------
a    1
b    2
c    3
dtype: int64


首选 loc 的原因是因为使用 [] 索引时对整数的处理方式不同。如果索引包含整数，则基于 [] 的常规索引会将整数视为标签，因此行为因索引的数据类型而异。例如：

In [6]:
# 使用[]进行索引
print(obj1[[0,1,2]]) # 
print('---------')
print(obj2[[0,1,2]])

0    2
1    3
2    1
dtype: int64
---------
a    3
b    3
c    3
dtype: int64


  print(obj2[[0,1,2]])


In [None]:
obj2.loc[[0,1]] # loc 是基于标签的索引，所以这里会 return KeyError

KeyError: "None of [Index([0, 1], dtype='int64')] are in the [index]"

由于 loc 运算符专门使用标签进行索引，因此还有一个 iloc 运算符专门使用整数进行索引，无论索引是否包含整数，它都能一致地工作：

In [7]:
print(obj1.iloc[[0,1,2]]) # 与[] indexing 返回内容不同
print('---------')
print(obj2.iloc[[0,1,2]])


2    1
0    2
1    3
dtype: int64
---------
a    3
b    3
c    3
dtype: int64


In [None]:
obj2.loc['b':'d'] = 5
obj2 # 直接修改原对象，因为 Series 是mutable的

a    3
b    5
c    5
dtype: int64

In [9]:
obj2[:2] = 3
obj2

a    3
b    3
c    5
dtype: int64

DataFrame 的索引：
   - df[列名]：选择列
   - df.loc[行标签]：按标签选择行
   - df.iloc[行位置]：按位置选择行

In [13]:
data = pd.DataFrame(rng.standard_normal((4, 4)),
                    index=["Ohio", "Colorado", "Utah", "New York"],
                    columns=["one", "two", "three", "four"])
print(data)
print('-----------------')
print(data['two'])
print('-----------------')
print(data[['three', 'one']])
print('-----------------')
print(data.loc[['Ohio', 'Utah']])
print('-----------------')
print(data.loc[['Ohio', 'Utah']]['two'])

               one       two     three      four
Ohio      0.061144  0.070915  0.433655  0.277484
Colorado  0.530252  0.536721  0.618350 -0.795017
Utah      0.300031 -1.602702  0.266799 -1.261624
New York -0.071271  0.474050 -0.414854  0.097717
-----------------
Ohio        0.070915
Colorado    0.536721
Utah       -1.602702
New York    0.474050
Name: two, dtype: float64
-----------------
             three       one
Ohio      0.433655  0.061144
Colorado  0.618350  0.530252
Utah      0.266799  0.300031
New York -0.414854 -0.071271
-----------------
           one       two     three      four
Ohio  0.061144  0.070915  0.433655  0.277484
Utah  0.300031 -1.602702  0.266799 -1.261624
-----------------
Ohio    0.070915
Utah   -1.602702
Name: two, dtype: float64


In [15]:
print(data[:2])
print('-----------------')
print(data[data['three']>0])

               one       two     three      four
Ohio      0.061144  0.070915  0.433655  0.277484
Colorado  0.530252  0.536721  0.618350 -0.795017
-----------------
               one       two     three      four
Ohio      0.061144  0.070915  0.433655  0.277484
Colorado  0.530252  0.536721  0.618350 -0.795017
Utah      0.300031 -1.602702  0.266799 -1.261624


In [16]:
data >0

Unnamed: 0,one,two,three,four
Ohio,True,True,True,True
Colorado,True,True,True,False
Utah,True,False,True,False
New York,False,True,False,True


In [17]:
data[data < 0] = 0
data

Unnamed: 0,one,two,three,four
Ohio,0.061144,0.070915,0.433655,0.277484
Colorado,0.530252,0.536721,0.61835,0.0
Utah,0.300031,0.0,0.266799,0.0
New York,0.0,0.47405,0.0,0.097717


### Selection on DataFrame with loc and iloc

In [18]:
data

Unnamed: 0,one,two,three,four
Ohio,0.061144,0.070915,0.433655,0.277484
Colorado,0.530252,0.536721,0.61835,0.0
Utah,0.300031,0.0,0.266799,0.0
New York,0.0,0.47405,0.0,0.097717


select rows

In [19]:
data.loc['Colorado']

one      0.530252
two      0.536721
three    0.618350
four     0.000000
Name: Colorado, dtype: float64

In [20]:
data.loc[['Colorado', 'Ohio']]

Unnamed: 0,one,two,three,four
Colorado,0.530252,0.536721,0.61835,0.0
Ohio,0.061144,0.070915,0.433655,0.277484


In [21]:
data.loc['New York', ['four', 'one']]

four    0.097717
one     0.000000
Name: New York, dtype: float64

In [24]:
print(data.iloc[2])
print('---------')
print(data.iloc[2, 3]) # 第三行第四列
print('---------')
print(data.iloc[2, [3, 0, 1]]) # 第三行的第四列、第一列、第二列
print('---------')
print(data.iloc[[1, 2], [3, 0, 1]]) # 第二行、第三行的第四列、第一列、第二列


one      0.300031
two      0.000000
three    0.266799
four     0.000000
Name: Utah, dtype: float64
---------
0.0
---------
four    0.000000
one     0.300031
two     0.000000
Name: Utah, dtype: float64
---------
          four       one       two
Colorado   0.0  0.530252  0.536721
Utah       0.0  0.300031  0.000000


In [25]:
data

Unnamed: 0,one,two,three,four
Ohio,0.061144,0.070915,0.433655,0.277484
Colorado,0.530252,0.536721,0.61835,0.0
Utah,0.300031,0.0,0.266799,0.0
New York,0.0,0.47405,0.0,0.097717


In [26]:
data.loc[:'Utah', 'two']

Ohio        0.070915
Colorado    0.536721
Utah        0.000000
Name: two, dtype: float64

In [None]:
data.iloc[:4, :3][data.Ohio > 0]

#  原因解释:
#   - data.Ohio 等价于 data['Ohio']，是访问列的语法
#   - Ohio是行索引标签(index)，不是列名(columns)
#   - DataFrame的Attribute访问只适用于列名，不适用于行索引

AttributeError: 'DataFrame' object has no attribute 'Ohio'

### Integer indexing pitfalls 整数索引陷阱

In [29]:
ser = pd.Series(np.arange(3.))
ser

0    0.0
1    1.0
2    2.0
dtype: float64

整数索引不能引用不存在的标签

In [None]:
ser[-1]



KeyError: -1

In [39]:
ser[1]

np.float64(1.0)

In [40]:
ser2 = pd.Series(np.arange(3.), index=['a', 'b', 'c'])
ser2

a    0.0
b    1.0
c    2.0
dtype: float64

非整数索引

In [41]:
ser2[-1]

  ser2[-1]


np.float64(2.0)

### 链式索引的陷阱

In [42]:
data

Unnamed: 0,one,two,three,four
Ohio,0.061144,0.070915,0.433655,0.277484
Colorado,0.530252,0.536721,0.61835,0.0
Utah,0.300031,0.0,0.266799,0.0
New York,0.0,0.47405,0.0,0.097717


In [43]:
data.loc[:, 'one'] = 100
data

Unnamed: 0,one,two,three,four
Ohio,100.0,0.070915,0.433655,0.277484
Colorado,100.0,0.536721,0.61835,0.0
Utah,100.0,0.0,0.266799,0.0
New York,100.0,0.47405,0.0,0.097717


In [45]:
data.iloc[2] = 5
data

Unnamed: 0,one,two,three,four
Ohio,100.0,0.070915,0.433655,0.277484
Colorado,100.0,0.536721,0.61835,0.0
Utah,5.0,5.0,5.0,5.0
New York,100.0,0.47405,0.0,0.097717


In [48]:
data.loc[data['four'] > 3] = 66
# 标记大于3的行
# 大于3的行，four列的值为66
data

Unnamed: 0,one,two,three,four
Ohio,100.0,0.070915,0.433655,0.277484
Colorado,100.0,0.536721,0.61835,0.0
Utah,66.0,66.0,66.0,66.0
New York,100.0,0.47405,0.0,0.097717


In [50]:
data.loc[data.four > 5, 'four'] = 5
# 大于5的行，four列的值为5
data

Unnamed: 0,one,two,three,four
Ohio,100.0,0.070915,0.433655,0.277484
Colorado,100.0,0.536721,0.61835,0.0
Utah,66.0,66.0,66.0,5.0
New York,100.0,0.47405,0.0,0.097717


  对比整行赋值：
   - data.loc[条件] = 值 → 修改整行
   - data.loc[条件, 列名] = 值 → 修改特定列

In [51]:
data.loc[data['two'] == 66, 'three'] = 33
data

Unnamed: 0,one,two,three,four
Ohio,100.0,0.070915,0.433655,0.277484
Colorado,100.0,0.536721,0.61835,0.0
Utah,66.0,66.0,33.0,5.0
New York,100.0,0.47405,0.0,0.097717


## Arithmetic and Data Alignment 算术和数据对齐

对Series的 index 对齐

In [52]:
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=["a", "c", "d", "e"])
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=["a", "c", "e", "f", "g"])
print(s1)
print(s2)

a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64
a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64


In [53]:
s1 + s2

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

对于 DataFrame，对行和列都执行对齐

In [57]:
df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)),
                    columns=list('bcd'),
                    index=['Ohio', 'Texas', 'California'])
df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)),
                    columns=list('bde'),
                    index=["Utah", "Ohio", "Texas", "Oregon"])
print(df1)
print(df2)
print(df1 + df2)


              b    c    d
Ohio        0.0  1.0  2.0
Texas       3.0  4.0  5.0
California  6.0  7.0  8.0
          b     d     e
Utah    0.0   1.0   2.0
Ohio    3.0   4.0   5.0
Texas   6.0   7.0   8.0
Oregon  9.0  10.0  11.0
              b   c     d   e
California  NaN NaN   NaN NaN
Ohio        3.0 NaN   6.0 NaN
Oregon      NaN NaN   NaN NaN
Texas       9.0 NaN  12.0 NaN
Utah        NaN NaN   NaN NaN


In [58]:
# 如果添加没有共同列或行标签的 DataFrame 对象，则结果将包含所有空：
df1 = pd.DataFrame({"A": [1,2]})
df2 = pd.DataFrame({"B": [3,4]})
print(df1)
print(df2)
print(df1 + df2)

   A
0  1
1  2
   B
0  3
1  4
    A   B
0 NaN NaN
1 NaN NaN


### 在算术中填充缺失值fill_value

In [None]:
df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)),
                   columns = list('abcd'))
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)),
                   columns = list('abcde'))
df2.loc[1, 'b'] = np.nan # 代表 NaN

print(df1)
print(df2)
print(df1 + df2)


     a    b     c     d
0  0.0  1.0   2.0   3.0
1  4.0  5.0   6.0   7.0
2  8.0  9.0  10.0  11.0
      a     b     c     d     e
0   0.0   1.0   2.0   3.0   4.0
1   5.0   NaN   7.0   8.0   9.0
2  10.0  11.0  12.0  13.0  14.0
3  15.0  16.0  17.0  18.0  19.0
      a     b     c     d   e
0   0.0   2.0   4.0   6.0 NaN
1   9.0   NaN  13.0  15.0 NaN
2  18.0  20.0  22.0  24.0 NaN
3   NaN   NaN   NaN   NaN NaN


In [None]:
df1.add(df2, fill_value=0)
# 将缺失值替换为 0

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,4.0
1,9.0,5.0,13.0,15.0,9.0
2,18.0,20.0,22.0,24.0,14.0
3,15.0,16.0,17.0,18.0,19.0


In [62]:
1/df1

Unnamed: 0,a,b,c,d
0,inf,1.0,0.5,0.333333
1,0.25,0.2,0.166667,0.142857
2,0.125,0.111111,0.1,0.090909


In [63]:
df1.rdiv(1)

Unnamed: 0,a,b,c,d
0,inf,1.0,0.5,0.333333
1,0.25,0.2,0.166667,0.142857
2,0.125,0.111111,0.1,0.090909


In [None]:
df1.reindex(columns=df2.columns, fill_value=0)

Unnamed: 0,a,b,c,d,e
0,0.0,1.0,2.0,3.0,0
1,4.0,5.0,6.0,7.0,0
2,8.0,9.0,10.0,11.0,0


In [65]:
df1.reindex(index=df2.index, fill_value=0)

Unnamed: 0,a,b,c,d
0,0.0,1.0,2.0,3.0
1,4.0,5.0,6.0,7.0
2,8.0,9.0,10.0,11.0
3,0.0,0.0,0.0,0.0


### Operations between DataFrame and Series

In [66]:
arr = np.arange(12.).reshape((3,4))
print(arr)
print(arr-arr[0])

[[ 0.  1.  2.  3.]
 [ 4.  5.  6.  7.]
 [ 8.  9. 10. 11.]]
[[0. 0. 0. 0.]
 [4. 4. 4. 4.]
 [8. 8. 8. 8.]]


当我们从 arr 中减去 arr[0] 时，每行执行一次减法。这被称为广播

In [None]:
# 默认情况下，DataFrame 和 Series 之间的算术与 DataFrame 列上的 Series 索引匹配，向下广播行：
frame = pd.DataFrame(np.arange(12.).reshape((4,3)),
                     columns=list('abc'),
                     index=['Ohio', 'Colorado', 'Utah', 'New York'])
series = frame.iloc[0]
print(frame)
print(series)
print(frame - series)


            a     b     c
Ohio      0.0   1.0   2.0
Colorado  3.0   4.0   5.0
Utah      6.0   7.0   8.0
New York  9.0  10.0  11.0
a    0.0
b    1.0
c    2.0
Name: Ohio, dtype: float64
            a    b    c
Ohio      0.0  0.0  0.0
Colorado  3.0  3.0  3.0
Utah      6.0  6.0  6.0
New York  9.0  9.0  9.0


默认情况下，pandas 会将 Series 的 索引 与 DataFrame 的 列 (columns) 进行匹配，然后沿着行 向下广播 (broadcast down)。

In [None]:
series2 = pd.Series(range(3), index=['b', 'e', 'c'])
print(series2)
print(frame + series2)

b    0
e    1
c    2
dtype: int64
           a     b     c   e
Ohio     NaN   1.0   4.0 NaN
Colorado NaN   4.0   7.0 NaN
Utah     NaN   7.0  10.0 NaN
New York NaN  10.0  13.0 NaN


**如果想按行匹配并向右广播呢？**

你需要使用算术方法 (如 .sub() 代替 -)，并指定 axis='index' 或 axis=0。

In [75]:
print(frame)
series3 = frame['c']
print('-----------')
print(series3)

frame.sub(series3, axis='index')

            a     b     c
Ohio      0.0   1.0   2.0
Colorado  3.0   4.0   5.0
Utah      6.0   7.0   8.0
New York  9.0  10.0  11.0
-----------
Ohio         2.0
Colorado     5.0
Utah         8.0
New York    11.0
Name: c, dtype: float64


Unnamed: 0,a,b,c
Ohio,-2.0,-1.0,0.0
Colorado,-2.0,-1.0,0.0
Utah,-2.0,-1.0,0.0
New York,-2.0,-1.0,0.0


## Function Application and Mapping