虽然 pandas 采用了 NumPy 的许多编码习语，但最大的区别是 pandas 是为处理表格或异构数据(tabular or heterogeneous data)而设计的。相比之下，NumPy 最适合处理同质(homogeneous)类型的数值数组数据。

In [2]:
import numpy as np 
import pandas as pd 
from pandas import Series, DataFrame

rng = np.random.default_rng(seed = 12345)

# 5.1 Introduction to pandas Data Structures

## Series

In [2]:
obj = pd.Series([4, 7, -5, 3])  
obj

0    4
1    7
2   -5
3    3
dtype: int64

In [4]:
obj.array

<NumpyExtensionArray>
[np.int64(4), np.int64(7), np.int64(-5), np.int64(3)]
Length: 4, dtype: int64

In [5]:
obj.index

RangeIndex(start=0, stop=4, step=1)

In [6]:
obj2 = pd.Series([3, 4, 2, -7], index=['d', 'b', 'c', 'a'])
obj2


d    3
b    4
c    2
a   -7
dtype: int64

In [7]:
obj2.index

Index(['d', 'b', 'c', 'a'], dtype='object')

In [None]:
print(obj2['a'])
print(obj2[['d', 'b']]) # 输入一个list

-7
d    3
b    4
dtype: int64


使用布尔数组过滤、标量乘法或应用数学函数，将保留索引值链接：

In [12]:
obj2[obj2 > 0]

d    3
b    4
c    2
dtype: int64

In [13]:
obj2*2

d     6
b     8
c     4
a   -14
dtype: int64

In [14]:
np.exp(obj2)

d    20.085537
b    54.598150
c     7.389056
a     0.000912
dtype: float64

In [16]:
print('b' in obj2)
print('e' in obj2)

True
False


In [None]:
# 如果有现成的dict 数据，可以直接传递 dict 创建 Series
sdata = {"Ohio": 35000, "Texas": 71000, "Oregon": 16000, "Utah": 5000}
obj3 = pd.Series(sdata)
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

In [24]:
# 从 Series convert to dict
obj3.to_dict(sdata)

  obj3.to_dict(sdata)


{'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}

当您只传递字典时，生成的 Series 中的索引将根据字典的 keys 方法（取决于键的插入顺序）遵循键的顺序。您可以通过按照您希望它们出现在生成的 Series 中的顺序传递带有字典键的索引来覆盖此设置：

In [25]:
states = ["California", "Ohio", "Oregon", "Texas"]
obj4 = pd.Series(sdata, index=states)
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

在 sdata 中找到的三个值被放置在适当的位置，但由于没有找到 “California” 的值，它显示为 NaN（不是数字），这在 pandas 中被认为是标记缺失值或 NA 值。由于 “犹他州” 未包含在各州中，因此将其排除在生成的对象之外。

In [26]:
print(pd.isna(obj4)) # 检查是否为缺失值
print(pd.notna(obj4)) # 检查是否不为缺失值

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool
California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool


In [27]:
obj4.isna() # 也可以用实例方法的方式使用

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [None]:
print([obj3, obj4])
obj3 + obj4 # 后续讨论数据对齐逻辑

[Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64, California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64]


California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

In [None]:
# 为 Series 增加名称Attribute
obj4.name = "state_population"
obj4.index.name = "state"
obj4

state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: state_population, dtype: float64

In [None]:
print(obj)
obj.index = ["CA", "OH", "OR", "TX"] # 重新赋值index
obj


0    4
1    7
2   -5
3    3
dtype: int64


CA    4
OH    7
OR   -5
TX    3
dtype: int64

## Dataframe

In [33]:
# from a dictionary of equal-length lists or NumPy arrays:
data = {"state": ["Ohio", "Ohio", "Ohio", "Nevada", "Nevada", "Nevada"],
        "year": [2000, 2001, 2002, 2001, 2002, 2003],
        "pop": [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


In [None]:
frame.head() # the head method select first 5 rows

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


In [None]:
frame.tail() # return last 5 rows

Unnamed: 0,state,year,pop
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


In [37]:
frame2 = pd.DataFrame(data, columns=["year", "state"])#, "pop", 'debt'])
frame2

Unnamed: 0,year,state
0,2000,Ohio
1,2001,Ohio
2,2002,Ohio
3,2001,Nevada
4,2002,Nevada
5,2003,Nevada


In [38]:
frame2 = pd.DataFrame(data, columns=["year", "state", "pop", 'debt'])
frame2

Unnamed: 0,year,state,pop,debt
0,2000,Ohio,1.5,
1,2001,Ohio,1.7,
2,2002,Ohio,3.6,
3,2001,Nevada,2.4,
4,2002,Nevada,2.9,
5,2003,Nevada,3.2,


In [41]:
print(frame['state']) # slicing
print(frame.state) # Attribues

0      Ohio
1      Ohio
2      Ohio
3    Nevada
4    Nevada
5    Nevada
Name: state, dtype: object
0      Ohio
1      Ohio
2      Ohio
3    Nevada
4    Nevada
5    Nevada
Name: state, dtype: object


In [46]:
print(frame2.loc[0])
print(frame2.iloc[-2:])

year     2000
state    Ohio
pop       1.5
debt      NaN
Name: 0, dtype: object
   year   state  pop debt
4  2002  Nevada  2.9  NaN
5  2003  Nevada  3.2  NaN


1. 将列表或数组分配给列时，值的长度必须与 DataFrame 的长度匹配。

In [50]:
frame2['debt'] = rng.random(6)
frame2

Unnamed: 0,year,state,pop,debt
0,2000,Ohio,1.5,0.598309
1,2001,Ohio,1.7,0.186734
2,2002,Ohio,3.6,0.672756
3,2001,Nevada,2.4,0.941803
4,2002,Nevada,2.9,0.248246
5,2003,Nevada,3.2,0.948881


2. 如果分配一个 Series，其标签将完全重新对齐到 DataFrame 的索引，并在任何不存在的索引值中插入缺失值：

In [51]:
val = pd.Series([-1.2, -1.5, -1.7], index = [2, 4, 5])
frame2['debt'] = val
frame2

Unnamed: 0,year,state,pop,debt
0,2000,Ohio,1.5,
1,2001,Ohio,1.7,
2,2002,Ohio,3.6,-1.2
3,2001,Nevada,2.4,
4,2002,Nevada,2.9,-1.5
5,2003,Nevada,3.2,-1.7


3. 分配column 如果不存在，会创建一个新列

In [61]:
frame2['eastern'] = frame2['state'] == 'Ohio'
frame2

Unnamed: 0,year,state,pop,debt,eastern
0,2000,Ohio,1.5,,True
1,2001,Ohio,1.7,,True
2,2002,Ohio,3.6,-1.2,True
3,2001,Nevada,2.4,,False
4,2002,Nevada,2.9,-1.5,False
5,2003,Nevada,3.2,-1.7,False


In [62]:
del frame2['eastern']
frame2.columns


Index(['year', 'state', 'pop', 'debt'], dtype='object')

如果将嵌套字典传递给 DataFrame，pandas 会将外部key解释为column，将内部key解释为row索引：

In [60]:
populations = {"Ohio": {2000: 1.5, 2001: 1.7, 2002: 3.6},
               "Nevada": {2000: 2.4, 2001: 2.9}}
frame3 = pd.DataFrame(populations)
frame3

Unnamed: 0,Ohio,Nevada
2000,1.5,2.4
2001,1.7,2.9
2002,3.6,


In [63]:
frame3.T

Unnamed: 0,2000,2001,2002
Ohio,1.5,1.7,3.6
Nevada,2.4,2.9,


In [None]:
pd.DataFrame(populations, index = [2010, 2011, 2012])
# 显式指定列会覆盖 nested dict 中的 inner key

Unnamed: 0,Ohio,Nevada
2010,,
2011,,
2012,,


In [65]:
pdata = {"Ohio": frame3['Ohio'][:-1],
         "Nevada": frame3['Nevada'][:2]}
pd.DataFrame(pdata)


Unnamed: 0,Ohio,Nevada
2000,1.5,2.4
2001,1.7,2.9


DataFrame 的行(index)和列(column)有 names attribute

In [66]:
frame3.index.name = 'year'
frame3.columns.name = 'state'
frame3

state,Ohio,Nevada
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2000,1.5,2.4
2001,1.7,2.9
2002,3.6,


In [67]:
frame3.to_numpy()

array([[1.5, 2.4],
       [1.7, 2.9],
       [3.6, nan]])

In [68]:
frame2.to_numpy()

array([[2000, 'Ohio', 1.5, nan],
       [2001, 'Ohio', 1.7, nan],
       [2002, 'Ohio', 3.6, -1.2],
       [2001, 'Nevada', 2.4, nan],
       [2002, 'Nevada', 2.9, -1.5],
       [2003, 'Nevada', 3.2, -1.7]], dtype=object)

## Index Objects

In [70]:
obj = pd.Series(rng.standard_normal(4), index=['a', 'b', 'c', 'd'])
print(obj.index)
print(obj.index[:1])

Index(['a', 'b', 'c', 'd'], dtype='object')
Index(['a'], dtype='object')


Index objects are immutable

In [71]:
obj.index[1] = 'dom'

TypeError: Index does not support mutable operations

In [72]:
# Immutability makes it safer to share Index objects among data structures:
labels = pd.Index(np.arange(3))
obj2 = pd.Series([1.5, -2.5, 0], index=labels)
obj2.index is labels

True

In [75]:
print(frame3)
print(frame3.index)
print(frame3.columns)
print('Ohio' in frame3.columns)
print(2003 in frame3.index)

state  Ohio  Nevada
year               
2000    1.5     2.4
2001    1.7     2.9
2002    3.6     NaN
Index([2000, 2001, 2002], dtype='int64', name='year')
Index(['Ohio', 'Nevada'], dtype='object', name='state')
True
False


# 5.2 Essential Functionality

## Reindexing index 撞库

.reindex() 基本上遵循以下逻辑：
   1. 你给它一个目标索引 (target index)。
   2. 它会遍历这个目标索引中的每一个标签。
   3. 对于目标索引中的每个标签，它会去原始对象中寻找具有相同标签的数据。
       * 如果找到了，就把原始数据值复制到新对象中对应的位置。
       * 如果没找到 (即目标索引中有而原始索引中没有的标签)，就在新对象中该标签的位置放入 NaN
         (表示缺失)。
   4. 原始对象中那些不在目标索引里的标签，它们的数据就会被丢弃。
   5. 最终，它会返回一个全新的 `Series` 或 `DataFrame`，其索引就是你指定的目标索引。

In [85]:
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=["d", "b", "a", "c"])
obj

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

In [87]:
obj2 = obj.reindex(['beijing', 'shanghai', 'wuxi', 'shenzhen', 'tokyo'])
obj2

beijing    NaN
shanghai   NaN
wuxi       NaN
shenzhen   NaN
tokyo      NaN
dtype: float64

In [88]:
obj3 = obj.reindex(['d', 'b', 'a', 'c', 'e'])
obj3

d    4.5
b    7.2
a   -5.3
c    3.6
e    NaN
dtype: float64

判断是“修改”还是“替换”：

  “等号 (`=`) 的左边是什么？”

   1. 如果是 `变量名[索引]` 或 `对象.属性[索引]` (如 a[0], arr[1],
      frame.loc['d', 'Utah'])
       * 这通常意味着你正在尝试修改 (Mutate) 对象内部的数据。
       * 如果对象是可变的 (list, ndarray, DataFrame data)，操作会成功。
       * 如果对象是不可变的 (tuple, string, pd.Index)，操作会失败。

   2. 如果仅仅是 `变量名` 或 `对象.属性` (如 a, arr, frame.columns)
       * 这永远是替换 (Replace) 操作。
       * 它的意思是“让这个变量名（或属性名）指向等号右边这个全新的对象”。
       * 这个操作总是会成功，因为它与左边变量之前指向的对象是可变还是不可变
         无关。


In [None]:
frame = pd.DataFrame(np.arange(9).reshape((3, 3)),
                     index=['a', 'c', 'd'],
                     columns=['Ohio', 'Texas', 'California'])
print("--- 初始DataFrame ---")
print(frame)

# --- 1. 对不可变对象 (Index) 的操作 ---

# 尝试“修改 (Mutate)” Index -> 失败！
try:
    print("\n尝试修改 frame.columns[1] ...")
    frame.columns[1] = "Utah"
except TypeError as e:
    print(f"失败: {e}") # Index对象不支持内部修改

# “替换 (Replace)” Index -> 成功！
print("\n执行替换 frame.columns ...")
frame.columns = ['Ohio', 'Utah', 'California'] # 用一个全新的列表替换整个columns属性
print("替换columns后:\n", frame)
# 新的list会继承值
 
# --- 2. 对可变对象 (DataFrame中的数据) 的操作 ---

# “修改 (Mutate)” DataFrame内部的数据 -> 成功！
print("\n执行修改 DataFrame 内部数据 ...")
frame.loc['d', 'Utah'] = 999 # 使用.loc直接定位并修改数据
print("修改数据后:\n", frame)

--- 初始DataFrame ---
   Ohio  Texas  California
a     0      1           2
c     3      4           5
d     6      7           8

尝试修改 frame.columns[1] ...
失败: Index does not support mutable operations

执行替换 frame.columns ...
替换columns后:
    Ohio  Utah  California
a     0     1           2
c     3     4           5
d     6     7           8

执行修改 DataFrame 内部数据 ...
修改数据后:
    Ohio  Utah  California
a     0     1           2
c     3     4           5
d     6   999           8


In [None]:
obj4 = pd.Series(["blue", "purple", "yellow"], index=[0, 2, 4])
print(obj4)
obj4.reindex(np.arange(6), method = 'ffill') # reindex 没有修改 obj4


0      blue
2    purple
4    yellow
dtype: object


0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

In [95]:
frame = pd.DataFrame(np.arange(9).reshape((3, 3)),
                     index=['a', 'c', 'd'],
                     columns=['Ohio', 'Texas', 'California'])
print(frame)

frame2 = frame.reindex(['a', 'b', 'c', 'd'])
print(frame2)

   Ohio  Texas  California
a     0      1           2
c     3      4           5
d     6      7           8
   Ohio  Texas  California
a   0.0    1.0         2.0
b   NaN    NaN         NaN
c   3.0    4.0         5.0
d   6.0    7.0         8.0


In [None]:
states = ["Texas", "Utah", "California"]
frame.reindex(columns=states)

Unnamed: 0,Texas,Utah,California
a,1,,2
c,4,,5
d,7,,8


In [None]:
frame.loc[["a", "d", "c"], ["California", "Texas"]] # .loc() => locate by label

Unnamed: 0,California,Texas
a,2,1
d,8,7
c,5,4


## Dropping Entries from an Axis 从轴中删除条目

from Series

In [3]:
obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
obj


a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [6]:
new_obj = obj.drop('c') # drop 方法将返回一个新对象
print(new_obj)
print(obj)

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64
a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64


In [7]:
obj.drop(['d', 'c'])

a    0.0
b    1.0
e    4.0
dtype: float64

From DataFrame

In [8]:
data = pd.DataFrame(np.arange(16).reshape((4,4)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns=['one', 'two', 'three', 'four'])
data


Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [9]:
data.drop(['Colorado', 'Ohio'])

Unnamed: 0,one,two,three,four
Utah,8,9,10,11
New York,12,13,14,15


In [11]:
data.drop(columns= ['four', 'one'])

Unnamed: 0,two,three
Ohio,1,2
Colorado,5,6
Utah,9,10
New York,13,14


drop with axis = 1/0 or 'columns'/'index'

In [None]:
data.drop('two', axis=1) # like numpy, 0 = index, 1 =columns

Unnamed: 0,one,three,four
Ohio,0,2,3
Colorado,4,6,7
Utah,8,10,11
New York,12,14,15


In [13]:
data.drop(['two','four'], axis='columns')

Unnamed: 0,one,three
Ohio,0,2
Colorado,4,6
Utah,8,10
New York,12,14


In [14]:
data.drop(['Utah', 'Ohio'], axis='index' )

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
New York,12,13,14,15


## Indexing, Selection, and Filtering

In [22]:
obj = pd.Series(np.arange(4.), index=["a", "b", "c", "d"])
print(obj)
print('---------')
print(obj['b'])
print(obj[1])
print('---------')
print(obj[2:4])
print(obj[['b', 'a', 'd']])
print(obj[[1, 3]])
print('---------')
print(obj[obj < 2])

a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64
---------
1.0
1.0
---------
c    2.0
d    3.0
dtype: float64
b    1.0
a    0.0
d    3.0
dtype: float64
b    1.0
d    3.0
dtype: float64
---------
a    0.0
b    1.0
dtype: float64


  print(obj[1])
  print(obj[[1, 3]])


loc方法 => 标签索引

iloc方法 => 整数索引

In [4]:
obj1 = pd.Series([1, 2, 3], index=[2, 0, 1])
obj2 = pd.Series([1, 2, 3], index=["a", "b", "c"])
print(obj1)
print('---------')
print(obj2)

2    1
0    2
1    3
dtype: int64
---------
a    1
b    2
c    3
dtype: int64


首选 loc 的原因是因为使用 [] 索引时对整数的处理方式不同。如果索引包含整数，则基于 [] 的常规索引会将整数视为标签，因此行为因索引的数据类型而异。例如：

In [6]:
# 使用[]进行索引
print(obj1[[0,1,2]]) # 
print('---------')
print(obj2[[0,1,2]])

0    2
1    3
2    1
dtype: int64
---------
a    3
b    3
c    3
dtype: int64


  print(obj2[[0,1,2]])


In [None]:
obj2.loc[[0,1]] # loc 是基于标签的索引，所以这里会 return KeyError

KeyError: "None of [Index([0, 1], dtype='int64')] are in the [index]"

由于 loc 运算符专门使用标签进行索引，因此还有一个 iloc 运算符专门使用整数进行索引，无论索引是否包含整数，它都能一致地工作：

In [7]:
print(obj1.iloc[[0,1,2]]) # 与[] indexing 返回内容不同
print('---------')
print(obj2.iloc[[0,1,2]])


2    1
0    2
1    3
dtype: int64
---------
a    3
b    3
c    3
dtype: int64


In [None]:
obj2.loc['b':'d'] = 5
obj2 # 直接修改原对象，因为 Series 是mutable的

a    3
b    5
c    5
dtype: int64

In [9]:
obj2[:2] = 3
obj2

a    3
b    3
c    5
dtype: int64

DataFrame 的索引：
   - df[列名]：选择列
   - df.loc[行标签]：按标签选择行
   - df.iloc[行位置]：按位置选择行

In [13]:
data = pd.DataFrame(rng.standard_normal((4, 4)),
                    index=["Ohio", "Colorado", "Utah", "New York"],
                    columns=["one", "two", "three", "four"])
print(data)
print('-----------------')
print(data['two'])
print('-----------------')
print(data[['three', 'one']])
print('-----------------')
print(data.loc[['Ohio', 'Utah']])
print('-----------------')
print(data.loc[['Ohio', 'Utah']]['two'])

               one       two     three      four
Ohio      0.061144  0.070915  0.433655  0.277484
Colorado  0.530252  0.536721  0.618350 -0.795017
Utah      0.300031 -1.602702  0.266799 -1.261624
New York -0.071271  0.474050 -0.414854  0.097717
-----------------
Ohio        0.070915
Colorado    0.536721
Utah       -1.602702
New York    0.474050
Name: two, dtype: float64
-----------------
             three       one
Ohio      0.433655  0.061144
Colorado  0.618350  0.530252
Utah      0.266799  0.300031
New York -0.414854 -0.071271
-----------------
           one       two     three      four
Ohio  0.061144  0.070915  0.433655  0.277484
Utah  0.300031 -1.602702  0.266799 -1.261624
-----------------
Ohio    0.070915
Utah   -1.602702
Name: two, dtype: float64


In [15]:
print(data[:2])
print('-----------------')
print(data[data['three']>0])

               one       two     three      four
Ohio      0.061144  0.070915  0.433655  0.277484
Colorado  0.530252  0.536721  0.618350 -0.795017
-----------------
               one       two     three      four
Ohio      0.061144  0.070915  0.433655  0.277484
Colorado  0.530252  0.536721  0.618350 -0.795017
Utah      0.300031 -1.602702  0.266799 -1.261624


In [16]:
data >0

Unnamed: 0,one,two,three,four
Ohio,True,True,True,True
Colorado,True,True,True,False
Utah,True,False,True,False
New York,False,True,False,True


In [17]:
data[data < 0] = 0
data

Unnamed: 0,one,two,three,four
Ohio,0.061144,0.070915,0.433655,0.277484
Colorado,0.530252,0.536721,0.61835,0.0
Utah,0.300031,0.0,0.266799,0.0
New York,0.0,0.47405,0.0,0.097717


### Selection on DataFrame with loc and iloc

In [18]:
data

Unnamed: 0,one,two,three,four
Ohio,0.061144,0.070915,0.433655,0.277484
Colorado,0.530252,0.536721,0.61835,0.0
Utah,0.300031,0.0,0.266799,0.0
New York,0.0,0.47405,0.0,0.097717


select rows

In [19]:
data.loc['Colorado']

one      0.530252
two      0.536721
three    0.618350
four     0.000000
Name: Colorado, dtype: float64

In [20]:
data.loc[['Colorado', 'Ohio']]

Unnamed: 0,one,two,three,four
Colorado,0.530252,0.536721,0.61835,0.0
Ohio,0.061144,0.070915,0.433655,0.277484


In [21]:
data.loc['New York', ['four', 'one']]

four    0.097717
one     0.000000
Name: New York, dtype: float64

In [24]:
print(data.iloc[2])
print('---------')
print(data.iloc[2, 3]) # 第三行第四列
print('---------')
print(data.iloc[2, [3, 0, 1]]) # 第三行的第四列、第一列、第二列
print('---------')
print(data.iloc[[1, 2], [3, 0, 1]]) # 第二行、第三行的第四列、第一列、第二列


one      0.300031
two      0.000000
three    0.266799
four     0.000000
Name: Utah, dtype: float64
---------
0.0
---------
four    0.000000
one     0.300031
two     0.000000
Name: Utah, dtype: float64
---------
          four       one       two
Colorado   0.0  0.530252  0.536721
Utah       0.0  0.300031  0.000000


In [25]:
data

Unnamed: 0,one,two,three,four
Ohio,0.061144,0.070915,0.433655,0.277484
Colorado,0.530252,0.536721,0.61835,0.0
Utah,0.300031,0.0,0.266799,0.0
New York,0.0,0.47405,0.0,0.097717


In [26]:
data.loc[:'Utah', 'two']

Ohio        0.070915
Colorado    0.536721
Utah        0.000000
Name: two, dtype: float64

In [None]:
data.iloc[:4, :3][data.Ohio > 0]

#  原因解释:
#   - data.Ohio 等价于 data['Ohio']，是访问列的语法
#   - Ohio是行索引标签(index)，不是列名(columns)
#   - DataFrame的Attribute访问只适用于列名，不适用于行索引

AttributeError: 'DataFrame' object has no attribute 'Ohio'

### Integer indexing pitfalls 整数索引陷阱

In [29]:
ser = pd.Series(np.arange(3.))
ser

0    0.0
1    1.0
2    2.0
dtype: float64

整数索引不能引用不存在的标签

In [None]:
ser[-1]



KeyError: -1

In [39]:
ser[1]

np.float64(1.0)

In [40]:
ser2 = pd.Series(np.arange(3.), index=['a', 'b', 'c'])
ser2

a    0.0
b    1.0
c    2.0
dtype: float64

非整数索引

In [41]:
ser2[-1]

  ser2[-1]


np.float64(2.0)

### 链式索引的陷阱

In [42]:
data

Unnamed: 0,one,two,three,four
Ohio,0.061144,0.070915,0.433655,0.277484
Colorado,0.530252,0.536721,0.61835,0.0
Utah,0.300031,0.0,0.266799,0.0
New York,0.0,0.47405,0.0,0.097717


In [43]:
data.loc[:, 'one'] = 100
data

Unnamed: 0,one,two,three,four
Ohio,100.0,0.070915,0.433655,0.277484
Colorado,100.0,0.536721,0.61835,0.0
Utah,100.0,0.0,0.266799,0.0
New York,100.0,0.47405,0.0,0.097717


In [45]:
data.iloc[2] = 5
data

Unnamed: 0,one,two,three,four
Ohio,100.0,0.070915,0.433655,0.277484
Colorado,100.0,0.536721,0.61835,0.0
Utah,5.0,5.0,5.0,5.0
New York,100.0,0.47405,0.0,0.097717


In [48]:
data.loc[data['four'] > 3] = 66
# 标记大于3的行
# 大于3的行，four列的值为66
data

Unnamed: 0,one,two,three,four
Ohio,100.0,0.070915,0.433655,0.277484
Colorado,100.0,0.536721,0.61835,0.0
Utah,66.0,66.0,66.0,66.0
New York,100.0,0.47405,0.0,0.097717


In [50]:
data.loc[data.four > 5, 'four'] = 5
# 大于5的行，four列的值为5
data

Unnamed: 0,one,two,three,four
Ohio,100.0,0.070915,0.433655,0.277484
Colorado,100.0,0.536721,0.61835,0.0
Utah,66.0,66.0,66.0,5.0
New York,100.0,0.47405,0.0,0.097717


  对比整行赋值：
   - data.loc[条件] = 值 → 修改整行
   - data.loc[条件, 列名] = 值 → 修改特定列

In [51]:
data.loc[data['two'] == 66, 'three'] = 33
data

Unnamed: 0,one,two,three,four
Ohio,100.0,0.070915,0.433655,0.277484
Colorado,100.0,0.536721,0.61835,0.0
Utah,66.0,66.0,33.0,5.0
New York,100.0,0.47405,0.0,0.097717


## Arithmetic and Data Alignment 算术和数据对齐

对Series的 index 对齐

In [52]:
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=["a", "c", "d", "e"])
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=["a", "c", "e", "f", "g"])
print(s1)
print(s2)

a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64
a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64


In [53]:
s1 + s2

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

对于 DataFrame，对行和列都执行对齐

In [57]:
df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)),
                    columns=list('bcd'),
                    index=['Ohio', 'Texas', 'California'])
df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)),
                    columns=list('bde'),
                    index=["Utah", "Ohio", "Texas", "Oregon"])
print(df1)
print(df2)
print(df1 + df2)


              b    c    d
Ohio        0.0  1.0  2.0
Texas       3.0  4.0  5.0
California  6.0  7.0  8.0
          b     d     e
Utah    0.0   1.0   2.0
Ohio    3.0   4.0   5.0
Texas   6.0   7.0   8.0
Oregon  9.0  10.0  11.0
              b   c     d   e
California  NaN NaN   NaN NaN
Ohio        3.0 NaN   6.0 NaN
Oregon      NaN NaN   NaN NaN
Texas       9.0 NaN  12.0 NaN
Utah        NaN NaN   NaN NaN


In [58]:
# 如果添加没有共同列或行标签的 DataFrame 对象，则结果将包含所有空：
df1 = pd.DataFrame({"A": [1,2]})
df2 = pd.DataFrame({"B": [3,4]})
print(df1)
print(df2)
print(df1 + df2)

   A
0  1
1  2
   B
0  3
1  4
    A   B
0 NaN NaN
1 NaN NaN


### 在算术中填充缺失值fill_value

In [None]:
df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)),
                   columns = list('abcd'))
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)),
                   columns = list('abcde'))
df2.loc[1, 'b'] = np.nan # 代表 NaN

print(df1)
print(df2)
print(df1 + df2)


     a    b     c     d
0  0.0  1.0   2.0   3.0
1  4.0  5.0   6.0   7.0
2  8.0  9.0  10.0  11.0
      a     b     c     d     e
0   0.0   1.0   2.0   3.0   4.0
1   5.0   NaN   7.0   8.0   9.0
2  10.0  11.0  12.0  13.0  14.0
3  15.0  16.0  17.0  18.0  19.0
      a     b     c     d   e
0   0.0   2.0   4.0   6.0 NaN
1   9.0   NaN  13.0  15.0 NaN
2  18.0  20.0  22.0  24.0 NaN
3   NaN   NaN   NaN   NaN NaN


In [None]:
df1.add(df2, fill_value=0)
# 将缺失值替换为 0

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,4.0
1,9.0,5.0,13.0,15.0,9.0
2,18.0,20.0,22.0,24.0,14.0
3,15.0,16.0,17.0,18.0,19.0


In [62]:
1/df1

Unnamed: 0,a,b,c,d
0,inf,1.0,0.5,0.333333
1,0.25,0.2,0.166667,0.142857
2,0.125,0.111111,0.1,0.090909


In [63]:
df1.rdiv(1)

Unnamed: 0,a,b,c,d
0,inf,1.0,0.5,0.333333
1,0.25,0.2,0.166667,0.142857
2,0.125,0.111111,0.1,0.090909


In [None]:
df1.reindex(columns=df2.columns, fill_value=0)

Unnamed: 0,a,b,c,d,e
0,0.0,1.0,2.0,3.0,0
1,4.0,5.0,6.0,7.0,0
2,8.0,9.0,10.0,11.0,0


In [65]:
df1.reindex(index=df2.index, fill_value=0)

Unnamed: 0,a,b,c,d
0,0.0,1.0,2.0,3.0
1,4.0,5.0,6.0,7.0
2,8.0,9.0,10.0,11.0
3,0.0,0.0,0.0,0.0


### Operations between DataFrame and Series

In [66]:
arr = np.arange(12.).reshape((3,4))
print(arr)
print(arr-arr[0])

[[ 0.  1.  2.  3.]
 [ 4.  5.  6.  7.]
 [ 8.  9. 10. 11.]]
[[0. 0. 0. 0.]
 [4. 4. 4. 4.]
 [8. 8. 8. 8.]]


当我们从 arr 中减去 arr[0] 时，每行执行一次减法。这被称为广播

In [None]:
# 默认情况下，DataFrame 和 Series 之间的算术与 DataFrame 列上的 Series 索引匹配，向下广播行：
frame = pd.DataFrame(np.arange(12.).reshape((4,3)),
                     columns=list('abc'),
                     index=['Ohio', 'Colorado', 'Utah', 'New York'])
series = frame.iloc[0]
print(frame)
print(series)
print(frame - series)


            a     b     c
Ohio      0.0   1.0   2.0
Colorado  3.0   4.0   5.0
Utah      6.0   7.0   8.0
New York  9.0  10.0  11.0
a    0.0
b    1.0
c    2.0
Name: Ohio, dtype: float64
            a    b    c
Ohio      0.0  0.0  0.0
Colorado  3.0  3.0  3.0
Utah      6.0  6.0  6.0
New York  9.0  9.0  9.0


默认情况下，pandas 会将 Series 的 索引 与 DataFrame 的 列 (columns) 进行匹配，然后沿着行 向下广播 (broadcast down)。

In [None]:
series2 = pd.Series(range(3), index=['b', 'e', 'c'])
print(series2)
print(frame + series2)

b    0
e    1
c    2
dtype: int64
           a     b     c   e
Ohio     NaN   1.0   4.0 NaN
Colorado NaN   4.0   7.0 NaN
Utah     NaN   7.0  10.0 NaN
New York NaN  10.0  13.0 NaN


**如果想按行匹配并向右广播呢？**

你需要使用算术方法 (如 .sub() 代替 -)，并指定 axis='index' 或 axis=0。

In [None]:
print(frame)
series3 = frame['c']
print('-----------')
print(series3)

frame.sub(series3, axis='index') # 以什么为计算基准

            a     b     c
Ohio      0.0   1.0   2.0
Colorado  3.0   4.0   5.0
Utah      6.0   7.0   8.0
New York  9.0  10.0  11.0
-----------
Ohio         2.0
Colorado     5.0
Utah         8.0
New York    11.0
Name: c, dtype: float64


Unnamed: 0,a,b,c
Ohio,-2.0,-1.0,0.0
Colorado,-2.0,-1.0,0.0
Utah,-2.0,-1.0,0.0
New York,-2.0,-1.0,0.0


## Function Application and Mapping

In [5]:
frame = pd.DataFrame(rng.standard_normal((4, 3)),
                     columns=list('bde'),
                     index=['Ohio', 'Texas', 'Utah', 'New York'])
print(frame)
print(np.abs(frame))

                 b         d         e
Ohio     -0.158189  0.449484 -1.343601
Texas    -0.081688  1.724740  2.618159
Utah      0.777361  0.828633 -0.958988
New York -1.209388 -1.412292  0.541547
                 b         d         e
Ohio      0.158189  0.449484  1.343601
Texas     0.081688  1.724740  2.618159
Utah      0.777361  0.828633  0.958988
New York  1.209388  1.412292  0.541547


### apply函数的作用是沿着一个轴（axis），将一个函数应用到 `DataFrame` 的每一行或每一列上。它处理的对象是一整个 `Series`（一行或一列都是一个 Series）。

In [None]:
  ┌──────────┬──────────────────────────────┬────────────────────────────────┐
  │ 特性      │ DataFrame.apply              │ DataFrame.applymap (或 df.map) │
  ├──────────┼──────────────────────────────┼────────────────────────────────┤
  │ 操作对象  │ 行或列 (Series)               │ 单个元素 (标量)                  │
  │ 函数输入  │ 接收一个 Series 对象           │ 接收一个单一值                   │
  │ 主要用途  │ 聚合、转换、需要上下文的计算      │ 元素级的格式化、转换              │
  │ 工作维度  │ 一维 (沿着行或列)               │ 零维 (逐个元素)                  │
  └──────────┴──────────────────────────────┴────────────────────────────────┘

In [None]:
def f1(x):
    return x.max() - x.min()

print(frame.apply(f1))
# 函数 f 计算 Series 的最大值和最小值之间的差值
# 计算默认 axis = 'index'
print(frame.apply(f1, axis='columns'))


b    1.986750
d    3.137032
e    3.961760
dtype: float64
Ohio        1.793085
Texas       2.699847
Utah        1.787622
New York    1.953839
dtype: float64


**核心思想：axis 参数指的是将被“压缩”或“折叠”的轴。**

你可以把 sum()、mean() 这类操作想象成一个“压路机”。axis

参数告诉压路机**要沿着哪个方向把数据压扁**。

**工作原理：**

1. **``axis=0`` 或 ``axis='index`'`**:

* 这表示**`索引（行）轴`**将被压缩。

* 想象一下，为了把**行**都压缩成一行，你的“压路机”必须**`从上到下`**碾过数据。

* 因此，运算是**针对每一列**（column）独立进行的。

* 最终结果的索引就是原来的列名。

2. **``axis=1`` 或 ``axis='columns`'`**:

* 这表示**`列轴`**将被压缩。

* 想象一下，为了把**列**都压缩成一列，你的“压路机”必须**`从左到右`**碾过数据。

* 因此，运算是**针对每一行**（index）独立进行的。

* 最终结果的索引就是原来的行名。

In [10]:
def f2(x):
    return pd.Series([x.min(), x.max()], index=['min', 'max'])

print(frame.apply(f2))
# 计算默认 axis = 'index'
print(frame.apply(f2, axis='columns'))


            b         d         e
min -1.209388 -1.412292 -1.343601
max  0.777361  1.724740  2.618159
               min       max
Ohio     -1.343601  0.449484
Texas    -0.081688  2.618159
Utah     -0.958988  0.828633
New York -1.412292  0.541547


### map 函数的任务是对 `Series`, `DataFrmae` 中的每一个元素应用一个转换规则。

  1. Series.map：专为 Series 设计的“转换器”

  Series.map 是一个专门用于 Series 的函数。它的核心任务是对 `Series` 中的每一个元素应用
  一个转换规则。这个“转换规则”可以是一个函数，也可以是一个包含映射关系的字典或另一个
  Series。

  工作原理:
   1. 遍历 `Series`: 它会依次查看 Series 中的每个值。
   2. 应用规则:
       * 如果规则是函数，它就对每个值执行这个函数。
       * 如果规则是字典或 
         `Series`，它就把每个值当作“键”（key），去查找对应的“值”（value）并替换。
   3. 返回新 `Series`: 它将所有转换后的结果收集起来，返回一个新的 Series。

用例 A：map函数

In [12]:
s = pd.Series([1, 2, 3, np.nan], index=['a', 'b', 'c', 'd'])

s.map(lambda x: f'{x: .2f}')

a     1.00
b     2.00
c     3.00
d      nan
dtype: object

In [14]:
s.map(lambda x: 'this is a nan value' if pd.isna(x) else f'{x: .2f}')

a                   1.00
b                   2.00
c                   3.00
d    this is a nan value
dtype: object

用例 B：map dict

In [None]:
mapping = {1: 'one', 2: 'two', 3: 'three'}

s.map(mapping) # 按‘dict.key’查找替换原值为‘dict.value’

a      one
b      two
c    three
d      NaN
dtype: object

  2. DataFrame.applymap：DataFrame 的“全局”元素转换器

  DataFrame.applymap 是一个专门用于 DataFrame 的函数。它的任务很简单：将一个函数应用到 
  `DataFrame` 的每一个单独的元素上。

  核心思想:
  你可以把 applymap 想象成是 Series.map 的一个“升级版”或“批量版”。它相当于对 DataFrame
  中的每一列（每一列都是一个 Series）都执行一次 map 操作。

  工作原理:
   1. 遍历 `DataFrame`: 它会访问 DataFrame 中的每一个单元格。
   2. 应用函数: 对每个单元格的值执行指定的函数。
   3. 返回新 `DataFrame`: 将所有结果收集起来，返回一个形状相同的新 DataFrame。

  重要: applymap 的参数只能是函数，不能是字典或 Series。这是它和 Series.map
  的一个关键区别。

In [None]:
def my_format(x):
    return f"{x: .2f}" # 保留两位小数
print(frame.applymap(my_format))

              b      d      e
Ohio      -0.16   0.45  -1.34
Texas     -0.08   1.72   2.62
Utah       0.78   0.83  -0.96
New York  -1.21  -1.41   0.54


  print(frame.applymap(my_format))


In [None]:
frame.map(lambda x: f"{x: .2f}") # 官方更推荐使用 DataFrame.map 来替代DataFrame.applymap，因为 applymap 未来可能会被移除。

Unnamed: 0,b,d,e
Ohio,-0.16,0.45,-1.34
Texas,-0.08,1.72,2.62
Utah,0.78,0.83,-0.96
New York,-1.21,-1.41,0.54


## Sorting and Ranking

In [None]:
obj = pd.Series(np.arange(4), index=["d", "a", "b", "c"])
obj.sort_index() # 按索引排序

a    1
b    2
c    3
d    0
dtype: int64

In [None]:
frame = pd.DataFrame(np.arange(8).reshape((2, 4)),
                     index=["three", "one"],
                     columns=["d", "a", "b", "c"])
print(frame.sort_index()) # 按索引排序
print('--------------')
print(frame.sort_index(axis=1)) # 按列索引排序
print('--------------')
print(frame.sort_index(axis=1, ascending=False)) # 按列索引排序，降序
print(frame.sort_columns()) # 没有sort_columns方法


       d  a  b  c
one    4  5  6  7
three  0  1  2  3
--------------
       a  b  c  d
three  1  2  3  0
one    5  6  7  4
--------------
       d  c  b  a
three  0  3  2  1
one    4  7  6  5


AttributeError: 'DataFrame' object has no attribute 'sort_columns'

In [23]:
obj = pd.Series([4,7,-3,2])
print(obj.sort_values())


2   -3
3    2
0    4
1    7
dtype: int64


In [None]:
obj = pd.Series([4, np.nan, 7, np.nan, -3, 2])
print(obj.sort_values())
print('--------------')
print(obj.sort_values(na_position="first")) # 缺失值排在最前面

4   -3.0
5    2.0
0    4.0
2    7.0
1    NaN
3    NaN
dtype: float64
--------------
1    NaN
3    NaN
4   -3.0
5    2.0
0    4.0
2    7.0
dtype: float64


sort a DataFrame by values

In [30]:
frame = pd.DataFrame({"b": [4, 7, -3, 2], "a": [0, 1, 0, 1]})
print(frame)
print('--------------')
print(frame.sort_values("b")) # 按列b排序
print('--------------')
print(frame.sort_values(['a',"b"]))


   b  a
0  4  0
1  7  1
2 -3  0
3  2  1
--------------
   b  a
2 -3  0
3  2  1
0  4  0
1  7  1
--------------
   b  a
2 -3  0
0  4  0
3  2  1
1  7  1


In [None]:
  我们一步步来看 frame.sort_values(['a', 'b']) 是如何工作的。

  原始 DataFrame:

   b  a
0  4  0
1  7  1
2 -3  0
3  2  1

  第1步：按主要键 `'a'` 排序

  pandas 首先查看 'a' 列 [0, 1, 0, 1]。它会把所有 a=0 的行放在前面，所有 a=1
  的行放在后面。

  排序后，DataFrame 在概念上被分成了两个组：

      b  a
   0  4  0  } a=0 的组
   2 -3  0  }
   -----------
   1  7  1  } a=1 的组
   3  2  1  }
  注意：此时，在 `a=0` 组内部，行的顺序（0号 vs 2号）还不确定。`a=1` 组同理。

  第2步：在每个组内，按次要键 `'b'` 排序

  现在，pandas 在上一步形成的每个组内部，使用 'b' 列来决定最终顺序。

   * 对于 `a=0` 的组:
       * 我们有两行，它们的 'b' 值分别是 4 (索引0) 和 -3 (索引2)。
       * 升序排列 -3 和 4，得到 -3 在前，4 在后。
       * 所以，索引为 2 的行排在索引为 0 的行前面。

   * 对于 `a=1` 的组:
       * 我们有两行，它们的 'b' 值分别是 7 (索引1) 和 2 (索引3)。
       * 升序排列 2 和 7，得到 2 在前，7 在后。
       * 所以，索引为 3 的行排在索引为 1 的行前面。

  最终结果:

  将排序好的组重新组合起来，就得到了最终的输出：

      b  a
   2 -3  0  <- a=0 组，按 b 排序 (-3 < 4)
   0  4  0
   3  2  1  <- a=1 组，按 b 排序 (2 < 7)
   1  7  1

  总结: 这个语法的规则就是 “先按 a 升序排，如果 a 相同，再按 b 升序排”。

## Rank 方法

Series 的 rank 方法

In [38]:
obj = pd.Series([7, -5, 7, 4, 2, 0, 4])
obj_rank = obj.rank()
print(obj_rank)
print(pd.DataFrame({
    'ranking': obj_rank,
    'value': obj
}))


0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64
   ranking  value
0      6.5      7
1      1.0     -5
2      6.5      7
3      4.5      4
4      3.0      2
5      2.0      0
6      4.5      4


In [39]:
print(obj.rank(method = 'first'))
print(pd.DataFrame({
    'ranking': obj.rank(method = 'first'),
    'value': obj
}))
print('----------------')
print(obj.rank(method = 'first', ascending=False))


0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64
   ranking  value
0      6.0      7
1      1.0     -5
2      7.0      7
3      4.0      4
4      3.0      2
5      2.0      0
6      5.0      4
----------------
0    1.0
1    7.0
2    2.0
3    3.0
4    5.0
5    6.0
6    4.0
dtype: float64


DataFrame 的 rank 方法

In [41]:
frame = pd.DataFrame({"b": [4.3, 7, -3, 2], "a": [0, 1, 0, 1],
                      "c": [-2, 5, 8, -2.5]})
print(frame)
print('----------------')
print(frame.rank())
print('----------------')
print(frame.rank(axis='columns'))


     b  a    c
0  4.3  0 -2.0
1  7.0  1  5.0
2 -3.0  0  8.0
3  2.0  1 -2.5
----------------
     b    a    c
0  3.0  1.5  2.0
1  4.0  3.5  3.0
2  1.0  1.5  4.0
3  2.0  3.5  1.0
----------------
     b    a    c
0  3.0  2.0  1.0
1  3.0  1.0  2.0
2  1.0  2.0  3.0
3  3.0  2.0  1.0


## Axis Indexes with Duplicate Labels

In [42]:
obj = pd.Series(np.arange(5), index=["a", "a", "b", "b", "c"])
obj

a    0
a    1
b    2
b    3
c    4
dtype: int64

In [43]:
obj.index.is_unique

False

In [44]:
obj['a']

a    0
a    1
dtype: int64

In [45]:
df = pd.DataFrame(np.random.standard_normal((5, 3)),
                  index=["a", "a", "b", "b", "c"])
df

Unnamed: 0,0,1,2
a,-0.728236,-0.903254,0.236219
a,-0.128571,0.439533,-0.041958
b,-0.814358,-0.628422,0.019379
b,-1.241152,-0.528698,-0.128073
c,-0.546206,-1.220203,-0.0164


In [47]:
print(df.loc['a'])
print('----------------')
print(df.loc['c'])

          0         1         2
a -0.728236 -0.903254  0.236219
a -0.128571  0.439533 -0.041958
----------------
0   -0.546206
1   -1.220203
2   -0.016400
Name: c, dtype: float64


# 5.3 Summarizing and Computing Descriptive Statistics

In [48]:
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
                   [np.nan, np.nan], [0.75, -1.3]],
                  index=["a", "b", "c", "d"],
                  columns=["one", "two"])
df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


In [None]:
print(df.sum()) # 默认 axis = 0，折叠 index
print('----------------')
print(df.sum(axis='columns')) # 折叠 columns

one    9.25
two   -5.80
dtype: float64
----------------
a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64


In [None]:
print(df.sum(axis='index', skipna=False)) # 默认 skipna = True
print('----------------')
print(df.sum(axis='columns', skipna=False))

one   NaN
two   NaN
dtype: float64
----------------
a     NaN
b    2.60
c     NaN
d   -0.55
dtype: float64


In [51]:
df.mean(axis='columns')

a    1.400
b    1.300
c      NaN
d   -0.275
dtype: float64

某些方法（如 idxmin 和 idxmax）返回间接统计信息，例如达到最小值或最大值的索引值：

In [53]:
print(df.idxmax())
print('----------------')
print(df.idxmin())

one    b
two    d
dtype: object
----------------
one    d
two    b
dtype: object


In [54]:
df.cumsum()

Unnamed: 0,one,two
a,1.4,
b,8.5,-4.5
c,,
d,9.25,-5.8


describe 一次生成多个汇总统计数据：

In [None]:
df.describe()


Unnamed: 0,one,two
count,3.0,2.0
mean,3.083333,-2.9
std,3.493685,2.262742
min,0.75,-4.5
25%,1.075,-3.7
50%,1.4,-2.9
75%,4.25,-2.1
max,7.1,-1.3


In [None]:
df.describe(axis='columns') # describe没有 axis 参数

TypeError: NDFrame.describe() got an unexpected keyword argument 'axis'

In [58]:
obj = pd.Series(["a", "a", "b", "c"] * 4)
print(obj)
print('---------')
print(obj.describe())

0     a
1     a
2     b
3     c
4     a
5     a
6     b
7     c
8     a
9     a
10    b
11    c
12    a
13    a
14    b
15    c
dtype: object
---------
count     16
unique     3
top        a
freq       8
dtype: object


## Correlation and Covariance

In [60]:
price = pd.read_pickle('/Users/d0m999/Desktop/_bot/data_analysis/pydata-book-3rd-edition/examples/yahoo_price.pkl')
volume = pd.read_pickle('/Users/d0m999/Desktop/_bot/data_analysis/pydata-book-3rd-edition/examples/yahoo_volume.pkl')
price.head()

Unnamed: 0_level_0,AAPL,GOOG,IBM,MSFT
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2010-01-04,27.990226,313.062468,113.304536,25.884104
2010-01-05,28.038618,311.683844,111.935822,25.892466
2010-01-06,27.592626,303.826685,111.208683,25.733566
2010-01-07,27.541619,296.753749,110.823732,25.465944
2010-01-08,27.724725,300.709808,111.935822,25.641571


In [61]:
returns = price.pct_change()
returns.tail()

Unnamed: 0_level_0,AAPL,GOOG,IBM,MSFT
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2016-10-17,-0.00068,0.001837,0.002072,-0.003483
2016-10-18,-0.000681,0.019616,-0.026168,0.00769
2016-10-19,-0.002979,0.007846,0.003583,-0.002255
2016-10-20,-0.000512,-0.005652,0.001719,-0.004867
2016-10-21,-0.00393,0.003011,-0.012474,0.042096


In [None]:
returns['MSFT'].corr(returns['IBM'])
# 语法解析
# 因为 returns 变量是一个 pandas DataFrame 对象
# 而 [] (方括号) 是 DataFrame 用于选取列 (Column Selection)的主要语法

np.float64(0.49976361144151166)

In [63]:
returns['MSFT'].cov(returns['IBM'])

np.float64(8.870655479703549e-05)

In [64]:
returns.corr()

Unnamed: 0,AAPL,GOOG,IBM,MSFT
AAPL,1.0,0.407919,0.386817,0.389695
GOOG,0.407919,1.0,0.405099,0.465919
IBM,0.386817,0.405099,1.0,0.499764
MSFT,0.389695,0.465919,0.499764,1.0


In [65]:
returns.cov()

Unnamed: 0,AAPL,GOOG,IBM,MSFT
AAPL,0.000277,0.000107,7.8e-05,9.5e-05
GOOG,0.000107,0.000251,7.8e-05,0.000108
IBM,7.8e-05,7.8e-05,0.000146,8.9e-05
MSFT,9.5e-05,0.000108,8.9e-05,0.000215


In [66]:
returns.corrwith(returns['IBM'])

AAPL    0.386817
GOOG    0.405099
IBM     1.000000
MSFT    0.499764
dtype: float64

In [69]:
returns.corrwith(volume)

AAPL   -0.075565
GOOG   -0.007067
IBM    -0.204849
MSFT   -0.092950
dtype: float64

In [71]:
returns.corrwith(volume, axis='columns').tail()

Date
2016-10-17   -0.881606
2016-10-18   -0.303369
2016-10-19   -0.970723
2016-10-20   -0.304414
2016-10-21    0.927824
dtype: float64

In [None]:
returns.corrwith(volume.T).tail()
# 不会自动按行或列计算，需要手动转置

2016-10-17 00:00:00   NaN
2016-10-18 00:00:00   NaN
2016-10-19 00:00:00   NaN
2016-10-20 00:00:00   NaN
2016-10-21 00:00:00   NaN
dtype: float64

## Unique Values, Value Counts, and Membership

In [75]:
obj = pd.Series(["c", "a", "d", "a", "a", "b", "b", "c", "c"])
uniques = obj.unique()
print(uniques)
print('-------')
print(obj.value_counts())


['c' 'a' 'd' 'b']
-------
c    3
a    3
b    2
d    1
Name: count, dtype: int64


In [77]:
mask = obj.isin(["b", "c"])
print(mask)
print('-------')
print(obj[mask])


0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
dtype: bool
-------
0    c
5    b
6    b
7    c
8    c
dtype: object


In [78]:
to_match = pd.Series(["c", "a", "b", "b", "c", "a"])
unique_vals = pd.Series(["c", "b", "a"])

indices = pd.Index(unique_vals).get_indexer(to_match)
print(indices)


[0 2 1 1 0 2]


  让我们模拟一下 .get_indexer() 的工作流程：

   * “目录”是: Index(['c', 'b', 'a']) (c=0, b=1, a=2)
   * 要查找的值是: ["c", "a", "b", "b", "c", "a"]

   1. 查找 to_match 的第一个元素 'c'。在“目录”中，'c' 的位置是 `0`。
   2. 查找 to_match 的第二个元素 'a'。在“目录”中，'a' 的位置是 `2`。
   3. 查找 to_match 的第三个元素 'b'。在“目录”中，'b' 的位置是 `1`。
   4. 查找 to_match 的第四个元素 'b'。在“目录”中，'b' 的位置是 `1`。
   5. 查找 to_match 的第五个元素 'c'。在“目录”中，'c' 的位置是 `0`。
   6. 查找 to_match 的第六个元素 'a'。在“目录”中，'a' 的位置是 `2`。

  将这些结果组合起来，就得到了最终的输出：

   1 # In [301]: indices
   2 Out[301]: array([0, 2, 1, 1, 0, 2])

  用途与优势

   * 高性能: 这个方法底层由 C 语言实现，对于大型数据集，它的速度远超使用 Python 循环或字典的 map 方法。
   * 整数编码 (Integer Encoding): 这是将分类数据（如字符串标签）转换为整数代码的经典方法，是机器学习预处理步骤中的常见操作。
   * 数据对齐: 它是 pandas 内部许多对齐和合并操作的基础。

In [79]:
data = pd.DataFrame({"Qu1": [1, 3, 4, 3, 4],
                     "Qu2": [2, 3, 1, 2, 3],
                     "Qu3": [1, 5, 2, 4, 4]})
print(data)

   Qu1  Qu2  Qu3
0    1    2    1
1    3    3    5
2    4    1    2
3    3    2    4
4    4    3    4


In [None]:
data['Qu1'].value_counts() # 

Qu1
3    2
4    2
1    1
Name: count, dtype: int64

计数每行的value

In [None]:
  ┌──────────┬──────────────────────────────┬────────────────────────────────┐
  │ 特性      │ DataFrame.apply              │ DataFrame.applymap (或 df.map) │
  ├──────────┼──────────────────────────────┼────────────────────────────────┤
  │ 操作对象  │ 行或列 (Series)               │ 单个元素 (标量)                  │
  │ 函数输入  │ 接收一个 Series 对象           │ 接收一个单一值                   │
  │ 主要用途  │ 聚合、转换、需要上下文的计算      │ 元素级的格式化、转换              │
  │ 工作维度  │ 一维 (沿着行或列)               │ 零维 (逐个元素)                  │
  └──────────┴──────────────────────────────┴────────────────────────────────┘

In [95]:
result = data.apply(pd.value_counts).fillna(0)
result

  result = data.apply(pd.value_counts).fillna(0)


Unnamed: 0,Qu1,Qu2,Qu3
1,1.0,1.0,1.0
2,0.0,2.0,1.0
3,2.0,2.0,0.0
4,2.0,0.0,2.0
5,0.0,0.0,1.0


In [None]:
result1 = data.apply(lambda x: x.value_counts()).fillna(0)
# pandas 社区鼓励使用更面向对象的方法
result1

Unnamed: 0,Qu1,Qu2,Qu3
1,1.0,1.0,1.0
2,0.0,2.0,1.0
3,2.0,2.0,0.0
4,2.0,0.0,2.0
5,0.0,0.0,1.0


计数 rows

In [99]:
data = pd.DataFrame({"a": [1, 1, 1, 2, 2], "b": [0, 0, 1, 0, 0]})
data

Unnamed: 0,a,b
0,1,0
1,1,0
2,1,1
3,2,0
4,2,0


In [100]:
data.value_counts()

a  b
1  0    2
2  0    2
1  1    1
Name: count, dtype: int64