# Pandas 学习笔记之索引篇

* 版本号： 0.11
* 创建时间： 2024年04月15日
* 修改时间： 2024年05月21日
* 数据来源：
 * movies.csv http://boxofficemojo.com/daily/
 * iris.csv https://github.com/dsaber/py-viz-blog
 * titanic.csv https://github.com/dsaber/py-viz-blog
 * ts.csv https://github.com/dsaber/py-viz-blog
 * tips.csv https://github.com/pandas-dev/pandas/blob/master/doc/data/tips.csv

## 一些准备工作

In [1]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

# 辅助函数
def get_random_df():
    return pd.DataFrame(
        np.random.randn(6, 4),
        index=pd.date_range('20200101', periods=6),
        columns=list('ABCD'))

## 把行 Index 改成由1开始

In [2]:
df = get_random_df()
df.index = range(1,len(df) + 1)
df.head()

Unnamed: 0,A,B,C,D
1,0.118502,-1.80874,0.997507,0.392562
2,-0.457779,0.160701,-0.454315,-0.794106
3,-0.064728,0.10722,-0.173583,-0.855021
4,-0.114218,0.033859,-1.801286,-0.278239
5,0.898737,-0.317048,0.381936,-1.160173


## 把某列设置为索引

### 使用列名称

In [3]:
# 构建 DataFrame
df = get_random_df();df

Unnamed: 0,A,B,C,D
2020-01-01,1.11145,0.182791,-0.937013,0.956929
2020-01-02,-0.304885,0.6978,-0.900396,-1.753596
2020-01-03,-0.84814,0.318127,-2.012993,-1.005838
2020-01-04,0.59724,1.817809,0.78658,0.232008
2020-01-05,0.655043,0.762338,-0.698655,1.151653
2020-01-06,0.184392,-1.333492,0.349873,0.151685


In [4]:
df.set_index(['A'])

Unnamed: 0_level_0,B,C,D
A,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1.11145,0.182791,-0.937013,0.956929
-0.304885,0.6978,-0.900396,-1.753596
-0.84814,0.318127,-2.012993,-1.005838
0.59724,1.817809,0.78658,0.232008
0.655043,0.762338,-0.698655,1.151653
0.184392,-1.333492,0.349873,0.151685


### 使用列编号

In [5]:
# 把第三列作为索引
df = get_random_df()
df.set_index(df.columns[2])

Unnamed: 0_level_0,A,B,D
C,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
-1.35115,1.378955,0.152973,0.919439
0.027135,-0.470867,0.009272,1.162112
1.812734,-0.528501,0.248176,-1.263733
0.891118,0.873387,-1.498628,-1.941507
0.531397,1.573426,-1.207346,0.016876
1.327886,-0.035997,-1.003411,1.246743


## 手动生成多重索引

In [6]:
m_index = pd.MultiIndex.from_arrays(
    [['level-one']*2, ['level-two-one', 'level-two-tow']])
m_index

MultiIndex([('level-one', 'level-two-one'),
            ('level-one', 'level-two-tow')],
           )

In [7]:
m_index = pd.MultiIndex.from_product(
    [['level-one'], ['level-two-one', 'level-two-tow']])
m_index

MultiIndex([('level-one', 'level-two-one'),
            ('level-one', 'level-two-tow')],
           )

## 修改列名称

### 修改一个列名称

In [8]:
# 构建 DataFrame
df = get_random_df();df

Unnamed: 0,A,B,C,D
2020-01-01,0.994847,-0.628494,0.199592,-0.102407
2020-01-02,0.29014,0.351781,-0.300229,-0.962547
2020-01-03,0.291478,0.539801,1.129685,-0.380217
2020-01-04,-1.898179,0.699758,1.908884,-0.194987
2020-01-05,-0.234319,-1.815653,0.05534,-1.425894
2020-01-06,-0.726428,-1.344369,0.799225,0.853736


In [9]:
df.rename(columns={'A':'AA'}, inplace=True);df

Unnamed: 0,AA,B,C,D
2020-01-01,0.994847,-0.628494,0.199592,-0.102407
2020-01-02,0.29014,0.351781,-0.300229,-0.962547
2020-01-03,0.291478,0.539801,1.129685,-0.380217
2020-01-04,-1.898179,0.699758,1.908884,-0.194987
2020-01-05,-0.234319,-1.815653,0.05534,-1.425894
2020-01-06,-0.726428,-1.344369,0.799225,0.853736


### 修改全部列名称

转自：https://stackoverflow.com/questions/11346283/renaming-columns-in-pandas

Pandas 0.21+ Answer

There have been some significant updates to column renaming in version 0.21.

The rename method has added the axis parameter which may be set to columns or 1. This update makes this method match the rest of the pandas API. It still has the index and columns parameters but you are no longer forced to use them.

The set_axis method with the inplace set to False enables you to rename all the index or column labels with a list.

Examples for Pandas 0.21+

In [10]:
# 构建 DataFrame
df = get_random_df();df

Unnamed: 0,A,B,C,D
2020-01-01,-1.367012,2.318288,-0.325283,0.055034
2020-01-02,-0.049454,0.008833,-2.938136,-0.4009
2020-01-03,1.299508,-0.088676,0.829149,0.212462
2020-01-04,-1.018557,-0.186619,0.493114,-0.328296
2020-01-05,-0.067945,0.546336,-0.799995,-0.722019
2020-01-06,-0.958219,1.424339,1.399772,-0.041341


#### 方法一：使用 rename ，并且设置 axis='columns' 或者 axis=1

In [11]:
df.rename({'A':'a', 'B':'b', 'C':'c', 'D':'d'}, axis='columns')

Unnamed: 0,a,b,c,d
2020-01-01,-1.367012,2.318288,-0.325283,0.055034
2020-01-02,-0.049454,0.008833,-2.938136,-0.4009
2020-01-03,1.299508,-0.088676,0.829149,0.212462
2020-01-04,-1.018557,-0.186619,0.493114,-0.328296
2020-01-05,-0.067945,0.546336,-0.799995,-0.722019
2020-01-06,-0.958219,1.424339,1.399772,-0.041341


In [12]:
# 下句与上句结果相同
df.rename({'A':'a', 'B':'b', 'C':'c', 'D':'d'}, axis=1)

Unnamed: 0,a,b,c,d
2020-01-01,-1.367012,2.318288,-0.325283,0.055034
2020-01-02,-0.049454,0.008833,-2.938136,-0.4009
2020-01-03,1.299508,-0.088676,0.829149,0.212462
2020-01-04,-1.018557,-0.186619,0.493114,-0.328296
2020-01-05,-0.067945,0.546336,-0.799995,-0.722019
2020-01-06,-0.958219,1.424339,1.399772,-0.041341


In [13]:
# 老的方法，结果相同
df.rename(columns={'A':'a', 'B':'b', 'C':'c', 'D':'d'})

Unnamed: 0,a,b,c,d
2020-01-01,-1.367012,2.318288,-0.325283,0.055034
2020-01-02,-0.049454,0.008833,-2.938136,-0.4009
2020-01-03,1.299508,-0.088676,0.829149,0.212462
2020-01-04,-1.018557,-0.186619,0.493114,-0.328296
2020-01-05,-0.067945,0.546336,-0.799995,-0.722019
2020-01-06,-0.958219,1.424339,1.399772,-0.041341


In [14]:
#rename 函数接受一个函数作为参数，作为参数的函数作用于每一个列名称。
df = get_random_df()
df.rename(lambda x: x.lower(), axis='columns')

Unnamed: 0,a,b,c,d
2020-01-01,-0.389596,-0.643607,-1.245727,-0.079882
2020-01-02,0.38928,-0.453595,-0.290197,-0.413748
2020-01-03,-3.39994,1.016301,-0.574126,1.070502
2020-01-04,-0.069537,-0.543435,0.125908,0.344096
2020-01-05,0.022832,-0.032724,0.792888,1.20614
2020-01-06,0.599666,1.70037,-1.169337,-0.899192


In [15]:
df = get_random_df()
df.rename(lambda x: x.lower(), axis=1)

Unnamed: 0,a,b,c,d
2020-01-01,-0.54859,0.479504,-0.638681,-0.729953
2020-01-02,-0.453389,-0.091043,-1.277966,1.226509
2020-01-03,-1.192221,0.756808,1.386562,-2.212451
2020-01-04,2.477589,-0.605837,1.069088,0.002167
2020-01-05,0.446367,-0.644324,-0.905393,0.130968
2020-01-06,-0.313847,1.163701,-0.240772,1.162398


#### 方法二：使用 set_axis ，把一个 list 作为列名称，并且设置 inplace=False
list 的长度必须与列（或者索引）的数量一致。当前版本（0.24.2， inplace 参数的默认值为 True ，以后可能改为 False 。

In [16]:
df.set_axis(['a', 'b', 'c', 'd'], axis='columns', copy=False)

Unnamed: 0,a,b,c,d
2020-01-01,-0.54859,0.479504,-0.638681,-0.729953
2020-01-02,-0.453389,-0.091043,-1.277966,1.226509
2020-01-03,-1.192221,0.756808,1.386562,-2.212451
2020-01-04,2.477589,-0.605837,1.069088,0.002167
2020-01-05,0.446367,-0.644324,-0.905393,0.130968
2020-01-06,-0.313847,1.163701,-0.240772,1.162398


In [17]:
df.set_axis(['a', 'b', 'c', 'd'], axis=1, copy=False)

Unnamed: 0,a,b,c,d
2020-01-01,-0.54859,0.479504,-0.638681,-0.729953
2020-01-02,-0.453389,-0.091043,-1.277966,1.226509
2020-01-03,-1.192221,0.756808,1.386562,-2.212451
2020-01-04,2.477589,-0.605837,1.069088,0.002167
2020-01-05,0.446367,-0.644324,-0.905393,0.130968
2020-01-06,-0.313847,1.163701,-0.240772,1.162398


#### 方法三：使用 columns 属性

In [18]:
df.columns = ['a', 'b', 'c', 'd']
df

Unnamed: 0,a,b,c,d
2020-01-01,-0.54859,0.479504,-0.638681,-0.729953
2020-01-02,-0.453389,-0.091043,-1.277966,1.226509
2020-01-03,-1.192221,0.756808,1.386562,-2.212451
2020-01-04,2.477589,-0.605837,1.069088,0.002167
2020-01-05,0.446367,-0.644324,-0.905393,0.130968
2020-01-06,-0.313847,1.163701,-0.240772,1.162398


Why not use df.columns = ['a', 'b', 'c', 'd', 'e']?

There is nothing wrong with assigning columns directly like this. It is a perfectly good solution.

The advantage of using set_axis is that it can be used as part of a method chain and that it returns a new copy of the DataFrame.
Without it, you would have to store your intermediate steps of the chain to another variable before reassigning the columns.
```
    # new for pandas 0.21+
    df.some_method1()
      .some_method2()
      .set_axis()
      .some_method3()

    # old way
    df1 = df.some_method1()
            .some_method2()
    df1.columns = columns
    df1.some_method3()
```