# Pandas 学习笔记之索引篇

* 版本号： 0.1
* 创建时间： 2024年04月15日
* 修改时间： 2024年04月15日
* 数据来源：
 * movies.csv http://boxofficemojo.com/daily/
 * iris.csv https://github.com/dsaber/py-viz-blog
 * titanic.csv https://github.com/dsaber/py-viz-blog
 * ts.csv https://github.com/dsaber/py-viz-blog
 * tips.csv https://github.com/pandas-dev/pandas/blob/master/doc/data/tips.csv

## 一些准备工作

In [1]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
import seaborn as sns
# 辅助函数
def get_movie_df():
    """
    获得 movie dataframe
    """
    return pd.read_csv('datas/movies.csv', sep='\t', encoding='utf-8',thousands=',',escapechar='$')

def get_titanic_df():
    return pd.read_csv('datas/titanic.csv')

def get_iris_df():
    return pd.read_csv('datas/iris.csv')

def get_tips_df():
    return pd.read_csv('datas/tips.csv')

def get_random_df():
    return pd.DataFrame(
        np.random.randn(6, 4),
        index=pd.date_range('20200101', periods=6),
        columns=list('ABCD'))

## 把某列设置为索引

### 使用列名称

In [2]:
# 构建 DataFrame
df = get_random_df();df

Unnamed: 0,A,B,C,D
2020-01-01,0.22785,1.417402,0.662716,-0.840174
2020-01-02,-0.908174,-0.161938,0.987568,0.332834
2020-01-03,-0.354602,-0.542344,0.237632,-1.161399
2020-01-04,-0.950604,-1.577008,-1.069262,-0.685116
2020-01-05,-0.120652,-0.045322,-0.964505,-0.167253
2020-01-06,-0.257626,0.629918,-0.001851,-0.463611


In [3]:
df.set_index(['A'])

Unnamed: 0_level_0,B,C,D
A,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0.22785,1.417402,0.662716,-0.840174
-0.908174,-0.161938,0.987568,0.332834
-0.354602,-0.542344,0.237632,-1.161399
-0.950604,-1.577008,-1.069262,-0.685116
-0.120652,-0.045322,-0.964505,-0.167253
-0.257626,0.629918,-0.001851,-0.463611


### 使用列编号

In [4]:
# 把第三列作为索引
df = get_random_df()
df.set_index(df.columns[2])

Unnamed: 0_level_0,A,B,D
C,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1.536455,0.466916,-0.366877,-1.215555
0.893713,0.276462,-1.462756,0.963352
0.256841,-1.738209,1.093039,0.342333
-0.576202,0.739173,-1.014447,-0.965608
0.226706,-2.169952,-0.992201,-1.446678
-0.853298,-1.05142,1.030935,-0.217549


## 手动生成多重索引

In [5]:
m_index = pd.MultiIndex.from_arrays(
    [['level-one']*2, ['level-two-one', 'level-two-tow']])
m_index

MultiIndex([('level-one', 'level-two-one'),
            ('level-one', 'level-two-tow')],
           )

In [6]:
m_index = pd.MultiIndex.from_product(
    [['level-one'], ['level-two-one', 'level-two-tow']])
m_index

MultiIndex([('level-one', 'level-two-one'),
            ('level-one', 'level-two-tow')],
           )

## 修改列名称

### 修改一个列名称

In [7]:
# 构建 DataFrame
df = get_random_df();df

Unnamed: 0,A,B,C,D
2020-01-01,0.654426,-0.54075,0.661756,-1.746388
2020-01-02,-0.295946,-0.709389,-1.050192,-0.194055
2020-01-03,0.057393,-0.349799,1.134097,0.341846
2020-01-04,-1.328864,1.095307,-1.191685,-0.077589
2020-01-05,-0.425008,0.325361,-0.771963,-0.495083
2020-01-06,-1.364578,-0.262868,1.038787,0.184585


In [8]:
df.rename(columns={'A':'AA'}, inplace=True);df

Unnamed: 0,AA,B,C,D
2020-01-01,0.654426,-0.54075,0.661756,-1.746388
2020-01-02,-0.295946,-0.709389,-1.050192,-0.194055
2020-01-03,0.057393,-0.349799,1.134097,0.341846
2020-01-04,-1.328864,1.095307,-1.191685,-0.077589
2020-01-05,-0.425008,0.325361,-0.771963,-0.495083
2020-01-06,-1.364578,-0.262868,1.038787,0.184585


### 修改全部列名称

转自：https://stackoverflow.com/questions/11346283/renaming-columns-in-pandas

Pandas 0.21+ Answer

There have been some significant updates to column renaming in version 0.21.

The rename method has added the axis parameter which may be set to columns or 1. This update makes this method match the rest of the pandas API. It still has the index and columns parameters but you are no longer forced to use them.

The set_axis method with the inplace set to False enables you to rename all the index or column labels with a list.

Examples for Pandas 0.21+

In [9]:
# 构建 DataFrame
df = get_random_df();df

Unnamed: 0,A,B,C,D
2020-01-01,-0.994148,0.1437,-0.461555,-0.088175
2020-01-02,-2.035707,0.619382,0.105826,-0.399662
2020-01-03,2.208128,0.799443,1.465318,0.425646
2020-01-04,-1.045779,-0.61067,0.424734,-0.945671
2020-01-05,-1.316035,-1.304069,-0.000966,-0.705236
2020-01-06,0.885357,-0.163003,-2.424775,0.634127


#### 方法一：使用 rename ，并且设置 axis='columns' 或者 axis=1

In [10]:
df.rename({'A':'a', 'B':'b', 'C':'c', 'D':'d'}, axis='columns')

Unnamed: 0,a,b,c,d
2020-01-01,-0.994148,0.1437,-0.461555,-0.088175
2020-01-02,-2.035707,0.619382,0.105826,-0.399662
2020-01-03,2.208128,0.799443,1.465318,0.425646
2020-01-04,-1.045779,-0.61067,0.424734,-0.945671
2020-01-05,-1.316035,-1.304069,-0.000966,-0.705236
2020-01-06,0.885357,-0.163003,-2.424775,0.634127


In [11]:
# 下句与上句结果相同
df.rename({'A':'a', 'B':'b', 'C':'c', 'D':'d'}, axis=1)

Unnamed: 0,a,b,c,d
2020-01-01,-0.994148,0.1437,-0.461555,-0.088175
2020-01-02,-2.035707,0.619382,0.105826,-0.399662
2020-01-03,2.208128,0.799443,1.465318,0.425646
2020-01-04,-1.045779,-0.61067,0.424734,-0.945671
2020-01-05,-1.316035,-1.304069,-0.000966,-0.705236
2020-01-06,0.885357,-0.163003,-2.424775,0.634127


In [12]:
# 老的方法，结果相同
df.rename(columns={'A':'a', 'B':'b', 'C':'c', 'D':'d'})

Unnamed: 0,a,b,c,d
2020-01-01,-0.994148,0.1437,-0.461555,-0.088175
2020-01-02,-2.035707,0.619382,0.105826,-0.399662
2020-01-03,2.208128,0.799443,1.465318,0.425646
2020-01-04,-1.045779,-0.61067,0.424734,-0.945671
2020-01-05,-1.316035,-1.304069,-0.000966,-0.705236
2020-01-06,0.885357,-0.163003,-2.424775,0.634127


In [13]:
#rename 函数接受一个函数作为参数，作为参数的函数作用于每一个列名称。
df = get_random_df()
df.rename(lambda x: x.lower(), axis='columns')

Unnamed: 0,a,b,c,d
2020-01-01,0.068075,-1.720987,1.049093,-0.406167
2020-01-02,0.619941,-0.551233,-2.222463,-0.297259
2020-01-03,-0.198128,-1.056249,0.046281,-0.709618
2020-01-04,0.735538,1.015683,2.115012,0.092454
2020-01-05,0.391768,0.083871,-0.342363,-1.23137
2020-01-06,0.374733,-0.112518,0.048934,-0.949842


In [14]:
df = get_random_df()
df.rename(lambda x: x.lower(), axis=1)

Unnamed: 0,a,b,c,d
2020-01-01,0.203286,-1.644424,-0.257422,0.194584
2020-01-02,-0.418832,1.907486,-1.903078,-1.572944
2020-01-03,0.416744,-1.518611,-0.381607,0.848238
2020-01-04,0.080954,0.075704,0.548198,2.778319
2020-01-05,0.799477,-2.418315,-1.005651,1.081426
2020-01-06,1.224705,0.208852,0.232456,-0.829433


#### 方法二：使用 set_axis ，把一个 list 作为列名称，并且设置 inplace=False
list 的长度必须与列（或者索引）的数量一致。当前版本（0.24.2， inplace 参数的默认值为 True ，以后可能改为 False 。

In [15]:
df.set_axis(['a', 'b', 'c', 'd'], axis='columns', copy=False)

Unnamed: 0,a,b,c,d
2020-01-01,0.203286,-1.644424,-0.257422,0.194584
2020-01-02,-0.418832,1.907486,-1.903078,-1.572944
2020-01-03,0.416744,-1.518611,-0.381607,0.848238
2020-01-04,0.080954,0.075704,0.548198,2.778319
2020-01-05,0.799477,-2.418315,-1.005651,1.081426
2020-01-06,1.224705,0.208852,0.232456,-0.829433


In [16]:
df.set_axis(['a', 'b', 'c', 'd'], axis=1, copy=False)

Unnamed: 0,a,b,c,d
2020-01-01,0.203286,-1.644424,-0.257422,0.194584
2020-01-02,-0.418832,1.907486,-1.903078,-1.572944
2020-01-03,0.416744,-1.518611,-0.381607,0.848238
2020-01-04,0.080954,0.075704,0.548198,2.778319
2020-01-05,0.799477,-2.418315,-1.005651,1.081426
2020-01-06,1.224705,0.208852,0.232456,-0.829433


#### 方法三：使用 columns 属性

In [17]:
df.columns = ['a', 'b', 'c', 'd']
df

Unnamed: 0,a,b,c,d
2020-01-01,0.203286,-1.644424,-0.257422,0.194584
2020-01-02,-0.418832,1.907486,-1.903078,-1.572944
2020-01-03,0.416744,-1.518611,-0.381607,0.848238
2020-01-04,0.080954,0.075704,0.548198,2.778319
2020-01-05,0.799477,-2.418315,-1.005651,1.081426
2020-01-06,1.224705,0.208852,0.232456,-0.829433


Why not use df.columns = ['a', 'b', 'c', 'd', 'e']?

There is nothing wrong with assigning columns directly like this. It is a perfectly good solution.

The advantage of using set_axis is that it can be used as part of a method chain and that it returns a new copy of the DataFrame.
Without it, you would have to store your intermediate steps of the chain to another variable before reassigning the columns.
```
    # new for pandas 0.21+
    df.some_method1()
      .some_method2()
      .set_axis()
      .some_method3()

    # old way
    df1 = df.some_method1()
            .some_method2()
    df1.columns = columns
    df1.some_method3()
```