#### Pandas的索引index的用途
把数据存储于普通的column列也能用于数据查询，那使用index有什么好处？

index的用途总结：
1. 更方便的数据查询
2. 使用index可以获得性能提升
3. 自动的数据对齐功能
4. 更多更强大的数据结构支持

In [2]:
import pandas as pd

In [3]:
df=pd.read_csv("C:/Users/pxpxz_ct9p1p3/Downloads/Fortune_1000_Companies_by_Revenue.csv")

In [4]:
df.head()

Unnamed: 0,rank,name,revenues,revenue_percent_change,profits,profits_percent_change,assets,market_value,change_in_rank,employees
0,1,Walmart,"$572,754",2.40%,"$13,673",1.20%,"$244,860","$409,795",-,2300000
1,2,Amazon,"$469,822",21.70%,"$33,364",56.40%,"$420,549","$1,658,807.30",-,1608000
2,3,Apple,"$365,817",33.30%,"$94,680",64.90%,"$351,002","$2,849,537.60",-,154000
3,4,CVS Health,"$292,111",8.70%,"$7,910",10.20%,"$232,999","$132,839.20",-,258000
4,5,UnitedHealth Group,"$287,597",11.80%,"$17,285",12.20%,"$212,206","$479,830.30",-,350000


In [7]:
df.count()

rank                      1000
name                      1000
revenues                  1000
revenue_percent_change    1000
profits                   1000
profits_percent_change    1000
assets                    1000
market_value              1000
change_in_rank            1000
employees                 1000
dtype: int64

#### 1、使用index查询数据

In [8]:
# drop==False，让索引列还保持在column
df.set_index('rank', inplace=True, drop=False)

In [9]:
df.head()

Unnamed: 0_level_0,rank,name,revenues,revenue_percent_change,profits,profits_percent_change,assets,market_value,change_in_rank,employees
rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,1,Walmart,"$572,754",2.40%,"$13,673",1.20%,"$244,860","$409,795",-,2300000
2,2,Amazon,"$469,822",21.70%,"$33,364",56.40%,"$420,549","$1,658,807.30",-,1608000
3,3,Apple,"$365,817",33.30%,"$94,680",64.90%,"$351,002","$2,849,537.60",-,154000
4,4,CVS Health,"$292,111",8.70%,"$7,910",10.20%,"$232,999","$132,839.20",-,258000
5,5,UnitedHealth Group,"$287,597",11.80%,"$17,285",12.20%,"$212,206","$479,830.30",-,350000


In [6]:
df.index

Index(['1', '2', '3', '4', '5', '6', '7', '8', '9', '10',
       ...
       '991', '992', '993', '994', '995', '996', '997', '998', '999', '1,000'],
      dtype='object', name='rank', length=1000)

In [11]:
# 使用index的查询方法
df.loc['500'].head(5)
# 否则，使用column的condition查询方法,也是一样，但显然用index查询更简便
df.loc[df['userId']==500].head()

rank                          500
name                       Ameren
revenues                  $6,394 
revenue_percent_change     10.40%
profits                     $990 
Name: 500, dtype: object

#### 2、使用index会提升查询性能
+ 如果index是唯一的，Pandas会使用哈希表优化，查询性能为O(1);
+ 如果index不是唯一的，但是有序，Pandas会使用二分查找算法，查询性能为O(logN);
+ 如果indax是完全随机的，那么每次查询都要扫描全表，查询性能为O(N);

实验1：完全随机的顺序查询

实验2：将index排序后的查询

通过实验1和实验2对比，知道排序后查询效果大大提升

#### 3、使用index能自动对齐数据
包括series和dataframe

In [12]:
# 创建series
s1=pd.Series([1,2,3],index=list('abc'))

In [13]:
s1

a    1
b    2
c    3
dtype: int64

In [14]:
# 创建series
s2=pd.Series([2,3,4],index=list('bcd'))

In [15]:
s2

b    2
c    3
d    4
dtype: int64

In [17]:
s1+s2

a    NaN
b    4.0
c    6.0
d    NaN
dtype: float64

#### 4、使用index更多更强大的数据结构支持
##### 很多强大的索引数据结构
+ CategoricallIndex,基于分类数据的Index，提升性能；
+ MultiIndex，多维索引，用于groupby多维聚合后结果等；
+ DatetimeIndex，时间类型索引，强大的日期和时间的方法支持。