As we've seen, both Series and DataFrames can have indices applied to them. The index is essentially a row level
label, and in pandas the rows correspond to axis zero. Indices can either be either autogenerated, such as when 
we create a new Series without an index, in which case we get numeric values, or they can be set explicitly, like
when we use the dictionary object to create the series, or when we loaded data from the CSV file and set 
appropriate parameters. Another option for setting an index is to use the set_index() function. This function 
takes a list of columns and promotes those columns to an index. In this lecture we'll explore more about how 
indexes work in pandas.

运用set_index()函数实际上是一个破坏的过程，它不保留当前的index。如果想保留当前的index，则需要手动创建一列，然后从index attribute中将值复制进去。

In [1]:
# Lets import pandas and our admissions dataset
import pandas as pd
df = pd.read_csv("C:/Users/asus/Desktop/Coursera/Applied Data Science with Python/(1) Introduction to Data Science in Python/dataset/Admission_Predict.csv", index_col=0)
df.head()

Unnamed: 0_level_0,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65


假设我们不想以Serial number作为DataFrame的index，而是以chance of admit作为index，但我们还是想保留serial number。因此，我们将serial number保存在新的一列中，然后用set_index来将chance of admit那一列设为index。

In [2]:
# So we copy the indexed data into its own column
df['Serial Number'] = df.index

# Then we set the index to another column
df = df.set_index('Chance of Admit ')
df.head()

Unnamed: 0_level_0,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Serial Number
Chance of Admit,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0.92,337,118,4,4.5,4.5,9.65,1,1
0.76,324,107,4,4.0,4.5,8.87,1,2
0.72,316,104,3,3.0,3.5,8.0,1,3
0.8,322,110,3,3.5,2.5,8.67,1,4
0.65,314,103,2,2.0,3.0,8.21,0,5


当我们根据已存在的一列建立新的index时，index是有名称的，即为该列的名称。我们可以通过reset_index()函数重置index，它将原有index生成新的一列，然后创造默认的index。

In [3]:
df = df.reset_index()
df.head()

Unnamed: 0,Chance of Admit,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Serial Number
0,0.92,337,118,4,4.5,4.5,9.65,1,1
1,0.76,324,107,4,4.0,4.5,8.87,1,2
2,0.72,316,104,3,3.0,3.5,8.0,1,3
3,0.8,322,110,3,3.5,2.5,8.67,1,4
4,0.65,314,103,2,2.0,3.0,8.21,0,5


Pandas的一个属性是可以创建multi-level indexing，这与relational database中的composite keys很像（即例如以二元键作为main key）。

Pandas能够进行multi-level indexing，我们通过set_index并且给定一个列的list来创建multi-level index,这个列的list会生成index。Pandas会依次检索这些列，找到互不相同的数据并且建立组合index。

一个例子是处理由地区或人口分类的地理数据，让我们看一些人口普查数据。

In [4]:
# Let's import and see what the data looks like

df = pd.read_csv('C:\\Users\\asus\\Desktop\\Coursera\\Applied Data Science with Python\\(1) Introduction to Data Science in Python\\dataset\\census.csv')
df.head()

Unnamed: 0,SUMLEV,REGION,DIVISION,STATE,COUNTY,STNAME,CTYNAME,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,...,RDOMESTICMIG2011,RDOMESTICMIG2012,RDOMESTICMIG2013,RDOMESTICMIG2014,RDOMESTICMIG2015,RNETMIG2011,RNETMIG2012,RNETMIG2013,RNETMIG2014,RNETMIG2015
0,40,3,6,1,0,Alabama,Alabama,4779736,4780127,4785161,...,0.002295,-0.193196,0.381066,0.582002,-0.467369,1.030015,0.826644,1.383282,1.724718,0.712594
1,50,3,6,1,1,Alabama,Autauga County,54571,54571,54660,...,7.242091,-2.915927,-3.012349,2.265971,-2.530799,7.606016,-2.626146,-2.722002,2.59227,-2.187333
2,50,3,6,1,3,Alabama,Baldwin County,182265,182265,183193,...,14.83296,17.647293,21.845705,19.243287,17.197872,15.844176,18.559627,22.727626,20.317142,18.293499
3,50,3,6,1,5,Alabama,Barbour County,27457,27457,27341,...,-4.728132,-2.50069,-7.056824,-3.904217,-10.543299,-4.874741,-2.758113,-7.167664,-3.978583,-10.543299
4,50,3,6,1,7,Alabama,Bibb County,22915,22919,22861,...,-5.527043,-5.068871,-6.201001,-0.177537,0.177258,-5.088389,-4.363636,-5.403729,0.754533,1.107861


在这个数据集中，有两个summarized level，第一个为每个州的数据总结，第二个为州下每个郡的数据总结。

我需要某一指定列中的所有唯一值组成的list。在这个DataFrame中，可以发现SUMLEV这一列下面的可能值（40、50）在DataFrame中
起unique对应作用，就像SQL中的key一样。

In [5]:
# Here we can run unique on the sum level of our current DataFrame 
df['SUMLEV'].unique()

#得到两个不同的值，40和50。还可以用set()。

array([40, 50], dtype=int64)

In [6]:
# 首先我们先排除州数据，只保留郡数据。

df = df[df['SUMLEV'] == 50]
df.head()

Unnamed: 0,SUMLEV,REGION,DIVISION,STATE,COUNTY,STNAME,CTYNAME,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,...,RDOMESTICMIG2011,RDOMESTICMIG2012,RDOMESTICMIG2013,RDOMESTICMIG2014,RDOMESTICMIG2015,RNETMIG2011,RNETMIG2012,RNETMIG2013,RNETMIG2014,RNETMIG2015
1,50,3,6,1,1,Alabama,Autauga County,54571,54571,54660,...,7.242091,-2.915927,-3.012349,2.265971,-2.530799,7.606016,-2.626146,-2.722002,2.59227,-2.187333
2,50,3,6,1,3,Alabama,Baldwin County,182265,182265,183193,...,14.83296,17.647293,21.845705,19.243287,17.197872,15.844176,18.559627,22.727626,20.317142,18.293499
3,50,3,6,1,5,Alabama,Barbour County,27457,27457,27341,...,-4.728132,-2.50069,-7.056824,-3.904217,-10.543299,-4.874741,-2.758113,-7.167664,-3.978583,-10.543299
4,50,3,6,1,7,Alabama,Bibb County,22915,22919,22861,...,-5.527043,-5.068871,-6.201001,-0.177537,0.177258,-5.088389,-4.363636,-5.403729,0.754533,1.107861
5,50,3,6,1,9,Alabama,Blount County,57322,57322,57373,...,1.807375,-1.177622,-1.748766,-2.062535,-1.36997,1.859511,-0.84858,-1.402476,-1.577232,-0.884411


假设我们现在只关注total population和total number of birth。首先，我们需要创建一个list，其中
包含我们需要的列的名称，然后将用索引符提取这些列，创造一个新的DataFrame。

In [7]:
columns_to_keep = ['STNAME','CTYNAME','BIRTHS2010','BIRTHS2011','BIRTHS2012','BIRTHS2013',
                   'BIRTHS2014','BIRTHS2015','POPESTIMATE2010','POPESTIMATE2011',
                   'POPESTIMATE2012','POPESTIMATE2013','POPESTIMATE2014','POPESTIMATE2015']
df = df[columns_to_keep]
df.head()

Unnamed: 0,STNAME,CTYNAME,BIRTHS2010,BIRTHS2011,BIRTHS2012,BIRTHS2013,BIRTHS2014,BIRTHS2015,POPESTIMATE2010,POPESTIMATE2011,POPESTIMATE2012,POPESTIMATE2013,POPESTIMATE2014,POPESTIMATE2015
1,Alabama,Autauga County,151,636,615,574,623,600,54660,55253,55175,55038,55290,55347
2,Alabama,Baldwin County,517,2187,2092,2160,2186,2240,183193,186659,190396,195126,199713,203709
3,Alabama,Barbour County,70,335,300,283,260,269,27341,27226,27159,26973,26815,26489
4,Alabama,Bibb County,44,266,245,259,247,253,22861,22733,22642,22512,22549,22583
5,Alabama,Blount County,183,744,710,646,618,603,57373,57711,57776,57734,57658,57673


我们可以加载数据，并且令index为州数据与国家数据的combination，然后观察pandas如何在DataFrame中处理它。
首先我们创造一个list，其中包含我们想要命为index的列，然后对该list用set_index。可以发现这里生成了dual index，
第一个元素为州的名称，第二个为郡的名称。

In [8]:
df = df.set_index(['STNAME', 'CTYNAME'])
df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,BIRTHS2010,BIRTHS2011,BIRTHS2012,BIRTHS2013,BIRTHS2014,BIRTHS2015,POPESTIMATE2010,POPESTIMATE2011,POPESTIMATE2012,POPESTIMATE2013,POPESTIMATE2014,POPESTIMATE2015
STNAME,CTYNAME,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Alabama,Autauga County,151,636,615,574,623,600,54660,55253,55175,55038,55290,55347
Alabama,Baldwin County,517,2187,2092,2160,2186,2240,183193,186659,190396,195126,199713,203709
Alabama,Barbour County,70,335,300,283,260,269,27341,27226,27159,26973,26815,26489
Alabama,Bibb County,44,266,245,259,247,253,22861,22733,22642,22512,22549,22583
Alabama,Blount County,183,744,710,646,618,603,57373,57711,57776,57734,57658,57673


我们如何索引该DataFrame呢？之前我们看到，DataFrame的loc attribute可以取多个参数，并且它可以索引行与列。当使用Multi-Index时，必须依次提供想要索引的dual index。

In [9]:
# 如果我们想查看Michigan州的Washtenaw County的人口数据，则输入的第一项为Michigan，第二项为Washtenaw。

df.loc['Michigan', 'Washtenaw County']

BIRTHS2010            977
BIRTHS2011           3826
BIRTHS2012           3780
BIRTHS2013           3662
BIRTHS2014           3683
BIRTHS2015           3709
POPESTIMATE2010    345563
POPESTIMATE2011    349048
POPESTIMATE2012    351213
POPESTIMATE2013    354289
POPESTIMATE2014    357029
POPESTIMATE2015    358880
Name: (Michigan, Washtenaw County), dtype: int64

假设想比较两个郡的数据，例如Washtenaw和Wayne两个郡，我们可以输入由两个tuple构成的list，每个tuple中必须有两项，第一项为first index，第二项为second index。在这个例子中，对于两个tuple，第一项都为Michigan，第二项分别为Washtenaw County和Wayne County。

In [10]:
df.loc[ [('Michigan', 'Washtenaw County'), ('Michigan', 'Wayne County')] ]

Unnamed: 0_level_0,Unnamed: 1_level_0,BIRTHS2010,BIRTHS2011,BIRTHS2012,BIRTHS2013,BIRTHS2014,BIRTHS2015,POPESTIMATE2010,POPESTIMATE2011,POPESTIMATE2012,POPESTIMATE2013,POPESTIMATE2014,POPESTIMATE2015
STNAME,CTYNAME,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Michigan,Washtenaw County,977,3826,3780,3662,3683,3709,345563,349048,351213,354289,357029,358880
Michigan,Wayne County,5918,23819,23270,23377,23607,23586,1815199,1801273,1792514,1775713,1766008,1759335


Okay so that's how hierarchical indices work in a nutshell. They're a special part of the pandas library which I 
think can make management and reasoning about data easier. Of course hierarchical labeling isn't just for rows. 
For example, you can transpose this matrix and now have hierarchical column labels. And projecting a single 
column which has these labels works exactly the way you would expect it to. Now, in reality, I don't tend to use 
hierarchical indicies very much, and instead just keep everything as columns and manipulate those. But, it's a 
unique and sophisticated aspect of pandas that is useful to know, especially if viewing your data in a tabular 
form.