In [3]:
import pandas as pd, numpy as np

# Hierarchical Indexing

Say we are tasked with creating a series that stores some two dimensional data. For concreteness, let's assume each point to have 2 keys - a string and a numerical key. 

In [4]:
tupledindex = [('California', 2000), ('California', 2010),('New York', 2000), ('New York', 2010),('Texas', 2000), ('Texas', 2010)]
populations = [33871648, 37253956,18976457, 19378102,20851820, 25145561]
Popz = pd.Series(populations, index=tupledindex)
Popz

(California, 2000)    33871648
(California, 2010)    37253956
(New York, 2000)      18976457
(New York, 2010)      19378102
(Texas, 2000)         20851820
(Texas, 2010)         25145561
dtype: int64

With this indexing you can slice the Series using the usual intuitive method 

In [5]:
Popz[('California', 2010):('Texas', 2000)]

(California, 2010)    37253956
(New York, 2000)      18976457
(New York, 2010)      19378102
(Texas, 2000)         20851820
dtype: int64

But, say you want to know all the data of 2010, there would be some intricate playing with the index that would be necessary

In [6]:
Popz[[i for i in Popz.index if i[1]==2010]]

(California, 2010)    37253956
(New York, 2010)      19378102
(Texas, 2010)         25145561
dtype: int64

<b>There Exists a better way to store such data

Pandas has the perfect datatype `MultiIndex` to index exactly this sort of data

In [7]:
index=pd.MultiIndex.from_tuples(tupledindex)
index

MultiIndex([('California', 2000),
            ('California', 2010),
            (  'New York', 2000),
            (  'New York', 2010),
            (     'Texas', 2000),
            (     'Texas', 2010)],
           )

Notice how `index` contains multiple levels of indexing, namely the statenames and the years. We can now, fix/improve the indexing of the Series `Popz` as follows

In [8]:
Popz=Popz.reindex(index)
Popz

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

The resultant is a Series displayed above with the first two columns showing the two levels of indeces, and the third column showing the population. We can now access the necessary elements as follows

In [9]:
Popz[:,2010]

California    37253956
New York      19378102
Texas         25145561
dtype: int64

## MultiIndex as extra dimension

We could have easily kept the above data a bit more organized if we were to store it in a two dimensional array. This can be done the following way

In [10]:
PopzDF=Popz.unstack()
PopzDF

Unnamed: 0,2000,2010
California,33871648,37253956
New York,18976457,19378102
Texas,20851820,25145561


Conversly, the `stack` method provides the opposite operation

In [11]:
PopzDF.stack()

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

Thus, we just saw how a two dimenisional data can be represented in one dimension using multi indexing. Likewise, higher dimensional data can be represented in lower dimensional entities using `MultiIndex` indexing. 

Say we want to add another dimension with two features `under10` and `total`

In [12]:
PopzDF=pd.DataFrame({'Under18': [9267089, 9284094,4687374, 4318033,5906301, 6879014],'Total':Popz})
PopzDF

Unnamed: 0,Unnamed: 1,Under18,Total
California,2000,9267089,33871648
California,2010,9284094,37253956
New York,2000,4687374,18976457
New York,2010,4318033,19378102
Texas,2000,5906301,20851820
Texas,2010,6879014,25145561


In addition to this, all the Ufuncs discussed in operations work with these Data Storage Techniques

In [13]:
PopzDF['U18Frac']=PopzDF['Under18']/PopzDF['Total']
PopzDF

Unnamed: 0,Unnamed: 1,Under18,Total,U18Frac
California,2000,9267089,33871648,0.273594
California,2010,9284094,37253956,0.249211
New York,2000,4687374,18976457,0.24701
New York,2010,4318033,19378102,0.222831
Texas,2000,5906301,20851820,0.283251
Texas,2010,6879014,25145561,0.273568


## Creating a Multi-Index

Simplest way of Multi-indexing a `Dataframe` or `Series` is to pass two or more indexing arrays as an index

In [14]:
Df=pd.DataFrame(np.random.rand(4,2),index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],columns=['data1','data2'])
Df

Unnamed: 0,Unnamed: 1,data1,data2
a,1,0.441919,0.728648
a,2,0.869134,0.958092
b,1,0.816355,0.013729
b,2,0.553965,0.196688


Pandas automatically recognizes the multi-indexing if we define a `Series` as follows

In [15]:
data={
    ('California', 2000): 33871648,
    ('California', 2010): 37253956,
    ('Texas', 2000): 20851820,
    ('Texas', 2010): 25145561,
    ('New York', 2000): 18976457,
    ('New York', 2010): 19378102
}
pd.Series(data)
# Remember how this is different from the way we defined the first Series
# pd.Series(populations,index=tupledindex)

California  2000    33871648
            2010    37253956
Texas       2000    20851820
            2010    25145561
New York    2000    18976457
            2010    19378102
dtype: int64

### Explicitly Defining MultiIndex's

For defining MultiIndex's you can use multiple predefined class methods.

In [16]:
pd.MultiIndex.from_tuples([
    ('a',1),
    ('a',2),
    ('b',3),
    ('b',4),
    ('b',5),
    ('c',6),
    ('c',7),
    ('c',8)
])


MultiIndex([('a', 1),
            ('a', 2),
            ('b', 3),
            ('b', 4),
            ('b', 5),
            ('c', 6),
            ('c', 7),
            ('c', 8)],
           )

In [17]:
pd.MultiIndex.from_arrays([
    [1,2,3,4,5,6,7,8],
    ['a','a','b','b','b','c','c','c']
])

MultiIndex([(1, 'a'),
            (2, 'a'),
            (3, 'b'),
            (4, 'b'),
            (5, 'b'),
            (6, 'c'),
            (7, 'c'),
            (8, 'c')],
           )

A MultiIndex can also be defined using a cartesian product of sets

In [18]:
pd.MultiIndex.from_product([['a','b','c'],[1,2,3,4]])

MultiIndex([('a', 1),
            ('a', 2),
            ('a', 3),
            ('a', 4),
            ('b', 1),
            ('b', 2),
            ('b', 3),
            ('b', 4),
            ('c', 1),
            ('c', 2),
            ('c', 3),
            ('c', 4)],
           )

Apart from difining MultiIndeces using above mentioned methods, you can also define them the classic way as shown

In [19]:
pd.MultiIndex(
    levels=[['a','b','c'],[1,2,3,6]],
    codes=[[0, 0,0,0, 1, 1,1,2,2,2], [0, 1,2,3, 0, 2,3,1,2,3]]
)

MultiIndex([('a', 1),
            ('a', 2),
            ('a', 3),
            ('a', 6),
            ('b', 1),
            ('b', 3),
            ('b', 6),
            ('c', 2),
            ('c', 3),
            ('c', 6)],
           )

These `MultiIndex` objects can be passed into the `index` argument while defining a DataFrame or a Series, or can be passed into the `reindex` method of an existing Series or a DataFrame

### Naming Levels of a MultiIndex

Sometimes it might be necessary to name a Level of a Multi Index. This can be done by using the names argument of the `MultiIndex` function

In [20]:
DemoIndexer=pd.MultiIndex(
    levels=[['a','b','c'],[1,2,3,6]],
    codes=[[0, 0,0,0, 1, 1,1,2,2,2], [0, 1,2,3, 0, 2,3,1,2,3]],
    names=['letters','numbers']
)
PlayableSeries=pd.Series(np.random.rand(10),index=DemoIndexer)
PlayableSeries

letters  numbers
a        1          0.103466
         2          0.338037
         3          0.315730
         6          0.450887
b        1          0.805237
         3          0.474443
         6          0.325222
c        2          0.391719
         3          0.860930
         6          0.449841
dtype: float64

Naming a level of an index can be thought as naming a dimension or naming a set of features in a dataset

### MultiIndex for columns

Just like rows, columns can also have MultiIndexing. Consider the not-so-realistic medical Data

In [21]:
# hierarchical indices and columns
index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]],
                                   names=['year', 'visit'])
columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']],
                                     names=['subject', 'type'])

# mock some data
data = np.round(np.random.randn(4, 6), 1)
print(data)
data[:, ::2] *= 10
data += 37

# create the DataFrame
health_data = pd.DataFrame(data, index=index, columns=columns)
health_data

[[-0.4  0.3 -2.9 -2.3  0.5  0.1]
 [ 0.7  0.5  0.3  0.6 -1.4  0.5]
 [ 0.2  0.3  0.7  1.6  0.7  1.9]
 [ 1.3  0.6 -1.2  0.6 -0.   0.2]]


Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,33.0,37.3,8.0,34.7,42.0,37.1
2013,2,44.0,37.5,40.0,37.6,23.0,37.5
2014,1,39.0,37.3,44.0,38.6,44.0,38.9
2014,2,50.0,37.6,25.0,37.6,37.0,37.2


The data above is, by it's definition, is 4-dimensional, each of it's dimensions being subject, year, visit number, measurement type.

With this definition, the top level columns can be accessed by using the persons name and it'll return the records linked to that person

In [22]:
health_data['Guido']

Unnamed: 0_level_0,type,HR,Temp
year,visit,Unnamed: 2_level_1,Unnamed: 3_level_1
2013,1,8.0,34.7
2013,2,40.0,37.6
2014,1,44.0,38.6
2014,2,25.0,37.6


Note that there exists a cirtain hierarchy while indexing in a multi-indexed Series, as shown above. i.e. where the above comand is executed withou any trouble, a syntax like `health_data['HR']` would not make sense to the interpreter

## Indexing and Slicing a MultiIndex

### in Series

In [23]:
Popz

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

Consider the multipely indexed Series `Popz`. A single element of this Series can be accessed with multiple terms as follows

In [24]:
Popz['California',2000]

33871648

In [25]:
Popz['Texas']

2000    20851820
2010    25145561
dtype: int64

Partial slicing is also possible, (as long as the `multiindex` is sorted)

In [26]:
Popz.loc['California':'New York']

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
dtype: int64

With sorted indices, partial indexing can be performed on lower levels by passing an empty slice in the first index:

In [27]:
Popz[:,2000]

California    33871648
New York      18976457
Texas         20851820
dtype: int64

Indexing can also be based on a boolean mask

In [28]:
Popz[Popz>22000000]

California  2000    33871648
            2010    37253956
Texas       2010    25145561
dtype: int64

Indexing based on fancy Indexing also works

In [29]:
Popz[['California','New York']]

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
dtype: int64

### in DataFrames

In [30]:
health_data

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,33.0,37.3,8.0,34.7,42.0,37.1
2013,2,44.0,37.5,40.0,37.6,23.0,37.5
2014,1,39.0,37.3,44.0,38.6,44.0,38.9
2014,2,50.0,37.6,25.0,37.6,37.0,37.2


Consider the case of the DataFrame defined before. 

Remember that columns are primary in DataFrames, and the syntax used for indexing in multi-indexed Series applies to columns here.

In [31]:
health_data['Guido','HR']

year  visit
2013  1         8.0
      2        40.0
2014  1        44.0
      2        25.0
Name: (Guido, HR), dtype: float64

Also, as with the single-index case, we can use the loc, iloc, and ix indexers introduced in Data Indexing and Selection. For example:

In [32]:
health_data.iloc[:2,:2]

Unnamed: 0_level_0,subject,Bob,Bob
Unnamed: 0_level_1,type,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2
2013,1,33.0,37.3
2013,2,44.0,37.5


The `loc` and `iloc` indexers provide an 2D-array like approach to the DataFrame, But each index can be passed as a tuple of mulitple indeces

In [33]:
health_data.loc[:,('Bob','HR')]

year  visit
2013  1        33.0
      2        44.0
2014  1        39.0
      2        50.0
Name: (Bob, HR), dtype: float64

## Rearranging MultiIndeces

There are multiple ways of rearranging the Dataset preserving all the information linked to it's key's. We say a brief example with `stack()` and `unstack()`

### Sorted and Unsorted indeces

Many of the slicing operations fail to be aplicable if the index is not sorted. Let's study this concept further by creating an index where the index is not alphabetically sorted

In [38]:
UnorderedIndex=pd.MultiIndex.from_product([['a','c','b'],[1,2]])
data=pd.Series(np.random.rand(6),index=UnorderedIndex)
data.index.names=['char','int']
data

char  int
a     1      0.242891
      2      0.624384
c     1      0.175055
      2      0.995686
b     1      0.197655
      2      0.149124
dtype: float64

Say we try to take a partial slice of the above data

In [39]:
data.loc['a':'b',:]

UnsortedIndexError: 'MultiIndex slicing requires the index to be lexsorted: slicing on levels [0], lexsort depth 0'

The result of this error is due to the index not be lexographically sorted. This can be done separately using some predefined methods of `MultiIndex` as follows

In [43]:
data=data.sort_index()
data.loc['a':'b',]

char  int
a     1      0.242891
      2      0.624384
b     1      0.197655
      2      0.149124
dtype: float64

### Stacking and unstacking indices

As we say before briefly, `unstack`,`stack` can be used to change representations of a MultiIndexed Dataset

In [66]:
health_data

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,33.0,37.3,8.0,34.7,42.0,37.1
2013,2,44.0,37.5,40.0,37.6,23.0,37.5
2014,1,39.0,37.3,44.0,38.6,44.0,38.9
2014,2,50.0,37.6,25.0,37.6,37.0,37.2


In [67]:
health_data.unstack(level=1)

subject,Bob,Bob,Bob,Bob,Guido,Guido,Guido,Guido,Sue,Sue,Sue,Sue
type,HR,HR,Temp,Temp,HR,HR,Temp,Temp,HR,HR,Temp,Temp
visit,1,2,1,2,1,2,1,2,1,2,1,2
year,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3,Unnamed: 11_level_3,Unnamed: 12_level_3
2013,33.0,44.0,37.3,37.5,8.0,40.0,34.7,37.6,42.0,23.0,37.1,37.5
2014,39.0,50.0,37.3,37.6,44.0,25.0,38.6,37.6,44.0,37.0,38.9,37.2


In [68]:
health_data.unstack(level=1)['Bob','HR',1]

year
2013    33.0
2014    39.0
Name: (Bob, HR, 1), dtype: float64