# MultiIndex

In [1]:
import pandas as pd

## This Module's Dataset

In [5]:
bigmac = pd.read_csv("bigmac.csv", parse_dates=["Date"], date_format="%Y-%m-%d")
bigmac.head()

Unnamed: 0,Date,Country,Price in US Dollars
0,2000-04-01,Argentina,2.5
1,2000-04-01,Australia,1.541667
2,2000-04-01,Brazil,1.648045
3,2000-04-01,Canada,1.938776
4,2000-04-01,Switzerland,3.470588


In [6]:
bigmac.dtypes

Date                   datetime64[ns]
Country                        object
Price in US Dollars           float64
dtype: object

In [7]:
bigmac.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1386 entries, 0 to 1385
Data columns (total 3 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   Date                 1386 non-null   datetime64[ns]
 1   Country              1386 non-null   object        
 2   Price in US Dollars  1386 non-null   float64       
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 32.6+ KB


## Create a MultiIndex
- A **MultiIndex** is an index with multiple levels or layers.
- Pass the `set_index` method a list of colum names to create a multi-index **DataFrame**.
- The order of the list's values will determine the order of the levels.
- Alternatively, we can pass the `read_csv` function's `index_col` parameter a list of columns.

In [8]:
bigmac = pd.read_csv("bigmac.csv", parse_dates=["Date"], date_format="%Y-%m-%d")
bigmac.head()

Unnamed: 0,Date,Country,Price in US Dollars
0,2000-04-01,Argentina,2.5
1,2000-04-01,Australia,1.541667
2,2000-04-01,Brazil,1.648045
3,2000-04-01,Canada,1.938776
4,2000-04-01,Switzerland,3.470588


In [11]:
bigmac.set_index(keys=["Date", "Country"]).sort_index()

Unnamed: 0_level_0,Unnamed: 1_level_0,Price in US Dollars
Date,Country,Unnamed: 2_level_1
2000-04-01,Argentina,2.500000
2000-04-01,Australia,1.541667
2000-04-01,Brazil,1.648045
2000-04-01,Britain,3.002000
2000-04-01,Canada,1.938776
...,...,...
2020-07-01,Ukraine,2.174714
2020-07-01,United Arab Emirates,4.015846
2020-07-01,United States,5.710000
2020-07-01,Uruguay,4.327418


In [13]:
bigmac.set_index(keys=["Country", "Date"]).sort_index()
#Outermost level in the multiindex should have the least amount of unique values
bigmac.nunique()
bigmac.set_index(keys=["Date", "Country"]).sort_index()

Unnamed: 0_level_0,Unnamed: 1_level_0,Price in US Dollars
Date,Country,Unnamed: 2_level_1
2000-04-01,Argentina,2.500000
2000-04-01,Australia,1.541667
2000-04-01,Brazil,1.648045
2000-04-01,Britain,3.002000
2000-04-01,Canada,1.938776
...,...,...
2020-07-01,Ukraine,2.174714
2020-07-01,United Arab Emirates,4.015846
2020-07-01,United States,5.710000
2020-07-01,Uruguay,4.327418


In [14]:
bigmac = pd.read_csv("bigmac.csv", parse_dates=["Date"], date_format="%Y-%m-%d", index_col=["Date", "Country"]).sort_index()
bigmac.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Price in US Dollars
Date,Country,Unnamed: 2_level_1
2000-04-01,Argentina,2.5
2000-04-01,Australia,1.541667
2000-04-01,Brazil,1.648045
2000-04-01,Britain,3.002
2000-04-01,Canada,1.938776


In [18]:
bigmac.index.names
bigmac.index[0]
type(bigmac.index[0]) # a tuple, or a list that cannot change

tuple

## Extract Index Level Values
- The `get_level_values` method extracts an **Index** with the values from one level in the **MultiIndex**.
- Invoke the `get_level_values` on the **MultiIndex**, not the **DataFrame** itself.
- The method expects either the level's index position or its name.

In [21]:
bigmac = pd.read_csv("bigmac.csv", parse_dates=["Date"], date_format="%Y-%m-%d", index_col=["Date", "Country"]).sort_index()
bigmac.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Price in US Dollars
Date,Country,Unnamed: 2_level_1
2000-04-01,Argentina,2.5
2000-04-01,Australia,1.541667
2000-04-01,Brazil,1.648045
2000-04-01,Britain,3.002
2000-04-01,Canada,1.938776


In [25]:
bigmac.index.get_level_values("Date") #You can do this by the name
bigmac.index.get_level_values(0) #Or by the index # of the level
bigmac.index.get_level_values("Country")
bigmac.index.get_level_values(1)

Index(['Argentina', 'Australia', 'Brazil', 'Britain', 'Canada', 'Chile',
       'China', 'Czech Republic', 'Denmark', 'Euro area',
       ...
       'Sweden', 'Switzerland', 'Taiwan', 'Thailand', 'Turkey', 'Ukraine',
       'United Arab Emirates', 'United States', 'Uruguay', 'Vietnam'],
      dtype='object', name='Country', length=1386)

## Rename Index Levels
- Invoke the `set_names` method on the **MultiIndex** to change one or more level names.
- Use the `names` and `level` parameter to target a nested index at a given level.
- Alternatively, pass `names` a list of strings to overwrite *all* level names.
- The `set_names` method returns a copy, so replace the original index to alter the **DataFrame**.

In [39]:
bigmac = pd.read_csv("bigmac.csv", parse_dates=["Date"], date_format="%Y-%m-%d", index_col=["Date", "Country"]).sort_index()
bigmac.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Price in US Dollars
Date,Country,Unnamed: 2_level_1
2000-04-01,Argentina,2.5
2000-04-01,Australia,1.541667
2000-04-01,Brazil,1.648045
2000-04-01,Britain,3.002
2000-04-01,Canada,1.938776


In [41]:
bigmac.index.set_names(names="Time", level=0)
bigmac.index.set_names(names="Country", level=1)

bigmac.index = bigmac.index.set_names(names=["Time", "Location"])

In [43]:
bigmac.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Price in US Dollars
Time,Location,Unnamed: 2_level_1
2000-04-01,Argentina,2.5
2000-04-01,Australia,1.541667
2000-04-01,Brazil,1.648045
2000-04-01,Britain,3.002
2000-04-01,Canada,1.938776


## The sort_index Method on a MultiIndex DataFrame
- Using the `sort_index` method, we can target all levels or specific levels of the **MultiIndex**.
- To apply a different sort order to different levels, pass a list of Booleans.

In [44]:
bigmac = pd.read_csv("bigmac.csv", parse_dates=["Date"], date_format="%Y-%m-%d", index_col=["Date", "Country"])
bigmac.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Price in US Dollars
Date,Country,Unnamed: 2_level_1
2000-04-01,Argentina,2.5
2000-04-01,Australia,1.541667
2000-04-01,Brazil,1.648045
2000-04-01,Canada,1.938776
2000-04-01,Switzerland,3.470588


In [45]:
bigmac.sort_index() #by default goes from outer to inner levels, then ascending 
#(smallest to largest/ alphabetical/ earliest to latest) 

Unnamed: 0_level_0,Unnamed: 1_level_0,Price in US Dollars
Date,Country,Unnamed: 2_level_1
2000-04-01,Argentina,2.500000
2000-04-01,Australia,1.541667
2000-04-01,Brazil,1.648045
2000-04-01,Britain,3.002000
2000-04-01,Canada,1.938776
...,...,...
2020-07-01,Ukraine,2.174714
2020-07-01,United Arab Emirates,4.015846
2020-07-01,United States,5.710000
2020-07-01,Uruguay,4.327418


In [51]:
bigmac.sort_index(ascending=False)
bigmac.sort_index(ascending=[False, True])

Unnamed: 0_level_0,Unnamed: 1_level_0,Price in US Dollars
Date,Country,Unnamed: 2_level_1
2020-07-01,Argentina,3.509232
2020-07-01,Australia,4.578450
2020-07-01,Azerbaijan,2.324897
2020-07-01,Bahrain,3.713035
2020-07-01,Brazil,3.913528
...,...,...
2000-04-01,Sweden,2.714932
2000-04-01,Switzerland,3.470588
2000-04-01,Taiwan,2.287582
2000-04-01,Thailand,1.447368


## Extract Rows from a MultiIndex DataFrame
- A **tuple** is an immutable list. It cannot be modified after creation.
- Create a tuple with a comma between elements. The community convention is to wrap the elements in parentheses.
- The `iloc` and `loc` accessors are available to extract rows by index position or label.
- For the `loc` accessor, pass a tuple to hold the labels from the index levels.

In [52]:
bigmac = pd.read_csv("bigmac.csv", parse_dates=["Date"], date_format="%Y-%m-%d", index_col=["Date", "Country"]).sort_index()
bigmac.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Price in US Dollars
Date,Country,Unnamed: 2_level_1
2000-04-01,Argentina,2.5
2000-04-01,Australia,1.541667
2000-04-01,Brazil,1.648045
2000-04-01,Britain,3.002
2000-04-01,Canada,1.938776


In [56]:
1, #this is a tuple
1, 2 #this is also a tuple
(1, 2) #another tuple. the parentheses are not required but nice for legibility

(1, 2)

In [59]:
bigmac.iloc[2] #notice that the name is a tuple

Price in US Dollars    1.648045
Name: (2000-04-01 00:00:00, Brazil), dtype: float64

In [75]:
bigmac.loc["2000-04-01"]

bigmac.loc["2000-04-01", "Canada"]
bigmac.loc["2000-04-01", "Price in US Dollars"]
bigmac.loc[("2000-04-01", "Canada"), ] #less ambiguous to pull with tuple designating the multiindex

Price in US Dollars    1.938776
Name: (2000-04-01 00:00:00, Canada), dtype: float64

In [76]:
bigmac.loc[("2000-04-01", "Hungary"):("2000-04-01", "Poland")]

start = ("2000-04-01", "Hungary")
end = ("2000-04-01", "Poland")

bigmac.loc[start:end]
bigmac.loc[("2019-07-09", "Hungary"):]
bigmac.loc[:("2019-07-09", "Hungary")]
bigmac.loc[("2012-01-01", "Brazil"):("2013-07-01", "Turkey"), "Price in US Dollars"]

Date        Country    
2012-01-01  Brazil         5.678670
            Britain        3.823395
            Canada         4.632940
            Chile          4.050983
            China          2.438445
                             ...   
2013-07-01  Sweden         6.156874
            Switzerland    6.719041
            Taiwan         2.630834
            Thailand       2.845723
            Turkey         4.342384
Name: Price in US Dollars, Length: 160, dtype: float64

## The transpose Method
- The `transpose` method inverts/flips the horizontal and vertical axes of the **DataFrame**.

In [77]:
bigmac = pd.read_csv("bigmac.csv", parse_dates=["Date"], date_format="%Y-%m-%d", index_col=["Date", "Country"]).sort_index()
bigmac.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Price in US Dollars
Date,Country,Unnamed: 2_level_1
2000-04-01,Argentina,2.5
2000-04-01,Australia,1.541667
2000-04-01,Brazil,1.648045
2000-04-01,Britain,3.002
2000-04-01,Canada,1.938776


In [79]:
start = ("2018-01-01", "China")
end = ("2018-01-01", "Denmark") 

bigmac.loc[start:end].transpose()

Date,2018-01-01,2018-01-01,2018-01-01,2018-01-01,2018-01-01
Country,China,Colombia,Costa Rica,Czech Republic,Denmark
Price in US Dollars,3.171642,3.832468,4.027932,3.807779,4.93202


## The stack Method
- The `stack` method moves the column index to the row index.
- Pandas will return a **MultiIndex Series**.
- Think of it like "stacking" index levels for a **MultiIndex**.

## The unstack Method
- The `unstack` method moves a row index to the column index (the inverse of the `stack` method).
- By default, the `unstack` method will move the innermost index.
- We can customize the moved index with the `level` parameter.
- The `level` parameter accepts the level's index position or its name. It can also accept a list of positions/names.

## The pivot Method
- The `pivot` method reshapes data from a tall format to a wide format.
- Ask yourself which direction the data will expand in if you add more entries.
- A tall/long format expands down. A wide format expands out.
- The `index` parameter sets the horizontal index of the pivoted **DataFrame**.
- The `columns` parameter sets the column whose values will be the columns in the pivoted **DataFrame**.
- The `values` parameter set the values of the pivoted **DataFrame**. Pandas will populate the correct values based on the index and column intersections.

## The melt Method
- The `melt` method is the inverse of the `pivot` method.
- It takes a 'wide' dataset and converts it to a 'tall' dataset.
- The `melt` method is ideal when you have multiple columns storing the *same* data point.
- Ask yourself whether the column's values are a *type* of the column header. If they're not, the data is likely stored in a wide format.
- The `id_vars` parameters accepts the column whose values will be repeated for every column.
- The `var_name` parameter sets the name of the new column for the varying values (the former column names).
- The `value_name` parameter set the new name of the values column (holding the values from the original **DataFrame**).

## The pivot_table Method
- The `pivot_table` method operates similarly to the Pivot Table feature in Excel.
- A pivot table is a table whose values are aggregations of groups of values from another table.
- The `values` parameter accepts the numeric column whose values will be aggregated.
- The `aggfunc` parameter declares the aggregation function (the default is mean/average).
- The `index` parameter sets the index labels of the pivot table. MultiIndexes are permitted.
- The `columns` parameter sets the column labels of the pivot table. MultiIndexes are permitted.