# MultiIndex

In [1]:
import pandas as pd
import datetime as dt

## This Module's Dataset

In [2]:
bg = pd.read_csv("bigmac.csv")

In [3]:
bg.head()

Unnamed: 0,Date,Country,Price in US Dollars
0,2000-04-01,Argentina,2.5
1,2000-04-01,Australia,1.541667
2,2000-04-01,Brazil,1.648045
3,2000-04-01,Canada,1.938776
4,2000-04-01,Switzerland,3.470588


## Create a MultiIndex
- A **MultiIndex** is an index with multiple levels or layers.
- Pass the `set_index` method a list of colum names to create a multi-index **DataFrame**.
- The order of the list's values will determine the order of the levels.
- Alternatively, we can pass the `read_csv` function's `index_col` parameter a list of columns.

In [4]:
bg["Date"] = pd.to_datetime(bg["Date"], format=r"%Y-%m-%d").dt.date

In [5]:
bg.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1386 entries, 0 to 1385
Data columns (total 3 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Date                 1386 non-null   object 
 1   Country              1386 non-null   object 
 2   Price in US Dollars  1386 non-null   float64
dtypes: float64(1), object(2)
memory usage: 32.6+ KB


In [6]:
bg = bg.set_index(keys=["Date", "Country"]).sort_index()

In [7]:
bg

Unnamed: 0_level_0,Unnamed: 1_level_0,Price in US Dollars
Date,Country,Unnamed: 2_level_1
2000-04-01,Argentina,2.500000
2000-04-01,Australia,1.541667
2000-04-01,Brazil,1.648045
2000-04-01,Britain,3.002000
2000-04-01,Canada,1.938776
...,...,...
2020-07-01,Ukraine,2.174714
2020-07-01,United Arab Emirates,4.015846
2020-07-01,United States,5.710000
2020-07-01,Uruguay,4.327418


## Extract Index Level Values
- The `get_level_values` method extracts an **Index** with the values from one level in the **MultiIndex**.
- Invoke the `get_level_values` on the **MultiIndex**, not the **DataFrame** itself.
- The method expects either the level's index position or its name.

In [8]:
# Extract Date level (level 0)
bg.index.get_level_values(0)

Index([2000-04-01, 2000-04-01, 2000-04-01, 2000-04-01, 2000-04-01, 2000-04-01,
       2000-04-01, 2000-04-01, 2000-04-01, 2000-04-01,
       ...
       2020-07-01, 2020-07-01, 2020-07-01, 2020-07-01, 2020-07-01, 2020-07-01,
       2020-07-01, 2020-07-01, 2020-07-01, 2020-07-01],
      dtype='object', name='Date', length=1386)

In [9]:
# Extract Country level by name
bg.index.get_level_values("Country")

Index(['Argentina', 'Australia', 'Brazil', 'Britain', 'Canada', 'Chile',
       'China', 'Czech Republic', 'Denmark', 'Euro area',
       ...
       'Sweden', 'Switzerland', 'Taiwan', 'Thailand', 'Turkey', 'Ukraine',
       'United Arab Emirates', 'United States', 'Uruguay', 'Vietnam'],
      dtype='object', name='Country', length=1386)

In [10]:
# Extract Country level (level 1)
bg.index.get_level_values(1)

Index(['Argentina', 'Australia', 'Brazil', 'Britain', 'Canada', 'Chile',
       'China', 'Czech Republic', 'Denmark', 'Euro area',
       ...
       'Sweden', 'Switzerland', 'Taiwan', 'Thailand', 'Turkey', 'Ukraine',
       'United Arab Emirates', 'United States', 'Uruguay', 'Vietnam'],
      dtype='object', name='Country', length=1386)

In [11]:
# Get unique countries
bg.index.get_level_values("Country").unique()

Index(['Argentina', 'Australia', 'Brazil', 'Britain', 'Canada', 'Chile',
       'China', 'Czech Republic', 'Denmark', 'Euro area', 'Hong Kong',
       'Hungary', 'Indonesia', 'Israel', 'Japan', 'Malaysia', 'Mexico',
       'New Zealand', 'Poland', 'Russia', 'Singapore', 'South Africa',
       'South Korea', 'Sweden', 'Switzerland', 'Taiwan', 'Thailand',
       'United States', 'Philippines', 'Norway', 'Peru', 'Turkey', 'Egypt',
       'Colombia', 'Costa Rica', 'Pakistan', 'Saudi Arabia', 'Sri Lanka',
       'Ukraine', 'Uruguay', 'UAE', 'India', 'Vietnam', 'Azerbaijan',
       'Bahrain', 'Croatia', 'Guatemala', 'Honduras', 'Jordan', 'Kuwait',
       'Lebanon', 'Moldova', 'Nicaragua', 'Oman', 'Qatar', 'Romania',
       'United Arab Emirates'],
      dtype='object', name='Country')

## Rename Index Levels
- Invoke the `set_names` method on the **MultiIndex** to change one or more level names.
- Use the `names` and `level` parameter to target a nested index at a given level.
- Alternatively, pass `names` a list of strings to overwrite *all* level names.
- The `set_names` method returns a copy, so replace the original index to alter the **DataFrame**.

In [12]:
# Rename a single level

bg.index = bg.index.set_names("Year-Month-Day", level=0)
bg

Unnamed: 0_level_0,Unnamed: 1_level_0,Price in US Dollars
Year-Month-Day,Country,Unnamed: 2_level_1
2000-04-01,Argentina,2.500000
2000-04-01,Australia,1.541667
2000-04-01,Brazil,1.648045
2000-04-01,Britain,3.002000
2000-04-01,Canada,1.938776
...,...,...
2020-07-01,Ukraine,2.174714
2020-07-01,United Arab Emirates,4.015846
2020-07-01,United States,5.710000
2020-07-01,Uruguay,4.327418


In [13]:
# Rename all levels at once

bg.index = bg.index.set_names(["Date", "Nation"])
bg

Unnamed: 0_level_0,Unnamed: 1_level_0,Price in US Dollars
Date,Nation,Unnamed: 2_level_1
2000-04-01,Argentina,2.500000
2000-04-01,Australia,1.541667
2000-04-01,Brazil,1.648045
2000-04-01,Britain,3.002000
2000-04-01,Canada,1.938776
...,...,...
2020-07-01,Ukraine,2.174714
2020-07-01,United Arab Emirates,4.015846
2020-07-01,United States,5.710000
2020-07-01,Uruguay,4.327418


In [14]:
# View the MultiIndex structure (not tuples)

print("Index names:", bg.index.names)

print("Index levels:", bg.index.nlevels)

Index names: ['Date', 'Nation']
Index levels: 2


In [15]:
# Reset to original names
bg.index = bg.index.set_names(["Date", "Country"])

## The sort_index Method on a MultiIndex DataFrame
- Using the `sort_index` method, we can target all levels or specific levels of the **MultiIndex**.
- To apply a different sort order to different levels, pass a list of Booleans.

In [16]:
bg = pd.read_csv("bigmac.csv", index_col=["Date", "Country"])
bg.index = bg.index.set_names(["Date", "Country"])
bg

Unnamed: 0_level_0,Unnamed: 1_level_0,Price in US Dollars
Date,Country,Unnamed: 2_level_1
2000-04-01,Argentina,2.500000
2000-04-01,Australia,1.541667
2000-04-01,Brazil,1.648045
2000-04-01,Canada,1.938776
2000-04-01,Switzerland,3.470588
...,...,...
2020-07-01,Ukraine,2.174714
2020-07-01,Uruguay,4.327418
2020-07-01,United States,5.710000
2020-07-01,Vietnam,2.847282


In [17]:
bg.sort_index().head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Price in US Dollars
Date,Country,Unnamed: 2_level_1
2000-04-01,Argentina,2.5
2000-04-01,Australia,1.541667
2000-04-01,Brazil,1.648045
2000-04-01,Britain,3.002
2000-04-01,Canada,1.938776


In [18]:
bg.sort_index(ascending=[False, False]).head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Price in US Dollars
Date,Country,Unnamed: 2_level_1
2020-07-01,Vietnam,2.847282
2020-07-01,Uruguay,4.327418
2020-07-01,United States,5.71
2020-07-01,United Arab Emirates,4.015846
2020-07-01,Ukraine,2.174714


## Extract Rows from a MultiIndex DataFrame
- A **tuple** is an immutable list. It cannot be modified after creation.
- Create a tuple with a comma between elements. The community convention is to wrap the elements in parentheses.
- The `iloc` and `loc` accessors are available to extract rows by index position or label.
- For the `loc` accessor, pass a tuple to hold the labels from the index levels.

In [19]:
bg = pd.read_csv("bigmac.csv", index_col=["Date", "Country"])
bg.index = bg.index.set_names(["Date", "Country"])
bg = bg.sort_index()

In [20]:
bg.iloc[2]

Price in US Dollars    1.648045
Name: (2000-04-01, Brazil), dtype: float64

In [21]:
bg.loc["2000-04-01", :]

Unnamed: 0_level_0,Price in US Dollars
Country,Unnamed: 1_level_1
Argentina,2.5
Australia,1.541667
Brazil,1.648045
Britain,3.002
Canada,1.938776
Chile,2.451362
China,1.195652
Czech Republic,1.390537
Denmark,3.078358
Euro area,2.3808


In [22]:
bg["Price"] = bg["Price in US Dollars"]
bg = bg.drop(columns="Price in US Dollars")
bg

Unnamed: 0_level_0,Unnamed: 1_level_0,Price
Date,Country,Unnamed: 2_level_1
2000-04-01,Argentina,2.500000
2000-04-01,Australia,1.541667
2000-04-01,Brazil,1.648045
2000-04-01,Britain,3.002000
2000-04-01,Canada,1.938776
...,...,...
2020-07-01,Ukraine,2.174714
2020-07-01,United Arab Emirates,4.015846
2020-07-01,United States,5.710000
2020-07-01,Uruguay,4.327418


In [23]:
# Sort the MultiIndex before using tuple-based indexing
bg = bg.sort_index()
bg.loc[("2020-07-01", "United States"):]

Unnamed: 0_level_0,Unnamed: 1_level_0,Price
Date,Country,Unnamed: 2_level_1
2020-07-01,United States,5.71
2020-07-01,Uruguay,4.327418
2020-07-01,Vietnam,2.847282


**Important Note about MultiIndex Sorting:**

When using tuple-based indexing with `.loc` on a MultiIndex (like `("2020-07-01", "United States")`), the MultiIndex must be **sorted** first. If you get an `UnsortedIndexError`, it means the MultiIndex isn't properly sorted for efficient lookups. 

Always call `sort_index()` before using tuple-based indexing to ensure proper performance and avoid errors.

In [24]:
# Now we can use other tuple-based indexing operations
bg.loc[("2010-01-01", "United States")]

Price    3.58
Name: (2010-01-01, United States), dtype: float64

## The transpose Method
- The `transpose` method inverts/flips the horizontal and vertical axes of the **DataFrame**.

In [25]:
bg = pd.read_csv("bigmac.csv", index_col=["Date", "Country"])
bg = bg.rename(columns={"Price in US Dollars": "$"})
bg = bg.sort_index()

In [26]:
bg.index = bg.index.set_levels(pd.to_datetime(bg.index.levels[0]), level=0)

In [27]:
bg.loc[:].transpose()

Date,2000-04-01,2000-04-01,2000-04-01,2000-04-01,2000-04-01,2000-04-01,2000-04-01,2000-04-01,2000-04-01,2000-04-01,...,2020-07-01,2020-07-01,2020-07-01,2020-07-01,2020-07-01,2020-07-01,2020-07-01,2020-07-01,2020-07-01,2020-07-01
Country,Argentina,Australia,Brazil,Britain,Canada,Chile,China,Czech Republic,Denmark,Euro area,...,Sweden,Switzerland,Taiwan,Thailand,Turkey,Ukraine,United Arab Emirates,United States,Uruguay,Vietnam
$,2.5,1.541667,1.648045,3.002,1.938776,2.451362,1.195652,1.390537,3.078358,2.3808,...,5.755931,6.90571,2.444282,4.078381,2.039507,2.174714,4.015846,5.71,4.327418,2.847282


## The stack Method
- The `stack` method moves the column index to the row index.
- Pandas will return a **MultiIndex Series**.
- Think of it like "stacking" index levels for a **MultiIndex**.

In [40]:
ws = pd.read_csv("worldstats.csv", index_col=["year", "country"]).sort_index()

In [41]:
ws

Unnamed: 0_level_0,Unnamed: 1_level_0,Population,GDP
year,country,Unnamed: 2_level_1,Unnamed: 3_level_1
1960,Afghanistan,8.994793e+06,5.377778e+08
1960,Algeria,1.112489e+07,2.723638e+09
1960,Australia,1.027648e+07,1.856759e+10
1960,Austria,7.047539e+06,6.592694e+09
1960,"Bahamas, The",1.095260e+05,1.698023e+08
...,...,...,...
2015,Vietnam,9.170380e+07,1.935994e+11
2015,West Bank and Gaza,4.422143e+06,1.267740e+10
2015,World,7.346633e+09,7.343364e+13
2015,Zambia,1.621177e+07,2.120156e+10


In [42]:
ws.stack()

year  country                
1960  Afghanistan  Population    8.994793e+06
                   GDP           5.377778e+08
      Algeria      Population    1.112489e+07
                   GDP           2.723638e+09
      Australia    Population    1.027648e+07
                                     ...     
2015  World        GDP           7.343364e+13
      Zambia       Population    1.621177e+07
                   GDP           2.120156e+10
      Zimbabwe     Population    1.560275e+07
                   GDP           1.389294e+10
Length: 22422, dtype: float64

In [43]:
ws.columns

Index(['Population', 'GDP'], dtype='object')

## Understanding What Happens When We Stack
When we use `stack()`, the **DataFrame** gets converted to a **Series** with a **MultiIndex**. The column names don't disappear - they become a new level in the index!

In [44]:
# Let's examine the original DataFrame
ws = pd.read_csv("worldstats.csv", index_col=["year", "country"]).sort_index()
print("Original DataFrame:")
print("Type:", type(ws))
print("Shape:", ws.shape)
print("Columns:", ws.columns.tolist())
print("Index names:", ws.index.names)
print("Index levels:", ws.index.nlevels)

Original DataFrame:
Type: <class 'pandas.core.frame.DataFrame'>
Shape: (11211, 2)
Columns: ['Population', 'GDP']
Index names: ['year', 'country']
Index levels: 2


In [45]:
# Now let's see what happens after stacking
stacked_result = ws.stack()
print("After stacking:")
print("Type:", type(stacked_result))
print("Shape:", stacked_result.shape)
print("Index names:", stacked_result.index.names)
print("Index levels:", stacked_result.index.nlevels)

After stacking:
Type: <class 'pandas.core.series.Series'>
Shape: (22422,)
Index names: ['year', 'country', None]
Index levels: 3


In [46]:
# The original DataFrame still has its columns
print("Original ws still has columns:", ws.columns.tolist())
print()

# But the stacked result doesn't have columns - it's a Series!
print("Stacked result has no columns attribute (it's a Series)")
print("Instead, the former columns are now part of the index:")
print(
    "Level 2 values (former columns):",
    stacked_result.index.get_level_values(2).unique().tolist(),
)

Original ws still has columns: ['Population', 'GDP']

Stacked result has no columns attribute (it's a Series)
Instead, the former columns are now part of the index:
Level 2 values (former columns): ['Population', 'GDP']


In [47]:
# Visual comparison:
print("BEFORE STACK (DataFrame):")
print(ws.head())
print("\nAFTER STACK (Series with MultiIndex):")
print(stacked_result.head())

BEFORE STACK (DataFrame):
                   Population           GDP
year country                               
1960 Afghanistan    8994793.0  5.377778e+08
     Algeria       11124892.0  2.723638e+09
     Australia     10276477.0  1.856759e+10
     Austria        7047539.0  6.592694e+09
     Bahamas, The    109526.0  1.698023e+08

AFTER STACK (Series with MultiIndex):
year  country                
1960  Afghanistan  Population    8.994793e+06
                   GDP           5.377778e+08
      Algeria      Population    1.112489e+07
                   GDP           2.723638e+09
      Australia    Population    1.027648e+07
dtype: float64


In [48]:
# Key Point: The original DataFrame is unchanged!
print("ws.columns still shows:", ws.columns.tolist())
print("This is because stack() creates a NEW object, it doesn't modify ws")
print()

# The columns became the 3rd level of the stacked Series index
print("In the stacked result, 'Population' and 'GDP' became index level 2:")
print(
    "Level 2 in stacked result:",
    stacked_result.index.get_level_values(2).unique().tolist(),
)
print()

# Proof: They're the same values, just moved location!
print(
    "Original columns == New index level 2:",
    set(ws.columns) == set(stacked_result.index.get_level_values(2).unique()),
)

ws.columns still shows: ['Population', 'GDP']
This is because stack() creates a NEW object, it doesn't modify ws

In the stacked result, 'Population' and 'GDP' became index level 2:
Level 2 in stacked result: ['Population', 'GDP']

Original columns == New index level 2: True


## Accessing Values in MultiIndex by Position
When you have a MultiIndex and want to access values by **position** (like 0, 0, 1), here are the different methods:

In [61]:
# First, let's examine our MultiIndex structure
print("=== Understanding our MultiIndex structure ===")
print("DataFrame shape:", ws.shape)
print("Index names:", ws.index.names)
print("Index levels:", ws.index.nlevels)
print("\nFirst few index tuples:")
print(ws.index[:5].tolist())

=== Understanding our MultiIndex structure ===
DataFrame shape: (11211, 2)
Index names: ['year', 'country']
Index levels: 2

First few index tuples:
[(1960, 'Afghanistan'), (1960, 'Algeria'), (1960, 'Australia'), (1960, 'Austria'), (1960, 'Bahamas, The')]


In [None]:
# Method 1: Using iloc for positional indexing (most direct for your question)
print("=== Method 1: Using iloc (positional indexing) ===")
print("Access row at position 0 (first row):")
print("ws.iloc[0] =", ws.iloc[0].tolist())
print("\nAccess specific value at row 0, column 1:")
print("ws.iloc[0, 1] =", ws.iloc[0, 1])
print("(This gets the 1st row, 2nd column - GDP value)")

In [62]:
# Method 2: Using loc with actual index labels (if you know the labels)
print("=== Method 2: Using loc (label-based indexing) ===")
# First, let's see what the actual labels are at position 0
first_index_tuple = ws.index[0]
print(f"Index at position 0: {first_index_tuple}")

# Now we can use loc with those labels
print(f"Using loc with labels: ws.loc{first_index_tuple} =")
print(ws.loc[first_index_tuple].tolist())

=== Method 2: Using loc (label-based indexing) ===
Index at position 0: (np.int64(1960), 'Afghanistan')
Using loc with labels: ws.loc(np.int64(1960), 'Afghanistan') =
[8994793.0, 537777811.911111]


In [63]:
# Method 3: For stacked Series with 3-level MultiIndex (your 0,0,1 example)
print("=== Method 3: Accessing stacked Series (3-level MultiIndex) ===")
stacked_series = ws.stack()
print("Stacked series shape:", stacked_series.shape)
print("Stacked series index names:", stacked_series.index.names)

# Access by position in the stacked series
print(f"\nstacked_series.iloc[0] = {stacked_series.iloc[0]}")
print(f"stacked_series.iloc[1] = {stacked_series.iloc[1]}")

# If you want position (0,0,1) in terms of (level0, level1, level2):
# This would be first year, first country, second metric
print(f"\nFor position pattern (0,0,1) - first year, first country, 2nd metric:")
print(
    f"stacked_series.iloc[1] = {stacked_series.iloc[1]}"
)  # This is often what (0,0,1) refers to

=== Method 3: Accessing stacked Series (3-level MultiIndex) ===
Stacked series shape: (22422,)
Stacked series index names: ['year', 'country', None]

stacked_series.iloc[0] = 8994793.0
stacked_series.iloc[1] = 537777811.911111

For position pattern (0,0,1) - first year, first country, 2nd metric:
stacked_series.iloc[1] = 537777811.911111


In [None]:
# How this translates to stacked Series positions
print("=== In Stacked Series (how positions map) ===")
stacked = ws.stack()

# The stacked series flattens everything into a single series
# Each combination becomes a separate position
print("Stacked series positions:")
for i in range(6):  # Show first 6 positions
    idx_tuple = stacked.index[i]
    print(f"Position {i}: {idx_tuple} = {stacked.iloc[i]}")

print("\nSo in the stacked series:")
print("• Position 0 = (year[0], country[0], metric[0])")
print("• Position 1 = (year[0], country[0], metric[1])")
print(
    "• Position 2 = (year[0], country[1], metric[0])"
)  # <- Different country = different row!
print("• Position 3 = (year[0], country[1], metric[1])")
print("• ...")

In [None]:
# Method 4: Understanding the index mapping for complex selection
print("=== Method 4: Understanding index positions ===")

# Get the unique values at each level to understand positions
level_0_values = ws.index.get_level_values(0).unique()
level_1_values = ws.index.get_level_values(1).unique()

print("Level 0 (year) values:", level_0_values[:5].tolist(), "...")
print("Level 1 (country) values:", level_1_values[:5].tolist(), "...")

# If you want year[0], country[0], column[1]:
year_0 = level_0_values[0]
country_0 = level_1_values[0]
column_1 = ws.columns[1]  # This would be 'GDP'

print(f"\nPosition [0,0,1] translates to:")
print(f"Year: {year_0}, Country: {country_0}, Column: {column_1}")
print(f"Value: {ws.loc[(year_0, country_0), column_1]}")

In [None]:
# Summary: Different interpretations of (0,0,1)
print("=== SUMMARY: Different ways to interpret (0,0,1) ===")
print()
print("🔹 If (0,0,1) means row 0, column 1 in DataFrame:")
print(f"   ws.iloc[0, 1] = {ws.iloc[0, 1]}")
print()
print("🔹 If (0,0,1) means 1st element in stacked Series:")
print(f"   ws.stack().iloc[1] = {ws.stack().iloc[1]}")
print()
print("🔹 If (0,0,1) means level positions [year=0, country=0, metric=1]:")
year_pos_0 = ws.index.get_level_values(0).unique()[0]
country_pos_0 = ws.index.get_level_values(1).unique()[0]
metric_pos_1 = ws.columns[1]
print(
    f"   ws.loc[({year_pos_0}, '{country_pos_0}'), '{metric_pos_1}'] = {ws.loc[(year_pos_0, country_pos_0), metric_pos_1]}"
)
print()
print("💡 Most common: ws.iloc[0, 1] for simple positional access!")

## Why Use MultiIndex vs Stacking?

### **MultiIndex Benefits:**
1. **Hierarchical organization** - Natural grouping of related data
2. **Efficient memory usage** - No duplicate index values stored
3. **Advanced selection** - Easy to slice across multiple levels
4. **Grouping operations** - Natural fit for group-by operations

### **Stacking Benefits:**
1. **Data analysis** - Better for statistical operations and aggregations
2. **Visualization** - Many plotting libraries prefer "tidy" (tall) format
3. **Machine learning** - Most ML algorithms expect tall format
4. **Database operations** - Easier to work with normalized data

In [49]:
# Example: MultiIndex is great for hierarchical selection
print("=== MultiIndex Benefits ===")
print("1. Easy hierarchical selection:")
print("Get all data for 2000:", ws.loc[2000].head())
print("\n2. Efficient multi-level grouping:")
print("Average GDP by year:", ws.groupby(level=0)["GDP"].mean().head())

=== MultiIndex Benefits ===
1. Easy hierarchical selection:
Get all data for 2000:                      Population           GDP
country                                      
Albania               3089027.0  3.632044e+09
Algeria              31183658.0  5.479006e+10
Andorra                 65399.0  1.401694e+09
Angola               15058638.0  9.129635e+09
Antigua and Barbuda     77648.0  7.838379e+08

2. Efficient multi-level grouping:
Average GDP by year: year
1960    7.348987e+10
1961    7.592604e+10
1962    8.033787e+10
1963    8.670495e+10
1964    9.547473e+10
Name: GDP, dtype: float64


In [53]:
# Example: Stacking is great for analysis and visualization
print("=== Stacking Benefits ===")
stacked_ws = ws.stack()

print("1. Better for statistical analysis:")
print("Overall statistics across all metrics:", stacked_ws.describe())
print("\n2. Easy to filter by metric type:")
print(
    "Just GDP values:", stacked_ws[stacked_ws.index.get_level_values(2) == "GDP"].head()
)

=== Stacking Benefits ===
1. Better for statistical analysis:
Overall statistics across all metrics: count    2.242200e+04
mean     4.836552e+11
std      3.129207e+12
min      0.000000e+00
25%      8.795337e+06
50%      5.351863e+08
75%      1.412686e+10
max      7.810634e+13
dtype: float64

2. Easy to filter by metric type:
Just GDP values: year  country          
1960  Afghanistan   GDP    5.377778e+08
      Algeria       GDP    2.723638e+09
      Australia     GDP    1.856759e+10
      Austria       GDP    6.592694e+09
      Bahamas, The  GDP    1.698023e+08
dtype: float64


In [51]:
# Real-world example: When to choose each format
print("=== When to Choose Which Format ===")
print("\n📊 CHOOSE MULTIINDEX WHEN:")
print("• You need to slice by country/year efficiently")
print("• You want to perform country-level or year-level aggregations")
print("• Data has natural hierarchical structure")
print("• You need to maintain the 'wide' format for reporting")

print("\n📈 CHOOSE STACKING WHEN:")
print("• Preparing data for machine learning (most algorithms expect tall format)")
print("• Creating visualizations with seaborn/plotly (they prefer tidy data)")
print("• You want to analyze patterns across ALL variables together")
print("• Preparing for database storage (normalized form)")

=== When to Choose Which Format ===

📊 CHOOSE MULTIINDEX WHEN:
• You need to slice by country/year efficiently
• You want to perform country-level or year-level aggregations
• Data has natural hierarchical structure
• You need to maintain the 'wide' format for reporting

📈 CHOOSE STACKING WHEN:
• Preparing data for machine learning (most algorithms expect tall format)
• Creating visualizations with seaborn/plotly (they prefer tidy data)
• You want to analyze patterns across ALL variables together
• Preparing for database storage (normalized form)


In [52]:
# Practical comparison: Same question, different approaches
print("=== Question: What's the correlation between GDP and Population? ===")

# Method 1: MultiIndex approach
print("\n1️⃣ MultiIndex approach:")
correlation_multi = ws["GDP"].corr(ws["Population"])
print(f"Correlation using MultiIndex: {correlation_multi:.3f}")

# Method 2: Stacked approach
print("\n2️⃣ Stacked approach:")
stacked_df = stacked_ws.unstack(level=2)  # Convert back to DataFrame for correlation
correlation_stacked = stacked_df["GDP"].corr(stacked_df["Population"])
print(f"Correlation using stacked then unstacked: {correlation_stacked:.3f}")

print(f"\n✅ Both give same result, but MultiIndex was more direct for this question!")

=== Question: What's the correlation between GDP and Population? ===

1️⃣ MultiIndex approach:
Correlation using MultiIndex: 0.552

2️⃣ Stacked approach:
Correlation using stacked then unstacked: 0.552

✅ Both give same result, but MultiIndex was more direct for this question!


## The unstack Method
- The `unstack` method moves a row index to the column index (the inverse of the `stack` method).
- By default, the `unstack` method will move the innermost index.
- We can customize the moved index with the `level` parameter.
- The `level` parameter accepts the level's index position or its name. It can also accept a list of positions/names.

## The pivot Method
- The `pivot` method reshapes data from a tall format to a wide format.
- Ask yourself which direction the data will expand in if you add more entries.
- A tall/long format expands down. A wide format expands out.
- The `index` parameter sets the horizontal index of the pivoted **DataFrame**.
- The `columns` parameter sets the column whose values will be the columns in the pivoted **DataFrame**.
- The `values` parameter set the values of the pivoted **DataFrame**. Pandas will populate the correct values based on the index and column intersections.

## The melt Method
- The `melt` method is the inverse of the `pivot` method.
- It takes a 'wide' dataset and converts it to a 'tall' dataset.
- The `melt` method is ideal when you have multiple columns storing the *same* data point.
- Ask yourself whether the column's values are a *type* of the column header. If they're not, the data is likely stored in a wide format.
- The `id_vars` parameters accepts the column whose values will be repeated for every column.
- The `var_name` parameter sets the name of the new column for the varying values (the former column names).
- The `value_name` parameter set the new name of the values column (holding the values from the original **DataFrame**).

## The pivot_table Method
- The `pivot_table` method operates similarly to the Pivot Table feature in Excel.
- A pivot table is a table whose values are aggregations of groups of values from another table.
- The `values` parameter accepts the numeric column whose values will be aggregated.
- The `aggfunc` parameter declares the aggregation function (the default is mean/average).
- The `index` parameter sets the index labels of the pivot table. MultiIndexes are permitted.
- The `columns` parameter sets the column labels of the pivot table. MultiIndexes are permitted.