# MultiIndex

In [1]:
import pandas as pd

## This Module's Dataset

We observe that dtype for dates field is generic dataframe object. So this can be changed using the `parse_dates`

In [2]:
bigmac=pd.read_csv("bigmac.csv",parse_dates=["Date"],date_format="%Y-%m-%d")
bigmac.head()

Unnamed: 0,Date,Country,Price in US Dollars
0,2000-04-01,Argentina,2.5
1,2000-04-01,Australia,1.541667
2,2000-04-01,Brazil,1.648045
3,2000-04-01,Canada,1.938776
4,2000-04-01,Switzerland,3.470588


Also there's no missing values

In [3]:
bigmac.describe()
bigmac.info()
bigmac.nunique()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1386 entries, 0 to 1385
Data columns (total 3 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   Date                 1386 non-null   datetime64[ns]
 1   Country              1386 non-null   object        
 2   Price in US Dollars  1386 non-null   float64       
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 32.6+ KB


Date                     33
Country                  57
Price in US Dollars    1350
dtype: int64

## Create a MultiIndex
- A **MultiIndex** is an index with multiple levels or layers.
- Pass the `set_index` method a list of colum names to create a multi-index **DataFrame**.
- The order of the list's values will determine the order of the levels.
- Alternatively, we can pass the `read_csv` function's `index_col` parameter a list of columns.

##### Multi-indexing helps in fetching records faster whn filtering as it provides more no. of pointers to that record
Multiple indexes may be required where heirarchal categorization needs to be applied to the dataset
<br><br>
This mostly occurs when there could be more than one unique identifiers for a record. In this case, the price in USD can be uniquely identied based on date and based on country.

##### Start with the column that is smaller in size -- ie it'll show lesser values in `nunique()`

In [4]:
# Setting multi-index -- note that the sequencematters
bigmac.set_index(["Date","Country"])
bigmac.set_index(["Country","Date"])

Unnamed: 0_level_0,Unnamed: 1_level_0,Price in US Dollars
Country,Date,Unnamed: 2_level_1
Argentina,2000-04-01,2.500000
Australia,2000-04-01,1.541667
Brazil,2000-04-01,1.648045
Canada,2000-04-01,1.938776
Switzerland,2000-04-01,3.470588
...,...,...
Ukraine,2020-07-01,2.174714
Uruguay,2020-07-01,4.327418
United States,2020-07-01,5.710000
Vietnam,2020-07-01,2.847282


In [5]:
bigmac = bigmac.set_index(["Date","Country"])
bigmac.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Price in US Dollars
Date,Country,Unnamed: 2_level_1
2000-04-01,Argentina,2.5
2000-04-01,Australia,1.541667
2000-04-01,Brazil,1.648045
2000-04-01,Canada,1.938776
2000-04-01,Switzerland,3.470588


In [6]:
bigmac=pd.read_csv("bigmac.csv",parse_dates=["Date"],date_format="%Y-%m-%d",index_col=["Date","Country"]).sort_index()
bigmac

Unnamed: 0_level_0,Unnamed: 1_level_0,Price in US Dollars
Date,Country,Unnamed: 2_level_1
2000-04-01,Argentina,2.500000
2000-04-01,Australia,1.541667
2000-04-01,Brazil,1.648045
2000-04-01,Britain,3.002000
2000-04-01,Canada,1.938776
...,...,...
2020-07-01,Ukraine,2.174714
2020-07-01,United Arab Emirates,4.015846
2020-07-01,United States,5.710000
2020-07-01,Uruguay,4.327418


In [7]:
# bigmac.index
bigmac.index.names


FrozenList(['Date', 'Country'])

In [8]:
bigmac.columns

Index(['Price in US Dollars'], dtype='object')

## Extract Index Level Values
- The `get_level_values` method extracts an **Index** with the values from one level in the **MultiIndex**.
- Invoke the `get_level_values` on the **MultiIndex**, not the **DataFrame** itself.
- The method expects either the level's index position or its name.

In [9]:
# Extracts outer-most index level 
# bigmac.index.get_level_values("Date")
bigmac.index.get_level_values(0)


DatetimeIndex(['2000-04-01', '2000-04-01', '2000-04-01', '2000-04-01',
               '2000-04-01', '2000-04-01', '2000-04-01', '2000-04-01',
               '2000-04-01', '2000-04-01',
               ...
               '2020-07-01', '2020-07-01', '2020-07-01', '2020-07-01',
               '2020-07-01', '2020-07-01', '2020-07-01', '2020-07-01',
               '2020-07-01', '2020-07-01'],
              dtype='datetime64[ns]', name='Date', length=1386, freq=None)

In [10]:
# Extracts inner-most level
bigmac.index.get_level_values(-1)
bigmac.index.get_level_values(1)


Index(['Argentina', 'Australia', 'Brazil', 'Britain', 'Canada', 'Chile',
       'China', 'Czech Republic', 'Denmark', 'Euro area',
       ...
       'Sweden', 'Switzerland', 'Taiwan', 'Thailand', 'Turkey', 'Ukraine',
       'United Arab Emirates', 'United States', 'Uruguay', 'Vietnam'],
      dtype='object', name='Country', length=1386)

## Rename Index Levels
- Invoke the `set_names` method on the **MultiIndex** to change one or more level names.
- Use the `names` and `level` parameter to target a nested index at a given level.
- Alternatively, pass `names` a list of strings to overwrite *all* level names.
- The `set_names` method returns a copy, so replace the original index to alter the **DataFrame**.

In [11]:
# Renaming a specific index level
bigmac.index.set_names(names="Timespan",level=0)

MultiIndex([('2000-04-01',            'Argentina'),
            ('2000-04-01',            'Australia'),
            ('2000-04-01',               'Brazil'),
            ('2000-04-01',              'Britain'),
            ('2000-04-01',               'Canada'),
            ('2000-04-01',                'Chile'),
            ('2000-04-01',                'China'),
            ('2000-04-01',       'Czech Republic'),
            ('2000-04-01',              'Denmark'),
            ('2000-04-01',            'Euro area'),
            ...
            ('2020-07-01',               'Sweden'),
            ('2020-07-01',          'Switzerland'),
            ('2020-07-01',               'Taiwan'),
            ('2020-07-01',             'Thailand'),
            ('2020-07-01',               'Turkey'),
            ('2020-07-01',              'Ukraine'),
            ('2020-07-01', 'United Arab Emirates'),
            ('2020-07-01',        'United States'),
            ('2020-07-01',              'Uruguay

In [12]:
# Renaming all the index levels by passing a list 
# note that its length should be equal to the no. of levels
bigmac.index.set_names(names=["Timespan","Place"])

MultiIndex([('2000-04-01',            'Argentina'),
            ('2000-04-01',            'Australia'),
            ('2000-04-01',               'Brazil'),
            ('2000-04-01',              'Britain'),
            ('2000-04-01',               'Canada'),
            ('2000-04-01',                'Chile'),
            ('2000-04-01',                'China'),
            ('2000-04-01',       'Czech Republic'),
            ('2000-04-01',              'Denmark'),
            ('2000-04-01',            'Euro area'),
            ...
            ('2020-07-01',               'Sweden'),
            ('2020-07-01',          'Switzerland'),
            ('2020-07-01',               'Taiwan'),
            ('2020-07-01',             'Thailand'),
            ('2020-07-01',               'Turkey'),
            ('2020-07-01',              'Ukraine'),
            ('2020-07-01', 'United Arab Emirates'),
            ('2020-07-01',        'United States'),
            ('2020-07-01',              'Uruguay

In [13]:
# The renaming can be assigned to the original dataframe
# Ensure that the new names are passed to the index & not the DataFrame itself
bigmac.index = bigmac.index.set_names(names=["Timespan","Place"])
bigmac.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Price in US Dollars
Timespan,Place,Unnamed: 2_level_1
2000-04-01,Argentina,2.5
2000-04-01,Australia,1.541667
2000-04-01,Brazil,1.648045
2000-04-01,Britain,3.002
2000-04-01,Canada,1.938776


## The sort_index Method on a MultiIndex DataFrame
- Using the `sort_index` method, we can target all levels or specific levels of the **MultiIndex**.
- To apply a different sort order to different levels, pass a list of Booleans.

In [14]:
bigmac=pd.read_csv("bigmac.csv",parse_dates=["Date"],date_format="%Y-%m-%d",index_col=["Date","Country"]).sort_index()

In [15]:
bigmac.sort_index(ascending=[True,False])

Unnamed: 0_level_0,Unnamed: 1_level_0,Price in US Dollars
Date,Country,Unnamed: 2_level_1
2000-04-01,United States,2.510000
2000-04-01,Thailand,1.447368
2000-04-01,Taiwan,2.287582
2000-04-01,Switzerland,3.470588
2000-04-01,Sweden,2.714932
...,...,...
2020-07-01,Brazil,3.913528
2020-07-01,Bahrain,3.713035
2020-07-01,Azerbaijan,2.324897
2020-07-01,Australia,4.578450


In [16]:
bigmac.sort_index(ascending=[False,True])

Unnamed: 0_level_0,Unnamed: 1_level_0,Price in US Dollars
Date,Country,Unnamed: 2_level_1
2020-07-01,Argentina,3.509232
2020-07-01,Australia,4.578450
2020-07-01,Azerbaijan,2.324897
2020-07-01,Bahrain,3.713035
2020-07-01,Brazil,3.913528
...,...,...
2000-04-01,Sweden,2.714932
2000-04-01,Switzerland,3.470588
2000-04-01,Taiwan,2.287582
2000-04-01,Thailand,1.447368


## Extract Rows from a MultiIndex DataFrame
- A **tuple** is an immutable list. It cannot be modified after creation.
- Create a tuple with a comma between elements. The community convention is to wrap the elements in parentheses.
- The `iloc` and `loc` accessors are available to extract rows by index position or label.
- For the `loc` accessor, pass a tuple to hold the labels from the index levels.

In [17]:
bigmac.iloc[2]

Price in US Dollars    1.648045
Name: (2000-04-01 00:00:00, Brazil), dtype: float64

In [18]:
bigmac.loc["2000-04-01"]

Unnamed: 0_level_0,Price in US Dollars
Country,Unnamed: 1_level_1
Argentina,2.5
Australia,1.541667
Brazil,1.648045
Britain,3.002
Canada,1.938776
Chile,2.451362
China,1.195652
Czech Republic,1.390537
Denmark,3.078358
Euro area,2.3808


In [19]:
# Now filtering both indexes to fetch a particular dollar value
bigmac.loc["2000-04-01","Canada"]

Price in US Dollars    1.938776
Name: (2000-04-01 00:00:00, Canada), dtype: float64

#### Extending the concept of multi-indexing to bond dataset
We'll fetch the films using Actor-Director combination

In [20]:
# We'll make actor & director as the indexes here
bond = pd.read_csv("jamesbond.csv",index_col=["Actor","Director"]).sort_index()
bond.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Film,Year,Box Office,Budget,Bond Actor Salary
Actor,Director,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Daniel Craig,Cary Joji Fukunaga,No Time to Die,2021,774.2,301.0,25.0
Daniel Craig,Marc Forster,Quantum of Solace,2008,514.2,181.4,8.1
Daniel Craig,Martin Campbell,Casino Royale,2006,581.5,145.3,3.3
Daniel Craig,Sam Mendes,Skyfall,2012,943.5,170.2,14.5
Daniel Craig,Sam Mendes,Spectre,2015,726.7,206.3,30.0
David Niven,Ken Hughes,Casino Royale,1967,315.0,85.0,
George Lazenby,Peter R. Hunt,On Her Majesty's Secret Service,1969,291.5,37.3,0.6
Pierce Brosnan,Lee Tamahori,Die Another Day,2002,465.4,154.2,17.9
Pierce Brosnan,Martin Campbell,GoldenEye,1995,518.5,76.9,5.1
Pierce Brosnan,Michael Apted,The World Is Not Enough,1999,439.5,158.3,13.5


In [21]:
# We'll fetch the film & year using Actor-Director combination 
bond.loc[("Daniel Craig","Sam Mendes"),("Film","Year")]

Unnamed: 0_level_0,Unnamed: 1_level_0,Film,Year
Actor,Director,Unnamed: 2_level_1,Unnamed: 3_level_1
Daniel Craig,Sam Mendes,Skyfall,2012
Daniel Craig,Sam Mendes,Spectre,2015


In [22]:
bond.loc[("Sean Connery","Guy Hamilton"),("Film","Year")]

Unnamed: 0_level_0,Unnamed: 1_level_0,Film,Year
Actor,Director,Unnamed: 2_level_1,Unnamed: 3_level_1
Sean Connery,Guy Hamilton,Goldfinger,1964
Sean Connery,Guy Hamilton,Diamonds Are Forever,1971


#### Filtering records using only the 2nd-level index
John Glen is the director who has directed multiple actors as Bond

In [23]:
bond[bond.index.get_level_values(1)=="John Glen"]

Unnamed: 0_level_0,Unnamed: 1_level_0,Film,Year,Box Office,Budget,Bond Actor Salary
Actor,Director,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Roger Moore,John Glen,For Your Eyes Only,1981,449.4,60.2,
Roger Moore,John Glen,Octopussy,1983,373.8,53.9,7.8
Roger Moore,John Glen,A View to a Kill,1985,275.2,54.5,9.1
Timothy Dalton,John Glen,The Living Daylights,1987,313.5,68.8,5.2
Timothy Dalton,John Glen,Licence to Kill,1989,250.9,56.7,7.9


In [24]:
bond[bond.index.get_level_values(0)=="Sean Connery"]
bond.loc["Sean Connery"]

Unnamed: 0_level_0,Film,Year,Box Office,Budget,Bond Actor Salary
Director,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Guy Hamilton,Goldfinger,1964,820.4,18.6,3.2
Guy Hamilton,Diamonds Are Forever,1971,442.5,34.7,5.8
Irvin Kershner,Never Say Never Again,1983,380.0,86.0,
Lewis Gilbert,You Only Live Twice,1967,514.2,59.9,4.4
Terence Young,Dr. No,1962,448.8,7.0,0.6
Terence Young,From Russia with Love,1963,543.8,12.6,1.6
Terence Young,Thunderball,1965,848.1,41.9,4.7


#### Fetching over a range of rows

In [25]:
bond.loc[("Sean Connery","Guy Hamilton"):("Sean Connery","Lewis Gilbert")]

Unnamed: 0_level_0,Unnamed: 1_level_0,Film,Year,Box Office,Budget,Bond Actor Salary
Actor,Director,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Sean Connery,Guy Hamilton,Goldfinger,1964,820.4,18.6,3.2
Sean Connery,Guy Hamilton,Diamonds Are Forever,1971,442.5,34.7,5.8
Sean Connery,Irvin Kershner,Never Say Never Again,1983,380.0,86.0,
Sean Connery,Lewis Gilbert,You Only Live Twice,1967,514.2,59.9,4.4


In [26]:
bond.loc[("Sean Connery","Guy Hamilton"):("Sean Connery","Lewis Gilbert"),("Film","Year")]

Unnamed: 0_level_0,Unnamed: 1_level_0,Film,Year
Actor,Director,Unnamed: 2_level_1,Unnamed: 3_level_1
Sean Connery,Guy Hamilton,Goldfinger,1964
Sean Connery,Guy Hamilton,Diamonds Are Forever,1971
Sean Connery,Irvin Kershner,Never Say Never Again,1983
Sean Connery,Lewis Gilbert,You Only Live Twice,1967


____
# Reshaping DataFrames
using `transpose` `stack` `unstack` `melt` methods to reshape existing dataframe

## The transpose Method
- The `transpose` method inverts/flips the horizontal and vertical axes of the **DataFrame**.

In [27]:
start = ("Sean Connery","Guy Hamilton")
end = ("Sean Connery","Lewis Gilbert")

In [28]:
bond.loc[start:end].transpose()

Actor,Sean Connery,Sean Connery,Sean Connery,Sean Connery
Director,Guy Hamilton,Guy Hamilton.1,Irvin Kershner,Lewis Gilbert
Film,Goldfinger,Diamonds Are Forever,Never Say Never Again,You Only Live Twice
Year,1964,1971,1983,1967
Box Office,820.4,442.5,380.0,514.2
Budget,18.6,34.7,86.0,59.9
Bond Actor Salary,3.2,5.8,,4.4


In [29]:
start = ("2000-04-01","Argentina")
end = ("2000-04-01","Canada")
bigmac.loc[start:end]
bigmac.loc[start:end].transpose()

Date,2000-04-01,2000-04-01,2000-04-01,2000-04-01,2000-04-01
Country,Argentina,Australia,Brazil,Britain,Canada
Price in US Dollars,2.5,1.541667,1.648045,3.002,1.938776


## The stack Method
- The `stack` method moves the column index to the row index. i.e. it flix the axis from row to column
- Pandas will return a **MultiIndex Series**.
- Think of it like "stacking" index levels for a **MultiIndex**.

In [30]:
# Inspecting the data we can see that we can assign year then country as the indexes
worldstats = pd.read_csv("worldstats.csv",index_col=["year","country"]).sort_index()
worldstats.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Population,GDP
year,country,Unnamed: 2_level_1,Unnamed: 3_level_1
1960,Afghanistan,8994793.0,537777800.0
1960,Algeria,11124892.0,2723638000.0
1960,Australia,10276477.0,18567590000.0
1960,Austria,7047539.0,6592694000.0
1960,"Bahamas, The",109526.0,169802300.0


In [31]:
worldstats.nunique()

Population    11067
GDP           11065
dtype: int64

In [32]:
worldstats = worldstats.stack().to_frame()

##### Accessing data in a stacked dataframe

In [33]:
worldstats.loc[(1960,"Algeria","Population")]

0    11124892.0
Name: (1960, Algeria, Population), dtype: float64

In [34]:
# Accessing via multiple levels of indexing
worldstats.loc[(1960,"Algeria","GDP")]

0    2.723638e+09
Name: (1960, Algeria, GDP), dtype: float64

In [35]:
worldstats.loc[(1960,"Algeria"):(1960,"Belgium")]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,0
year,country,Unnamed: 2_level_1,Unnamed: 3_level_1
1960,Algeria,Population,11124890.0
1960,Algeria,GDP,2723638000.0
1960,Australia,Population,10276480.0
1960,Australia,GDP,18567590000.0
1960,Austria,Population,7047539.0
1960,Austria,GDP,6592694000.0
1960,"Bahamas, The",Population,109526.0
1960,"Bahamas, The",GDP,169802300.0
1960,Bangladesh,Population,48200700.0
1960,Bangladesh,GDP,4274894000.0


In [36]:
worldstats.loc[(1960,"Algeria","GDP"):(1960,"Belgium","Population")]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,0
year,country,Unnamed: 2_level_1,Unnamed: 3_level_1
1960,Algeria,GDP,2723638000.0
1960,Australia,Population,10276480.0
1960,Australia,GDP,18567590000.0
1960,Austria,Population,7047539.0
1960,Austria,GDP,6592694000.0
1960,"Bahamas, The",Population,109526.0
1960,"Bahamas, The",GDP,169802300.0
1960,Bangladesh,Population,48200700.0
1960,Bangladesh,GDP,4274894000.0
1960,Belgium,Population,9153489.0


In [37]:
pd.Series([1,2,3,4]).to_frame()

Unnamed: 0,0
0,1
1,2
2,3
3,4


In [38]:
ws = pd.read_csv("worldstats.csv",index_col=["year","country"]).sort_index()
ws.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Population,GDP
year,country,Unnamed: 2_level_1,Unnamed: 3_level_1
1960,Afghanistan,8994793.0,537777800.0
1960,Algeria,11124892.0,2723638000.0
1960,Australia,10276477.0,18567590000.0
1960,Austria,7047539.0,6592694000.0
1960,"Bahamas, The",109526.0,169802300.0


## The unstack Method
- The `unstack` method moves a row index to the column index (the inverse of the `stack` method).
- By default, the `unstack` method will move the innermost index.
- We can customize the moved index with the `level` parameter.
- The `level` parameter accepts the level's index position or its name. It can also accept a list of positions/names.

In [39]:
worldstats.unstack()

Unnamed: 0_level_0,Unnamed: 1_level_0,0,0
Unnamed: 0_level_1,Unnamed: 1_level_1,Population,GDP
year,country,Unnamed: 2_level_2,Unnamed: 3_level_2
1960,Afghanistan,8.994793e+06,5.377778e+08
1960,Algeria,1.112489e+07,2.723638e+09
1960,Australia,1.027648e+07,1.856759e+10
1960,Austria,7.047539e+06,6.592694e+09
1960,"Bahamas, The",1.095260e+05,1.698023e+08
...,...,...,...
2015,Vietnam,9.170380e+07,1.935994e+11
2015,West Bank and Gaza,4.422143e+06,1.267740e+10
2015,World,7.346633e+09,7.343364e+13
2015,Zambia,1.621177e+07,2.120156e+10


In [40]:
worldstats = worldstats.unstack()
worldstats

Unnamed: 0_level_0,Unnamed: 1_level_0,0,0
Unnamed: 0_level_1,Unnamed: 1_level_1,Population,GDP
year,country,Unnamed: 2_level_2,Unnamed: 3_level_2
1960,Afghanistan,8.994793e+06,5.377778e+08
1960,Algeria,1.112489e+07,2.723638e+09
1960,Australia,1.027648e+07,1.856759e+10
1960,Austria,7.047539e+06,6.592694e+09
1960,"Bahamas, The",1.095260e+05,1.698023e+08
...,...,...,...
2015,Vietnam,9.170380e+07,1.935994e+11
2015,West Bank and Gaza,4.422143e+06,1.267740e+10
2015,World,7.346633e+09,7.343364e+13
2015,Zambia,1.621177e+07,2.120156e+10


In [41]:
# Dropping the '0' column header 
worldstats.columns = worldstats.columns.droplevel(level=0)
worldstats

Unnamed: 0_level_0,Unnamed: 1_level_0,Population,GDP
year,country,Unnamed: 2_level_1,Unnamed: 3_level_1
1960,Afghanistan,8.994793e+06,5.377778e+08
1960,Algeria,1.112489e+07,2.723638e+09
1960,Australia,1.027648e+07,1.856759e+10
1960,Austria,7.047539e+06,6.592694e+09
1960,"Bahamas, The",1.095260e+05,1.698023e+08
...,...,...,...
2015,Vietnam,9.170380e+07,1.935994e+11
2015,West Bank and Gaza,4.422143e+06,1.267740e+10
2015,World,7.346633e+09,7.343364e+13
2015,Zambia,1.621177e+07,2.120156e+10


## The pivot Method
- The `pivot` method reshapes data from a tall format to a wide format.
- Ask yourself which direction the data will expand in if you add more entries.
- A tall/long format expands down. A wide format expands out.
- The `index` parameter sets the horizontal index of the pivoted **DataFrame**.
- The `columns` parameter sets the column whose values will be the columns in the pivoted **DataFrame**.
- The `values` parameter set the values of the pivoted **DataFrame**. Pandas will populate the correct values based on the index and column intersections.

In [44]:
sales = pd.read_csv("salesmen.csv").sort_values(["Date"])
sales

Unnamed: 0,Date,Salesman,Revenue
0,1/1/2025,Sharon,7172
1460,1/1/2025,Oscar,5250
730,1/1/2025,Dave,1864
365,1/1/2025,Ronald,2639
1095,1/1/2025,Alexander,4430
...,...,...,...
251,9/9/2025,Sharon,1231
616,9/9/2025,Ronald,8922
981,9/9/2025,Dave,4733
1346,9/9/2025,Alexander,6823


##### Notice that Dataframe height is decreased while on horizontal axis, its no. of columns has increased to 5

In [51]:
sales.pivot(index="Date",columns="Salesman",values="Revenue")

Salesman,Alexander,Dave,Oscar,Ronald,Sharon
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1/1/2025,4430,1864,5250,2639,7172
1/10/2025,301,7105,7663,8267,7543
1/11/2025,9489,6851,8888,1340,1053
1/12/2025,8719,7147,3092,279,4362
1/13/2025,2349,6160,6139,7540,6812
...,...,...,...,...,...
9/5/2025,2439,211,7743,4252,992
9/6/2025,7585,7293,5072,1112,556
9/7/2025,6669,9774,5230,3608,6499
9/8/2025,3058,8194,7755,5762,9621


## The melt Method
- The `melt` method is the inverse of the `pivot` method.
- It takes a 'wide' dataset and converts it to a 'tall' dataset.
- The `melt` method is ideal when you have multiple columns storing the *same* data point.
- Ask yourself whether the column's values are a *type* of the column header. If they're not, the data is likely stored in a wide format.
- The `id_vars` parameters accepts the column whose values will be repeated for every column.
- The `var_name` parameter sets the name of the new column for the varying values (the former column names).
- The `value_name` parameter set the new name of the values column (holding the values from the original **DataFrame**).

In [59]:
# The dataset shows quarterly performance of each salesman 
# Assuming NO new salesman (row) is added, its an example of a WIDE dataframe  
qrtr = pd.read_csv("quarters.csv")
qrtr

Unnamed: 0,Salesman,Q1,Q2,Q3,Q4
0,Boris,602908,233879,354479,32704
1,Piers,43790,514863,297151,544493
2,Tommy,392668,113579,430882,247231
3,Travis,834663,266785,749238,570524
4,Cindy,580935,411379,110390,651572
5,Rob,656644,70803,375948,321388
6,Mike,486141,600753,742716,404995
7,Stacy,479662,742806,770712,2501
8,Alexandra,992673,879183,37945,293710


In [None]:
# Converting this wide dataframe to tall dataframe
qrtr.melt(id_vars="Salesman")#.set_index(["Salesman","variable"]).sort_index()

Unnamed: 0_level_0,Unnamed: 1_level_0,value
Salesman,variable,Unnamed: 2_level_1
Alexandra,Q1,992673
Alexandra,Q2,879183
Alexandra,Q3,37945
Alexandra,Q4,293710
Boris,Q1,602908
Boris,Q2,233879
Boris,Q3,354479
Boris,Q4,32704
Cindy,Q1,580935
Cindy,Q2,411379


In [72]:
# This dataframe can then be  optimized by multi-indexing ,sorting converting dtypes etc.
qrtr.melt(id_vars="Salesman").set_index(["Salesman","variable"]).sort_index()

Unnamed: 0_level_0,Unnamed: 1_level_0,value
Salesman,variable,Unnamed: 2_level_1
Alexandra,Q1,992673
Alexandra,Q2,879183
Alexandra,Q3,37945
Alexandra,Q4,293710
Boris,Q1,602908
Boris,Q2,233879
Boris,Q3,354479
Boris,Q4,32704
Cindy,Q1,580935
Cindy,Q2,411379


## The pivot_table Method
- The `pivot_table` method operates similarly to the Pivot Table feature in Excel.
- A pivot table is a table whose values are aggregations of groups of values from another table.
- The `values` parameter accepts the numeric column whose values will be aggregated.
- The `aggfunc` parameter declares the aggregation function (the default is mean/average).
- The `index` parameter sets the index labels of the pivot table. MultiIndexes are permitted.
- The `columns` parameter sets the column labels of the pivot table. MultiIndexes are permitted.

In [None]:
foods = pd.read_csv("foods.csv")
foods

Unnamed: 0,First Name,Gender,City,Frequency,Item,Spend
0,Wanda,Female,Stamford,Weekly,Burger,15.66
1,Eric,Male,Stamford,Daily,Chalupa,10.56
2,Charles,Male,New York,Never,Sushi,42.14
3,Anna,Female,Philadelphia,Once,Ice Cream,11.01
4,Deborah,Female,Philadelphia,Daily,Chalupa,23.49
...,...,...,...,...,...,...
995,Donna,Female,New York,Monthly,Sushi,83.53
996,Albert,Male,Philadelphia,Daily,Sushi,72.88
997,Jean,Female,Stamford,Weekly,Donut,5.85
998,Jessica,Female,New York,Daily,Chalupa,43.19


In [83]:
foods.info()
foods["City"].value_counts()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   First Name  1000 non-null   object 
 1   Gender      1000 non-null   object 
 2   City        1000 non-null   object 
 3   Frequency   1000 non-null   object 
 4   Item        1000 non-null   object 
 5   Spend       1000 non-null   float64
dtypes: float64(1), object(5)
memory usage: 47.0+ KB


City
Philadelphia    359
Stamford        328
New York        313
Name: count, dtype: int64

In [80]:
# pivot_table aggregates values by the mentioned index
# Find average spend (values) by City (the index based on which aggregation is to be performed)
foods.pivot_table(values="Spend",index="City")

Unnamed: 0_level_0,Spend
City,Unnamed: 1_level_1
New York,50.509808
Philadelphia,49.678384
Stamford,50.077012


In [None]:
# Find averge spend (values) by Gender (the index)
foods.pivot_table(values="Spend",index="Gender")

Unnamed: 0_level_0,Spend
Gender,Unnamed: 1_level_1
Female,50.709629
Male,49.397623


In [84]:
# Find total spend by city
foods.pivot_table(values="Spend",index="City",aggfunc="sum")

Unnamed: 0_level_0,Spend
City,Unnamed: 1_level_1
New York,15809.57
Philadelphia,17834.54
Stamford,16425.26


In [85]:
# Find total spend by city and by gender within each city
foods.pivot_table(values="Spend",index=["City","Gender"],aggfunc="sum")

Unnamed: 0_level_0,Unnamed: 1_level_0,Spend
City,Gender,Unnamed: 2_level_1
New York,Female,7543.26
New York,Male,8266.31
Philadelphia,Female,9632.69
Philadelphia,Male,8201.85
Stamford,Female,8787.38
Stamford,Male,7637.88


In [88]:
foods.pivot_table(values="Spend",index=["City","Item"],columns=["Gender"],aggfunc="sum")

Unnamed: 0_level_0,Gender,Female,Male
City,Item,Unnamed: 2_level_1,Unnamed: 3_level_1
New York,Burger,1239.04,1294.09
New York,Burrito,978.95,1399.4
New York,Chalupa,876.58,1227.77
New York,Donut,1446.78,1345.27
New York,Ice Cream,1521.62,1603.63
New York,Sushi,1480.29,1396.15
Philadelphia,Burger,1639.24,938.18
Philadelphia,Burrito,1458.76,1312.93
Philadelphia,Chalupa,1673.33,1114.23
Philadelphia,Donut,1639.26,1249.36


In [91]:
foods.pivot_table(values="Spend",index=["Gender","Item"],aggfunc="sum")

Unnamed: 0_level_0,Unnamed: 1_level_0,Spend
Gender,Item,Unnamed: 2_level_1
Female,Burger,4094.3
Female,Burrito,4257.82
Female,Chalupa,4152.26
Female,Donut,4743.0
Female,Ice Cream,4032.87
Female,Sushi,4683.08
Male,Burger,3671.43
Male,Burrito,4012.62
Male,Chalupa,3492.26
Male,Donut,4015.76


In [90]:
foods.pivot_table(values="Spend",index=["Gender","Item"],columns=["City"],aggfunc="sum")

Unnamed: 0_level_0,City,New York,Philadelphia,Stamford
Gender,Item,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Female,Burger,1239.04,1639.24,1216.02
Female,Burrito,978.95,1458.76,1820.11
Female,Chalupa,876.58,1673.33,1602.35
Female,Donut,1446.78,1639.26,1656.96
Female,Ice Cream,1521.62,1479.22,1032.03
Female,Sushi,1480.29,1742.88,1459.91
Male,Burger,1294.09,938.18,1439.16
Male,Burrito,1399.4,1312.93,1300.29
Male,Chalupa,1227.77,1114.23,1150.26
Male,Donut,1345.27,1249.36,1421.13
