# Reshaping and pivot tables


There are three different ways to reshape an given DataFrame (or Series), and these are

- `pivot()`
- `stack()` `unstack()`
- `melt()`

These functions helps us to transitioned between different shapes forward and
backwards.

In [15]:
import pandas as pd
import numpy as np

np.random.seed(0)

## Wide and Long format

Suppose you have a variable `a` that depends on two parameters `i` and `j`. 
There are two equivalent ways to represent it as a table:

- **wide** format is more appropiate for dense data
- **large** format is more appropiate for sparse data, when some values are
zero or missing, and you can omit some rows.

<img src="./assets/imgs/wide_and_large.webp" width="500"/>

## `pivot()`

Pivot helps us to transition from a **long format** to a **wide format**. You 
need to provide the following

`df.pivot(index, columns, values)`

- `index` will be the column from the long format that will be transformed into the
index in the wide format.
- `columns` will be the column from the long format that will be transformed into
the columns in the wide format.
- `values` will be the column from the long format that will be the values in 
cells of the wide format.

**Note:** If `values` is ommited and there are several values columns in the long
format, then there will be a multi-index in the columns indicating the respective
value column.

**Note:** If `index` is ommited, the index from the large format is used.

**Note:** `pivot()` will error with a ValueError: Index contains duplicate 
entries, **cannot reshape if the index/column pair is not unique**. In this 
case, consider using pivot_table() which is a generalization of pivot that can 
handle duplicate values for one index/column pair.

In [16]:
data = {
    "client" : ["John", "John", "Silvia", "Silvia"],
    "product" : ["bananas", "oranges", "bananas", "oranges"],
    "quantity" : [5, 3, 4, 2],
    "price" : [1.5, 3, 2.5, 4]
}

df = pd.DataFrame(data)
df

Unnamed: 0,client,product,quantity,price
0,John,bananas,5,1.5
1,John,oranges,3,3.0
2,Silvia,bananas,4,2.5
3,Silvia,oranges,2,4.0


In [17]:
df.pivot(index='client', columns='product', values='quantity')

product,bananas,oranges
client,Unnamed: 1_level_1,Unnamed: 2_level_1
John,5,3
Silvia,4,2


**Note**: that  client and product will be the index and columns respectively.
And the values will be `quantity` discarting the column `price`.

**Note**: in the long format, there are different values (in quantity column) 
for the combination (product, client) y that case you can transform to a 
pivot table. For integer types, by default data will converted to float 
and missing values will be set to NaN.

In [18]:
# more column values
df.pivot(index='client', columns='product')

Unnamed: 0_level_0,quantity,quantity,price,price
product,bananas,oranges,bananas,oranges
client,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
John,5,3,1.5,3.0
Silvia,4,2,2.5,4.0


In [19]:
df.pivot(index='client', columns='product')['quantity']

product,bananas,oranges
client,Unnamed: 1_level_1,Unnamed: 2_level_1
John,5,3
Silvia,4,2


## `stack()` and `unstack()`

Pandas doesn’t have `set_index` for columns. A common way of adding levels to columns is to `unstack` existing levels from the index:

- `stack()` transform the inner-most column level into a inner-most index.
- `unstack()` makes the inverse process.

<img src="./assets/imgs/stack_unstack.webp" width="500"/>

**Note:** you can also select the level that you want to stack or unstack, using
a name `stack(<'column_level_name'>)` or the level position `stack(<n_position>)`.

**Note:** you can edd up with a Series if there are not more levels to stack.

**Note:** the `stack()` and `unstack()` methods implicitly sort the index levels involved.

**Note:** The columns must not contain duplicate values to be eligible for stacking (same applies to index when unstacking):

In [80]:
columns = pd.MultiIndex.from_product([["population", "area"],[2010,2020]],
                                     names=["features", "year"])

index = pd.MultiIndex.from_tuples(
    [
        ("Portland", "Maine"),
        ("Portland", "Oregon"),
        ("Springfield","Illinois"),
        ("Springfield", "Oregon")
    ]
)

data = np.array([[66194, 583776, 116250, 59403], 
                  [68408, 652503, 114394, 61851], 
                  [21.31, 133.43, 59.48, 15.74], 
                  [21.54, 133.45, 61.14, 15.85]]).T

df = pd.DataFrame(data, index=index, columns=columns)
df

Unnamed: 0_level_0,features,population,population,area,area
Unnamed: 0_level_1,year,2010,2020,2010,2020
Portland,Maine,66194.0,68408.0,21.31,21.54
Portland,Oregon,583776.0,652503.0,133.43,133.45
Springfield,Illinois,116250.0,114394.0,59.48,61.14
Springfield,Oregon,59403.0,61851.0,15.74,15.85


In [83]:
# year becames the last level of the index
df.stack()

Unnamed: 0_level_0,Unnamed: 1_level_0,features,area,population
Unnamed: 0_level_1,Unnamed: 1_level_1,year,Unnamed: 3_level_1,Unnamed: 4_level_1
Portland,Maine,2010,21.31,66194.0
Portland,Maine,2020,21.54,68408.0
Portland,Oregon,2010,133.43,583776.0
Portland,Oregon,2020,133.45,652503.0
Springfield,Illinois,2010,59.48,116250.0
Springfield,Illinois,2020,61.14,114394.0
Springfield,Oregon,2010,15.74,59403.0
Springfield,Oregon,2020,15.85,61851.0


In [84]:
# stack two times will remove all the column levels and return a Series
# with multi-index of 4 levels
df.stack().stack()

                       year  features  
Portland     Maine     2010  area              21.31
                             population     66194.00
                       2020  area              21.54
                             population     68408.00
             Oregon    2010  area             133.43
                             population    583776.00
                       2020  area             133.45
                             population    652503.00
Springfield  Illinois  2010  area              59.48
                             population    116250.00
                       2020  area              61.14
                             population    114394.00
             Oregon    2010  area              15.74
                             population     59403.00
                       2020  area              15.85
                             population     61851.00
dtype: float64

In [86]:
# control the level to stack
# alternative: df.stack(0) indicating the column level
df.stack('features')


Unnamed: 0_level_0,Unnamed: 1_level_0,year,2010,2020
Unnamed: 0_level_1,Unnamed: 1_level_1,features,Unnamed: 3_level_1,Unnamed: 4_level_1
Portland,Maine,area,21.31,21.54
Portland,Maine,population,66194.0,68408.0
Portland,Oregon,area,133.43,133.45
Portland,Oregon,population,583776.0,652503.0
Springfield,Illinois,area,59.48,61.14
Springfield,Illinois,population,116250.0,114394.0
Springfield,Oregon,area,15.74,15.85
Springfield,Oregon,population,59403.0,61851.0


In [85]:
# You can also stack or unstack more than one level at the same time by passing
# a list of levels and process them individually.
df.stack(['year', 'features'])

                       year  features  
Portland     Maine     2010  area              21.31
                             population     66194.00
                       2020  area              21.54
                             population     68408.00
             Oregon    2010  area             133.43
                             population    583776.00
                       2020  area             133.45
                             population    652503.00
Springfield  Illinois  2010  area              59.48
                             population    116250.00
                       2020  area              61.14
                             population    114394.00
             Oregon    2010  area              15.74
                             population     59403.00
                       2020  area              15.85
                             population     61851.00
dtype: float64

In [91]:
# stack or unstack can result in missing values if subgroupd do not have the 
# same set of labels, then it will be filled with NaN

df.iloc[0:-1,:].unstack()

features,population,population,population,population,population,population,area,area,area,area,area,area
year,2010,2010,2010,2020,2020,2020,2010,2010,2010,2020,2020,2020
Unnamed: 0_level_2,Illinois,Maine,Oregon,Illinois,Maine,Oregon,Illinois,Maine,Oregon,Illinois,Maine,Oregon
Portland,,66194.0,583776.0,,68408.0,652503.0,,21.31,133.43,,21.54,133.45
Springfield,116250.0,,,114394.0,,,59.48,,,61.14,,


In [92]:
# you can fill the values using fill_value
df.iloc[0:-1,:].unstack(fill_value=1)

# NOTE: stack doesn't have fill_value

features,population,population,population,population,population,population,area,area,area,area,area,area
year,2010,2010,2010,2020,2020,2020,2010,2010,2010,2020,2020,2020
Unnamed: 0_level_2,Illinois,Maine,Oregon,Illinois,Maine,Oregon,Illinois,Maine,Oregon,Illinois,Maine,Oregon
Portland,1.0,66194.0,583776.0,1.0,68408.0,652503.0,1.0,21.31,133.43,1.0,21.54,133.45
Springfield,116250.0,1.0,1.0,114394.0,1.0,1.0,59.48,1.0,1.0,61.14,1.0,1.0


## `melt()`

If we previously reset index from a pivot table, you can use `melt()` to unpivot.

It practically takes a wide format and transform it in the long format using

`melt(id_vars, value_vars, var_name, value_name)`

- `id_vars` indicates the columns that are used as identifiers (which were transformed
to index with pivot).
- `value_vars` indicates the columns that are used as values (which were used as 
columns with pivot)

**Note:** without `var_name` or `value_name`, the resulting DataFrame will have
to long format columns named variable and values. Note these were the two columns
used in pivot.

- `var_name` sets the name of variable column
- `value_name` sets the name of value column

When learning group by check Combining with stats and GroupBy
