# Reshaping

There are three different ways to reshape an given DataFrame (or Series), and these are

- `pivot()`
- `pivot_table()`
- `stack()` `unstack()`
- `melt()`

These functions helps us to transitioned between different shapes forward and
backwards.

In [64]:
import pandas as pd
import numpy as np

np.random.seed(0)

In [65]:
## Handy functions
from IPython.display import display_html, display

def display_side_by_side(*args):
    html_str=''
    for df in args:
        html_str+=df.to_html()
    display_html(html_str.replace('table','table style="display:inline"'),raw=True)

def display_several(*args):
    for df in args:
        display(df)

## Wide and Long format

Suppose you have a variable `a` that depends on two parameters `i` and `j`. 
There are two equivalent ways to represent it as a table:

- **wide** format is more appropiate for dense data
- **large** format is more appropiate for sparse data, when some values are
zero or missing, and you can omit some rows.

<img src="./assets/imgs/wide_and_large.webp" width="500"/>

## `pivot()`

Pivot helps us to transition from a **long format** to a **wide format**. You 
need to provide the following

`df.pivot(index, columns, values)`

- `index` will be the column from the long format that will be transformed into the
index in the wide format.
- `columns` will be the column from the long format that will be transformed into
the columns in the wide format.
- `values` will be the column from the long format that will be the values in 
cells of the wide format.

<img src="./assets/imgs/reshaping_pivot.png" width="500"/>

Things to take into account when pivoting:

1. **values ommitted**: If `values` is ommited and there are several values columns in the long format, then there will be a multi-index in the columns indicating the respective value column.
2. **index ommitted**: If `index` is ommited, the index from the large format is used.
3. **missing values**: If there are missing values for the index/column pair, they will be
represented by NaN.

**NOTE:** `pivot()` will error with a ValueError: Index contains duplicate 
entries, **cannot reshape if the index/column pair is not unique**. In this 
case, consider using `pivot_table()` which is a generalization of pivot that can 
handle duplicate values for one index/column pair.

In [66]:
data = {
    "client" : ["John", "John", "Silvia", "Silvia"],
    "product" : ["bananas", "oranges", "bananas", "oranges"],
    "quantity" : [5, 3, 4, 2],
    "price" : [1.5, 3, 2.5, 4]
}

df = pd.DataFrame(data)
df

Unnamed: 0,client,product,quantity,price
0,John,bananas,5,1.5
1,John,oranges,3,3.0
2,Silvia,bananas,4,2.5
3,Silvia,oranges,2,4.0


In [67]:
# NOTE: client and product will be the index and columns respectively.
# And the values will be `quantity` discarting the column `price`.
result = df.pivot(index='client', columns='product', values='quantity')
display_side_by_side(df, result)

Unnamed: 0,client,product,quantity,price
0,John,bananas,5,1.5
1,John,oranges,3,3.0
2,Silvia,bananas,4,2.5
3,Silvia,oranges,2,4.0

product,bananas,oranges
client,Unnamed: 1_level_1,Unnamed: 2_level_1
John,5,3
Silvia,4,2


In [68]:
# 1. values ommited
result = df.pivot(index='client', columns='product')
display_side_by_side(df, result)

Unnamed: 0,client,product,quantity,price
0,John,bananas,5,1.5
1,John,oranges,3,3.0
2,Silvia,bananas,4,2.5
3,Silvia,oranges,2,4.0

Unnamed: 0_level_0,quantity,quantity,price,price
product,bananas,oranges,bananas,oranges
client,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
John,5,3,1.5,3.0
Silvia,4,2,2.5,4.0


In [69]:
# 2. index ommited
result = df.pivot(columns='product', values='quantity')
display_side_by_side(df, result)

Unnamed: 0,client,product,quantity,price
0,John,bananas,5,1.5
1,John,oranges,3,3.0
2,Silvia,bananas,4,2.5
3,Silvia,oranges,2,4.0

product,bananas,oranges
0,5.0,
1,,3.0
2,4.0,
3,,2.0


In [70]:
# 3. missing values (missing Silvia/oranges pair)
result = df.iloc[:-1, :].pivot(index='client', columns='product', values='quantity')
display_side_by_side(df.iloc[:-1, :], result)

Unnamed: 0,client,product,quantity,price
0,John,bananas,5,1.5
1,John,oranges,3,3.0
2,Silvia,bananas,4,2.5

product,bananas,oranges
client,Unnamed: 1_level_1,Unnamed: 2_level_1
John,5.0,3.0
Silvia,4.0,


## `pivot_table()`

The practice of grouping values and then pivoting the results is so common that `groupby` and `pivot` have been bundled together into a dedicated function (and a corresponding DataFrame method) `pivot_table`. It is specially useful when there are multiple values for index/column pairs.

`pivot_table(index, columns, values, aggfunc)`

- without the `columns` argument, it behaves similarly to `groupby`;
- when there are no duplicate rows to group by, it works just like `pivot`;
- otherwise, it does grouping and pivoting.

<img src="./assets/imgs/pivot_table.webp" width="500"/>

**NOTE:** `pivot_table()` works well when there are multiple values for index/colum pairs.  `pivot_table()` implicitly will make a `group_by(<index/column names>).agg(<some agg func>).reset_index()`, and then it will `pivot` the result.

Things to take into account when using `pivot_table()`

1. you can add margin using `margins=True` that shows the total per column and rows, and the whole total.
2. you can use `fill_value` to fill the nan values in the pivot table.
3. same as `pivot()` if you don't put `values` then it will consider all the other columns as `values`
4. the `aggfunc` by default is mean
5. you can use a list of index names for `columns`, `index`, or `values`.

In [71]:
data = {
    "client" : ["John", "Silvia", "Silvia", "John", "Silvia"],
    "product" : ["bananas", "oranges", "bananas", "bananas", "oranges"],
    "quantity" : [3, 2, 4, 5, 7],
    "price" : [1.5, 3, 2.5, 1.5, 3]
}
df = pd.DataFrame(data)

# NOTE: multiple values for John/bananas and Silvia/oranges (index/column pairs)
df

Unnamed: 0,client,product,quantity,price
0,John,bananas,3,1.5
1,Silvia,oranges,2,3.0
2,Silvia,bananas,4,2.5
3,John,bananas,5,1.5
4,Silvia,oranges,7,3.0


In [72]:
# using group_by and pivot()

aggregation = df.groupby(["client", "product"]).sum().reset_index()
result = aggregation.pivot(index="client", columns="product", values="quantity")
display_side_by_side(df, aggregation, result)

Unnamed: 0,client,product,quantity,price
0,John,bananas,3,1.5
1,Silvia,oranges,2,3.0
2,Silvia,bananas,4,2.5
3,John,bananas,5,1.5
4,Silvia,oranges,7,3.0

Unnamed: 0,client,product,quantity,price
0,John,bananas,8,3.0
1,Silvia,bananas,4,2.5
2,Silvia,oranges,9,6.0

product,bananas,oranges
client,Unnamed: 1_level_1,Unnamed: 2_level_1
John,8.0,
Silvia,4.0,9.0


In [73]:
# using pivot_table()
result = df.pivot_table(index="client", columns="product", values="quantity", aggfunc="sum")
display_side_by_side(df, result)

Unnamed: 0,client,product,quantity,price
0,John,bananas,3,1.5
1,Silvia,oranges,2,3.0
2,Silvia,bananas,4,2.5
3,John,bananas,5,1.5
4,Silvia,oranges,7,3.0

product,bananas,oranges
client,Unnamed: 1_level_1,Unnamed: 2_level_1
John,8.0,
Silvia,4.0,9.0


In [74]:
#1 adding margins
df.pivot_table(index="client", 
               columns="product", 
               values="quantity", 
               aggfunc="sum", 
               margins=True)

product,bananas,oranges,All
client,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
John,8.0,,8
Silvia,4.0,9.0,13
All,12.0,9.0,21


In [75]:
# 2. filling nans with fill_value
df.pivot_table(index="client", 
               columns="product", 
               values="quantity", 
               aggfunc="sum", 
               fill_value= 0)

product,bananas,oranges
client,Unnamed: 1_level_1,Unnamed: 2_level_1
John,8,0
Silvia,4,9


In [76]:
# 3. don't specify values -> takes all the rest of columns as values
df.pivot_table(index="client", 
               columns="product", 
               aggfunc="sum")

Unnamed: 0_level_0,price,price,quantity,quantity
product,bananas,oranges,bananas,oranges
client,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
John,3.0,,8.0,
Silvia,2.5,6.0,4.0,9.0


In [77]:
# 4. aggfunc by default is mean
df.pivot_table(index="client", 
               columns="product", 
               values="quantity")

product,bananas,oranges
client,Unnamed: 1_level_1,Unnamed: 2_level_1
John,4.0,
Silvia,4.0,4.5


In [78]:
data = {
    "client" : ["John", "Silvia", "Silvia", "John", "Silvia", "Silvia", "Jhon"],
    "lastname" : ["Smith", "Smith", "Clark", "Wick", "Clark", "Wick", "Wick"],
    "product" : ["bananas", "oranges", "bananas", "bananas", "oranges", "oranges", "apples"],
    "quantity" : np.random.randint(10, size=7),
    "price" : np.random.randint(10, size=7)
}
df = pd.DataFrame(data)

df

Unnamed: 0,client,lastname,product,quantity,price
0,John,Smith,bananas,5,5
1,Silvia,Smith,oranges,0,2
2,Silvia,Clark,bananas,3,4
3,John,Wick,bananas,3,7
4,Silvia,Clark,oranges,7,6
5,Silvia,Wick,oranges,9,8
6,Jhon,Wick,apples,3,8


In [79]:
# 5. use multiple columns for index, columns or values in pivot_table()

df.pivot_table(index = ["client", "lastname"],
         columns = "product",
         values = "quantity",
         aggfunc = "sum",
         fill_value=0)

Unnamed: 0_level_0,product,apples,bananas,oranges
client,lastname,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Jhon,Wick,3,0,0
John,Smith,0,5,0
John,Wick,0,3,0
Silvia,Clark,0,3,7
Silvia,Smith,0,0,0
Silvia,Wick,0,0,9


In [80]:
df.pivot_table(index = "product",
         columns = ["client", "lastname"],
         values = "price",
         aggfunc = "sum",
         fill_value=0)

client,Jhon,John,John,Silvia,Silvia,Silvia
lastname,Wick,Smith,Wick,Clark,Smith,Wick
product,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
apples,8,0,0,0,0,0
bananas,0,5,7,4,0,0
oranges,0,0,0,6,2,8


In [81]:
df.pivot_table(index = "product",
         columns = "lastname",
         values = ["quantity", "price"],
         aggfunc = "sum",
         fill_value=0)

Unnamed: 0_level_0,price,price,price,quantity,quantity,quantity
lastname,Clark,Smith,Wick,Clark,Smith,Wick
product,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
apples,0,0,8,0,0,3
bananas,4,5,7,3,5,3
oranges,6,2,8,7,0,9


## `stack()` and `unstack()`

Pandas doesn’t have `set_index` for columns. A common way of adding levels to columns is to `unstack` existing levels from the index:

- `stack()` transform the inner-most column level into a inner-most index.
- `unstack()` makes the inverse process.

<img src="./assets/imgs/stack_unstack.webp" width="500"/>

Things to take into account when `stacking`:

1. **Control the level to stack (or Unstack)**: you can also select the level that you want to stack or unstack, using a name `stack(<'column_level_name'>)` or the level position `stack(<n_position>)`.
2. **Stacking multiple levels**: You can also stack or unstack more than one level at the same time by passing a list of levels and process them individually.
3. **Ending with a Series**: you can edd up with a Series if there are not more levels to stack.
4. **Missing Values**: `stack` or `unstack` can result in missing values if subgroups do not have the same set of labels, then it will be filled with NaN. You can fill the values using `fill_value = <number>`, but only in `unstack`.


**Note:** the `stack()` and `unstack()` methods implicitly sort the index levels involved.

**Note:** The columns index must not contain duplicate values to be eligible for stacking (same applies to index when unstacking):

In [82]:
columns = pd.MultiIndex.from_product([["population", "area"],[2010,2020]],
                                     names=["features", "year"])

index = pd.MultiIndex.from_tuples(
    [
        ("Portland", "Maine"),
        ("Portland", "Oregon"),
        ("Springfield","Illinois"),
        ("Springfield", "Oregon")
    ],
    names=["city", "region"]
)

data = np.array([[66194, 583776, 116250, 59403], 
                  [68408, 652503, 114394, 61851], 
                  [21.31, 133.43, 59.48, 15.74], 
                  [21.54, 133.45, 61.14, 15.85]]).T

df = pd.DataFrame(data, index=index, columns=columns)
df

Unnamed: 0_level_0,features,population,population,area,area
Unnamed: 0_level_1,year,2010,2020,2010,2020
city,region,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Portland,Maine,66194.0,68408.0,21.31,21.54
Portland,Oregon,583776.0,652503.0,133.43,133.45
Springfield,Illinois,116250.0,114394.0,59.48,61.14
Springfield,Oregon,59403.0,61851.0,15.74,15.85


In [83]:
# year becames the last level of the index
result = df.stack()
display_side_by_side(df, result)

Unnamed: 0_level_0,features,population,population,area,area
Unnamed: 0_level_1,year,2010,2020,2010,2020
city,region,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Portland,Maine,66194.0,68408.0,21.31,21.54
Portland,Oregon,583776.0,652503.0,133.43,133.45
Springfield,Illinois,116250.0,114394.0,59.48,61.14
Springfield,Oregon,59403.0,61851.0,15.74,15.85

Unnamed: 0_level_0,Unnamed: 1_level_0,features,area,population
city,region,year,Unnamed: 3_level_1,Unnamed: 4_level_1
Portland,Maine,2010,21.31,66194.0
Portland,Maine,2020,21.54,68408.0
Portland,Oregon,2010,133.43,583776.0
Portland,Oregon,2020,133.45,652503.0
Springfield,Illinois,2010,59.48,116250.0
Springfield,Illinois,2020,61.14,114394.0
Springfield,Oregon,2010,15.74,59403.0
Springfield,Oregon,2020,15.85,61851.0


In [84]:
# 1. control the level to stack (or unstack)
# alternative: df.stack(0) indicating the column level
result = df.stack('features')
display_side_by_side(df, result)

Unnamed: 0_level_0,features,population,population,area,area
Unnamed: 0_level_1,year,2010,2020,2010,2020
city,region,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Portland,Maine,66194.0,68408.0,21.31,21.54
Portland,Oregon,583776.0,652503.0,133.43,133.45
Springfield,Illinois,116250.0,114394.0,59.48,61.14
Springfield,Oregon,59403.0,61851.0,15.74,15.85

Unnamed: 0_level_0,Unnamed: 1_level_0,year,2010,2020
city,region,features,Unnamed: 3_level_1,Unnamed: 4_level_1
Portland,Maine,area,21.31,21.54
Portland,Maine,population,66194.0,68408.0
Portland,Oregon,area,133.43,133.45
Portland,Oregon,population,583776.0,652503.0
Springfield,Illinois,area,59.48,61.14
Springfield,Illinois,population,116250.0,114394.0
Springfield,Oregon,area,15.74,15.85
Springfield,Oregon,population,59403.0,61851.0


In [85]:
# 2. Stack (or unstack) multiple level at once
result = df.stack(['year', 'features'])
display_several(df, result)

Unnamed: 0_level_0,features,population,population,area,area
Unnamed: 0_level_1,year,2010,2020,2010,2020
city,region,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Portland,Maine,66194.0,68408.0,21.31,21.54
Portland,Oregon,583776.0,652503.0,133.43,133.45
Springfield,Illinois,116250.0,114394.0,59.48,61.14
Springfield,Oregon,59403.0,61851.0,15.74,15.85


city         region    year  features  
Portland     Maine     2010  area              21.31
                             population     66194.00
                       2020  area              21.54
                             population     68408.00
             Oregon    2010  area             133.43
                             population    583776.00
                       2020  area             133.45
                             population    652503.00
Springfield  Illinois  2010  area              59.48
                             population    116250.00
                       2020  area              61.14
                             population    114394.00
             Oregon    2010  area              15.74
                             population     59403.00
                       2020  area              15.85
                             population     61851.00
dtype: float64

In [86]:
# 3. Ending with a Series.
# NOTE: the previous example can also be done with two consecutive stacks
result = df.stack().stack()
display_several(df, result)

Unnamed: 0_level_0,features,population,population,area,area
Unnamed: 0_level_1,year,2010,2020,2010,2020
city,region,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Portland,Maine,66194.0,68408.0,21.31,21.54
Portland,Oregon,583776.0,652503.0,133.43,133.45
Springfield,Illinois,116250.0,114394.0,59.48,61.14
Springfield,Oregon,59403.0,61851.0,15.74,15.85


city         region    year  features  
Portland     Maine     2010  area              21.31
                             population     66194.00
                       2020  area              21.54
                             population     68408.00
             Oregon    2010  area             133.43
                             population    583776.00
                       2020  area             133.45
                             population    652503.00
Springfield  Illinois  2010  area              59.48
                             population    116250.00
                       2020  area              61.14
                             population    114394.00
             Oregon    2010  area              15.74
                             population     59403.00
                       2020  area              15.85
                             population     61851.00
dtype: float64

In [87]:
# 4. Missing Values are filled with NaN

result = df.iloc[0:-1,:].unstack()
display_side_by_side(df.iloc[0:-1,:], result)


Unnamed: 0_level_0,features,population,population,area,area
Unnamed: 0_level_1,year,2010,2020,2010,2020
city,region,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Portland,Maine,66194.0,68408.0,21.31,21.54
Portland,Oregon,583776.0,652503.0,133.43,133.45
Springfield,Illinois,116250.0,114394.0,59.48,61.14

features,population,population,population,population,population,population,area,area,area,area,area,area
year,2010,2010,2010,2020,2020,2020,2010,2010,2010,2020,2020,2020
region,Illinois,Maine,Oregon,Illinois,Maine,Oregon,Illinois,Maine,Oregon,Illinois,Maine,Oregon
city,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3,Unnamed: 11_level_3,Unnamed: 12_level_3
Portland,,66194.0,583776.0,,68408.0,652503.0,,21.31,133.43,,21.54,133.45
Springfield,116250.0,,,114394.0,,,59.48,,,61.14,,


In [88]:
# You can fill the values using fill_value
# NOTE: stack doesn't have fill_value

result = df.iloc[0:-1,:].unstack(fill_value=1)
display_side_by_side(df.iloc[0:-1,:], result)

Unnamed: 0_level_0,features,population,population,area,area
Unnamed: 0_level_1,year,2010,2020,2010,2020
city,region,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Portland,Maine,66194.0,68408.0,21.31,21.54
Portland,Oregon,583776.0,652503.0,133.43,133.45
Springfield,Illinois,116250.0,114394.0,59.48,61.14

features,population,population,population,population,population,population,area,area,area,area,area,area
year,2010,2010,2010,2020,2020,2020,2010,2010,2010,2020,2020,2020
region,Illinois,Maine,Oregon,Illinois,Maine,Oregon,Illinois,Maine,Oregon,Illinois,Maine,Oregon
city,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3,Unnamed: 11_level_3,Unnamed: 12_level_3
Portland,1.0,66194.0,583776.0,1.0,68408.0,652503.0,1.0,21.31,133.43,1.0,21.54,133.45
Springfield,116250.0,1.0,1.0,114394.0,1.0,1.0,59.48,1.0,1.0,61.14,1.0,1.0


### Use `stack()` to unpivot

You can use `stack()` together with `reset_index()` to unpivot a given wide 
format, but take into account that is not its main behavior.


In [89]:
data = {
    "client" : ["John", "John", "Silvia", "Silvia"],
    "product" : ["bananas", "oranges", "bananas", "oranges"],
    "quantity" : [5, 3, 4, 2],
    "price" : [1.5, 3, 2.5, 4]
}

df = pd.DataFrame(data)
df = df.pivot(index="client", columns="product", values="quantity")
df

product,bananas,oranges
client,Unnamed: 1_level_1,Unnamed: 2_level_1
John,5,3
Silvia,4,2


In [90]:
df.stack().reset_index()

Unnamed: 0,client,product,0
0,John,bananas,5
1,John,oranges,3
2,Silvia,bananas,4
3,Silvia,oranges,2


In [91]:
df = df.stack() 
df.reset_index(name = "quantity")

Unnamed: 0,client,product,quantity
0,John,bananas,5
1,John,oranges,3
2,Silvia,bananas,4
3,Silvia,oranges,2


## `melt()`

`melt()` practically takes a wide format and transform it in the long format, but we need **previously reset index** if we have a pivot table. It is because `melt()` works with columns.


`melt(id_vars, value_vars, var_name, value_name)`

- `id_vars` indicates the columns that are used as identifiers (which were transformed to index with pivot).
- `value_vars` indicates the columns to unpivot (which were transformed to columns with pivot). If not specified, uses all columns that are not set as `id_vars`.
- `var_name` sets the name of variable column (default = variable).
- `value_name` sets the name of value column (dafault = value).


**Note:** note the difference with a pure unpivot is that you can set different
number of `id_vars` which are **taken from the columns not the index**. You can also
set different `value_vars` to make a kind of filtering.

<img src="./assets/imgs/reshaping_melt.png" width="500"/>

**Note:** the `melt()` main behavior is to unpivot, in contrast
to `stack()`. Although `stack()` could be used to unpivot, it is not its main
behavior.

**Note:** `melt()` will ignore the index by default. Use `ignore_index = False`
to keep it.

In [92]:
data = { "bananas" : [5,3],
         "oranges" : [4, 2]}

index = pd.Index(["John", "Silvia"], name="client")

df = pd.DataFrame(data, index=index)
df

#Note this is the wide format version

Unnamed: 0_level_0,bananas,oranges
client,Unnamed: 1_level_1,Unnamed: 2_level_1
John,5,4
Silvia,3,2


In [93]:
# We need to reset the index because melt() works in columns
df = df.reset_index()
df

Unnamed: 0,client,bananas,oranges
0,John,5,4
1,Silvia,3,2


In [94]:
# We get the long format
df.melt(id_vars=["client"])

Unnamed: 0,client,variable,value
0,John,bananas,5
1,Silvia,bananas,3
2,John,oranges,4
3,Silvia,oranges,2


In [95]:
# setting name to the new columns
df.melt(id_vars=["client"], var_name="product", value_name="quantity")

Unnamed: 0,client,product,quantity
0,John,bananas,5
1,Silvia,bananas,3
2,John,oranges,4
3,Silvia,oranges,2


In [96]:
# we can also select which value vars to use to perform a kind of filtering
df.melt(id_vars=["client"], value_vars=["bananas"], var_name="product", value_name="quantity")

Unnamed: 0,client,product,quantity
0,John,bananas,5
1,Silvia,bananas,3


# Use Cases for reshaping

### Combining with stats and groupby

It is possible to combine `pivot()`, `pivot_table()`, `stack()`, `unstack()`, `melt()` with stats and `group_by()` to get interesting filters and results.

In [97]:
columns = pd.MultiIndex.from_product([["A", "B"], ["cat", "dog"]]
                                     , names=["letter", "animal"])

index = pd.MultiIndex.from_product([["bar", "baz", "foo", "qux"], ["one", "two"] ])

df = pd.DataFrame(
    data = np.random.randint(20, size=(8,4)),
    index= index,
    columns= columns
)
df

Unnamed: 0_level_0,letter,A,A,B,B
Unnamed: 0_level_1,animal,cat,dog,cat,dog
bar,one,12,1,6,7
bar,two,14,17,5,13
baz,one,8,9,19,16
baz,two,19,5,15,15
foo,one,0,18,3,17
foo,two,19,19,19,14
qux,one,7,0,1,9
qux,two,0,10,3,11


In [98]:
# Example 1: get the mean between letter A and B for cat and dog separately

# NOTE: you can get both results in diff ways
# NOTE: in group by the axis will indicate the way that mean() behaves too
# so it will also get the mean on axis = 1

result1 = df.groupby(level=1, axis=1).mean()
result2 = df.stack().mean(1).unstack()
display_side_by_side(df, result1, result2)

Unnamed: 0_level_0,letter,A,A,B,B
Unnamed: 0_level_1,animal,cat,dog,cat,dog
bar,one,12,1,6,7
bar,two,14,17,5,13
baz,one,8,9,19,16
baz,two,19,5,15,15
foo,one,0,18,3,17
foo,two,19,19,19,14
qux,one,7,0,1,9
qux,two,0,10,3,11

Unnamed: 0,animal,cat,dog
bar,one,9.0,4.0
bar,two,9.5,15.0
baz,one,13.5,12.5
baz,two,17.0,10.0
foo,one,1.5,17.5
foo,two,19.0,16.5
qux,one,4.0,4.5
qux,two,1.5,10.5

Unnamed: 0,animal,cat,dog
bar,one,9.0,4.0
bar,two,9.5,15.0
baz,one,13.5,12.5
baz,two,17.0,10.0
foo,one,1.5,17.5
foo,two,19.0,16.5
qux,one,4.0,4.5
qux,two,1.5,10.5


In [99]:
# Example 2: get the mean among all the pairs (letter, animal)
result1 = df.stack().groupby(level="animal").mean()
result2 = df.mean().unstack(0)
display_side_by_side(df, result1, result2)

Unnamed: 0_level_0,letter,A,A,B,B
Unnamed: 0_level_1,animal,cat,dog,cat,dog
bar,one,12,1,6,7
bar,two,14,17,5,13
baz,one,8,9,19,16
baz,two,19,5,15,15
foo,one,0,18,3,17
foo,two,19,19,19,14
qux,one,7,0,1,9
qux,two,0,10,3,11

letter,A,B
animal,Unnamed: 1_level_1,Unnamed: 2_level_1
cat,9.875,8.875
dog,9.875,12.75

letter,A,B
animal,Unnamed: 1_level_1,Unnamed: 2_level_1
cat,9.875,8.875
dog,9.875,12.75
