Link to Medium blog post: https://towardsdatascience.com/3-easy-ways-to-reshape-pandas-dataframe-5b2cbe73d60e

Data comes in different shapes and sizes. As professionals working with data, we often need to reshape the data to a form that is more suitable for the task at hand. In this post, we will look at 3 simple ways to reshape a DataFrame.

## 1. Transform wide to long format with melt()

In [12]:
import numpy as np
import pandas as pd
from seaborn import load_dataset
# Load sample data
wide = load_dataset('penguins')\
        .drop(columns=['sex', 'island'])\
        .sample(n=3, random_state=1).sort_index()\
        .reset_index().rename(columns={'index': 'id'})
wide

Unnamed: 0,id,species,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
0,291,Gentoo,46.4,15.6,221.0,5000.0
1,306,Gentoo,43.4,14.4,218.0,4600.0
2,341,Gentoo,50.4,15.7,222.0,5750.0


We can reshape the data to a long format with stack() like this:

In [13]:
long = wide.set_index('id').stack().to_frame().reset_index()\
           .rename(columns={'level_1': 'variable', 0: 'value'})
long

Unnamed: 0,id,variable,value
0,291,species,Gentoo
1,291,bill_length_mm,46.4
2,291,bill_depth_mm,15.6
3,291,flipper_length_mm,221.0
4,291,body_mass_g,5000.0
5,306,species,Gentoo
6,306,bill_length_mm,43.4
7,306,bill_depth_mm,14.4
8,306,flipper_length_mm,218.0
9,306,body_mass_g,4600.0


It gets the job done but this is quite verbose and not very elegant. Luckily, transforming the data to a long format becomes easy with melt():



In [14]:
long = wide.melt(id_vars='id')
long

Unnamed: 0,id,variable,value
0,291,species,Gentoo
1,306,species,Gentoo
2,341,species,Gentoo
3,291,bill_length_mm,46.4
4,306,bill_length_mm,43.4
5,341,bill_length_mm,50.4
6,291,bill_depth_mm,15.6
7,306,bill_depth_mm,14.4
8,341,bill_depth_mm,15.7
9,291,flipper_length_mm,221.0


Voila! It’s quite simple, isn’t it? Of note, wide.melt(id_vars=’id’) can also be written as pd.melt(wide, id_vars='id').

It’s always important to apply what we learn to consolidate our knowledge. One of my favourite practical application of melt() that you may also find useful is to use it to format correlation matrix. Although we only have three records in wide, to illustrate the idea, let’s do a correlation table:

In [17]:
corr = wide.drop(columns=['id', 'species']).corr()
corr

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
bill_length_mm,1.0,0.859391,0.93472,0.995803
bill_depth_mm,0.859391,1.0,0.985005,0.808989
flipper_length_mm,0.93472,0.985005,1.0,0.898272
body_mass_g,0.995803,0.808989,0.898272,1.0


This format is useful as we can turn the matrix into heatmaps to visualise the correlations. But often the matrix or the heatmap is not enough if you want to drill into specifics and find variables whose correlations are above a certain threshold. Turning the matrix into a long format makes that task a whole lot easier:

In [18]:
corr.reset_index().melt(id_vars='index')

Unnamed: 0,index,variable,value
0,bill_length_mm,bill_length_mm,1.0
1,bill_depth_mm,bill_length_mm,0.859391
2,flipper_length_mm,bill_length_mm,0.93472
3,body_mass_g,bill_length_mm,0.995803
4,bill_length_mm,bill_depth_mm,0.859391
5,bill_depth_mm,bill_depth_mm,1.0
6,flipper_length_mm,bill_depth_mm,0.985005
7,body_mass_g,bill_depth_mm,0.808989
8,bill_length_mm,flipper_length_mm,0.93472
9,bill_depth_mm,flipper_length_mm,0.985005


Now with this long data, we can easily filter by ‘value’ to find correlations between desired values. We will format the data a bit more and filter correlations between 0.9 and 1:

In [29]:
corr.reset_index().melt(id_vars='index')\
    .rename(columns={'index': 'variable1', 
                     'variable': 'variable2', 
                     'value': 'correlation'})\
    .sort_values('correlation', ascending=False)\
    .query('correlation.between(.9,1)', 
           engine='python') # workaround of the bug

Unnamed: 0,variable1,variable2,correlation
0,bill_length_mm,bill_length_mm,1.0
5,bill_depth_mm,bill_depth_mm,1.0
10,flipper_length_mm,flipper_length_mm,1.0
15,body_mass_g,body_mass_g,1.0
3,body_mass_g,bill_length_mm,0.995803
12,bill_length_mm,body_mass_g,0.995803
6,flipper_length_mm,bill_depth_mm,0.985005
9,bill_depth_mm,flipper_length_mm,0.985005
2,flipper_length_mm,bill_length_mm,0.93472
8,bill_length_mm,flipper_length_mm,0.93472


## 2. Transform long to wide format with pivot()

On the other hand, sometimes the data comes in a long format and we need to reshape it to a wide data. Let’s now do the opposite of what we did previously. Similar to the previous section, we will start the transformation with unstack():

In [30]:
long.set_index(['id', 'variable']).unstack()

Unnamed: 0_level_0,value,value,value,value,value
variable,bill_depth_mm,bill_length_mm,body_mass_g,flipper_length_mm,species
id,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
291,15.6,46.4,5000.0,221.0,Gentoo
306,14.4,43.4,4600.0,218.0,Gentoo
341,15.7,50.4,5750.0,222.0,Gentoo


The same transformation can be done using pivot() as below:

In [31]:
long.pivot(index='id', columns='variable', values='value')

variable,bill_depth_mm,bill_length_mm,body_mass_g,flipper_length_mm,species
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
291,15.6,46.4,5000.0,221.0,Gentoo
306,14.4,43.4,4600.0,218.0,Gentoo
341,15.7,50.4,5750.0,222.0,Gentoo


This is not necessarily more concise but it probably is little easier to work with compared to unstack(). By now, you probably have noticed that melt() is to pivot() as stack() is to unstack().

A possible practical application of reshaping data to wide format is if your data is in an Entity-Attribute-Value (a.k.a. EAV) format similar to this:

In [32]:
eav = pd.DataFrame({'entity': np.repeat([10,25,37, 49], 2),
                    'attribute': ['name', 'age']*4,
                    'value': ['Anna', 30, 'Jane', 40, 
                              'John', 20, 'Jim', 50]})
eav

Unnamed: 0,entity,attribute,value
0,10,name,Anna
1,10,age,30
2,25,name,Jane
3,25,age,40
4,37,name,John
5,37,age,20
6,49,name,Jim
7,49,age,50


Reshaping the data into a format where each row represents entity (e.g. customer) can be done using pivot():

In [33]:
eav.pivot(index='entity', columns='attribute', values='value')

attribute,age,name
entity,Unnamed: 1_level_1,Unnamed: 2_level_1
10,30,Anna
25,40,Jane
37,20,John
49,50,Jim


## 3. Transform wide to long format with wide_to_long()

We learned how to reshape from long to wide with melt(). But with wide_to_long() function, reshaping becomes easier compared to melt() in some instances. Here’s one example:

In [34]:
pop = pd.DataFrame({'country':['Monaco', 'Liechtenstein', 
                               'San Marino'],         
                   'population_2016' : [38070, 37658, 33504],
                   'population_2017' : [38392, 37800, 33671],
                   'population_2018' : [38682, 37910, 33785]})
pop

Unnamed: 0,country,population_2016,population_2017,population_2018
0,Monaco,38070,38392,38682
1,Liechtenstein,37658,37800,37910
2,San Marino,33504,33671,33785


Using melt(), we can reshape the data and format it as follows:

In [35]:
new = pop.melt(id_vars='country')\
         .rename(columns={'variable': 'year', 
                          'value': 'population'})
new['year'] = new['year'].str.replace('population_', '')
new

Unnamed: 0,country,year,population
0,Monaco,2016,38070
1,Liechtenstein,2016,37658
2,San Marino,2016,33504
3,Monaco,2017,38392
4,Liechtenstein,2017,37800
5,San Marino,2017,33671
6,Monaco,2018,38682
7,Liechtenstein,2018,37910
8,San Marino,2018,33785


With wide_to_long(), it’s much simpler to get the same output:

In [36]:
pd.wide_to_long(pop, stubnames='population', i='country', j='year', 
                sep='_').reset_index()

Unnamed: 0,country,year,population
0,Monaco,2016,38070
1,Liechtenstein,2016,37658
2,San Marino,2016,33504
3,Monaco,2017,38392
4,Liechtenstein,2017,37800
5,San Marino,2017,33671
6,Monaco,2018,38682
7,Liechtenstein,2018,37910
8,San Marino,2018,33785


When using the function, it’s good to understand these three main terms: a stub name (stubnames), a suffix and a separator (sep). While these terms may be self-explanatory, an example may clarify them: population is a stub name, 2017 is a suffix and _ is a separator. A new column name for the suffix is passed to parameter j and a unique identifier column name is passed to parameter i. Without reset_index(), the output would look like the following where the unique identifier and the suffix column are in the index:

In [37]:
pd.wide_to_long(pop, stubnames='population', i='country', j='year', 
                sep='_')

Unnamed: 0_level_0,Unnamed: 1_level_0,population
country,year,Unnamed: 2_level_1
Monaco,2016,38070
Liechtenstein,2016,37658
San Marino,2016,33504
Monaco,2017,38392
Liechtenstein,2017,37800
San Marino,2017,33671
Monaco,2018,38682
Liechtenstein,2018,37910
San Marino,2018,33785


By default, suffix is set up to be numerical values. So this worked fine in our previous example. But it may not work for a data like this:

In [38]:
iris = load_dataset('iris').head()
iris


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


This time, there are two stub names: sepal and petal. We will pass both in a list to stubnames when reshaping. The suffixes (i.e. length and width) are no longer numeric so we will need to specify that pattern using regular expression in suffix argument.

In [39]:
pd.wide_to_long(iris.reset_index(), stubnames=['sepal', 'petal'], 
                i='index', j='Measurement', sep='_', suffix='\D+')

Unnamed: 0_level_0,Unnamed: 1_level_0,species,sepal,petal
index,Measurement,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,length,setosa,5.1,1.4
1,length,setosa,4.9,1.4
2,length,setosa,4.7,1.3
3,length,setosa,4.6,1.5
4,length,setosa,5.0,1.4
0,width,setosa,3.5,0.2
1,width,setosa,3.0,0.2
2,width,setosa,3.2,0.2
3,width,setosa,3.1,0.2
4,width,setosa,3.6,0.2
