# Wide and long formats with pandas

## Get a dataset from internet

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import pandas as pd

We download a dataset from www.worldometers.info:

In [3]:
dfs = pd.read_html("http://www.worldometers.info/world-population/population-by-country/")
df_pop = dfs[0]
df_pop.head()

Unnamed: 0,#,Country (or dependency),Population (2017),Yearly Change,Net Change,Density (P/Km²),Land Area (Km²),Migrants (net),Fert. Rate,Med. Age,Urban Pop %,World Share
0,1,China,1409517397,0.43 %,6017032,150,9388211,-339690.0,1.6,37,57 %,18.67 %
1,2,India,1339180127,1.13 %,15008773,450,2973190,-515643.0,2.4,27,32 %,17.74 %
2,3,U.S.,324459463,0.71 %,2279858,35,9147420,900000.0,1.9,38,83 %,4.30 %
3,4,Indonesia,263991379,1.10 %,2875923,146,1811570,-167000.0,2.5,28,53 %,3.50 %
4,5,Brazil,209288278,0.79 %,1635413,25,8358140,3185.0,1.8,31,84 %,2.77 %


## Converting to long format

This dataframe has a wide format, all information is provided in columns. Sometimes, to process information, it's better to have a long format. In pandas, this is un unpivoting transformation or melting:

In [4]:
df_pop.melt().head()

Unnamed: 0,variable,value
0,#,1
1,#,2
2,#,3
3,#,4
4,#,5


The dataframe is transformed into a two-column dataframe. The first column holds the name of column from previous dataframe, and the second its value.

In our case this is not very convenient. We would like to keep the # and Country (or dependency) as columns. These columns are id columns, and we can tell melt they are:

In [5]:
df_melted = df_pop.melt(id_vars=["#", "Country (or dependency)"])
df_melted.head()

Unnamed: 0,#,Country (or dependency),variable,value
0,1,China,Population (2017),1409517397
1,2,India,Population (2017),1339180127
2,3,U.S.,Population (2017),324459463
3,4,Indonesia,Population (2017),263991379
4,5,Brazil,Population (2017),209288278


We can have a closer look to France data:

In [6]:
df_melted[df_melted["Country (or dependency)"] == "France"]

Unnamed: 0,#,Country (or dependency),variable,value
21,22,France,Population (2017),64979548
254,22,France,Yearly Change,0.40 %
487,22,France,Net Change,258858
720,22,France,Density (P/Km²),119
953,22,France,Land Area (Km²),547557
1186,22,France,Migrants (net),72344
1419,22,France,Fert. Rate,2.0
1652,22,France,Med. Age,41
1885,22,France,Urban Pop %,80 %
2118,22,France,World Share,0.86 %


## Converting back to wide format, step by step

The reverse operation of melt is pivot. And we can get back to the original form of the dataframe.
To do so we can start by grouping and aggregating the data. We group by any column that we want to get in our final dataframe:

In [7]:
df_step1 = df_melted.groupby(['#', 'Country (or dependency)', 'variable']).aggregate("first")
df_step1.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,value
#,Country (or dependency),variable,Unnamed: 3_level_1
1,China,Density (P/Km²),150.0
1,China,Fert. Rate,1.6
1,China,Land Area (Km²),9388211.0
1,China,Med. Age,37.0
1,China,Migrants (net),-339690.0


We have as a result a multi-index dataframe. We need to unstack the variable index:

In [8]:
df_step2 = df_step1.unstack("variable")
df_step2.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,value,value,value,value,value,value,value,value,value,value
Unnamed: 0_level_1,variable,Density (P/Km²),Fert. Rate,Land Area (Km²),Med. Age,Migrants (net),Net Change,Population (2017),Urban Pop %,World Share,Yearly Change
#,Country (or dependency),Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2
1,China,150,1.6,9388211,37,-339690,6017032,1409517397,57 %,18.67 %,0.43 %
2,India,450,2.4,2973190,27,-515643,15008773,1339180127,32 %,17.74 %,1.13 %
3,U.S.,35,1.9,9147420,38,900000,2279858,324459463,83 %,4.30 %,0.71 %
4,Indonesia,146,2.5,1811570,28,-167000,2875923,263991379,53 %,3.50 %,1.10 %
5,Brazil,25,1.8,8358140,31,3185,1635413,209288278,84 %,2.77 %,0.79 %


The columns are also stacked in a multi-index, we can get rid off the outer most level:

In [9]:
df_step3 = df_step2["value"]
df_step3.head()

Unnamed: 0_level_0,variable,Density (P/Km²),Fert. Rate,Land Area (Km²),Med. Age,Migrants (net),Net Change,Population (2017),Urban Pop %,World Share,Yearly Change
#,Country (or dependency),Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,China,150,1.6,9388211,37,-339690,6017032,1409517397,57 %,18.67 %,0.43 %
2,India,450,2.4,2973190,27,-515643,15008773,1339180127,32 %,17.74 %,1.13 %
3,U.S.,35,1.9,9147420,38,900000,2279858,324459463,83 %,4.30 %,0.71 %
4,Indonesia,146,2.5,1811570,28,-167000,2875923,263991379,53 %,3.50 %,1.10 %
5,Brazil,25,1.8,8358140,31,3185,1635413,209288278,84 %,2.77 %,0.79 %


We then have to reset the index so that # and Country (or dependecy) become columns again:

In [10]:
df_step4 = df_step3.reset_index()
df_step4.head()

variable,#,Country (or dependency),Density (P/Km²),Fert. Rate,Land Area (Km²),Med. Age,Migrants (net),Net Change,Population (2017),Urban Pop %,World Share,Yearly Change
0,1,China,150,1.6,9388211,37,-339690,6017032,1409517397,57 %,18.67 %,0.43 %
1,2,India,450,2.4,2973190,27,-515643,15008773,1339180127,32 %,17.74 %,1.13 %
2,3,U.S.,35,1.9,9147420,38,900000,2279858,324459463,83 %,4.30 %,0.71 %
3,4,Indonesia,146,2.5,1811570,28,-167000,2875923,263991379,53 %,3.50 %,1.10 %
4,5,Brazil,25,1.8,8358140,31,3185,1635413,209288278,84 %,2.77 %,0.79 %


If you want to have absolutely the same dataframe as the one downloaded, you have to rename the column index:

In [11]:
df_step4.columns.name=""
df_step4.head()

Unnamed: 0,#,Country (or dependency),Density (P/Km²),Fert. Rate,Land Area (Km²),Med. Age,Migrants (net),Net Change,Population (2017),Urban Pop %,World Share,Yearly Change
0,1,China,150,1.6,9388211,37,-339690,6017032,1409517397,57 %,18.67 %,0.43 %
1,2,India,450,2.4,2973190,27,-515643,15008773,1339180127,32 %,17.74 %,1.13 %
2,3,U.S.,35,1.9,9147420,38,900000,2279858,324459463,83 %,4.30 %,0.71 %
3,4,Indonesia,146,2.5,1811570,28,-167000,2875923,263991379,53 %,3.50 %,1.10 %
4,5,Brazil,25,1.8,8358140,31,3185,1635413,209288278,84 %,2.77 %,0.79 %


Here we are !

## Converting back to wide format, with a custom cast function

We can chain all the previous steps into a custom cast function:

In [12]:
def cast(df, id_vars=None, columns=None, values=None):
    """
    - df: dataframe to be converted from long to wide format
    - id_vars: [list or str] identification columns that are kept as-is when converting to wide format
    - columns: [str] column holding the name of variables that will be converted into columns
    - values: [str] column holding the values of variables
    """
    #Check input errors
    if id_vars is None:
        raise ValueError("No id_vars provided.")
    elif not hasattr(id_vars, "append"):
        id_vars = [id_vars]
    if columns is None:
        columns = df.columns[-2]
    if values is None:
        values = df.columns[-1]
        
    #We only keep columns we are interested in:
    all_columns = id_vars[:]
    all_columns.extend([columns, values])
    output = df[all_columns]
    
    #Group and aggregate
    group_by = id_vars[:]
    group_by.append(columns)
    output = output.groupby(group_by).aggregate("first")
    
    #Unstack variable and value
    output = output.unstack(columns)[values]
    
    #Reset index
    output = output.reset_index()
    
    #Rename column index
    output.columns.name = ""
    
    #End
    return output

Let's try our function:

In [13]:
df_wide = cast(df_melted, id_vars=['#', 'Country (or dependency)'], columns="variable", values="value")
df_wide.head()

Unnamed: 0,#,Country (or dependency),Density (P/Km²),Fert. Rate,Land Area (Km²),Med. Age,Migrants (net),Net Change,Population (2017),Urban Pop %,World Share,Yearly Change
0,1,China,150,1.6,9388211,37,-339690,6017032,1409517397,57 %,18.67 %,0.43 %
1,2,India,450,2.4,2973190,27,-515643,15008773,1339180127,32 %,17.74 %,1.13 %
2,3,U.S.,35,1.9,9147420,38,900000,2279858,324459463,83 %,4.30 %,0.71 %
3,4,Indonesia,146,2.5,1811570,28,-167000,2875923,263991379,53 %,3.50 %,1.10 %
4,5,Brazil,25,1.8,8358140,31,3185,1635413,209288278,84 %,2.77 %,0.79 %


With less parameters:

In [14]:
df_wide = cast(df_melted, id_vars='Country (or dependency)')
df_wide.head()

Unnamed: 0,Country (or dependency),Density (P/Km²),Fert. Rate,Land Area (Km²),Med. Age,Migrants (net),Net Change,Population (2017),Urban Pop %,World Share,Yearly Change
0,Afghanistan,54,5.3,652860,17,89601.0,874049,35530081,25 %,0.47 %,2.52 %
1,Albania,107,1.7,27400,36,-18685.0,3839,2930187,64 %,0.04 %,0.13 %
2,Algeria,17,3.0,2381740,28,-28654.0,712090,41318142,71 %,0.55 %,1.75 %
3,American Samoa,278,N.A.,200,N.A.,,42,55641,87 %,0.00 %,0.08 %
4,Andorra,164,N.A.,470,N.A.,,-316,76965,90 %,0.00 %,-0.41 %


## Another method using pivot_table

In the previous example, we had non-numeric fields in the dataframe. The non-numeric fields prevent from using pivot_table. This method only apply to dataframes with numerical fields. We can try to convert all fields to numeric values:

In [15]:
def to_numeric(series):
    def to_float_or_none(value):
        try:
            return float(value)
        except ValueError:
            return None
    if series.name.lower() in ["yearly change", "urban pop %", "world share"]:
        output = series.map(lambda x: to_float_or_none(x.rstrip(' %')), na_action='ignore')
    elif series.name.lower() in ["#", "country (or dependency)"]:
        output = series
    else:
        output = series.map(to_float_or_none)
    return output

df_num = df_pop.apply(to_numeric)
df_num.head()

Unnamed: 0,#,Country (or dependency),Population (2017),Yearly Change,Net Change,Density (P/Km²),Land Area (Km²),Migrants (net),Fert. Rate,Med. Age,Urban Pop %,World Share
0,1,China,1409517000.0,0.43,6017032.0,150.0,9388211.0,-339690.0,1.6,37.0,57.0,18.67
1,2,India,1339180000.0,1.13,15008773.0,450.0,2973190.0,-515643.0,2.4,27.0,32.0,17.74
2,3,U.S.,324459500.0,0.71,2279858.0,35.0,9147420.0,900000.0,1.9,38.0,83.0,4.3
3,4,Indonesia,263991400.0,1.1,2875923.0,146.0,1811570.0,-167000.0,2.5,28.0,53.0,3.5
4,5,Brazil,209288300.0,0.79,1635413.0,25.0,8358140.0,3185.0,1.8,31.0,84.0,2.77


We can melt the dataframe as previously:

In [16]:
df_long = df_num.melt(id_vars=["#", "Country (or dependency)"])
df_long.head()

Unnamed: 0,#,Country (or dependency),variable,value
0,1,China,Population (2017),1409517000.0
1,2,India,Population (2017),1339180000.0
2,3,U.S.,Population (2017),324459500.0
3,4,Indonesia,Population (2017),263991400.0
4,5,Brazil,Population (2017),209288300.0


In [17]:
df_wide = df_long.pivot_table(index=["#", "Country (or dependency)"], columns="variable", values="value")
df_wide.head()

Unnamed: 0_level_0,variable,Density (P/Km²),Fert. Rate,Land Area (Km²),Med. Age,Migrants (net),Net Change,Population (2017),Urban Pop %,World Share,Yearly Change
#,Country (or dependency),Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,China,150.0,1.6,9388211.0,37.0,-339690.0,6017032.0,1409517000.0,57.0,18.67,0.43
2,India,450.0,2.4,2973190.0,27.0,-515643.0,15008773.0,1339180000.0,32.0,17.74,1.13
3,U.S.,35.0,1.9,9147420.0,38.0,900000.0,2279858.0,324459500.0,83.0,4.3,0.71
4,Indonesia,146.0,2.5,1811570.0,28.0,-167000.0,2875923.0,263991400.0,53.0,3.5,1.1
5,Brazil,25.0,1.8,8358140.0,31.0,3185.0,1635413.0,209288300.0,84.0,2.77,0.79


We just have to reset index and rename column index:

In [18]:
df_wide = df_wide.reset_index()
df_wide.columns.name = ""
df_wide.head()

Unnamed: 0,#,Country (or dependency),Density (P/Km²),Fert. Rate,Land Area (Km²),Med. Age,Migrants (net),Net Change,Population (2017),Urban Pop %,World Share,Yearly Change
0,1,China,150.0,1.6,9388211.0,37.0,-339690.0,6017032.0,1409517000.0,57.0,18.67,0.43
1,2,India,450.0,2.4,2973190.0,27.0,-515643.0,15008773.0,1339180000.0,32.0,17.74,1.13
2,3,U.S.,35.0,1.9,9147420.0,38.0,900000.0,2279858.0,324459500.0,83.0,4.3,0.71
3,4,Indonesia,146.0,2.5,1811570.0,28.0,-167000.0,2875923.0,263991400.0,53.0,3.5,1.1
4,5,Brazil,25.0,1.8,8358140.0,31.0,3185.0,1635413.0,209288300.0,84.0,2.77,0.79


Without converting input dataframe into numeric, it is possible to pass an aggregate function to pivot_table to do almost the same than our cast function:

In [19]:
df_wide = df_melted.pivot_table(index=["#", "Country (or dependency)"], 
                                columns="variable", values="value", aggfunc="first")
df_wide.head()

Unnamed: 0_level_0,variable,Density (P/Km²),Fert. Rate,Land Area (Km²),Med. Age,Migrants (net),Net Change,Population (2017),Urban Pop %,World Share,Yearly Change
#,Country (or dependency),Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,China,150,1.6,9388211,37,-339690,6017032,1409517397,57 %,18.67 %,0.43 %
2,India,450,2.4,2973190,27,-515643,15008773,1339180127,32 %,17.74 %,1.13 %
3,U.S.,35,1.9,9147420,38,900000,2279858,324459463,83 %,4.30 %,0.71 %
4,Indonesia,146,2.5,1811570,28,-167000,2875923,263991379,53 %,3.50 %,1.10 %
5,Brazil,25,1.8,8358140,31,3185,1635413,209288278,84 %,2.77 %,0.79 %


In [20]:
df_wide = df_wide.reset_index()
df_wide.columns.name = ""
df_wide.head()

Unnamed: 0,#,Country (or dependency),Density (P/Km²),Fert. Rate,Land Area (Km²),Med. Age,Migrants (net),Net Change,Population (2017),Urban Pop %,World Share,Yearly Change
0,1,China,150,1.6,9388211,37,-339690,6017032,1409517397,57 %,18.67 %,0.43 %
1,2,India,450,2.4,2973190,27,-515643,15008773,1339180127,32 %,17.74 %,1.13 %
2,3,U.S.,35,1.9,9147420,38,900000,2279858,324459463,83 %,4.30 %,0.71 %
3,4,Indonesia,146,2.5,1811570,28,-167000,2875923,263991379,53 %,3.50 %,1.10 %
4,5,Brazil,25,1.8,8358140,31,3185,1635413,209288278,84 %,2.77 %,0.79 %


Same output again.