<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Reshaping-Data" data-toc-modified-id="Reshaping-Data-1">Reshaping Data</a></span><ul class="toc-item"><li><span><a href="#Changing-Row-Indices" data-toc-modified-id="Changing-Row-Indices-1.1">Changing Row Indices</a></span></li><li><span><a href="#Long-and-Wide-Data-Formats" data-toc-modified-id="Long-and-Wide-Data-Formats-1.2">Long and Wide Data Formats</a></span></li><li><span><a href="#A-Quick-Note-about-pandas-Series" data-toc-modified-id="A-Quick-Note-about-pandas-Series-1.3">A Quick Note about <code>pandas</code> <code>Series</code></a></span></li></ul></li></ul></div>

# Reshaping Data



The usual preamble for importing the essential modules and configuring the plotting engine.

In [1]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
sns.set(style="ticks", color_codes=True)
sns.set_context("notebook")

In [2]:
sns.set({ "figure.figsize": (12/1.5,8/1.5) })

We will make use of `gapminder` for illustration. As usual, note that the path to the data file should be changed to match the directory structure in your home area.

In [4]:
gm_df = pd.read_csv('gapminder2.tsv', sep='\t')

A quick reminder of structure and contents of the dataset:

In [5]:
gm_df.sample(5)

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
1197,Paraguay,Americas,1997,69.4,5154123,4247.400261
395,Cuba,Americas,2007,78.273,11416987,8948.102923
710,Indonesia,Asia,1962,42.518,99028000,849.28977
907,Libya,Africa,1987,66.234,3799845,11770.5898
1502,Taiwan,Asia,1962,65.2,11918938,1822.879028


## Changing Row Indices

We have discussed in more than one occasion that DataFrames are indexed by **row** and **column** indices. By default, `pandas` created a range of integers for the **row labels**:

In [6]:
gm_df.index

RangeIndex(start=0, stop=1704, step=1)

And the **column labels** match the header of the `.csv` file:

In [7]:
gm_df.columns

Index(['country', 'continent', 'year', 'lifeExp', 'pop', 'gdpPercap'], dtype='object')

To address one or more rows, one would normally use `.loc` with the corresponding labels:

In [8]:
gm_df.loc[249]

country        Canada
continent    Americas
year             1997
lifeExp         78.61
pop          30305843
gdpPercap     28954.9
Name: 249, dtype: object

In many scenarios, one might want to make a row index more meaningful, into something that corresponds to intrinsic properties of the data. For example, if you we were looking a student dataset, we could use student IDs or (albeit, not as good) full names. More often than not, row labels are useful when they stand as unique identifiers for observations.

We can use `.set_index()` and `.reset_index()` for turning a column into a row index, and vice-versa. Let us illustrate with a subset of `gapminder`.

In [9]:
australia_df = gm_df.loc[ gm_df['country']=='Australia' ]

In [10]:
australia_df

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
60,Australia,Oceania,1952,69.12,8691212,10039.59564
61,Australia,Oceania,1957,70.33,9712569,10949.64959
62,Australia,Oceania,1962,70.93,10794968,12217.22686
63,Australia,Oceania,1967,71.1,11872264,14526.12465
64,Australia,Oceania,1972,71.93,13177000,16788.62948
65,Australia,Oceania,1977,73.49,14074100,18334.19751
66,Australia,Oceania,1982,74.74,15184200,19477.00928
67,Australia,Oceania,1987,76.32,16257249,21888.88903
68,Australia,Oceania,1992,77.56,17481977,23424.76683
69,Australia,Oceania,1997,78.83,18565243,26997.93657


I will take a subset of columns for an easier demonstration. Instead of `.loc` for selecting the columns of interest, I am droping those I do not wish to keep.

In [11]:
australia_df = australia_df.drop(columns=['country','continent', 'gdpPercap', 'pop'])

In [13]:
australia_df

Unnamed: 0,year,lifeExp
60,1952,69.12
61,1957,70.33
62,1962,70.93
63,1967,71.1
64,1972,71.93
65,1977,73.49
66,1982,74.74
67,1987,76.32
68,1992,77.56
69,1997,78.83


In [14]:
australia_df = australia_df.set_index('year')

In [15]:
australia_df

Unnamed: 0_level_0,lifeExp
year,Unnamed: 1_level_1
1952,69.12
1957,70.33
1962,70.93
1967,71.1
1972,71.93
1977,73.49
1982,74.74
1987,76.32
1992,77.56
1997,78.83


In [16]:
australia_df.index

Int64Index([1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, 2002,
            2007],
           dtype='int64', name='year')

In [17]:
australia_df.loc[2002]

lifeExp    80.37
Name: 2002, dtype: float64

In [19]:
australia_df.loc[1992:2007]

Unnamed: 0_level_0,lifeExp
year,Unnamed: 1_level_1
1992,77.56
1997,78.83
2002,80.37
2007,81.235


**checkpoint**:

Take a subset of `gapminder` for `Oceania` and `2007`; it makes sense to assign it to a variable (with a meaningful name). Then, make `country` the row index, and try some indexing expressions with `.loc` for selecting unique observations from that subset.

In [30]:
gm_oceania_2007_df = gm_df.loc[ (gm_df['continent']=='Oceania') & (gm_df['year']==2007)]

In [31]:
gm_oceania_2007_df.head()

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
71,Australia,Oceania,2007,81.235,20434176,34435.36744
1103,New Zealand,Oceania,2007,80.204,4115771,25185.00911


In [32]:
gm_oceania_2007_df=gm_oceania_2007_df.set_index('country')

In [33]:
gm_oceania_2007_df

Unnamed: 0_level_0,continent,year,lifeExp,pop,gdpPercap
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Australia,Oceania,2007,81.235,20434176,34435.36744
New Zealand,Oceania,2007,80.204,4115771,25185.00911


In [39]:
gm_oceania_2007_df.reset_index().reset_index().drop(columns='index')

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Australia,Oceania,2007,81.235,20434176,34435.36744
1,New Zealand,Oceania,2007,80.204,4115771,25185.00911


More than one column can be pushed into the row index; then, the latter becomes a **hierarchical index**. Each column occupies an index **level**. There are all sorts of complex operations and transformation that we can apply on and with them; for the moment, it suffices to realise that now, in order to address specific rows, one should use **tuples** representing combinations of index level labels.


In [40]:
oceania_sub_df = oceania_df.drop(columns=['continent', 'gdpPercap', 'pop'])

In [89]:
oceania_sub_df

Unnamed: 0,country,year,lifeExp
60,Australia,1952,69.12
61,Australia,1957,70.33
62,Australia,1962,70.93
63,Australia,1967,71.1
64,Australia,1972,71.93
65,Australia,1977,73.49
66,Australia,1982,74.74
67,Australia,1987,76.32
68,Australia,1992,77.56
69,Australia,1997,78.83


In [90]:
oceania_rsh_df = oceania_sub_df.set_index(['country', 'year'])

In [91]:
oceania_rsh_df

Unnamed: 0_level_0,Unnamed: 1_level_0,lifeExp
country,year,Unnamed: 2_level_1
Australia,1952,69.12
Australia,1957,70.33
Australia,1962,70.93
Australia,1967,71.1
Australia,1972,71.93
Australia,1977,73.49
Australia,1982,74.74
Australia,1987,76.32
Australia,1992,77.56
Australia,1997,78.83


`country` is the **outermost level** (0), and `year` is the **inner most level** (1).

We should explore this topic further in the coming weeks but, if you would like to look ahead, the online documentation can help: https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html#advanced-indexing-with-hierarchical-index. Also, our **Python Data Science Handbook**.

Selecting a single observation by using a **pair of row index labels**:

In [92]:
oceania_rsh_df.loc[ ('New Zealand', 2007) ]

lifeExp    80.204
Name: (New Zealand, 2007), dtype: float64

Ranges are now also pairs of labels:

In [93]:
oceania_rsh_df.loc[ ('New Zealand', 1952):('New Zealand', 1967) ]

Unnamed: 0_level_0,Unnamed: 1_level_0,lifeExp
country,year,Unnamed: 2_level_1
New Zealand,1952,69.39
New Zealand,1957,70.26
New Zealand,1962,71.24
New Zealand,1967,71.52


And partial matches using the outermost levels are possible:

In [94]:
oceania_rsh_df.loc[ 'New Zealand' ]

Unnamed: 0_level_0,lifeExp
year,Unnamed: 1_level_1
1952,69.39
1957,70.26
1962,71.24
1967,71.52
1972,71.89
1977,72.22
1982,73.84
1987,74.32
1992,76.33
1997,77.55


**checkpoint**:

In a similar fashion, obtain a subset for 2007 observations and use both `continent` and `country` as row index. Try a few `.loc` expressions using pairs of continent and country labels, as well as partial matching on a single `continent`.

In [95]:
gm_2007_df=gm_df.loc[ gm_df['year']==2007 ]
gm_2007_df.head()

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
11,Afghanistan,Asia,2007,43.828,31889923,974.580338
23,Albania,Europe,2007,76.423,3600523,5937.029526
35,Algeria,Africa,2007,72.301,33333216,6223.367465
47,Angola,Africa,2007,42.731,12420476,4797.231267
59,Argentina,Americas,2007,75.32,40301927,12779.37964


In [96]:
gm_2007_df=gm_2007_df.set_index(['continent','country']).sort_values(by=['continent','country']) #sort_index()
gm_2007_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,year,lifeExp,pop,gdpPercap
continent,country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Africa,Algeria,2007,72.301,33333216,6223.367465
Africa,Angola,2007,42.731,12420476,4797.231267
Africa,Benin,2007,56.728,8078314,1441.284873
Africa,Botswana,2007,50.728,1639131,12569.85177
Africa,Burkina Faso,2007,52.295,14326203,1217.032994


In [97]:
gm_2007_df.loc[('Europe','Germany')]

year         2.007000e+03
lifeExp      7.940600e+01
pop          8.240100e+07
gdpPercap    3.217037e+04
Name: (Europe, Germany), dtype: float64

In [98]:
gm_2007_df.loc['Africa']

Unnamed: 0_level_0,year,lifeExp,pop,gdpPercap
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Algeria,2007,72.301,33333216,6223.367465
Angola,2007,42.731,12420476,4797.231267
Benin,2007,56.728,8078314,1441.284873
Botswana,2007,50.728,1639131,12569.85177
Burkina Faso,2007,52.295,14326203,1217.032994
Burundi,2007,49.58,8390505,430.070692
Cameroon,2007,50.43,17696293,2042.09524
Central African Republic,2007,44.741,4369038,706.016537
Chad,2007,50.651,10238807,1704.063724
Comoros,2007,65.152,710960,986.147879


## Long and Wide Data Formats

In short: typically, the columns of the dataset represent properties of the data but, in a number of situations (say, because of the way the data was recorded, or for making comparisons and plotting easier), values are shown are columns.

In [99]:
oceania_rsh_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,lifeExp
country,year,Unnamed: 2_level_1
Australia,1952,69.12
Australia,1957,70.33
Australia,1962,70.93
Australia,1967,71.1
Australia,1972,71.93


In [100]:
# don't worry about the expression below and how unstack works; right now,
# I just wanted to produce a small dataset with values as column labels
oceania_wide_df = oceania_rsh_df.unstack('year')['lifeExp']
oceania_wide_df

year,1952,1957,1962,1967,1972,1977,1982,1987,1992,1997,2002,2007
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Australia,69.12,70.33,70.93,71.1,71.93,73.49,74.74,76.32,77.56,78.83,80.37,81.235
New Zealand,69.39,70.26,71.24,71.52,71.89,72.22,73.84,74.32,76.33,77.55,79.11,80.204


In [104]:
oceania_wide_df.index

Index(['Australia', 'New Zealand'], dtype='object', name='country')

In [105]:
oceania_wide_df.columns

Int64Index([1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, 2002,
            2007],
           dtype='int64', name='year')

In [106]:
oceania_wide_df.mean()

year
1952    69.2550
1957    70.2950
1962    71.0850
1967    71.3100
1972    71.9100
1977    72.8550
1982    74.2900
1987    75.3200
1992    76.9450
1997    78.1900
2002    79.7400
2007    80.7195
dtype: float64

**Aggregation functions** that we have applied on single columns before work on **subsets of DataFrames** too. We can specify whether they should work across columns or rows.

In [107]:
oceania_wide_df.mean(axis='columns')

country
Australia      74.662917
New Zealand    73.989500
dtype: float64

In [108]:
oceania_wide_df.mean(axis='rows')

year
1952    69.2550
1957    70.2950
1962    71.0850
1967    71.3100
1972    71.9100
1977    72.8550
1982    74.2900
1987    75.3200
1992    76.9450
1997    78.1900
2002    79.7400
2007    80.7195
dtype: float64

In [109]:
oceania_wide_df.max(axis='columns')

country
Australia      81.235
New Zealand    80.204
dtype: float64

In [110]:
oceania_wide_df.max(axis='rows')

year
1952    69.390
1957    70.330
1962    71.240
1967    71.520
1972    71.930
1977    73.490
1982    74.740
1987    76.320
1992    77.560
1997    78.830
2002    80.370
2007    81.235
dtype: float64

**checkpoint**:

On our wider Oceania dataset, produce a subset for observations concerning `New Zealand` from `1992` to `2007`.

In [122]:
oceania_rsh_df.loc[('New Zealand',1992):('New Zealand',2007)]

Unnamed: 0_level_0,Unnamed: 1_level_0,lifeExp
country,year,Unnamed: 2_level_1
New Zealand,1992,76.33
New Zealand,1997,77.55
New Zealand,2002,79.11
New Zealand,2007,80.204


Usefully, one can swaps rows with columns via a **transpose** operation:

In [126]:
oceania_wide_df

year,1952,1957,1962,1967,1972,1977,1982,1987,1992,1997,2002,2007
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Australia,69.12,70.33,70.93,71.1,71.93,73.49,74.74,76.32,77.56,78.83,80.37,81.235
New Zealand,69.39,70.26,71.24,71.52,71.89,72.22,73.84,74.32,76.33,77.55,79.11,80.204


In [127]:
oceania_wide_df.T

country,Australia,New Zealand
year,Unnamed: 1_level_1,Unnamed: 2_level_1
1952,69.12,69.39
1957,70.33,70.26
1962,70.93,71.24
1967,71.1,71.52
1972,71.93,71.89
1977,73.49,72.22
1982,74.74,73.84
1987,76.32,74.32
1992,77.56,76.33
1997,78.83,77.55


In [128]:
oceania_wide_df.T.index

Int64Index([1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, 2002,
            2007],
           dtype='int64', name='year')

In [129]:
oceania_wide_df.T.columns

Index(['Australia', 'New Zealand'], dtype='object', name='country')

## A Quick Note about `pandas` `Series`

For the time being, think of it as a **unidimensional DataFrame**. A row is a Series whose index is the column labels, mapped into values. A column is a Series with the index being the row labels, mapped into values.

In [130]:
oceania_wide_df.loc['New Zealand']

year
1952    69.390
1957    70.260
1962    71.240
1967    71.520
1972    71.890
1977    72.220
1982    73.840
1987    74.320
1992    76.330
1997    77.550
2002    79.110
2007    80.204
Name: New Zealand, dtype: float64

In [134]:
oceania_wide_df.loc['New Zealand'][1987]

74.32

In [135]:
oceania_wide_df.loc['New Zealand'].index

Int64Index([1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, 2002,
            2007],
           dtype='int64', name='year')

In [136]:
oceania_wide_df.loc['New Zealand'].values

array([69.39 , 70.26 , 71.24 , 71.52 , 71.89 , 72.22 , 73.84 , 74.32 ,
       76.33 , 77.55 , 79.11 , 80.204])

In [140]:
oceania_wide_df[[2007]]

year,2007
country,Unnamed: 1_level_1
Australia,81.235
New Zealand,80.204


In [138]:
oceania_wide_df[2007].index

Index(['Australia', 'New Zealand'], dtype='object', name='country')

In [139]:
oceania_wide_df[2007].values

array([81.235, 80.204])

Most operations work the same in either **DataFrames** or **Series**; it is worth pointing out that `Series` can have hierarchical indices too.