# Chapter 8 - Data Wrangling: Join, Combine, and Reshape

## 8.3 Reshaping and Pivoting

In [1]:
import pandas as pd

- `stack()` and `unstack()`, combined with hierarchical indices to transform data

- `df.melt()` and `df.pivot()` to interchange between "wide" and "long" formats

In [2]:
# DATA PREPARATION STEP

# Read from CSV file
df_csv = pd.read_csv('dataset-C-enrolment.csv')
df_mf, df_f = df_csv.copy(), df_csv.copy()

# slice & dice, change index
df_mf = df_mf[df_mf.year>2012]
df_mf = df_mf[df_mf.sex=='MF']
df_mf.index=df_mf['year']
df_mf = df_mf[['intake', 'graduates']]

df_f = df_f[df_f.year>2012]
df_f = df_f[df_f.sex=='F']
df_f.index=df_f['year']
df_f = df_f[['intake', 'graduates']]

# Get the number of males from the 2 df
df_m = df_mf - df_f

# Join them together, reindex
df = pd.concat([df_f.copy(), df_m.copy(), df_mf], axis=1)
cols_arrays = [['female', 'female', 'male', 'male', 'mf', 'mf'], 
               ['intake', 'graduates', 'intake', 'graduates', 'intake', 'graduates']]
df.columns = cols_arrays
display(df)

Unnamed: 0_level_0,female,female,male,male,mf,mf
Unnamed: 0_level_1,intake,graduates,intake,graduates,intake,graduates
year,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
2013,208,179,195,189,403,368
2014,170,176,227,180,397,356
2015,171,168,234,187,405,355
2016,210,173,189,178,399,351
2017,201,188,190,187,391,375


Because of hierarchical indexing, there are ways to rearrange data in a `df`. 
- `stack()` will pivot the columns to individual rows, producing a `Series`

In [3]:
m = df['male']
display(m)
m_ser = m.stack()
print(m_ser)

Unnamed: 0_level_0,intake,graduates
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2013,195,189
2014,227,180
2015,234,187
2016,189,178
2017,190,187


year           
2013  intake       195
      graduates    189
2014  intake       227
      graduates    180
2015  intake       234
      graduates    187
2016  intake       189
      graduates    178
2017  intake       190
      graduates    187
dtype: int64


`unstack()` will "unroll" them to columns in a `DataFrame`

In [4]:
m_p = m_ser.swaplevel()
print(m_p)
display(m_p.unstack())

           year
intake     2013    195
graduates  2013    189
intake     2014    227
graduates  2014    180
intake     2015    234
graduates  2015    187
intake     2016    189
graduates  2016    178
intake     2017    190
graduates  2017    187
dtype: int64


year,2013,2014,2015,2016,2017
intake,195,227,234,189,190
graduates,189,180,187,178,187


<hr>
Data is often stored in "long format", where there is a column for date, a column for item name and a column for value name. The following is an example.

In [5]:
longf_df = pd.read_csv('dataset-I3-ES3.csv')
display(longf_df.head(4))

Unnamed: 0,date,item,val
0,2017-01-03,Close,2.96
1,2017-01-03,Volume,819500.0
2,2017-01-04,Close,2.98
3,2017-01-04,Volume,439000.0


`df.pivot()` can be used to convert data in "long format" to columns. The first parameter is the <b>row index</b> and the second parameter is the <b>column index</b> respectively. If the last argument is not specified then the resulting `df` has a hierarchical index structure.

In [6]:
display(longf_df.pivot('date', 'item').head(3))

Unnamed: 0_level_0,val,val
item,Close,Volume
date,Unnamed: 1_level_2,Unnamed: 2_level_2
2017-01-03,2.96,819500.0
2017-01-04,2.98,439000.0
2017-01-05,3.01,772500.0


In [7]:
pivot_df = longf_df.pivot('date', 'item', 'val')
display(pivot_df.head(3))
# Retrieving values from the df
display(pivot_df['Close'].head(3))

item,Close,Volume
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2017-01-03,2.96,819500.0
2017-01-04,2.98,439000.0
2017-01-05,3.01,772500.0


date
2017-01-03    2.96
2017-01-04    2.98
2017-01-05    3.01
Name: Close, dtype: float64

Note that pivot is equivalent to creating a hierarchical index using `set_index()` followed by a call to `unstack()`

In [8]:
longf_df_indices = longf_df.copy().set_index(['date', 'item'])
display(longf_df_indices.head(3))
pivot_df2 = longf_df_indices.unstack('item')
display(pivot_df2.head(3))

Unnamed: 0_level_0,Unnamed: 1_level_0,val
date,item,Unnamed: 2_level_1
2017-01-03,Close,2.96
2017-01-03,Volume,819500.0
2017-01-04,Close,2.98


Unnamed: 0_level_0,val,val
item,Close,Volume
date,Unnamed: 1_level_2,Unnamed: 2_level_2
2017-01-03,2.96,819500.0
2017-01-04,2.98,439000.0
2017-01-05,3.01,772500.0


In [9]:
# Adding this step to show that the result is a replica of the df.pivot step
pivot_df2.columns = pd.Index(['Close', 'Volume'], name='item')
pivot_df2.head(3)

item,Close,Volume
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2017-01-03,2.96,819500.0
2017-01-04,2.98,439000.0
2017-01-05,3.01,772500.0


In [10]:
df3 = pd.read_csv('dataset-C-enrolment.csv')
df3 = df3[df3['sex']=='MF']
_ = df3.reset_index(drop=True, inplace=True)
pivoted_enrolment = df3.copy().iloc[-4:,[0,3,5]]
display(pivoted_enrolment)

Unnamed: 0,year,intake,graduates
9,2014,397,356
10,2015,405,355
11,2016,399,351
12,2017,391,375


The inverse of `df.pivot()` is `df.melt()`. When performing this, remember to state the `id_vars` parameter. They will be preserved as a column in the output `df`.

In [11]:
melted_enrolment = pd.melt(pivoted_enrolment, id_vars=['year'])
melted_enrolment

Unnamed: 0,year,variable,value
0,2014,intake,397
1,2015,intake,405
2,2016,intake,399
3,2017,intake,391
4,2014,graduates,356
5,2015,graduates,355
6,2016,graduates,351
7,2017,graduates,375


Similarly, reverse this using the `pivot()` function. Use the concepts of hierarchical index to rename the columns.

In [12]:
pivoted_enrolemnt2 = melted_enrolment.pivot('year', 'variable')
display(pivoted_enrolemnt2)
pivoted_enrolemnt2.columns = pivoted_enrolemnt2.columns.levels[1]
pivoted_enrolemnt2.columns.name = None
display(pivoted_enrolemnt2)

Unnamed: 0_level_0,value,value
variable,graduates,intake
year,Unnamed: 1_level_2,Unnamed: 2_level_2
2014,356,397
2015,355,405
2016,351,399
2017,375,391


Unnamed: 0_level_0,graduates,intake
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2014,356,397
2015,355,405
2016,351,399
2017,375,391


**References:**

Python for Data Analysis, 2nd Edition, McKinney (2017)