# Reshaping DataFrames Using Pandas




## Outline
* Wide to Long with `melt`
* Long to Wide with `pivot`




In [13]:
import pandas as pd
from pathlib import Path

# our dataset:
df = pd.read_csv(Path('data/mods_data_wide.csv'))
df

Unnamed: 0,id,status,data_worker,Manager,pursing_work,not_in_field
0,1,Completed,1.0,,,
1,2,Completed,1.0,,,
2,3,Completed,1.0,,,
3,4,Completed,1.0,,,
4,5,Completed,1.0,,,
...,...,...,...,...,...,...
329,330,Completed,1.0,,,
330,331,Completed,,,1.0,
331,332,Completed,,,1.0,
332,333,Completed,1.0,,,


# Wide to Long with `melt`

Our data currently has a different column for each variable, or "wide format".  Use `DataFrame.melt()` to convert it to long form, where the resulting table will have a "variable" column containing the variable name, and a "value" column containing the value of that variable.

We can use this functionality to convert our "data_worker", "Manager", "pursuing_work", and "not_in_field" columns into a column called "work_type", whose value is one of the 4 types of work.

Pick the identifying columns for the `id_vars` argument, and include all variables you want unpivoted in `value_vars`.

In [14]:
df.melt(
    id_vars='id', 
    value_vars=['data_worker', 'Manager'])

Unnamed: 0,id,variable,value
0,1,data_worker,1.0
1,2,data_worker,1.0
2,3,data_worker,1.0
3,4,data_worker,1.0
4,5,data_worker,1.0
...,...,...,...
663,330,Manager,
664,331,Manager,
665,332,Manager,
666,333,Manager,


`melt()` will also assume all columns are `value_vars` if they aren't included as `id_vars`. Use `var_name` and `value_name` to control the names of the output dataframe columns.

In [4]:
melted = df.melt(
    id_vars=['id', 'status'], 
    var_name='work_type', 
    value_name='has_data')

melted

Unnamed: 0,id,status,work_type,has_data
0,1,Completed,data_worker,1.0
1,2,Completed,data_worker,1.0
2,3,Completed,data_worker,1.0
3,4,Completed,data_worker,1.0
4,5,Completed,data_worker,1.0
...,...,...,...,...
1331,330,Completed,not_in_field,
1332,331,Completed,not_in_field,
1333,332,Completed,not_in_field,
1334,333,Completed,not_in_field,


Predictably, the resulting dataframe has a row count equal to the original number of rows multiplied by the number of features included as `value_vars`.

In [5]:
if len(melted) == len(df) * 4 == 1336: # there are 5 feature variables that we are melting
    print('okay!')
else:
    print('ERROR')

okay!


We now have a column called work_type that has the types of work in it, however we also have duplicate rows where the 'has_data' field is empty (or NaN).  We only want the rows where the value is 1.  Remember selecting data from a dataframe?

In [150]:
melted = melted[melted.has_data == 1]

melted

Unnamed: 0,id,status,work_type,has_data
0,1,Completed,data_worker,1.0
1,2,Completed,data_worker,1.0
2,3,Completed,data_worker,1.0
3,4,Completed,data_worker,1.0
4,5,Completed,data_worker,1.0
...,...,...,...,...
1228,227,Completed,not_in_field,1.0
1236,235,Completed,not_in_field,1.0
1275,274,Completed,not_in_field,1.0
1276,275,Completed,not_in_field,1.0


Our 'has_data' colunn is now all 1's, so we might as well drop it.

In [151]:
melted = melted.drop(columns='has_data')

In [152]:
df # original dataframe

Unnamed: 0,id,status,data_worker,Manager,pursing_work,not_in_field
0,1,Completed,1.0,,,
1,2,Completed,1.0,,,
2,3,Completed,1.0,,,
3,4,Completed,1.0,,,
4,5,Completed,1.0,,,
...,...,...,...,...,...,...
329,330,Completed,1.0,,,
330,331,Completed,,,1.0,
331,332,Completed,,,1.0,
332,333,Completed,1.0,,,


In [153]:
melted # finished conversion

Unnamed: 0,id,status,work_type
0,1,Completed,data_worker
1,2,Completed,data_worker
2,3,Completed,data_worker
3,4,Completed,data_worker
4,5,Completed,data_worker
...,...,...,...
1228,227,Completed,not_in_field
1236,235,Completed,not_in_field
1275,274,Completed,not_in_field
1276,275,Completed,not_in_field


Let's look at some data to check our work:

In [12]:
def check(id):
    orig_row = df[df.id == id]
    melted_row = melted[melted.id == id]
    print(orig_row)
    print(' ')
    print(melted_row)
    print('\n')

check(129)
check(301)

      id     status  data_worker  Manager  pursing_work  not_in_field
128  129  Completed          NaN      1.0           NaN           NaN
 
       id     status     work_type  has_data
128   129  Completed   data_worker       NaN
462   129  Completed       Manager       1.0
796   129  Completed  pursing_work       NaN
1130  129  Completed  not_in_field       NaN


      id     status  data_worker  Manager  pursing_work  not_in_field
300  301  Completed          1.0      NaN           NaN           NaN
 
       id     status     work_type  has_data
300   301  Completed   data_worker       1.0
634   301  Completed       Manager       NaN
968   301  Completed  pursing_work       NaN
1302  301  Completed  not_in_field       NaN




## Long to Wide with `pivot`

Let's say we want to convert the data in the opposite way.  We have our `melted` dataframe in "long-form", but we want to create a new dataframe in "wide-form" (just like the dataframe we started with).

In [157]:
# re-add our 'has data' column
melted2 = melted.copy()
melted2['has_data'] = melted2.id.map(lambda x: 1)

pivoted = melted2.pivot(
    index='id',
    columns='work_type', 
    values='has_data').reset_index()

pivoted.columns.name = None # this is unecessary, but the column index name is confusing when displayed in jupyter

pivoted

Unnamed: 0,id,Manager,data_worker,not_in_field,pursing_work
0,1,,1.0,,
1,2,,1.0,,
2,3,,1.0,,
3,4,,1.0,,
4,5,,1.0,,
...,...,...,...,...,...
327,330,,1.0,,
328,331,,,,1.0
329,332,,,,1.0
330,333,,1.0,,


In [146]:
df

Unnamed: 0,id,status,data_worker,Manager,pursing_work,not_in_field
0,1,Completed,1.0,,,
1,2,Completed,1.0,,,
2,3,Completed,1.0,,,
3,4,Completed,1.0,,,
4,5,Completed,1.0,,,
...,...,...,...,...,...,...
329,330,Completed,1.0,,,
330,331,Completed,,,1.0,
331,332,Completed,,,1.0,
332,333,Completed,1.0,,,
