In [1]:
import pandas as pd

In [11]:
trials = pd.read_csv('trials_01.csv', sep=";") 

In [12]:
trials

Unnamed: 0,id,treatment,gender,response
0,1,A,F,5
1,2,A,M,3
2,3,B,F,8
3,4,B,M,9


In [13]:
trials.pivot(index="treatment", columns="gender", values="response")

gender,F,M
treatment,Unnamed: 1_level_1,Unnamed: 2_level_1
A,5,3
B,8,9


In [14]:
trials.pivot(index="treatment", columns="gender")

Unnamed: 0_level_0,id,id,response,response
gender,F,M,F,M
treatment,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
A,1,2,5,3
B,3,4,8,9


## Pivoting a single variable
Suppose you started a blog for a band, and you would like to log how many visitors you have had, and how many signed-up for your newsletter. To help design the tours later, you track where the visitors are. A DataFrame called users consisting of this information has been pre-loaded for you.

Inspect users in the IPython Shell and make a note of which variable you want to use to index the rows ('weekday'), which variable you want to use to index the columns ('city'), and which variable will populate the values in the cells ('visitors'). Try to visualize what the result should be.

For example, in the video, Dhavide used 'treatment' to index the rows, 'gender' to index the columns, and 'response' to populate the cells. Prior to pivoting, the DataFrame looked like this:

![image.png](attachment:image.png)

After pivoting:
![image.png](attachment:image.png)

In this exercise, your job is to pivot users so that the focus is on 'visitors', with the columns indexed by 'city' and the rows indexed by 'weekday'.
- Pivot the users DataFrame with the rows indexed by 'weekday', the columns indexed by 'city', and the values populated with 'visitors'.
- Print the pivoted DataFrame. This has been done for you, so hit 'Submit Answer' to view the result.

In [15]:
users = pd.read_csv("site_visitors.csv", sep=";")

In [16]:
users

Unnamed: 0,weekday,city,visitors,signups
0,Sun,Austin,139,7
1,Sun,Dallas,237,12
2,Mon,Austin,326,3
3,Mon,Dallas,456,5


In [20]:
visitors_pivot = users.pivot(index="weekday", columns = "city", values='visitors')

In [21]:
visitors_pivot

city,Austin,Dallas
weekday,Unnamed: 1_level_1,Unnamed: 2_level_1
Mon,326,456
Sun,139,237


## Pivoting all variables
If you do not select any particular variables, all of them will be pivoted. In this case - with the users DataFrame - both 'visitors' and 'signups' will be pivoted, creating hierarchical column labels.

You will explore this for yourself now in this exercise.

- Pivot the users DataFrame with the 'signups' indexed by 'weekday' in the rows and 'city' in the columns.
- Print the new DataFrame. This has been done for you.
- Pivot the users DataFrame with both 'signups' and 'visitors' pivoted - that is, all the variables. This will happen automatically if you do not specify an argument for the values parameter of .pivot().
- Print the pivoted DataFrame. This has been done for you, so hit 'Submit Answer' to see the result.

In [22]:
signups_pivot = users.pivot(index='weekday', columns='city', values='signups')

In [23]:
signups_pivot

city,Austin,Dallas
weekday,Unnamed: 1_level_1,Unnamed: 2_level_1
Mon,3,5
Sun,7,12


In [24]:
pivot = users.pivot(index='weekday', columns='city')

In [25]:
pivot

Unnamed: 0_level_0,visitors,visitors,signups,signups
city,Austin,Dallas,Austin,Dallas
weekday,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Mon,326,456,3,5
Sun,139,237,7,12


# Stacking and unstacking dataframes

In [26]:
# wotking with trials DF defined above
trials

Unnamed: 0,id,treatment,gender,response
0,1,A,F,5
1,2,A,M,3
2,3,B,F,8
3,4,B,M,9


In [28]:
#make multidimentional index
trials_md = trials.set_index(['treatment', 'gender'])

In [29]:
trials_md

Unnamed: 0_level_0,Unnamed: 1_level_0,id,response
treatment,gender,Unnamed: 2_level_1,Unnamed: 3_level_1
A,F,1,5
A,M,2,3
B,F,3,8
B,M,4,9


In [30]:
trials_md.unstack(level='gender')

Unnamed: 0_level_0,id,id,response,response
gender,F,M,F,M
treatment,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
A,1,2,5,3
B,3,4,8,9


In [31]:
trials_md.unstack(level=1)

Unnamed: 0_level_0,id,id,response,response
gender,F,M,F,M
treatment,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
A,1,2,5,3
B,3,4,8,9


### Stacking dataframes

In [35]:
trials_by_gender = trials_md.unstack(level="gender")
trials_by_gender

Unnamed: 0_level_0,id,id,response,response
gender,F,M,F,M
treatment,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
A,1,2,5,3
B,3,4,8,9


In [37]:
stacked = trials_by_gender.stack(level = 'gender')

In [38]:
stacked

Unnamed: 0_level_0,Unnamed: 1_level_0,id,response
treatment,gender,Unnamed: 2_level_1,Unnamed: 3_level_1
A,F,1,5
A,M,2,3
B,F,3,8
B,M,4,9


### swapping levels

In [39]:
swapped = stacked.swaplevel(0, 1)

In [40]:
swapped

Unnamed: 0_level_0,Unnamed: 1_level_0,id,response
gender,treatment,Unnamed: 2_level_1,Unnamed: 3_level_1
F,A,1,5
M,A,2,3
F,B,3,8
M,B,4,9


In [41]:
sorted_trials = swapped.sort_index()

In [42]:
sorted_trials

Unnamed: 0_level_0,Unnamed: 1_level_0,id,response
gender,treatment,Unnamed: 2_level_1,Unnamed: 3_level_1
F,A,1,5
F,B,3,8
M,A,2,3
M,B,4,9


### Stacking & unstacking I
You are now going to practice stacking and unstacking DataFrames. The users DataFrame you have been working with in this chapter has been pre-loaded for you, this time with a MultiIndex. Explore it in the IPython Shell to see the data layout. Pay attention to the index, and notice that the index levels are ['city', 'weekday']. So 'weekday' - the second entry - has position 1. This position is what corresponds to the level parameter in .stack() and .unstack() calls. Alternatively, you can specify 'weekday' as the level instead of its position.

Your job in this exercise is to unstack users by 'weekday'. You will then use .stack() on the unstacked DataFrame to see if you get back the original layout of users.

- Define a DataFrame byweekday with the 'weekday' level of users unstacked.
- Print the byweekday DataFrame to see the new data layout. This has been done for you.
- Stack byweekday by 'weekday' and print it to check if you get the same layout as the original users DataFrame.

In [43]:
users

Unnamed: 0,weekday,city,visitors,signups
0,Sun,Austin,139,7
1,Sun,Dallas,237,12
2,Mon,Austin,326,3
3,Mon,Dallas,456,5


In [60]:
users_md = users.set_index (["city", "weekday"])

In [61]:
users_md = users_md.sort_index()

In [62]:
users_md

Unnamed: 0_level_0,Unnamed: 1_level_0,visitors,signups
city,weekday,Unnamed: 2_level_1,Unnamed: 3_level_1
Austin,Mon,326,3
Austin,Sun,139,7
Dallas,Mon,456,5
Dallas,Sun,237,12


In [63]:
users_md.index.names

FrozenList(['city', 'weekday'])

In [67]:
byweekday = users_md.unstack(level=1)

In [68]:
byweekday

Unnamed: 0_level_0,visitors,visitors,signups,signups
weekday,Mon,Sun,Mon,Sun
city,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Austin,326,139,3,7
Dallas,456,237,5,12


In [69]:
byweekday.stack(level = 'weekday')

Unnamed: 0_level_0,Unnamed: 1_level_0,visitors,signups
city,weekday,Unnamed: 2_level_1,Unnamed: 3_level_1
Austin,Mon,326,3
Austin,Sun,139,7
Dallas,Mon,456,5
Dallas,Sun,237,12


### Stacking & unstacking II
You are now going to continue working with the users DataFrame. As always, first explore it in the IPython Shell to see the layout and note the index.

Your job in this exercise is to unstack and then stack the 'city' level, as you did previously for 'weekday'. Note that you won't get the same DataFrame.

In [70]:
users_md1 = users.set_index(["city", "weekday"])

In [73]:
users_md1 = users_md1.sort_index()

In [74]:
users_md1

Unnamed: 0_level_0,Unnamed: 1_level_0,visitors,signups
city,weekday,Unnamed: 2_level_1,Unnamed: 3_level_1
Austin,Mon,326,3
Austin,Sun,139,7
Dallas,Mon,456,5
Dallas,Sun,237,12


- Define a DataFrame bycity with the 'city' level of users unstacked.
- Print the bycity DataFrame to see the new data layout. This has been done for you.
- Stack bycity by 'city' and print it to check if you get the same layout as the original users DataFrame.

In [75]:
bycity = users_md1.unstack(level='city')

In [76]:
bycity

Unnamed: 0_level_0,visitors,visitors,signups,signups
city,Austin,Dallas,Austin,Dallas
weekday,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Mon,326,456,3,5
Sun,139,237,7,12


In [77]:
print(bycity.stack(level='city'))

                visitors  signups
weekday city                     
Mon     Austin       326        3
        Dallas       456        5
Sun     Austin       139        7
        Dallas       237       12


### restoring the index order
Continuing from the previous exercise, you will now use .swaplevel(0, 1) to flip the index levels. Note they won't be sorted. To sort them, you will have to follow up with a .sort_index(). You will then obtain the original DataFrame. Note that an unsorted index leads to slicing failures.

To begin, print both users and bycity in the IPython Shell. The goal here is to convert bycity back to something that looks like users.

- Define a DataFrame newusers with the 'city' level stacked back into the index of bycity.
- Swap the levels of the index of newusers.
- Print newusers and verify that the index is not sorted. This has been done for you.
- Sort the index of newusers.
- Print newusers and verify that the index is now sorted. This has been done for you.
- Assert that newusers equals users. This has been done for you, so hit 'Submit Answer' to see the result.

In [79]:
bycity

Unnamed: 0_level_0,visitors,visitors,signups,signups
city,Austin,Dallas,Austin,Dallas
weekday,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Mon,326,456,3,5
Sun,139,237,7,12


In [80]:
# Stack 'city' back into the index of bycity: newusers
newusers = bycity.stack(level="city")

In [84]:
# Swap the levels of the index of newusers: newusers
newusers = newusers.swaplevel(0,1)

In [86]:
# Sort the index of newusers: newusers
newusers = newusers.sort_index()

In [87]:
newusers

Unnamed: 0_level_0,Unnamed: 1_level_0,visitors,signups
city,weekday,Unnamed: 2_level_1,Unnamed: 3_level_1
Austin,Mon,326,3
Austin,Sun,139,7
Dallas,Mon,456,5
Dallas,Sun,237,12


In [90]:
# Verify that the new DataFrame is equal to the original (here it's users_md1)
print(newusers.equals(users_md1))

True


# Melting dataframes

In [91]:
trials = pd.read_csv("trials_01.csv", sep=';')

In [92]:
trials

Unnamed: 0,id,treatment,gender,response
0,1,A,F,5
1,2,A,M,3
2,3,B,F,8
3,4,B,M,9


In [94]:
#let's pivot
trials.pivot(index="treatment", columns = 'gender', values = 'response')

gender,F,M
treatment,Unnamed: 1_level_1,Unnamed: 2_level_1
A,5,3
B,8,9


In [96]:
new_trials = pd.read_csv("trials_02.csv", sep=";")

In [97]:
new_trials

Unnamed: 0,treatment,F,M
0,A,5,3
1,B,8,9


In [99]:
#this won't give us the result we wanted
pd.melt(new_trials)

Unnamed: 0,variable,value
0,treatment,A
1,treatment,B
2,F,5
3,F,8
4,M,3
5,M,9


In [101]:
#we need define variables that need to remain as dataframe columns
pd.melt(new_trials, id_vars=["treatment"])

Unnamed: 0,treatment,variable,value
0,A,F,5
1,B,F,8
2,A,M,3
3,B,M,9


In [102]:
#explicitly show whioch column to be converted to values
pd.melt(new_trials, id_vars=['treatment'], value_vars=["F", "M"])

Unnamed: 0,treatment,variable,value
0,A,F,5
1,B,F,8
2,A,M,3
3,B,M,9


In [104]:
# look at the column names above, variable and value is not very good distinguishable, so we can explicitly set the names for variable and value
pd.melt(new_trials, id_vars = ['treatment'], var_name='gender', value_name = 'response')

Unnamed: 0,treatment,gender,response
0,A,F,5
1,B,F,8
2,A,M,3
3,B,M,9


### Adding names for readability
You are now going to practice melting DataFrames. A DataFrame called visitors_by_city_weekday has been pre-loaded for you. Explore it in the IPython Shell and see that it is the users DataFrame from previous exercises with the rows indexed by 'weekday', columns indexed by 'city', and values populated with 'visitors'.

Recall from the video that the goal of melting is to restore a pivoted DataFrame to its original form, or to change it from a wide shape to a long shape. You can explicitly specify the columns that should remain in the reshaped DataFrame with id_vars, and list which columns to convert into values with value_vars. As Dhavide demonstrated, if you don't pass a name to the values in pd.melt(), you will lose the name of your variable. You can fix this by using the value_name keyword argument.

Your job in this exercise is to melt visitors_by_city_weekday to move the city names from the column labels to values in a single column called 'city'. If you were to use just pd.melt(visitors_by_city_weekday), you would obtain the following result:

![image.png](attachment:image.png)

Therefore, you have to specify the id_vars keyword argument to ensure that 'weekday' is retained in the reshaped DataFrame, and the value_name keyword argument to change the name of value to visitors.

- Reset the index of visitors_by_city_weekday with .reset_index().
- Print visitors_by_city_weekday and verify that you have just a range index, 0, 1, 2, 3. This has been done for you.
- Melt visitors_by_city_weekday to move the city names from the column labels to values in a single column called visitors.
- Print visitors to check that the city values are in a single column now and that the dataframe is longer and skinnier.


In [105]:
users


Unnamed: 0,weekday,city,visitors,signups
0,Sun,Austin,139,7
1,Sun,Dallas,237,12
2,Mon,Austin,326,3
3,Mon,Dallas,456,5


In [110]:
# now gettig the data for the excersise: data needs to be pivoted
visitors_by_city_weekday = users.pivot(index="weekday", columns = 'city', values = "visitors")

In [111]:
visitors_by_city_weekday

city,Austin,Dallas
weekday,Unnamed: 1_level_1,Unnamed: 2_level_1
Mon,326,456
Sun,139,237


### now the excercise

In [112]:
pd.melt(visitors_by_city_weekday)

Unnamed: 0,city,value
0,Austin,326
1,Austin,139
2,Dallas,456
3,Dallas,237


In [114]:
# Reset the index: visitors_by_city_weekday
visitors_by_city_weekday =visitors_by_city_weekday.reset_index()

In [115]:
# Print visitors_by_city_weekday
print(visitors_by_city_weekday)

city weekday  Austin  Dallas
0        Mon     326     456
1        Sun     139     237


In [120]:
visitors = pd.melt(visitors_by_city_weekday, id_vars = ["weekday"], value_name = "visitors")

In [121]:
visitors

Unnamed: 0,weekday,city,visitors
0,Mon,Austin,326
1,Sun,Austin,139
2,Mon,Dallas,456
3,Sun,Dallas,237


### Going from wide to long
You can move multiple columns into a single column (making the data long and skinny) by "melting" multiple columns. In this exercise, you will practice doing this.

The users DataFrame has been pre-loaded for you. As always, explore it in the IPython Shell and note the index.

- Define a DataFrame skinny where you melt the 'visitors' and 'signups' columns of users into a single column.
- Print skinny to verify the results. Note the value column that had the cell values in users.

In [122]:
users = pd.read_csv("users_wide.csv", sep=";")

In [123]:
users

Unnamed: 0,weekday,city,visitors,signups
0,Sun,Austin,139,7
1,Sun,Dallas,237,12
2,Mon,Austin,326,3
3,Mon,Dallas,456,5


In [127]:
# Melt users: skinny
skinny = pd.melt(users, id_vars=["weekday", "city"])

In [128]:
skinny

Unnamed: 0,weekday,city,variable,value
0,Sun,Austin,visitors,139
1,Sun,Dallas,visitors,237
2,Mon,Austin,visitors,326
3,Mon,Dallas,visitors,456
4,Sun,Austin,signups,7
5,Sun,Dallas,signups,12
6,Mon,Austin,signups,3
7,Mon,Dallas,signups,5


### Obtaining key-value pairs with melt()
Sometimes, all you need is some key-value pairs, and the context does not matter. If said context is in the index, you can easily obtain what you want. For example, in the users DataFrame, the visitors and signups columns lend themselves well to being represented as key-value pairs. So if you created a hierarchical index with 'city' and 'weekday' columns as the index, you can easily extract key-value pairs for the 'visitors' and 'signups' columns by melting users and specifying col_level=0

- Set the index of users to ['city', 'weekday'].
- Print the DataFrame users_idx to see the new index.
- Obtain the key-value pairs corresponding to visitors and signups by melting users_idx with the keyword argument col_level=0.

In [129]:
users

Unnamed: 0,weekday,city,visitors,signups
0,Sun,Austin,139,7
1,Sun,Dallas,237,12
2,Mon,Austin,326,3
3,Mon,Dallas,456,5


In [133]:
users_idx = users.set_index(["city", "weekday"])

In [134]:
users_idx

Unnamed: 0_level_0,Unnamed: 1_level_0,visitors,signups
city,weekday,Unnamed: 2_level_1,Unnamed: 3_level_1
Austin,Sun,139,7
Dallas,Sun,237,12
Austin,Mon,326,3
Dallas,Mon,456,5


In [135]:
# Obtain the key-value pairs: kv_pairs
kv_pairs = pd.melt(users_idx, col_level=0)

In [136]:
kv_pairs

Unnamed: 0,variable,value
0,visitors,139
1,visitors,237
2,visitors,326
3,visitors,456
4,signups,7
5,signups,12
6,signups,3
7,signups,5


# Pivot tables

In [137]:
more_trials = pd.read_csv("trials_03.csv", sep=";")

In [138]:
more_trials

Unnamed: 0,id,treatment,gender,response
0,1,A,F,5
1,2,A,M,3
2,3,A,M,8
3,4,A,F,9
4,5,B,F,1
5,6,B,M,8
6,7,B,F,4
7,8,B,F,6


In [139]:
#this will generate an error as data contain duplications (ids 1 and 2)
more_trials.pivot(index='treatment',columns='gender',values='response') 

ValueError: Index contains duplicate entries, cannot reshape

In [140]:
# here we need to use pivot_table to transfrom the data
more_trials.pivot_table(index="treatment", columns='gender', values='response')
#by default the reduction is an average

gender,F,M
treatment,Unnamed: 1_level_1,Unnamed: 2_level_1
A,7.0,5.5
B,3.666667,8.0


In [142]:
#if we need something different rather than average value, we need to use aggfunc
more_trials.pivot_table(index="treatment", columns='gender', values='response', aggfunc="count")
#if we use "count" as aggfunc as above, so we have frequency table

gender,F,M
treatment,Unnamed: 1_level_1,Unnamed: 2_level_1
A,2,2
B,3,1


### Setting up a pivot table
Recall from the video that a pivot table allows you to see all of your variables as a function of two other variables. In this exercise, you will use the .pivot_table() method to see how the users DataFrame entries appear when presented as functions of the 'weekday' and 'city' columns. That is, with the rows indexed by 'weekday' and the columns indexed by 'city'.

Before using the pivot table, print the users DataFrame in the IPython Shell and observe the layout.

- Use a pivot table to index the rows of users by 'weekday' and the columns of users by 'city'. These correspond to the index and columns parameters of .pivot_table().
- Print by_city_day. This has been done for you, so hit 'Submit Answer' to see the result.

In [143]:
# Create the DataFrame with the appropriate pivot table: by_city_day
by_city_day = users.pivot_table(index="weekday", columns = "city")


In [145]:

# Print by_city_day
by_city_day

Unnamed: 0_level_0,signups,signups,visitors,visitors
city,Austin,Dallas,Austin,Dallas
weekday,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Mon,3,5,326,456
Sun,7,12,139,237


### Using other aggregations in pivot tables
You can also use aggregation functions within a pivot table by specifying the aggfunc parameter. In this exercise, you will practice using the 'count' and len aggregation functions - which produce the same result - on the users DataFrame.

- Define a DataFrame count_by_weekday1 that shows the count of each column with the parameter aggfunc='count'. The index here is 'weekday'.
- Print count_by_weekday1. This has been done for you.
- Replace aggfunc='count' with aggfunc=len and verify you obtain the same result.


In [152]:
# Use a pivot table to display the count of each column: count_by_weekday1
count_by_weekday1 = users.pivot_table(index="weekday", aggfunc= "count")

In [153]:
count_by_weekday1


Unnamed: 0_level_0,city,signups,visitors
weekday,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Mon,2,2,2
Sun,2,2,2


In [154]:
# Replace 'aggfunc='count'' with 'aggfunc=len': count_by_weekday2
count_by_weekday2 = users.pivot_table(index="weekday", aggfunc= len)

In [155]:
count_by_weekday2

Unnamed: 0_level_0,city,signups,visitors
weekday,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Mon,2,2,2
Sun,2,2,2


In [156]:
# Verify that the same result is obtained
print('==========================================')
print(count_by_weekday1.equals(count_by_weekday2))

True


### Using margins in pivot tables
Sometimes it's useful to add totals in the margins of a pivot table. You can do this with the argument margins=True. In this exercise, you will practice using margins in a pivot table along with a new aggregation function: sum.

The users DataFrame, which you are now probably very familiar with, has been pre-loaded for you.

- Define a DataFrame signups_and_visitors that shows the breakdown of signups and visitors by day.
- You will need to use aggfunc=sum to do this.
- Print signups_and_visitors. This has been done for you.
- Now pass the additional argument margins=True to the .pivot_table() method to obtain the totals.
- Print signups_and_visitors_total. This has been done for you, so hit 'Submit Answer' to see the result.

In [159]:
# Create the DataFrame with the appropriate pivot table: signups_and_visitors
signups_and_visitors = users.pivot_table(index="weekday", aggfunc=sum)

In [160]:
# Print signups_and_visitors
print(signups_and_visitors)

         signups  visitors
weekday                   
Mon            8       782
Sun           19       376


In [161]:
# Add in the margins: signups_and_visitors_total 
signups_and_visitors_total = users.pivot_table(index="weekday", aggfunc=sum, margins=True)

In [162]:
# Print signups_and_visitors_total
print(signups_and_visitors_total)

         signups  visitors
weekday                   
Mon            8       782
Sun           19       376
All           27      1158
