# Pivot tables
* A pivot table is itself a DataFrame, which compares two groups on some shared columns, thus:
  1. the rows represent one variable that you're interested in
  2. the columns another variables, and 
  3. the cell content is some *aggregate* value of a third column 
* Often a pivot table includes marginal values as well, which are comparisons across multiple groups (more in a minute)

In [1]:
# Here we have the Times Higher Education World University Ranking dataset
import pandas as pd
import numpy as np
df = pd.read_csv('datasets/cwurData.csv')
df.head()

Unnamed: 0,world_rank,institution,country,national_rank,quality_of_education,alumni_employment,quality_of_faculty,publications,influence,citations,broad_impact,patents,score,year
0,1,Harvard University,USA,1,7,9,1,1,1,1,,5,100.0,2012
1,2,Massachusetts Institute of Technology,USA,2,9,17,3,12,4,4,,1,91.67,2012
2,3,Stanford University,USA,3,17,11,5,4,2,2,,15,89.5,2012
3,4,University of Cambridge,United Kingdom,1,10,24,4,16,16,11,,50,86.17,2012
4,5,California Institute of Technology,USA,4,2,29,7,37,22,22,,18,85.21,2012


In [None]:
# Let's say we want to create a new column called *Rank_Level*, where institutions with world ranking 1-100 are
# categorized as *first tier* and those with world ranking 101 - 200 are *second tier*, ranking 201 - 300 are
# *third tier*, after 301 is *other* top universities.

# You do that now, please.

In [2]:
#Put interesting student solution here.
def set_rank(row):
    if row["world_rank"] <= 100:
        row["Rank_Level"]="first tier"
    elif row["world_rank"] <= 200:
        row["Rank_Level"]="second tier"
    elif row["world_rank"] <= 300:
        row["Rank_Level"]="third tier"
    else:
        row["Rank_Level"]="other top universities"
    return row

df=df.apply(set_rank, axis=1)
df.head()


Unnamed: 0,world_rank,institution,country,national_rank,quality_of_education,alumni_employment,quality_of_faculty,publications,influence,citations,broad_impact,patents,score,year,Rank_Level
0,1,Harvard University,USA,1,7,9,1,1,1,1,,5,100.0,2012,first tier
1,2,Massachusetts Institute of Technology,USA,2,9,17,3,12,4,4,,1,91.67,2012,first tier
2,3,Stanford University,USA,3,17,11,5,4,2,2,,15,89.5,2012,first tier
3,4,University of Cambridge,United Kingdom,1,10,24,4,16,16,11,,50,86.17,2012,first tier
4,5,California Institute of Technology,USA,4,2,29,7,37,22,22,,18,85.21,2012,first tier


* Let's pivot! We need two columns, let's say the *country* and our *rank level* these will become our new rows (index)/columns (labels)
* Now we need one column of interest for the cell value, let's use the *score*
* Then we need on aggregation function, which we'll apply to *score* let's use `np.mean`

* essentially this means we're comparing two groups, "Countries" vs. "Rank Level" with respect to score using an average. Think for a moment how you might tackle this with group by

In [4]:
df.pivot_table(values='score', 
               index='country', 
               columns='Rank_Level', 
               aggfunc=[np.mean])

Unnamed: 0_level_0,mean,mean,mean,mean
Rank_Level,first tier,other top universities,second tier,third tier
country,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Argentina,,44.672857,,
Australia,47.9425,44.64575,49.2425,47.285
Austria,,44.864286,,47.066667
Belgium,51.875,45.081,49.084,46.746667
Brazil,,44.499706,49.565,
Bulgaria,,44.335,,
Canada,53.633846,44.760541,49.218182,46.826364
Chile,,44.7675,,
China,53.5925,44.564267,47.868,46.92625
Colombia,,44.4325,,


* we notice that there are some NaN values, e.g. Argentia has only observations in the "Other Top Unversities" category

* pivot tables aren't limited to one aggregation! We could use multiple functions and see those results with heirarchical column labels

In [5]:
df.pivot_table(values='score', index='country', columns='Rank_Level', 
               aggfunc=[np.mean, np.max]).head()

Unnamed: 0_level_0,mean,mean,mean,mean,amax,amax,amax,amax
Rank_Level,first tier,other top universities,second tier,third tier,first tier,other top universities,second tier,third tier
country,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
Argentina,,44.672857,,,,45.66,,
Australia,47.9425,44.64575,49.2425,47.285,51.61,45.97,50.4,47.47
Austria,,44.864286,,47.066667,,46.29,,47.78
Belgium,51.875,45.081,49.084,46.746667,52.03,46.21,49.73,47.14
Brazil,,44.499706,49.565,,,46.08,49.82,


In [6]:
# we can also provide those marginal values
df.pivot_table(values='score', 
               index='country', 
               columns='Rank_Level', 
               aggfunc=[np.mean, np.std], 
               margins=True).head()

Unnamed: 0_level_0,mean,mean,mean,mean,mean,std,std,std,std,std
Rank_Level,first tier,other top universities,second tier,third tier,All,first tier,other top universities,second tier,third tier,All
country,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
Argentina,,44.672857,,,44.672857,,0.59913,,,0.59913
Australia,47.9425,44.64575,49.2425,47.285,45.825517,3.798397,0.386542,0.82064,0.26163,2.297206
Austria,,44.864286,,47.066667,45.139583,,0.590191,,0.695725,0.947929
Belgium,51.875,45.081,49.084,46.746667,47.011,0.219203,0.786419,0.829958,0.481283,2.461225
Brazil,,44.499706,49.565,,44.781111,,0.490476,0.360624,,1.270909


In [7]:
# A pivot table is just a multi-level dataframe
new_df=df.pivot_table(values='score', index='country', columns='Rank_Level', aggfunc=[np.mean, np.max], 
               margins=True)
# Now let's look at the index
print(new_df.index)
# And let's look at the columns
print(new_df.columns)

Index(['Argentina', 'Australia', 'Austria', 'Belgium', 'Brazil', 'Bulgaria',
       'Canada', 'Chile', 'China', 'Colombia', 'Croatia', 'Cyprus',
       'Czech Republic', 'Denmark', 'Egypt', 'Estonia', 'Finland', 'France',
       'Germany', 'Greece', 'Hong Kong', 'Hungary', 'Iceland', 'India', 'Iran',
       'Ireland', 'Israel', 'Italy', 'Japan', 'Lebanon', 'Lithuania',
       'Malaysia', 'Mexico', 'Netherlands', 'New Zealand', 'Norway', 'Poland',
       'Portugal', 'Puerto Rico', 'Romania', 'Russia', 'Saudi Arabia',
       'Serbia', 'Singapore', 'Slovak Republic', 'Slovenia', 'South Africa',
       'South Korea', 'Spain', 'Sweden', 'Switzerland', 'Taiwan', 'Thailand',
       'Turkey', 'USA', 'Uganda', 'United Arab Emirates', 'United Kingdom',
       'Uruguay', 'All'],
      dtype='object', name='country')
MultiIndex([('mean',             'first tier'),
            ('mean', 'other top universities'),
            ('mean',            'second tier'),
            ('mean',             'third

In [None]:
# How would we query this if we want to get the average scores of First Tier universities broken down by country?
# You do that now, please.

* Let's get weird. We can `stack` and `unstack` columns in our dataframe.
* `stack` takes pivots the lowermost column index to become the innermost row index. unstack is the inverse
* let's look back at that pivot table...

In [None]:
new_df.head() #we want to take the tier of uni and move it to a row index, so we are stacking....

In [8]:
new_df.stack().head()

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,amax
country,Rank_Level,Unnamed: 2_level_1,Unnamed: 3_level_1
Argentina,other top universities,44.672857,45.66
Argentina,All,44.672857,45.66
Australia,first tier,47.9425,51.61
Australia,other top universities,44.64575,45.97
Australia,second tier,49.2425,50.4


In [9]:
# It can get complex! You are just comparing two groups and a value (or multiple values in this case!)
# we can unstack() all the way if we want to, which means move a row index into a column index
new_df.head() #let's pivot rank_level now into the column, what shape do you think this will create?

Unnamed: 0_level_0,mean,mean,mean,mean,mean,amax,amax,amax,amax,amax
Rank_Level,first tier,other top universities,second tier,third tier,All,first tier,other top universities,second tier,third tier,All
country,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
Argentina,,44.672857,,,44.672857,,45.66,,,45.66
Australia,47.9425,44.64575,49.2425,47.285,45.825517,51.61,45.97,50.4,47.47,51.61
Austria,,44.864286,,47.066667,45.139583,,46.29,,47.78,47.78
Belgium,51.875,45.081,49.084,46.746667,47.011,52.03,46.21,49.73,47.14,52.03
Brazil,,44.499706,49.565,,44.781111,,46.08,49.82,,49.82


In [10]:
new_df.unstack().head(10)

      Rank_Level  country  
mean  first tier  Argentina          NaN
                  Australia    47.942500
                  Austria            NaN
                  Belgium      51.875000
                  Brazil             NaN
                  Bulgaria           NaN
                  Canada       53.633846
                  Chile              NaN
                  China        53.592500
                  Colombia           NaN
dtype: float64

* Remember, you can pass any function you want to the aggregate function, including those that you define yourself!