# Pivot tables
* A pivot table is itself a DataFrame, which compares two groups on some shared columns, thus:
  1. the rows represent one variable that you're interested in
  2. the columns another variables, and 
  3. the cell content is some *aggregate* value of a third column 
* Often a pivot table includes marginal values as well, which are comparisons across multiple groups (more in a minute)

In [2]:
# Here we have the Times Higher Education World University Ranking dataset
import pandas as pd #import pandas
import numpy as np #import numpy
df = pd.read_csv('datasets/cwurData.csv') #creating variable called df that reads data from csv file 
df.head() #prints first few rows of df

Unnamed: 0,world_rank,institution,country,national_rank,quality_of_education,alumni_employment,quality_of_faculty,publications,influence,citations,broad_impact,patents,score,year
0,1,Harvard University,USA,1,7,9,1,1,1,1,,5,100.0,2012
1,2,Massachusetts Institute of Technology,USA,2,9,17,3,12,4,4,,1,91.67,2012
2,3,Stanford University,USA,3,17,11,5,4,2,2,,15,89.5,2012
3,4,University of Cambridge,United Kingdom,1,10,24,4,16,16,11,,50,86.17,2012
4,5,California Institute of Technology,USA,4,2,29,7,37,22,22,,18,85.21,2012


Let's say we want to create a new column called *Rank_Level*, where institutions with world ranking 1-100 are
categorized as *first tier* and those with world ranking 101 - 200 are *second tier*, ranking 201 - 300 are
*third tier*, after 301 is *other* top universities.

Try it now!

In [3]:
def set_rank(row): #creates function called set_rank that uses argument row
    if row['world_rank'] <= 100: #if the specific row with data under column world_rank is less than or equal to 100
        row['rank_level'] = 'first tier' #establishes new column and puts data for the specific row as first tier under the column
    elif row['world_rank'] <= 200: #if the specific row with data under column world_rank is less than or equal to 200
        row['rank_level'] = 'second tier' #establishes new column and puts data for the specific row as second tier under the column
    elif row['world_rank'] <= 300: #if the specific row with data under column world_rank is less than or equal to 300
        row['rank_level'] = 'third tier' #establishes new column and puts data for the specific row as third tier under the column
    else: #if the specific row with data under column world_rank anything other than if and elif statments 
        row['rank_level'] = 'other' #establishes new column and puts data for the specific row as other under the column
    return row #returns row
df = df.apply(set_rank, axis = 1) #sets df as df that applies the function set_rank to each row
df.head() #prints out first few rows of df

Unnamed: 0,world_rank,institution,country,national_rank,quality_of_education,alumni_employment,quality_of_faculty,publications,influence,citations,broad_impact,patents,score,year,rank_level
0,1,Harvard University,USA,1,7,9,1,1,1,1,,5,100.0,2012,first tier
1,2,Massachusetts Institute of Technology,USA,2,9,17,3,12,4,4,,1,91.67,2012,first tier
2,3,Stanford University,USA,3,17,11,5,4,2,2,,15,89.5,2012,first tier
3,4,University of Cambridge,United Kingdom,1,10,24,4,16,16,11,,50,86.17,2012,first tier
4,5,California Institute of Technology,USA,4,2,29,7,37,22,22,,18,85.21,2012,first tier


In [4]:
df.tail() #prints last few rows of df

Unnamed: 0,world_rank,institution,country,national_rank,quality_of_education,alumni_employment,quality_of_faculty,publications,influence,citations,broad_impact,patents,score,year,rank_level
2195,996,University of the Algarve,Portugal,7,367,567,218,926,845,812,969.0,816,44.03,2015,other
2196,997,Alexandria University,Egypt,4,236,566,218,997,908,645,981.0,871,44.03,2015,other
2197,998,Federal University of Ceará,Brazil,18,367,549,218,830,823,812,975.0,824,44.03,2015,other
2198,999,University of A Coruña,Spain,40,367,567,218,886,974,812,975.0,651,44.02,2015,other
2199,1000,China Pharmaceutical University,China,83,367,567,218,861,991,812,981.0,547,44.02,2015,other


* Let's pivot! We need two columns, let's say the *country* and our *rank level* these will become our new rows (index)/columns (labels)
* Now we need one column of interest for the cell value, let's use the *score*
* Then we need on aggregation function, which we'll apply to *score* let's use `np.mean`

* Essentially this means we're comparing two groups, "Countries" vs. "Rank Level" with respect to score using an average. Think for a moment how you might tackle this with group by...

In [5]:
df.pivot_table(values='score',
              index= 'country',
              columns='rank_level',
              aggfunc=[np.mean]).head() #takes df and creates pivot_table with the values score, finding all the indexes with the same country and the columns with the rank_level, and getting the mean of the score

Unnamed: 0_level_0,mean,mean,mean,mean
rank_level,first tier,other,second tier,third tier
country,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Argentina,,44.672857,,
Australia,47.9425,44.64575,49.2425,47.285
Austria,,44.864286,,47.066667
Belgium,51.875,45.081,49.084,46.746667
Brazil,,44.499706,49.565,


* Notice that there are some NaN values, e.g. Argentina has only observations in the "Other" unversities category
* Pivot tables aren't limited to one aggregation! We could use multiple functions and see those results with hierarchial column labels

In [6]:
df.pivot_table(values='score',
              index= 'country',
              columns='rank_level',
              aggfunc=[np.mean, np.max]).head() #takes df and creates pivot_table with the values score, finding all the indexes with the same country and the columns with the rank_level, and getting the mean of the score, then adding another part that shows the max of the score of the different rank levels of the countries

Unnamed: 0_level_0,mean,mean,mean,mean,amax,amax,amax,amax
rank_level,first tier,other,second tier,third tier,first tier,other,second tier,third tier
country,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
Argentina,,44.672857,,,,45.66,,
Australia,47.9425,44.64575,49.2425,47.285,51.61,45.97,50.4,47.47
Austria,,44.864286,,47.066667,,46.29,,47.78
Belgium,51.875,45.081,49.084,46.746667,52.03,46.21,49.73,47.14
Brazil,,44.499706,49.565,,,46.08,49.82,


In [7]:
# we can also provide those marginal values
df.pivot_table(values='score',
              index= 'country',
              columns='rank_level',
              aggfunc=[np.mean, np.std],
              margins=True).tail() #margin can get summary

Unnamed: 0_level_0,mean,mean,mean,mean,mean,std,std,std,std,std
rank_level,first tier,other,second tier,third tier,All,first tier,other,second tier,third tier,All
country,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
Uganda,,44.28,,,44.28,,0.169706,,,0.169706
United Arab Emirates,,44.22,,,44.22,,0.19799,,,0.19799
United Kingdom,63.937931,44.881299,48.9575,46.862273,49.474653,18.737306,0.589956,0.688636,0.510704,11.130161
Uruguay,,44.255,,,44.255,,0.13435,,,0.13435
All,58.350675,44.738871,49.06545,46.84345,47.798395,13.589643,0.525101,0.939407,0.49779,7.759042


In [8]:
# A pivot table is just a multi-level dataframe
new_df = df.pivot_table(values='score',
              index= 'country',
              columns='rank_level',
              aggfunc=[np.mean, np.std],
              margins=True)

#let's look at the index
print(new_df.index)  #gets all indexes which are the countries
#and the columns
print(new_df.columns) #gets all the columns 

Index(['Argentina', 'Australia', 'Austria', 'Belgium', 'Brazil', 'Bulgaria',
       'Canada', 'Chile', 'China', 'Colombia', 'Croatia', 'Cyprus',
       'Czech Republic', 'Denmark', 'Egypt', 'Estonia', 'Finland', 'France',
       'Germany', 'Greece', 'Hong Kong', 'Hungary', 'Iceland', 'India', 'Iran',
       'Ireland', 'Israel', 'Italy', 'Japan', 'Lebanon', 'Lithuania',
       'Malaysia', 'Mexico', 'Netherlands', 'New Zealand', 'Norway', 'Poland',
       'Portugal', 'Puerto Rico', 'Romania', 'Russia', 'Saudi Arabia',
       'Serbia', 'Singapore', 'Slovak Republic', 'Slovenia', 'South Africa',
       'South Korea', 'Spain', 'Sweden', 'Switzerland', 'Taiwan', 'Thailand',
       'Turkey', 'USA', 'Uganda', 'United Arab Emirates', 'United Kingdom',
       'Uruguay', 'All'],
      dtype='object', name='country')
MultiIndex([('mean',  'first tier'),
            ('mean',       'other'),
            ('mean', 'second tier'),
            ('mean',  'third tier'),
            ('mean',         'All')

How would we query this if we want to get the average scores of First Tier universities broken down by country?

In [11]:
new_df.loc[:, ('mean', 'first tier')] #get series with all the countries and the mean of their first tier scores

country
Argentina                     NaN
Australia               47.942500
Austria                       NaN
Belgium                 51.875000
Brazil                        NaN
Bulgaria                      NaN
Canada                  53.633846
Chile                         NaN
China                   53.592500
Colombia                      NaN
Croatia                       NaN
Cyprus                        NaN
Czech Republic                NaN
Denmark                 49.180000
Egypt                         NaN
Estonia                       NaN
Finland                 44.415000
France                  51.914444
Germany                 49.153636
Greece                        NaN
Hong Kong                     NaN
Hungary                       NaN
Iceland                       NaN
India                         NaN
Iran                          NaN
Ireland                       NaN
Israel                  56.307143
Italy                   48.736667
Japan                   58.812692
Lebano

* Let's get weird. We can `stack` and `unstack` columns in our dataframe.
* `stack` takes pivots the lowermost column index to become the innermost row index. unstack is the inverse
* Let's look back at that pivot table...

In [9]:
new_df.head() #we want to take the tier of uni and move it to a row index, so we are stacking....

Unnamed: 0_level_0,mean,mean,mean,mean,mean,std,std,std,std,std
rank_level,first tier,other,second tier,third tier,All,first tier,other,second tier,third tier,All
country,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
Argentina,,44.672857,,,44.672857,,0.59913,,,0.59913
Australia,47.9425,44.64575,49.2425,47.285,45.825517,3.798397,0.386542,0.82064,0.26163,2.297206
Austria,,44.864286,,47.066667,45.139583,,0.590191,,0.695725,0.947929
Belgium,51.875,45.081,49.084,46.746667,47.011,0.219203,0.786419,0.829958,0.481283,2.461225
Brazil,,44.499706,49.565,,44.781111,,0.490476,0.360624,,1.270909


In [10]:
new_df.stack().head()  #now multiindex

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,std
country,rank_level,Unnamed: 2_level_1,Unnamed: 3_level_1
Argentina,other,44.672857,0.59913
Argentina,All,44.672857,0.59913
Australia,first tier,47.9425,3.798397
Australia,other,44.64575,0.386542
Australia,second tier,49.2425,0.82064


In [15]:
new_df.stack().index 

MultiIndex([(     'Argentina',       'other'),
            (     'Argentina',         'All'),
            (     'Australia',  'first tier'),
            (     'Australia',       'other'),
            (     'Australia', 'second tier'),
            (     'Australia',  'third tier'),
            (     'Australia',         'All'),
            (       'Austria',       'other'),
            (       'Austria',  'third tier'),
            (       'Austria',         'All'),
            ...
            ('United Kingdom', 'second tier'),
            ('United Kingdom',  'third tier'),
            ('United Kingdom',         'All'),
            (       'Uruguay',       'other'),
            (       'Uruguay',         'All'),
            (           'All',  'first tier'),
            (           'All',       'other'),
            (           'All', 'second tier'),
            (           'All',  'third tier'),
            (           'All',         'All')],
           names=['country', 'rank_level'],

In [16]:
new_df.stack().columns

Index(['mean', 'std'], dtype='object')

In [14]:
# It can get complex! You are just comparing two groups and a value (or multiple values in this case!)
# We can unstack() all the way if we want to, which means move a row index into a column index
new_df.head()

Unnamed: 0_level_0,mean,mean,mean,mean,mean,std,std,std,std,std
rank_level,first tier,other,second tier,third tier,All,first tier,other,second tier,third tier,All
country,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
Argentina,,44.672857,,,44.672857,,0.59913,,,0.59913
Australia,47.9425,44.64575,49.2425,47.285,45.825517,3.798397,0.386542,0.82064,0.26163,2.297206
Austria,,44.864286,,47.066667,45.139583,,0.590191,,0.695725,0.947929
Belgium,51.875,45.081,49.084,46.746667,47.011,0.219203,0.786419,0.829958,0.481283,2.461225
Brazil,,44.499706,49.565,,44.781111,,0.490476,0.360624,,1.270909


In [17]:
new_df.unstack().head(10)

      rank_level  country  
mean  first tier  Argentina          NaN
                  Australia    47.942500
                  Austria            NaN
                  Belgium      51.875000
                  Brazil             NaN
                  Bulgaria           NaN
                  Canada       53.633846
                  Chile              NaN
                  China        53.592500
                  Colombia           NaN
dtype: float64

In [18]:
new_df.unstack().index

MultiIndex([('mean', 'first tier',            'Argentina'),
            ('mean', 'first tier',            'Australia'),
            ('mean', 'first tier',              'Austria'),
            ('mean', 'first tier',              'Belgium'),
            ('mean', 'first tier',               'Brazil'),
            ('mean', 'first tier',             'Bulgaria'),
            ('mean', 'first tier',               'Canada'),
            ('mean', 'first tier',                'Chile'),
            ('mean', 'first tier',                'China'),
            ('mean', 'first tier',             'Colombia'),
            ...
            ( 'std',        'All',          'Switzerland'),
            ( 'std',        'All',               'Taiwan'),
            ( 'std',        'All',             'Thailand'),
            ( 'std',        'All',               'Turkey'),
            ( 'std',        'All',                  'USA'),
            ( 'std',        'All',               'Uganda'),
            ( 'std',    

In [19]:
new_df.unstack().columns

AttributeError: 'Series' object has no attribute 'columns'