# Pivot tables
* A pivot table is itself a DataFrame, which compares two groups on some shared columns, thus:
  1. the rows represent one variable that you're interested in
  2. the columns another variables, and 
  3. the cell content is some *aggregate* value of a third column 
* Often a pivot table includes marginal values as well, which are comparisons across multiple groups (more in a minute)

In [None]:
# Here we have the Times Higher Education World University Ranking dataset
import pandas as pd
import numpy as np
df = pd.read_csv('datasets/cwurData.csv')
df.head()

In [None]:
# Let's say we want to create a new column called *Rank_Level*, where institutions with world ranking 1-100 are
# categorized as *first tier* and those with world ranking 101 - 200 are *second tier*, ranking 201 - 300 are
# *third tier*, after 301 is *other* top universities.

# You do that now, please.

In [None]:
#Put interesting student solution here.
def set_rank(row):
    if row["world_rank"] <= 100:
        row["Rank_Level"]="first tier"
    elif row["world_rank"] <= 200:
        row["Rank_Level"]="second tier"
    elif row["world_rank"] <= 300:
        row["Rank_Level"]="third tier"
    else:
        row["Rank_Level"]="other top universities"
    return row

df=df.apply(set_rank, axis=1)
df.head()


* Let's pivot! We need two columns, let's say the *country* and our *rank level* these will become our new rows (index)/columns (labels)
* Now we need one column of interest for the cell value, let's use the *score*
* Then we need on aggregation function, which we'll apply to *score* let's use `np.mean`

* essentially this means we're comparing two groups, "Countries" vs. "Rank Level" with respect to score using an average. Think for a moment how you might tackle this with group by

In [None]:
df.pivot_table(values='score', index='country', columns='Rank_Level', aggfunc=[np.mean]).head()

* we notice that there are some NaN values, e.g. Argentia has only observations in the "Other Top Unversities" category

* pivot tables aren't limited to one aggregation! We could use multiple functions and see those results with heirarchical column labels

In [None]:
df.pivot_table(values='score', index='country', columns='Rank_Level', aggfunc=[np.mean, np.max]).head()

In [None]:
# we can also provide those marginal values
df.pivot_table(values='score', index='country', columns='Rank_Level', aggfunc=[np.mean, np.max], 
               margins=True).head()

In [None]:
# A pivot table is just a multi-level dataframe
new_df=df.pivot_table(values='score', index='country', columns='Rank_Level', aggfunc=[np.mean, np.max], 
               margins=True)
# Now let's look at the index
print(new_df.index)
# And let's look at the columns
print(new_df.columns)

In [None]:
# How would we query this if we want to get the average scores of First Tier universities broken down by country?
# You do that now, please.

* Let's get weird. We can `stack` and `unstack` columns in our dataframe.
* `stack` takes pivots the lowermost column index to become the innermost row index. unstack is the inverse
* let's look back at that pivot table...

In [None]:
new_df.head() #we want to take the tier of uni and move it to a row index, so we are stacking....

In [None]:
new_df.stack().head()

In [None]:
# It can get complex! You are just comparing two groups and a value (or multiple values in this case!)
# we can unstack() all the way if we want to, which means move a row index into a column index
new_df.head() #let's pivot rank_level now into the column, what shape do you think this will create?

In [None]:
new_df.unstack().head(10)

* Remember, you can pass any function you want to the aggregate function, including those that you define yourself!