# Pandas

This notebook covers the following advanced but useful pandas techniques:

- [Map](#map)
- [Apply](#apply)
- [Groupby](#groupby)
- [Pivot Tables](#Pivot-Tables)

This tutorial will use the Titanic Data Set. This dataset records the information of the passengers onboard the Titanic.


In [1]:
import pandas as pd

df_titanic = pd.read_csv("Titanic.csv")
df_titanic.head(20)


Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S
5,897,3,"Svensson, Mr. Johan Cervin",male,14.0,0,0,7538,9.225,,S
6,898,3,"Connolly, Miss. Kate",female,30.0,0,0,330972,7.6292,,Q
7,899,2,"Caldwell, Mr. Albert Francis",male,26.0,1,1,248738,29.0,,S
8,900,3,"Abrahim, Mrs. Joseph (Sophie Halaut Easu)",female,18.0,0,0,2657,7.2292,,C
9,901,3,"Davies, Mr. John Samuel",male,21.0,2,0,A/4 48871,24.15,,S


The information of the columns are as follows:

- Survival - Survival (0 = No; 1 = Yes). Not included in test.csv file.
- Pclass - Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
- Name - Name
- Sex - Sex
- Age - Age
- Sibsp - Number of Siblings/Spouses Aboard
- Parch - Number of Parents/Children Aboard
- Ticket - Ticket Number
- Fare - Passenger Fare
- Cabin - Cabin
- Embarked - Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

## Map

The map method lets us map values in a column to other values. Let's use map to create a binary column that is 1 for females and 0 for males. We give the mapping that we want by providing the appropriate dictionary

In [3]:
#Using map
df_titanic["Sex"].map({"male":1, "female":0}).head()

0    1
1    0
2    1
3    1
4    0
Name: Sex, dtype: int64

The result is a series, which we can again store as a column

In [4]:
#Use map to create new column
df_titanic["Binary_Male"] = df_titanic.Sex.map({"male":1, "female":0})

df_titanic.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Binary_Male
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q,1
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S,0
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q,1
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S,1
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S,0


Let's compute the fraction of male passengers.

In [11]:
df_titanic['Pclass'] = df_titanic['Pclass'].map({3: 'Eco', 2: 'Bus', 1: 'Fst'})
df_titanic

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Binary_Male
0,892,Eco,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q,1
1,893,Eco,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0000,,S,0
2,894,Bus,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q,1
3,895,Eco,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S,1
4,896,Eco,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S,0
...,...,...,...,...,...,...,...,...,...,...,...,...
413,1305,Eco,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,8.0500,,S,1
414,1306,Fst,"Oliva y Ocana, Dona. Fermina",female,39.0,0,0,PC 17758,108.9000,C105,C,0
415,1307,Eco,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,SOTON/O.Q. 3101262,7.2500,,S,1
416,1308,Eco,"Ware, Mr. Frederick",male,,0,0,359309,8.0500,,S,1


In [10]:
num_male = len(df_titanic[df_titanic['Sex'] == 'male'])
# num_female = len(df_titanic[df_titanic['Sex'] == 'female'])
num_male / len(df_titanic)
#Compute fraction of male passengers
# df_titanic["Binary_Male"].mean()

0.6363636363636364

More Pandas `map` examples: [here](https://www.w3resource.com/pandas/series/series-map.php) 

## Apply

If we write a function (or there is a built-in) function that works for an entry, we can use the apply method to apply it to each entry in a column (or the whole dataset).


Let's say we want to round the fare and Age. The `round` function can be used on a single entry.

In [7]:
round(df_titanic.Fare[0])

8

Actually, the `round` function is written in a very flexible way so that you can directly use it for the dataframe.

In [12]:
round(df_titanic[["Age","Fare"]],ndigits=2) 

Unnamed: 0,Age,Fare
0,34.5,7.83
1,47.0,7.00
2,62.0,9.69
3,27.0,8.66
4,22.0,12.29
...,...,...
413,,8.05
414,39.0,108.90
415,38.5,7.25
416,,8.05


But in general, functions written for a single entry cannot be directly applied to the whole data frame. Should we write another (way more complex) version? Luckily the method `.apply` allows us to scale the object up to the whole data frame.

- `.apply` follows the data frame. The function name (`round` in this case) is one argument.

In [8]:
df_titanic[["Age","Fare"]].apply(round)

Unnamed: 0,Age,Fare
0,34.0,8.0
1,47.0,7.0
2,62.0,10.0
3,27.0,9.0
4,22.0,12.0
...,...,...
413,,8.0
414,39.0,109.0
415,38.0,7.0
416,,8.0


Note that we can provide additional argument to `round`, `ndigits=2`, as a side argument to `apply`.

In [13]:
df_titanic[["Age","Fare"]].apply(round, ndigits=2)

Unnamed: 0,Age,Fare
0,34.5,7.83
1,47.0,7.00
2,62.0,9.69
3,27.0,8.66
4,22.0,12.29
...,...,...
413,,8.05
414,39.0,108.90
415,38.5,7.25
416,,8.05


In [15]:
x = 3.14
round(x) # function

3

In [16]:
s = 'kuro'
s.upper()   # string's method

'KURO'

In [22]:
df_titanic['Pclass'] = df_titanic['Pclass'].str.upper()
df_titanic.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Binary_Male
0,892,ECO,"Kelly, Mr. James",MALE,34.5,0,0,330911,7.8292,,Q,1
1,893,ECO,"Wilkes, Mrs. James (Ellen Needs)",FEMALE,47.0,1,0,363272,7.0,,S,0
2,894,BUS,"Myles, Mr. Thomas Francis",MALE,62.0,0,0,240276,9.6875,,Q,1
3,895,ECO,"Wirz, Mr. Albert",MALE,27.0,0,0,315154,8.6625,,S,1
4,896,ECO,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",FEMALE,22.0,1,1,3101298,12.2875,,S,0


In [23]:
df_titanic['Sex'] = df_titanic['Sex'].apply(str.upper)
df_titanic.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Binary_Male
0,892,ECO,"Kelly, Mr. James",MALE,34.5,0,0,330911,7.8292,,Q,1
1,893,ECO,"Wilkes, Mrs. James (Ellen Needs)",FEMALE,47.0,1,0,363272,7.0,,S,0
2,894,BUS,"Myles, Mr. Thomas Francis",MALE,62.0,0,0,240276,9.6875,,Q,1
3,895,ECO,"Wirz, Mr. Albert",MALE,27.0,0,0,315154,8.6625,,S,1
4,896,ECO,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",FEMALE,22.0,1,1,3101298,12.2875,,S,0


It works on a custom function as well.


In [24]:
def age_group(age, threshold=30): 
    """will determine if the age is <threshold or >=threshold
    """
    if age<threshold:
        return 0
    else:
        return 1
# note that for age_group, we cannot apply it to the whole dataframe or a column 
df_titanic["Age"].apply(age_group, threshold=30)

0      1
1      1
2      1
3      0
4      0
      ..
413    1
414    1
415    1
416    1
417    1
Name: Age, Length: 418, dtype: int64

#### Custom Functions

The main benefit of `apply` is using it with custom functions. Each passenger has a title (Mr., Mrs., etc ...).  The title comes after the comma in the name.  Let's write a custom function to get this title from each name a and store it in a column.

In [None]:
def Get_Title(name):
    # split the name into a list of words, and take the first word after ","
    parsed_name = name.split(" ")
    for word in parsed_name:
        if "."  in word:
            return word
    return "NA"

This function splits the name by space. If there is a perioid, the title is found.

In [10]:
#Use apply with this custome function
df_titanic.Name.apply(Get_Title).tail()

413        Mr.
414      Dona.
415        Mr.
416        Mr.
417    Master.
Name: Name, dtype: object

Just like with the built in functions, we supply the name of the function we want applied to each value in the column.  The result is a series that we can store as a new column.

In [11]:
#Add  in the new Title column
df_titanic["Title"] = df_titanic.Name.apply(Get_Title)
df_titanic.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Binary_Male,Title
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q,1,Mr.
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S,0,Mrs.
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q,1,Mr.
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S,1,Mr.
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S,0,Mrs.


In [12]:
#Let's see the breakdown of each title
df_titanic.Title.value_counts()

Mr.        240
Miss.       78
Mrs.        72
Master.     21
Col.         2
Rev.         2
Ms.          1
Dr.          1
Dona.        1
Name: Title, dtype: int64

#### Applying Functions to Rows/Columns 

In other occations, we do not want to apply a function to every entry of the data frame.
We may have some functions, like `mean` or `max` that originally work with a list (series).
We may want to apply them to the data frame *row-wise* or *column-wise*.
Let's look at the following example.

The following is the standard use of `max`. 

In [13]:
max(df_titanic.Age)

76.0

Let's say we want to apply it to multiple columns of df_titanic.

In [27]:
df_titanic[["Age", "Fare"]].apply(max, axis=0)

Age      76.0000
Fare    512.3292
dtype: float64

- We use `.apply` after a dataframe.
- This case is different from before, because we don't want to apply `max` to each entry. Instead, we want to apply it to each column of the data frame. This is why we use **axis=0**.
- To apply the function to each row (axis = 1) or each column (axis = 0).
- The outcome is itself a series.

Next we see an example of using `apply` to the columns.
Note that the index number of the series is the index number of the original dataframe.

In [15]:
#Using apply - axis = 1
df_titanic[["Age", "Fare"]].apply(max, axis=1)

0       34.5
1       47.0
2       62.0
3       27.0
4       22.0
       ...  
413      NaN
414    108.9
415     38.5
416      NaN
417      NaN
Length: 418, dtype: float64

When we use apply on multiple columns we pass the function a row.  Let's create a column called "Old_Man" that is 1 if the passenger is male and above the age of 60.

In [25]:
def Get_Old_Man(row):
    
    gender  = row["Sex"]
    age  = row["Age"]
    
    if gender == "male" and age>=60:
        return 1
    else:
        return 0

In [30]:
#Make the Old Man column
df_titanic["Old_Man"] = df_titanic.apply(Get_Old_Man, axis=1)
df_titanic.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Binary_Male,Old_Man
0,892,ECO,"Kelly, Mr. James",MALE,34.5,0,0,330911,7.8292,,Q,1,0
1,893,ECO,"Wilkes, Mrs. James (Ellen Needs)",FEMALE,47.0,1,0,363272,7.0,,S,0,0
2,894,BUS,"Myles, Mr. Thomas Francis",MALE,62.0,0,0,240276,9.6875,,Q,1,0
3,895,ECO,"Wirz, Mr. Albert",MALE,27.0,0,0,315154,8.6625,,S,1,0
4,896,ECO,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",FEMALE,22.0,1,1,3101298,12.2875,,S,0,0


In [18]:
#Lets compute the number of old men.
df_titanic.Old_Man.sum()

7

Note that since we are again applying the function to multiple columns, I have to specify `axis=1` because I want the function applied to each row.

Lastly, let's see how you can use apply with a custom function that achieves a similar goal. In addition to the input of a row, it also allows the input of a lower bound of age of an old man. Let's recreate the column with this lower bound set to 50.

In [31]:
#Function has new input
def Get_Old_Man(row, old_man_age):
    
    gender  = row["Sex"]
    age  = row["Age"]
    
    if gender == "male" and age>=old_man_age:
        return 1
    else:
        return 0

#Make the Old Man column
#need to specify old_man_age explicitly
df_titanic["Old_Man"] = df_titanic.apply(Get_Old_Man, old_man_age=50,  axis=1)
df_titanic.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Binary_Male,Old_Man
0,892,ECO,"Kelly, Mr. James",MALE,34.5,0,0,330911,7.8292,,Q,1,0
1,893,ECO,"Wilkes, Mrs. James (Ellen Needs)",FEMALE,47.0,1,0,363272,7.0,,S,0,0
2,894,BUS,"Myles, Mr. Thomas Francis",MALE,62.0,0,0,240276,9.6875,,Q,1,0
3,895,ECO,"Wirz, Mr. Albert",MALE,27.0,0,0,315154,8.6625,,S,1,0
4,896,ECO,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",FEMALE,22.0,1,1,3101298,12.2875,,S,0,0


Let's recompute the number of old men, which should go up.

In [20]:
#Lets compute the number of old man
df_titanic.Old_Man.sum()

20

Let's see another example. We want to create a new label combining "Embarked" and the first digit of "Ticket".


In [21]:
def New_Label(row):

    tix = str(row["Ticket"])
    emb = row["Embarked"]
    new_label = emb+tix[0]

    return new_label

In [22]:
df_titanic.apply(New_Label, axis = 1)

0      Q3
1      S3
2      Q2
3      S3
4      S3
       ..
413    SA
414    CP
415    SS
416    S3
417    C2
Length: 418, dtype: object

*(Exercise)*: use the `apply` method to calculate the adjusted fare of each passenger. If the passenger is *older than 50* or *less than 15*, then `adj_fare` is 10% off the listed fare.
  - Can you rewrite the function that `10%` becomes a function input? When you apply the function, you can then specify the discount rate.


## Groupby

Very often, as data scientists, we want to achieve the following goal:

- **Splitting** the data into groups based on some criteria
- **Applying** a function to each group independently
- **Combining** the results into a data structure

Like in the Titanic dataset, we may want to calculate the average age of passengers in each class.

In [33]:
df_titanic = pd.read_csv("Titanic.csv")
df_titanic.head(20)

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S
5,897,3,"Svensson, Mr. Johan Cervin",male,14.0,0,0,7538,9.225,,S
6,898,3,"Connolly, Miss. Kate",female,30.0,0,0,330972,7.6292,,Q
7,899,2,"Caldwell, Mr. Albert Francis",male,26.0,1,1,248738,29.0,,S
8,900,3,"Abrahim, Mrs. Joseph (Sophie Halaut Easu)",female,18.0,0,0,2657,7.2292,,C
9,901,3,"Davies, Mr. John Samuel",male,21.0,2,0,A/4 48871,24.15,,S


We may sort the values according to ``Pclass``, and do the calculation separately.
But this is very inconvenient especially for large datasets.

In [34]:
df_titanic.sort_values(by=["Pclass", "Age"], ascending=[True, False]).head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
96,988,1,"Cavendish, Mrs. Tyrell William (Julia Florence...",female,76.0,1,0,19877,78.85,C46,S
81,973,1,"Straus, Mr. Isidor",male,67.0,1,0,PC 17483,221.7792,C55 C57,S
179,1071,1,"Compton, Mrs. Alexander Taylor (Mary Eliza Ing...",female,64.0,0,2,PC 17756,83.1583,E45,C
236,1128,1,"Warren, Mr. Frank Manley",male,64.0,1,0,110813,75.25,D37,C
305,1197,1,"Crosby, Mrs. Edward Gifford (Catherine Elizabe...",female,64.0,1,1,112901,26.55,B26,S


In [37]:
df_titanic.groupby('Pclass')['Age'].mean()

Pclass
1    40.918367
2    28.777500
3    24.027945
Name: Age, dtype: float64

``groupby`` allows one to, as the name indicates, group the dataset by a value and then apply a function to each group.

A basic use is to combine with the `.first` method to get the first row of each group.

In [38]:
df_titanic.groupby(["Pclass", "Sex"]).first()

Unnamed: 0_level_0,Unnamed: 1_level_0,PassengerId,Name,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
Pclass,Sex,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,female,904,"Snyder, Mrs. John Pillsbury (Nelle Stevenson)",23.0,1,0,21228,82.2667,B45,S
1,male,903,"Jones, Mr. Charles Cresson",46.0,0,0,694,26.0,A21,S
2,female,907,"del Carlo, Mrs. Sebastiano (Argenia Genovesi)",24.0,1,0,SC/PARIS 2167,27.7208,F4,C
2,male,894,"Myles, Mr. Thomas Francis",62.0,0,0,240276,9.6875,F,Q
3,female,893,"Wilkes, Mrs. James (Ellen Needs)",47.0,1,0,363272,7.0,G6,S
3,male,892,"Kelly, Mr. James",34.5,0,0,330911,7.8292,F G63,Q


In addition to show the first row of each group, we can also use the `.size` method to get the size of each group.

In [26]:
df_titanic.groupby(["Pclass", "Sex"]).size()

Pclass  Sex   
1       female     50
        male       57
2       female     30
        male       63
3       female     72
        male      146
dtype: int64

In [43]:
def kind_of_person(age):
    if age < 30:
        return "Young"
    if age < 65:
        return "Middle Age"
    return "Old"

df_titanic['Age'].apply(kind_of_person)
# df_titanic['Kind'] = df_titanic['Age'].apply(kind_of_person)
# df_titanic.head()

0      Middle Age
1      Middle Age
2      Middle Age
3           Young
4           Young
          ...    
413           Old
414    Middle Age
415    Middle Age
416           Old
417           Old
Name: Age, Length: 418, dtype: object

In [41]:
df_titanic.groupby(['Kind', 'Sex']).size()

Kind        Sex   
Middle Age  female     55
            male       90
Old         female     26
            male       62
Young       female     71
            male      114
dtype: int64

It allows us to get group-specific statistics such as the mean.

In [None]:
df_titanic.groupby("Pclass").Age.mean()

What happened in the above process?

- ``groupby`` creates three separate datasets, corresponding to the three ``Pclass``.
- When applying ``mean``, the function is applied to the three datasets/dataframes separately.
Each outcome is recorded in a row. 

*(Question)* why is it a row instead of a single number?
- The three rows are combined, indexed by ``Pclass``.

``groupby`` can be easily combined with ``apply``. For example, we can write a function just to output the range of ``age``.

But think about what is the input and output of the function. In this case, we apply a function that takes a *DataFrame* as input and output a *number*. The result is a *Series*. 

In [28]:
def range_age(df):
    return df.Age.max()-df.Age.min()

df_titanic.groupby("Pclass").apply(range_age)

Pclass
1    70.00
2    62.08
3    60.33
dtype: float64

We can also sort the groups by the range of age

In [31]:
# We can also sort the groups by the range of age.
df_titanic.groupby("Pclass").apply(range_age).sort_values()

Pclass
3    60.33
2    62.08
1    70.00
dtype: float64

What if we want both *mean* and *standard deviation* of the age within each group, what should we do?

- The function should output two values.
- Maybe we can return a tuple.

In [42]:
def mean_sd_age(df):
    return (df.Age.mean(), df.Age.std())

df_Pclass = df_titanic.groupby("Pclass").apply(mean_sd_age)
df_Pclass

Pclass
1    (40.91836734693877, 13.956798933932587)
2              (28.7775, 12.943457873541288)
3    (24.02794520547945, 10.537105270084329)
dtype: object

This is not exactly what we want. The output is a **pandas series**, or a one-column dataframe.
Each entry is a tuple.
A better way is to make it a two-column dataframe, so that we can operate it later within the pandas framework.

To do this, we need the function to output a pandas series.

In [44]:
def mean_sd_age(df):
    return pd.Series((df.Age.mean(), df.Age.std()))

df_Pclass = df_titanic.groupby("Pclass").apply(mean_sd_age)
df_Pclass

Unnamed: 0_level_0,0,1
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1
1,40.918367,13.956799
2,28.7775,12.943458
3,24.027945,10.537105


This is nice, but it would be better if we can make more informative headers. 
This function should be included in the returned value.
When creating a pandas series/dataframe, a tuple cannot possibly include that information.
So we can use a dictionary instead.

In [45]:
def mean_sd_age(df):
    return pd.Series({"average":df.Age.mean(), "standard deviation":df.Age.std()})

df_Pclass = df_titanic.groupby("Pclass").apply(mean_sd_age)
df_Pclass

Unnamed: 0_level_0,average,standard deviation
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1
1,40.918367,13.956799
2,28.7775,12.943458
3,24.027945,10.537105


*(Exercise):* we want to select the row of the youngest person in each ``Pclass``, and output them. How should we do it?

More examples about pandas `groupby` method [here](https://www.machinelearningplus.com/pandas/pandas-groupby-examples/).

## Pivot Tables

Pivot tables are a common tabularization method to summarize the data.

In [46]:
pivot_age = df_titanic.pivot_table(values='Age', index='Pclass', columns='Sex', aggfunc='mean')
pivot_age

Sex,female,male
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1
1,41.333333,40.52
2,24.376552,30.940678
3,23.0734,24.525104


There are a few inputs to the method.
- `values` is the quantity you want to apply the calculation to.
- `index` is the rows you want to show in the pivot table.
- `columns` is similar to `index`
- `aggrefunc`: this is applied to `values`, for each subgroup of `index` and `columns`

Think about what are categorical variables and what can be continuous variables?

You don't have to have both index and columns. For example, if we don't set `columns`, what will happen?
We can also have more than one column.

In [47]:
# You can have multiple columns in the table
multi_pivot_example = df_titanic.pivot_table(values='Age', 
                                             index='Embarked', 
                                             columns=['Pclass', 'Sex'], 
                                             aggfunc='mean')
multi_pivot_example

Pclass,1,1,2,2,3,3
Sex,female,male,female,male,female,male
Embarked,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
C,41.0,39.962963,19.75,29.4,24.166667,22.423077
Q,37.0,,,52.666667,25.681818,23.928571
S,42.0,41.173913,25.1168,29.813725,22.005152,24.939605


The columns have multiple levels. To access one of them, you can use tuples:


In [48]:
multi_pivot_example[(1, 'female')]

Embarked
C    41.0
Q    37.0
S    42.0
Name: (1, female), dtype: float64

In [None]:
multi_pivot_example[1]

In [None]:
multi_pivot_example[(1, 'female')]

What functions can we use in `aggfunc`?

- Many build-in functions:
  - `sum`, `mean`, `median`, `min`, `max`, `count`, `nunique`

- Custom functions. You can write your own function that takes in a **series of values and return a single value**. 


In [None]:
def range_func(x):
    return max(x) - min(x)
pivot_age_range = df_titanic.pivot_table(values='Age', index = 'Embarked', columns = 'Pclass', aggfunc=range_func)
pivot_age_range

*(Exercise)*: using the Titanic datset
- Before creating the pivot tables, categorize passengers into two age groups: "Younger than 30" and "30 and Older". Use the `apply` function to create a new column called `Age Group` based on the `Age` column.
- Create a pivot table that displays the average `fare` of passengers based on their age group and sex. Which group had a higher average fare?
- Which embarkation port had the highest number of passengers who are "Younger than 30"?


## Selected Solutions to Exercises



In [None]:
# *(Exercise)*: use the `apply` method to calculate the adjusted fare of each passenger. If the passenger is *older than 50* or *less than 15*, then `adj_fare` is 10% off the listed fare.
# solution
def adjust_fare(row):
    # return row.Fare * 0.9 if row.Age is >=60 or <=15
    if row.Age>=60 or row.Age<=15:
        return row.Fare * 0.9
    else:
        return row.Fare

df_titanic["adj_fare"] = df_titanic.apply(adjust_fare, axis=1)
df_titanic

In [None]:
df_titanic

In [49]:
# Last exercise
def age_group(age):
    if age<=30:
        return 0
    else:
        return 1
df_titanic["Age Group"] = df_titanic.Age.apply(age_group)
# Create a pivot table that displays the average `fare` of passengers based on their age group and sex. Which group had a higher average fare?
pivot_tab1 = df_titanic.pivot_table(values = 'Fare', index = "Age Group", columns="Sex", aggfunc='mean')
# Which embarkation port had the highest number of passengers who are "Younger than 30"?
pivot_tab2 = df_titanic.pivot_table(values = 'Fare', index = "Embarked", columns="Age Group", aggfunc='count')

In [50]:
pivot_tab1

Sex,female,male
Age Group,Unnamed: 1_level_1,Unnamed: 2_level_1
0,34.638261,23.107756
1,65.260055,31.356573


In [51]:
pivot_tab2

Age Group,0,1
Embarked,Unnamed: 1_level_1,Unnamed: 2_level_1
C,36,66
Q,14,32
S,150,119
