# Pandas

This notebook covers the following advanced but useful pandas techniques:

- [String Methods](#string-methods)
- [Map](#map)
- [Apply](#apply)
- [Groupby](#groupby)

This tutorial will use the Titanic Data Set. This dataset records the information of the passengers onboard the Titanic.


In [1]:
import pandas as pd

df_titanic = pd.read_csv("Data/Titanic.csv")
df_titanic.head(20)

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S
5,897,3,"Svensson, Mr. Johan Cervin",male,14.0,0,0,7538,9.225,,S
6,898,3,"Connolly, Miss. Kate",female,30.0,0,0,330972,7.6292,,Q
7,899,2,"Caldwell, Mr. Albert Francis",male,26.0,1,1,248738,29.0,,S
8,900,3,"Abrahim, Mrs. Joseph (Sophie Halaut Easu)",female,18.0,0,0,2657,7.2292,,C
9,901,3,"Davies, Mr. John Samuel",male,21.0,2,0,A/4 48871,24.15,,S


The information of the columns are as follows:

- Survival - Survival (0 = No; 1 = Yes). Not included in test.csv file.
- Pclass - Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
- Name - Name
- Sex - Sex
- Age - Age
- Sibsp - Number of Siblings/Spouses Aboard
- Parch - Number of Parents/Children Aboard
- Ticket - Ticket Number
- Fare - Passenger Fare
- Cabin - Cabin
- Embarked - Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

## String Methods

We can use the usual string methods to manipulate data inside of a dataframe. To invoke a string method use the `.str` attribute of a series. Since the only column that is stored as a string is the Name column we will work with that.

This is the same as using `.dt` for datetime methods.

In [2]:
#Using the lower() method

df_titanic["Name"].str.lower().head()

0                                kelly, mr. james
1                wilkes, mrs. james (ellen needs)
2                       myles, mr. thomas francis
3                                wirz, mr. albert
4    hirvonen, mrs. alexander (helga e lindqvist)
Name: Name, dtype: object

- Here `.str` turn the pandas series into a string.
- We apply `.lower()` to make the string lowercase.

The above bit of code returns the Name column as a str with all lowercase letters. Another useful string method is the contains() method, which returns a boolean if the given string contain the input string. 

In [3]:
#using contains() method to check for title Mr.
df_titanic["Name"].str.contains("Mr\.").head()

0     True
1    False
2     True
3     True
4    False
Name: Name, dtype: bool

*(Exercise)*

Note that we need the escape character "\\" to look for the ".".  We can easily add the results of the returned series in a new column as follows.
This is because "." matches any character in regex. Read more about regular expression (regex) [here](https://docs.python.org/3/library/re.html).
We can use the `regex=False` argument to avoid this.

In [2]:
df_titanic["Name"].str.contains("Mr.", regex=False).head()

0     True
1    False
2     True
3     True
4    False
Name: Name, dtype: bool

In [None]:
#Creating new column
df_titanic["Bool_Mr"] = df_titanic["Name"].str.contains("Mr\.")
# the same as 
# df_titanic["Bool_Mr"] = df_titanic["Name"].str.contains("Mr.", regex=False)
df_titanic.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Bool_Mr
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q,True
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S,False
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q,True
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S,True
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S,False


Now we have a new boolean column and we can easily figure out, for example, the fraction of "Mr."s as follows

In [None]:
#Computing the proportion of Mr.s
df_titanic.Bool_Mr.mean()

0.5741626794258373

We can also use these string methods on the column names. Let's say I want to replace all the underscores (there is only one) with blank spaces.  I can do that with the replace method. Recall that I access the column names through the columns attribute of any dataframe.

In [None]:
df_titanic.columns

Index(['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch',
       'Ticket', 'Fare', 'Cabin', 'Embarked', 'Bool_Mr'],
      dtype='object')

In [None]:
#Replace the underscore
df_titanic.columns = df_titanic.columns.str.replace("_", "")

df_titanic.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,BoolMr
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q,True
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S,False
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q,True
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S,True
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S,False


Notice the name of the column we created above has been changed.  Before we move to the next section, I will delete this column.

In [None]:
del df_titanic["BoolMr"]

df_titanic.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


For a comprehenssive list of string methods, see the [pandas tutorial](https://pandas.pydata.org/docs/user_guide/text.html) on working with text data.

In [None]:
df_titanic.Name.unique()

array(['Kelly, Mr. James', 'Wilkes, Mrs. James (Ellen Needs)',
       'Myles, Mr. Thomas Francis', 'Wirz, Mr. Albert',
       'Hirvonen, Mrs. Alexander (Helga E Lindqvist)',
       'Svensson, Mr. Johan Cervin', 'Connolly, Miss. Kate',
       'Caldwell, Mr. Albert Francis',
       'Abrahim, Mrs. Joseph (Sophie Halaut Easu)',
       'Davies, Mr. John Samuel', 'Ilieff, Mr. Ylio',
       'Jones, Mr. Charles Cresson',
       'Snyder, Mrs. John Pillsbury (Nelle Stevenson)',
       'Howard, Mr. Benjamin',
       'Chaffee, Mrs. Herbert Fuller (Carrie Constance Toogood)',
       'del Carlo, Mrs. Sebastiano (Argenia Genovesi)',
       'Keane, Mr. Daniel', 'Assaf, Mr. Gerios',
       'Ilmakangas, Miss. Ida Livija',
       'Assaf Khalil, Mrs. Mariana (Miriam")"', 'Rothschild, Mr. Martin',
       'Olsen, Master. Artur Karl',
       'Flegenheim, Mrs. Alfred (Antoinette)',
       'Williams, Mr. Richard Norris II',
       'Ryerson, Mrs. Arthur Larned (Emily Maria Borie)',
       'Robins, Mr. Alexander

*(Exercise)*: check if all the rows with "Mr." in the name also are always "male".

In [None]:
# check the rows with "Mr." in the name column have "male" in the Sex column
# all(df_titanic[df_titanic.Name.str.contains("Mr\.")].Sex == "male")

In [5]:
def vc(x):
    return x + 2

lst = [5, 2, 1, 3, 1, 4, 8]

lst2 = list(map(vc, lst))
lst2

lst2 = list(map(lambda x: x + 2, lst))
lst2


# lst3 = list(map(str, lst))
# lst3

[7, 4, 3, 5, 3, 6, 10]

## Map

The map method lets us map values in a column to other values. Let's use map to create a binary column that is 1 for females and 0 for males. We give the mapping that we want by providing the appropriate dictionary

In [None]:
#Using map
df_titanic.Sex.map({"male":1, "female":0}).head()

0    1
1    0
2    1
3    1
4    0
Name: Sex, dtype: int64

The result is a series, which we can again store as a column

In [None]:
#Use map to create new column
df_titanic["Binary_Male"] = df_titanic.Sex.map({"male":1, "female":0})

df_titanic.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Binary_Male
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q,1
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S,0
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q,1
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S,1
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S,0


Let's compute the fraction of male passengers.

In [None]:
#Compute fraction of male passengers
df_titanic.Binary_Male.mean()

0.6363636363636364

So there are passengers labeled as "male" but don't have "Mr." in their name, because this number differs from the previous one.

*(Exercise)*: Can you filter out these rows and see what happens?

More Pandas `map` examples: [here](https://www.w3resource.com/pandas/series/series-map.php) 

## Apply

The apply method applies built-in or custom functions to each row or column of a dataframe. Let's use the apply method to compute the average age and fare. We will need the mean function from numpy to do so.

In [None]:
#Using apply - axis =0
import numpy as np

df_titanic[["Age", "Fare"]].apply(np.mean, axis=0)

Age     30.272590
Fare    35.627188
dtype: float64

- We use `.apply` after a dataframe.
- The first argument in `apply()` is the function name that we want to apply.
- The second argument is the dimension, to each row (axis = 1) or each column (axis = 0).
- The outcome is itself a series.

Next we see an example of using `apply` to the columns.
Note that the index number of the series is the index number of the original dataframe.

In [None]:
#Using apply - axis = 1
df_titanic[["Age", "Fare"]].apply(np.max, axis=1).head()

0    34.5
1    47.0
2    62.0
3    27.0
4    22.0
dtype: float64

Now let's say we also want to look at rounded-up versions of the columns corresponding to the age and fare. We can use numpy's ceil function and the apply method to accomplish this. If we don't specify the axis, it will apply the function to each entry.

In [None]:
#Using apply with the round function
rounded_cols = df_titanic[["Age", "Fare"]].apply(np.ceil).head()
rounded_cols.head()        

Unnamed: 0,Age,Fare
0,35.0,8.0
1,47.0,7.0
2,62.0,10.0
3,27.0,9.0
4,22.0,13.0


*(Question)*: why doesn't this one have "axis" argument?

This provides a convenient way to create intermediate columns.

In [None]:
df_titanic["Ceil_Fare"] = df_titanic["Fare"].apply(np.ceil)
df_titanic.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Binary_Male,Ceil_Fare
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q,1,8.0
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S,0,7.0
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q,1,10.0
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S,1,9.0
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S,0,13.0


The main benefit of apply is using it with custom functions. Each passenger has a title (Mr., Mrs., etc ...).  The title comes after the comma in the name.  Let's write a custom function to get this title from each name a and store it in a column.

In [3]:
def Get_Title(string):
    # split the name into a list of words, and take the first word after ","
    parsed_name = string.split(" ")
    for i in range(len(parsed_name)):
        if ","  in parsed_name[i]:
            return parsed_name[i+1]
        
    return string

This function splits the name by space. If there is a comma, the title is the first word after the comma.

In [4]:
#Use apply with this custome function
df_titanic.Name.apply(Get_Title).tail()

413        Mr.
414      Dona.
415        Mr.
416        Mr.
417    Master.
Name: Name, dtype: object

Just like with the built in functions, we supply the name of the function we want applied to each value in the column.  The result is a series that we can store as a new column.

In [None]:
#Add  in the new Title column
df_titanic["Title"] = df_titanic['Name'].apply(Get_Title)
df_titanic.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Binary_Male,Ceil_Fare,Title
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q,1,8.0,Mr.
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S,0,7.0,Mrs.
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q,1,10.0,Mr.
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S,1,9.0,Mr.
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S,0,13.0,Mrs.


In [None]:
#Let's see the breakdown of each title
df_titanic.Title.value_counts()

Mr.        240
Miss.       78
Mrs.        72
Master.     21
Col.         2
Rev.         2
Ms.          1
Dr.          1
Dona.        1
Name: Title, dtype: int64

The above example demonstrates how we can use apply on single column or series. 
When we use apply on multiple columns we pass the function a row.  Let's create a column called "Old_Man" that is 1 if the passenger is male and above the age of 60.

In [None]:
def Get_Old_Man(row):
    
    gender  = row["Sex"]
    age  = row["Age"]
    
    if gender == "male" and age>=60:
        return 1
    else:
        return 0

In [None]:
#Make the Old Man column
df_titanic["Old_Man"] = df_titanic.apply(Get_Old_Man, axis=1)
df_titanic.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Bool_Mr,Old_Man
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q,True,0
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S,False,0
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q,True,1
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S,True,0
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S,False,0


In [None]:
#Lets compute the number of old men.
df_titanic.Old_Man.sum()

7

Note that since we are again applying the function to multiple columns, I have to specify `axis=1` because I want the function applied to each row.

Lastly, let's see how you can use apply with a custom function that achieves a similar goal. In addition to the input of a row, it also allows the input of a lower bound of age of an old man. Let's recreate the column with this lower bound set to 50.

In [None]:
#Function has new input
def Get_Old_Man(row, old_man_age):
    
    gender  = row["Sex"]
    age  = row["Age"]
    
    if gender == "male" and age>=old_man_age:
        return 1
    else:
        return 0
    
#Make the Old Man column
#need to specify old_man_age explicitly
df_titanic["Old_Man"] = df_titanic.apply(Get_Old_Man, old_man_age=50,  axis=1)
df_titanic.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Binary_Male,Ceil_Fare,Title,Old_Man
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q,1,8.0,Mr.,0
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S,0,7.0,Mrs.,0
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q,1,10.0,Mr.,1
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S,1,9.0,Mr.,0
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S,0,13.0,Mrs.,0


Let's recompute the number of old men, which should go up.

In [None]:
#Lets compute the number of old m
df_titanic.Old_Man.sum()

20

Let's see another example. We want to create a new label combining "Embarked" and the first digit of "Ticket".


In [None]:
def New_Label(row):

    tix = str(row["Ticket"])
    emb = row["Embarked"]
    new_label = emb+tix[0]

    return new_label

In [None]:
df_titanic.apply(New_Label, axis = 1)

0      Q3
1      S3
2      Q2
3      S3
4      S3
       ..
413    SA
414    CP
415    SS
416    S3
417    C2
Length: 418, dtype: object

*(Exercise)*: use the `apply` method to calculate the adjusted fare of each passenger. If the passenger is *older than 50* or *less than 15*, then `adj_fare` is 10% off the listed fare.

In [None]:
# solution
def adjust_fare(row):
    # return row.Fare * 0.9 if row.Age is >=60 or <=15
    if row.Age>=60 or row.Age<=15:
        return row.Fare * 0.9
    else:
        return row.Fare

df_titanic["adj_fare"] = df_titanic.apply(adjust_fare, axis=1)
df_titanic

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Bool_Mr,Old_Man,adj_fare
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q,True,0,7.82920
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0000,,S,False,0,7.00000
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q,True,1,8.71875
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S,True,0,8.66250
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S,False,0,12.28750
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
413,1305,3,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,8.0500,,S,True,0,8.05000
414,1306,1,"Oliva y Ocana, Dona. Fermina",female,39.0,0,0,PC 17758,108.9000,C105,C,False,0,108.90000
415,1307,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,SOTON/O.Q. 3101262,7.2500,,S,True,0,7.25000
416,1308,3,"Ware, Mr. Frederick",male,,0,0,359309,8.0500,,S,True,0,8.05000


## Groupby

Very often, as data scientists, we want to achieve the following goal:

- **Splitting** the data into groups based on some criteria
- **Applying** a function to each group independently
- **Combining** the results into a data structure

Like in the Titanic dataset, we may want to calculate the average age of passengers in each class.

In [5]:
df_titanic

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0000,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S
...,...,...,...,...,...,...,...,...,...,...,...
413,1305,3,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,8.0500,,S
414,1306,1,"Oliva y Ocana, Dona. Fermina",female,39.0,0,0,PC 17758,108.9000,C105,C
415,1307,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,SOTON/O.Q. 3101262,7.2500,,S
416,1308,3,"Ware, Mr. Frederick",male,,0,0,359309,8.0500,,S


In [9]:
df_titanic.groupby('Pclass')['Age'].mean()

Pclass
1    40.918367
2    28.777500
3    24.027945
Name: Age, dtype: float64

We may sort the values according to ``Pclass``, and do the calculation separately.
But this is very inconvenient especially for large datasets.

In [5]:
df_titanic.sort_values(by="Pclass").head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
208,1100,1,"Rosenbaum, Miss. Edith Louise",female,33.0,0,0,PC 17613,27.7208,A11,C
350,1242,1,"Greenfield, Mrs. Leo David (Blanche Strouse)",female,45.0,0,1,PC 17759,63.3583,D10 D12,C
122,1014,1,"Schabert, Mrs. Paul (Emma Mock)",female,35.0,1,0,13236,57.75,C28,C
343,1235,1,"Cardeza, Mrs. James Warburton Martinez (Charlo...",female,58.0,0,1,PC 17755,512.3292,B51 B53 B55,C
131,1023,1,"Gracie, Col. Archibald IV",male,53.0,0,0,113780,28.5,C51,C


``groupby`` allows one to, as the name indicates, group the dataset by a value and then apply a function to each group.

A basic use is to combine with the `.first` method to get the first row of each group.

In [13]:
df_titanic.groupby(["Pclass", "Sex"]).describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,PassengerId,PassengerId,PassengerId,PassengerId,PassengerId,PassengerId,PassengerId,PassengerId,Age,Age,...,Parch,Parch,Fare,Fare,Fare,Fare,Fare,Fare,Fare,Fare
Unnamed: 0_level_1,Unnamed: 1_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
Pclass,Sex,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2
1,female,50.0,1104.08,135.748242,904.0,985.0,1088.0,1246.5,1306.0,48.0,41.333333,...,1.0,4.0,50.0,115.591168,97.587457,25.7,52.69065,79.025,161.537525,512.3292
1,male,57.0,1093.087719,118.242227,903.0,974.0,1094.0,1190.0,1299.0,50.0,40.52,...,0.0,3.0,57.0,75.586551,66.339086,0.0,27.7208,51.8625,83.1583,262.375
2,female,30.0,1111.2,98.041652,907.0,1067.25,1122.0,1163.75,1277.0,29.0,24.376552,...,1.75,3.0,30.0,26.43875,11.653268,10.5,21.0,26.0,35.0625,65.0
2,male,63.0,1121.142857,123.401111,894.0,1024.5,1122.0,1231.0,1298.0,59.0,30.940678,...,0.0,2.0,63.0,20.184654,14.634267,9.6875,12.54375,13.0,26.0,73.5
3,female,72.0,1085.722222,124.352998,893.0,979.75,1070.5,1186.25,1304.0,50.0,23.0734,...,1.0,9.0,72.0,13.735129,11.898984,6.95,7.75,8.08125,15.5,69.55
3,male,146.0,1098.349315,118.293416,892.0,998.25,1102.5,1191.75,1309.0,96.0,24.525104,...,0.0,9.0,145.0,11.82635,10.200631,3.1708,7.75,7.8958,9.5,69.55


In addition to show the first row of each group, we can also use the `.size` method to get the size of each group.

In [14]:
df_titanic.groupby(["Pclass", "Sex"]).size()

Pclass  Sex   
1       female     50
        male       57
2       female     30
        male       63
3       female     72
        male      146
dtype: int64

It allows us to get group-specific statistics such as the mean.

In [13]:
df_titanic.groupby("Pclass").mean()

Unnamed: 0_level_0,PassengerId,Age,SibSp,Parch,Fare
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,1098.224299,40.918367,0.476636,0.383178,94.280297
2,1117.935484,28.7775,0.376344,0.344086,22.202104
3,1094.178899,24.027945,0.463303,0.417431,12.459678


What happened in the above process?

- ``groupby`` creates three separate datasets, corresponding to the three ``Pclass``.
- When applying ``mean``, the function is applied to the three datasets/dataframes separately.
Each outcome is recorded in a row. *(Question)* why is it a row instead of a single number?
- The three rows are combined, indexed by ``Pclass``.

``groupby`` can be easily combined with ``apply``. For example, we can write a function just to output the mean of ``age``, nothing else.

But think about what is the input and output of the function.

In [6]:
def mean_age(df):
    return df.Age.mean()

df_titanic.groupby("Pclass").apply(mean_age)

Pclass
1    40.918367
2    28.777500
3    24.027945
dtype: float64

What if we want both *mean* and *standard deviation* of the age within each group, what should we do?

- The function should output two values.
- Maybe we can return a tuple.

In [16]:
def mean_sd_age(df):
    return (df.Age.mean(), df.Age.std())

df_Pclass = df_titanic.groupby("Pclass").apply(mean_sd_age)
df_Pclass

Pclass
1    (40.91836734693877, 13.956798933932587)
2              (28.7775, 12.943457873541288)
3    (24.02794520547945, 10.537105270084329)
dtype: object

This is not exactly what we want. The output is a **pandas series**, or a one-column dataframe.
Each entry is a tuple.
A better way is to make it a two-column dataframe, so that we can operate it later within the pandas framework.

To do this, we need the function to output a pandas series.

In [None]:
def mean_sd_age(df):
    return pd.Series((df.Age.mean(), df.Age.std()))

df_Pclass = df_titanic.groupby("Pclass").apply(mean_sd_age)
df_Pclass

Unnamed: 0_level_0,0,1
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1
1,40.918367,13.956799
2,28.7775,12.943458
3,24.027945,10.537105


This is nice, but it would be better if we can make more informative headers. 
This function should be included in the returned value.
When creating a pandas series/dataframe, a tuple cannot possibly include that information.
So we can use a dictionary instead.

In [None]:
def mean_sd_age(df):
    return pd.Series({"average":df.Age.mean(), "standard deviation":df.Age.std()})

df_Pclass = df_titanic.groupby("Pclass").apply(mean_sd_age)
df_Pclass

Unnamed: 0_level_0,average,standard deviation
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1
1,40.918367,13.956799
2,28.7775,12.943458
3,24.027945,10.537105


*(Exercise):* we want to select the row of the youngest person in each ``Pclass``, and output them. How should we do it?

Try to explain the code below.

In [None]:
def youngest(df):
    return df.loc[df.Age == df.Age.min(), :]

df_titanic.groupby("Pclass").apply(youngest)

Unnamed: 0_level_0,Unnamed: 1_level_0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Bool_Mr,Old_Man,adj_fare
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
1,196,1088,1,"Spedden, Master. Robert Douglas",male,6.0,0,2,16966,134.5,E34,C,False,0,121.05
2,250,1142,2,"West, Miss. Barbara J",female,0.92,1,2,C.A. 34651,27.75,,S,False,0,24.975
3,354,1246,3,"Dean, Miss. Elizabeth Gladys Millvina""""",female,0.17,1,2,C.A. 2315,20.575,,S,False,0,18.5175


More examples about pandas `groupby` method [here](https://www.machinelearningplus.com/pandas/pandas-groupby-examples/).