# Titanic Study: Pivot Table, dropna(), sort_value()

From Kaggle web site, https://www.kaggle.com/c/titanic/data, you can download the file titanic.csv.
Inspired from dataquest course:

In [1]:
import pandas as pd
titanic = pd.read_csv('data/train.csv')
titanic.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


* `Pclass`: The passenger's cabin class from 1 to 3 where 1 was the highest class
* `Survived`: `1` if the passenger survived, and `0` if they did not.
* `Sex`: The passenger's gender
* `Age`: The passenger's age
* `Fare` The amount the passenger paid for his ticket
* `Embarked`: `C`, `Q`, or `S`, to indicate which port the passenger boarded the ship from.

Let us count how many values in the "age" column have null values:

In [2]:
age_serie = titanic["Age"]
# Boolean serie indicating if the age is null (True) or defined (False)
isnull = pd.isnull(age_serie)
# Assign a new dataframe with null values for age.
age_null = age_serie[isnull]
print(len(age_null))

177


Let us calculate the average age for the passengers considering the age values that are defined:

In [3]:
age_def = age_serie[isnull == False]
mean_age = sum(age_def)/len(age_def)
print(mean_age)

29.6991176471


A more convenient way is to use the **Series.mean()** method to calculate the mean of a column, missing values will not be included in the calculation.

In [4]:
mean_age = titanic["Age"].mean()
print(mean_age)

29.69911764705882


In [5]:
mean_fare = titanic["Fare"].mean()
print(mean_fare)

32.2042079685746


Let us create a dictionary, fares_by_class containing 1, 2, and 3 as keys, with the average fares as the corresponding values.

In [6]:
classes = [1, 2, 3]
fares_by_class = {}
for pclass in classes:
    is_same_class = (titanic["Pclass"] == pclass)
    fare = titanic['Fare'][is_same_class]
    mean = fare.mean()
    fares_by_class[pclass] = mean
print(fares_by_class)

{1: 84.15468749999992, 2: 20.66218315217391, 3: 13.675550101832997}


A more convenient way is to use the **Pivot table**: 

In [8]:
p_age = titanic.pivot_table(index = 'Pclass', values='Fare')
print(p_age)

             Fare
Pclass           
1       84.154687
2       20.662183
3       13.675550


Let us make a pivot table that calculates the total fares collected and total number of survivors for each embarkation port.

In [8]:
port_stats = titanic.pivot_table(index='Embarked', values=['Survived','Fare'], aggfunc=sum)
print(port_stats) 

                Fare  Survived
Embarked                      
C         10072.2962        93
Q          1022.2543        30
S         17439.3988       217


Let us drop all rows in titanic_survival that have missing values and assign the result to drop_na_columns.

The `dropna()` method takes an axis parameter:
* axis=0 or axis='index' will drop any rows that have null values
* axis=1 or axis='columns' will drop any columns that have null values. 

In [9]:
titanic.shape

(891, 12)

In [10]:
rop_na_rows = titanic.dropna(axis=0)
rop_na_rows.shape

(183, 12)

Let us drop all rows in titanic_survival where the columns "age" or "sex" have missing values and assign the result to new_titanic_survival

In [11]:
new_titanic = titanic.dropna(axis=0, subset=['Age','Sex'])
new_titanic.shape

(714, 12)

Let us sort our new dataframe by age:

In [12]:
sorted_titanic = new_titanic.sort_values(by='Age')

In [13]:
sorted_titanic.head(20)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
803,804,1,3,"Thomas, Master. Assad Alexander",male,0.42,0,1,2625,8.5167,,C
755,756,1,2,"Hamalainen, Master. Viljo",male,0.67,1,1,250649,14.5,,S
644,645,1,3,"Baclini, Miss. Eugenie",female,0.75,2,1,2666,19.2583,,C
469,470,1,3,"Baclini, Miss. Helene Barbara",female,0.75,2,1,2666,19.2583,,C
78,79,1,2,"Caldwell, Master. Alden Gates",male,0.83,0,2,248738,29.0,,S
831,832,1,2,"Richards, Master. George Sibley",male,0.83,1,1,29106,18.75,,S
305,306,1,1,"Allison, Master. Hudson Trevor",male,0.92,1,2,113781,151.55,C22 C26,S
827,828,1,2,"Mallet, Master. Andre",male,1.0,0,2,S.C./PARIS 2079,37.0042,,C
381,382,1,3,"Nakid, Miss. Maria (""Mary"")",female,1.0,0,2,2653,15.7417,,C
164,165,0,3,"Panula, Master. Eino Viljami",male,1.0,4,1,3101295,39.6875,,S


You can see that the row labels for the first 5 rows are 803, 755, 644, 469 and 78

To select rows by position, we can use the `DataFrame.iloc[]` method as follows:

In [14]:
sorted_titanic.iloc[0:5]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
803,804,1,3,"Thomas, Master. Assad Alexander",male,0.42,0,1,2625,8.5167,,C
755,756,1,2,"Hamalainen, Master. Viljo",male,0.67,1,1,250649,14.5,,S
644,645,1,3,"Baclini, Miss. Eugenie",female,0.75,2,1,2666,19.2583,,C
469,470,1,3,"Baclini, Miss. Helene Barbara",female,0.75,2,1,2666,19.2583,,C
78,79,1,2,"Caldwell, Master. Alden Gates",male,0.83,0,2,248738,29.0,,S


Note the difference with the `DataFrame.loc[]` that considers indexes from the first column and not position.

In [15]:
sorted_titanic.loc[[1,2]]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


In [16]:
row_position_fifth = sorted_titanic.iloc[4]
print(row_position_fifth)

PassengerId                               79
Survived                                   1
Pclass                                     2
Name           Caldwell, Master. Alden Gates
Sex                                     male
Age                                     0.83
SibSp                                      0
Parch                                      2
Ticket                                248738
Fare                                      29
Cabin                                    NaN
Embarked                                   S
Name: 78, dtype: object


In [17]:
row_index_4 = sorted_titanic.loc[4]
print(row_index_4)

PassengerId                           5
Survived                              0
Pclass                                3
Name           Allen, Mr. William Henry
Sex                                male
Age                                  35
SibSp                                 0
Parch                                 0
Ticket                           373450
Fare                               8.05
Cabin                               NaN
Embarked                              S
Name: 4, dtype: object
