## Titanic Survival Project

- clean and analyze data on passenger survival from the Titanic. 

- pclass -- The passenger's cabin class from 1 to 3 where 1 was the highest class
- survived -- 1 if the passenger survived, and 0 if they did not.
- sex -- The passenger's gender
- age -- The passenger's age
- fare -- The amount the passenger paid for their ticket
- embarked -- Either C, Q, or S, to indicate which port the passenger boarded the ship from.
- Many of the columns, such as age and sex, have missing values.


In [26]:
# read the file
import pandas as pd
import numpy as np
titanic_survival = pd.read_csv('titanic_survival.csv')

In [27]:
# find the missing data
## In Python, 'None' indicates no value.
## Pandas library uses 'NaN', which stands for "not a number", to indicate a missing value.

In [28]:
# count the missing number
age_is_null = pd.isnull(titanic_survival["age"])
age_null_true = age[age_is_null]
len(age_null_true)

264

In [29]:
# get the effect values
good_ages = titanic_survival['age'][age_is_null == False]
len(good_ages)

1046

In [30]:
# calculate fares_by_class

In [31]:
passenger_classes = [1, 2, 3]
fares_by_class = {}
for x in passenger_classes:
    pclass_rows = titanic_survival[titanic_survival['pclass'] == x]
    pclass_fares = pclass_rows['fare']
    fares_for_class = pclass_fares.mean()
    fares_by_class[x] = fares_for_class
print(fares_by_class)

{1: 87.508991640866881, 2: 21.179196389891697, 3: 13.302888700564973}


In [32]:
# Making Pivot Tables: Dataframe.pivot_table()

In [33]:
passenger_age = titanic_survival.pivot_table(index='pclass',values='age',aggfunc=np.mean)
print(passenger_age)

pclass
1.0    39.159918
2.0    29.506705
3.0    24.816367
Name: age, dtype: float64


In [34]:
# Complex Pivot Tables with multiple columns 
port_stats = titanic_survival.pivot_table(index='embarked',values=['fare','survived'],aggfunc=np.sum)
print(port_stats)

                fare  survived
embarked                      
C         16830.7922     150.0
Q          1526.3085      44.0
S         25033.3862     304.0


In [35]:
# Drop Missing Values -- DataFrame.dropna()
# axis=0 or axis='index' will drop any rows that have null values 
# axis=1 or axis='columns' will drop any columns that have null values

In [36]:
drop_na_rows = titanic_survival.dropna(axis=0)
drop_na_columns = titanic_survival.dropna(axis=1)

In [37]:
# specify drop a list of columns or rows
new_titanic_survival = titanic_survival.dropna(axis=0,subset=["age", "sex"])

In [38]:
#  Reindexing Rows -- DataFrame.reset_index()

In [39]:
# Reindex the new_titanic_survival dataframe so the row indexes start from 0,
# and the old index is dropped.

In [40]:
titanic_reindexed = new_titanic_survival.reset_index(drop=True)
print(titanic_reindexed.iloc[0:4,0:2])

   pclass  survived
0     1.0       1.0
1     1.0       1.0
2     1.0       0.0
3     1.0       0.0


In [41]:
# apply() function 

In [42]:
def generate_age_label(row):
    age = row["age"]
    if pd.isnull(age):
        return "unknown"
    elif age < 18:
        return "minor"
    else:
        return "adult"

age_labels = titanic_survival.apply(generate_age_label, axis=1) # apply function to the rows