### Data Exploration of the Titanic Dataset from Kaggle

In [47]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# get titanic & test csv files as a DataFrame
titanic_train = pd.read_csv("./train.csv" )

# preview the data
titanic_train.head()
titanic_train.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


I want to initially see how many null values exist in the data.

In [48]:
titanic_train.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

Looking at the data above results, we see that there are a lot of null values in the "Age" category, as well as in "Cabin" and "Embarked". Since we care about the "Age" category, let's substitute the null values with the median value of all "Ages" we have available.

In [49]:
titanic_train["Age"]=titanic_train["Age"].fillna(titanic_train["Age"].median())

As a result, looking at the null values again we observe that none null values exist in the "Age" category.

In [50]:
titanic_train.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

I think it is worth exploring some ordering in our dataset. How about first we create a new dataframe grouped by gender and see how average values based on gender vary in our dataset.

In [51]:
# Will filter and remove columns we don't care about
titanic_gender_mean_data = titanic_train.drop(['PassengerId','Pclass'], axis=1)
titanic_gender_mean_data.groupby("Sex").mean()

Unnamed: 0_level_0,Survived,Age,SibSp,Parch,Fare
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
female,0.742038,27.929936,0.694268,0.649682,44.479818
male,0.188908,30.140676,0.429809,0.235702,25.523893


It is very interesting to point out that the survival rate of females was 74% and 19% for males.

I was also very interested in viewing the survival rates and the way the were influenced by the passenger class people were in, while at the same time investigating the average age in each class.

In [52]:
titanic_train.groupby(["Pclass"]).mean()

Unnamed: 0_level_0,PassengerId,Survived,Age,SibSp,Parch,Fare
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,461.597222,0.62963,36.81213,0.416667,0.356481,84.154687
2,445.956522,0.472826,29.76538,0.402174,0.380435,20.662183
3,439.154786,0.242363,25.932627,0.615071,0.393075,13.67555


Looking at the table above, one can observe that the biggest percentage that survived in a class was in the 1st class, i.e. 63% in 3 significant digits. First class also had the biggest average age at 36 years of age.

We can dive deeper into grouping and see how the passenger class and gender relate to each other. 

In [53]:
titanic_train.groupby(["Pclass","Sex"]).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,PassengerId,Survived,Age,SibSp,Parch,Fare
Pclass,Sex,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,female,469.212766,0.968085,33.978723,0.553191,0.457447,106.125798
1,male,455.729508,0.368852,38.995246,0.311475,0.278689,67.226127
2,female,443.105263,0.921053,28.703947,0.486842,0.605263,21.970121
2,male,447.962963,0.157407,30.512315,0.342593,0.222222,19.741782
3,female,399.729167,0.5,23.572917,0.895833,0.798611,16.11881
3,male,455.51585,0.135447,26.911873,0.498559,0.224784,12.661633


Once again, it is observed that the percentage of females that survived at each class was greater than theat of males.

However, I am not quite aware of the actual numbers involved, so instead of taking the means, let's take the counts on all data, grouped as we have already have.

In [54]:
titanic_train.groupby(["Pclass","Sex"]).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,PassengerId,Survived,Name,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
Pclass,Sex,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,female,94,94,94,94,94,94,94,94,81,92
1,male,122,122,122,122,122,122,122,122,95,122
2,female,76,76,76,76,76,76,76,76,10,76
2,male,108,108,108,108,108,108,108,108,6,108
3,female,144,144,144,144,144,144,144,144,6,144
3,male,347,347,347,347,347,347,347,347,6,347


More males died in each class. It is interesting to note that the Number of Siblings/Spouses Aboard (SibSP) in each row groupping is the same as the number that survived. This could potentially be that couples left the ship together at all passenger classes. 

In [None]:
import matplotlib.pyplot as plt
plt.scatter(titanic_train.Age, titanic_train.Fare)
plt.xlabel('age')
plt.ylabel('fare price')
plt.show()
