## Exploring Continious Features

Using the Titanic dataset from [this](https://www.kaggle.com/c/titanic/overview) Kaggle competition.

### Read in Data

In [18]:
import pandas as pd
titanic = pd.read_csv('../data/titanic.csv')

In [19]:
titanic.shape

(891, 12)

In [20]:
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### Drop the categorical features
A categorical feature means having limited number of unique values, like Sex has only 2 unique values, and PClass has only 3 options in this dataset.
But here, we are also keeeping the Pclass in the dataset.

In [21]:
cat_features = ['PassengerId', 'Name', 'Ticket', 'Sex', 'Cabin', 'Embarked']
titanic.drop(cat_features, axis=1, inplace=True)

In [22]:
titanic.head()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare
0,0,3,22.0,1,0,7.25
1,1,1,38.0,1,0,71.2833
2,1,3,26.0,0,0,7.925
3,1,1,35.0,1,0,53.1
4,0,3,35.0,0,0,8.05


### Explore continious features

In [23]:
# Look at geenral distributions, simple stats
titanic.describe()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,714.0,891.0,891.0,891.0
mean,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,20.125,0.0,0.0,7.9104
50%,0.0,3.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,38.0,1.0,0.0,31.0
max,1.0,3.0,80.0,8.0,6.0,512.3292


In [24]:
# Look at correlations btw features easily with titanic.corr()
titanic.corr()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare
Survived,1.0,-0.338481,-0.077221,-0.035322,0.081629,0.257307
Pclass,-0.338481,1.0,-0.369226,0.083081,0.018443,-0.5495
Age,-0.077221,-0.369226,1.0,-0.308247,-0.189119,0.096067
SibSp,-0.035322,0.083081,-0.308247,1.0,0.414838,0.159651
Parch,0.081629,0.018443,-0.189119,0.414838,1.0,0.216225
Fare,0.257307,-0.5495,0.096067,0.159651,0.216225,1.0


1. It can be seen Fare has 25% correlation with Survived column, it is pretty strong feature.
2. Also PClass has a -33% correlatuion, negative correlation is still a good correlation.
2. As PClass get bigger, the survivel change get smaller. 3rd class passangers had less likelyhood of survival.
4. Between features, Pclass and Fare have very strong correlation: -54%.  

In [27]:
# Look at fare by different passenger class levels
titanic.groupby('Pclass')['Fare'].mean()

Pclass
1    84.154687
2    20.662183
3    13.675550
Name: Fare, dtype: float64

The mean Fare for different classes is very distinct, but let's look at other distributions too.

In [28]:
titanic.groupby('Pclass')['Fare'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,216.0,84.154687,78.380373,0.0,30.92395,60.2875,93.5,512.3292
2,184.0,20.662183,13.417399,0.0,13.0,14.25,26.0,73.5
3,491.0,13.67555,11.778142,0.0,7.75,8.05,15.5,69.55


1. The 25th quantile for Fare of 2nd class passangers is smaller than 75th quantile of 3rd class passangers.  
2. It means although means are very distinct for Fare feature btw classes, there is still overlap of Fare.

In [29]:
# Look at the average value of each feature based on whether the Age is missing.
titanic.groupby(titanic['Age'].isnull()).mean()

Unnamed: 0_level_0,Survived,Pclass,Age,SibSp,Parch,Fare
Age,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
False,0.406162,2.236695,29.699118,0.512605,0.431373,34.694514
True,0.293785,2.59887,,0.564972,0.180791,22.158567


If Age is missing:
1. The Fare is usually lower for passangers without Age recorded.
2. The passangers are less likely to survive.
3. The passangers usually have higher class numbers
4. The passangers have less Parents and Children aboard.

In [34]:
# Look at the distribution of each feature based on whether they survived or not.
for feature in ['Pclass', 'Age', 'SibSp', 'Parch', 'Fare']:
    print('\n**************** Describing feature: {} ****************'.format(feature))
    print(titanic.groupby('Survived')[feature].describe())


**************** Describing feature: Pclass ****************
          count      mean       std  min  25%  50%  75%  max
Survived                                                    
0         549.0  2.531876  0.735805  1.0  2.0  3.0  3.0  3.0
1         342.0  1.950292  0.863321  1.0  1.0  2.0  3.0  3.0

**************** Describing feature: Age ****************
          count       mean        std   min   25%   50%   75%   max
Survived                                                           
0         424.0  30.626179  14.172110  1.00  21.0  28.0  39.0  74.0
1         290.0  28.343690  14.950952  0.42  19.0  28.0  36.0  80.0

**************** Describing feature: SibSp ****************
          count      mean       std  min  25%  50%  75%  max
Survived                                                    
0         549.0  0.553734  1.288399  0.0  0.0  0.0  1.0  8.0
1         342.0  0.473684  0.708688  0.0  0.0  0.0  1.0  4.0

**************** Describing feature: Parch **************

1. Average Age of a person who did not survive is 30.6 and who survived has mean age of 28.3.
2. But if you look at the median (50th percentile), the medians are the same, which is 28.
3. Fare has significant difference in mean, median, and inter-quantile ranges.
4. Also Pclass has  a significant difference based on Survival.
5. Keep in mind though Fare and Pclass are highly correlated.