# Data Investigation

**Methods:**
>1. Load training data
>2. Investigate features
>3. Cleanse any features that require cleansing

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

## 1. Load training data

In [2]:
data_train = pd.read_csv('./../../data/raw/train.csv')

## 2. Investigate features

First, let me look at each of the features by itself:

### PassengerId

In [19]:
feature_name = data_train.columns[0]
print 
print '###########'
print feature_name
print
print 'Value Counts:'
print data_train[feature_name].value_counts().head(10)
print
print 'Describe'
print data_train[feature_name].describe()


###########
PassengerId

Value Counts:
891    1
293    1
304    1
303    1
302    1
301    1
300    1
299    1
298    1
297    1
Name: PassengerId, dtype: int64

Describe
count    891.000000
mean     446.000000
std      257.353842
min        1.000000
25%      223.500000
50%      446.000000
75%      668.500000
max      891.000000
Name: PassengerId, dtype: float64


* This is a unique identifier for each passenger. There are 891 passengers total

#### ToDo: I will set this to data frame index after loading the data in the data_loader library

### Survived

In [18]:
feature_name = data_train.columns[1]
print 
print '###########'
print feature_name
print
print 'Value Counts:'
print data_train[feature_name].value_counts()
print
print 'Describe'
print data_train[feature_name].describe()


###########
Survived

Value Counts:
0    549
1    342
Name: Survived, dtype: int64

Describe
count    891.000000
mean       0.383838
std        0.486592
min        0.000000
25%        0.000000
50%        0.000000
75%        1.000000
max        1.000000
Name: Survived, dtype: float64


* The Survived response variable is binary, with less people surviving than not. 

#### ToDo: Convert the Survived variable to a category variable 

### Pclass

In [20]:
feature_name = data_train.columns[2]
print 
print '###########'
print feature_name
print
print 'Value Counts:'
print data_train[feature_name].value_counts()
print
print 'Describe'
print data_train[feature_name].describe()


###########
Pclass

Value Counts:
3    491
1    216
2    184
Name: Pclass, dtype: int64

Describe
count    891.000000
mean       2.308642
std        0.836071
min        1.000000
25%        2.000000
50%        3.000000
75%        3.000000
max        3.000000
Name: Pclass, dtype: float64


* Stands for "Passenger Class." 1 = first class. The other two are lower classes
* There are quite a few Pclass=3 units.
* All passengers have a pclass assigned to them

#### ToDo: Convert Pclass into a category

### Name

In [22]:
feature_name = data_train.columns[3]
print 
print '###########'
print feature_name
print
print 'Value Counts:'
print data_train[feature_name].value_counts().head(10)


###########
Name

Value Counts:
Graham, Mr. George Edward                              1
Elias, Mr. Tannous                                     1
Madill, Miss. Georgette Alexandra                      1
Cumings, Mrs. John Bradley (Florence Briggs Thayer)    1
Beane, Mrs. Edward (Ethel Clarke)                      1
Roebling, Mr. Washington Augustus II                   1
Moran, Mr. James                                       1
Padro y Manent, Mr. Julian                             1
Scanlan, Mr. James                                     1
Ali, Mr. William                                       1
Name: Name, dtype: int64


* The given name of each passenger. 
* These are likely not be useful, save without a deep NLP investigation. 
* We get the passenger's sex in another column, so it's not necessary to extract that here. Same with familial connections.

#### ToDo: Drop Name from training data

### Sex

In [23]:
feature_name = data_train.columns[4]
print 
print '###########'
print feature_name
print
print 'Value Counts:'
print data_train[feature_name].value_counts().head(10)
print
print 'Describe'
print data_train[feature_name].describe()


###########
Sex

Value Counts:
male      577
female    314
Name: Sex, dtype: int64

Describe
count      891
unique       2
top       male
freq       577
Name: Sex, dtype: object


* More males than females. Every entry has an associated sex


#### ToDo: Convert sex into a category

### Age

In [24]:
feature_name = data_train.columns[5]
print 
print '###########'
print feature_name
print
print 'Value Counts:'
print data_train[feature_name].value_counts().head(10)
print
print 'Describe'
print data_train[feature_name].describe()


###########
Age

Value Counts:
24.0    30
22.0    27
18.0    26
30.0    25
28.0    25
19.0    25
21.0    24
25.0    23
36.0    22
29.0    20
Name: Age, dtype: int64

Describe
count    714.000000
mean      29.699118
std       14.526497
min        0.420000
25%             NaN
50%             NaN
75%             NaN
max       80.000000
Name: Age, dtype: float64




* There are quite a few units with missing ages (177 cases)
* There are some babies on the ship: the minimum age is 0.42
* The average age is 29.7

#### ToDo: Attempt imputation methods, using the actual ages as a benchmark. Use the best one on the 177 missing cases

### SibSp

In [26]:
feature_name = data_train.columns[6]
print 
print '###########'
print feature_name
print
print 'Value Counts:'
print data_train[feature_name].value_counts().head(10)
print
print 'Describe'
print data_train[feature_name].describe()


###########
SibSp

Value Counts:
0    608
1    209
2     28
4     18
3     16
8      7
5      5
Name: SibSp, dtype: int64

Describe
count    891.000000
mean       0.523008
std        1.102743
min        0.000000
25%        0.000000
50%        0.000000
75%        1.000000
max        8.000000
Name: SibSp, dtype: float64


* Number of Siblings/Spouses Aboard
* No missing values
* Most people had 0 siblings/spouses aboard
* The next largest segment was 1 sibling/spouse. These people are very likely married without kids or a couple
* Everything 3 and above is likely to be family
    * There are 5 cases where the passenger had 5 siblings/spouses. This are likely all from the same family

#### ToDo: Create dummy variables for ["alone", "couple", and "family"]
#### ToDo: Look up if the larger families fared better than the smaller families

### Parch

In [27]:
feature_name = data_train.columns[7]
print 
print '###########'
print feature_name
print
print 'Value Counts:'
print data_train[feature_name].value_counts().head(10)
print
print 'Describe'
print data_train[feature_name].describe()


###########
Parch

Value Counts:
0    678
1    118
2     80
5      5
3      5
4      4
6      1
Name: Parch, dtype: int64

Describe
count    891.000000
mean       0.381594
std        0.806057
min        0.000000
25%        0.000000
50%        0.000000
75%        0.000000
max        6.000000
Name: Parch, dtype: float64


* Number of parents/children aboard
* This could be used along with siblings/spouse to figure out the status of each member

#### ToDo: Create a feature that takes both sibsp and ParCh, and creates a "familial_status" categorical variable

### Ticket

In [28]:
feature_name = data_train.columns[8]
print 
print '###########'
print feature_name
print
print 'Value Counts:'
print data_train[feature_name].value_counts().head(10)
print
print 'Describe'
print data_train[feature_name].describe()


###########
Ticket

Value Counts:
CA. 2343        7
347082          7
1601            7
347088          6
CA 2144         6
3101295         6
382652          5
S.O.C. 14879    5
PC 17757        4
4133            4
Name: Ticket, dtype: int64

Describe
count          891
unique         681
top       CA. 2343
freq             7
Name: Ticket, dtype: object


* The ticket number
* Could gives clues as to which passengers are related, since their tickets should be similarly named

### Fare

In [30]:
feature_name = data_train.columns[9]
print 
print '###########'
print feature_name
print
print 'Value Counts:'
print data_train[feature_name].value_counts().head(10)
print
print 'Describe'
print data_train[feature_name].describe()


###########
Fare

Value Counts:
8.0500     43
13.0000    42
7.8958     38
7.7500     34
26.0000    31
10.5000    24
7.9250     18
7.7750     16
7.2292     15
26.5500    15
Name: Fare, dtype: int64

Describe
count    891.000000
mean      32.204208
std       49.693429
min        0.000000
25%        7.910400
50%       14.454200
75%       31.000000
max      512.329200
Name: Fare, dtype: float64


* Passenger fare
* No missing values
* Could also be used to group passengers together. It's possible that the ticket number and this price could link couples and families together

#### ToDo: Investigate if the ticket number and fare cost can link couples together. Linked couples could be informative to see if having a survived spouse increased the survival of the other spouse. 

#### ToDo: After finding links between couples, check if one of the spouses survived, does the other spouse have a higher chance of surviving

### Cabin

In [32]:
feature_name = data_train.columns[10]
print 
print '###########'
print feature_name
print
print 'Value Counts:'
print data_train[feature_name].value_counts().head(10)
print
print 'Describe'
print data_train[feature_name].describe()


###########
Cabin

Value Counts:
C23 C25 C27        4
G6                 4
B96 B98            4
D                  3
C22 C26            3
E101               3
F2                 3
F33                3
B57 B59 B63 B66    2
C68                2
Name: Cabin, dtype: int64

Describe
count             204
unique            147
top       C23 C25 C27
freq                4
Name: Cabin, dtype: object


* There are only 204 counts, which is very low

#### ToDo: Drop "Cabin" from the features

In [33]:
feature_name = data_train.columns[11]
print 
print '###########'
print feature_name
print
print 'Value Counts:'
print data_train[feature_name].value_counts().head(10)
print
print 'Describe'
print data_train[feature_name].describe()


###########
Embarked

Value Counts:
S    644
C    168
Q     77
Name: Embarked, dtype: int64

Describe
count     889
unique      3
top         S
freq      644
Name: Embarked, dtype: object


* Gives the port one embarked from
* This could be useful in linking couples/families

#### ToDo: Include this in the linking of couples/families

## 3. Cleanse any features that require cleansing

Will perform a few investigations first and then move on to data cleansing