In [95]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
import pandas as pd
import os
filepath = os.path.join(os.getcwd(),"Data")
train = pd.DataFrame.from_csv(os.path.join(filepath,"train.csv"), index_col = None)
train.shape
survival = pd.DataFrame.from_csv(os.path.join(filepath,"gender_submission.csv"), index_col = None)
test = pd.DataFrame.from_csv(os.path.join(filepath,"test.csv"), index_col = None)
test = survival.merge(test, on="PassengerId")
test.shape
train = train.append(test, ignore_index=True)
train.shape

(891, 12)

(418, 12)

(1309, 12)

Instead of relying on Kaggle's partitioning of the test and training sets, we'll combine the sets together and do custom splits for cross-validation. There are 1309 rows, with 12 variables. Let's have a look.

In [96]:
train.head(n=7)
train.isnull().sum()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S


PassengerId       0
Survived          0
Pclass            0
Name              0
Sex               0
Age             263
SibSp             0
Parch             0
Ticket            0
Fare              1
Cabin          1014
Embarked          2
dtype: int64

There are a couple of variables that are of questionable value for surivival prediction. 

PassengerID is just an index, and since it was assigned long after the sinking of the Titanic, it will not have any predictive power.

A passenger's name should not have any effect on survival. While names can give us information on the gender of the passenger (and an interesting task that's best suited for another analysis), there's already an pre-existing gender variable. 

The ticket variable is probably the most confusing. There doesn't seem to be any kind of clear idea of what information the ticket variable is supposed to have. Some tickets are just digits, and others have a combination of letters, punctuation and numbers. It could be that the ticket number represents the order in which the passengers purchaes their tickets, which could have some predictive value, but without further information on how to extract meaning from the ticket variable, it should not be used as a predictive variable. 

There are some other variables that might present issues besides the lack of predictive power.

There are passengers missing ages, cabin numbers, and to a lesser extent, fare and embarkation location.

More than 75% of passengers are missing cabin numbers. Cabin numbers might indicate what section the passenger is located in (the first letter) but with 75% of labels missing, the value that it provides is dubious; imputating 75% of passenger's cabin numbers based on the remaining 25% might lead to some spurious results. Also worth nothing, is that passenger class("Pclass") can be an excellent proxy for ship location, as the different passenger classes were housed in different, seperate sections of the ship.

The missing passenger age problem is a little more tractable, with only 20% of passengers missing age labels. We can attempt age label imputation with confidence. And as for the handful of missing fare/embarkation location labels, we can just use the mean/ most common class to substitute in a value, respectively.  

In [97]:
import hashlib
def test_split(id, seed, test_proportion):
    if type(test_proportion) not in [float, int] or test_proportion > 1 or test_proportion < 0:
        raise ValueError("Test proportion must be a real number between 0 and 1")
    test = str(id) + str(seed)
    test_digest = hashlib.md5(test.encode("ascii")).hexdigest()
    test_hex = int(test_digest[-6:], 16)
    split = test_hex/0xFFFFFF
    if split > test_proportion:
        return 0
    else:
        return 1       

In [98]:
train["PassengerId"].map(lambda x: test_split(id = x, seed = 42, test_proportion = 0.2))

0       1
1       0
2       0
3       0
4       0
5       0
6       0
7       0
8       1
9       0
10      1
11      0
12      0
13      0
14      1
15      0
16      0
17      0
18      0
19      0
20      0
21      0
22      1
23      0
24      0
25      0
26      0
27      0
28      0
29      0
       ..
1279    0
1280    0
1281    0
1282    0
1283    1
1284    0
1285    1
1286    0
1287    0
1288    0
1289    0
1290    0
1291    0
1292    0
1293    0
1294    1
1295    0
1296    0
1297    0
1298    1
1299    0
1300    0
1301    0
1302    0
1303    0
1304    0
1305    0
1306    0
1307    0
1308    0
Name: PassengerId, Length: 1309, dtype: int64

In [None]:
type(0.01)