In [2]:
import pandas as pd
titanic = pd.read_csv('train.csv')
titanic_test = pd.read_csv('test.csv')

This step imports the training and test sets

In [3]:
def clean_data(data, age_filler):
    data.Age = data.Age.fillna(age_filler)
    data.Fare = data.Fare.fillna(data.Fare.median())
    data.loc[data.Sex == "female", "Sex"] = 1
    data.loc[data.Sex == "male", "Sex"] = 0
    data.Embarked = data.Embarked.fillna("S")
    data.loc[data.Embarked == "S", "Embarked"] = 0
    data.loc[data.Embarked == "C", "Embarked"] = 1
    data.loc[data.Embarked == "Q", "Embarked"] = 2
    return data
    
titanic = clean_data(titanic, titanic.Age.median())
titanic_test = clean_data(titanic_test, titanic.Age.median())
print "cleaned"

cleaned


This step defines a function called clean_data that I can use to clean both the train and test sets in the same way.  Because we need to replace the age variable with the same median in the train and test sets, it takes that age as an argument.  Overall, it just removed the NaN values in relevant columns and replaces string values with numbers instead

In [4]:
from sklearn import cross_validation
from sklearn.linear_model import LogisticRegression

predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]

alg = LogisticRegression(random_state=1)
# Compute the accuracy score for all the cross validation folds.  (much simpler than what we did before!)
scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=3)
# Take the mean of the scores (because we have one for each fold)
print(scores.mean())

0.787878787879


This step imports scikit-learn and uses it to build a logistic regression model.  We create an algorithm, and cross-validate it on our training set to get an idea of what the scores might be.  It creates 3-fold data, as opposed to the in-class example where instead we used half the data for training and half for evaluating.

In [5]:
# Train the algorithm using all the training data
alg.fit(titanic[predictors], titanic["Survived"])

# Make predictions using the test set.
predictions = alg.predict(titanic_test[predictors])

# Create a new dataframe with only the columns Kaggle wants from the dataset.
submission = pd.DataFrame({
        "PassengerId": titanic_test["PassengerId"],
        "Survived": predictions
    })

submission.to_csv("kaggle.csv", index=False)

This step fits our logistic regression algoritm to the training set using the relevant columns from the training set (we ignore the name, cabin, etc, as defined in the predictors array), and match it to the Survived column as the thing we want to predict.  Then, we use this algorithm we trained to create predictions from the test set, and turn those predictions into a csv to submit to kaggle

Kaggle submission score: 75.120

I think one place to start with improving the DataQuest model is based on what we talked about in class on Tuesday 1/26.  We discussed how assigning numbers to string values can skew the model because it's creating an artificial range for that variable.  In this model, we turned the Embarked variable into either 0, 1 or 2.  I think one way to see if it can do better is to create three new columns for something like "EmbarkedC", "EmbarkedS", and "EmbarkedQ" and have a true/false boolean value for each, so that we can separate the effects.

In [6]:
def parse_port(port):
    if port == 0: # port is s=0, c=1, q=2
        return 1,0,0
    elif port == 1:
        return 0,1,0
    elif port == 2:
        return 0,0,1

def make_embarked_cols(data):
    s = []
    c = []
    q = []

    for port in data.Embarked:
        sbool, cbool, qbool = parse_port(port)
        s.append(sbool)
        c.append(cbool)
        q.append(qbool)

    data['EmbarkedS'] = s
    data['EmbarkedC'] = c
    data['EmbarkedQ'] = q
    
    return data

make_embarked_cols(titanic)
make_embarked_cols(titanic_test)
print titanic.EmbarkedS

0      1
1      0
2      1
3      1
4      1
5      0
6      1
7      1
8      1
9      0
10     1
11     1
12     1
13     1
14     1
15     1
16     0
17     1
18     1
19     0
20     1
21     1
22     0
23     1
24     1
25     1
26     0
27     1
28     0
29     1
      ..
861    1
862    1
863    1
864    1
865    1
866    0
867    1
868    1
869    1
870    1
871    1
872    1
873    1
874    0
875    0
876    1
877    1
878    1
879    0
880    1
881    1
882    1
883    1
884    1
885    0
886    1
887    1
888    1
889    0
890    0
Name: EmbarkedS, dtype: int64


These two functions sort through the Embarked column and creates 3 new columns in the dataset

In [7]:
predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "EmbarkedS", "EmbarkedQ", "EmbarkedC"]

alg = LogisticRegression(random_state=1)
# Compute the accuracy score for all the cross validation folds.  (much simpler than what we did before!)
scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=3)
# Take the mean of the scores (because we have one for each fold)
print(scores.mean())

0.789001122334


In [8]:
# Train the algorithm using all the training data
alg.fit(titanic[predictors], titanic["Survived"])

# Make predictions using the test set.
predictions = alg.predict(titanic_test[predictors])

# Create a new dataframe with only the columns Kaggle wants from the dataset.
submission = pd.DataFrame({
        "PassengerId": titanic_test["PassengerId"],
        "Survived": predictions
    })

submission.to_csv("kaggle2.csv", index=False)

Despite the _slightly_ better score in the cross-validation, my new Kaggle submission actually scored more poorly, at a score of 0.74641. :(  Clearly, the issue with the Embarked ports is not as important as I though, or the port of embarkation doesn't even actually matter that much

In [9]:
# let's take a look at the ones it predicted wrong

newpredictions = alg.predict(titanic[predictors])

predictedDead_isAlive = 0
predictedAlive_isDead = 0

for i in range(len(newpredictions)):
    if titanic.Survived[i] != newpredictions[i]:
        print titanic[predictors].loc[i]
        print "SURVIVED: ", titanic.Survived.loc[i]
        print "PREDICTED: ", newpredictions[i]
        predictedDead_isAlive += titanic.Survived.loc[i]
        predictedAlive_isDead += newpredictions[i]
        
print "predictedDead_isAlive:",predictedDead_isAlive
print "predictedAlive_isDead:",predictedAlive_isDead


Pclass            3
Sex               1
Age              14
SibSp             0
Parch             0
Fare         7.8542
EmbarkedS         1
EmbarkedQ         0
EmbarkedC         0
Name: 14, dtype: object
SURVIVED:  0
PREDICTED:  1
Pclass        2
Sex           0
Age          28
SibSp         0
Parch         0
Fare         13
EmbarkedS     1
EmbarkedQ     0
EmbarkedC     0
Name: 17, dtype: object
SURVIVED:  1
PREDICTED:  0
Pclass        3
Sex           1
Age          31
SibSp         1
Parch         0
Fare         18
EmbarkedS     1
EmbarkedQ     0
EmbarkedC     0
Name: 18, dtype: object
SURVIVED:  0
PREDICTED:  1
Pclass        2
Sex           0
Age          34
SibSp         0
Parch         0
Fare         13
EmbarkedS     1
EmbarkedQ     0
EmbarkedC     0
Name: 21, dtype: object
SURVIVED:  1
PREDICTED:  0
Pclass          1
Sex             0
Age            28
SibSp           0
Parch           0
Fare         35.5
EmbarkedS       1
EmbarkedQ       0
EmbarkedC       0
Name: 23, dtype: objec

It's interesting that my model predicted more dead people than there should be, but I'm not entirely sure how to edit the model to improve my secore based on this information

I know that from my exploration, it seemed to make a pretty big difference in survival rates on whether the subject was an adult or child.  I wonder if making this a specific column separate from age might change my results?

In [18]:
titanic['Child'] = titanic.Age <= 18
titanic_test['Child'] = titanic_test.Age <= 18
print titanic.Child

0      False
1      False
2      False
3      False
4      False
5      False
6      False
7       True
8      False
9       True
10      True
11     False
12     False
13     False
14      True
15     False
16      True
17     False
18     False
19     False
20     False
21     False
22      True
23     False
24      True
25     False
26     False
27     False
28     False
29     False
       ...  
861    False
862    False
863    False
864    False
865    False
866    False
867    False
868    False
869     True
870    False
871    False
872    False
873    False
874    False
875     True
876    False
877    False
878    False
879    False
880    False
881    False
882    False
883    False
884    False
885    False
886    False
887    False
888    False
889    False
890    False
Name: Child, dtype: bool


In [33]:
predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "EmbarkedS", "EmbarkedQ", "EmbarkedC", "Child"]

alg = LogisticRegression(random_state=1)
# Compute the accuracy score for all the cross validation folds.  (much simpler than what we did before!)
scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=3)
# Take the mean of the scores (because we have one for each fold)
print(scores.mean())

0.789001122334


In [26]:
for i in range(2,100):
    scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=i)
    # Take the mean of the scores (because we have one for each fold)
    print i, scores.mean()

2 0.772187232327
3 0.789001122334
4 0.796850787567
5 0.795773475089
6 0.797985065603
7 0.798014645491
8 0.800286467474
9 0.796857463524
10 0.799171773919
11 0.80366948619
12 0.802582376897
13 0.803704224165
14 0.803654827316
15 0.801449444769
16 0.802594554568
17 0.79920142545
18 0.801337868481
19 0.800331077462
20 0.800231664471
21 0.798176352994
22 0.800268633805
23 0.801711410407
24 0.800501159054
25 0.800468720822
26 0.802578467284
27 0.798198488149
28 0.801497477657
29 0.801685141856
30 0.798050920776
31 0.801430160496
32 0.80273900748
33 0.802731219398
34 0.800324403266
35 0.800626373626
36 0.800381054131
37 0.800487661575
38 0.801344393593
39 0.803933988717
40 0.802879728967
41 0.80261714249
42 0.802709885008
43 0.801736900391
44 0.801711665527
45 0.800478975216
46 0.799383422324
47 0.801810377006
48 0.802643762183
49 0.803781161587
50 0.801683006536
51 0.80328078944
52 0.802758924082
53 0.801632445431
54 0.802892762043
55 0.802923351159
56 0.80175070028
57 0.802704678363
58 0.8

KeyboardInterrupt: 

In [28]:
# Train the algorithm using all the training data
alg.fit(titanic[predictors], titanic["Survived"])

# Make predictions using the test set.
predictions = alg.predict(titanic_test[predictors])

# Create a new dataframe with only the columns Kaggle wants from the dataset.
submission = pd.DataFrame({
        "PassengerId": titanic_test["PassengerId"],
        "Survived": predictions
    })

submission.to_csv("kaggle6.csv", index=False)

Good news, this improved my model up to 0.76077!  Not very much of an increase though, but still, it's something.  I moved up 190 positions on the leaderboard!