# Next step, logistic Regression.

I've been trying several different methods out for working with this Titanic data.  Many are inspired by others' work. I'm using this dataset as a way to formalize some knowledge of these methodologies I learned for my thesis (though often called differently) and learning the ins and outs of how to use them with sklearn.

#### Here, I want to get into the habit of making nicely contained functions that do the work on the raw input and processesing it. This should clean up the code and make tweaks easier (and swapping algs easier as well).

In [1]:
# Imports 
import numpy as np
import pandas as pd
%matplotlib inline

In [5]:
def cleanup_data(data):
    """Return a cleaned up raw dataset."""
    
    # There are missing AGES in the input data. We need to figure out how to deal with them.
    # For now, I'm replacing them with a very negative number as the fact we don't know the 
    # age for these people may be interesting
    data['Age'].fillna(-100, inplace=True)
    
    # Sex is a M/F only entry. Let's code this as 0/1
    data.loc[data["Sex"] == "male", "Sex"] = 0
    data.loc[data["Sex"] == "female", "Sex"] = 1

    # Similarly, there are exactly three entries for Embarked Q, C, S, and missing. 
    data["Embarked"].fillna("Missing", inplace=True)
    data.loc[data["Embarked"] == "Missing", "Embarked"] = 0
    data.loc[data["Embarked"] == "Q", "Embarked"] = 1
    data.loc[data["Embarked"] == "C", "Embarked"] = 2
    data.loc[data["Embarked"] == "S", "Embarked"] = 3
    
    # Some of the test data is missing a Fare
    data["Fare"].fillna(data["Fare"].median(), inplace=True)
    
    #return the modified dataset
    return data

### While we're at it, let's make a tool to quickly create my submission

In [7]:
def make_submission(alg, train, test, predictors, filename):
    """Create Kaggle Output formatted CSV file.
    
    Input : alg -- object for the model
            train -- training set
            test -- testing set
            predictors -- columns used when fitting
            filename -- output file name
    """
    alg.fit(train[predictors], train["Survived"])
    predictions = alg.predict(test[predictors])
    
    submission = pd.DataFrame({
            "PassengerId" : test["PassengerId"],
            "Survived" : predictions
        })
    submission.to_csv(filename, index=False)


## Let's start the main loop. 

In [10]:
train = pd.read_csv("data/train.csv", dtype={"Age":np.float64},)
test = pd.read_csv("data/test.csv", dtype={"Age":np.float64},)

In [11]:
# Run out cleanup script
train_data = cleanup_data(train)
test_data = cleanup_data(test)

In [15]:
# Let's run a random forest classifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import cross_validation

# List of the data we want to use
predictors = ["Pclass", "Sex", "Age", "Fare", "Embarked", "SibSp", "Parch"]

model = RandomForestClassifier(
    random_state=69,
    n_estimators=100, 
    min_samples_split=4, 
    min_samples_leaf=2)

scores = cross_validation.cross_val_score(
    model, 
    train_data[predictors], 
    train_data["Survived"], 
    cv=3)

print(scores.mean())

0.819304152637


In [16]:
# Let's take a look at some subsplits
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(train_data[predictors], 
                                                    train_data["Survived"] ,
                                                    test_size=0.25, random_state=69)

In [17]:
# Let's fit it with our model:
# Run on our subsample
model.fit(X_train, y_train)

# Run predictions on our fake subsample
y_pred = model.predict(X_test)
no_samples = len(y_test)
print("Missclassified Samples %d of %d" %
      ((y_test != y_pred).sum(), no_samples))

from sklearn.metrics import accuracy_score
print("Accuracy: %.2f" % accuracy_score(y_test, y_pred))

Missclassified Samples 41 of 223
Accuracy: 0.82
