### Model Iteration 2

In [1]:
import pandas as pd
import numpy as np
from sklearn import cross_validation
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import KFold
import re
%matplotlib inline



In [2]:
def clean(d, median_age):
    """
    cleans the data with 
    """
    
    # recodes male as 0 and female as 1
    d.loc[d.Sex == 'male', 'Sex'] = 0
    d.loc[d.Sex == 'female', 'Sex'] = 1
    
    # missing age with median
    d.Age = d["Age"].fillna(median_age)

    # recodes the ports to 0 1 2
    d.Embarked = d.Embarked.fillna('S')
    d.loc[d.Embarked == 'S', 'Embarked'] = 0
    d.loc[d.Embarked == 'C', 'Embarked'] = 1
    d.loc[d.Embarked == 'Q', 'Embarked'] = 2
    
    d["Fare"] = d["Fare"].fillna(d["Fare"].median())
    
    return d

train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')

train = clean(train, train.Age.median())
test = clean(test, train.Age.median())

In [3]:
train.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         0
dtype: int64

### Random forest mode

In [4]:
predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]
alg = RandomForestClassifier(random_state=1, n_estimators=10, min_samples_split=2, min_samples_leaf=1)
scores = cross_validation.cross_val_score(alg, train[predictors], train["Survived"], cv=3)
print "accuracy", scores.mean()

accuracy 0.801346801347


Random forest model is already doing better than previous models

In [5]:
def cross_validate(alg, data, predictors):
    return cross_validation.cross_val_score(alg, train[predictors], train["Survived"], cv=3).mean()

In [6]:
alg = RandomForestClassifier(random_state=1, n_estimators=150, min_samples_split=4, min_samples_leaf=2)
print "accuracy ", cross_validate(alg, train, predictors) 

accuracy  0.820426487093


When creating a Random Forest model with more estimaters, samples split and leaf to reduce over fitting and it does much better

In [7]:
def submit(file_name, alg, train, test, predictors):
    """
    generates submission csv with alg, train, and predictors
    """
    
    alg.fit(train[predictors], train["Survived"])
    predictions = alg.predict(test[predictors])
    submission = pd.DataFrame({
        'PassengerId': test["PassengerId"],
        'Survived': predictions
    })
    
    submission.to_csv(file_name, index=False)

In [8]:
submit("kaggle_rf.csv", alg, train, test, predictors)

.76077 for the submission. It did better than my previous attempts. One possiblity why random forest model is not performing super well on the kaggle set is because it might be overfitting. I want to try with samples split and leaf to reducing overfitting

In [9]:
alg = RandomForestClassifier(random_state=1, n_estimators=150, min_samples_split=10, min_samples_leaf=5)
print "accuracy", cross_validate(alg, train, predictors) 

accuracy 0.818181818182


In [10]:
submit("kaggle_rf2.csv", alg, train, test, predictors)

It seems like this didnt change much ai got the same accuracy as before

### New Features

In [11]:
# creating functions for all of the transformations since we have to apply them to both train and test
def get_title(name):
    # Use a regular expression to search for a title.  Titles always consist of capital and lowercase letters, and end with a period.
    title_search = re.search(' ([A-Za-z]+)\.', name)
    # If the title exists, extract and return it.
    if title_search:
        title = title_search.group(1)
        
        # Map each title to an integer.  Some titles are very rare, and are compressed into the same codes as other titles.
        title_mapping = {
            "Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Dr": 5, "Rev": 6, "Major": 7, "Col": 7, "Mlle": 8, "Mme": 8, "Don": 9, "Lady": 10, "Countess": 10, "Jonkheer": 10, "Sir": 9, "Capt": 7, "Ms": 2}
        try:
            return title_mapping[title]
        except KeyError:
            pass
    return 0

def new_features(data):
    
    # family number column
    data["FamilyNum"] = data["SibSp"] + data["Parch"] + 1 # add one to include the individual
    
    # The .apply method generates a new series
    data["NameLength"] = data["Name"].apply(lambda x: len(x))
    
    # add title
    data["Title"] = data.Name.apply(get_title)
    
    data["Child"] = data["Age"]
    data.loc[data["Child"] <= 8, "Child"] = 1
    data.loc[data["Child"] >8, "Child"] = 0


### Ensembling

In [12]:
def ensemble(algorithms, data):
    # Initialize the cross validation folds
    kf = KFold(data.shape[0], n_folds=3, random_state=1)

    predictions = []
    for train, test in kf:
        train_target = data["Survived"].iloc[train]
        full_test_predictions = []
        # Make predictions for each algorithm on each fold
        for alg, predictors in algorithms:
            # Fit the algorithm on the training data.
            alg.fit(data[predictors].iloc[train,:], train_target)
            # Select and predict on the test fold.  
            # The .astype(float) is necessary to convert the dataframe to all floats and avoid an sklearn error.
            test_predictions = alg.predict_proba(data[predictors].iloc[test,:].astype(float))[:,1]
            full_test_predictions.append(test_predictions)
        # Use a simple ensembling scheme -- just average the predictions to get the final classification.
        test_predictions = (full_test_predictions[0] + full_test_predictions[1]) / 2
        # Any value over .5 is assumed to be a 1 prediction, and below .5 is a 0 prediction.
        test_predictions[test_predictions <= .5] = 0
        test_predictions[test_predictions > .5] = 1
        predictions.append(test_predictions)

    # Put all the predictions together into one array.
    predictions = np.concatenate(predictions, axis=0)

    # Compute accuracy by comparing to the training data.
    accuracy = sum(predictions[predictions == data["Survived"]]) / len(predictions)
    return accuracy

In [13]:
algs = [
        [GradientBoostingClassifier(random_state=1, n_estimators=25, max_depth=3), ["Pclass", "Sex", "Age", "Fare", "Embarked", "FamilyNum", "Child", "NameLength", "Title" ]],
        [LogisticRegression(random_state=1), ["Pclass", "Sex", "Age", "Fare", "Embarked", "FamilyNum", "Child", "NameLength","Title" ]],
        [RandomForestClassifier(random_state=1, n_estimators=150, min_samples_split=10, min_samples_leaf=5), ["Pclass", "Sex", "Age", "Fare", "Embarked", "FamilyNum", "Child", "NameLength","Title" ]]
    ]
new_features(train)
new_features(test)
print ensemble(algs, train)

0.82379349046




I am going to try combining GradientBoosting, Logistic Regrssion and random forest to see if this makes my result better 

In [14]:
def submitMultipe(file_name, algs, train, test):
    """
    generates submission csv with alg, train, and predictors
    """
    full_predictions = []
    for alg, predictors in algs:
        alg.fit(train[predictors], train["Survived"])
        
        predictions = alg.predict_proba(test[predictors].astype(float))[:,1]
        full_predictions.append(predictions)

    test_predictions = np.sum(full_predictions, axis=0) / len(full_predictions)
    predictions = (predictions > 0.5).astype(int)
        
    submission = pd.DataFrame({
        'PassengerId': test["PassengerId"],
        'Survived': predictions
    })
    
    submission.to_csv(file_name, index=False)

In [15]:
submitMultipe("ensemble.csv", algs, train, test)

I did much better with .78469. Combining difference algorithms definitely seemed to have helped

### Best Features

In [16]:
predictors = ["Pclass", "Sex", "Age", "Fare", "Embarked", "FamilyNum", "Child", "NameLength","Title" ]

selector = SelectKBest(f_classif, k=5)
selector.fit(train[predictors], train["Survived"])
scores = -np.log10(selector.pvalues_)
for i in range(len(scores)):
    print predictors[i], scores[i]

Pclass 24.5956714208
Sex 68.8519942529
Age 1.27768954597
Fare 14.2132351418
Embarked 2.85130099045
FamilyNum 0.207684583419
Child 5.02069887713
NameLength 23.6931901615
Title 26.9833860721


We can see that top 5 categories are Sex, Title, PClass, NameLength, Fare. Let's try running our tests with just these 5.

In [17]:
algs = [
        [GradientBoostingClassifier(random_state=1, n_estimators=25, max_depth=3), ["Pclass", "Sex", "Title", "Fare", "NameLength"]],
        [LogisticRegression(random_state=1), ["Pclass", "Sex", "Title", "Fare", "NameLength"]],
        [RandomForestClassifier(random_state=1, n_estimators=150, min_samples_split=10, min_samples_leaf=5),["Pclass", "Sex", "Title", "Fare", "NameLength"]]
    ]
print ensemble(algs, train)
submitMultipe("5cat.csv", algs, train, test)

0.817059483726




The scores mean was less and it didn't do better on kaggle. I want to try excluding all of the predictors less than 2.

In [18]:
algs = [
        [GradientBoostingClassifier(random_state=1, n_estimators=25, max_depth=3), ["Pclass", "Sex", "Title", "Fare", "NameLength", "Fare", "Child", "Embarked"]],
        [LogisticRegression(random_state=1), ["Pclass", "Sex", "Title", "Fare", "NameLength", "Fare", "Child", "Embarked"]],
        [RandomForestClassifier(random_state=1, n_estimators=150, min_samples_split=10, min_samples_leaf=5),["Pclass", "Sex", "Title", "Fare", "NameLength", "Fare", "Child", "Embarked"]]
    ]
print ensemble(algs, train)
submitMultipe("8cat.csv", algs, train, test)

0.813692480359




Still no improvement so it does seem like adding more predictors help our case.

This exploration has been very interesting. The data most of the times doesnt seem to behave the way I expected it to be, but I guess that is the beauty of data science. If everything behaved the way we expected to behave then nothing would be really interesting. With this warmup project I think I definitely learned some tools I can use to start my own machine learning journey. 
I think for my future iterations I want to possibly read more about the background story of titanic to understand the data set better and learn more algorithms to try them out and form a better ensemble of them.