Importing all necessary dependencies for this notebook

In [1]:
%matplotlib inline

import pandas as pd
import numpy as np
import sklearn
# import thinkstats2
# import thinkplot

# Import the linear regression class
from sklearn.linear_model import LinearRegression
# Sklearn also has a helper that makes it easy to do cross validation
from sklearn.cross_validation import KFold
from sklearn import cross_validation
from sklearn.linear_model import LogisticRegression

Read in data from the train and test data set csv files

In [2]:
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

# train_df.columns
# print train_df.describe()

**Clean data**

For the train dataset:
- Fill in missing age age with the median age (this prevents mathematical computations from going horribly)
- Recode the categorical variables that are coded using char's with numerical codes to make computation easier

[ Gender recode: {male:0, female:1}
  Embarkment recode: {S:0, C:1, Q:2}]

In [3]:
# fill in data
train_df["Age"] = train_df.fillna(train_df["Age"].median())
# Recode gender
train_df.loc[train_df["Sex"] == "male", "Sex"] = 0
train_df.loc[train_df["Sex"] == "female", "Sex"] = 1
# Recode embarkment
train_df["Embarked"] = train_df["Embarked"].fillna("S")
train_df.loc[train_df["Embarked"] == "S", "Embarked"] = 0
train_df.loc[train_df["Embarked"] == "C", "Embarked"] = 1
train_df.loc[train_df["Embarked"] == "Q", "Embarked"] = 2

For the test dataset, do all the same things as above, but also replace missing values in the Fare variable with the test data's median (since we want to use Fare as a variable in prediction)

In [4]:
# fill in NAN data
test_df["Age"] = test_df.fillna(train_df["Age"].median())
test_df["Fare"] = test_df.fillna(test_df["Fare"].median())
# Recode gender
test_df.loc[test_df["Sex"] == "male", "Sex"] = 0
test_df.loc[test_df["Sex"] == "female", "Sex"] = 1
# Recode embarkment
test_df["Embarked"] = test_df["Embarked"].fillna("S")
test_df.loc[test_df["Embarked"] == "S", "Embarked"] = 0
test_df.loc[test_df["Embarked"] == "C", "Embarked"] = 1
test_df.loc[test_df["Embarked"] == "Q", "Embarked"] = 2


Divide the dataset into 3 subgroups to test and compare for modeling validation. Set the model (Linear Regression) and the predictive variables ("predictors"), and run the model on all pairs of the 3 subgroups using those parameters.

In [32]:

predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]

# Initialize our algorithm class
alg = LinearRegression()
# Generate cross validation folds for the titanic dataset.  It return the row indices corresponding to train and test.
# We set random_state to ensure we get the same splits every time we run this.
kf = KFold(df.shape[0], n_folds=3, random_state=1)

predictions = []
for train, test in kf:
    # The predictors we're using the train the algorithm.  Note how we only take the rows in the train folds.
    train_predictors = (df[predictors].iloc[train,:])
    # The target we're using to train the algorithm.
    train_target = df["Survived"].iloc[train]
    # Training the algorithm using the predictors and target.
    alg.fit(train_predictors, train_target)
    # We can now make predictions on the test fold
    test_predictions = alg.predict(df[predictors].iloc[test,:])
    predictions.append(test_predictions)

Combine the results of the 3 subgroup pairs, map them to binary results and calculate the accuracy based on how many predictions were correct.

In [34]:
predictions = np.concatenate(predictions, axis=0)

# Map predictions to outcomes (only possible outcomes are 1 and 0)
predictions[predictions > .5] = 1
predictions[predictions <=.5] = 0
total = len(predictions)

accuracy = sum(predictions == df["Survived"])/float(total)
accuracy

0.78114478114478114

Instead of using Linear Regression, try using Logistic Regression modeling on the data, and calculate the mean accuracy across the 3 subgroups.

In [11]:
predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]

# Initialize our algorithm
alg = LogisticRegression(random_state=1)
# Compute the accuracy score for all the cross validation folds.  (much simpler than what we did before!)
scores = cross_validation.cross_val_score(alg, train_df[predictors], train_df["Survived"], cv=3)
# Take the mean of the scores (because we have one for each fold)
print(scores.mean())

0.796857463524


Export the predictions (found using the Logistic Regression) into a Kaggle submission format.

In [7]:
predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]

# Initialize the algorithm class
alg = LogisticRegression(random_state=1)

# Train the algorithm using all the training data
alg.fit(train_df[predictors], train_df["Survived"])

# Make predictions using the test set.
predictions = alg.predict(test_df[predictors])

# Create a new dataframe with only the columns Kaggle wants from the dataset.
submission = pd.DataFrame({
        "PassengerId": test_df["PassengerId"],
        "Survived": predictions
    })

submission.to_csv("mod1Kaggle.csv", index = False)

Kaggle accuracy: .37799