Zoher Ghadyali

Data Science 2016

Model Iteration 1 - Warmup Project

The code below was all generated by the "Getting Started with Kaggle" DataQuest mission.

In [104]:
import pandas

# We can use the pandas library in python to read in the csv file.
# This creates a pandas dataframe and assigns it to the titanic variable.
titanic = pandas.read_csv("train.csv")

#cleaning up the data to make it ready for our model
titanic["Age"] = titanic["Age"].fillna(titanic["Age"].median())
titanic.loc[titanic["Sex"] == "male", "Sex"] = 0
titanic.loc[titanic["Sex"] == "female", "Sex"] = 1

titanic["Embarked"] = titanic["Embarked"].fillna("S")
titanic.loc[titanic["Embarked"] == "S", "Embarked"] = 0
titanic.loc[titanic["Embarked"] == "C", "Embarked"] = 1
titanic.loc[titanic["Embarked"] == "Q", "Embarked"] = 2

We now use a linear regression model. We take our test data and we fold it three times which breaks up our data into train and test chunks. The model will choose 2 to train off of, then test on the remaning chunk and then proceed to use a different combination of two chunks to train again. The resulting predictions are then compared to the actual data and the accuracy of the model is determined.

In [105]:
# Import the linear regression class
from sklearn.linear_model import LinearRegression
# Sklearn also has a helper that makes it easy to do cross validation
from sklearn.cross_validation import KFold
import numpy as np

# The columns we'll use to predict the target
predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]

# Initialize our algorithm class
alg = LinearRegression()
# Generate cross validation folds for the titanic dataset.  It return the row indices corresponding to train and test.
# We set random_state to ensure we get the same splits every time we run this.
kf = KFold(titanic.shape[0], n_folds=3, random_state=1)

predictions = []
for train, test in kf:
    # The predictors we're using the train the algorithm.  Note how we only take the rows in the train folds.
    train_predictors = (titanic[predictors].iloc[train,:])
    # The target we're using to train the algorithm.
    train_target = titanic["Survived"].iloc[train]
    # Training the algorithm using the predictors and target.
    alg.fit(train_predictors, train_target)
    # We can now make predictions on the test fold
    test_predictions = alg.predict(titanic[predictors].iloc[test,:])
    predictions.append(test_predictions)
    
# The predictions are in three separate numpy arrays.  Concatenate them into one.  
# We concatenate them on axis 0, as they only have one axis.
predictions = np.concatenate(predictions, axis=0)

# Map predictions to outcomes (only possible outcomes are 1 and 0)
predictions[predictions > .5] = 1
predictions[predictions <=.5] = 0

correct = len(titanic[titanic.Survived == predictions])
accuracy = float(correct)/len(predictions)

print "Accuracy on test" , accuracy

Accuracy on test 0.783389450056


Next we try a logistic regression which squeezes the extreme values, which works for us in this case because we only care about predicting whether or not a passenger has survived.

In [106]:
from sklearn.linear_model import LogisticRegression
from sklearn import cross_validation

# Initialize our algorithm
alg = LogisticRegression(random_state=1)
# Compute the accuracy score for all the cross validation folds.  (much simpler than what we did before!)
scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=3)
# Take the mean of the scores (because we have one for each fold)
print "Accuracy", scores.mean()

Accuracy 0.787878787879


The code below imports the Titanic test data provided by Kaggle and cleans it up in the same way we cleaned up the train data. This could be made into a clean function specific for the fields from the Titanic dataset. We then train the model off of the train data and using the same predictors, now run the model on the test data. We then output to a .csv file and submit it to Kaggle.

In [107]:
titanic_test = pandas.read_csv("test.csv")

titanic_test["Age"] = titanic_test["Age"].fillna(titanic["Age"].median())
titanic_test.loc[titanic_test["Sex"] == "male", "Sex"] = 0
titanic_test.loc[titanic_test["Sex"] == "female", "Sex"] = 1

titanic_test["Embarked"] = titanic_test["Embarked"].fillna("S")
titanic_test.loc[titanic_test["Embarked"] == "S", "Embarked"] = 0
titanic_test.loc[titanic_test["Embarked"] == "C", "Embarked"] = 1
titanic_test.loc[titanic_test["Embarked"] == "Q", "Embarked"] = 2

titanic_test["Fare"] = titanic_test["Fare"].fillna(titanic_test["Fare"].median())

# Initialize the algorithm class
alg = LogisticRegression(random_state=1)

# Train the algorithm using all the training data
alg.fit(titanic[predictors], titanic["Survived"])

# Make predictions using the test set.
predictions = alg.predict(titanic_test[predictors])

# Create a new dataframe with only the columns Kaggle wants from the dataset.
submission = pandas.DataFrame({
        "PassengerId": titanic_test["PassengerId"],
        "Survived": predictions
    })

submission.to_csv("kaggle.csv", index=False)

Kaggle determined the accuracy of this model to be 0.75120.

I remembered that Ryan Louie in my Software Design final project used a random forest classifier to predict whether or not a drawing of a circuit component was of a resistor or a capacitor. So I tried to use a random forest classifier on this data. n_estimators is the number of decision trees in the forest. We do see a slight increase in accuracy from this classifier on the train data.

In [108]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(n_estimators=100)

scores = cross_validation.cross_val_score(rfc, titanic[predictors], titanic["Survived"], cv=3)
# Take the mean of the scores (because we have one for each fold)
print "Accuracy", scores.mean()

Accuracy 0.795735129068


The code below trains the rfc on the train data, runs it on the test data, and outputs a .csv file to submit to Kaggle that uses the Random Forest Classifier instead of a logistic regression.

In [109]:
rfc.fit(titanic[predictors], titanic["Survived"])

predictions = rfc.predict(titanic_test[predictors])

submission = pandas.DataFrame({
        "PassengerId": titanic_test["PassengerId"],
        "Survived": predictions
    })

submission.to_csv("kaggle_annot1.csv", index=False)

Unfortunately the Kaggle score of this submission was 0.74641 which did not beat the score of the logistic regression.

I then decided to try and create my own column in the Titanic dataframe. So far we have been using Fare and Pclass as an estimation of wealth and my previous findings have indicated that the poorer you are, the less likely you are to survive. I decided to combine Fare and Pclass. I represented Fare as a percentage of the max Fare a given passenger has paid and then I restructured the values of the Pclass to add more weight to being in first class than to being in third class. I see a slight improvement in accuracy on the training data.

In [110]:
titanic.loc[titanic["Pclass"] == 1, "Pclass"] = 2
titanic.loc[titanic["Pclass"] == 2, "Pclass"] = 1
titanic.loc[titanic["Pclass"] == 3, "Pclass"] = 0

titanic['estimWealth'] = titanic.Fare.div(titanic.Fare.max()) + titanic.Pclass

predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked", "estimWealth"]

alg = LogisticRegression(random_state=1)

scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=3)

print "Accuracy", scores.mean()

Accuracy 0.801346801347


I then proceed to create this field in the test data frame so that I can use it as a predictor for the logistic regression model. I do not know if this is okay because in the DataQuest mission, when we were cleaning the data, we used the median of the age in the train data group to fill NaN ages in the test data group, not the median age in the test group. So I am not sure if this is a valid method.

In [111]:
titanic_test.loc[titanic_test["Pclass"] == 1, "Pclass"] = 2
titanic_test.loc[titanic_test["Pclass"] == 2, "Pclass"] = 1
titanic_test.loc[titanic_test["Pclass"] == 3, "Pclass"] = 0

titanic_test['estimWealth'] = titanic_test.Fare.div(titanic_test.Fare.max()) + titanic_test.Pclass

alg = LogisticRegression(random_state=1)

alg.fit(titanic[predictors], titanic["Survived"])

predictions = alg.predict(titanic_test[predictors])

submission = pandas.DataFrame({
        "PassengerId": titanic_test["PassengerId"],
        "Survived": predictions
    })

submission.to_csv("kaggle_annot2.csv", index=False)

This method did improve the accuracy of the model, with a Kaggle score of 0.77512, which is 2.392% better than my previous score.