# Kaggle: Titanic Survival

### The Steps to Kaggle submissions
1. Find a problem
2. download the data:
3. Familiarize yourself with the data. 
4. Format your data
5. run a machine learning algorithm on your data
6. save your predictions in the correct format
7. submit your prediction to Kaggle.
8. repeat

## Step 1: Find a problem

go to this website: https://www.kaggle.com  
We're going to do the titanic problem: https://www.kaggle.com/c/titanic

## Step 2: Download the data  
1. train.csv
2. test.csv
3. gendermodel.csv (an example submission)

## Step 3: Familiarize yourself with the data

In [4]:
import pandas as pd
datafile = '~/databases/titanic/train.csv'
titanic = pd.read_csv(datafile)
titanic[:10]


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


Usually, I like to save a shortened csv file, so that I can have a quick reference. This isn't necessary for this particular dataset, since it's so small, but, we'll do it anyway. Being able to visualize the data is very important.

In [5]:
titanic[:100].to_csv('mini_titanic_train',index=False)

## Step 4: Format your data

Make sure to do all of your formatting in a structured way, prefferably in a function that takes in a dataset(or datafile) as an argument and returns the formatted data.

This is important since you will have to format your test.csv in the EXACT same way.

In [6]:
'''import and format a csv'''
def format_titanic_data(datafile,predictors):
	titanic = pd.read_csv(datafile)
	
	'''convert data to usable values'''
	#fill nan age values with Age.median
	titanic.Age = titanic.Age.fillna(titanic.Age.median())
	#replace Sex with numbers
	titanic.loc[titanic.Sex=='male','Sex'] 	= 0
	titanic.loc[titanic.Sex=='female','Sex'] 	= 1
	#replace Embarked values with numbers
	titanic.Embarked = titanic.Embarked.fillna('S')
	titanic.loc[titanic.Embarked == 'S', 'Embarked']	=0
	titanic.loc[titanic.Embarked == 'C', 'Embarked']	=1
	titanic.loc[titanic.Embarked == 'Q', 'Embarked']	=2

	for field in predictors:
		titanic[field] = titanic[field].fillna(titanic[field].median())

	return titanic

In [8]:
#columns used to predict our target
predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]
#our formatted training set
titanic = format_titanic_data('~/databases/titanic/train.csv',predictors)
#our formatted testing set
test = format_titanic_data('~/databases/titanic/test.csv',predictors)

In [9]:
titanic[:10][['Survived']+predictors]

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,0,22,1,0,7.25,0
1,1,1,1,38,1,0,71.2833,1
2,1,3,1,26,0,0,7.925,0
3,1,1,1,35,1,0,53.1,0
4,0,3,0,35,0,0,8.05,0
5,0,3,0,28,0,0,8.4583,2
6,0,1,0,54,0,0,51.8625,0
7,0,3,0,2,3,1,21.075,0
8,1,3,1,27,0,2,11.1333,0
9,1,2,1,14,1,0,30.0708,1


In [10]:
test[:10][predictors]

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,3,0,34.5,0,0,7.8292,2
1,3,1,47.0,1,0,7.0,0
2,2,0,62.0,0,0,9.6875,2
3,3,0,27.0,0,0,8.6625,0
4,3,1,22.0,1,1,12.2875,0
5,3,0,14.0,0,0,9.225,0
6,3,1,30.0,0,0,7.6292,2
7,2,0,26.0,1,1,29.0,0
8,3,1,18.0,0,0,7.2292,1
9,3,0,21.0,2,0,24.15,0


## Step 5: Learn from the data. Make a Prediction

### KFolds

In [42]:
from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import KFold
import numpy as np
import pandas as pd
from sklearn import cross_validation
from sklearn.linear_model import LogisticRegression

def regression1(titanic,test,predictors):

    #initialize our algorithm class
    alg = LinearRegression()
    
    #generate cross-validation folds for the titanicing set
    '''
    cross validation is a simple way to avoid overfitting. To cross validate, you split your data into some number of parts (or "folds"). Lets use 3 as an example. You then do this:
        Combine the first two parts, train a model, make predictions on the third.
        Combine the first and third parts, train a model, make predictions on the second.
        Combine the second and third parts, train a model, make predictions on the second.
    This way, we generate predictions for the whole dataset without ever evaluating accuracy on the same data we train our model using.
    '''
    kf = KFold(titanic.shape[0], n_folds=3, random_state=1)
    
    predictions = []
    for train, test in kf:
        #The predictors used to train the algorithm. only the rows in the train folds
        train_predictors = (titanic[predictors].iloc[train,:])
        # The target we're using to train the algorithm.
        train_target = titanic["Survived"].iloc[train]
        # Training the algorithm using the predictors and target.
        alg.fit(train_predictors, train_target)
        # We can now make predictions on the test fold
        test_predictions = alg.predict(titanic[predictors].iloc[test,:])
        predictions.append(test_predictions.tolist())
    # We concatenate them on axis 0, as they only have one axis.
    predictions = np.concatenate(predictions, axis=0)
    
    # Map predictions to outcomes (only possible outcomes are 1 and 0)
    print predictions[:10].round(3)
    predictions[predictions > .5] = 1
    predictions[predictions <=.5] = 0
    print predictions[:10]
    accuracy = sum(predictions[np.array(predictions == titanic.Survived)]) / len(predictions)
    print accuracy
regression1(titanic,test,predictors)

[ 0.09   0.961  0.593  0.931  0.053  0.17   0.37   0.103  0.522  0.874]
[ 0.  1.  1.  1.  0.  0.  0.  0.  1.  1.]
0.261503928171


In [34]:
def regression2(titanic,test,predictors):
    alg = LinearRegression()
    predictions = cross_validation.cross_val_predict(alg, titanic[predictors], titanic["Survived"], cv=3)
    scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=3)
    
    print predictions[:10].round(3)
    predictions[predictions > .5] = 1
    predictions[predictions <=.5] = 0
    print predictions[:10]
    print scores.mean()
     
regression2(titanic,test,predictors)

[ 0.09   0.961  0.593  0.931  0.053  0.17   0.37   0.103  0.522  0.874]
[ 0.  1.  1.  1.  0.  0.  0.  0.  1.  1.]
0.374682056691


In [16]:

def log_regression(titanic,test,predictors):
    alg = LogisticRegression(random_state=1)
    predictions = cross_validation.cross_val_predict(alg, titanic[predictors], titanic["Survived"], cv=3)
    scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=3)
    
    print predictions[:10].round(3)
    predictions[predictions > .5] = 1
    predictions[predictions <=.5] = 0
    print predictions[:10]
    print scores.mean()

log_regression(titanic,test,predictors)

[0 1 1 1 0 0 0 0 1 1]
[0 1 1 1 0 0 0 0 1 1]
0.787878787879


## Step 6: Make our prediction and save it to .csv

In [14]:
def test_titanic_predictions(titanic,test,predictors):
    #make prediction
    alg = LogisticRegression(random_state=1) 
    alg.fit(titanic[predictors],titanic["Survived"])
    predictions = alg.predict(test[predictors])
    
    #save it 
    submission = pd.DataFrame({ "Survived": predictions,"PassengerId":test.PassengerId})
    submission.to_csv('~/databases/titanic/submission1_tit_logregression',index=False)
    return submission
test_titanic_predictions(titanic,test,predictors)[:10]

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,0
3,895,0
4,896,1
5,897,0
6,898,1
7,899,0
8,900,1
9,901,0


## Step 7: Submit our prediction

1. click "Make a submission"
2. submit your submission

## Step 8: repeat... but make it better