# Titanic survivor data 

http://en.wikipedia.org/wiki/RMS_Titanic
http://www.kaggle.com/c/titanic-gettingStarted    
http://www.kaggle.com/c/titanic-gettingStarted/data
    
## 1. Simple heuristic

Part of a data scientist's job is to use her or his intuition and insight to
write algorithms and heuristics. A data scientist also creates mathematical models 
to make predictions based on some attributes from the data that they are examining.
We would like for you to take your knowledge and intuition about the Titanic
and its passengers' attributes to predict whether or not the passengers survived
or perished. You can read more about the Titanic and specifics about this dataset at:

For this exercise, you need to write a simple heuristic that will use
the passengers' gender to predict if that person survived the Titanic diaster.
    
You prediction should be 78% accurate or higher.
        
Here's a simple heuristic to start off:
1) If the passenger is female, your heuristic should assume that the passenger survived.
2) If the passenger is male, you heuristic should assume that the passenger did not surive.
    
You can access the gender of a passenger via passenger['Sex'].
If the passenger is male, passenger['Sex'] will return a string "male".
If the passenger is female, passenger['Sex'] will return a string "female".

Write your prediction back into the "predictions" dictionary. The
key of the dictionary should be the Passenger's id (which can be accessed
via passenger["PassengerId"]) and the associating value should be 1 if the
passenger survied or 0 otherwise. 

For example, if a passenger survived:
passenger_id = passenger['PassengerId']
predictions[passenger_id] = 1

Or if a passenger perished in the disaster:
passenger_id = passenger['PassengerId']
predictions[passenger_id] = 0

In [2]:
from __future__ import division

import numpy
import pandas
import statsmodels.api as sm
import sys

def simple_heuristic(file_path):
    
    predictions = {}
    df = pandas.read_csv(file_path)
    #print df.head()
    for passenger_index, passenger in df.iterrows():
    # iterrows gives you (index, row) tuples rather than just the rows
        passenger_id = passenger['PassengerId']
        if passenger['Sex'] == 'male':
            predictions[passenger_id] = 0
        else:
            predictions[passenger_id] = 1
    return predictions

def check_accuracy(file_name):
    total_count = 0
    correct_count = 0
    df = pandas.read_csv(file_name)
    predictions = simple_heuristic(file_name)
    for row_index, row in df.iterrows():
        total_count += 1
        if predictions[row['PassengerId']] == row['Survived']:
            correct_count += 1
    return correct_count/total_count
    
predictions = simple_heuristic("data/titanic_data.csv")

if __name__ == "__main__":
    simple_heuristic_success_rate = check_accuracy('data/titanic_data.csv')
    print "The success rate for the simple heuristic is:", simple_heuristic_success_rate * 100, "%"

The success rate for the simple heuristic is: 78.6756453423 %


## 2. More complex heuristics

The passenger survives:
1. If the passenger is female or
2. if his/her socioeconomic status is high AND if the passenger is under 18

Otherwise, your algorithm should predict that the passenger perished in the disaster.

Or more specifically in terms of coding:
female or (high status and under 18)

You can access the gender of a passenger via passenger['Sex'].
If the passenger is male, passenger['Sex'] will return a string "male".
If the passenger is female, passenger['Sex'] will return a string "female".

You can access the socioeconomic status of a passenger via passenger['Pclass']:
High socioeconomic status -- passenger['Pclass'] is 1
Medium socioeconomic status -- passenger['Pclass'] is 2
Low socioeconomic status -- passenger['Pclass'] is 3

You can access the age of a passenger via passenger['Age'].

Write your prediction back into the "predictions" dictionary. The
key of the dictionary should be the Passenger's id (which can be accessed
via passenger["PassengerId"]) and the associated value should be 1 if the
passenger survived or 0 otherwise. 

For example, if a passenger is predicted to have survived:
passenger_id = passenger['PassengerId']
predictions[passenger_id] = 1

And if a passenger is predicted to have perished in the disaster:
passenger_id = passenger['PassengerId']
predictions[passenger_id] = 0


In [6]:
def complex_heuristic(file_path):
    
    predictions = {}
    df = pandas.read_csv(file_path)
    #print df.head()
    for passenger_index, passenger in df.iterrows():
    # iterrows gives you (index, row) tuples rather than just the rows
        passenger_id = passenger['PassengerId']
        if passenger['Sex'] == 'female' or (passenger['Pclass'] == 2 and passenger['Age'] < 18):
            predictions[passenger_id] = 1
        else:
            predictions[passenger_id] = 0
    return predictions
    
predictions = complex_heuristic("data/titanic_data.csv")

def check_accuracy(file_name):
    total_count = 0
    correct_count = 0
    df = pandas.read_csv(file_name)
    predictions = complex_heuristic(file_name)
    for row_index, row in df.iterrows():
        total_count += 1
        if predictions[row['PassengerId']] == row['Survived']:
            correct_count += 1
    return correct_count/total_count


if __name__ == "__main__":
    complex_heuristic_success_rate = check_accuracy('data/titanic_data.csv')
    print "The success rate for the complex heuristic is:", complex_heuristic_success_rate * 100, "%"

The success rate for the complex heuristic is: 79.4612794613 %


## Custom heuristic