[View in Colaboratory](https://colab.research.google.com/github/chug2k/Titanic-Workshop/blob/master/CoderSchool_Workshop.ipynb)

# Welcome to CoderSchool's Workshop.

Today we'll be going predicting who lived or died on the HMS Titanic.

![alt text](https://upload.wikimedia.org/wikipedia/commons/thumb/d/db/Titanic-Cobh-Harbour-1912.JPG/330px-Titanic-Cobh-Harbour-1912.JPG)

If these steps are new to you, don't worry, the isntructor will help. 

## Initialization Code

Here we'll just do a few simple things to download our data and load it. Don't worry about this step too much; just click "run" and you'll be done with it.

In [0]:
import numpy as np
import pandas as pd
import seaborn as sns
%matplotlib inline
!wget https://raw.githubusercontent.com/chug2k/Titanic-Workshop/master/titanic_visualizations.py
!wget https://raw.githubusercontent.com/chug2k/Titanic-Workshop/master/train.csv
from titanic_visualizations import survival_stats

# Load the dataset
in_file = 'train.csv'
full_data = pd.read_csv(in_file)

# Print the first few entries of the RMS Titanic data
full_data.head()

## Data Cleanup

One common thing you'll have to do is _data engineering_. Here, we'll do a simple trivial example where we'll remove the "Survived" column from our dataset, and save it under "outcomes". We're doing this because that's the column we want to predict. 

Below we've also defined a utility function to predict scores.

In [0]:
outcomes = full_data['Survived']
data = full_data.drop('Survived', axis = 1)
data.head()

def accuracy_score(truth, pred, name):
    """ Returns accuracy score for input truth and predictions. """

    if len(truth) == len(pred):
        # Calculate and return the accuracy as a percent
        # {:2f}.format((truth == pred).mean()*100)
        # ":" represents format specification
        # "2f" represents 2 decimal places
        return "{} Predictions have an accuracy of {:.2f}.".format(name, (truth == pred).mean()*100)
    else:
        return "Number of predictions does not match number of outcomes!"



In [0]:


def predict_everyone_survives():
  return pd.Series(np.ones(outcomes.size, dtype = int))

def predict_everyone_dies():
  return pd.Series(np.zeros(outcomes.size, dtype = int))
  

predictions = predict_everyone_survives()

print(accuracy_score(outcomes, predictions, "Everyone Survived"))

predictions = predict_everyone_dies()

print(accuracy_score(outcomes, predictions, "Everyone Died"))




## Exploring Data

We have a utility graph function called survival_stats. Let's use it and explore a few different features...

In [0]:
survival_stats(data, outcomes, 'Sex')


In [0]:
def predict_men_died():
   """ Model with one feature: 
            - Predict a passenger survived if they are female. """
   predictions = []
   for index, passenger in data.iterrows():
        if passenger['Sex'] == 'female':
            predictions.append(1)
        else:
            predictions.append(0)

            return pd.Series(predictions)

# Make the predictions
predictions = predict_men_died()
print(accuracy_score(outcomes, predictions, "Men Died"))


## Simple Warmup Exercise: Write code to predict that all women died.


In [0]:
def predict_women_died():
      """ Model with two features: 
            - Predict a passenger survived if they are female.
            - Predict a passenger survived if they are male and younger than 10. """

    predictions = []
    for index, passenger in data.iterrows():
        # Remove the 'pass' statement below 
        # and write your prediction conditions here
        pass

    # Return our predictions
    return pd.Series(predictions)
  
predictions = predict_women_died()
print(accuracy_score(outcomes, predictions, "Women Died"))

In [0]:
survival_stats(data, outcomes, 'Age', ["Sex == 'male'"])

## Second Exercise

Looks like there's a strong correlation with age, for men.  Can you create a new function that returns a prediction that takes a man's age into account?

In summary:
* Women Survive.
* Men who are older than 10 died.

In [0]:
def predictions_2():
    """ Model with two features: 
            - Predict a passenger survived if they are female.
            - Predict a passenger survived if they are male and younger than 10. """

    predictions = []
    for index, passenger in data.iterrows():
        # Remove the 'pass' statement below 
        # and write your prediction conditions here
        pass

    # Return our predictions
    return pd.Series(predictions)
  
predictions = predictions_2()
print(accuracy_score(outcomes, predictions, "Men Over 10 Died"))

## Final Exercise - Getting to 80%

Exploring the data, I'm going to add one more attribute: Pclass. Looks like the rich people survived. 

In [0]:
survival_stats(data, outcomes, 'Pclass')

def predictions_3():
    """ Model with two features: 
            - Predict a passenger survived if they are female.
            - Predict a passenger survived if they are male and younger than 10. """

    predictions = []
    for index, passenger in data.iterrows():
        # Remove the 'pass' statement below 
        # and write your prediction conditions here
        if passenger['Sex'] == 'female':
            predictions.append(1)
        elif (passenger['Sex'] == 'male') & (passenger['Age'] < 10) & (passenger['Pclass'] < 3):
            predictions.append(1)
        else:
            predictions.append(0)


    # Return our predictions
    return pd.Series(predictions)
  
predictions = predictions_3()
print(accuracy_score(outcomes, predictions, "Men Over 10 Died"))

## Reflection

I think our classifier works, evidenced by the famous movie. Jack was PClass 3, Male, and over 10. 

![alt text](https://hips.hearstapps.com/hmg-prod.s3.amazonaws.com/images/titanic-jack-1485877267.gif?crop=1xw:1xh;center,top&resize=480:*)

Rose was female, PClass 1. She had a good chance to survive. 

We could have skipped watching the movie and predicted the outcome from the beginning.