# Introduction
In this tutorial, we will explore how to tackle Kaggle Titanic competition using Python and Machine Learning. When the Titanic sank, $1502$ of the $2224$ passengers and crew were killed. One of the main reasons for this high level of casualties was the lack of lifeboats on this self-proclaimed __"unsinkable"__ ship. In this tutorial, we will learn how to apply machine learning techniques to predict a passenger's chance of surviving using Python.

# Getting Data with Pandas
We start with loading in the training and testing set into your Python environment. We will use the [training set](http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/train.csv) to build our model, and the [test set](http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/test.csv) to validate it. The first step is to load this data with the `read_csv()` method from the Pandas library.

In [1]:
# Import the Pandas library
import pandas as pd

# Load the train and test datasets to create two DataFrames
train_url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/train.csv"
train = pd.read_csv(train_url)
test_url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/test.csv"
test = pd.read_csv(test_url)

#Print the `head` of the train dataframe
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S


In [2]:
#Print the `head` of the test dataframe
test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


On thing that immediately stands out when looking at the two data sets. The `test` set has no variable (colum) for wether or not the passanger `Survived` or not. This has been intentionally removed as that's the variable we will be predicting using the `train` set.

# Understanding the Data
Before starting with the actual analysis, it's important to understand the structure of the data. Both `test` and `train` are DataFrame objects, the way pandas represent datasets. We can easily explore a DataFrame using the `.describe()` method. `.describe()` summarizes the columns/features of the DataFrame, including the count of observations, mean, max and so on. Another useful trick is to look at the dimensions of the DataFrame. This is done by requesting the `.shape` attribute of your DataFrame object. It is also a good practice to look for any missing values in the data set.

Next we apply the `.describe()` method, look for missing values and then apply `.shape` attribute of the training set.

In [3]:
# Describe the `train` data
train.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [4]:
# Look for missing values
train.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [5]:
# Index for missing vales: `Embarked`
train["Embarked"][train["Embarked"].isnull()]

61     NaN
829    NaN
Name: Embarked, dtype: object

In [6]:
# Look at the dimensions of `train`
train.shape

(891, 12)

As we can see, the training set has $891$ observations and $12$ variables, the count for `Age` is $714$. But how many people in the training set survived the disaster with the Titanic? To see this, we can use the `value_counts()` method in combination with standard bracket notation to select a single column of a DataFrame:

In [7]:
# No. of people who survived (absolute numbers)
train["Survived"].value_counts()

0    549
1    342
dtype: int64

In [8]:
# No. of people who survived (percentages)
train["Survived"].value_counts(normalize = True) * 100

0    61.616162
1    38.383838
dtype: float64

We see that $549$ individuals died ($62\%$) and $342$ survived ($38\%$). A simple way to predict heuristically could be: "majority wins". This would mean that we will predict every unseen observation to not survive.

To dive in a little deeper we can perform similar counts and percentage calculations on subsets of the Survived column. For example, maybe gender could play a role as well? We can explore this using the .`value_counts()` method for a two-way comparison on the number of __males__ and __females__ that survived.

In [9]:
# Count of males who survived
train["Survived"][train["Sex"] == "male"].value_counts()

0    468
1    109
dtype: int64

In [10]:
# Count of femails who survived
train["Survived"][train["Sex"] == "female"].value_counts()

1    233
0     81
dtype: int64

To get proportions,  we again pass in the argument `normalize = True` to the `.value_counts()` method.

In [11]:
# Count of males who survived (percentage)
train["Survived"][train["Sex"] == "male"].value_counts(normalize = True) * 100

0    81.109185
1    18.890815
dtype: float64

In [12]:
# Count of femails who survived (percentage)
train["Survived"][train["Sex"] == "female"].value_counts(normalize = True) * 100

1    74.203822
0    25.796178
dtype: float64

It looks like it makes sense to include gender in the predictions since there is a difference between the survival rate of males vs. females. Around $74\%$ of females survived as opposed to $18\%$ of the males surviving.

Another variable that could influence survival is `age`; since it's probable that children were saved first. We can test this by creating a new column with a categorical variable `Child`. `Child` will take the value $1$ in cases where age is less than $18$, and a value of $0$ in cases where age is greater than or equal to $18$. So to add this new variable we need to do two things:

1. Create a new column.
2. Provide the values for each observation (i.e., row) based on the age of the passenger.

Adding a new column with Pandas in Python is easy and can be done via the following syntax:
```
<variable>["new_variable"] = 0
```
This code would create a new column in the train DataFrame titled new_var with $0$ for each observation. To set the values based on the age of the passenger, we make use of a boolean test inside the square bracket operator. With the `[]` operator we create a subset of rows and assign a value to a certain variable of that subset of observations. For example:

```
train["new_var"][train["Fare"] > 10] = 1
```

This would give a value of $1$ to the variable `new_var` for the subset of passengers whose fares greater than $10$. Keeping in mind that `new_var` has a value of $0$ for all other values (including missing values). 

In [13]:
# Create the column Child and assign to 'NaN'
train["Child"] = float('NaN')

# Assign 1 to passengers under 18, 0 to those 18 or older.
train["Child"][train["Age"] < 18] = 1
train["Child"][train["Age"] >= 18] = 0

# Print normalized Survival Rates for passengers under 18
print "Survival proportions for passangers under 18:\n",
train["Survived"][train["Child"] == 1].value_counts(normalize = True) * 100

Survival proportions for passangers under 18:


1    53.982301
0    46.017699
dtype: float64

In [14]:
# Print normalized Survival Rates for passengers 18 or older
print "Survival proportions for passangers over 18:\n",
train["Survived"][train["Child"] == 0].value_counts(normalize = True) * 100

Survival proportions for passangers over 18:


0    61.896839
1    38.103161
dtype: float64

As we can see from the survival proportions, age does certainly seem to play a role. So the the __[Birhenhead Drill](https://en.wikipedia.org/wiki/Women_and_children_first)__ holds true and thus `Sex` and `Age` make good predictors.

# Basic Prediction
From exploring the data we can see that females had over a $50\%$ chance of surviving and males had less than a $50\%$ chance of surviving. Hence, we could use this information for a first and very basic prediction: 

__All females in the `test` set survive and all males in the `test` set die.__

To do this, we use the test set for validating our predictions. As was mentioned above,  the `test` set has no `Survived` column. this is so that we can use this colums for our predicted values. Next, when uploading our results, Kaggle will use this variable i.e. oour predictions, to score the performance. 

So to start with the first prediction, we will perform the following:

1. Create a variable test_one, identical to dataset test.
2. Add an additional column, `Survived`, that is initialize to zero.
3. Use vector subsetting to set the value of `Survived` to $1$ for observations whose Sex equals "female".
4. Print the Survived column of predictions from the test_one dataset.

In [15]:
# Create a copy of test: test_one
test_one = test

# Initialize a Survived column to 0
test_one["Survived"] = 0

# Set Survived to 1 if Sex equals "female"
test_one["Survived"][test_one["Sex"] == "female"] = 1

#print a sample prediction of who servived
test_one[["PassengerId", "Survived"]] .head()

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,1
2,894,0
3,895,0
4,896,1


# Prediction using Decision Trees
In the basic prediction example, we did all the "slicing" and "dicing" ourselves to find subsets that have a higher chance of surviving. A decision tree automates this process for us and outputs a classification model or classifier.

Conceptually, the decision tree algorithm starts with all the data at the root node and scans all the variables for the best one to split on. Once a variable is chosen, it does the split and goes down one level (or one node) and repeats the process. The final nodes at the bottom of the decision tree are known as terminal nodes, and the majority vote of the observations in that node determine how to predict for new observations that end up in that terminal node.

Before we can start using Decision Trees, we need to import the necessary libraries:

In [16]:
# Import the Numpy library
import numpy as np

# Import 'tree' from scikit-learn library
from sklearn import tree

# Reload the train and test datasets to create two DataFrames
train_url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/train.csv"
train = pd.read_csv(train_url)
test_url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/test.csv"
test = pd.read_csv(test_url)

Before we can begin constructing your trees we need to clean the data so that we can use all the features (predictors) available. In the first section, we saw that the `Age` variable had some missing value. Although dealing with missing values is a whole subject with and in itself, we will use a simple imputation technique where we substitute each missing value with the median of the all present values. This is done by using the `.fillna()` method, for example:
```
train["Age"] = train["Age"].fillna(train["Age"].median())
```
Another problem is that the `Sex` and `Embarked` variables are categorical but in a non-numeric format. Thus, we will need to assign each class a unique integer so that Python can handle the information. `Embarked` also has some missing values which we should impute with the most common class of embarkation, which is "S". 

In [17]:
# Convert the male and female groups to integer form
train["Sex"][train["Sex"] == "male"] = 0
train["Sex"][train["Sex"] == "female"] = 1

# Impute the Embarked variable
train["Embarked"] = train["Embarked"].fillna("S")

# Confirm that `Embarked` has no missing values
print "No. of missing values in `Embarked` Column: ", train["Embarked"].isnull().sum()

# Convert the Embarked classes to integer form
train["Embarked"][train["Embarked"] == "S"] = 0
train["Embarked"][train["Embarked"] == "C"] = 1
train["Embarked"][train["Embarked"] == "Q"] = 2

No. of missing values in `Embarked` Column:  0


Now that the data has been cleaned, we will use the scikit-learn and numpy libraries to build a decision tree. scikit-learn can be used to create tree objects from the `DecisionTreeClassifier` class. The methods that we will use take numpy arrays as inputs and therefore we will need to create those from the DataFrame that we already have. 

We will need the following to build a decision tree

- `target`: A one-dimensional numpy array containing the target/response from the train data. (`Survival`)
- `features`: A multidimensional numpy array containing the features/predictors from the train data. (e.g. `Sex`, `Age`)

The following sample code shows what this would look like:
```
target = train["Survived"].values

features = train[["Sex", "Age"]].values

my_tree = tree.DecisionTreeClassifier()

my_tree = my_tree.fit(features, target)
```
One way to quickly see the result of the decision tree is to see the importance of the features that are included. This is done by requesting the `.feature_importances_` attribute of the tree object. Another quick metric is the mean accuracy that we can compute using the `.score()` function with `features_one` and `target` as arguments.

To build the decision tree, we will perform the following steps:
1. Build the `target` and `features_one` numpy arrays. The target will be based on the `Survived` column in `train`. The features array will be based on the variables `Passenger`, `Class`, `Sex`, `Age`, and Passenger `Fare`.
2. Build a decision tree `my_tree_one` to predict survival using `features_one` and `target`.
3. View at the importance of features in the decision tree and compute the score.

In [18]:
# Create the target and features numpy arrays: target, features_one
target = train["Survived"].values
features_one = train[["Pclass", "Sex", "Age", "Fare"]].values

# Fit your first decision tree: my_tree_one
my_tree_one = tree.DecisionTreeClassifier()
my_tree_one = my_tree_one.fit(features_one, target)

# Look at the importance and score of the included features
print(my_tree_one.feature_importances_)
print(my_tree_one.score(features_one, target))

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

In [None]:
features_one

In [19]:
import numpy as np
import pandas as pd
from sklearn import tree
from sklearn.metrics import accuracy_score

#Print you can execute arbitrary python code
train_url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/train.csv"
test_url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/test.csv"


train = pd.read_csv(train_url, dtype={"Age": np.float64}, )
test = pd.read_csv(test_url, dtype={"Age": np.float64}, )

train = pd.DataFrame(train)
train = train.replace(["male","female", "S", "C", "Q"],[1, 0, 1, 2, 3])
test = pd.DataFrame(test)

"""
#Print to standard output, and see the results in the "log" section below after running your script
print("\n\nTop of the training data:")
print(train.head())

print("\n\nSummary statistics of training data")
print(train.describe())

#Any files you save will be available in the output tab below
train.to_csv('copy_of_the_training_data.csv', index=False)
"""

train_features = train[["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]]
train_labels = train["Survived"]
test_features = test[["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]]

clf = tree.DecisionTreeClassifier()
clf.fit(train_features, train_labels)
pred = clf.predict(test_features)
print(pred)

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').