When the Titanic sank, 1502 of the 2224 passengers and crew were killed. One of the main reasons for this high level of casualties was the lack of lifeboats.

Those that have seen the movie know that some individuals were more likely to survive the sinking (lucky Rose) than others (poor Jack). In this lecture, we will start to learn machine learning by developing methods that can learn from training data to predict a passenger's chance of surviving. We will focus today on **decision trees**.

Along the way, we will have the opportunity of learning a bit about Pandas and Scikit-Learn, two very useful python libraries for data analysis. We will also have a look at kaggle, a website full of very interesting machine learning competitions and datasets: you can use some of them for your project!

### 1.- Reading Data:
Let's start with loading in the training and testing set. In machine learning we always need to have a training set and an independent held-out test set. The training set is used to teach our algorithms, and the test set to understand how good they are **after the training has finished**.

The data we will use is stored on the web as csv files; their URLs are below, and we can load it with the `read_csv()` method from the `Pandas` library:

In [59]:
import pandas as pd
pd.options.mode.chained_assignment = None # avoid set copy warning

In [60]:
# Load the train and test datasets to create two DataFrames
train_url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/train.csv"
train = pd.read_csv(train_url)

test_url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/test.csv"

* Exercise: Load the test data, and use the method `head()` to have a quick inspection on the training set. Which one is the column containing the target variable we want to predict, and which one are the columns we can use to predict it?

In [61]:
test = pd.read_csv(test_url)

train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### 2.-Preliminary Data Analysis:

Before starting with the actual analysis, it's important to understand the structure of the data. Both test and train are `DataFrame` objects, which is the class `pandas` uses to represent datasets. You can use the method `.describe()` to summarizes the columns/features of the DataFrame. Another useful piece of information is to look at the dimensions of the `DataFrame`, which can be done by looking into the `.shape` attribute:

In [62]:
train.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [63]:
train.shape

(891, 12)

**Female vs Male**

To see how many people in the training set survived the disaster with the Titanic, we can use the `value_counts()` method in combination with standard bracket notation to select a single column of a `DataFrame`:

In [64]:
train['Survived'].value_counts()

0    549
1    342
Name: Survived, dtype: int64

Note `DataFrames` also admit dot notation to access columns

In [65]:
train.Survived.value_counts()

0    549
1    342
Name: Survived, dtype: int64

Another thing we can do is to see, from those passengers that were male, how many survived:

In [66]:
print(train.Survived[train.Sex == 'male'].value_counts())
print(train.Survived[train.Sex == 'female'].value_counts())

0    468
1    109
Name: Survived, dtype: int64
1    233
0     81
Name: Survived, dtype: int64


You can also easily look into percentages:

In [67]:
print(train.Survived[train.Sex == 'male'].value_counts(normalize=True))
print(train.Survived[train.Sex == 'female'].value_counts(normalize=True))

0    0.811092
1    0.188908
Name: Survived, dtype: float64
1    0.742038
0    0.257962
Name: Survived, dtype: float64


### Creating a new column to account for Childs

In [68]:
train['Child'] = 0

In [69]:
train['Child'][train['Age'] < 18] = 1

In [70]:
train.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Child
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S,0
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S,0
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S,0
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C,0
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q,0


In one of the previous exercises you discovered that in your training set, females had over a 50% chance of surviving and males had less than a 50% chance of surviving. Hence, you could use this information for your first prediction: all females in the test set survive and all males in the test set die.

You use your test set for validating your predictions. You might have seen that contrary to the training set, the test set has no Survived column. You add such a column using your predicted values. Next, when uploading your results, Kaggle will use this variable (= your predictions) to score your performance.

* Create a variable test_one, identical to dataset test
* Add an additional column, Survived, that you initialize to zero.
* Use vector subsetting like in the previous exercise to set the value of Survived to 1 for observations whose Sex equals "female".
* Print the Survived column of predictions from the test_one dataset.

## Cleaning and Formatting Data
Before you can begin constructing your trees you need to get your hands dirty and clean the data so that you can use all the features available to you. In the first chapter, we saw that the Age variable had some missing value. Missingness is a whole subject with and in itself, but we will use a simple imputation technique where we substitute each missing value with the median of the all present values.

In [71]:
train["Age"] = train["Age"].fillna(train["Age"].median())

Another problem is that the Sex and Embarked variables are categorical but in a non-numeric format. Thus, we will need to assign each class a unique integer so that Python can handle the information. Embarked also has some missing values which you should impute witht the most common class of embarkation, which is "S".

Replace each class of Embarked with a uniques integer. 0 for S, 1 for C, and 2 for Q

In [72]:
# Convert the male and female groups to integer form
train["Sex"][train["Sex"] == "male"] = 0
train["Sex"][train["Sex"] == "female"] = 1

# Impute the Embarked variable
train["Embarked"] = train["Embarked"].fillna("S")

# Convert the Embarked classes to integer form
train["Embarked"][train["Embarked"] == "S"] = 0
train["Embarked"][train["Embarked"] == "C"] = 1
train["Embarked"][train["Embarked"] == "Q"] = 2

# # Print the Sex and Embarked columns
# print(train["Sex"])
# print(train["Embarked"])

## Intro to decision trees

In the previous chapter, you did all the slicing and dicing yourself to find subsets that have a higher chance of surviving. A decision tree automates this process for you and outputs a classification model or classifier.

Conceptually, the decision tree algorithm starts with all the data at the root node and scans all the variables for the best one to split on. Once a variable is chosen, you do the split and go down one level (or one node) and repeat. The final nodes at the bottom of the decision tree are known as terminal nodes, and the majority vote of the observations in that node determine how to predict for new observations that end up in that terminal node.

First, let's import the necessary libraries:

In [55]:
from sklearn import tree
import numpy as np