# Introduction
In this case study, we will explore how to tackle Kaggle Titanic competition using Python and Machine Learning. When the Titanic sank, $1502$ of the $2224$ passengers and crew were killed. One of the main reasons for this high level of casualties was the lack of lifeboats on this self-proclaimed __"unsinkable"__ ship. In this tutorial, we will learn how to apply machine learning techniques to predict a passenger's chance of surviving using Python.

# Getting Data with Pandas
We start with loading in the training and testing set into your Python environment. We will use the [training set](http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/train.csv) to build our model, and the [test set](http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/test.csv) to validate it. The first step is to load this data with the `read_csv()` method from the Pandas library.

In [1]:
%%local
# Import the Pandas library
import warnings
warnings.filterwarnings('ignore')
import pandas as pd

# Load the train and test datasets to create two DataFrames
train_url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/train.csv"
train = pd.read_csv(train_url)
test_url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/test.csv"
test = pd.read_csv(test_url)

#Print the `head` of the train dataframe
train.head()

In [2]:
%%local
#Print the `head` of the test dataframe
test.head()

On thing that immediately stands out when looking at the two data sets. The `test` set has no variable (colum) for wether or not the passanger `Survived` or not. This has been intentionally removed as that's the variable we will be predicting using the `train` set.

# Exploring the Data
Before starting with the actual analysis, it's important to understand the structure of the data. Both `test` and `train` are DataFrame objects, the way pandas represent datasets. We can easily explore a DataFrame using the `.describe()` method. This method summarizes the columns/features of the DataFrame, including the count of observations, mean, max and so on. Another useful trick is to look at the dimensions of the DataFrame. This is done by requesting the `.shape` attribute of your DataFrame object. It is also a good practice to look for any missing values in the data set.

### Summary Statistics
Next we apply the `.describe()` method, look for missing values and then apply `.shape` attribute of the training set.

In [3]:
%%local
# Describe the `train` data
train.describe()

### Missing Values

In [4]:
%%local
# Look for missing values
train.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [5]:
%%local
# Index for missing vales: `Embarked`
train["Embarked"][train["Embarked"].isnull()]

61     NaN
829    NaN
Name: Embarked, dtype: object

### Dimensions

In [6]:
%%local
# Look at the dimensions of `train`
train.shape

(891, 12)

### Understanding the Data
As we can see, the training set has $891$ observations and $12$ variables, the count for `Age` is $714$. But how many people in the training set survived the disaster with the Titanic? To see this, we can use the `value_counts()` method in combination with standard bracket notation to select a single column of a DataFrame:

In [7]:
%%local
# No. of people who survived (absolute numbers)
train["Survived"].value_counts()

0    549
1    342
Name: Survived, dtype: int64

In [8]:
%%local
# No. of people who survived (percentages)
train["Survived"].value_counts(normalize = True) * 100

0    61.616162
1    38.383838
Name: Survived, dtype: float64

We see that $549$ individuals died ($62\%$) and $342$ survived ($38\%$). A simple way to predict heuristically could be: "majority wins". This would mean that we will predict every unseen observation to not survive.

To dive in a little deeper we can perform similar counts and percentage calculations on subsets of the Survived column. For example, maybe gender could play a role as well? We can explore this using the .`value_counts()` method for a two-way comparison on the number of __males__ and __females__ that survived.

In [9]:
%%local
# Count of males who survived
train["Survived"][train["Sex"] == "male"].value_counts()

0    468
1    109
Name: Survived, dtype: int64

In [10]:
%%local
# Count of femails who survived
train["Survived"][train["Sex"] == "female"].value_counts()

1    233
0     81
Name: Survived, dtype: int64

To get proportions,  we again pass in the argument `normalize = True` to the `.value_counts()` method.

In [11]:
%%local
# Count of males who survived (percentage)
train["Survived"][train["Sex"] == "male"].value_counts(normalize = True) * 100

0    81.109185
1    18.890815
Name: Survived, dtype: float64

In [12]:
%%local
# Count of females who survived (percentage)
train["Survived"][train["Sex"] == "female"].value_counts(normalize = True) * 100

1    74.203822
0    25.796178
Name: Survived, dtype: float64

It looks like it makes sense to include gender in the predictions since there is a difference between the survival rate of males vs. females. Around $74\%$ of females survived as opposed to $18\%$ of the males surviving.

Another variable that could influence survival is `age`; since it's probable that children were saved first. We can test this by creating a new column with a categorical variable `Child`. `Child` will take the value $1$ in cases where age is less than $18$, and a value of $0$ in cases where age is greater than or equal to $18$. So to add this new variable we need to do two things:

1. Create a new column.
2. Provide the values for each observation (i.e., row) based on the age of the passenger.

Adding a new column with Pandas in Python is easy and can be done via the following syntax:
```
<variable>["new_variable"] = 0
```
This code would create a new column in the train DataFrame titled new_var with $0$ for each observation. To set the values based on the age of the passenger, we make use of a boolean test inside the square bracket operator. With the `[]` operator we create a subset of rows and assign a value to a certain variable of that subset of observations. For example:

```
train["new_var"][train["Fare"] > 10] = 1
```

This would give a value of $1$ to the variable `new_var` for the subset of passengers whose fares greater than $10$. Keeping in mind that `new_var` has a value of $0$ for all other values (including missing values). 

In [13]:
%%local
# Create the column Child and assign to 'NaN'
train["Child"] = float('NaN')

# Assign 1 to passengers under 18, 0 to those 18 or older.
train["Child"][train["Age"] < 18] = 1
train["Child"][train["Age"] >= 18] = 0

# Print normalized Survival Rates for passengers under 18
print "Survival proportions for passangers under 18:\n",
train["Survived"][train["Child"] == 1].value_counts(normalize = True) * 100

Survival proportions for passangers under 18:


1    53.982301
0    46.017699
Name: Survived, dtype: float64

In [14]:
%%local
# Print normalized Survival Rates for passengers 18 or older
print "Survival proportions for passangers over 18:\n",
train["Survived"][train["Child"] == 0].value_counts(normalize = True) * 100

Survival proportions for passangers over 18:


0    61.896839
1    38.103161
Name: Survived, dtype: float64

As we can see from the survival proportions, age does certainly seem to play a role. So the the __[Birhenhead Drill](https://en.wikipedia.org/wiki/Women_and_children_first)__ holds true and thus `Sex` and `Age` make good predictors.

# Basic Prediction
From exploring the data we can see that females had over a $50\%$ chance of surviving and males had less than a $50\%$ chance of surviving. Hence, we could use this information for a first and very basic prediction: 

__All females in the `test` set survive and all males in the `test` set die.__

To do this, we use the test set for validating our predictions. As was mentioned above,  the `test` set has no `Survived` column. this is so that we can use this colums for our predicted values. Next, when uploading our results, Kaggle will use this variable i.e. oour predictions, to score the performance. 

So to start with the first prediction, we will perform the following:

1. Create a variable test_one, identical to dataset test.
2. Add an additional column, `Survived`, that is initialize to zero.
3. Use vector subsetting to set the value of `Survived` to $1$ for observations whose Sex equals "female".
4. Print the Survived column of predictions from the test_one dataset.

In [15]:
%%local
# Start the timer
from time import time
start = time()

# Create a copy of test: test_one
test_one = test

# Initialize a Survived column to 0
test_one["Survived"] = 0

# Set Survived to 1 if Sex equals "female"
test_one["Survived"][test_one["Sex"] == "female"] = 1

# Print a sample prediction of who servived
test_one[["PassengerId", "Survived"]] .head()
#print "Our basic prediction took {:.2f} seconds.".format(time() - start)

# Prediction using Decision Trees
In the basic prediction example, we did all the "slicing" and "dicing" ourselves to find subsets that have a higher chance of surviving. A decision tree automates this process for us and outputs a classification model or classifier.

Conceptually, the decision tree algorithm starts with all the data at the root node and scans all the variables for the best one to split on. Once a variable is chosen, it does the split and goes down one level (or one node) and repeats the process. The final nodes at the bottom of the decision tree are known as terminal nodes, and the majority vote of the observations in that node determine how to predict for new observations that end up in that terminal node.

Before we can start using Decision Trees, we need to import the necessary libraries:

In [16]:
%%local
# Import the Numpy library
import numpy as np

# Import 'tree' from scikit-learn library
from sklearn import tree

# Reload the train and test datasets to create two DataFrames
train_url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/train.csv"
train = pd.read_csv(train_url)
test_url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/test.csv"
test = pd.read_csv(test_url)

### Preprocessing
Before we can begin constructing your trees we need to clean the data so that we can use all the features (predictors) available. In the first section, we saw that the `Age` variable had some missing value. Although dealing with missing values is a whole subject with and in itself, we will use a simple imputation technique where we substitute each missing value with the median of the all present values. This is done by using the `.fillna()` method, for example:
```
train["Age"] = train["Age"].fillna(train["Age"].median())
```
Another problem is that the `Sex` and `Embarked` variables are categorical but in a non-numeric format. Thus, we will need to assign each class a unique integer so that Python can handle the information. `Embarked` also has some missing values which we should impute with the most common class of embarkation, which is "S". 

In [17]:
%%local
# Convert the male and female groups to integer form
train["Sex"][train["Sex"] == "male"] = 0
train["Sex"][train["Sex"] == "female"] = 1

# Impute the `Embarked` variable
train["Embarked"] = train["Embarked"].fillna("S")

# Impute the `Age` variable
train["Age"] = train["Age"].fillna(train["Age"].median())

# Confirm that `Embarked` and `Age` have no missing values
print "No. of missing values: ", train.isnull().sum()

# Convert the Embarked classes to integer form
train["Embarked"][train["Embarked"] == "S"] = 0
train["Embarked"][train["Embarked"] == "C"] = 1
train["Embarked"][train["Embarked"] == "Q"] = 2

No. of missing values:  PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         0
dtype: int64


### Fitting the Model
Now that the data has been cleaned, we will use the scikit-learn and numpy libraries to build a decision tree. scikit-learn can be used to create tree objects from the `DecisionTreeClassifier` class. The methods that we will use take numpy arrays as inputs and therefore we will need to create those from the DataFrame that we already have. 

We will need the following to build a decision tree

- `target`: A one-dimensional numpy array containing the target/response from the train data. (`Survival`)
- `features`: A multidimensional numpy array containing the features/predictors from the train data. (e.g. `Sex`, `Age`)

The following sample code shows what this would look like:
```
target = train["Survived"].values

features = train[["Sex", "Age"]].values

my_tree = tree.DecisionTreeClassifier()

my_tree = my_tree.fit(features, target)
```
One way to quickly see the result of the decision tree is to see the importance of the features that are included. This is done by requesting the `.feature_importances_` attribute of the tree object. Another quick metric is the mean accuracy that we can compute using the `.score()` function with `features_one` and `target` as arguments.

To build the decision tree, we will perform the following steps:
1. Build the `target` and `features_one` numpy arrays. The target will be based on the `Survived` column in `train`. The features array will be based on the variables `Passenger`, `Class`, `Sex`, `Age`, and Passenger `Fare`.
2. Build a decision tree `my_tree_one` to predict survival using `features_one` and `target`.
3. View at the importance of features in the decision tree and compute the score.

In [18]:
%%local
# Create the target and features numpy arrays: target, features_one
target = train["Survived"].values
features_one = train[["Pclass", "Sex", "Age", "Fare"]].values

# Fit your first decision tree: my_tree_one
my_tree_one = tree.DecisionTreeClassifier()
start = time()
my_tree_one = my_tree_one.fit(features_one, target)

# Look at the importance and score of the included features
print "Our Decision Tree prediction took {:.2f} seconds.".format(time() - start)
print "Importance:\n", my_tree_one.feature_importances_
print "Score:\n", my_tree_one.score(features_one, target)

Our Decision Tree prediction took 0.00 seconds.
Importance:
[ 0.12901815  0.31274009  0.22207249  0.33616926]
Score:
0.977553310887


Looks like assenger __Fare__ has most significance in determining survival based on the model. Since we decalred the features to use (`features_one`), we can assume that the important features are assigned in the same order, but let's confirm that by mapping the feature name to it's importance.

In [19]:
%%local
# List of feature names
names = ["Pclass", "Sex", "Age", "Fare"]

# Code coutesy of:
# http://blog.datadive.net/selecting-good-features-part-iii-random-forests/
print "Features sorted by their score:"
print sorted(zip(map(lambda x: round(x, 4), my_tree_one.feature_importances_), names), 
             reverse=True)

Features sorted by their score:
[(0.3362, 'Fare'), (0.3127, 'Sex'), (0.2221, 'Age'), (0.129, 'Pclass')]


# Dealing with Overfitting
When applying models to new data, one thing to pay special attention to is __[overfitting](https://en.wikipedia.org/wiki/Overfitting)__. when creating the decision tree above, the default arguments for `max_depth` and `min_samples_split` were set to `None`. This means that no limit on the depth of the tree was set. This is not necessary a good thing, as we are likely overfitting. This means that while our model describes the training data extremely well, it doesn't generalize to new data, which is frankly the point of prediction. 

One solution to address this is to make  a less complex model. In `DecisionTreeRegressor`, the depth of the model is defined by two parameters:
- The `max_depth` parameter determines when the splitting up of the decision tree stops.
- The `min_samples_split` parameter monitors the amount of observations in a bucket. If a certain threshold is not reached (e.g minimum 10 passengers) no further splitting can be done.

By limiting the complexity of the decision tree we can increase its generality and thus its usefulness for better prediction. To test this theory we now include the Siblings or Spouses Aboard (`SibSp`), Parents/Children Aboard (`Parch`), and `Embarked` features in a new set of features and fit a  second tree (`my_tree_two`) with the new features, and control for the model compelexity by toggling the `max_depth` and `min_samples_split` arguments.

In [20]:
%%local
# Create a new array with the added features: features_two
target = train["Survived"].values
features_two = train[["Pclass","Age","Sex","Fare", "SibSp", "Parch", "Embarked"]].values

# Control overfitting by setting "max_depth" to 10
max_depth = 10

# Control overfitting by setting "min_samples_split" to 5
my_tree_two = tree.DecisionTreeClassifier(max_depth = 10, min_samples_split = 5, random_state = 1)

# Create the my_tree_two model
start = time()
my_tree_two = my_tree_two.fit(features_two, target)

#Print the score of the new decison tree
print "Our new Decision Tree prediction took {:.2f} seconds.".format(time() - start)
print "Importance:\n", my_tree_two.feature_importances_
print "Score:\n", my_tree_two.score(features_two, target)

Our new Decision Tree prediction took 0.00 seconds.
Importance:
[ 0.14130255  0.17906027  0.41616727  0.17938711  0.05039699  0.01923751
  0.0144483 ]
Score:
0.905723905724


Even though the scope of this tutorial doesn't include actually submitting the updated solution to Kaggle, we would see however that despite a lower `.score`, this new model predicts better then the fist one. 

# Feature Engineering
One of the most complicated aspects of Data Science is trying various machine learning algorithms, dealing with over and under-fitting and tweaking parameters to find the best possible fit to new and unseen data. Part of the process fo tweaking paramaters isfeature engineering. This is the process of creatively engineering our own features by combining the different existing variables.

While feature engineering is a discipline in itself, too broad to be covered here in detail, we will have a look at a simple example by creating our very own new predictive attribute: `family_size`. A valid assumption is that larger families need more time to get together on a sinking ship, and hence have lower probability of surviving. Family size is determined by the variables `SibSp` and `Parch`, which indicate the number of family members a certain passenger is traveling with. So when doing feature engineering, we add a new variable `family_size`, which is the sum of `SibSp` and `Parch` plus one (for the observation itself), to the test and train set. To engineer this new feature, we do the following:

1. Create a "fresh" `train` set called `train_two` that differs from `train` only by having an extra column with the engineered variable `family_size`.
2. Add the new engineered variable `family_size` in addition to `Pclass`, `Sex`, `Age`, `Fare`, `SibSp` and `Parch` to a new set,  `features_three`.
3. Create a new decision tree as `my_tree_three` and fit the decision tree with the new feature set,  `features_three`.
4. Find the score of the new decision tree.

In [21]:
%%local
# Create train_two with the newly defined feature
target = train["Survived"].values
train_two = train.copy()
train_two["family_size"] = train_two["SibSp"] + train_two["Parch"] + 1

# Create a new feature set and add the new feature
features_three = train_two[["Pclass", "Sex", "Age", "Fare", "SibSp", "Parch", "family_size"]].values

# Define the tree classifier, then fit the model
my_tree_three = tree.DecisionTreeClassifier()
start = time()
my_tree_three = my_tree_three.fit(features_three, target)

# Print the score of this decision tree
print "Our Decision Tree with `family_size` prediction took {:.2f} seconds.".format(time() - start)
print "Importance:\n", my_tree_three.feature_importances_
print "Score:\n", my_tree_three.score(features_three, target)

Our Decision Tree with `family_size` prediction took 0.00 seconds.
Importance:
[ 0.11181314  0.31088095  0.22213523  0.26463439  0.02335337  0.02185157
  0.04533135]
Score:
0.979797979798


__Notice__ that this time the newly created variable is included in the model. 
# Prediction using Random Forest
A detailed study of Random Forests is outside the scope of this tutorial. However, since it's an often used machine learning technique, we introduce a general overview in Python. The Random Forest technique handles the overfitting problem we saw with decision trees. It grows multiple (very deep) classification trees using the training set. At the time of prediction, each tree is used to come up with a prediction and every outcome is counted as a vote. For example, if we have trained $3$ trees with $2$ saying a passenger in the test set will survive and $1$ says he will not, the passenger will be classified as a survivor. This approach of overtraining trees, but having the majority's vote count as the actual classification decision, avoids overfitting.

Building a random forest in Python looks is very similar to building a decision tree, with three key differences.
1. A different class is used.
2. A new argument is necessary.
3. The necessary library from `scikit-learn` must be imported.
    - Use the `RandomForestClassifier()` class instead of the `DecisionTreeClassifier()` class.
    - `n_estimators` needs to be set when using the `RandomForestClassifier()` class. This argument allows us to set the number of trees we wish to plant and average over.

The following exampple shows us how to build a Random Forest Classifier by dowing the following:
1. Build the random forest with `n_estimators` set to 100.
2. Fit your random forest model with inputs features_forest and target.
3. Compute the classifier predictions on the selected test set features.

In [22]:
%%local
# Import the `RandomForestClassifier`
from sklearn.ensemble import RandomForestClassifier

# Creat a list of the features. 
target = train["Survived"].values
forest_features = train[["Pclass", "Age", "Sex", "Fare", "SibSp", "Parch", "Embarked"]].values

# Building and fitting my_forest
forest = RandomForestClassifier(max_depth = 10, min_samples_split=2, n_estimators = 100, random_state = 1)
start = time()
my_forest = forest.fit(forest_features, target)

# Print the score of the fitted random forest
print "Our Random Forest prediction took {:.2f} seconds.".format(time() - start)
print "Score:\n", my_forest.score(forest_features, target)

Our Random Forest prediction took 0.29 seconds.
Score:
0.939393939394


# Model Comparison
Recall that in the when using the Decision Tree models, we looked at the `.feature_importances_` attribute to see how each of the feature influenced the decision trees.  We can request the same attribute from the random forest as well and interpret the relevance of the included variables. Since the Random Forest aleviates the overfitting problem, it would be a good exercise to compare it to the Decision Tree model in some quick and easy way. For this, we can use the `.score()` method, which takes the features data and the target vector and computes mean accuracy of the model.

In [23]:
%%local
# Final score comparison
names = ["Pclass","Age","Sex","Fare", "SibSp", "Parch", "Embarked"]
print "Decision Tree final score:", my_tree_two.score(features_two, target)
print "Decision Tree features sorted by their score:"
print sorted(zip(map(lambda x: round(x, 4), my_tree_two.feature_importances_), names), 
             reverse=True)
print "\n"
print "Random Forest final score:", my_forest.score(forest_features, target)
print "Random Forest features sorted by their score:"
print sorted(zip(map(lambda x: round(x, 4), my_forest.feature_importances_), names), 
             reverse=True)

Decision Tree final score: 0.905723905724
Decision Tree features sorted by their score:
[(0.4162, 'Sex'), (0.1794, 'Fare'), (0.1791, 'Age'), (0.1413, 'Pclass'), (0.0504, 'SibSp'), (0.0192, 'Parch'), (0.0144, 'Embarked')]


Random Forest final score: 0.939393939394
Random Forest features sorted by their score:
[(0.3199, 'Sex'), (0.246, 'Fare'), (0.2014, 'Age'), (0.1038, 'Pclass'), (0.0527, 'SibSp'), (0.0416, 'Parch'), (0.0345, 'Embarked')]


# Conclusion
Based on our findings from the various models that have been run, we can determine which feature was of most importance, and for which model.

__The most important feature was "Sex", but it was more significant for "my_tree_two" Decision Tree.__

---

# Ensemble Methods
In an prediction task, it is important try and test multiple algorithms to find the best possible fit. In most cases it is not necessary or possible to test every single algorithm type against the data, a fare assesment may suffice, but it is a good practice to at least try achieve a better fit. That is why we introduced a different Machine Learning algorithm to the `DecisionTreeClassifier()` we were currently using. What we did in essence is introduce an Ensemble Method. The goal of ensemble methods is to combine the predictions of several base estimators built with a given learning algorithm in order to improve generalizability or robustness over a single estimator. In our case, we used the `RandomForestClassifier()` which built several (very deep) classification trees that when used for prediction, each tree is used to come up with a prediction and every outcome is counted as a vote. This is an example of an __averaging methods__. Here, the driving principle is to build several estimators independently and then to average their predictions. On average, the combined estimator is usually better than any of the single base estimator because its variance is reduced. This is evedent when comparing the score from both models above.

One of the biggest issues with ensemble Methods is that they can be computationally exhaustive. The alternative is manually trying every possible variable combinations (since there is no easy way to know which parameters work best, other than trying out many different combinations) or parameter combinations (hyperparameters) to get the best model fit. Fortunately, the `scikit-learn` package includes the `GridSearchCV` and `RandomizedSearchCV` functions wich allows for the evaluation of each parameter setting independently, in parallel.

The down-side to this is the fact that this is not scalable. The models above are executed on data in Pandas DataFrames. These DataFrames are memory resident, so unless the machine executing the models has sufficient memory or the data sets are small enough, the `GridSearchCV` and `RandomizedSearchCV` functions may not be helpful. Fortunately the team at Databrix has released the [`spark-sklearn`](http://spark-packages.org/package/databricks/spark-sklearn) package to allow us to execute these funcitons over a Spark cluster.

The following example is based on the  [Auto-scaling scikit-learn with Spark](https://databricks.com/blog/2016/02/08/auto-scaling-scikit-learn-with-spark.html) example, but has been adapted to our Titanic example. The objective is to populate a "grid" of various model parameters and have `GridSearchCV` execute each set of parameters to determine the best score. The goal here is not necessarily to find a model with a better score but to illustrate the time it takes to search through the various model parameters to find the best one. The example has the following steps:

1. Import the `GridSearchCV` library from `scikit-learn`.
2. Create the function to report the top $3$ models, courtesy of [Databrix](http://go.databricks.com/hubfs/notebooks/Samples/Miscellaneous/blog_post_cv.html).
3. Create the "grid" of parameters combnations to execute. Thjis is loosley based on the Databrix example.
4. Fit the model with the "grid"of parameters.

In [24]:
%%local
# Import necessary packages
import warnings
warnings.filterwarnings('ignore')
from operator import itemgetter
from sklearn.model_selection import GridSearchCV

# Databrix utility function to report top 3 best scores
def report(grid_scores, n_top = 3):
    top_scores = sorted(grid_scores, key = itemgetter(1), reverse = True)[:n_top]
    for i, score in enumerate(top_scores):
        print("Model with rank: {0}".format(i + 1))
        print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
              score.mean_validation_score,
              np.std(score.cv_validation_scores)))
        print("Parameters: {0}".format(score.parameters))
        print("")

# Creat a list of the features. 
target = train["Survived"].values
forest_features = train[["Pclass", "Age", "Sex", "Fare", "SibSp", "Parch", "Embarked"]].values

# Add grid parameter code based on Databricks
param_grid = {"max_depth": [3, 10, None],
              "min_samples_split": [1.0, 2, 3, 10], 
              "min_samples_leaf": [1, 2, 3, 10], 
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"],
              "n_estimators": [10, 20, 40, 80, 160]}

# Execute the grid search
gs = GridSearchCV(RandomForestClassifier(), param_grid = param_grid)
start = time()
gs.fit(forest_features, target)
print("GridSearchCV took {:.2f} minutes for {:d} candidate settings.".format((time() - start) / 60, len(gs.grid_scores_)))
report(gs.grid_scores_)

GridSearchCV took 12.00 minutes for 960 candidate settings.
Model with rank: 1
Mean validation score: 0.835 (std: 0.019)
Parameters: {'bootstrap': True, 'min_samples_leaf': 1, 'n_estimators': 40, 'criterion': 'gini', 'min_samples_split': 10, 'max_depth': None}

Model with rank: 2
Mean validation score: 0.834 (std: 0.029)
Parameters: {'bootstrap': True, 'min_samples_leaf': 2, 'n_estimators': 40, 'criterion': 'gini', 'min_samples_split': 10, 'max_depth': None}

Model with rank: 3
Mean validation score: 0.833 (std: 0.016)
Parameters: {'bootstrap': True, 'min_samples_leaf': 3, 'n_estimators': 10, 'criterion': 'gini', 'min_samples_split': 3, 'max_depth': None}



---

# Ensemble Methods at Scale
As we can see, the `GridSearchCV` took $12$ minutes on a *m4.2xlarge* Instance and provided us with the top $3$ candidates and their settings to try. By applying these suggestions we can very effectivley find the best model as well as the optumum parameters to apply to achieve the best score. Next we aply the same proceedure to a Spark Cluster to parallelize the task across $5$ Spark Workers.

## Provision EMR Cluster in Python

```
%%local
import boto3
connection = boto3.client(
    'emr',
    region_name='us-west-1',
    aws_access_key_id='<Your AWS Access Key>',
    aws_secret_access_key='<You AWS Secred Key>',
)

cluster_id = connection.run_job_flow(
    Name='SparkaaS',
    LogUri='s3://chkrd/emr-log',
    ReleaseLabel='emr-5.4.0',
    Instances={
        'InstanceGroups': [
            {
                'Name': "Master nodes",
                'Market': 'ON_DEMAND',
                'InstanceRole': 'MASTER',
                'InstanceType': 'm1.large',
                'InstanceCount': 1,
            },
            {
                'Name': "Slave nodes",
                'Market': 'ON_DEMAND',
                'InstanceRole': 'CORE',
                'InstanceType': 'm1.large',
                'InstanceCount': 5,
            }
        ],
        'Ec2KeyName': '<Ec2 Keyname>',
        'KeepJobFlowAliveWhenNoSteps': True,
        'TerminationProtected': False,
        'Ec2SubnetId': '<Your Subnet ID>',
    },
    Steps=[],
    VisibleToAllUsers=True,
    JobFlowRole='EMR_EC2_DefaultRole',
    ServiceRole='EMR_DefaultRole',
    Tags=[
        {
            'Key': 'tag_name_1',
            'Value': 'tab_value_1',
        },
        {
            'Key': 'tag_name_2',
            'Value': 'tag_value_2',
        },
    ],
)

print (cluster_id['JobFlowId'])
```
## Provision EMR Cluster with AWS CLI
```bash
aws emr create-cluster --termination-protected --applications Name=Hadoop Name=Hive --ec2-attributes '{"KeyName":"devenv-key","InstanceProfile":"EMR_EC2_DefaultRole","SubnetId":"subnet-39a22e5e","EmrManagedSlaveSecurityGroup":"sg-86bb82fe","EmrManagedMasterSecurityGroup":"sg-83bb82fb"}' --release-label emr-5.4.0 --log-uri 's3n://chkrd/elasticmapreduce/' --steps '[{"Args":["nohup","/home/hadoop/livy-server-0.3.0/bin/livy-server",">","/dev/null","2>/tmp/livy.log","&"],"Type":"CUSTOM_JAR","ActionOnFailure":"CONTINUE","Jar":"command-runner.jar","Properties":"","Name":"Start Livy"},{"Args":["aws","s3","cp","s3://chkrd/artifacts/livy.conf","/home/hadoop/livy-server-0.3.0/conf/"],"Type":"CUSTOM_JAR","ActionOnFailure":"CONTINUE","Jar":"command-runner.jar","Properties":"","Name":"Copy Livy Config"},{"Args":["aws","s3","cp","s3://chkrd/artifacts/livy-env.sh","/home/hadoop/livy-server-0.3.0/conf/"],"Type":"CUSTOM_JAR","ActionOnFailure":"CONTINUE","Jar":"command-runner.jar","Properties":"","Name":"Copy Livy Environment"}]' --instance-groups '[{"InstanceCount":1,"InstanceGroupType":"MASTER","InstanceType":"m3.xlarge","Name":"Master - 1"},{"InstanceCount":2,"InstanceGroupType":"CORE","InstanceType":"m3.xlarge","Name":"Core - 2"}]' --auto-scaling-role EMR_AutoScaling_DefaultRole --bootstrap-actions '[{"Path":"s3://chkrd/artifacts/emr_base_config.sh","Name":"EMR Base Configuration"}]' --service-role EMR_DefaultRole --enable-debugging --name 'SparkaaS' --scale-down-behavior TERMINATE_AT_INSTANCE_HOUR --region us-west-2
```

## Load `sparkmagic` Jupyter Extension

In [1]:
%load_ext sparkmagic.magics

## Configure EMR Endpoint

In [2]:
%manage_spark

Added endpoint http://54.190.208.32:8998
Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
0,application_1489684169355_0001,pyspark,idle,Link,Link,✔


SparkSession available as 'spark'.


In [3]:
%%spark
sc

<pyspark.context.SparkContext object at 0x7f210b815f50>

In [4]:
%%spark
sqlContext

<pyspark.sql.context.SQLContext object at 0x7f210b523f10>

## Ensamble Methods using Spark-as-a-Service

In [8]:
%%spark
# Import necessary packages
import warnings
warnings.filterwarnings('ignore')
from operator import itemgetter
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from time import time
import numpy as np
from sklearn import tree
import pandas as pd

# Reload the train and test datasets to create two DataFrames
train_url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/train.csv"
train = pd.read_csv(train_url)
test_url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/test.csv"
test = pd.read_csv(test_url)

# Preprocess the Data
train["Sex"][train["Sex"] == "male"] = 0
train["Sex"][train["Sex"] == "female"] = 1
train["Embarked"] = train["Embarked"].fillna("S")
train["Age"] = train["Age"].fillna(train["Age"].median())
train["Embarked"][train["Embarked"] == "S"] = 0
train["Embarked"][train["Embarked"] == "C"] = 1
train["Embarked"][train["Embarked"] == "Q"] = 2

# Databrix utility function to report top 3 best scores
def report(grid_scores, n_top = 3):
    top_scores = sorted(grid_scores, key = itemgetter(1), reverse = True)[:n_top]
    for i, score in enumerate(top_scores):
        print("Model with rank: {0}".format(i + 1))
        print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
              score.mean_validation_score,
              np.std(score.cv_validation_scores)))
        print("Parameters: {0}".format(score.parameters))
        print("")

# Creat a list of the features. 
target = train["Survived"].values
forest_features = train[["Pclass", "Age", "Sex", "Fare", "SibSp", "Parch", "Embarked"]].values

# Add grid parameter code based on Databricks
param_grid = {"max_depth": [3, 10, None],
              "min_samples_split": [1.0, 2, 3, 10], 
              "min_samples_leaf": [1, 2, 3, 10], 
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"],
              "n_estimators": [10, 20, 40, 80, 160]}

# Import grid search for Spark
from spark_sklearn import GridSearchCV as SparkSearch

# Execute the grid search on Spark
Spark_gs = SparkSearch(sc, RandomForestClassifier(), param_grid = param_grid)
start = time()

Spark_gs.fit(forest_features, target)
print("Spark GridSearchCV took {:.2f} seconds for {:d} candidate settings.".format(time() - start, len(Spark_gs.grid_scores_)))
report(Spark_gs.grid_scores_)

Spark GridSearchCV took 201.13 seconds for 960 candidate settings.
Model with rank: 1
Mean validation score: 0.835 (std: 0.019)
Parameters: {'bootstrap': True, 'min_samples_leaf': 1, 'n_estimators': 40, 'criterion': 'entropy', 'min_samples_split': 3, 'max_depth': 10}

Model with rank: 2
Mean validation score: 0.834 (std: 0.025)
Parameters: {'bootstrap': True, 'min_samples_leaf': 1, 'n_estimators': 20, 'criterion': 'entropy', 'min_samples_split': 3, 'max_depth': 10}

Model with rank: 3
Mean validation score: 0.833 (std: 0.019)
Parameters: {'bootstrap': False, 'min_samples_leaf': 2, 'n_estimators': 20, 'criterion': 'gini', 'min_samples_split': 3, 'max_depth': 10}

By running the `GridSearchCV` on a cluster of $5$ Spark Nodes, we managed to execute the same ensemble task in $19$ seconds, thus demonstarting that distributing the various algorithms across multiple nodes certainly helps to narrow down the best fit without having to manually try multiple models and wasting unnecessary time.