## Setting Up Your Environment

In [None]:
ENV["OS_AUTH_URL"]="https://keystone-yeg.cloud.cybera.ca:5000/v2.0"
ENV["OS_TENANT_NAME"]="julia_workshop"
ENV["OS_PROJECT_NAME"]="julia_workshop"
ENV["OS_USERNAME"]=""
ENV["OS_PASSWORD"]=""

include(joinpath("..", "src", "lib", "Config.jl"))

## Get dataset

In [None]:
titanic_data_clean = Dataset.fetch(:titanic_clean)

## Load Modules

In [None]:
using DataFrames
using DecisionTree
using FreqTables
using StatsBase
using Titanic

## Predictive Modeling: Data Preparation
### Cleaning your Data

In [None]:
describe(titanic_data)

### Delete unused columns
Remove PassengerID, Name, Ticket, and Cabin attributes.

In [None]:
delete!(titanic_data_clean, [:PassengerId,:Name,:Ticket,:Cabin])
names(titanic_data_clean)

### Now at this point you could do some Feature Engineering to add value (i.e. columns) to your dataset. Please see the homework excercise at the bottom of the notebook to test your skills.

### Holdout dataset
Split your dataset: 2/3rds for testing and 1/3rd for training.

In [None]:
nrow(titanic_data_clean)

In [None]:
training_size = convert(Integer,round(nrow(titanic_data_clean)*0.66))
all_titanic_index = 1:nrow(titanic_data_clean)
titanic_train_index = sample(all_titanic_index,training_size,replace=false)
titanic_train = titanic_data_clean[titanic_train_index,:]
nrow(titanic_train)

In [None]:
titanic_test_index = setdiff(all_titanic_index, titanic_train_index)
titanic_test = titanic_data_clean[titanic_test_index,:]
nrow(titanic_test)

### Convert data
For the DecisionTree package, the input data must be converted to arrays.

In [None]:
train_array_survived = convert(Array,titanic_train[:Survived])
train_array = convert(Array,titanic_train[:,[2,3,4,5,6,7,8]])

In [None]:
test_array_survived = convert(Array,titanic_test[:Survived])
test_array = convert(Array{Any},titanic_test[:,[2,3,4,5,6,7,8]])

## Build a predictive model
### Decision Tree example

In [None]:
dt_model = build_tree(train_array_survived, train_array)
dt_model

In [None]:
print_tree(dt_model,4)

### Random Forest classification model
Train random forest with 3 for number of features chosen at each random split (n<sub>features</sub>)<sup>0.5</sup>, 100 for number of trees, and 1.0 for ratio of subsampling.

In [None]:
rf_model = build_forest(train_array_survived, train_array, 3, 100, 1.0)
rf_model

## Evaluate your predictive model
### Cross-validation for evaluating a classifier model's performance
Run n-fold cross validation: the inputs are labels, features, n_randomfeatures, n_trees, n_folds, partialsampling (optional).
Where n-folds refers to the number of subsets the data gets broken down into and where each n will be used as a test set with the remaining data being used as training data.

#### Cross Validation Results
`Accuracy = (TP+TN)/(TP+TN+FP+FN)`

Kappa Statistic compares the accuracy of the system to the accuracy of a random system (between 0 and 1).

`Kappa = (Accuracy-randomAccuracy)/(1-randomAccuracy)`

`randomAccuracy = ((TN+FP)*(TN+FN)+(FN+TP)*(FP+TP))/(TP+FP+TN+FN)`

In [None]:
nfoldCV_tree(train_array_survived, train_array, 3, 4)

In [None]:
nfoldCV_forest(train_array_survived, train_array, 3, 50, 4)

### Compare cross-validation results
Note: these are sample results for explanatory purposes only.

#### Accuracy
```
|---------|-------|-------|-------|-------|
| CV Fold |  DT   | RF    | RF2   | RF3   |
|---------|-------|-------|-------|-------|
|    1    | 0.721 | 0.816 | 0.501 | 0.901 | 
|---------|-------|-------|-------|-------|
|    2    | 0.748 | 0.830 | 0.502 | 0.851 |
|---------|-------|-------|-------|-------|
|    3    | 0.796 | 0.837 | 0.503 | 0.721 |
|---------|-------|-------|-------|-------|
|    4    | 0.741 | 0.864 | 0.504 | 0.805 |
|---------|-------|-------|-------|-------|
|   Mean  | 0.752 | 0.837 | 0.503 | 0.820 |
|---------|-------|-------|-------|-------|
|    SD   | 0.032 | 0.020 | 0.001 | 0.076 |
|---------|-------|-------|-------|-------|
```


## Apply your predictive model
### Use your predictive model with your holdout test data

In [None]:
test_array_predict = convert(Array{Int64,1},apply_forest(rf_model,test_array))

### Evaluate your predicted results
#### Confusion matrix
Use a confusion matrix to examine actual vs predicted results and calculate summary statistics.

```
|---------------|---------------|---------------|
|               | Predicted Yes | Predicted No  |
|---------------|---------------|---------------|
|   Actual Yes  |       TP      |      FN       | 
|---------------|---------------|---------------|
|   Actual No   |       FP      |      TN       |
|---------------|---------------|---------------|
```

In [None]:
CM = confusion_matrix(test_array_survived,test_array_predict)
CM

## Other Metrics to Evaluate Your Model

#### Precision
Calculate a precision metric, in which a high precision model means that there are few false positives.

`Precision = TP/(TP+FP)`

In [None]:
precision_metric = CM.matrix[1]/sum(CM.matrix[:,1])
precision_metric

#### Recall
Calculate a recall metric, in which a high recall model means there are few false negatives.

`Recall = TP/(TP+FN)`

In [None]:
recall_metric = CM.matrix[1]/sum(CM.matrix[1,:])
recall_metric

#### F1 Score
Calculate an F-measure, in which a high F-measure classifier model is biased towards all actual and predicted positives.

`F1 = 2TP/(2TP+FN+FP)`

In [None]:
f1_metric = 2*CM.matrix[1]/(2*CM.matrix[1]+CM.matrix[1,2]+CM.matrix[2,1])
f1_metric

## Homework Exercise
Remember the concept of feature engineering

### Feature engineering
The term Feature Engineering refers to the creation of value-added data from your data sources that will be fed into your machine learning algorithm for training of your predictive model.

#### Q1. How can you add the ChildType Feature from Lesson 2 to train your predictive random forest model?


#### Q2. Does this feature improve the accuracy of your predictive model when applied to your test data?