## Setting Up Your Environment

In [None]:
ENV["OS_AUTH_URL"]="https://keystone-yeg.cloud.cybera.ca:5000/v2.0"
ENV["OS_TENANT_NAME"]="julia_workshop"
ENV["OS_PROJECT_NAME"]="julia_workshop"
ENV["OS_USERNAME"]="julia_workshop"
ENV["OS_PASSWORD"]="Y2RhM2Ni"

include(joinpath("..", "src", "lib", "Config.jl"))

## Get dataset

In [None]:
titanic_data = Dataset.fetch(:titanic)

## Load Modules

In [None]:
using DataFrames
Pkg.add("DecisionTree")
using DecisionTree
using FreqTables
using StatsBase
using Titanic

## Predictive Modeling: Data Preparation
### Cleaning your Data

In [None]:
describe(titanic_data)

### Delete unused columns
Remove PassengerID, Name, Ticket, and Cabin attributes.

In [None]:
titanic_data_clean = delete!(titanic_data,[1,4,9,11])
names(titanic_data_clean)

#### Dealing with missing values
Age: replace NA values with the mean age and a comment about performance.

In [None]:
@time mean(titanic_data_clean[:Age],skipna=true)

In [None]:
@time mean(dropna(titanic_data_clean[:Age]))

In [None]:
@time mean(titanic_data_clean[!isna(titanic_data_clean[:Age]),:Age])

In [None]:
@time mean(titanic_data_clean[!isna(titanic_data_clean[:Age]),:][:Age])

In [None]:
titanic_data_clean[isna(titanic_data_clean[:Age]), :Age] = mean(titanic_data_clean[:Age],skipna=true)

Embarked: replace NA values with the most frequent values.

In [None]:
countmap(titanic_data_clean[:Embarked])

In [None]:
titanic_data_clean[isna(titanic_data_clean[:Embarked]),:Embarked] = "S"

In [None]:
describe(titanic_data_clean)

### Feature engineering
The term Feature Engineering refers to the creation of value-added data from your data sources that will be fed into your machine learning algorithm for development of your predictive model.

In [None]:
@enum ChildType Child=0 Adult=1

titanic_data_clean[:Child] = to_enum(ChildType, map(titanic_data_clean[:Age]) do x
  if isna(x)
    NA
  elseif x < 13
    Child
  else
    Adult
  end
end)
head(titanic_data_clean,20)

### Holdout dataset
Split your dataset: 2/3rds for testing and 1/3rd for training.

In [None]:
nrow(titanic_data_clean)

In [None]:
training_size = convert(Integer,round(nrow(titanic_data_clean)*0.66))
all_titanic_index = 1:nrow(titanic_data_clean)
titanic_train_index = sample(all_titanic_index,training_size,replace=false)
titanic_train = titanic_data_clean[titanic_train_index,:]
nrow(titanic_train)

In [None]:
titanic_test_index = setdiff(all_titanic_index, titanic_train_index)
titanic_test = titanic_data_clean[titanic_test_index,:]
nrow(titanic_test)

### Convert data
For the DecisionTree package, the input data must be converted to arrays.

In [None]:
train_array_survived = convert(Array{Int64,1},titanic_train[:Survived])
train_array = convert(Array,titanic_train[:,[2,3,4,5,6,7,8,9]])
test_array  = convert(Array,titanic_test)

## Build a predictive model
### Decision Tree example

In [None]:
dt_model = build_tree(train_array_survived, train_array)
dt_model

In [None]:
print_tree(dt_model,4)

### Random Forest classification model
Train random forest with 3 for number of features chosen at each random split (n<sub>features</sub>)<sup>0.5</sup>, 100 for number of trees, and 1.0 for ratio of subsampling.

In [None]:
rf_model = build_forest(train_array_survived, train_array, 3, 100, 1.0)
rf_model

## Evaluate your predictive model
### Cross-validation for evaluating a classifier model's performance
Run n-fold cross validation: the inputs are labels, features, nsubfeatures, ntrees, nfolds, partialsampling.
Where n-fold refers to the number of subsets the data gets broken down into and where each n will be used as a test set with the remaining data being used as training data.

In [None]:
accuracy = nfoldCV_forest(train_array_survived, train_array, 3, 50, 5, 1.0)
accuracy
mean(accuracy)

## Apply your predictive model
### Use your predictive model with your holdout test data

In [None]:
test_array_predict = convert(Array{Int64,1},apply_forest(rf_model, test_array))

### Evaluate your predicted results
#### Confusion matrix
Use a confusion matrix to examine actual vs predicted results and calculate summary statistics.
#### Where:
Accuracy = (TP+TN)/(TP+TN+FP+FN)
Kappa Statistic compares the accuracy of the system to the accuracy of a random system (normalized between 0 and 1).
Kappa is between normalized (between 0 and 1)
Kappa = (Accuracy-randomAccuracy)/(1-randomAccuracy)
randomAccuracy = (TN+FP)\*(TN+FN)+(FN+TP)\*(FP+TP)/(total\*\total)

In [None]:
CM = confusion_matrix(test_labels_actual,test_labels_predict)
CM

#### Precision
Calculate a precision metric, in which a high precision model means that there are few false positives.

In [None]:
precision_metric = CM.matrix[1]/sum(CM.matrix[:,1])
precision_metric

#### Recall
Calculate a recall metric, in which a high recall classifier model means there are few false negatives.

In [None]:
recall_metric = CM.matrix[1]/sum(CM.matrix[1,:])
recall_metric

#### F1 Score
Calculate an F-measure, in which a high F-measure classifier model is biased towards all actual and predicted positives.

In [None]:
f1_metric = 2*CM.matrix[1]/(2*CM.matrix[1]+CM.matrix[1,2]+CM.matrix[2,1])
f1_metric