# Using R to Build and Evaluate a Basic Decision Tree Model

First, we need in import the various libraries we require for our analysis.

In [None]:
library(rpart)
library(rattle)
library(rpart.plot)
library(RColorBrewer)
library(caret)
library(repr)
library(tidyverse)

options(repr.plot.width=15, repr.plot.height=15)

In [None]:
## Get the data from a csv file

titanic <- read.csv("titanic.csv")

### Examine Data

You can print out the dataframe by running a codeblock with the dataframe name in it- in this case ```titanic```. You can also use the ```summary``` command to see various summary statistics.

In [None]:
## Examine dataframe

titanic

Looking at your data is critical before you start doing any analysis. In this specific dataset, the "Survived" column indicates whether the individual survived ("1') or prerished ("0"). So, "1" is the positive case.

In [None]:
summary(titanic)

### Set up Training and Testing Datasets

Now we need to take our data and start organizing it for our modeling. Remember you have to train and evaluate your model, so the first step is splitting our data into training and testing(evaluating) dataframes.

In [None]:
## 75% sample for training data
sample_size <- floor(0.75 * nrow(titanic))

If we want to be able to directly compare models, we need to make sure the random sampling is the same. We do this by setting a seed which is used to generate the random sample. Anyone who uses this same seed will get the same random sample.

In [None]:
## Use seed to make models reproduceable
set.seed(123)

## Determine the row numbers to sample
train_split <- sample(seq_len(nrow(titanic)), size = sample_size)

The following code block takes our original dataframe and separates it into training and testing datasets based on the sample size we set earlier.

In [None]:
## Split the data into 75% training and 25% testing
train <- titanic[train_split, ]
test <- titanic[-train_split, ]

## Validate that the dataframes are correct

cat("There are ", nrow(train), " rows in the training data. \n")
cat("There are ", nrow(test), " rows in the testing data.")

### Use the Training Dataset to Build the Model

We are going to use the R library called rpart to create our decision tree. You can read more about it here: http://www.milbo.org/doc/prp.pdf

In R, there is a standard way to create the model that uses an equation format similar to a standard linear equation. On the left is the predicted or dependent variable, ```~``` indicates the equal sign, and each independent variable is to the right of this sign.

```R
predicted_variable ~ independent_variable_1 + independent_variable_2
```



In [None]:
## Build a decision tree model
my_tree <- rpart(Survived ~ Pclass + Sex + Age + Siblings.Spouses.Aboard + Parents.Children.Aboard + Fare, data = train, method = "class", cp =.01)

### Plot the Decision Tree

In [None]:
## Plot the model results
rpart.plot(my_tree)

### Examine the Rules from our Tree Model

The ```rpart``` library also has a way to print out a table that defines all the rules for the generated model. After you run the code below, the table will give you the probability in the 'Survived" column and the rules. This rules table represents the end nodes in the model.

In [None]:
rpart.rules(my_tree, cover=TRUE)

### Evaluate the Model

Finally, we need to evaluate how well our model performed using the testing data (we can also call this the holdout data). Do do this we use our model to predict the test data outcomes. Given we already know the outcomes, we compare the predictions to the actual outcomes. We use the confusion matrix to graphically represent the different types of errors our model may have.

In [None]:
## Create the probabilities for each test data point
predict_probs <- as.data.frame(predict(my_tree, newdata = test, type = "p"))

## Create the predicted test values and ground truth values and .5 threshold value
predicted <- as.integer(predict_probs$`1` > .5)
actual <- test$Survived

## Build confusion matrix
confusionMatrix(as.factor(predicted), as.factor(actual), positive = "1")

## https://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/