# Assessment 3 - A new tool for detecting prostate cancers
### Authors: Jeffrey Mills (28083938), 

**Required Libraries -** The following R libraries are referenced in this notebook
* glmnet
* ROCR
---
### Table of Contents

[1. Exploratory Data Analysis](#1.-Exploratory-Data-Analysis)

[2. EDA Report](#2.-EDA-Report)

---

In [None]:
# disable scientific notation and set to 4 digits
options(scipen=999)
options(digits=4)
# set the default plot size
options(repr.plot.width = 6)
options(repr.plot.height = 4)

In [None]:
# install packages ggplot2 for more fancier plots and graphs
# and psych for a more comprehensive stats package
# only install if not already installed
list.of.packages <- c("ggplot2", "psych", "ROCR", "leaps", "glmnet", "caret")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages, repos="http://cran.uk.r-project.org")

In [None]:
# import ggplot for some of the more fancier plots if needed
library("ggplot2")
# import more comprehensive summary stats package
library(psych)
# import ROCR for easier model performance checking
library("ROCR")

library("caret")

## 1. Exploratory Data Analysis

In [None]:
prostate <- read.csv("./prostate.csv")

In [None]:
# Check the head of the dataframe
head(prostate)

In [None]:
# Check structure of the datatypes
str(prostate)

Given that we know the **Result** variable is our target variable which represents one of 4 stages of the cancer (0 for curable, and 1 for tumour stage, 2 for node stage and 3 for incurable cancers) - this should be represented as a factor. 

In [None]:
prostate$Result <- as.factor(prostate$Result) 

In [None]:
# Let's check for incomplete observations
nrow(prostate[complete.cases(prostate),])

In [None]:
# Check the proportions of each factor within the Result variable
w = table(prostate$Result)
w

In [None]:
round(describe(prostate), 3)

In [None]:
meanAtt1Result0 <- mean(prostate

plot(prostate$Result, mean(prostate$ATT1), xlab = "BMI", ylab = "age")

## 2. EDA Report

Perform an EDA, and prepare a summary of your findings. The summary should be less than 300 words. You need to emphasise on the aspects of the EDA that guide you in choosing a particular model or algorithm for your classifier. If you want to include any chart in your notebook, there should be something you have learnt and you are going to use it in your model selection. Please provide a short explanation on what you have learnt just after the cell. In addition, types of variables, the dimension of the data, and any correlation between variables are very important. Also, you might need to look at some distribution of your variables to get some insights.

## 3. XGBoost

In [None]:
inTrain = createDataPartition(prostate$Result, p = 2/3, list = FALSE)
dfTrain=prostate[inTrain,]
dfTest=prostate[-inTrain,]

In [None]:
# Check the proportions of each factor within the Result variable
w = table(dfTest$Result)
w