Welcome to the first assignment of EPI7913!

# Overview of the Assignment
In this assignment, you'll do some exploratory / descriptive analysis of the dataset.

In this assignment, we will test your knowledge about the following topics:

- Data exploration
  - Have a first look of the data and get the summary information of each variables.
  - Tabulate the data for categorical variables.
  - Generate violin plots for the continuous variables.
  
- Data descriptive analysis, _i.e._ Identify problematic variables:
  - High missingness
  - Highly imbalanced data
  - Extreme outliers
  - Inconsistent coding
  - Data redundancy

## N0147 data: the control arm of a colon cancer trial

For this assignment, we will use a synthetic version of the `N0147` data with some introduced data problems. More details about this data can be found [here](https://pubmed.ncbi.nlm.nih.gov/30297240/)

# Import Packages

We'll first import all the packages that we need for this assignment.

- `dplyr` is what we'll use to manipulate our data. For more details, see [dplyr](https://dplyr.tidyverse.org/)
- `ggplot2` is a plotting library. For more details, see [ggplot2](https://ggplot2.tidyverse.org/)


In [None]:
library(dplyr)
library(ggplot2)

# Load Data

First we will load in the dataset that we will use for training and testing our model.


In [None]:
data <- read.csv("n0147_syn_issue.csv", header = T, stringsAsFactors = TRUE)

# Explore the Dataset

## Summary table

The dataset include the following fields (you can view this as a data dictionary):

- `arm`: Labels of experimental arms, this dataset contains arms A, B, and C.
- `ps`: ECOG Performance Status: 0 = 0, 1 = 1, 2 = 2+.
- `stage_g`: Clinical T Stage: 1 = T1 or T2, 2 = T3, 3 = T4.
- `wild`: Biomarker KRAS: 0 = Mutant, 1 = Wild-type, Missing = indeterminate.
- `histo_g`: Histology: 1=High (poorly differentiated or undifferentiated), 2=Low (well or moderately differentiated).
- `nodes`: Positive lymph node involvement: 1 = 1-3, 2 = >=4.
- `bwl_obs`: Bowel obstruction : 1=Yes, 2=No.
- `age`: Age.
- `agecat`: Age category: < 40, 40-69, >=70.
- `sex`: Sex: m=Male, f=Female.
- `bmi2`: Body mass index (BMI).
- `logbmi`: The logarithm of `bmi2`.
- `racecat`: Race: b=black, w=white, oth=other.
- `numcycle`: Total Number of Cycles Given.
- `dfsstat5`: Disease free survival status (5yr censor): 0 = Event-Free, 1 = Event.
- `fustat8`: Overall survival status (8 year censor): 0 = Alive, 1 = Dead.
- `dfstime5`: Time in days of disease free survival.
- `futime8`: Time in days of overall survival.

We first get the number of subjects (`nSub`) and number of variables (`nVar`) of this dataset.


In [None]:
nSub <- nrow(data)
nVar <- ncol(data)

nSub
nVar

We can use the `head()` method to display the first six records of each. 



In [None]:
head(data)

We can use the `summary()` method to display the summary information. 



In [None]:
summary(data)

The function `str()` is another helpful one to display the summary information, especially data type of each variable. 



In [None]:
str(data)

You can see that by setting `stringsAsFactors = TRUE` when we read in the data, variables `arm`, `agecat`, `sex`, and `racecat` have been recognized as categorical variables automatically (data type as `Factor`). However, there are a few categorical variables still in `int`, and we need to transform them into `Factor`.

__Question 1: __ Follow the example we have in the below cell, identify other variables that need to be transformed and do the transformation. Don't forget to use `str()` afterwards to make sure you have done it properly.

_Hint: _ Please read the data dictionary carefully to determine the right type of each variable.


In [None]:
# Example for Question 1
data$fustat8 <- as.factor(data$fustat8)

In [None]:
### START CODE HERE ###  

### END CODE HERE ###

Re-run the `summary()` method to display the summary information of current dataframe. You will need this summary table to answer some of the questions below.



In [None]:
summary(data)

## Data issues

### Missing data

We have talked about missing data and different types of missingness in the class.

__Question 2: __ Calculate the percentage of missing data in each variable and rank them (from high to low).

_Hint: _

- For calculating the percentage of missing data in each variable, [see here](https://stackoverflow.com/questions/33512837/calculate-using-dplyr-percentage-of-nas-in-each-column).
- We only consider `NA` as missing at this step.
- For sorting vector, [see here](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/sort).


In [None]:
### START CODE HERE ###  

### END CODE HERE ###

### Highly imbalanced data

__Question 3: __ Looking at the summary table above, do you think the dataset is highly imbalanced? If yes, which variable(s) is/are imbalanced? Please just add your answer in this cell.

_Answer: _ 

### Extreme outliers

__Question 4: __ Can you identify a variable that contains extreme outliers? Please remove the outliers.

_Hint: _ It is a common practice in clinical trials that missing data of continuous variables are labeled as `9999`.


In [None]:
### START CODE HERE  (REPLACE INSTANCES OF 'None' with your code) ###  

# Define the maximum allowed value of the variable you find.
max <- None

# set values above max as NA

data$None[which(data$None >= max)] <- NA

### END CODE HERE ###

__Question 5: __ A variable in this dataset is derived from the above variable with outliers. Please update the derived variable in the cell below.



In [None]:
### START CODE HERE ###  


### END CODE HERE ###

### Inconsistent coding

__Question 6: __ Looking at the summary table above, can you identify a variable that has inconsistent coding/categories comparing with the data dictionary? Modify the variable in the cell below. Don't forget to call `summary()` on this variable to make sure that you have done it properly.


In [None]:
### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###  

# Repeat below code until you correct all the inconsistent categories of the problematic variable.
data$None[which(data$None == None)] <- None

summary(data$None)

### END CODE HERE ###

__Note: __ to remove the empty categories, you can use the function `droplevels()`.



In [None]:
data <- droplevels(data)

### Redundant variable

__Question 7: __ Looking at the summary table above, can you identify a redundant variable that has no variance at all? If yes, replace `None` in the cell below with the variable name. 

_Hint: _ If you need help with the `dplyr::select` function, [see here](https://dplyr.tidyverse.org/reference/select.html).


In [None]:
### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###  

data <- data %>% select(- None)

### END CODE HERE ###

We can also examine bivariate correlations and see if there is redundancy.

__Note: __ 

- The function `cor()` only works on continuous variables. For correlations between two categorical variables, and a categorical variable and continuous variable, see [here](https://datascience.stackexchange.com/questions/893/how-to-get-correlation-between-two-categorical-variable-and-a-categorical-variab) and [here](https://medium.com/@outside2SDs/an-overview-of-correlation-measures-between-categorical-and-continuous-variables-4c7f85610365)
- Your answer of Question 1 will affect the results of below cell, please make sure to re-run the cell.


In [None]:
data_cnt <- data %>% select(where(is.numeric))
cor(data_cnt, use = "complete.obs")

__Question 8: __ Looking at the correlation matrix above, which variable is redundant? Remove it in the cell below.



In [None]:
### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###  

data <- data %>% select(- None)

### END CODE HERE ###

__Question 9: __ Can you identify any other data issues beyond the ones raised in the questions above? Please name the variable and explain its issue in this cell.

_Answer: _

Variable name:
Issue:

## Data visualization 

Here we will use two types of plots:

- [Bar plot](https://ggplot2.tidyverse.org/reference/geom_bar.html) for categorical variables
- [Violin plot](https://ggplot2.tidyverse.org/reference/geom_violin.html) for continuous variables. For more details about violin plot, [see here](https://mode.com/blog/violin-plot-examples/)

__Question 10: __ Follow the example below, find another _categorical_ variable and plot the bar plot.


In [None]:
# Example for Question 10
ggplot(data, aes(arm)) + geom_bar()

In [None]:
### START CODE HERE ###  

### END CODE HERE ###

__Question 11: __ Follow the example below, find another _continuous_ variable and plot the violin plot.



In [None]:
# Example for Question 11
ggplot(data, aes(fustat8, age)) + geom_violin(draw_quantiles = c(0.25, 0.5, 0.75))

In [None]:
### START CODE HERE ###  

### END CODE HERE ###

# Congratulations!

You have finished the first assignment of EPI7913.
