# Project Introduction

Congratulations on finishing the modules! This is the project notebook for the data mining track. This will give you a chance to apply your skills. In this project, you will select your dataset, create your own analysis using the skills you've learned in the previous module, and write up your findings. This will give you a chance to get your hands dirty using real healthcare data

## Project Datasets

Below are a list of carefully curated datasets for you to use. Please read about your selected dataset and run the associated code cell for your dataset. 

However, if you prefer to use your own dataset go ahead!

### Diabetes Dataset

This dataset contain over 700 Female Pima American Indian Patient. This data contains several clinical and demographic variables. Your goal will be to predict which patients develop Diabetes.  

In [None]:
# Code reading in the dataset
diabetes_data <- read.csv(file="data/diabetes.csv",  encoding="UTF-8", header=TRUE, sep=",")

#### Diabetes Data Dictionary

<center>

| *Variable*               | *Definition*                                                         |
| ------------------------ | -------------------------------------------------------------------- |
| Pregnancies              | Number of times pregnant                                             |
| Glucose                  | Plasma glucose concentration following a 2 hour oral glucose tolerance test                                             |
| BloodPressure            | Diastolic Blood Pressure (mm Hg)                                     | 
| SkinThickness            | Triceps Skin Fold Thickness (mm)                                     |
| Insulin                  | 2-Hour Serum Insulin Levels                                          |
| BMI                      | Weight in kg / (Heigh in m)^2                                        |
| DiabetesPedigreeFunction | Measure of expected genetic influence on the subject's diabetes risk |
| Age                      | Age (Years)                                                          |
| Outcome                  | Diabetes as defined as plasma glucose concentration > 200 mg/dl two hours after ingestion of 75 gram carbohydrate solution |


</center>

#### Extra: What is Diabetes?

Diabetes describes a group of metabolic disorders that is characterized by abnormally high blood glucose (blood sugar). Insulin is a critical hormone that allows the body to take up glucose and use it as energy. Diabetes occurs when your body does not produce enough insulin or cannot use insulin well. When this happens, glucose stays in your blood instead of being used as energy. Over time, this glucose in your blood can lead to various health problems. These can include serious complications such as blindness, limb amputation, and stroke. 

<img src="http://cdn.shopify.com/s/files/1/0582/0445/files/blood-sugar-levels-and-paleo_diagram_of_excessive_blood_glucose.jpg?14686865159083212781" align="center" style="margin-bottom: 0.5em; margin-top: 0.5em;">

Diabetes is common among individuals over 45 years old, have a family history of diabetes, and are overweight. Poor lifestyle choices such as not exercising or smoking can also increase the risk. Diabetes is one of the most common conditions in the United States with over 30 million people suffering from diabetes. The total estimated cost of diabetes to the health care system is estimated to be $327 Billion. Current treatment includes insulin and medication to control blood glucose levels. Lifestyle modifications before the onset of diabetes can make a huge difference in whether a person develops diabetes. Any algorithm which could reliably predict diabetes could give providers critical information to intervene before a patient has diabetes.

### Heart Failure Dataset

This is a dataset of over 900 patient enrolled in several international clinical studies. The dataset includes several demographic and clinical variables. Your goal will be to predict which patients develop heart disease. 

In [None]:
# Code reading in the dataset
heart_failure_data <- read.csv(file="data/heart_disease.csv",  encoding="UTF-8", header=TRUE, sep=",")

#### Heart Failure Data Dictionary

| *Column #*| *Definition*                                                                        |
| --------- | ----------------------------------------------------------------------------------- |
| V1         | age in years                                                                        |
| V2         | sex (1 = male; 0 = female)                                                          |
| V3         | chest pain (1:typical angina; 2:atypical angina; 3:non-anginal pain; 4:asymptomatic)| 
| V4         | resting blood pressure on hospital admission (mmHg)                                 |
| V5         | serum cholesterol in mg/dl                                                          |
| V6         | fasting blood sugar > 120 mg/dl (1=true; 0 = false)                                 |
| V7         | resting electrocariographic results (0: normal; 1: ST-T wave abnormality (T wave inversions and/or ST elevation or depression of 0.05 mV); 2: showing probably or definite left ventricular hypertrophy by Estes' criteria                                                                    |
| V8         | maximum heart rate acheived                                                         |
| V9         | exercise induced angina (1=yes; 0=no)                                               |
| V10        | ST depression induced by exercise relative to rest                                  |
| V11        | Slope of the peak exercise ST segment (1: upsloping; 2: flat; 3: downsloping)          |
| V12        | Number of major vessels (0-3) colored by fluoroscopy                                |
| V13        | Exercise thallium scintigraphic defects (fixed, reversible, or none)                |
| V14        | Diagnosis of heart disease (angiographic disease status) (0: <50% diameter narrowing, 1: >50% diameter narrowing)                                                                        |
| location        | The location of the research study that the patient participated in                                                                        |

#### Extra: What is Heart Disease?

Heart disease or cardiovascular disease refers to a condition where blood vessels become narrower or occluded. This can lead to numerous life threatening conditions such as heart attack or stroke. Heart disease can be caused by numerous lifestyle factors such as high blood pressure, high cholesterol, smoking, sedentary lifestyle, and stress. 

<center>
  <img width="400" height="200" src="https://www.health.harvard.edu/media/content/images/p6_PlaqueArtery_HL1902_gi958536398.jpg">
</center>

What makes heart disease so dangerous is that it is a gradual process that does not become symptomatic until you're middle aged or older. At this point, much of the damage has already been done. While heart disease mortality has improved in recent years, it is still one of the most dangerous conditions in the US. Over 600,000 people die of heart disease in the US every year (1 of every 4 deaths). Heart disease is still the leading cause of death among both men and women. Any predictive algorithm that could identify heart disease early could potentially save countless lives. 

### Stroke Dataset

This is a dataset of over 40,000 patients with several associated demographic and clinical features. The goal with this dataset is to predict whether someone will have a stroke

In [None]:
# Code reading in the dataset
stroke_data <- read.csv(file="data/stroke_predict.csv",  encoding="UTF-8", header=TRUE, sep=",")

#### Stroke Data Dictionary

The data dictionary below will help you make sense of the feature in the data

<center>

| *Variable*        | *Definition*                                           |
| ----------------- | ------------------------------------------------------ |
| id                | Patient ID                                             |
| gender            | Gender of Patient                                      |
| age               | Age of Patient                                         | 
| hypertension      | 0 - no hypertension, 1 - suffering from hypertension   |
| heart_disease     | 0 - no heart disease, 1 - suffering from heart disease |
| ever_married      | Yes/No                                                 |
| work_type         | Type of occupation                                     |
| Residence_type    | Area type of residence (Urban/ Rural                   |
| avg_glucose_level | Average Glucose level (measured after meal)            |
| bmi               | Body mass index                                        |
| smoking_status    | patient's smoking status                               |
| stroke            | patient's smoking status                               |

</center>

#### Extra: What is a Stroke?

Stroke is an acute neurologic condition referred to as a cerebrovascular event. This means stroke is a condition that affects the brain ("cerebro-") and involves blood vessels ("vascular). In stroke, arteries leading to and within the brain are either blocked by a clot or rupture. The end result is lack of oxygen and nutrients to the brain leading to brain damage. 


<img width="500" height=300 src="https://www.strokeinfo.org/wp-content/uploads/2019/06/HTN_16_pg39_art600x400.png">


Stroke is usually diagnosed clinically (by symptoms) and imaging (non-contrast head CT scan). Stroke can exhibit a wide range of symptoms depending on the location affected within the brain. Some nonspecific symptoms include headache ("worst headache of my life", nausea, vomiting, loss of consciousness, and neck stiffness. If suspected a non-contrast head CT is ordered to detect bleeding. Depending on whether the stroke is caused by a clot or rupture, treatment will be different. A clot will be treated with blood thinners. A rupture will be treated through emergent neurosurgery. 

> Stroke require prompt diagnosis and treatment before irreversible damages sets in. Any tool (such as a predictive model) that could make stroke diagnosis quicker or easier could make a large difference in preventing stroke. 

# List of functions

First, please run the code cell below to set-up the module

In [None]:
source("setup.R")
cat('Setup Complete!')

Below are a list of functions you will need to complete the project. Each of these functions is from each of the modules you encountered in the data mining track. If you do not remember what any of the function do, you may want to review that module which that function comes from. 

## Module 1 (Preparing Data) Functions

```R

# Select desired columns from a dataframe into a new dataframe, the numbers are the columns selected, columns can be selected individually or as a range using a colon symbol (:). New dataframe can be replaced with whatever you would like to name the dataframe.  
library(dplyr)
new_dataframe = select(original_dataframe, 43, 56:123) 

# Insert a new column in an existing dataframe with values determined by another column present in the dataframe. Useful when changing a single categorical variable into multiple yes or no categories. For example, if the variable is state then the new column could show if a person lives in one state specifically. Yes is represented by a 1 and no is represented by a 0.  
library(dplyr)
dataframe<- mutate(dataframe, new_column_name = ifelse(name_of_reference_column == 'value_in_reference_column', '1', '0'))

# Delete a column from a dataframe, the number with the - sign in front of it is the column that is going to be removed when the code is executed.  
dataframe = select(dataframe, -17)

# Change data type of a column to factor data type
dataframe\\$column_name<- as.factor(dataframe\\$column_name)

# Change data type of a column to numeric data type 
dataframe\\$column_name<- as.numeric(dataframe\\$column_name)

# Check all variable data types for a dataframe
str(dataframe)

# Replace null cells in a column with the column mean 
dataframe = transform(dataframe, column_name = ifelse(is.na(column_name), mean(column_name, na.rm=TRUE), column_name))

# Replace null cells in a column with a 0, if 0 is replaced by a 1 then the null will be replaced with a 1.  
dataframe\\$column_name[is.na(dataframe\\$column_name)] <- 0

# Random Forest imputation, maxiter is the number of iterations performed, ntree is the number of decision trees created. New dataframe can be replace with a name of your choosing but make sure it ends in .imp as shown below.  
library(missForest)
library(randomForest)
set.seed(96)
new_dataframe.imp <- missForest(old_dataframe, verbose = TRUE, maxiter = 3, ntree= 20)
 

# Check imputed values and assign imputed values to a new dataframe. Replace new dataframe with a name of your choosing.  
imputed_dataframe.imp\\$ximp 
new_dataframe<- imputed_dataframe.imp\\$ximp 
```

## Module 2 (Univariate Analysis) Functions

```R
# Determine Variable Type
class()

# Calculate Mean
mean()

# Calculate Median
median()

# Calculate Frequencies
table()

# Percentages is trickier
# First use the table function on your variable, then apply the below code
prop.table()

# For example 
prop.table(table(my\\$variable))

# Create Bar Graph
bar_graph()

# Create Histogram
histogram()

```

## Module 3 (Bivariate Analysis) Functions

```R
# Load csv file
read.csv()

# Show content in the output
print()

# Concatenating two pieces of texts in string data type
paste()

# Look up variable dictionary
var_dict()

# Show the coloumn names of the dataframe generated by read.csv
colnames()

# Generate a dataframe by subsetting from the original dataframe
subset()

# Pearson correlation test
cor.test()

# Load module/package
library()

# Generate scatter plot using package ggplot
qplot()

# Output levels of a categorical variable
levels()

# Generate contingency table for frequency
table()

# Refresh the levels in your categorical variable after you make changes
factor()

# Pearson's Chi-squared test
chisq.test()

# Transform a table into a dataframe
data.frame()

# Generating bar plot using ggplot
ggplot() + geom_bar()

# For example
ggplot(MyDataframe, aes(x=CategoricalVariable1, y=Frequency, fill=CategoricalVariable2)) + geom_bar(stat="identity",position=position_dodge())

# Welch two sample T-Test assuming same variance
t.test()

# Generating box plot using ggplot
ggplot() + geom_boxplot()

# For example
ggplot(MyDataframe, aes(x=CategoricalVariable, y=NumericalVariable)) + 
  geom_boxplot(outlier.colour="red", outlier.shape=8,
               outlier.size=4)

# Show all unique values in a numerical variable
unique()

# ANOVA
anova(aov())

# For example
anova(aov(NumericalVariable~CategoricalVariable, data=MyDataframe))
```


## Module 4 (Feature Selection) Functions

```R
# Create training and testing dataframes, in this case 75 percent will be used for training and 25 percent for testing. 
library(caTools) 
smp_size<- floor(0.75 * nrow(dataframe_to_be_split)) 
set.seed(123) 
train_ind<- sample(seq_len(nrow(dataframe_to_be_split)), size = smp_size) 
training_dataframe<- dataframe_to_be_split[train_ind, ] 
testing_dataframe<- dataframe_to_be_split[-train_ind, ] 

# Oversample the minority outcome to balance the training data in an effort to improve model performance (not always necessary). In this case the minority outcome will be oversampled so that it occurs 40 percent of the time.  
library(ROSE) 
oversampled_dataframe<- ROSE(column_of_interest~., p = 0.4, data=testing_dataframe, seed=3)\\$data 

# Create a table after oversampling to view the count of the oversampled outcome. 
table(oversampled_dataframe\\$column_of_interest)

# Feature selection using recursive feature elimination. 2:18 in this case is the range of the columns used to predict the outcome variable. The outcome variable in this example is column 1.   
library(e1071)
library(mlbench)
library(caret)
library(randomForest)
rfe_training <- rfeControl(functions=rfFuncs, method="cv", number=10)
rfe <- rfe(oversampled_dataframe[,2:18], oversampled_dataframe[,1], sizes=c(2:18), rfeControl=rfe_training)
print(rfe)

# Show variable rank based on RFE
predictors(rfe)

# Display graph that highlights the most accurate number of features found using RFE. 
plot(rfe, type=c("g", "o"))

# Feature selection using Random Forest 
library(e1071)
library(mlbench)
library(caret)
library(randomForest)
t_training<- trainControl(method = "repeatedcv", number=10, repeats=3)
seed<- 7
metric<- "Accuracy"
set.seed(seed)
mtry<- sqrt(ncol(oversampled_dataframe))
tunegrid<-expand.grid(.mtry=mtry)

# Train the model using the oversampled dataframe
t_model<- train(column_of_interest~., data=oversampled_dataframe, method="rf", metric=metric, tuneGrid=tunegrid, trControl=t_training)

# Apply the model to the testing_dataframe. Change new dataframe to a name of your choosing. 
new_dataframe<- predict(t_model, testing_dataframe)

# Analyze the results of running the model on the testing dataframe using a confusion matrix. The dataframe in this example is the one created in the previous step. 
confusionMatrix(dataframe, testing_dataframe\\$column_of_interest)
                 
# Visualize variable importance in Random Forest model 
variable_importance<- varImp(t_model)
print(variable_importance)
```

## Module 5 (Predictive Analysis) Functions

This section will be updated once data mining module 5 is released. This section will be updated earlier if possible. 

# Student Input

**Warning**  
<font color = blue, size = 4> 
    Your work will not be saved in Jupyter Notebook. You are recommended to copy your work and paste it to a safe place to record your work.
<font>

## How To Download Your Work

The project will be much easier if you are able to download you work and save your progress. The link below will guide you to a resource which will provide instructions for setting up Jupyter Notebook on your own computer. This will allow you to download this notebook and save your work.

<a href="https://datamine.unc.edu/wp-content/uploads/2020/06/FAQs.pdf">Link to Instruction</a>

## Project Input

Use the code cell below to perform your analysis

Use the cell below the horizontal line for your writeup. Your writeup should describe your analytic process and your findings from your analysis. 

---