# Introduction to Data Mining

## Welcome to the data mining portion of the course
- This part of the course moves away from studying free text and will instead focus on more sturctured data analysis. 
- From now on the notebooks will use a language called R which was specifically created for data analytics. 
- Like the previous sections the code is being provided for your reference and it is not necessary to understand the specifics in order to progress through the remainder of the course. 

### In this notebook we will be covering 
1. How to read in a CSV file from github 
2. How to select variable columns from a data frame 
3. How to change data types
4. Replacing null numeric cells with the column mean
5. Replacing null categorical cells with a zero 
6. Replacing null cell using random forest imputation  

## Read in the CSV file

In [3]:
install.packages("readr")
library (readr)

#location of raw csv file on github repository 
urlfile="https://raw.githubusercontent.com/e-cui/ENABLE-HiDAV-Online-Modules/master/Data%20Mining%20Modules/csv_files/training_v2.csv"

#Read the CSV file into a data frame called training_v2 
training_v2<-read_csv(url(urlfile))


"package 'readr' is in use and will not be installed"Parsed with column specification:
cols(
  .default = col_double(),
  ethnicity = col_character(),
  gender = col_character(),
  hospital_admit_source = col_character(),
  icu_admit_source = col_character(),
  icu_stay_type = col_character(),
  icu_type = col_character(),
  urineoutput_apache = col_logical(),
  apache_3j_bodysystem = col_character(),
  apache_2_bodysystem = col_character()
)
See spec(...) for full column specifications.
"41783 parsing failures.
 row                col           expected    actual         file
8119 urineoutput_apache 1/0/T/F/TRUE/FALSE 680.3136  <connection>
8120 urineoutput_apache 1/0/T/F/TRUE/FALSE 665.9712  <connection>
8121 urineoutput_apache 1/0/T/F/TRUE/FALSE 929.4048  <connection>
8122 urineoutput_apache 1/0/T/F/TRUE/FALSE 2869.9488 <connection>
8124 urineoutput_apache 1/0/T/F/TRUE/FALSE 1291.1616 <connection>
.... .................. .................. ......... ............
See problems(...) fo

tibble [91,713 x 186] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ encounter_id                 : num [1:91713] 66154 114252 119783 79267 92056 ...
 $ patient_id                   : num [1:91713] 25312 59342 50777 46918 34377 ...
 $ hospital_id                  : num [1:91713] 118 81 118 118 33 83 83 33 118 118 ...
 $ hospital_death               : num [1:91713] 0 0 0 0 0 0 0 0 1 0 ...
 $ age                          : num [1:91713] 68 77 25 81 19 67 59 70 45 50 ...
 $ bmi                          : num [1:91713] 22.7 27.4 31.9 22.6 NA ...
 $ elective_surgery             : num [1:91713] 0 0 0 1 0 0 0 0 0 0 ...
 $ ethnicity                    : chr [1:91713] "Caucasian" "Caucasian" "Caucasian" "Caucasian" ...
 $ gender                       : chr [1:91713] "M" "F" "F" "F" ...
 $ height                       : num [1:91713] 180 160 173 165 188 ...
 $ hospital_admit_source        : chr [1:91713] "Floor" "Floor" "Emergency Department" "Operating Room" ...
 $ icu_admit_source             : ch

## Selecting variable columns from a data frame 

In [4]:
#select variable columns from the training_v2 data frame into a new data frame called practice_csv
install.packages("dplyr")
library(dplyr)
practice_csv = select(training_v2, 4, 43, 36, 79, 61, 42, 123, 107, 115, 46, 28, 29, 5, 16, 6, 35, 20)
 
#create sepsis and cardiovascular diagnosis columns from the diagnosis column
practice_csv<- mutate(practice_csv, sepsis = ifelse(apache_2_diagnosis == '113', '1', '0'))
practice_csv<- mutate(practice_csv, cardiovascular_diagnosis = ifelse(apache_2_diagnosis == '114', '1', '0'))

#delete original diagnosis column
practice_csv = select(practice_csv, -17)

"package 'dplyr' is in use and will not be installed"

## Changing data types of the selected variables 

In [5]:
#change categorical variables to factor data type 
practice_csv$hospital_death<- as.factor(practice_csv$hospital_death)
practice_csv$sepsis<- as.factor(practice_csv$sepsis)
practice_csv$cardiovascular_diagnosis<- as.factor(practice_csv$cardiovascular_diagnosis)
practice_csv$intubated_apache<- as.factor(practice_csv$intubated_apache)
practice_csv$gcs_eyes_apache<- as.factor(practice_csv$gcs_eyes_apache)
practice_csv$gcs_motor_apache<- as.factor(practice_csv$gcs_motor_apache)

#change to numeric variables to numeric data type 
practice_csv$temp_apache<- as.numeric(practice_csv$temp_apache)
practice_csv$map_apache<- as.numeric(practice_csv$map_apache)
practice_csv$h1_heartrate_max<- as.numeric(practice_csv$h1_heartrate_max)
practice_csv$d1_resprate_max<- as.numeric(practice_csv$d1_resprate_max)
practice_csv$d1_potassium_max<- as.numeric(practice_csv$d1_potassium_max)
practice_csv$d1_creatinine_max<- as.numeric(practice_csv$d1_creatinine_max)
practice_csv$d1_hematocrit_max<- as.numeric(practice_csv$d1_hematocrit_max)
practice_csv$sodium_apache<- as.numeric(practice_csv$sodium_apache)
practice_csv$wbc_apache<- as.numeric(practice_csv$wbc_apache)
practice_csv$age<- as.numeric(practice_csv$age)
practice_csv$pre_icu_los_days<- as.numeric(practice_csv$pre_icu_los_days)
practice_csv$bmi<- as.numeric(practice_csv$bmi)

#check all variable data types to make sure they are correct
str(practice_csv)

tibble [91,713 x 18] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ hospital_death          : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 2 1 ...
 $ temp_apache             : num [1:91713] 39.3 35.1 36.7 34.8 36.7 36.6 35 36.6 36.9 36.3 ...
 $ map_apache              : num [1:91713] 40 46 68 60 103 130 138 60 66 58 ...
 $ h1_heartrate_max        : num [1:91713] 119 114 96 100 89 83 79 118 82 96 ...
 $ d1_resprate_max         : num [1:91713] 34 32 21 23 18 32 38 28 24 44 ...
 $ sodium_apache           : num [1:91713] 134 145 NA NA NA 137 135 140 142 139 ...
 $ d1_potassium_max        : num [1:91713] 4 4.2 NA 5 NA 3.9 5 5.8 5.2 4.1 ...
 $ d1_creatinine_max       : num [1:91713] 2.51 0.71 NA NA NA 0.71 0.85 2.05 1.16 0.83 ...
 $ d1_hematocrit_max       : num [1:91713] 27.4 36.9 NA 34 NA 44.2 37.5 25.5 37.9 37.2 ...
 $ wbc_apache              : num [1:91713] 14.1 12.7 NA 8 NA 10.9 5.9 12.8 24.7 8.4 ...
 $ gcs_eyes_apache         : Factor w/ 4 levels "1","2","3","4": 3 1 3 4 NA 4 4 4 4 4 ...
 $

## Replacing null values using measures of center (mean, median, mode)  

In [6]:
#replace some numeric variable nulls with the column mean 
practice_csv = transform(practice_csv, bmi = ifelse(is.na(bmi), mean(bmi, na.rm=TRUE), bmi))
practice_csv = transform(practice_csv, temp_apache = ifelse(is.na(temp_apache), mean(temp_apache, na.rm=TRUE), temp_apache))
practice_csv = transform(practice_csv, map_apache = ifelse(is.na(map_apache), mean(map_apache, na.rm=TRUE), map_apache))
practice_csv = transform(practice_csv, h1_heartrate_max = ifelse(is.na(h1_heartrate_max), mean(h1_heartrate_max, na.rm=TRUE), h1_heartrate_max))
practice_csv = transform(practice_csv, d1_resprate_max = ifelse(is.na(d1_resprate_max), mean(d1_resprate_max, na.rm=TRUE), d1_resprate_max))
practice_csv = transform(practice_csv, d1_potassium_max = ifelse(is.na(d1_potassium_max), mean(d1_potassium_max, na.rm=TRUE), d1_potassium_max))
practice_csv = transform(practice_csv, d1_creatinine_max = ifelse(is.na(d1_creatinine_max), mean(d1_creatinine_max, na.rm=TRUE), d1_creatinine_max))
practice_csv = transform(practice_csv, d1_hematocrit_max = ifelse(is.na(d1_hematocrit_max), mean(d1_hematocrit_max, na.rm=TRUE), d1_hematocrit_max))
practice_csv = transform(practice_csv, sodium_apache = ifelse(is.na(sodium_apache), mean(sodium_apache, na.rm=TRUE), sodium_apache))

#replace some categorical variable nulls with 0 (0 is the mode for these columns)
#gcs_motor_apache nulls replaced with a 1 (1 is the mode for this column) 
practice_csv$sepsis[is.na(practice_csv$sepsis)] <- 0
practice_csv$cardiovascular_diagnosis[is.na(practice_csv$cardiovascular_diagnosis)] <- 0
practice_csv$intubated_apache[is.na(practice_csv$intubated_apache)] <- 0
practice_csv$gcs_motor_apache[is.na(practice_csv$gcs_motor_apache)] <- 1

## Replacing null values using a random forest algorithm 

In [None]:
#use a random forest algorithm to replace the remaining nulls
#random forest is non-parametric and works for categorical and numeric variables
#ntree is the number of trees, maxiter is the maximum number of iterations (repeats) 
#more trees and more iterations can increase accuracy but takes more time to run
install.packages("missForest")
install.packages("randomForest")
library(missForest)
library(randomForest)
set.seed(96) 
practice_csv.imp <- missForest(practice_csv, verbose = TRUE, maxiter = 3, ntree = 10)

#check imputed values
practice_csv.imp$ximp

#check imputation error
practice_csv.imp$OOBerror

#assign imputed data frame to a new data frame called t
t<- practice_csv.imp$ximp

also installing the dependencies 'codetools', 'iterators', 'randomForest', 'foreach', 'itertools'



package 'codetools' successfully unpacked and MD5 sums checked
package 'iterators' successfully unpacked and MD5 sums checked
package 'randomForest' successfully unpacked and MD5 sums checked
package 'foreach' successfully unpacked and MD5 sums checked
package 'itertools' successfully unpacked and MD5 sums checked
package 'missForest' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
	C:\Users\Aaron\AppData\Local\Temp\RtmpM7hIIX\downloaded_packages
package 'randomForest' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
	C:\Users\Aaron\AppData\Local\Temp\RtmpM7hIIX\downloaded_packages


"package 'missForest' was built under R version 3.6.3"Loading required package: randomForest
"package 'randomForest' was built under R version 3.6.3"randomForest 4.6-14
Type rfNews() to see new features/changes/bug fixes.

Attaching package: 'randomForest'

The following object is masked from 'package:dplyr':

    combine

Loading required package: foreach
"package 'foreach' was built under R version 3.6.3"Loading required package: itertools
"package 'itertools' was built under R version 3.6.3"Loading required package: iterators
"package 'iterators' was built under R version 3.6.3"

## This concludes our introduction to data analytics 
- The following questions were addressed between this notebook and the complementary video lecture:
1. What is a CSV file and how is it loaded into R? 
2. What are some simple and more advanced methods that can be used to deal with nulls? 
3. How should present but highly unlikely data be dealt with?  


# Exporting CSV

This section just provides a space to export the cleaned data you've already cleaned. I made a new section so it wouldn't interfere with your other work. 

-Eric 