# Type conversions

## Types of variables in R

* character
* numeric : 22.44, Inf, NaN
* integer : 4L
* logical : TRUE, FALSE, NA

## Overview of lubridate

* Written by Garrett Grolemund & Hadley Wickham

## Dates with lubridate

* mdy
* ymd
* hms
* ymd_hms

## Types of variables in R

In [None]:
# Make this evaluate to character
# class(true)
class("true")

# Make this evaluate to numeric
# class("8484.00")
class(8484.00)
# Make this evaluate to integer
# class(99)
class(99L)

# Make this evaluate to factor
# class("factor")
class(as.factor("factor"))

# Make this evaluate to logical
# class("FALSE")
class(FALSE)

## Common type conversions

In [None]:
# Preview students with str()
str(students)

# Coerce Grades to character
students$Grades <- as.character(students$Grades)

# Coerce Medu to factor
students$Medu <- as.factor(students$Medu)

# Coerce Fedu to factor
students$Fedu <- as.factor(students$Fedu)
    
# Look at students once more with str()
str(students)

## Working with dates

In [2]:
# Preview students2 with str()
str(students2)

# Load the lubridate package
library(lubridate)

# Parse as date
# "17 Sep 2015"
dmy("17 Sep 2015")

# Parse as date and time (with no seconds!)
# "July 15, 2012 12:56"
mdy_hm("July 15, 2012 12:56")

# Coerce dob to a date (with no time)
students2$dob <- ymd(students2$dob)

# Coerce nurse_visit to a date and time
students2$nurse_visit <-ymd_hms(students2$nurse_visit) 
    
# Look at students2 once more with str()
str(students2)

# String manipulation

## Overview of stringr

* R package written by Hadley Wickham
* Suite of helpful functions for working with strings
* Functions share consistent interface

## Key functions in stringr for cleaning data

* str_trim() : Trim leading and trailing white space
* str_pad() : Pad string with additional characters
* str_detect() : Search for string in vector and returns logical values
* str_replace() : Replace string in vector

## Other helpful functions in base R

* tolower() : Make all lowercase
* toupper() : Make all uppercase

## Trimming and padding strings

In [4]:
# Load the stringr package
library(stringr)

# Trim all leading and trailing whitespace
# c("   Filip ", "Nick  ", " Jonathan")
str_trim(c("   Filip ", "Nick  ", " Jonathan"))

# Pad these strings with leading zeros
# c("23485W", "8823453Q", "994Z")
str_pad(c("23485W", "8823453Q", "994Z"), width = 9, pad = "0")

## Upper and lower case

In [7]:
states <- c("al", "ak", "az", "ar", "ca", "co", "ct", "de", "fl", "ga", "hi", "id", "il", "in", "ia",
            "ks", "ky", "la", "me", "md", "ma", "mi", "mn", "ms", "mo", "mt", "ne", "nv", "nh", "nj",
            "nm", "ny", "nc", "nd", "oh", "ok", "or", "pa", "ri", "sc", "sd", "tn", "tx", "ut", "vt",
            "va", "wa", "wv", "wi", "wy")

# Print state abbreviations
print(states)

# Make states all uppercase and save result to states_upper
states_upper <- toupper(states)
states_upper

# Make states_upper all lowercase again
tolower(states_upper)

 [1] "al" "ak" "az" "ar" "ca" "co" "ct" "de" "fl" "ga" "hi" "id" "il" "in" "ia"
[16] "ks" "ky" "la" "me" "md" "ma" "mi" "mn" "ms" "mo" "mt" "ne" "nv" "nh" "nj"
[31] "nm" "ny" "nc" "nd" "oh" "ok" "or" "pa" "ri" "sc" "sd" "tn" "tx" "ut" "vt"
[46] "va" "wa" "wv" "wi" "wy"


## Finding and replaceing strings

* str_detect()
* str_replace()

In [None]:
## stringr has been loaded for you

# Look at the head of students2
head(students2)

# Detect all dates of birth (dob) in 1997
str_detect(students2$dob, "1997")

# In the sex column, replace "F" with "Female"...
students2$sex <- str_replace(students2$sex, "F", "Female")

# ...And "M" with "Male"
students2$sex <- str_replace(students2$sex, "M", "Male")

# View the head of students2
head(students2)

# Missing and special values

## Missing values

* May be random, but dangerous to assume
* Sometimes associated with variable/outcome of interest
* In R, represented as NA
* May appear in other forms
    - #N/A (Excel)
    - Single dot (SPSS, SAS)
    - Empty string

## Special values

* Inf - "Infinity value"(Indicative of outliers?)
    - 1/0
    - 1/0 + 1/0
    - 33333^33333
* NaN - "Not a number"(rethink a variable?)
    - 0/0
    - 1/0 - 1/0

## Finding missing values

* is.na() - Check for NAs
* any(is.na()) - Are there any NAs?
* sum(is.na()) - Count the number of NAs
* summary() - returns the number of NA's per column if there's any

## Dealing with missing values

* complete.cases() - Find rows with no missing values
* df[complete.cases(df), ] - Subset data, keeping only complete cases
* na.omit() - Another way to remove rows with NAs

In [8]:
social_df <- data.frame(name = c("Sarah", "Tom", "David", "Alice"),
                       n_friends = c(244, NA, 145, 43),
                       status = c("Going out!", "", "Movie night...", ""))

# Call is.na() on the full social_df to spot all NAs
is.na(social_df)

# Use the any() function to ask whether there are any NAs in the data
any(is.na(social_df))

# View a summary() of the dataset
summary(social_df)

# Call table() on the status column
table(social_df$status)

name,n_friends,status
False,False,False
False,True,False
False,False,False
False,False,False


    name     n_friends                status 
 Alice:1   Min.   : 43.0                 :2  
 David:1   1st Qu.: 94.0   Going out!    :1  
 Sarah:1   Median :145.0   Movie night...:1  
 Tom  :1   Mean   :144.0                     
           3rd Qu.:194.5                     
           Max.   :244.0                     
           NA's   :1                         


                   Going out! Movie night... 
             2              1              1 

## Dealing with missing values

In [9]:
  ## The stringr package is preloaded

# Replace all empty strings in status with NA
social_df$status[social_df$status == ""] <- NA

# Print social_df to the console
print(social_df)

# Use complete.cases() to see which rows have no missing values
complete.cases(social_df)

# Use na.omit() to remove all rows with any missing values
na.omit(social_df)

   name n_friends         status
1 Sarah       244     Going out!
2   Tom        NA           <NA>
3 David       145 Movie night...
4 Alice        43           <NA>


Unnamed: 0,name,n_friends,status
1,Sarah,244,Going out!
3,David,145,Movie night...


# Outliers and ovious erros

## Outliers

* boxplot() - Box and Whiscker plot shows outliers easily
* Extreme values distant from other values
* Several causes:
    - Valid measurements
    - Variablity in measurement
    - Experimental error
    - Data entry error
* May be discarded or retained depending on cause

## Obvious errors

* May appear in many forms
    - Values so extreme they can't be plausible(e.g. person aged 243)
    - Values that don't make sense(e.g. negative age)
* Several causes
    - Measurement error
    - Data entry error
    - Special code for missing data (e.g. -1 means missing)
* Should generally be removed or replaced

## Finding outliers and errors

* boxplot()
* hist()

## Dealing with outliers and obvious errors

In [None]:
# Look at a summary() of students3
summary(students3)

# View a histogram of the age variable
hist(students3$age)

# View a histogram of the absences variable
hist(students3$absences)

# View a histogram of absences, but force zeros to be bucketed to the right of zero
hist(students3$absences, right = FALSE)

## Another look at strange values

In [None]:
# View a boxplot of age
boxplot(students3$age)

# View a boxplot of absences
boxplot(students3$absences)