<a href="https://colab.research.google.com/github/chathasphere/chathasphere.github.io/blob/main/teaching/306_materials/003_lab7.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 7: Missing Values and Cleaning Messy Data

## March 15th, 2022

In [None]:
library(tidyverse)

In [None]:
(stocks <- tibble(
  Year   = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
  Qtr    = c(   1,    2,    3,    4,    2,    3,    4),
  Return = c(1.88, 0.59, 0.35,   NA, 0.92, 0.17, 2.66)
))

# 1. Missing Values
Missing values can be:
1. *Explicit* (marked as NA in our data)
1. *Implicit* (not present in the data)

In this example we have one explicitly missing value for the 4th quarter of 2015. 

Are there any other missing values? Yes: we do not have an observation for the first quarter of 2016.

## Handling Missing Data

`complete`: Turns implicit missing values into explicit missing values.

Specify a list of columns (column separated) to generate every possible combination. Missing combinations will initialize columns with NA.

In [None]:
stocks %>% complete(Year, Qtr)

`fill`: Fills missing values in selected columns. Defaults to using the previous entry.

In [None]:
stocks %>% complete(Year, Qtr) %>% fill(Return)

In [None]:
stocks %>% complete(Year, Qtr) %>% fill(Return, .direction="up")

The missing values also become explicit if we widen the tibble.

In [None]:
stocks_wide = stocks %>% pivot_wider(names_from = Year, values_from = Return)

stocks_wide

`pivot_longer` will keep all these explicitly missing values by default.

In [None]:
stocks_wide %>% pivot_longer(cols = `2015`:`2016`, names_to = 'Year') %>%
arrange(Year)

# 2. Cleaning messy data

In [None]:
datacamp_url = "https://assets.datacamp.com/production/repositories/34/datasets/b3c1036d9a60a9dfe0f99051d2474a54f76055ea/weather.rds"
weather = readRDS(url(datacamp_url))

In [None]:
weather %>% glimpse

In [None]:
weather %>% head

The first column lists row number, so let's ignore it.

In [None]:
weather = weather %>% select(-X)

It looks like the values for the weather measurements (column 3) for each day of the month are stored in the columns `X1` to `X31`. From a **tidy data** perspective, the data set is messy because:
* Values are given as column names (`X1` to `X31`)
* variable names are represented as values (column 3 - `measure`)

We can correct it by using `pivot_longer`.

In [None]:
tidy_weather = weather %>% 
  pivot_longer(cols = `X1`:`X31`, names_to = 'day', values_to = "value") %>%
  select(year, month, day, everything())

head(tidy_weather)

The values in the column `measure` of the weather dataset should be variables.

In [None]:
tidy_weather = tidy_weather %>% 
  pivot_wider(names_from = measure, values_from = value)

head(tidy_weather)

In [None]:
tidy_weather %>% glimpse

A few things about this data set are still odd. For one, the names of days start with an `X`. We can fix this with the `str_replace` function. We saw this a few labs back, but let's review!

In [None]:

# str_replace replaces only the first instance of a substring (2nd arg)
str_replace("tattoo", "t", "l")

str_replace("tattoo", "tatt", "yah")

# use replace all to replace multiple occurrences
str_replace_all("tattoo", "t", "b")

# Replace $ with nothing, so it removes the dollar sign
# we are applying this function to a vector!
# notice the use of \\...
# this is because $ is a reserved regex character
cost = c("$8", "12.5$", "$45")
cost = str_replace_all(cost, "\\$", "")
print(cost) 

#change its type to numeric
cost = as.numeric(cost)
print(cost)

#### Exercise 1: Remove `X` from the `day` entries and change its type to `numeric`

In [None]:
#as.integer

#### Exercise 2: Combine the year, month, and day columns into a new column called date.

*Hint: Use the `unite` function.*

#### Exercise 3: Move events variable to the second column (Just after the `date`)

#### Exercise 4: `PrecipitationIn` has “T”s for "Trace." “Traces” are defined as precipitation of less than 0.005 inch. So, in this case, we need to map “T”s to 0.

### Fun exercise: What is happening cell in below?

In [None]:
l = list(as.numeric, sqrt, `+`, c('1','9'))

l[[2]](l[[3]](l[[1]](l[[4]])[1], l[[1]](l[[4]])[2]))

#### Exercise 5: What are the unique events in the dataset?

#### Exercise 6: An empty entry means that there is no weather event. Change empties to `Clear`.

Reference: [Cleaning Messy Weather Dataset with tidyverse](https://www.rpubs.com/justinhtet/cleaning-messy-weather-dataset-with-tidyverse)