# Time to put it all together!

## The challenge

* Historical weather data from Boston, USA
* 12 months beginning Dec 2014
* The data are dirty
    - Column names are  values
    - Variables coded incorrectly
    - Missing and extreme values
    - ...
* Goal : Clean the data

## Understanding the structure of your data

* class() - Class of data object
* dim() - Dimensions of data
* names() - Column names
* str() - Preview of data with helpful details
* glimpse() - Better version of str() from dplyr package
* summary() - Summary of data

## Looking at your data

* head() - View top of dataset
* tail() - View bottom of dataset
* print() - View entire dataset(not recommended!)

## Visualizing your data

* hist() - View histogram of a single variable
* plot() - View plot of two variables
    

## Get a feel for the data

In [None]:
# Verify that weather is a data.frame
class(weather)

# Check the dimensions
dim(weather)

# View the column names
names(weather)

## Summarize the data

In [None]:
# View the structure of the data
str(weather)

# Load dplyr package
library(dplyr)

# Look at the structure using dplyr's glimpse()
glimpse(weather)

# View a summary of the data
summary(weather)

## Take a closer look

In [None]:
# View first 6 rows
head(weather)

# View first 15 rows
head(weather, 15)

# View the last 6 rows
tail(weather)

# View the last 10 rows
tail(weather, 10)

# Let's tidy the data

## Column names are values

In [None]:
# Load the tidyr package
library(tidyr)

# Gather the columns
weather2 <- gather(weather, day, value, X1:X31, na.rm = TRUE)

# View the head
head(weather2)

## Values are variable names

In [None]:
## The tidyr package is already loaded

# First remove column of row names
weather2 <- weather2[, -1]

# Spread the data
weather3 <- spread(weather2, measure, value)

# View the head
head(weather3)

# Prepare the data for analysis

## Dates with lubridate

* ymd() : year-month-day
* mdy() : month-day-year
* hms() : hour:minute:second
* ymd_hms() : year-month-day hour:minute:second

## Type Conversion

* as.character() : To character string
* as.numeric()
* as.integer()
* as.factor()
* as.logical()

## Clean up dates

In [None]:
## tidyr and dplyr are already loaded
library(tidyr)
library(dplyr)

# Load the stringr and lubridate packages
library(stringr)
library(lubridate)

# Remove X's from day column
weather3$day <- str_replace(weather3$day, "X", "")

# Unite the year, month, and day columns
weather4 <- unite(weather3, date, year, month, day, sep = "-")

# Convert date column to proper date format using lubridates's ymd()
weather4$date <- ymd(weather4$date)

# Rearrange columns using dplyr's select()
weather5 <- select(weather4, date, Events, CloudCover:WindDirDegrees)

# View the head of weather5
head(weather5)

## A closer look at column types

In [None]:
# View the structure of weather5
str(weather5)

# Examine the first 20 rows of weather5. Are most of the characters numeric?
head(weather5, 20)

# See what happens if we try to convert PrecipitationIn to numeric
as.numeric(weather5$PrecipitationIn)

## Column type conversions

In [None]:
## The dplyr and stringr packages are already loaded

# Replace T with 0 (T = trace)
weather5$PrecipitationIn <- str_replace(weather5$PrecipitationIn, "T", 0)

# Convert characters to numerics
weather6 <- mutate_each(weather5, funs(as.numeric), CloudCover:WindDirDegrees)

# Look at result
str(weather6)

# Missing, extreme, and unexpected values

## Finding missing values

* is.na() - Return data frame of TRUEs and FALSEs
* sum(is.na()) - Count number of TRUEs
* which(is.na()) - Find indices of missing values

## Identifying errors

* Context matters!
* Pausible ranges

## Finding missing values

In [None]:
# Count missing values
sum(is.na(weather6))

# Find missing values
summary(weather6)

# Find indices of NAs in Max.Gust.SpeedMPH
ind <- which(is.na(weather6$Max.Gust.SpeedMPH))

# Look at the full rows for records missing Max.Gust.SpeedMPH
weather6[ind, ]

## An obvious error

In [None]:
# Review distributions for all variables
summary(weather6)

# Find row with Max.Humidity of 1000
ind <- which(weather6$Max.Humidity == 1000)

# Look at the data for that day
weather6[ind, ]

# Change 1000 to 100
weather6$Max.Humidity[ind] <- 100

## Another obvious error

In [None]:
# Look at summary of Mean.VisibilityMiles
summary(weather6$Mean.VisibilityMiles)

# Get index of row with -1 value
ind <- which(weather6$Mean.VisibilityMiles == -1)

# Look at full row
weather6[ind, ]

# Set Mean.VisibilityMiles to the appropriate value
weather6$Mean.VisibilityMiles[ind] <- (weather6$Max.VisibilityMiles + weather6$Min.VisibilityMiles) / 2 # 10

## Check other extreme values

In [None]:
# Review summary of full data once more
summary(weather6)

# Look at histogram for MeanDew.PointF
hist(weather6$MeanDew.PointF)

# Look at histogram for Min.TemperatureF
hist(weather6$Min.TemperatureF)

# Compare to histogram for Mean.TemperatureF
hist(weather6$Mean.TemperatureF)

## Finishing touches

In [None]:
# Clean up column names
names(weather6) <- new_colnames

# Replace empty cells in events column
weather6$events[weather6$events == ""] <- "None"
    
# Print the first 6 rows of weather6
head(weather6)

# Your data are clean!

## Clean weather data

## Summary of your accomplishments

* Inspected the data
* Tided the data
* Improved date representations
* Dealt with incorrect variable codings
* Found and dealt with missing data
* Identified and corrected errors
* Visualized the result