<p align="center">
<img src="https://github.com/datacamp/r-live-training-template/blob/master/assets/datacamp.svg?raw=True" alt = "DataCamp icon" width="50%">
<br>
<h1 align="center">Cleaning Data in R Live Training</h1>
</p>
<br>


Welcome to this hands-on training where you'll identify issues in a dataset and clean it from start to finish using R. It's often said that data scientists spend 80% of their time cleaning and manipulating data and only about 20% of their time analyzing it, so cleaning data is an important skill to master!

In this session, you will:

- Examine a dataset and identify its problem areas, and what needs to be done to fix them.
-Convert between data types to make analysis easier.
- Correct inconsistencies in categorical data.
- Deal with missing data.
- Perform data validation to ensure every value makes sense.

## **The Dataset**

The dataset we'll use is a CSV file named `nyc_airbnb.csv`, which contains data on [*Airbnb*](https://www.airbnb.com/) listings in New York City. It contains the following columns:

- `listing_id`: The unique identifier for a listing
- `name`: The description used on the listing
- `host_id`: Unique identifier for a host
- `host_name`: Name of host
- `nbhood_full`: Name of borough and neighborhood
- `coordinates`: Coordinates of listing _(latitude, longitude)_
- `room_type`: Type of room 
- `price`: Price per night for listing
- `nb_reviews`: Number of reviews received 
- `last_review`: Date of last review
- `reviews_per_month`: Average number of reviews per month
- `availability_365`: Number of days available per year
- `avg_rating`: Average rating (from 0 to 5)
- `avg_stays_per_month`: Average number of stays per month
- `pct_5_stars`: Percent of reviews that were 5-stars
- `listing_added`: Date when listing was added


In [0]:
# Install non-tidyverse packages
install.packages("visdat")

In [0]:
# Load packages
library(readr)
library(dplyr)
library(stringr)
library(ggplot2)
library(visdat)
library(tidyr)

In [0]:
# Load dataset
airbnb <- read_csv("https://raw.githubusercontent.com/datacamp/cleaning-data-in-r-live-training/master/assets/nyc_airbnb.csv")

In [0]:
# Examine the first few rows
head(airbnb)

## Diagnosing data cleaning problems

We'll need to get a good look at the data frame in order to identify any problems that may cause issues during an analysis. There are a variety of functions (both from base R and `dplyr`) that can help us with this:

1. `head()` to look at the first few rows of the data
2. `glimpse()` to get a summary of the variables' data types
3. `summary()` to compute summary statistics of each variable and display the number of missing values
4. `duplicated()` to find duplicates


In [0]:
# Print the first few rows of data
head(airbnb)

In [0]:
# Inspect data types
glimpse(airbnb)

3. Columns like `coordinates` and `price` are factors instead of numeric values.
4. Columns with dates like `last_review` and `listing_added` are factors instead of the `Date` data type.

In [0]:
# Examine summary statistics and missing values
summary(airbnb)

5. There are 2075 missing values in `reviews_per_month`, `avg_rating`, `nb_stays`, and `pct_5_stars`.
6. The max of `avg_rating` is above 5 (out of range value)
7. There are inconsistencies in the categories of `room_type`, i.e. `"Private"`, `"Private room"`, and `"PRIVATE ROOM"`.

In [0]:
# Count data with duplicated listing_id
airbnb %>%
  filter(duplicated(listing_id)) %>%
  nrow()

8. Duplicates: there are 20 rows whose `listing_id` already appeared earlier in the dataset.

## What do we need to do?

**Data type issues**
1. Split `coordinates` into latitude and longitude and convert `numeric` data type.
2. Remove `$`s from `price` column and convert to `numeric`.
3. Convert `last_review` and `listing_added` to `Date`.

**Text & categorical data issues**
4. Split `nbhood_full` into separate neighborhood and borough columns.
5. Collapse the categories of `room_type` so that they're consistent.

**Data range issues**
6. Fix the `avg_rating` column so it doesn't exceed `5`.

**Missing data issues**
7. Further investigate the missing data and decide how to handle them.

**Duplicate data issues**
8. Further investigate duplicate data points and decide how to handle them.

***But also...***
- We need to validate our data using various sanity checks

---

**Q & A**

---




## Cleaning the data


### Data type issues


In [0]:
# Reminder: what does the data look like?
head(airbnb)

#### **Task 1:** Split `coordinates` into latitude and longitude and convert `numeric` data type.


In [0]:
lat_lon <- airbnb$coordinates %>%
  # Remove left parentheses
  str_remove_all(fixed("(")) %>%
  # Remove right parentheses
  str_remove_all(fixed(")")) %>%
  # Split latitude and longitude
  str_split(", ", simplify = TRUE) %>%
  # Convert from matrix to data frame
  as.data.frame(stringsAsFactors = FALSE) %>%
  # Rename columns
  rename(latitude = V1, longitude = V2)

In [0]:
airbnb <- airbnb %>%
  # Combine lat_lon with original data frame
  cbind(lat_lon) %>%
  # Convert to numeric
  mutate(latitude = as.numeric(latitude),
        longitude = as.numeric(longitude)) %>%
  # Remove coordinates column
  select(-coordinates)

#### **Task 2:** Remove `$`s from `price` column and convert to `numeric`.

In [0]:
# Remove $ and convert to numeric
price_clean <- airbnb$price %>%
  str_remove_all(fixed("$")) %>%
  as.numeric()

Notice we get a warning here that values are being converted to `NA`, so before we move on, we need to look into this further to ensure that the values are actually missing and we're not losing data by mistake.

Let's take a look at the values of `price`.


In [0]:
# Look at values of price
airbnb %>%
  count(price, sort = TRUE)

It looks like we have a non-standard representation of `NA` here, `$NA`, so these are getting coerced to `NA`s. This is the behavior we want, so we can ignore the warning.

In [0]:
# Add to data frame
airbnb <- airbnb %>%
  mutate(price = price_clean)

#### **Task 3:** Convert `last_review` and `listing_added` to `Date`.

<img src="https://raw.githubusercontent.com/datacamp/cleaning-data-in-r-live-training/master/assets/date_formats.png" alt="%d = day number, %m = month number, %Y = 4 digit year, %y = 2 digit year, %B = month, %b = month abbreviation" width="250px;"/>


In [0]:
# Look up date formatting symbols
?strptime

In [0]:
# Convert strings to Dates
airbnb <- airbnb %>%
  mutate(last_review = as.Date(last_review, format = "%m/%d/%Y"),
        listing_added = as.Date(listing_added, format = "%m/%d/%Y"))

### Text & categorical data issues


#### **Task 4:** Split `nbhood_full` into separate `nbhood` and `borough` columns.

In [0]:
borough_nbhood <- airbnb$nbhood_full %>%
  # Split column
  str_split(", ", simplify = TRUE) %>%
  # Convert from matrix to data frame
  as.data.frame() %>%
  # Rename columns
  rename(borough = V1, nbhood = V2)

In [0]:
airbnb <- airbnb %>%
  # Combine borough_nbhood with data
  cbind(borough_nbhood) %>%
  # Remove nbhood_full
  select(-nbhood_full)

#### **Task 5:** Collapse the categories of `room_type` so that they're consistent.

In [0]:
# Count categories of room_type
airbnb %>%
  count(room_type)

In [0]:
room_type_clean <- airbnb$room_type %>%
  # Change all to lowercase
  str_to_lower() %>%
  # Collapse categories
  forcats::fct_collapse(private_room = c("private", "private room"),
                        entire_place = c("entire home/apt", "home"),
                        shared_room = "shared room")

In [0]:
# Add to data frame
airbnb <- airbnb %>% 
  mutate(room_type = room_type_clean)

---

**Q & A**

---



### Data range issues

#### **Task 6:** Fix the `avg_rating` column so it doesn't exceed `5`.

In [0]:
# How many places with avg_rating above 5?
airbnb %>%
  filter(avg_rating > 5) %>%
  count()

In [0]:
# What does the data for these places look like?
airbnb %>%
  filter(avg_rating > 5)

In [0]:
# Remove the rows with rating > 5
airbnb <- airbnb %>%
  filter(avg_rating <= 5 | is.na(avg_rating))

### Missing data issues

#### **Task 7:** Further investigate the missing data and decide how to handle them.

*Are the missing values related in any way?*

The `visdat` package is useful for investigating missing data.

In [0]:
head(airbnb)

In [0]:
airbnb %>%
  # Focus only on columns with missing values
  select(price, last_review, reviews_per_month, avg_rating, avg_stays_per_month) %>%
  # Visualize missing data
  visdat::vis_miss()

It looks like missingness of `last_review`, `reviews_per_month`, `avg_rating`, and `avg_stays_per_month` are related. This suggests that these are places that have never been visited before (therefore have no ratings, reviews, or stays.

However, `price` is unrelated to the other columns, so we'll need to take a different approach for that.

In [0]:
# Sanity check that our hypothesis is correct
airbnb %>%
    filter(nb_reviews != 0,
           is.na(reviews_per_month))
airbnb %>%
    filter(nb_reviews != 0,
           is.na(avg_stays_per_month))

Now that we know our hypothesis is correct,
- We'll set any missing values in `reviews_per_month` or `avg_stays_per_month` to `0`.
    - Use `tidyr::replace_na()`
- We'll leave `last_review` and `avg_rating` as `NA`.
- We'll create a `logical` (`TRUE`/`FALSE`) column called `is_visited`, indicating whether or not the listing has been visited before.
    - Use `ifelse(condition, value if true, value if false)`

In [0]:
airbnb <- airbnb %>%
    # Replace missing values in reviews_per_month or avg_stays_per_month with 0
    replace_na(list(reviews_per_month = 0, avg_stays_per_month = 0)) %>%
    # Create is_visited
    mutate(is_visited = ifelse(is.na(avg_rating), FALSE, TRUE))

**Treating the `price` column**

There are lots of ways we could do this
- Remove all rows with missing price values
- Fill in missing prices with the overall average price
- Fill in missing prices based on other columns like `borough` or `room_type`

**Let's examine the relationship between `room_type` and `price`.**

<img src='https://raw.githubusercontent.com/datacamp/cleaning-data-in-r-live-training/master/assets/boxplot.png' alt='Box plot diagram' width='350px;'>

In [0]:
# Create a boxplot showing the distribution of price for each room_type
ggplot(airbnb, aes(x = room_type, y = price)) +
    geom_boxplot() +
    ylim(0, 1000)

In [0]:
# Use a grouped mutate to fill in missing prices with median of their room_type
airbnb %>%
    group_by(room_type) %>%
    mutate(price_filled = ifelse(is.na(price), median(price, na.rm = TRUE), price)) %>%
    # Look at the values we filled in to make sure it looks how we want
    filter(is.na(price)) %>%
    select(listing_id, description, room_type, price, price_filled)

In [0]:
# Overwrite price column in original data frame
airbnb <- airbnb %>%
    group_by(room_type) %>%
    mutate(price = ifelse(is.na(price), median(price, na.rm = TRUE), price)) %>%
    ungroup()

### Duplicate data issues


#### **Task 8:** Further investigate duplicate data points and decide how to handle them.

In [0]:
# Find duplicated listing_ids
duplicate_ids <- airbnb %>% 
    count(listing_id) %>% 
     filter(n > 1)

In [0]:
# Look at duplicated data
airbnb %>%
    filter(listing_id %in% duplicate_ids$listing_id) %>%
    arrange(listing_id)