<p align="center">
<img src="https://github.com/datacamp/r-live-training-template/blob/master/assets/datacamp.svg?raw=True" alt = "DataCamp icon" width="50%">
<br>
<h1 align="center">Cleaning Data in R Live Training</h1>
</p>
<br>


Welcome to this hands-on training where you'll identify issues in a dataset and clean it from start to finish using R. It's often said that data scientists spend 80% of their time cleaning and manipulating data and only about 20% of their time analyzing it, so cleaning data is an important skill to master!

In this session, you will:

- Examine a dataset and identify its problem areas, and what needs to be done to fix them.
-Convert between data types to make analysis easier.
- Correct inconsistencies in categorical data.
- Deal with missing data.
- Perform data validation to ensure every value makes sense.

## **The Dataset**

The dataset we'll use is a CSV file named `nyc_airbnb.csv`, which contains data on [*Airbnb*](https://www.airbnb.com/) listings in New York City. It contains the following columns:

- `listing_id`: The unique identifier for a listing
- `name`: The description used on the listing
- `host_id`: Unique identifier for a host
- `host_name`: Name of host
- `nbhood_full`: Name of borough and neighborhood
- `coordinates`: Coordinates of listing _(latitude, longitude)_
- `room_type`: Type of room 
- `price`: Price per night for listing
- `nb_reviews`: Number of reviews received 
- `last_review`: Date of last review
- `reviews_per_month`: Average number of reviews per month
- `availability_365`: Number of days available per year
- `avg_rating`: Average rating (from 0 to 5)
- `stays_per_month`: Average number of stays per month
- `pct_5_stars`: Percent of reviews that were 5-stars
- `listing_added`: Date when listing was added


In [13]:
# Install packages
install.packages("readr")
install.packages("dplyr")
install.packages("stringr")
install.packages("ggplot2")

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



In [0]:
# Load packages
library(readr)
library(dplyr)
library(stringr)
library(ggplot2)

In [20]:
# Load dataset
airbnb <- read_csv("https://raw.githubusercontent.com/datacamp/cleaning-data-in-r-live-training/master/assets/nyc_airbnb.csv")

“Missing column names filled in: 'X1' [1]”
Parsed with column specification:
cols(
  X1 = [32mcol_double()[39m,
  listing_id = [32mcol_double()[39m,
  name = [31mcol_character()[39m,
  host_id = [32mcol_double()[39m,
  host_name = [31mcol_character()[39m,
  nbhood_full = [31mcol_character()[39m,
  coordinates = [31mcol_character()[39m,
  room_type = [31mcol_character()[39m,
  price = [31mcol_character()[39m,
  nb_reviews = [32mcol_double()[39m,
  last_review = [34mcol_date(format = "")[39m,
  reviews_per_month = [32mcol_double()[39m,
  availability_365 = [32mcol_double()[39m,
  avg_rating = [32mcol_double()[39m,
  nb_stays = [32mcol_double()[39m,
  pct_5_stars = [32mcol_double()[39m,
  listing_added = [34mcol_date(format = "")[39m
)



In [18]:
head(airbnb)

X1,listing_id,name,host_id,host_name,nbhood_full,coordinates,room_type,price,nb_reviews,last_review,reviews_per_month,availability_365,avg_rating,nb_stays,pct_5_stars,listing_added
<dbl>,<dbl>,<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,<date>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<date>
1,13740704,"Cozy,budget friendly, cable inc, private entrance!",20583125,Michel,"Brooklyn, Flatlands","(40.63222, -73.93398)",Private room,$45,10,2018-12-12,0.7,85,4.100954,12.0,0.6094315,2018-06-08
2,22005115,Two floor apartment near Central Park,82746113,Cecilia,"Manhattan, Upper West Side","(40.78761, -73.96862)",Entire home/apt,$135,1,2019-06-30,1.0,145,3.3676,1.2,0.7461346,2018-12-25
3,21667615,Beautiful 1BR in Brooklyn Heights,78251,Leslie,"Brooklyn, Brooklyn Heights","(40.7007, -73.99517)",Entire home/apt,$150,0,,,65,,,,2018-08-15
4,6425850,"Spacious, charming studio",32715865,Yelena,"Manhattan, Upper West Side","(40.79169, -73.97498)",Entire home/apt,$86,5,2017-09-23,0.13,0,4.763203,6.0,0.7699471,2017-03-20
5,22986519,Bedroom on the lively Lower East Side,154262349,Brooke,"Manhattan, Lower East Side","(40.71884, -73.98354)",Private room,$160,23,2019-06-12,2.29,102,3.822591,27.6,0.6493831,2020-10-23
6,271954,Beautiful brownstone apartment,1423798,Aj,"Manhattan, Greenwich Village","(40.73388, -73.99452)",Entire home/apt,$150,203,2019-06-20,2.22,300,4.478396,243.6,0.7434997,2018-12-15


## Diagnosing data cleaning problems

We'll need to get a good look at the data frame in order to identify any problems that may cause issues during an analysis. There are a variety of functions (both from base R and `dplyr`) that can help us with this:

1. `head()` to look at the first few rows of the data
2. `glimpse()` to get a summary of the variables' data types
3. `summary()` to compute summary statistics of each variable and display the number of missing values
4. `duplicated()` to find duplicates


In [9]:
# What does the data look like?
head(airbnb)

Unnamed: 0_level_0,X,listing_id,name,host_id,host_name,nbhood_full,coordinates,room_type,price,nb_reviews,last_review,reviews_per_month,availability_365,avg_rating,nb_stays,pct_5_stars,listing_added
Unnamed: 0_level_1,<int>,<int>,<fct>,<int>,<fct>,<fct>,<fct>,<fct>,<fct>,<int>,<fct>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<fct>
1,1,13740704,"Cozy,budget friendly, cable inc, private entrance!",20583125,Michel,"Brooklyn, Flatlands","(40.63222, -73.93398)",Private room,$45,10,2018-12-12,0.7,85,4.100954,12.0,0.6094315,2018-06-08
2,2,22005115,Two floor apartment near Central Park,82746113,Cecilia,"Manhattan, Upper West Side","(40.78761, -73.96862)",Entire home/apt,$135,1,2019-06-30,1.0,145,3.3676,1.2,0.7461346,2018-12-25
3,3,21667615,Beautiful 1BR in Brooklyn Heights,78251,Leslie,"Brooklyn, Brooklyn Heights","(40.7007, -73.99517)",Entire home/apt,$150,0,,,65,,,,2018-08-15
4,4,6425850,"Spacious, charming studio",32715865,Yelena,"Manhattan, Upper West Side","(40.79169, -73.97498)",Entire home/apt,$86,5,2017-09-23,0.13,0,4.763203,6.0,0.7699471,2017-03-20
5,5,22986519,Bedroom on the lively Lower East Side,154262349,Brooke,"Manhattan, Lower East Side","(40.71884, -73.98354)",Private room,$160,23,2019-06-12,2.29,102,3.822591,27.6,0.6493831,2020-10-23
6,6,271954,Beautiful brownstone apartment,1423798,Aj,"Manhattan, Greenwich Village","(40.73388, -73.99452)",Entire home/apt,$150,203,2019-06-20,2.22,300,4.478396,243.6,0.7434997,2018-12-15


**Problems so far:**
1. Column called `X` that indicates row number - we don't need this
2. Multiple pieces of information in one value:
  - `coordinates` are easier to work with when separated into latitude and longitude
  - `nbhood_full` contains both the borough name (i.e. Manhattan, Brooklyn, etc.) and the neighborhood name (i.e. Lower East Side)
3. `price` has an unnecessary $

In [10]:
# Inspect data types
glimpse(airbnb)

Rows: 10,019
Columns: 17
$ X                 [3m[90m<int>[39m[23m 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, …
$ listing_id        [3m[90m<int>[39m[23m 13740704, 22005115, 21667615, 6425850, 22986519, 27…
$ name              [3m[90m<fct>[39m[23m "Cozy,budget friendly, cable inc, private entrance!…
$ host_id           [3m[90m<int>[39m[23m 20583125, 82746113, 78251, 32715865, 154262349, 142…
$ host_name         [3m[90m<fct>[39m[23m Michel, Cecilia, Leslie, Yelena, Brooke, Aj, Christ…
$ nbhood_full       [3m[90m<fct>[39m[23m "Brooklyn, Flatlands", "Manhattan, Upper West Side"…
$ coordinates       [3m[90m<fct>[39m[23m "(40.63222, -73.93398)", "(40.78761, -73.96862)", "…
$ room_type         [3m[90m<fct>[39m[23m Private room, Entire home/apt, Entire home/apt, Ent…
$ price             [3m[90m<fct>[39m[23m $45, $135, $150, $86, $160, $150, $200, $224, $169,…
$ nb_reviews        [3m[90m<int>[39m[23m 10, 1, 0, 5, 23, 203, 0, 2, 5, 8, 5, 2, 21, 0, 0

4. Columns like `coordinates` and `price` are factors instead of numeric values.
5. Columns with dates like `last_review` and `listing_added` are factors instead of the `Date` data type.

In [11]:
# Examine summary statistics and missing values
summary(airbnb)

       X           listing_id                                  name     
 Min.   :    1   Min.   :    3831                                :   5  
 1st Qu.: 2506   1st Qu.: 9674772   Beautiful Brooklyn Brownstone:   5  
 Median : 5010   Median :20070296   New york Multi-unit building :   5  
 Mean   : 5010   Mean   :19276341   Hillside Hotel               :   4  
 3rd Qu.: 7514   3rd Qu.:29338637   Home away from home          :   4  
 Max.   :10019   Max.   :36487245   Brooklyn Apartment           :   3  
                                    (Other)                      :9993  
    host_id                 host_name                          nbhood_full  
 Min.   :     2787   Michael     :  89   Brooklyn, Bedford-Stuyvesant: 777  
 1st Qu.:  7910880   David       :  85   Brooklyn, Williamsburg      : 766  
 Median : 31651673   Sonder (NYC):  66   Manhattan, Harlem           : 541  
 Mean   : 67959227   Alex        :  52   Brooklyn, Bushwick          : 502  
 3rd Qu.:107434423   Daniel    

6. 2075 missing values in `reviews_per_month`, `avg_rating`, `nb_stays`, and `pct_5_stars`.

In [15]:
# Find data with duplicated listing_id
airbnb %>%
  filter(duplicated(listing_id))

X,listing_id,name,host_id,host_name,nbhood_full,coordinates,room_type,price,nb_reviews,last_review,reviews_per_month,availability_365,avg_rating,nb_stays,pct_5_stars,listing_added
<int>,<int>,<fct>,<int>,<fct>,<fct>,<fct>,<fct>,<fct>,<int>,<fct>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<fct>
2256,7319856,450ft Square Studio in Gramercy NY,11773680,Adam,"Manhattan, Kips Bay","(40.73813, -73.98098)",Entire home/apt,$280,4,2016-05-22,0.09,225,3.903764,4.8,0.756381,2015-11-17
3008,17861841,THE CREATIVE COZY ROOM,47591528,Janessa,"Brooklyn, Sheepshead Bay","(40.59211, -73.94126999999997)",Private room,$99,13,2019-05-23,0.52,82,4.80659,15.6,0.9374216,2018-11-17
3341,35646737,"Private Cabins @ Chelsea, Manhattan",117365574,Maria,"Manhattan, Chelsea","(40.74946, -73.99627)",Private room,$85,1,2019-06-22,1.0,261,4.951714,1.2,0.6713879,2018-12-17
3431,15027024,Newly renovated 1bd on lively & historic St Marks,8344620,Ethan,"Manhattan, East Village","(40.72693, -73.98385)",Entire home/apt,$180,10,2018-12-31,0.3,0,3.869729,12.0,0.7725126,2018-06-27
4188,4244242,Best Bedroom in Bedstuy/Bushwick. Ensuite bathroom,22023014,BrooklynSleeps,"Brooklyn, Bedford-Stuyvesant","(40.69496, -73.93949)",Private room,$73,110,2019-06-23,1.96,323,4.962314,132.0,0.809882,2018-12-18
5078,33831116,Sonder | Stock Exchange | Collected 1BR + Laundry,219517861,Sonder (NYC),"Manhattan, Financial District","(40.70621, -74.01199)",Entire home/apt,$229,5,2019-06-15,1.92,350,4.026379,6.0,0.6017374,2018-12-10
5398,16518377,East Village 1BR Apt with all the amenities,3012457,Cody,"Manhattan, East Village","(40.7235, -73.97963)",Entire home/apt,$200,3,2018-07-10,0.16,0,4.67667,3.6,0.6944427,2018-01-04
6069,22014840,Sunny Bedroom Only 1 Metro Stop to Manhattan,32093643,Scarlett,"Manhattan, Roosevelt Island","(40.76211, -73.94887)",Private room,$70,2,2018-01-07,0.11,0,4.024336,2.4,0.7194262,2017-07-04
6086,33346762,2BR Apartment in Brownstone Brooklyn!,50321289,Avery,"Brooklyn, Bedford-Stuyvesant","(40.682, -73.95681)",Entire home/apt,$140,4,2019-06-14,1.58,4,4.013393,4.8,0.7195908,2018-12-09
6133,23990868,1 Bedroom in Luxury Building,4447548,Grace,"Brooklyn, Bedford-Stuyvesant","(40.69336, -73.94453)",Entire home/apt,$88,8,2019-06-16,0.56,18,4.164548,9.6,0.640106,2018-12-11


7. Duplicates: there are 17 rows whose `listing_id` already appeared earlier in the dataset.