<p align="center">
<img src="https://github.com/datacamp/r-live-training-template/blob/master/assets/datacamp.svg?raw=True" alt = "DataCamp icon" width="50%">
<br>
<h1 align="center">Cleaning Data in R Live Training</h1>
</p>
<br>


Welcome to this hands-on training where you'll identify issues in a dataset and clean it from start to finish using R. It's often said that data scientists spend 80% of their time cleaning and manipulating data and only about 20% of their time analyzing it, so cleaning data is an important skill to master!

In this session, you will:

- Examine a dataset and identify its problem areas, and what needs to be done to fix them.
-Convert between data types to make analysis easier.
- Correct inconsistencies in categorical data.
- Deal with missing data.
- Perform data validation to ensure every value makes sense.

## **The Dataset**

The dataset we'll use is a CSV file named `nyc_airbnb.csv`, which contains data on [*Airbnb*](https://www.airbnb.com/) listings in New York City. It contains the following columns:

- `listing_id`: The unique identifier for a listing
- `name`: The description used on the listing
- `host_id`: Unique identifier for a host
- `host_name`: Name of host
- `nbhood_full`: Name of borough and neighborhood
- `coordinates`: Coordinates of listing _(latitude, longitude)_
- `room_type`: Type of room 
- `price`: Price per night for listing
- `nb_reviews`: Number of reviews received 
- `last_review`: Date of last review
- `reviews_per_month`: Average number of reviews per month
- `availability_365`: Number of days available per year
- `avg_rating`: Average rating (from 0 to 5)
- `nb_stays`: Total number of stays thus far
- `pct_5_stars`: Percent of reviews that were 5-stars
- `listing_added`: Date when listing was added


In [0]:
# Install packages
install.packages("dplyr")
install.packages("stringr")
install.packages("ggplot2")

In [6]:
# Load packages
library(dplyr)
library(stringr)
library(ggplot2)


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




In [0]:
# Load dataset
airbnb <- read.csv("https://raw.githubusercontent.com/datacamp/cleaning-data-in-r-live-training/master/assets/nyc_airbnb.csv")

## Diagnosing data cleaning problems

We'll need to get a good look at the data frame in order to identify any problems that may cause issues during an analysis. There are a variety of functions (both from base R and `dplyr`) that can help us with this:

1. `head()` to look at the first few rows of the data
2. `glimpse()` to get a summary of the variables' data types
3. `summary()` to compute summary statistics of each variable and display the number of missing values
4. `duplicated()` to find duplicates


In [0]:
# What does the data look like?
head(airbnb)

In [0]:
# Inspect data types
glimpse(airbnb)

In [3]:
# Examine summary statistics
summary(airbnb)

       X           listing_id                                  name     
 Min.   :    1   Min.   :    3831                                :   5  
 1st Qu.: 2506   1st Qu.: 9674772   Beautiful Brooklyn Brownstone:   5  
 Median : 5010   Median :20070296   New york Multi-unit building :   5  
 Mean   : 5010   Mean   :19276341   Hillside Hotel               :   4  
 3rd Qu.: 7514   3rd Qu.:29338637   Home away from home          :   4  
 Max.   :10019   Max.   :36487245   Brooklyn Apartment           :   3  
                                    (Other)                      :9993  
    host_id                 host_name                          nbhood_full  
 Min.   :     2787   Michael     :  89   Brooklyn, Bedford-Stuyvesant: 777  
 1st Qu.:  7910880   David       :  85   Brooklyn, Williamsburg      : 766  
 Median : 31651673   Sonder (NYC):  66   Manhattan, Harlem           : 541  
 Mean   : 67959227   Alex        :  52   Brooklyn, Bushwick          : 502  
 3rd Qu.:107434423   Daniel    

In [8]:
# Find data with duplicated listing_id
airbnb %>%
  filter(duplicated(listing_id))

X,listing_id,name,host_id,host_name,nbhood_full,coordinates,room_type,price,nb_reviews,last_review,reviews_per_month,availability_365,avg_rating,nb_stays,pct_5_stars,listing_added
<int>,<int>,<fct>,<int>,<fct>,<fct>,<fct>,<fct>,<fct>,<int>,<fct>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<fct>
2256,7319856,450ft Square Studio in Gramercy NY,11773680,Adam,"Manhattan, Kips Bay","(40.73813, -73.98098)",Entire home/apt,$280,4,2016-05-22,0.09,225,3.903764,4.8,0.756381,2015-11-17
3008,17861841,THE CREATIVE COZY ROOM,47591528,Janessa,"Brooklyn, Sheepshead Bay","(40.59211, -73.94126999999997)",Private room,$99,13,2019-05-23,0.52,82,4.80659,15.6,0.9374216,2018-11-17
3341,35646737,"Private Cabins @ Chelsea, Manhattan",117365574,Maria,"Manhattan, Chelsea","(40.74946, -73.99627)",Private room,$85,1,2019-06-22,1.0,261,4.951714,1.2,0.6713879,2018-12-17
3431,15027024,Newly renovated 1bd on lively & historic St Marks,8344620,Ethan,"Manhattan, East Village","(40.72693, -73.98385)",Entire home/apt,$180,10,2018-12-31,0.3,0,3.869729,12.0,0.7725126,2018-06-27
4188,4244242,Best Bedroom in Bedstuy/Bushwick. Ensuite bathroom,22023014,BrooklynSleeps,"Brooklyn, Bedford-Stuyvesant","(40.69496, -73.93949)",Private room,$73,110,2019-06-23,1.96,323,4.962314,132.0,0.809882,2018-12-18
5078,33831116,Sonder | Stock Exchange | Collected 1BR + Laundry,219517861,Sonder (NYC),"Manhattan, Financial District","(40.70621, -74.01199)",Entire home/apt,$229,5,2019-06-15,1.92,350,4.026379,6.0,0.6017374,2018-12-10
5398,16518377,East Village 1BR Apt with all the amenities,3012457,Cody,"Manhattan, East Village","(40.7235, -73.97963)",Entire home/apt,$200,3,2018-07-10,0.16,0,4.67667,3.6,0.6944427,2018-01-04
6069,22014840,Sunny Bedroom Only 1 Metro Stop to Manhattan,32093643,Scarlett,"Manhattan, Roosevelt Island","(40.76211, -73.94887)",Private room,$70,2,2018-01-07,0.11,0,4.024336,2.4,0.7194262,2017-07-04
6086,33346762,2BR Apartment in Brownstone Brooklyn!,50321289,Avery,"Brooklyn, Bedford-Stuyvesant","(40.682, -73.95681)",Entire home/apt,$140,4,2019-06-14,1.58,4,4.013393,4.8,0.7195908,2018-12-09
6133,23990868,1 Bedroom in Luxury Building,4447548,Grace,"Brooklyn, Bedford-Stuyvesant","(40.69336, -73.94453)",Entire home/apt,$88,8,2019-06-16,0.56,18,4.164548,9.6,0.640106,2018-12-11
