* [Day 1: Reading in common data file formats: .json, .txt and .xlsx](https://www.kaggle.com/rtatman/data-cleaning-challenge-json-txt-and-xls/)
* [Day 2: Filling in missing values ](https://www.kaggle.com/rtatman/data-cleaning-challenge-imputing-missing-values/)
* [Day 3: Identifying & handling outliers](https://www.kaggle.com/rtatman/data-cleaning-challenge-outliers/)
* [Day 4: Removing duplicate records](https://www.kaggle.com/rtatman/data-cleaning-challenge-deduplication/)
* [Day 5: Cleaning numbers (percentages, money, dates and times)](https://www.kaggle.com/rtatman/data-cleaning-challenge-cleaning-numeric-columns/)
    
___

Welcome to Day 5 of the 5-Day Data Challenge! (Can you believe it's the last day already?) Today, we're going to be learning how to clean up columns with dates and numbers in them that R doesn’t realize are dates or numbers. In particular, we'll learn how to remove symbols like "%" or "$", and how to get R to correctly parse dates so you can do things like plot days in order. 

I'll start by introducing each concept or technique, and then you'll get a chance to apply it with an exercise (look for the **Your turn!** section). Ready? Let's get started!
___

**Kernel FAQs:**

* **How do I get started?**   To get started, click the blue "Fork Notebook" button in the upper, right hand corner. This will create a private copy of this notebook that you can edit and play with. Once you're finished with the exercises, you can choose to make your notebook public to share with others. :)

* **How do I run the code in this notebook?** Once you fork the notebook, it will open in the notebook editor. From there you can write code in any code cell (the ones with the grey background) and run the code by either 1) clicking in the code cell and then hitting CTRL + ENTER or 2) clicking in the code cell and the clicking on the white "play" arrow to the left of the cell. If you want to run all the code in your notebook, you can use the double, "fast forward" arrows at the bottom of the notebook editor.

* **How do I save my work?** Any changes you make are saved automatically as you work. You can run all the code in your notebook and save a static version by hitting the blue "Commit & Run" button in the upper right hand corner of the editor. 

* **How can I find my notebook again later?** The easiest way is to go to your user profile (https://www.kaggle.com/replace-this-with-your-username), then click on the "Kernels" tab. All of your kernels will be under the "Your Work" tab, and all the kernels you've upvoted will be under the "Favorites" tab.

___


# Get our environment set up
___

At this point you know the drill: we need to get our libraries and data all read in a ready to go. 

In [None]:
# libraries we'll need
library(tidyverse)
library(lubridate)

# read in the data we'll need
listings <- read_csv("../input/boston/listings.csv")
hospital_charges <- read_csv("../input/inpatient-hospital-charges/inpatientCharges.csv") %>%
    janitor::clean_names() # clean up column names, with clean_names() from the janitor package
drone_strikes <- read_csv("../input/pakistandroneattacks/PakistanDroneAttacksWithTemp Ver 9 (October 19, 2017).csv") %>%
    janitor::clean_names() # clean up column names
# remove the last row (has totals in it)
drone_strikes <- drone_strikes[-nrow(drone_strikes),] 
mass_shootings <- read_csv("../input/us-mass-shootings-last-50-years/Mass Shootings Dataset Ver 5.csv")%>%
    janitor::clean_names() # clean up column names

# Removing in non-numeric characters
___

Sometimes you'll get a dataset and find that someone has helpfully added percentage signs, dollar signs or other non-numeric symbols that tell you the units of the data you're working with. While this is great when you're looking at the data, it can lead to problems when you're trying to work with it in R. This is because when you read in a dataframe in R, it makes a guess about what data type each column is by looking at what is in that column. And if it runs in to any characters that aren't numbers in your column, it will play it safe and say that the datatype is "character" rather than "numeric". This means that in order to actually do any math with that column, you need some way to remove any non-numeric characters.

If you have tried to solve this problem in the past, you may have come across people suggesting that you use regular expressions. While I'm a big fan of regular expressions (I use one to get days of the week later on in this kernel), I would *strongly* advise you to steer clear of them in data cleaning unless you have no other choice. Why?

* **Regular expressions are brittle.** Since they rely on matching a very specific set of characters in a specific order, if you get new data later that's slightly different your regular expressions won't work. And if you haven't had to debug regular expressions previously, count yourself lucky: it's a huge pain.
* **Regular expressions are hard to write & read.** If you've worked with regular expressions before, you know that they are hard to get right! They're also very hard to read: can you tell what this is supposed at first glance? `^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$`.  (It's [a Roman Numeral finder](https://stackoverflow.com/a/800868/9403317), in case you were wondering.) If something is both hard to write and hard to read that means you and your team will spend more time on it than you need to, especially if there's another option.

Instead, the option I'd recommend is `parse_number()` from the readr package by Jim Hester. This is one of the things we load in with `library(tidyverse)`. `parse_number()` can handle most of the common ways that numbers are presented and neatly tidies up the non-numeric characters and changes the class of the resulting object:

In [None]:
# character vector of numbers
to_parse <- c(100, "10,000", "%100", "$50")

# check to make sure it's numeric
print("Class before:")
class(to_parse)

# parse numbers
parsed_numbers <- parse_number(to_parse)

# check class
print("Class after:")
class(parsed_numbers)

# see what it looks like now
parsed_numbers

Of  course, it's easy to use with a toy example, by how well does it extend to real datasets? Pretty well, in my experience! 

First, I like to pull out all the columns that are "character" class and look at them. (I focus on character columns because I don't need to re-parse the numeric columns that were parsed correctly the first time.)

In [None]:
# get only columns with the data type "character" 
character_columns <- hospital_charges[, sapply(hospital_charges, class) == "character"]

# look at these columns
str(character_columns)

Looking at these columns, it seems like only the last three have been parsed incorrectly: each of them should actually be numeric. My next step is to select just those columns and parse each of them. 

In [None]:
# select columns with "charge" or "pay" in the name
money_columns <- character_columns %>%
    select(contains("charge"), contains("pay")) 

# parse each of those columns as numeric using sapply()
money_columns_parsed <- sapply(money_columns, parse_number) %>%
    as_data_frame()

Finally, I make a new copy of the dataset with the parsed columns. (You may want to avoid making a bunch of copies like this if your dataset is very large, but I like the safety net of not modifying data in place.)

In [None]:
# replace columns with their parsed versions
hospital_charges_parsed <- hospital_charges %>%
    # the next line *removes* the columns we selected earlier
    select(-contains("charge"), -contains("pay")) %>%
    # add the columns we parsed earlier
    bind_cols(money_columns_parsed)

# double check that our data types are correct
str(hospital_charges_parsed)

And now our numeric columns are actually numeric! From here you can work with them as you would any other numeric values.

## Your turn!
___

Take a look at the `listings` dataset, which contains information on Airbnb listings. You should find quite a few numeric columns read in as characters that need to be parsed. If you're looking for more of a challenge, there are also some logical columns (with 't' and 'f') that have been parsed as characters. Try parsing them using `parse_logical()`. 

If you're looking an extra-tough challenge, try writing a function that identifies which numeric columns might have been mis-parsed as characters and attempts to parse them correctly.  

In [None]:
# your code here :)

# Parsing dates & times
___

Sometimes you're lucky and your dates will be in a format that R knows how to handle and will be read in and parsed automatically. Sometimes, however, you'll have to let R know that a certain column contains a date and what the format of that date is. This is known as *parsing* a date. I generally use [the lubridate package](https://cran.r-project.org/web/packages/lubridate/index.html) by Vitalie Spinu and co-authors for parsing dates. These are the four functions I use most often to parse dates in different formats:

* **mdy()**, for dates that are month, day and then year
* **ymd()**, for dates that are year, month and then day
*  **dmy()**, for dates that are day, month and then year
* **ymd_hms()**, for dates that are year, month, day, hour, minute and finally second

Once a date is parsed, you can easily extract parts of it using functions like month() and year() and plot or analyze it like any other numeric variable. You can learn more about parsing dates [here](https://cran.r-project.org/web/packages/lubridate/vignettes/lubridate.html).

Now that we've got the basics, let's put them into practice. Here are some of the dates in the date column of the `drone_strikes` dataset. 

In [None]:
# look at the dates
drone_strikes$date[1:10]

So our dates are in the format: day of week, month, day of month, year. Unfortunately, there's no lubridate parsing function for handling the day of the week, so we'll need to strip them. (Don't worry, once our dates are parsed we can get them all back with the `wday()` function.) Here, I've written a function to remove them for us:

In [None]:
# function to remove the day of the week (friday, monday, etc.) & following comma
remove_dow <- function(column){
    no_dow <- str_replace_all(column, '[A-Za-z]*day, ', '')
    return(no_dow)
}

# test it out
remove_dow(drone_strikes$date[1:10])

Now we can use this function to remove our day of the week, then parse the remaining data with the `mdy()` function. If you don't need to remove the days of the week, you can just skip that line.

In [None]:
# tidy up dates & convert them to date format
dates <- drone_strikes %>%
    select(date) %>% # get the "date" column
    mutate(date_formatted = remove_dow(date)) %>% # remove the day of the week
    mutate(date_formatted = mdy(date_formatted)) # convert to date format

# compare before & after formatting
head(dates)

# add formatted dates to our dataframe
drone_strikes$date_formatted <- dates$date_formatted

And that's all there is to it! You're now ready to start your time series analysis or easily plot your dates using ggplot.

## Your turn!
___

Correctly parse the "date" column from the mass_shootings dataset. Print some of the unconverted & converted dates using the head() function to compare them.

In [None]:
# your code goes here :)

# And that's it, you've completed the whole challenge!
____

Congrats, you've finished the entire 5-Day Challenge on data cleaning with R! I hope you learned lots of helpful tips and tricks that you can apply in your work further down the line. :)

If you're looking for more content, try checking out [the other 5-Day Challenges](https://www.kaggle.com/rtatman/list-of-5-day-challenges/) or some of the great content we have on the [Kaggle Learn page](https://www.kaggle.com/learn/overview). If you're looking for more practice with data cleaning specifically, I'd recommend picking a new dataset ([maybe one of these](https://www.kaggle.com/datasets?sortBy=hottest&group=public&page=1&pageSize=20&size=all&filetype=all&license=all&tagids=13202)), getting to know it (mainly doing a lot of visualizations) and then making the changes you need to use it for modelling or analysis. 