# Changing Data
## Background for this activity
In this activity, you'll review a scenario, and focus on manipulating and changing real data in R. You will learn more about functions you can use to manipulate your data, use statistical summaries to explore your data, and gain initial insights for your stakeholders. 

Throughout this activity, you will also have the opportunity to practice writing your own code by making changes to the code chunks yourself.
## The Scenario
In this scenario, you are a junior data analyst working for a hotel booking company. You have been asked to clean a .csv file that was created after querying a database to combine two different tables from different hotels. You have already performed some basic cleaning functions on this data; this activity will focus on using functions to conduct basic data manipulation.

### Load Packages

In [None]:
install.packages("tidyverse")
install.packages("skimr")
install.packages("janitor")
library(tidyverse)
library(skimr)
library(janitor)

### Import Data

In [None]:
hotel_bookings <- read_csv("hotel_bookings.csv")

### Get to Know the Data

In [None]:
head(hotel_bookings)
str(hotel_bookings)
glimpse(hotel_bookings)
colnames(hotel_bookings)

### Manipulating the Data
Let's say you want to arrange the data by most lead time to least lead time because you want to focus on bookings that were made far in advance. You decide you want to try using the `arrange()` function.

In [None]:
arrange(hotel_bookings, lead_time)

`arrange()`  automatically orders by ascending order. Now order by descending order.

In [None]:
arrange(hotel_bookings, desc(lead_time))

If you wanted to create a new data frame that had those changes saved, you would use the <- to store the arranged data in a data frame named 'hotel_bookings_v2'

In [None]:
hotel_bookings_v2 <-
  arrange(hotel_bookings, desc(lead_time))

**The highest lead time for a hotel booking in this data set is 737 days.**

Check out the new data frame: 

In [None]:
head(hotel_bookings_v2)

You can also find out the maximum and minimum lead times without sorting the whole data set using the `arrange()` function. Try it out

In [None]:
max(hotel_bookings$lead_time)
min(hotel_bookings$lead_time)

Now, let's say you just want to know what the average lead time for booking is because your boss asks you how early you should run promotions for hotel rooms.

In [None]:
mean(hotel_bookings$lead_time)
mean(hotel_bookings_v2$lead_time)

**The average lead time is 104.0114 days**

You were able to report to your boss what the average lead time before booking is, but now they want to know what the average lead time before booking is for just city hotels. They want to focus the promotion they're running by targeting major cities.

You know that your first step will be creating a new data set that only contains data about city hotels. You can do that using the `filter()` function, and name your new data frame 'hotel_bookings_city':

In [None]:
hotel_bookings_city <- 
  filter(hotel_bookings, hotel_bookings$hotel=="City Hotel")
head(hotel_bookings_city)
mean(hotel_bookings_city$lead_time)

Now, your boss wants to know a lot more information about city hotels, including the maximum and minimum lead time. They are also interested in how they are different from resort hotels. You don't want to run each line of code over and over again, so you decide to use the `group_by()`and`summarize()` functions. You can also use the pipe operator to make your code easier to follow. You will store the new data set in a data frame named 'hotel_summary':

In [None]:
hotel_summary <- 
  hotel_bookings %>%
  group_by(hotel) %>%
  summarise(average_lead_time=mean(lead_time),
            min_lead_time=min(lead_time),
            max_lead_time=max(lead_time))
head(hotel_summary)