### Packages & Libraries

In [39]:
install.packages("dplyr")
install.packages("questionr")
library(dplyr)
library(questionr)

"package 'dplyr' is in use and will not be installed"
"package 'questionr' is in use and will not be installed"


### Upload the datasets

In [40]:
setwd('C:/Users/Carlos/Documents/Varios/OnPremise/Datasets/TimeSeries')
air_visit <- read.csv(".\\air_visit_data.csv", header=T, na="")
date_info <- read.csv(".\\date_info.csv", header=T, na="")

In [41]:
air_visit

air_store_id,visit_date,visitors
<chr>,<chr>,<int>
air_ba937bf13d40fb24,2016-01-13,25
air_ba937bf13d40fb24,2016-01-14,32
air_ba937bf13d40fb24,2016-01-15,29
air_ba937bf13d40fb24,2016-01-16,22
air_ba937bf13d40fb24,2016-01-18,6
air_ba937bf13d40fb24,2016-01-19,9
air_ba937bf13d40fb24,2016-01-20,31
air_ba937bf13d40fb24,2016-01-21,21
air_ba937bf13d40fb24,2016-01-22,18
air_ba937bf13d40fb24,2016-01-23,26


date_info:
- Contains 517 calendar_date values, from 2016-01-01 to 2017-05-31

In [42]:
date_info

calendar_date,day_of_week,holiday_flg
<chr>,<chr>,<int>
2016-01-01,Friday,1
2016-01-02,Saturday,1
2016-01-03,Sunday,1
2016-01-04,Monday,0
2016-01-05,Tuesday,0
2016-01-06,Wednesday,0
2016-01-07,Thursday,0
2016-01-08,Friday,0
2016-01-09,Saturday,0
2016-01-10,Sunday,0


#### Using the package/library **'questionr'** we can use the function **'freq'** to check the frequency of each element

Observations:
- Show information about 829 restaurants
- We have restaurants that have been open almost every day, to restaurants that have only been open 20 days in a 477 day measurement period
- The restaurant that have been open more days in a period of 477 days is "air_5c817ef28f236bdf"

In [43]:
freq(air_visit$air_store_id, sort = "dec")

Unnamed: 0_level_0,n,%,val%
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>
air_5c817ef28f236bdf,477,0.2,0.2
air_36bcf77d3382d36e,476,0.2,0.2
air_a083834e7ffe187e,476,0.2,0.2
air_d97dabf7aae60da5,476,0.2,0.2
air_232dcee6f7c51d37,475,0.2,0.2
air_60a7057184ec7ec7,475,0.2,0.2
air_71903025d39a4571,475,0.2,0.2
air_7a946aada80376a4,474,0.2,0.2
air_883ca28ef0ed3d55,474,0.2,0.2
air_4cca5666eaf5c709,473,0.2,0.2


We extract the information about the "air_5c817ef28f236bdf" restaurant from the "air_visit" dataset
- We can see that it starts on 2016-01-02 and finish on 2017-04-22 with 477 values
- We must check if in that period there are 477 days, to check if we miss any day.  As we see in the dataset description "The training set omits days where the restaurants were closed"
- We must take into account the Time Series definition "a time series is a sequence taken at successive **equally spaced** points in time".  If there are days "omitted" the sequence is not equally spaced
- We must find if there are any "omitted day" and fill it with 0 visitors

In [44]:
air_visit_5c817ef28f236bdf <- air_visit[air_visit$air_store_id == "air_5c817ef28f236bdf", 
                                        c("air_store_id","visit_date","visitors")]
air_visit_5c817ef28f236bdf

Unnamed: 0_level_0,air_store_id,visit_date,visitors
Unnamed: 0_level_1,<chr>,<chr>,<int>
158603,air_5c817ef28f236bdf,2016-01-02,24
158604,air_5c817ef28f236bdf,2016-01-03,49
158605,air_5c817ef28f236bdf,2016-01-04,10
158606,air_5c817ef28f236bdf,2016-01-05,2
158607,air_5c817ef28f236bdf,2016-01-06,9
158608,air_5c817ef28f236bdf,2016-01-07,15
158609,air_5c817ef28f236bdf,2016-01-08,36
158610,air_5c817ef28f236bdf,2016-01-09,44
158611,air_5c817ef28f236bdf,2016-01-10,25
158612,air_5c817ef28f236bdf,2016-01-11,7


#### Using the package/library "dplyr" we extract from the "date_info" dataset the period that the "air_5c817ef28f236bdf" restaurant is open (from 2016-01-02 to 2017-04-22)

In [45]:
date_info_air_5c817ef28f236bdf <- date_info %>%
    left_join(air_visit_5c817ef28f236bdf, by = c(calendar_date = 'visit_date')) %>%
    rename(visitors = 'visitors')
date_info_air_5c817ef28f236bdf

calendar_date,day_of_week,holiday_flg,air_store_id,visitors
<chr>,<chr>,<int>,<chr>,<int>
2016-01-01,Friday,1,,
2016-01-02,Saturday,1,air_5c817ef28f236bdf,24
2016-01-03,Sunday,1,air_5c817ef28f236bdf,49
2016-01-04,Monday,0,air_5c817ef28f236bdf,10
2016-01-05,Tuesday,0,air_5c817ef28f236bdf,2
2016-01-06,Wednesday,0,air_5c817ef28f236bdf,9
2016-01-07,Thursday,0,air_5c817ef28f236bdf,15
2016-01-08,Friday,0,air_5c817ef28f236bdf,36
2016-01-09,Saturday,0,air_5c817ef28f236bdf,44
2016-01-10,Sunday,0,air_5c817ef28f236bdf,25


- Extract the "air_store_id" column from the dataset that is redundant
- Check that "calendar_date" is in chr (character) format, we must convert to date format

In [46]:
date_info_air_ <- select(date_info_air_5c817ef28f236bdf, -air_store_id)
date_info_air_

calendar_date,day_of_week,holiday_flg,visitors
<chr>,<chr>,<int>,<int>
2016-01-01,Friday,1,
2016-01-02,Saturday,1,24
2016-01-03,Sunday,1,49
2016-01-04,Monday,0,10
2016-01-05,Tuesday,0,2
2016-01-06,Wednesday,0,9
2016-01-07,Thursday,0,15
2016-01-08,Friday,0,36
2016-01-09,Saturday,0,44
2016-01-10,Sunday,0,25


Using the "mutate" function of dplyr, now "calendar_date" have a date format

In [47]:
date_info_air_ %>% 
    mutate(calendar_date = as.Date(calendar_date, format = "%Y-%m-%d"))

calendar_date,day_of_week,holiday_flg,visitors
<date>,<chr>,<int>,<int>
2016-01-01,Friday,1,
2016-01-02,Saturday,1,24
2016-01-03,Sunday,1,49
2016-01-04,Monday,0,10
2016-01-05,Tuesday,0,2
2016-01-06,Wednesday,0,9
2016-01-07,Thursday,0,15
2016-01-08,Friday,0,36
2016-01-09,Saturday,0,44
2016-01-10,Sunday,0,25


Now we have the final dataset, as we see the number of days from 2016-01-02 to 2017-04-22 are the same as the information provided by the "air_5c817ef28f236bdf" restaurant (477 days), so they don't close any day in that period

In [52]:
date_info_air_final <- date_info_air_[(date_info_air_$calendar_date > "2016-01-01" 
                                       & date_info_air_$calendar_date < "2017-04-23"),]
date_info_air_final

Unnamed: 0_level_0,calendar_date,day_of_week,holiday_flg,visitors
Unnamed: 0_level_1,<chr>,<chr>,<int>,<int>
2,2016-01-02,Saturday,1,24
3,2016-01-03,Sunday,1,49
4,2016-01-04,Monday,0,10
5,2016-01-05,Tuesday,0,2
6,2016-01-06,Wednesday,0,9
7,2016-01-07,Thursday,0,15
8,2016-01-08,Friday,0,36
9,2016-01-09,Saturday,0,44
10,2016-01-10,Sunday,0,25
11,2016-01-11,Monday,1,7


In [54]:
write.csv(date_info_air_final,"C://MBD//TimeSeries_R_dplyr_questionr.csv", row.names = FALSE)