# Cary open data analysis - COMPLETED

In [None]:
if(!require(RCurl)) {
    install.packages("RCurl", repos = "http://cran.us.r-project.org")
    library(RCurl)
}

if(!require(RDSTK)) {
    install.packages("RDSTK", repos = "http://cran.us.r-project.org")
    library(RDSTK)
}
 
if(!require(ggplot2)) {
    install.packages("ggplot2", repos = "http://cran.us.r-project.org")
    library(ggplot2)
}
 
if(!require(lazyeval)) {
    install.packages("lazyeval", repos = "http://cran.us.r-project.org")
    library(lazyeval)
}

if(!require(tidyverse)) {
    install.packages("tidyverse", repos = "http://cran.us.r-project.org")
    library(tidyverse)
}

if(!require(ggmap)) {
    install.packages("ggmap", repos = "http://cran.us.r-project.org")
    library(ggmap)
}
print('Finished loading libraries.')

First off, let's load some packages that we need in order to perform data processing and visualization in the steps below.

In [None]:
restaurants <- read.csv("http://www.catallaxyservices.com/media/blog/restaurants.csv", sep=",", header=TRUE)

Our first step is to take advantage of the Cary open data portal and grab a data set.  Cary has a listing of Wake County restaurant inspections which looked interesting.  The problem is that it has address but no latitude & longitude pairs, so we need to go through a geocoding service to get them.

To reduce (significantly) the amount of time necessary to go through a free geocoding service, the geocoded final product is available as **restaurants.csv** and is ready for processing.

In [None]:
##Get, geocode, and cleanse data
##If you want to run this in your own environment, remove the if(FALSE) block.
if(FALSE)
{
  tbl <- read.csv("https://data.townofcary.org/explore/dataset/wake_county_food_inspections/download/?format=csv&timezone=America/New_York&use_labels_for_header=true", sep=";", header=TRUE)
  tbl$address.string <- paste(tbl$address, tbl$city.town, tbl$state, tbl$postal_code, sep = " ")
  tbl$lat <- 0
  tbl$long <- 0
  for(i in 1:nrow(tbl)) {
    print(tbl[i,]$address.string)
    
    tryCatch({
      sc <- street2coordinates(tbl[i,]$address.string)
      tbl[i,]$lat <- sc[1,]$latitude
      tbl[i,]$long <- sc[1,]$longitude}, error=function(e){  })
  }
}

In case you are interested in building the original data set on your own, here is the code used to build restaurants.csv.  It took at least a couple of hours to process of all of the restaurants, so my recommendation is to use my pre-generated file.

**Your mission, should you choose to accept it:**

Generate summary data for this data set.  In the first box, we want to see a summary of results.  In the second box, we want to see the first five rows of the data set.  In the third box, we want to see the last five rows of the data set.  For the final box, we want to focus on restaurant scores and get an understanding of the range of scores.

In [None]:
# TODO:  generate a summary of the data
summary(restaurants)

In [None]:
# TODO:  look at the first five rows of the data set
head(restaurants, 5)

In [None]:
# TODO:  look at the last five rows of the data set
tail(restaurants, 5)

In [None]:
# TODO:  show a summary of just the restaurant scores.
summary(restaurants$score)

This command will hit the Google Maps API and return an image of a specific area, centered around the midpoint of our longitude/latitude pairings.  Zoom is how zoomed in Google Maps is.  Scale takes one of two options for the free version:  1 is low-res and 2 is high-res.

**Feel free to modify parameters and play with the resulting map.**

In [None]:
wakemap <- get_map(location = c(lon = mean(restaurants$long), 
                                lat = mean(restaurants$lat)), zoom = 11, 
                   maptype = "roadmap", scale = 2)

This snippet will lay out a heat map of the Wake County area based on the number of restaurant ratings.  It isn't very helpful for us, so we'll want to create different types of graphs to visualize the data a bit better.

In [None]:
ggmap(wakemap, extent = "device") +
  geom_density2d(data = restaurants, aes(x=long,y=lat),size = 0.3) +
  stat_density2d(data=restaurants, aes(x=long, y=lat, fill=..level.., alpha=..level..), size=0.01, bins=16, geom="polygon") +
  scale_fill_gradient(low="red",high="green") +
  scale_alpha(range = c(0,0.3), guide=FALSE)

**Problems we are trying to solve:**
1.  Where are the sketchy restaurants (rating of 85 or lower)?
2.  Which parts of Wake county have the worst ratings?
3.  Zooming into Cary, what do the ratings look like?

We will solve each of these problems in the following sections.

In [None]:
# dplyr has a function called "filter"
# TODO:  create a filter which gets ONLY restaurants with a rating UNDER 85.  Call the resulting data set "sketchy"
sketchy <- dplyr::filter(restaurants, score < 85)

# Once you have sketchy filled out, we will use dplyr to get the number of failures (scores < 85) per lat-long pair.
sketchy <- sketchy %>%
            dplyr::group_by(lat, long) %>%
            dplyr::summarize(failures = n())

In [None]:
# We now want to produce a POINT map of the sketchy restaurant locations.  To do this, we start with wakemap
# and need to include POINTs.  Let's make the size of each point correspond to the number of failures and 
# hard-code the alpha channel to 0.1.  If you need help, look up ggplot2 types (specifically relating to POINTs)
# TODO:  create a ggmap call which includes points of restaurant failures
ggmap(wakemap) +
    geom_point(data = sketchy, aes(x = long, y = lat, fill = "red", alpha=0.1, size = failures))

We now have a map of the restaurants with failed inspections.  Each restaurant is its own point and the size of the point represents how many failed inspections the restaurant had.

This is pretty interesting, and leads to the next problem:  what do ratings look like for **areas** of Wake county?

In [None]:
# We want to group restaurants by longitude and latitude.  Specifically, longitude and latitude at 1 place
# after the decimal.  This way, we can get clusters of restaurants and they'll show up on the map more clearly.
restgrp <- restaurants
# TODO:  round restgrp's lat and long values to 1 place after the decimal
restgrp$lat <- round(restgrp$lat, 1)
restgrp$long <- round(restgrp$long, 1)

# TODO:  use dplyr like in the sketchy example to do the following:
# 1)  Filter out any N/A scores
# 2)  Group by latitude and longitude
# 3)  Create two aggregates:  ratings, which is the number of records; and meanscore, which is the mean of scores
restgrp <- restgrp %>%
            dplyr::filter(!is.na(score)) %>%
            dplyr::group_by(lat, long) %>%
            dplyr::summarize(ratings = n(), meanscore = mean(score))


Fill out the above section.  As a hint, look at the sketchy block to see how we grouped and aggregated.  When in doubt, use R's help (?[topic]) to read up on options.

In [None]:
# TODO:  Set wakemap to a new map value.  This time, set the zoom value to 10 so you can see the entire region.
wakemap <- get_map(location = c(lon = mean(restgrp$long), lat = mean(restgrp$lat)), zoom = 10, maptype = "roadmap", scale = 2)

# TODO:  fill in ggmap with settings.  We want three function calls:
# scale_fill_gradient (to give us a visual cue of restuarants.  Pick good colors for low & high.)
# geom_text (to display the meanscore, giving us precise values.  Round meanscore to 1 spot after decimal)
# geom_tile (to display blocks of color.  Set alpha = 1)
ggmap(wakemap) +
  scale_fill_gradient(low="black",high="orange") +
  geom_text(data = restgrp, aes(x=long, y=lat, fill = meanscore, label = round(meanscore, 1))) +
  geom_tile(data = restgrp, aes(x=long, y=lat, alpha=1, fill=meanscore)) 



Once you have filled out the above section, you should see a map of the area with blocks and ratings.  You can also see that mean scores don't vary too significantly within the area, but look to be a little lower down on the edges of the map.

But let's suppose you wanted to get more details on a specific part of the RTP area.  Let's pick Cary, because that's the origin of our data set.

In [None]:
# For Cary restaurants, we want to go down to 2 spots after the decimal.  This is a nice compromise and should
# fit our map size better than prior examples.
caryrestgrp <- restaurants

# TODO:  round caryrestgrp's lat and long values to 2 places after the decimal.
caryrestgrp$lat <- round(caryrestgrp$lat, 2)
caryrestgrp$long <- round(caryrestgrp$long, 2)

# TODO:  use dplyr like in the sketchy example to do the following:
# 1)  Filter out any N/A scores
# 2)  Group by latitude and longitude
# 3)  Create two aggregates:  ratings, which is the number of records; and meanscore, which is the mean of scores
caryrestgrp <- caryrestgrp %>%
  dplyr::filter(!is.na(score)) %>%
  dplyr::group_by(lat, long) %>%
  dplyr::summarize(ratings = n(), meanscore = mean(score))


We now are going to build a new data set to plot on top of the map of Cary.  Optionally, you could filter out any values outside of [-78.85, -78.75] longitude and [35.76, 35.84] latitude.  But in this scenario there aren't too many data points once we've aggregated results (492), so it's not necessary to filter.

In [None]:
# We will now create a Cary map with a zoom of 13 and longitude and latitude around a fixed point.
carymap <- get_map(location = c(lon = -78.8, lat = 35.8), zoom = 13, maptype = "roadmap", scale = 2)

# TODO:  fill in ggmap with settings.  We want the same function calls as the wakemap example above,
# but fill in values from caryrestgrp instead of restgrp.
ggmap(carymap) +
  scale_fill_gradient(low="black",high="orange") +
  geom_text(data = caryrestgrp, aes(x=long, y=lat, fill = meanscore, label = round(meanscore, 1))) +
  geom_tile(data = caryrestgrp, aes(x=long, y=lat, alpha=1, fill=meanscore)) 

This final exercise creates a heat map for just the Cary area.

**Your mission, should you choose to accept it:**

Now that you've gone through and solved several problems around filtering and displaying data, what else can you do?  Maybe focus on another part of town, maybe play around with some of the charting options.  Whatever you want to do, make sure to update the notebook!  Use Insert --> Insert Cell Below to add new cells.  Code blocks are runnble and perform actions, whereas Markdown blocks show text.