<img src='https://raw.githubusercontent.com/dxkikuchi/SparkSnips/master/dsxlogo.jpg' width="30%" height="30%"></img>

## New York State Restaurant Inspections Notebook
Here is the location for data on restaurant inspections for most the state of New York.  For additional details, please see <a href="https://health.data.ny.gov/Health/Food-Service-Establishment-Last-Inspection/cnih-y5dw" target="_blank">https://health.data.ny.gov/Health/Food-Service-Establishment-Last-Inspection/cnih-y5dw</a>

In [None]:
nyr = 'https://health.data.ny.gov/api/views/cnih-y5dw/rows.csv?accessType=DOWNLOAD'

The csv (comma separated values) data will be read into a Pandas dataframe (nyr) and the first 5 records are displayed using the 'head()' method.<br>
Please attempt to write this code in the following cell to do this without looking at the solution provided 2 cells from here.

In [None]:
import pandas as pd
# Please add your code here

Solution: Please copy and paste the following code into the previous cell where specified.<br>
nyr = pd.read_csv(nyr)<br>
nyr.head()

The latitude and longitude are provided in the final column (Location1).  Functions are defined to extract the latitude and longitude independently.

In [None]:
def get_lon(loc):
    if len(loc.split(",")) > 1:
        lon = loc.split(",")[1].replace(")", "").strip()
    else:
        lon = ""
    return lon

def get_lat(loc):
    if len(loc.split(",")) > 1:
        lat = loc.split(",")[0].replace("(", "").strip()
    else:
        lat = ""
    return lat

The data is transformed into a Spark dataframe (nyrDF) and a table is registered.

In [None]:
nyrDF = sqlContext.createDataFrame(nyr)
nyrDF.registerTempTable("nyrDF")

A Spark dataframe (nyvDF) is created which contains a subset of the relevant columns.<br>
Please attempt to write this code in the following cell to register the table without looking at the solution provided 2 cells from here.

In [None]:
from pyspark.sql import Row
nyvDF = nyrDF.map(lambda x: Row(FACILITY=x.FACILITY, \
                                ADDRESS=x.ADDRESS, \
                                lat=get_lat(x.Location1), \
                                lon=get_lon(x.Location1), \
                                VIOLATIONS=x.VIOLATIONS, \
                                TOTAL_CRIT_VIOLATIONS=x["TOTAL # CRITICAL VIOLATIONS"])).toDF()
# Please add your code here

Solution: Please copy and paste the following code into the previous cell where specified.<br>
nyvDF.registerTempTable("nyvDF")

A SQL query is defined that provides the restaurant name (facility), latitude, longitude and violations.  They are ordered by number of violations in descending order and the top 10 records are requested.

In [None]:
query = """
select 
    FACILITY, 
    cast(lat as float) as lat,
    cast(lon as float) as lon,
    cast(TOTAL_CRIT_VIOLATIONS as int) as Violations
from nyvDF 
order by cast(TOTAL_CRIT_VIOLATIONS as decimal(10,2)) desc
limit 1000
"""
nyv1000 = sqlContext.sql(query)
nyv1000.show(10)

The latitude and longitude are mapped to a New York state map using Brunel visualization.  The color represents the number of violations as shown in the key.  

In [None]:
import brunel
nyv1000pan = nyv1000.toPandas()
%brunel map ('NY') + data('nyv1000pan') x(lon) y(lat) color(Violations) tooltip(FACILITY)

Pixiedust also provides charting and visualization.  It is an open source Python library that works as an add-on to Jupyter notebooks to improve the user experience of working with data.  For example, if you hover over the lonely yellow dot in the middle of New York State, you can see that it is for 'CAMP KINGSLEY - CC'.  By starting to type the value 'camp' in the 'Search table' text field below, the record will be displayed.  Numerous visualization are available with map support in the future.  In addition, the data can be downloaded as a file, or stashed to Cloudant or Object Storage.

In [None]:
from pixiedust.display import *
display(nyv1000)