<a href="https://colab.research.google.com/github/dirtydupe/cisc_3140_Midterm/blob/master/CISC3140_Midterm_Notebook_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Joe Troia - Notebook 1
## Dataset: DOHMH New York City Restaurant Inspection Results

The objectives in the use of this dataset:

1. Calculate number of restaurants per borough (graded and ungraded) and record number of each grade attained per borough.
2. Use this in order to find the percentage of each grade per borough as well as
3. percentage share of each borough per grade 
4. Calculate average inspection score per borough (a numerical score in which lower is better)
5. Calcualte average number of violations per borough

*Note that the restaurants in the dataset can have multiple inspection dates and multiple violations recorded per inspection so there will be duplicate records for restaurants. Only the one grade and score recorded per inspection will count towards the grade totals. Also, not every citation has a grade attached to it so this needs to be taken into account when parsing the records.*

In looking to follow the specifications provided by Professor Chuang, many of the problems were tackled with a focus on using list comprehension,  `lambda`  and the following Python functions:
* `map()`
* `reduce()`
* `filter()`

### Inspection Grades:
*  **A** 
*  **B**
*  **C**
*  **P** - Grade Pending
*  **Z** - Grade Pending issued on re-opening following an initial inspection that resulted in a closure
*  **N** - Not Yet Graded










Importing required libraries:

In [0]:
import functools
import json
import urllib.request

The API endpoint of the data to be examined is at https://data.cityofnewyork.us/resource/43nn-pn8j.json

By default, this will return only 1000 results so we'll pass the query parameters `$limit=100000&$offset=0` so that the json returned will contain many, but not all, of the elements in the dataset.

Collaboratory seems to hang up when trying to process all 380,000+ elements in the data; so this compromise needs to be made.

In [0]:
url = "https://data.cityofnewyork.us/resource/43nn-pn8j.json?$limit=100000&$offset=0"

Opening the URL and reading the Response object, then putting the json string into the list `data`

In [0]:
response = urllib.request.urlopen(url)
jsonObj = response.read()
data = json.loads(jsonObj)

Checking the total number of elements in the list

In [184]:
dataSize = len(data)
print(dataSize)

100000


Defining an Inspection class and a Restaurant class to hold accumulator fields and other data that will allow us to make the final calculations.

*Aside: "Restaurant" is one of those words that look weird when you spell it out, amirite?*

In [0]:
class Inspection():
      def __init__(self):
          self.date = ""
          self.grade = ""
          self.score = 0
          self.numViolations = 0
          
      def incrViolations(self):
        self.numViolations += 1
      

class Restaurant():
      def __init__(self):
          self.inspections = {}
          self.camis = ""

Here I'm defining a Borough class which will contain fields and methods which will manipulate and aggregate data for each borough including a list of Restaurant objects belonging to each.

In [0]:
class Borough():
    def __init__(self, name):
        self.name = name
        self.restaurants = {}
        self.allCitations = []
        
    def addRestaurant(self, camis, restaurant):
        self.restaurants[camis] = restaurant
        
    def isNewRestaurant(self, camis):
        if camis in self.restaurants:
            return False
        
        return True

These static fuctions will parse out ungraded citations so we have a clean list of graded records 

In [0]:
def isGraded(record):
    if 'grade' in record:
        return True
    
    return False
      
def accumulateGraded(list):  #need to change method
    gradedList = []
    i = 0
    
    for record in list:
      if isGraded(list[i]):
          gradedList.append(list[i])
          
      i += 1
          
    return gradedList

Instantiating an object for each borough and putting them into the `boroughs` list 
* The lambda function is defined to call `Borough`'s constructor
* List comprehension is used to create the list of `Borough` objects

In [202]:
constructBoroughs = lambda x: Borough(x)

boroughs = ["Brooklyn", "Manhattan", "Queens", "Bronx", "Staten Island"]
boroughs = [constructBoroughs(b) for b in boroughs]

for b in boroughs:
    print(b.name)

Brooklyn
Manhattan
Queens
Bronx
Staten Island


Using `map()` I call a function that itself calls `filter()` on the bulk data.  This will filter `data` based on the borough field. The resulting list is stored in the current borough object's `allCItations` field.                               

In [221]:
def buildCitationLists(borough):
      citList = list(filter(lambda x: x['boro'] == borough.name , data))
      borough.allCitations = citList
      return borough      

boroughs = list(map(buildCitationLists, boroughs))

print("CITATIONS TOTALS")
print("Brooklyn:", len(brooklyn.allCitations))
print("Manhattan:", len(manhattan.allCitations))
print("Queens:", len(queens.allCitations))
print("Bronx:", len(bronx.allCitations))
print("Staten Island:", len(statenIsland.allCitations))

CITATIONS TOTALS
Brooklyn: 25504
Manhattan: 39509
Queens: 22598
Bronx: 8978
Staten Island: 3383


Using `map()` again, the `restaurants` dictionary of each borough is built. A unique 'camis' id key maps to each unique Restaurant object. Restaurant objects are created for each new camis by testing if the key is present in that `Borough` object's `restaurants` dictionary.

Additionally, the details of the citation are entered if it is a new inspection or, if an inspection was already recorded in the `Restaurant` object, the number of violations is incremented if a violation was noted on that line of data.

In [220]:
def buildRestaurantDicts(record, borough):    
      if borough.isNewRestaurant(record['camis']):
          restaurant = Restaurant()
          restaurant.camis = record['camis']      
          borough.addRestaurant(record['camis'], restaurant)
 
      if record['inspection_date'] in borough.restaurants[record['camis']].inspections:
          if 'violation_code' in record:
              borough.restaurants[record['camis']].inspections[record['inspection_date']].incrViolations()
      else:
          inspection = Inspection()
          inspection.date = record['inspection_date']
          
          if 'grade' in record:
              inspection.grade = record['grade']
          
          if 'score' in record:
              inspection.score = record['score']
            
          if 'violation_code' in record:
              inspection.numViolations = 1
          
          borough.restaurants[record['camis']].inspections[record['inspection_date']] = inspection
          
          
def buildBoroughObjects(borough):       
      for record in borough.allCitations:
          buildRestaurantDicts(record, borough)

      return borough

boroughs = list(map(buildBoroughObjects, boroughs))

print("RESTAURANTS PER BOROUGH")
print("Brooklyn:", len(boroughs[0].restaurants))
print("Manhattan:", len(boroughs[1].restaurants))
print("Queens:", len(boroughs[2].restaurants))
print("Bronx:", len(boroughs[3].restaurants))
print("Staten Island:", len(boroughs[4].restaurants))

RESTAURANTS PER BOROUGH
Brooklyn: 5844
Manhattan:  9233
Queens:  5286
Bronx:  2120
Staten Island:  843


Find number of grades per borough:
* Total number of grades given
* Number of each grade attained

Total inspection scores and find average for each borough

Add number of violations and divide by number of restaurants for each borough

In [215]:
gradedRecords = accumulateGraded(data)
totalGradedRecords = len(gradedRecords)
print(totalGradedRecords)

50740


It was a challenge for me to break out of the object-oriented way of thinking. In writing the code, I felt like I was iterating through the original dataset many more times than I would have if I'd followed a programming style that I was more familiar with.  I found myself breaking the flow of the program into more numerous, yet admittedly more succinct, parts in order to fulfill the spec.