# Capstone Project - The Battle of the Neighborhoods
### Applied Data Science Capstone by IBM/Coursera
### Vivek Tiwari

This Jupyter Notebook contains all the code and brief comments of the Coursera Capstone project.

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Datasets](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)


## Introduction <a name="introduction"></a>

People are extra-cautious about which restaurant they want to visit as the pandemic has made hygiene a fundamental point in our lives. We find out the best, most hygienic restaurants in San Francisco neighbourhoods in our today's project.

## Datasets <a name="data"></a>

1. The first dataset to be used consists of a **GeoJSON file with the names and boundaries of 41 San Francisco neighborhoods (GeoJSON)** 
2. **Foursquare APIs (URI)**
3. City of San Francisco Health Department’s **hygiene inspection program (CSV)** 


In [316]:
pip install sodapy

Note: you may need to restart the kernel to use updated packages.


In [317]:
pip install fuzzy_pandas

Note: you may need to restart the kernel to use updated packages.


In [318]:
import pandas as pd
import numpy as np
import requests
import folium
import fuzzy_pandas as fpd
from sodapy import Socrata



<h3> Importinge the Datasets <h3>

The neighborhoods and hygiene inspection datasets are easily accessible thanks to the Socrata API provided by the San Francisco Government

In [319]:
client = Socrata("data.sfgov.org", None)
results = client.get("pyih-qa8i", limit=60000)
hygiene_df=pd.DataFrame.from_records(results)

nhoods=client.get("743h-p4bq", limit=60000) # We will use this JSON file later on to map out San Francisco's neighborhoods
nhoods_df=pd.DataFrame.from_records(nhoods)



In [320]:
print(hygiene_df.shape)
print(nhoods_df.shape)

(53973, 23)
(92, 4)


In [321]:
hygiene_df.head()

Unnamed: 0,business_id,business_name,business_address,business_city,business_state,business_phone_number,inspection_id,inspection_date,inspection_type,business_postal_code,...,risk_category,business_latitude,business_longitude,business_location,:@computed_region_fyvs_ahh9,:@computed_region_p5aj_wyqh,:@computed_region_rxqg_mtj9,:@computed_region_yftq_j783,:@computed_region_bh8s_q3mv,:@computed_region_ajp5_b2md
0,101192,Cochinita #2,2 Marina Blvd Fort Mason,San Francisco,CA,14150429222.0,101192_20190606,2019-06-06T00:00:00.000,New Ownership,,...,,,,,,,,,,
1,97975,BREADBELLY,1408 Clement St,San Francisco,CA,14157240859.0,97975_20190725,2019-07-25T00:00:00.000,Routine - Unscheduled,94118.0,...,Moderate Risk,,,,,,,,,
2,92982,Great Gold Restaurant,3161 24th St.,San Francisco,CA,,92982_20170912,2017-09-12T00:00:00.000,New Ownership,94110.0,...,,,,,,,,,,
3,101389,HOMAGE,214 CALIFORNIA ST,San Francisco,CA,14154878161.0,101389_20190625,2019-06-25T00:00:00.000,New Construction,94111.0,...,,,,,,,,,,
4,85986,Pronto Pizza,798 Eddy St,San Francisco,CA,,85986_20161011,2016-10-11T00:00:00.000,New Ownership,94109.0,...,High Risk,,,,,,,,,


We will now use the Foursquare API to search each neighborhoods restaurants

In [322]:
client_id = 'U0BHFR2CGBOER0NS2E3LDULEVT032SXA3KVWLR2U1RTQBJCV' # your Foursquare ID
client_secret = 'WRRQIHUGH45BSIKD4HCNE5ZXRNAK3E1JJNIXVNRVBNYLZYEC' # your Foursquare Secret
version = '20180605' # Foursquare API version
category= '4d4b7105d754a06374d81259' #Food Category
limit=1000


print('Your credentails:')
print('CLIENT_ID: ' + client_id)
print('CLIENT_SECRET:' + client_secret)

Your credentails:
CLIENT_ID: U0BHFR2CGBOER0NS2E3LDULEVT032SXA3KVWLR2U1RTQBJCV
CLIENT_SECRET:WRRQIHUGH45BSIKD4HCNE5ZXRNAK3E1JJNIXVNRVBNYLZYEC


In [323]:
def getVenuesLoc(names, radius=600):
    
    venues_list=[]
    unexplored_nhoods=[]
    explored_nhoods=[]
    for name in names:
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&near={},San Francisco, CA&categoryId={}&radius={}&limit={}'.format(
        client_id,
        client_secret,
        version,
        name,
        category,
        radius, 
        limit)
            
        # make the GET request for neighbourhoods that don't throw error
        results = requests.get(url).json()
        if 'errorType' in results['meta']:
            print("Couldn't get venues from:", name)
            unexplored_nhoods.append(name)
        else:
            print(name)
            explored_nhoods.append(name)
            results = results["response"]['groups'][0]['items']
        # return only relevant information for each nearby venue
            venues_list.append([(
            name,
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])
            nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
            nearby_venues.columns = ['Neighbourhood', 
                  'Venue', 
                  'Venue_Latitude', 
                  'Venue_Longitude', 
                  'Venue_Category']
    
    return(nearby_venues, explored_nhoods, unexplored_nhoods)

Becuase there are certain neighborhoods grouped, we ungroup them to apply the function

In [324]:
nhood_names=[]
for name in nhoods_df['nbrhood']:
    if '/' in name:
        split_name=name.split('/',1)
        nhood_names.append(split_name[0].strip())
        nhood_names.append(split_name[1].strip())
    elif '/' not in name:
        nhood_names.append(name)
    

In [325]:
sf_venues, explored_nhoods, unexplored_nhoods = getVenuesLoc(nhood_names)

Alamo Square
Anza Vista
Balboa Terrace
Couldn't get venues from: Bayview
Bernal Heights
Buena Vista Park
Ashbury Heights
Couldn't get venues from: Central Richmond
Central Sunset
Clarendon Heights
Couldn't get venues from: Corona Heights
Cow Hollow
Crocker Amazon
Couldn't get venues from: Diamond Heights
Downtown
Duboce Triangle
Couldn't get venues from: Eureka Valley
Couldn't get venues from: Dolores Heights
Excelsior
Financial District
Couldn't get venues from: Barbary Coast
Couldn't get venues from: Yerba Buena
Forest Hill
Couldn't get venues from: Forest Hills Extension
Forest Knolls
Glen Park
Golden Gate Heights
Golden Gate Park
Haight Ashbury
Hayes Valley
Hunters Point
Ingleside
Ingleside Heights
Ingleside Terrace
Couldn't get venues from: Inner Mission
Inner Parkside
Couldn't get venues from: Inner Richmond
Inner Sunset
Jordan Park
Laurel Heights
Couldn't get venues from: Lake Street
Monterey Heights
Couldn't get venues from: Lake Shore
Lakeside
Lone Mountain
Lower Pacific Heigh

We now need to check our hyegiene dataframe in order to update the hygiene score and status of each of the restaurants we've gathered from the Foursquare API. Before that, we clean up the hygiene_df, keeping only the columns we need and the rows that contain values.

In [326]:
hygiene_df.columns

Index(['business_id', 'business_name', 'business_address', 'business_city',
       'business_state', 'business_phone_number', 'inspection_id',
       'inspection_date', 'inspection_type', 'business_postal_code',
       'inspection_score', 'violation_id', 'violation_description',
       'risk_category', 'business_latitude', 'business_longitude',
       'business_location', ':@computed_region_fyvs_ahh9',
       ':@computed_region_p5aj_wyqh', ':@computed_region_rxqg_mtj9',
       ':@computed_region_yftq_j783', ':@computed_region_bh8s_q3mv',
       ':@computed_region_ajp5_b2md'],
      dtype='object')

In [327]:
hygiene_df=hygiene_df[['business_name','business_address','inspection_date', 'inspection_type','violation_description', 'risk_category','inspection_score' ]]
hygiene_df = hygiene_df[hygiene_df['inspection_score'].notna()]
hygiene_df

Unnamed: 0,business_name,business_address,inspection_date,inspection_type,violation_description,risk_category,inspection_score
1,BREADBELLY,1408 Clement St,2019-07-25T00:00:00.000,Routine - Unscheduled,Inadequately cleaned or sanitized food contact...,Moderate Risk,96
7,Fools Errand,639 Divisadero St A,2019-03-27T00:00:00.000,Routine - Unscheduled,Inadequately cleaned or sanitized food contact...,Moderate Risk,84
8,MoBowL,428 11th St,2017-04-29T00:00:00.000,Routine - Unscheduled,Moderate risk food holding temperature,Moderate Risk,94
11,VICTOR'S,210 TOWNSEND St,2018-10-30T00:00:00.000,Routine - Unscheduled,Improper storage use or identification of toxi...,Low Risk,71
12,"New Garden Restaurant, Inc.",716 Kearny St,2019-04-01T00:00:00.000,Routine - Unscheduled,Improper or defective plumbing,Low Risk,85
...,...,...,...,...,...,...,...
53967,7 Eleven #2366-35722A,4850 Geary Blvd,2018-09-19T00:00:00.000,Routine - Unscheduled,Inadequate and inaccessible handwashing facili...,Moderate Risk,84
53968,Snowbird Coffee,1352 A 9th Ave,2019-04-11T00:00:00.000,Routine - Unscheduled,Wiping cloths not clean or properly stored or ...,Low Risk,94
53969,Buffalo Kitchen,107 Leland Ave,2019-04-17T00:00:00.000,Routine - Unscheduled,Foods not protected from contamination,Moderate Risk,75
53970,BUNN MIKE,300 DE HARO ST,2019-03-21T00:00:00.000,Routine - Unscheduled,Inadequate and inaccessible handwashing facili...,Moderate Risk,84


Now we can Map out the locations of each restaurant!

In [328]:
hygiene_df=hygiene_df.sort_values('inspection_date', ascending=False).drop_duplicates(subset='business_name', keep='first')
hygiene_df

Unnamed: 0,business_name,business_address,inspection_date,inspection_type,violation_description,risk_category,inspection_score
3788,Frisco Fried,5176 03rd St,2019-10-03T00:00:00.000,Routine - Unscheduled,Unclean or degraded floors walls or ceilings,Low Risk,92
6775,GENEVA STEAK HOUSE,5130 MISSION ST,2019-10-03T00:00:00.000,Routine - Unscheduled,Improper thawing methods,Moderate Risk,92
1041,Tokyo Express,160 Spear St Lobby ID,2019-10-03T00:00:00.000,Routine - Unscheduled,High risk food holding temperature,High Risk,87
3732,5A5 Steak Lounge,244 Jackson St,2019-10-03T00:00:00.000,Routine - Unscheduled,Noncompliance with HAACP plan or variance,Moderate Risk,81
2600,SHERIDAN ELEMENTARY SCHOOL,431 CAPITOL Ave,2019-10-03T00:00:00.000,Routine - Unscheduled,Inadequate and inaccessible handwashing facili...,Moderate Risk,92
...,...,...,...,...,...,...,...
28072,Sally's Restaurant and Deli,300 De Haro St #332,2016-10-06T00:00:00.000,Routine - Unscheduled,Moderate risk food holding temperature,Moderate Risk,71
30846,Way To Life Foods,1 United Nations Plaza,2016-10-05T00:00:00.000,Routine - Unscheduled,,,100
29937,Hey Hey Gourmet,1 United Nations Plaza,2016-10-05T00:00:00.000,Routine - Unscheduled,,,100
26965,Christine's Sausage,1 United Nations Plaza,2016-10-05T00:00:00.000,Routine - Unscheduled,,,100


In [344]:
sf_venues=fpd.fuzzy_merge(sf_venues, hygiene_df,
                        keep='all',
                        left_on=['Venue'],
                        right_on=['business_name'],
                        method='metaphone',
                        ignore_nonalpha=True,
                        ignore_nonlatin=True,
                        ignore_case=True,
                        join='inner')

sf_venues

Unnamed: 0,Neighbourhood,Venue,Venue_Latitude,Venue_Longitude,Venue_Category,business_name,business_address,inspection_date,inspection_type,violation_description,risk_category,inspection_score
0,Alamo Square,Little Star Pizza,37.777489,-122.438281,Pizza Place,Little Star Pizza,846 Divisadero,2018-04-24T00:00:00.000,Routine - Unscheduled,,,100
1,Alamo Square,Brenda's Meat & Three,37.778265,-122.438584,Southern / Soul Food Restaurant,Brendas Meat & Three,919 DIVISADERO ST,2019-03-13T00:00:00.000,Routine - Unscheduled,Inadequately cleaned or sanitized food contact...,Moderate Risk,92
2,Alamo Square,The Mill,37.776425,-122.437970,Bakery,The Mill,736 DIVISADERO St,2019-04-11T00:00:00.000,Routine - Unscheduled,Moderate risk vermin infestation,Moderate Risk,88
3,Alamo Square,Jane the Bakery,37.783797,-122.434283,Bakery,Jane the Bakery,1875 Geary Blvd,2019-07-03T00:00:00.000,Routine - Unscheduled,Improper storage of equipment utensils or linens,Low Risk,87
4,Alamo Square,The Progress,37.783745,-122.432972,American Restaurant,The Progress,1525 Fillmore St,2019-02-14T00:00:00.000,Routine - Unscheduled,Inadequately cleaned or sanitized food contact...,Moderate Risk,90
...,...,...,...,...,...,...,...,...,...,...,...,...
5024,Nob Hill,Osso Steakhouse,37.791447,-122.413530,Steakhouse,Osso Steakhouse,1177 California St,2019-06-03T00:00:00.000,Routine - Unscheduled,Inadequately cleaned or sanitized food contact...,Moderate Risk,96
5025,Nob Hill,Batter Bakery,37.789551,-122.420776,Bakery,Batter Bakery,1517 Pine St,2018-08-21T00:00:00.000,Routine - Unscheduled,Wiping cloths not clean or properly stored or ...,Low Risk,98
5026,Nob Hill,Nobhill Pizza & Shawerma,37.790767,-122.419747,Pizza Place,Nobhill Pizza & Shawerma,1534 California St,2019-09-23T00:00:00.000,Routine - Unscheduled,High risk food holding temperature,High Risk,93
5027,Nob Hill,Kasa Indian Eatery,37.789655,-122.420449,Indian Restaurant,Kasa Indian Eatery,4001 18th St,2019-09-23T00:00:00.000,Routine - Unscheduled,Low risk vermin infestation,Low Risk,86


In [346]:
sf_center = [37.7749, -122.4194]
sf_map = folium.Map(location=sf_center, zoom_start=13)
folium.Marker(sf_center, popup='City Center').add_to(sf_map)
for name, lat, lng in zip(sf_venues.Venue, sf_venues.Venue_Latitude, sf_venues.Venue_Longitude):
    folium.CircleMarker([lat, lng], radius=3, color=color, fill=True, fill_color=color, fill_opacity=1).add_to(sf_map)
    color = 'blue'
sf_map