<h1 align=center><font size = 5>IBM Data Science Capstone Project</font></h1>
<h3 align=center><font size = 4>Opening an Italian restaurant in Brooklyn, New York</font></h3>
<h4 align=center><font size = 2>(Finding optimal neighborhood in Brooklyn for new restaurant)</font></h4>
<h4 align=center><font size = 2>By: Zdravko Radulovic</font></h4>
<h4 align=center><font size = 2>December 2019</font></h4>



## Table of Contents





1. Download and Explore Dataset

2. Explore Neighborhoods in Brooklyn

3. Analyze Each Neighborhood

4. Cluster Neighborhoods

5. Examine Clusters   



Before we download tha data, we need to import all libraries


In [46]:
#!conda install -c conda-forge folium

In [21]:
import requests
import pandas as pd 
import numpy as np
import random 
import folium
import json
from IPython.display import Image 
from IPython.core.display import HTML
from geopy.geocoders import Nominatim 
from pandas.io.json import json_normalize
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors
print('Libraries imported.')

Libraries imported.


## 1. Download and Explore Dataset

Data set is located at this address:https://cocl.us/new_york_dataset, so we will get this file:

In [4]:
!wget -q -O 'newyork_data.json' https://cocl.us/new_york_dataset
print('Data downloaded!')

Data downloaded!


In [5]:
with open('newyork_data.json') as json_data:
    ny = json.load(json_data)

In [6]:

neighborhoods_data = ny['features']

In [9]:
# quick look of tha data
neighborhoods_data[0]

{'type': 'Feature',
 'id': 'nyu_2451_34572.1',
 'geometry': {'type': 'Point',
  'coordinates': [-73.84720052054902, 40.89470517661]},
 'geometry_name': 'geom',
 'properties': {'name': 'Wakefield',
  'stacked': 1,
  'annoline1': 'Wakefield',
  'annoline2': None,
  'annoline3': None,
  'annoangle': 0.0,
  'borough': 'Bronx',
  'bbox': [-73.84720052054902,
   40.89470517661,
   -73.84720052054902,
   40.89470517661]}}

In [10]:
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

In [11]:
# fill dataframe with data
for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

In [12]:
# Quick examination of data
neighborhoods.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


In [13]:
neighborhoods.shape

(306, 4)

In [14]:
# filter only Brooklyn neighborhoods

brooklyn_data = neighborhoods[neighborhoods['Borough'] == 'Brooklyn'].reset_index(drop=True)
brooklyn_data.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Brooklyn,Bay Ridge,40.625801,-74.030621
1,Brooklyn,Bensonhurst,40.611009,-73.99518
2,Brooklyn,Sunset Park,40.645103,-74.010316
3,Brooklyn,Greenpoint,40.730201,-73.954241
4,Brooklyn,Gravesend,40.59526,-73.973471


In [15]:
brooklyn_data.shape

(70, 4)

#### Using geopy library to get the latitude and longitude values of Brooklyn.

In [23]:
address = 'Brooklyn, NY'

geolocator = Nominatim(user_agent="ny_explorer")
try:
    location = geolocator.geocode(address)
    latitude = location.latitude
    longitude = location.longitude
except GeocoderUnavailable as error:
    print("Service not available")
    
print('The geograpical coordinate of Brooklyn are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Brooklyn are 40.6501038, -73.9495823.


In [24]:
# create map of Brooklyn using latitude and longitude values
map_brooklyn = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(brooklyn_data['Latitude'], brooklyn_data['Longitude'], brooklyn_data['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_brooklyn)  
    
map_brooklyn

In [25]:
# save the map as HTML file
map_brooklyn.save('map_brooklyn.html')

####  Foursquare Credentials and Version

In [26]:
CLIENT_ID = 'JCLOQGVPA0KRLEIJ2DRGABAOW4P4CDZXW0WTWHJT3LEDSKJC' # your Foursquare ID
CLIENT_SECRET = 'QLGQN5VXHDIXUTPDJMZTQZ3RFDVAOMA3ZNVG20FQZHZ4H5CY' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: JCLOQGVPA0KRLEIJ2DRGABAOW4P4CDZXW0WTWHJT3LEDSKJC
CLIENT_SECRET:QLGQN5VXHDIXUTPDJMZTQZ3RFDVAOMA3ZNVG20FQZHZ4H5CY


## 2. Explore Neighborhoods in Brooklyn

In [27]:
def getNearbyVenues(names, latitudes, longitudes,radius=500, limit=100):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            limit)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [28]:
# creating new dataframe called brooklyn_venues

brooklyn_venues = getNearbyVenues(names=brooklyn_data['Neighborhood'],
                                   latitudes=brooklyn_data['Latitude'],
                                   longitudes=brooklyn_data['Longitude']
                                  )

Bay Ridge
Bensonhurst
Sunset Park
Greenpoint
Gravesend
Brighton Beach
Sheepshead Bay
Manhattan Terrace
Flatbush
Crown Heights
East Flatbush
Kensington
Windsor Terrace
Prospect Heights
Brownsville
Williamsburg
Bushwick
Bedford Stuyvesant
Brooklyn Heights
Cobble Hill
Carroll Gardens
Red Hook
Gowanus
Fort Greene
Park Slope
Cypress Hills
East New York
Starrett City
Canarsie
Flatlands
Mill Island
Manhattan Beach
Coney Island
Bath Beach
Borough Park
Dyker Heights
Gerritsen Beach
Marine Park
Clinton Hill
Sea Gate
Downtown
Boerum Hill
Prospect Lefferts Gardens
Ocean Hill
City Line
Bergen Beach
Midwood
Prospect Park South
Georgetown
East Williamsburg
North Side
South Side
Ocean Parkway
Fort Hamilton
Ditmas Park
Wingate
Rugby
Remsen Village
New Lots
Paerdegat Basin
Mill Basin
Fulton Ferry
Vinegar Hill
Weeksville
Broadway Junction
Dumbo
Homecrest
Highland Park
Madison
Erasmus


In [29]:
#check size of new dataframe

print(brooklyn_venues.shape)
brooklyn_venues.head()

(2793, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Bay Ridge,40.625801,-74.030621,Pilo Arts Day Spa and Salon,40.624748,-74.030591,Spa
1,Bay Ridge,40.625801,-74.030621,Bagel Boy,40.627896,-74.029335,Bagel Shop
2,Bay Ridge,40.625801,-74.030621,Cocoa Grinder,40.623967,-74.030863,Juice Bar
3,Bay Ridge,40.625801,-74.030621,Pegasus Cafe,40.623168,-74.031186,Breakfast Spot
4,Bay Ridge,40.625801,-74.030621,Ho' Brah Taco Joint,40.62296,-74.031371,Taco Place


In [30]:
# grouping neighborhoods and counting venues for each neighborhood

brooklyn_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Bath Beach,46,46,46,46,46,46
Bay Ridge,86,86,86,86,86,86
Bedford Stuyvesant,25,25,25,25,25,25
Bensonhurst,33,33,33,33,33,33
Bergen Beach,6,6,6,6,6,6
Boerum Hill,90,90,90,90,90,90
Borough Park,20,20,20,20,20,20
Brighton Beach,45,45,45,45,45,45
Broadway Junction,12,12,12,12,12,12
Brooklyn Heights,100,100,100,100,100,100


In [31]:
# chech unique categories
print('There are {} uniques categories.'.format(len(brooklyn_venues['Venue Category'].unique())))

There are 284 uniques categories.


## 3. Analyze Each Neighborhood

In [32]:
# one hot encoding
brooklyn_onehot = pd.get_dummies(brooklyn_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
brooklyn_onehot['Neighborhood'] = brooklyn_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [brooklyn_onehot.columns[-1]] + list(brooklyn_onehot.columns[:-1])
brooklyn_onehot = brooklyn_onehot[fixed_columns]

# adding neighborhood latitude and longitude
brooklyn_onehot['Latitude'] = brooklyn_venues['Neighborhood Latitude']
brooklyn_onehot['Longitude'] = brooklyn_venues['Neighborhood Longitude']
brooklyn_onehot.head()

Unnamed: 0,Yoga Studio,Accessories Store,Adult Boutique,African Restaurant,American Restaurant,Antique Shop,Arepa Restaurant,Argentinian Restaurant,Art Gallery,Arts & Crafts Store,...,Video Store,Vietnamese Restaurant,Waterfront,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Latitude,Longitude
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,40.625801,-74.030621
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,40.625801,-74.030621
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,40.625801,-74.030621
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,40.625801,-74.030621
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,40.625801,-74.030621


In [33]:
brooklyn_onehot.shape

(2793, 286)

Group rows by neigborhood and taking the mean of the frequency of occurrence of each category

In [34]:
brooklyn_grouped = brooklyn_onehot.groupby('Neighborhood').mean().reset_index()
brooklyn_grouped

Unnamed: 0,Neighborhood,Yoga Studio,Accessories Store,Adult Boutique,African Restaurant,American Restaurant,Antique Shop,Arepa Restaurant,Argentinian Restaurant,Art Gallery,...,Video Store,Vietnamese Restaurant,Waterfront,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Latitude,Longitude
0,Bath Beach,0.000000,0.000000,0.00,0.000000,0.000000,0.000000,0.00,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,40.599519,-73.998752
1,Bay Ridge,0.000000,0.000000,0.00,0.000000,0.034884,0.000000,0.00,0.000000,0.000000,...,0.000000,0.011628,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,40.625801,-74.030621
2,Bedford Stuyvesant,0.000000,0.000000,0.00,0.000000,0.000000,0.000000,0.00,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.040000,0.040000,0.000000,0.000000,40.687232,-73.941785
3,Bensonhurst,0.000000,0.000000,0.00,0.000000,0.000000,0.000000,0.00,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,40.611009,-73.995180
4,Bergen Beach,0.000000,0.000000,0.00,0.000000,0.000000,0.000000,0.00,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,40.615150,-73.898556
5,Boerum Hill,0.022222,0.000000,0.00,0.000000,0.011111,0.011111,0.00,0.000000,0.011111,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.011111,0.000000,0.000000,40.685683,-73.983748
6,Borough Park,0.000000,0.000000,0.00,0.000000,0.050000,0.000000,0.00,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,40.633131,-73.990498
7,Brighton Beach,0.000000,0.000000,0.00,0.000000,0.000000,0.000000,0.00,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,40.576825,-73.965094
8,Broadway Junction,0.000000,0.000000,0.00,0.000000,0.000000,0.000000,0.00,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,40.677861,-73.903317
9,Brooklyn Heights,0.050000,0.000000,0.00,0.000000,0.020000,0.000000,0.00,0.000000,0.000000,...,0.000000,0.010000,0.000000,0.000000,0.010000,0.020000,0.000000,0.010000,40.695864,-73.993782


 Filter neighborhoods that contain Italian Restaurants

In [35]:
len(brooklyn_grouped[brooklyn_grouped["Italian Restaurant"] > 0])

27

In [36]:
# creating new dataframe called it_rest

it_rest = brooklyn_grouped[['Neighborhood', 'Italian Restaurant','Latitude','Longitude']]

In [37]:
it_rest.head()

Unnamed: 0,Neighborhood,Italian Restaurant,Latitude,Longitude
0,Bath Beach,0.043478,40.599519,-73.998752
1,Bay Ridge,0.05814,40.625801,-74.030621
2,Bedford Stuyvesant,0.0,40.687232,-73.941785
3,Bensonhurst,0.060606,40.611009,-73.99518
4,Bergen Beach,0.0,40.61515,-73.898556


## 4. Cluster Neighborhoods

In [38]:
# set number of clusters
kclusters = 3

it_rest_clustering = it_rest.drop(["Neighborhood",'Latitude','Longitude'], 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(it_rest_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([1, 1, 0, 1, 0, 0, 0, 0, 0, 1], dtype=int32)

In [39]:
it_rest_merged = it_rest.copy()

# add clustering labels
it_rest_merged["Cluster Labels"] = kmeans.labels_

In [40]:
# sorting cluster labels form 0 to 2
it_rest_merged.sort_values(["Cluster Labels"], inplace=True)
it_rest_merged.head()

Unnamed: 0,Neighborhood,Italian Restaurant,Latitude,Longitude,Cluster Labels
34,Gerritsen Beach,0.0,40.590848,-73.930102,0
37,Greenpoint,0.02,40.730201,-73.954241,0
38,Highland Park,0.0,40.681999,-73.890346,0
39,Homecrest,0.0,40.598525,-73.959185,0
40,Kensington,0.0,40.642382,-73.980421,0


Visualise clusters on map

In [41]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(it_rest_merged['Latitude'], it_rest_merged['Longitude'], it_rest_merged['Neighborhood'], it_rest_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' - Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [42]:
map_clusters.save('map_italian_restarurants_in_brooklyn.html')

## 5. Examine Clusters

#### Cluster 0

In [43]:
it_rest_merged.loc[it_rest_merged['Cluster Labels'] == 0]

Unnamed: 0,Neighborhood,Italian Restaurant,Latitude,Longitude,Cluster Labels
34,Gerritsen Beach,0.0,40.590848,-73.930102,0
37,Greenpoint,0.02,40.730201,-73.954241,0
38,Highland Park,0.0,40.681999,-73.890346,0
39,Homecrest,0.0,40.598525,-73.959185,0
40,Kensington,0.0,40.642382,-73.980421,0
42,Manhattan Beach,0.0,40.577914,-73.943537,0
43,Manhattan Terrace,0.0,40.614433,-73.957438,0
44,Marine Park,0.0,40.609748,-73.931344,0
45,Midwood,0.0,40.625596,-73.957595,0
47,Mill Island,0.0,40.606336,-73.908186,0


### Cluster 1

In [44]:
it_rest_merged.loc[it_rest_merged['Cluster Labels'] == 1]

Unnamed: 0,Neighborhood,Italian Restaurant,Latitude,Longitude,Cluster Labels
9,Brooklyn Heights,0.03,40.695864,-73.993782,1
30,Fort Greene,0.056338,40.688527,-73.972906,1
3,Bensonhurst,0.060606,40.611009,-73.99518,1
1,Bay Ridge,0.05814,40.625801,-74.030621,1
64,Sunset Park,0.029412,40.645103,-74.010316,1
61,Sheepshead Bay,0.043478,40.58689,-73.943186,1
16,Cobble Hill,0.030303,40.68792,-73.998561,1
53,Park Slope,0.043478,40.672321,-73.97705,1
15,Clinton Hill,0.053191,40.693229,-73.967843,1
46,Mill Basin,0.030303,40.615974,-73.915154,1


### Cluster 2

In [45]:
it_rest_merged.loc[it_rest_merged['Cluster Labels'] == 2]

Unnamed: 0,Neighborhood,Italian Restaurant,Latitude,Longitude,Cluster Labels
13,Carroll Gardens,0.11,40.68054,-73.994654,2
41,Madison,0.076923,40.609378,-73.948415,2
36,Gravesend,0.125,40.59526,-73.973471,2


### Thank you!