# Analyzing Neighborhoods in Vancouver to open a Bakery

## Business Problem

##### The objective of this project is to analyze and select the best locations in Vancouver BC, Canada, to open a new Bakery. This project is mainly focused on geospatial analysis of the Vancouver to determine the best location to open a new bakery. 



## Importing Libraries

In [127]:
!pip install beautifulsoup4
!pip install lxml
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation

#!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 


from IPython.display import display_html
import pandas as pd
import numpy as np
    
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

!conda install -c conda-forge folium=0.5.0 --yes
import folium # plotting library
from bs4 import BeautifulSoup
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors

print('Folium installed')
print('Libraries imported.')

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.

Folium installed
Libraries imported.


## Importing Data

##### We need the following data to solve the Business Problem:
* List of neighbourhoods in Vancouver BC, Canada. 
* Finding the latitude and longitude coordinates of those neighbourhoods. 
* Fetching the Venue data, specifically data of the Bakeries. 
* Using the fetched data to perform clustering on the neighbourhoods.

### Webscrapping - Fetching data from the web

In [132]:
source = requests.get('https://canadapostcode.net/location/CA-BC/Vancouver').text
soup=BeautifulSoup(source,'html.parser')

In [133]:
print(soup.title.text)

Vancouver Post Codes & Zip Codes List


In [4]:
postal_table = soup.find('table')
print(postal_table)

<table class="table table-condensed table-bordered table-striped">
<tr>
<th>Location</th>
<th>City/District</th>
<th>States or Territories</th>
<th>States or Territories Abbrieviation</th>
<th>Postcode</th>
</tr>
<tr>
<td><a href="https://canadapostcode.net/postcode/CA-BC/V5K">Vancouver (North Hastings-Sunrise)</a></td>
<td><a href="https://canadapostcode.net/location/CA-BC/Vancouver">Vancouver</a></td>
<td>British Columbia</td>
<td>BC</td>
<td><a href="https://canadapostcode.net/postcode/CA-BC/V5K">V5K</a></td>
</tr>
<tr>
<td><a href="https://canadapostcode.net/postcode/CA-BC/V5L">Vancouver (North Grandview-Woodlands)</a></td>
<td><a href="https://canadapostcode.net/location/CA-BC/Vancouver">Vancouver</a></td>
<td>British Columbia</td>
<td>BC</td>
<td><a href="https://canadapostcode.net/postcode/CA-BC/V5L">V5L</a></td>
</tr>
<tr>
<td><a href="https://canadapostcode.net/postcode/CA-BC/V5M">Vancouver (South Hastings-Sunrise / North Renfrew-Collingwood)</a></td>
<td><a href="https://cana

### Creating a dataframe to capture the data

In [136]:
df = pd.DataFrame(columns = ['Location','City/District','States or Territories','States or Territories Abbrieviation','Postcode'])
df

Unnamed: 0,Location,City/District,States or Territories,States or Territories Abbrieviation,Postcode


In [49]:
location_arr = []
zip_arr = []

table = soup.find("table", { "class" : "table" })
for row in table.findAll("tr"):
    cells = row.findAll("td")
    print(cells)      

[]
[<td><a href="https://canadapostcode.net/postcode/CA-BC/V5K">Vancouver (North Hastings-Sunrise)</a></td>, <td><a href="https://canadapostcode.net/location/CA-BC/Vancouver">Vancouver</a></td>, <td>British Columbia</td>, <td>BC</td>, <td><a href="https://canadapostcode.net/postcode/CA-BC/V5K">V5K</a></td>]
[<td><a href="https://canadapostcode.net/postcode/CA-BC/V5L">Vancouver (North Grandview-Woodlands)</a></td>, <td><a href="https://canadapostcode.net/location/CA-BC/Vancouver">Vancouver</a></td>, <td>British Columbia</td>, <td>BC</td>, <td><a href="https://canadapostcode.net/postcode/CA-BC/V5L">V5L</a></td>]
[<td><a href="https://canadapostcode.net/postcode/CA-BC/V5M">Vancouver (South Hastings-Sunrise / North Renfrew-Collingwood)</a></td>, <td><a href="https://canadapostcode.net/location/CA-BC/Vancouver">Vancouver</a></td>, <td>British Columbia</td>, <td>BC</td>, <td><a href="https://canadapostcode.net/postcode/CA-BC/V5M">V5M</a></td>]
[<td><a href="https://canadapostcode.net/postcod

### Cleaning the Data by eliminating the empty values

In [137]:
data_arr = []
rows = postal_table.find_all('tr')
for row in rows:
    cols = row.find_all('td')
    cols = [ele.text.strip() for ele in cols]
    data_arr.append([ele for ele in cols if ele]) # Get rid of empty values

data_arr


[[],
 ['Vancouver (North Hastings-Sunrise)',
  'Vancouver',
  'British Columbia',
  'BC',
  'V5K'],
 ['Vancouver (North Grandview-Woodlands)',
  'Vancouver',
  'British Columbia',
  'BC',
  'V5L'],
 ['Vancouver (South Hastings-Sunrise / North Renfrew-Collingwood)',
  'Vancouver',
  'British Columbia',
  'BC',
  'V5M'],
 ['Vancouver (South Grandview-Woodlands / NE Kensington)',
  'Vancouver',
  'British Columbia',
  'BC',
  'V5N'],
 ['Vancouver (SE Kensington / Victoria-Fraserview)',
  'Vancouver',
  'British Columbia',
  'BC',
  'V5P'],
 ['Vancouver (South Renfrew-Collingwood)',
  'Vancouver',
  'British Columbia',
  'BC',
  'V5R'],
 ['Vancouver (Killarney)', 'Vancouver', 'British Columbia', 'BC', 'V5S'],
 ['Vancouver (East Mount Pleasant)',
  'Vancouver',
  'British Columbia',
  'BC',
  'V5T'],
 ['Vancouver (West Kensington / NE Riley Park-Little Mountain)',
  'Vancouver',
  'British Columbia',
  'BC',
  'V5V'],
 ['Vancouver (SE Riley Park-Little Mountain / SW Kensington / NE Oakridge

### Populating the data into the dataframe with the data captured above

In [138]:
df = pd.DataFrame(columns=['Location','City/District','States or Territories','States or Territories Abbrieviation','Postcode'], data=data_arr)

In [10]:
df

Unnamed: 0,Location,City/District,States or Territories,States or Territories Abbrieviation,Postcode
0,,,,,
1,Vancouver (North Hastings-Sunrise),Vancouver,British Columbia,BC,V5K
2,Vancouver (North Grandview-Woodlands),Vancouver,British Columbia,BC,V5L
3,Vancouver (South Hastings-Sunrise / North Renf...,Vancouver,British Columbia,BC,V5M
4,Vancouver (South Grandview-Woodlands / NE Kens...,Vancouver,British Columbia,BC,V5N
5,Vancouver (SE Kensington / Victoria-Fraserview),Vancouver,British Columbia,BC,V5P
6,Vancouver (South Renfrew-Collingwood),Vancouver,British Columbia,BC,V5R
7,Vancouver (Killarney),Vancouver,British Columbia,BC,V5S
8,Vancouver (East Mount Pleasant),Vancouver,British Columbia,BC,V5T
9,Vancouver (West Kensington / NE Riley Park-Lit...,Vancouver,British Columbia,BC,V5V


In [140]:
df.shape

(31, 5)

### Downnloading the dataframe into a CSV file

In [141]:
df.to_csv('Vancouver_zipcodes_dataframe')

### Created a new file from the file downloaded in the above step
* A new file "Vanc_Zipcoddes.csv" that contains the Postal Code, Latitude and Longitude
* The file will be read into a new dataframe

In [145]:
lat_lon = pd.read_csv(r"C:\Users\hache\Downloads\Vanc_Zipcodes.csv")
lat_lon.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,V5K,49.281719,-123.040006
1,V5L,49.280201,-123.066561
2,V5M,49.260082,-123.040164
3,V5N,49.255278,-123.070663
4,V5P,49.222751,-123.067557


## Merging the two dataframes created above into a single dataframe

In [146]:
lat_lon.rename(columns={'Postal Code':'Postcode'},inplace=True)
df2 = pd.merge(df,lat_lon,on='Postcode')
df2.head()

Unnamed: 0,Location,City/District,States or Territories,States or Territories Abbrieviation,Postcode,Latitude,Longitude
0,Vancouver (North Hastings-Sunrise),Vancouver,British Columbia,BC,V5K,49.281719,-123.040006
1,Vancouver (North Grandview-Woodlands),Vancouver,British Columbia,BC,V5L,49.280201,-123.066561
2,Vancouver (South Hastings-Sunrise / North Renf...,Vancouver,British Columbia,BC,V5M,49.260082,-123.040164
3,Vancouver (South Grandview-Woodlands / NE Kens...,Vancouver,British Columbia,BC,V5N,49.255278,-123.070663
4,Vancouver (SE Kensington / Victoria-Fraserview),Vancouver,British Columbia,BC,V5P,49.222751,-123.067557


In [78]:
first_column_df = df2.iloc[:, 0]

In [79]:
first_column_df

0                    Vancouver (North Hastings-Sunrise)
1                 Vancouver (North Grandview-Woodlands)
2     Vancouver (South Hastings-Sunrise / North Renf...
3     Vancouver (South Grandview-Woodlands / NE Kens...
4       Vancouver (SE Kensington / Victoria-Fraserview)
5                 Vancouver (South Renfrew-Collingwood)
6                                 Vancouver (Killarney)
7                       Vancouver (East Mount Pleasant)
8     Vancouver (West Kensington / NE Riley Park-Lit...
9     Vancouver (SE Riley Park-Little Mountain / SW ...
10    Vancouver (SE Oakridge / East Marpole / South ...
11    Vancouver (West Mount Pleasant / West Riley Pa...
12             Vancouver (East Fairview / South Cambie)
13    Vancouver (Strathcona / Chinatown / Downtown E...
14    Vancouver (NE Downtown / Harbour Centre / Gast...
15    Vancouver (Waterfront / Coal Harbour / Canada ...
16                           Vancouver (South West End)
17            Vancouver (North West End / Stanle

## Using FourSquare API 
* Fetching the geographical coordinates of the neighborhood locations in Vancouver BC, Canada
* We use the Geocoder package that converts the address into geographical coordinates

In [63]:
# Defining a function to get coordinates
def get_latlng(neighborhood):
    # initialize your variable to None
    lat_lng_coords = None
    # loop until you get the coordinates
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, Vancouver, Canada'.format(neighborhood))
        lat_lng_coords = g.latlng
    return lat_lng_coords

### Creating the neighborhood list
* Fetching the first column from the dataframe to get the neighborhoods and creating a list

In [82]:
neighborhoodList = []

for neighborhood in first_column_df:
          neighborhoodList.append(neighborhood)

print(neighborhoodList)

['Vancouver (North Hastings-Sunrise)', 'Vancouver (North Grandview-Woodlands)', 'Vancouver (South Hastings-Sunrise / North Renfrew-Collingwood)', 'Vancouver (South Grandview-Woodlands / NE Kensington)', 'Vancouver (SE Kensington / Victoria-Fraserview)', 'Vancouver (South Renfrew-Collingwood)', 'Vancouver (Killarney)', 'Vancouver (East Mount Pleasant)', 'Vancouver (West Kensington / NE Riley Park-Little Mountain)', 'Vancouver (SE Riley Park-Little Mountain / SW Kensington / NE Oakridge / North Sunset)', 'Vancouver (SE Oakridge / East Marpole / South Sunset)', 'Vancouver (West Mount Pleasant / West Riley Park-Little Mountain)', 'Vancouver (East Fairview / South Cambie)', 'Vancouver (Strathcona / Chinatown / Downtown Eastside)', 'Vancouver (NE Downtown / Harbour Centre / Gastown / Yaletown)', 'Vancouver (Waterfront / Coal Harbour / Canada Place)', 'Vancouver (South West End)', 'Vancouver (North West End / Stanley Park)', 'Vancouver (West Fairview / Granville Island / NE Shaughnessy)', 'Va

### Fetching the coordinates
* Using the Geocoder package to convert the neighborhood list into geographical coordinates

In [84]:
import geocoder
# Call the function to get the coordinates, store in a new list using list comprehension
coords = [ get_latlng(neighborhood) for neighborhood in neighborhoodList]

print(coords)

[[49.28021000000007, -123.02868999999998], [49.28010000000006, -123.07040999999998], [49.249700000000075, -123.02974999999998], [49.28010000000006, -123.07040999999998], [49.22027000000003, -123.06940999999995], [49.249700000000075, -123.02974999999998], [49.22670000000005, -123.03656999999998], [49.26983000000007, -123.09988999999996], [49.25028000000003, -123.10068999999999], [49.22026000000005, -123.08022999999997], [49.22026000000005, -123.08022999999997], [49.26983000000007, -123.09988999999996], [49.25010000000003, -123.11998999999997], [49.283930000000055, -123.09044999999998], [49.274320000000046, -123.12179999999995], [49.28834661628875, -123.11627307480667], [49.28619000000003, -123.13372999999996], [49.301060000000064, -123.13572999999997], [49.25039000000004, -123.12980999999996], [49.260380000000055, -123.11335999999994], [49.26833000000005, -123.16541999999998], [49.24936000000008, -123.16533999999996], [49.24936000000008, -123.16533999999996], [49.249960000000044, -123.1

In [149]:
# Getting the coordinates of Vancouver BC, Canada
address = 'Vancouver, Canada'
geolocator = Nominatim(user_agent="techmeritt@gmail.com")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Vancouver BC, Canada {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Vancouver BC, Canada 49.2608724, -123.113952.


In [86]:
map_kl = folium.Map(location=[latitude, longitude], zoom_start=11)

In [87]:
# Create temporary dataframe to populate the coordinates into Latitude and Longitude
df_coords = pd.DataFrame(coords, columns=['Latitude', 'Longitude'])
# Merge the coordinates into the original dataframe
df2['Latitude'] = df_coords['Latitude']
df2['Longitude'] = df_coords['Longitude']
print(df2.shape)
df2

(30, 7)


Unnamed: 0,Location,City/District,States or Territories,States or Territories Abbrieviation,Postcode,Latitude,Longitude
0,Vancouver (North Hastings-Sunrise),Vancouver,British Columbia,BC,V5K,49.28021,-123.02869
1,Vancouver (North Grandview-Woodlands),Vancouver,British Columbia,BC,V5L,49.2801,-123.07041
2,Vancouver (South Hastings-Sunrise / North Renf...,Vancouver,British Columbia,BC,V5M,49.2497,-123.02975
3,Vancouver (South Grandview-Woodlands / NE Kens...,Vancouver,British Columbia,BC,V5N,49.2801,-123.07041
4,Vancouver (SE Kensington / Victoria-Fraserview),Vancouver,British Columbia,BC,V5P,49.22027,-123.06941
5,Vancouver (South Renfrew-Collingwood),Vancouver,British Columbia,BC,V5R,49.2497,-123.02975
6,Vancouver (Killarney),Vancouver,British Columbia,BC,V5S,49.2267,-123.03657
7,Vancouver (East Mount Pleasant),Vancouver,British Columbia,BC,V5T,49.26983,-123.09989
8,Vancouver (West Kensington / NE Riley Park-Lit...,Vancouver,British Columbia,BC,V5V,49.25028,-123.10069
9,Vancouver (SE Riley Park-Little Mountain / SW ...,Vancouver,British Columbia,BC,V5W,49.22026,-123.08023


## Data Visualization

* The data populated into the dataframe is then visualized on a map using the Folium package.

In [151]:
map_vancover = folium.Map(location=[latitude, longitude], zoom_start=11)
# Adding markers to map
for lat, lng, neighborhood in zip(df2['Latitude'],  df2['Longitude'], df2['Location']):
 label = '{}'.format(neighborhood)
 label = folium.Popup(label, parse_html=True)
 folium.CircleMarker([lat, lng],radius=5,popup=label,color='blue',fill=True,fill_color='#3186cc',fill_opacity=0.7).add_to(map_vancover)
map_vancover

## Use the foursquare API to explore the neighbourhoods

In [173]:
CLIENT_ID = 'GQL0TFPFRD4YTZVYTYLR2E2K5AEOUK5BDC2S1W02NOUN5CTY' # your Foursquare ID
CLIENT_SECRET = 'FHA23RM4DV0SQC0OEZU1ZAL5VLJ5S0FM2IEF1PT0Y0AT3BYL' # your Foursquare Secret
VERSION = '20211017'
radius = 2000
LIMIT = 100
venues = []

for lat, lng, neighborhood in zip(df2['Latitude'], df2['Longitude'], df2['Location']):
# Create the API request URL
    url = "https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}".format(CLIENT_ID,CLIENT_SECRET,VERSION,lat,lng,radius,LIMIT)
    #print(url)
    # Make the GET request
    results = requests.get(url).json()["response"]['groups'][0]['items']
    # Return only relevant information for each nearby venue
    for venue in results:
                venues.append((neighborhood,lat,long,venue['venue']['name'],
                venue['venue']['location']['lat'],venue['venue']['location']    ['lng'],venue['venue']['categories'][0]['name']))

In [168]:
venues

[('Vancouver (North Hastings-Sunrise)',
  49.2817186,
  -122.9278,
  'The Fair at the PNE',
  49.28297109526185,
  -123.04210889782496,
  'Fair'),
 ('Vancouver (North Hastings-Sunrise)',
  49.2817186,
  -122.9278,
  'Wooden Roller Coaster',
  49.281743917117005,
  -123.03512832025775,
  'Theme Park Ride / Attraction'),
 ('Vancouver (North Hastings-Sunrise)',
  49.2817186,
  -122.9278,
  'Livestock Barns',
  49.2840365696587,
  -123.03927841558797,
  'Farm'),
 ('Vancouver (North Hastings-Sunrise)',
  49.2817186,
  -122.9278,
  'Tamam Fine Palestinian Cuisine',
  49.281070177811365,
  -123.05143787191706,
  'Middle Eastern Restaurant'),
 ('Vancouver (North Hastings-Sunrise)',
  49.2817186,
  -122.9278,
  'Pacific Coliseum',
  49.2858225364904,
  -123.04272666905692,
  'Hockey Arena'),
 ('Vancouver (North Hastings-Sunrise)',
  49.2817186,
  -122.9278,
  'New Brighton Park',
  49.28961412535829,
  -123.03848392509391,
  'Park'),
 ('Vancouver (North Hastings-Sunrise)',
  49.2817186,
  -122.

In [174]:
venues_df = pd.DataFrame(venues)
# Defining the column names
venues_df.columns = ['Neighborhood', 'Latitude', 'Longitude', 'VenueName', 'VenueLatitude', 'VenueLongitude', 'VenueCategory']
print(venues_df.shape)
venues_df.head()

(2647, 7)


Unnamed: 0,Neighborhood,Latitude,Longitude,VenueName,VenueLatitude,VenueLongitude,VenueCategory
0,Vancouver (North Hastings-Sunrise),49.281719,-122.9278,The Fair at the PNE,49.282971,-123.042109,Fair
1,Vancouver (North Hastings-Sunrise),49.281719,-122.9278,Wooden Roller Coaster,49.281744,-123.035128,Theme Park Ride / Attraction
2,Vancouver (North Hastings-Sunrise),49.281719,-122.9278,Livestock Barns,49.284037,-123.039278,Farm
3,Vancouver (North Hastings-Sunrise),49.281719,-122.9278,Tamam Fine Palestinian Cuisine,49.28107,-123.051438,Middle Eastern Restaurant
4,Vancouver (North Hastings-Sunrise),49.281719,-122.9278,Pacific Coliseum,49.285823,-123.042727,Hockey Arena


In [175]:
venues_df

Unnamed: 0,Neighborhood,Latitude,Longitude,VenueName,VenueLatitude,VenueLongitude,VenueCategory
0,Vancouver (North Hastings-Sunrise),49.281719,-122.9278,The Fair at the PNE,49.282971,-123.042109,Fair
1,Vancouver (North Hastings-Sunrise),49.281719,-122.9278,Wooden Roller Coaster,49.281744,-123.035128,Theme Park Ride / Attraction
2,Vancouver (North Hastings-Sunrise),49.281719,-122.9278,Livestock Barns,49.284037,-123.039278,Farm
3,Vancouver (North Hastings-Sunrise),49.281719,-122.9278,Tamam Fine Palestinian Cuisine,49.281070,-123.051438,Middle Eastern Restaurant
4,Vancouver (North Hastings-Sunrise),49.281719,-122.9278,Pacific Coliseum,49.285823,-123.042727,Hockey Arena
...,...,...,...,...,...,...,...
2642,North Vancouver Outer East,49.367800,-122.9278,Mystery Peak Express,49.368014,-122.948706,Ski Chairlift
2643,North Vancouver Outer East,49.367800,-122.9278,Dog Mountain Trail,49.368491,-122.950298,Trail
2644,North Vancouver Outer East,49.367800,-122.9278,Deep Cove Lookout,49.355513,-122.941562,Scenic Lookout
2645,North Vancouver Outer East,49.367800,-122.9278,Enquist Lodge,49.362060,-122.950469,Ski Lodge


### Group of neighbourhood by taking the sum of the frequency of occurrence of each category.

In [238]:
# Lets check how many venues were returned for each neighbourhood
venues_df.groupby(["Neighborhood"]).count()

# Lets check out how many unique categories can be curated from all the returned values
print('There are {} unique categories.'.format(len(venues_df['VenueCategory'].unique())))
    
# Displaying the first 50 Venue Category names
venues_df['VenueCategory'].unique()[:]


There are 236 unique categories.


array(['Fair', 'Theme Park Ride / Attraction', 'Farm',
       'Middle Eastern Restaurant', 'Hockey Arena', 'Park', 'Theme Park',
       'Vietnamese Restaurant', 'Mexican Restaurant', 'Grocery Store',
       'Coffee Shop', 'Indian Restaurant', 'Café', 'Hot Dog Joint',
       'Racetrack', 'Breakfast Spot', 'Food Truck', 'Malay Restaurant',
       'Brewery', 'Southern / Soul Food Restaurant', 'Liquor Store',
       'Dessert Shop', 'Sushi Restaurant', 'Bakery', 'Beer Garden',
       'Deli / Bodega', 'French Restaurant', 'Motorcycle Shop',
       'Amphitheater', 'Hobby Shop', 'Theater', 'Soccer Field',
       'Japanese Restaurant', 'Event Space', 'Latin American Restaurant',
       'Seafood Restaurant', 'Bank', 'Thai Restaurant',
       'Greek Restaurant', 'BBQ Joint', 'Chinese Restaurant',
       'Restaurant', 'Convenience Store', 'Fried Chicken Joint',
       'Thrift / Vintage Store', 'Sandwich Place', 'Pub',
       'Asian Restaurant', 'Fast Food Restaurant',
       'Furniture / Home Stor

## Analyzing each neighbourhood
* Here we apply one hot encoding to all the venues. So now the number of columns becomes 7

In [205]:
#One hot encoding
vanc_onehot = pd.get_dummies(venues_df[['VenueCategory']], prefix="", prefix_sep="")
# Adding neighborhood column back to dataframe
vanc_onehot['Neighborhoods'] = venues_df['Neighborhood']
# Moving neighbourhood column to the first column
fixed_columns = [vanc_onehot.columns[-1]] + list(vanc_onehot.columns[:-1])
vanc_onehot = vanc_onehot[fixed_columns]
print(vanc_onehot.shape)

(2647, 237)


In [239]:
vanc_grouped=vanc_onehot.groupby(["Neighborhoods"]).sum().reset_index()
print(vanc_grouped.shape)
vanc_grouped

(30, 237)


Unnamed: 0,Neighborhoods,Accessories Store,American Restaurant,Amphitheater,Aquarium,Art Gallery,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Auto Dealership,...,Trail,Vegetarian / Vegan Restaurant,Video Store,Vietnamese Restaurant,Volleyball Court,Water Park,Waterfront,Wine Shop,Women's Store,Yoga Studio
0,North Vancouver Outer East,0,0,0,0,0,0,0,0,0,...,3,0,0,0,0,0,0,0,0,0
1,Vancouver (Central Kitsilano),0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,4
2,Vancouver (Chaldecutt / South University Endow...,0,0,0,0,0,0,0,0,0,...,1,0,0,1,0,0,0,0,0,0
3,Vancouver (Dunbar-Southlands / Musqueam),0,0,0,0,0,0,1,0,0,...,0,0,0,1,0,0,0,0,0,0
4,Vancouver (East Fairview / South Cambie),0,0,0,0,0,0,0,0,0,...,1,3,0,2,0,0,2,0,1,1
5,Vancouver (East Mount Pleasant),0,0,0,0,2,0,0,0,1,...,1,2,0,4,0,0,0,0,0,2
6,Vancouver (Killarney),0,0,0,0,0,0,0,0,0,...,1,0,2,0,0,0,0,0,0,0
7,Vancouver (NE Downtown / Harbour Centre / Gast...,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,2,0,0,0
8,Vancouver (NW Arbutus Ridge),0,0,0,0,0,0,1,0,0,...,0,2,0,0,0,0,0,0,1,1
9,Vancouver (NW Shaughnessy / East Kitsilano / Q...,0,2,0,0,0,0,0,0,0,...,1,1,0,0,0,0,1,0,0,3


In [208]:
len((vanc_grouped[vanc_grouped["Bakery"] > 0]))

27

In [237]:
# Creating a dataframe for Bakery data only
vanc_bakery = vanc_grouped[["Neighborhoods","Bakery"]]

In [223]:
vanc_bakery

Unnamed: 0,Neighborhoods,Bakery
0,North Vancouver Outer East,0
1,Vancouver (Central Kitsilano),6
2,Vancouver (Chaldecutt / South University Endow...,1
3,Vancouver (Dunbar-Southlands / Musqueam),2
4,Vancouver (East Fairview / South Cambie),7
5,Vancouver (East Mount Pleasant),6
6,Vancouver (Killarney),0
7,Vancouver (NE Downtown / Harbour Centre / Gast...,5
8,Vancouver (NW Arbutus Ridge),5
9,Vancouver (NW Shaughnessy / East Kitsilano / Q...,8


## Clustering the neighbourhoods

* Cluster all the neighbourhoods into different clusters
* The results will allow us to identify neighborhoods that have a higher and lower concentrations of bakeries
* Based on the results we could identify neighbourhoods that are most suitable to open a new bakery

In [240]:
# Setting the number of clusters
kclusters = 4
vanc_clustering = vanc_bakery.drop(["Neighborhoods"], 1)
# Run k-means clustering algorithm
kmeans = KMeans(n_clusters=kclusters,random_state=0).fit(kl_clustering)
# Checking cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([2, 1, 2, 0, 1, 1, 2, 3, 3, 1])

In [241]:
# Creating a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.
vanc_merged = vanc_bakery.copy()
# Add the clustering labels
vanc_merged["Cluster Labels"] = kmeans.labels_
vanc_merged.rename(columns={"Neighborhoods": "Neighborhood"}, inplace=True)
vanc_merged.head(10)

Unnamed: 0,Neighborhood,Bakery,Cluster Labels
0,North Vancouver Outer East,0,2
1,Vancouver (Central Kitsilano),6,1
2,Vancouver (Chaldecutt / South University Endow...,1,2
3,Vancouver (Dunbar-Southlands / Musqueam),2,0
4,Vancouver (East Fairview / South Cambie),7,1
5,Vancouver (East Mount Pleasant),6,1
6,Vancouver (Killarney),0,2
7,Vancouver (NE Downtown / Harbour Centre / Gast...,5,3
8,Vancouver (NW Arbutus Ridge),5,3
9,Vancouver (NW Shaughnessy / East Kitsilano / Q...,8,1


In [246]:
 # Adding latitude and longitude values to the existing dataframe
vanc_merged['Latitude'] = df2['Latitude']
vanc_merged['Longitude'] = df2['Longitude']
# Sorting the results by Cluster Labels
vanc_merged.sort_values(["Cluster Labels"], inplace=True)
vanc_merged

Unnamed: 0,Neighborhood,Bakery,Cluster Labels,Latitude,Longitude
29,Vancouver (West Mount Pleasant / West Riley Pa...,2,0,49.3678,-122.9278
16,Vancouver (SE Riley Park-Little Mountain / SW ...,3,0,49.284837,-123.126301
15,Vancouver (SE Oakridge / East Marpole / South ...,2,0,49.287717,-123.115193
20,Vancouver (South Renfrew-Collingwood),2,0,49.268952,-123.165018
13,Vancouver (SE Kensington / Victoria-Fraserview),3,0,49.273527,-123.099559
21,Vancouver (South Shaughnessy / NW Oakridge / N...,2,0,49.252375,-123.169951
11,Vancouver (North Hastings-Sunrise),3,0,49.240339,-123.111677
10,Vancouver (North Grandview-Woodlands),3,0,49.215425,-123.097798
18,Vancouver (South Grandview-Woodlands / NE Kens...,3,0,49.256797,-123.133132
25,Vancouver (Waterfront / Coal Harbour / Canada ...,2,0,49.269118,-123.197198


## Visualizing the resulting clusters

### Looking at the map below we can spot the various clusters


In [252]:
# Creating the map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)
# Setting color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
# Add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(vanc_merged['Latitude'], vanc_merged['Longitude'], vanc_merged['Neighborhood'], vanc_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' - Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker([lat,lon],radius=5,popup=label,color=rainbow[cluster-1],fill=True,fill_color=rainbow[cluster-1],fill_opacity=0.7).add_to(map_clusters)
map_clusters

In [249]:
len(vanc_merged.loc[vanc_merged['Cluster Labels'] == 0])


12

In [251]:
len(vanc_merged.loc[vanc_merged['Cluster Labels'] == 1])


5

In [250]:
len(vanc_merged.loc[vanc_merged['Cluster Labels'] == 2])

7

## Result

* Cluster 0 contains 12 bakery places which is the highest among the 3 clusters
* Cluster 1 contains 5 bakery places
* Cluster 2 contains 7 bakery places

## Conclusion

### We should open a bakery in cluster 1 as there are the least number of bakeries there, the competition will be less and it has the most potential