# Capstone Project - The Battle of the Neighborhoods (Week 4)

## Best neighborhood to operate an independant coffee shop in Toronto

### Introduction

This is a "Capstone project" for the final course in the IBM Data Science Professional Certificate, Weeks 4/5. This project aims to determine the best location to open an independent coffee shop using Python scripts, visualization libraries, pandas & numpy packages, FourSquare API, etc.

This project aims to enable aspiring enterpreneures in making an informed choice in terms of location identification by taking into account various factors like population demographics, existing competitors and so on.

### Description and Background of the Problem

For this project, Toronto has been chosen as it has a growing independent coffee shop scene. According to Wikipedia - "Toronto is the capital city of the Canadian province of Ontario. With a recorded population of 2,731,571 in 2016, it is the most populous city in Canada and the fourth most populous city in North America. The city is the anchor of the Golden Horseshoe, an urban agglomeration of 9,245,438 people (as of 2016) surrounding the western end of Lake Ontario, while the Greater Toronto Area (GTA) proper had a 2016 population of 6,417,516. Toronto is an international centre of business, finance, arts, and culture, and is recognized as one of the most multicultural and cosmopolitan cities in the world."

As it is known, choosing a neighborhood and location for a coffee shop is an important and diffucult step for the entrepreneurs. Most of time it can be as crucial factor as it determines the coffee shop's success and also affects the type of service, quality, menu, interior design.

### Description of Data

To identify the best location, the following data is required:

1. The list of Toronto neighborhoods and their boroughs ('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')

2. The postal codes of the neighborhoods ('https://cocl.us/Geospatial_data/Geospatial_Coordinates.csv')

3. The most updated record of traffic signal vehicle and pedestrian volumes in Toronto.(https://tinyurl.com/vehicle-foot-traffic)

4. The popular or venues of a given neighborhood in Toronto.(https://developer.foursquare.com/)

The List of neighborhood data will be used to obtain the exact coordinates for each neighborhood based on the postal code, this allows exploration of the city map.
Then the coordinates and Foursquare credentials will be used to access the popular venues along with their details, especially for coffee shops. The pedestrian volumes can be used to analyze the foot traffic.

### Pre-Processing the raw data

In [1]:
# importing libraries and packages
import pandas as pd
import numpy as np
import seaborn as sns
import requests
import matplotlib.pyplot as plt
%config IPCompleter.greedy=True
%config IPCompleter.use_jedi=False
from bs4 import BeautifulSoup
import warnings
warnings.filterwarnings('ignore')

### Neighbourhood Data Analysis

In [2]:
url ='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
source = requests.get(url).text
soup = BeautifulSoup(source)

table_data = soup.find('div', class_='mw-parser-output')
table = table_data.table.tbody

columns = ['PostalCode', 'Borough', 'Neighbourhood']
data = dict({key:[]*len(columns) for key in columns})

for row in table.find_all('tr'):
    for i,column in zip(row.find_all('td'),columns):
        i = i.text
        i = i.replace('\n', '')
        data[column].append(i)

df = pd.DataFrame.from_dict(data=data)[columns]
print(df.shape)
df.head()

(180, 3)


Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [3]:
# cleaning the data

df = df[df['Borough'] != 'Not assigned'].reset_index(drop = True)
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [4]:
p, b,n = [],[],[]
for postcode, borough, neigh in zip(df['PostalCode'], df['Borough'], df['Neighbourhood']):
    p.append(postcode)
    b.append(borough)
    if neigh == 'Not assigned':
        n.append(borough)
    else:
        n.append(neigh)
        
df = pd.DataFrame({'PostalCode': p, 'Borough': b, 'Neighbourhood':n})[columns]
print(df.shape)
df.head()

(103, 3)


Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [5]:
postcodes = df['PostalCode'].values
boroughs = df['Borough'].values
neighs = df['Neighbourhood'].values

#create a dictionary with keys as Postcode and Borough, keys of dictioaries are unique
dic = dict({(key1,key2): [] for key1, key2 in zip(postcodes, boroughs)})
print('Number of keys in the dictionary are: ', len(dic.keys()))

#filling the values of keys of dictionary
for postcode, borough, neigh in zip(postcodes,boroughs, neighs):
    key = (postcode, borough)
    dic[key].append(neigh)
    
df = pd.DataFrame(columns = ['Postal Code', 'Borough', 'Neighbourhood'])
for key, value in dic.items():
    postcode, borough, neig = key[0], key[1], value
    neig = ', '.join(neig)
    df = df.append({'Postal Code': postcode,
                     'Borough': borough,
                     'Neighbourhood': neig}, ignore_index = True)
print('Shape of final data is: ', df.shape)
df.head()

Number of keys in the dictionary are:  103
Shape of final data is:  (103, 3)


Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


Getting the Coordinates

In [6]:
url='https://cocl.us/Geospatial_data/Geospatial_Coordinates.csv'
df_geo=pd.read_csv(url)
df_geo.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [7]:
df_ll = pd.merge(df, df_geo, on = 'Postal Code')
df_ll.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


In [8]:
df_ll.shape

(103, 5)

In [9]:
# saving the DataFrame as a new csv file
df_ll.to_csv('toronto_postalcodes_lalo.csv', index = False)
df_ll.shape

(103, 5)

### Clustering Coffee Shops in Toronto

In [10]:
from sklearn.cluster import KMeans
from geopy.geocoders import Nominatim

In [18]:
CLIENT_ID='XDD3O354EQ4HSN3MBOVZMD54IE05U1RZ1YEGMGZJJ4AAK5EA'
CLIENT_SECRET='PCHDYEDPYRHQAB4HJPDHZGXSSTWPSG23DWW4IHWWDIB02T2K'
VERSION='20212302'
geo=Nominatim(user_agent="my_app")
location=geo.geocode('Toronto')
lat=location.latitude
lon=location.longitude

In [19]:
import sys
!{sys.executable} -m pip install geopy



Calling FourSquare API for identifying existing coffee shops

In [20]:
url='https://api.foursquare.com/v2/venues/search?&client_id={}&client_secret={}&v={}&ll={},{}&limit={}&query={}'.format(
    CLIENT_ID,
    CLIENT_SECRET,
    VERSION,
    lat,
    lon,
    50,
    'coffee'    
)
url

'https://api.foursquare.com/v2/venues/search?&client_id=XDD3O354EQ4HSN3MBOVZMD54IE05U1RZ1YEGMGZJJ4AAK5EA&client_secret=PCHDYEDPYRHQAB4HJPDHZGXSSTWPSG23DWW4IHWWDIB02T2K&v=20212302&ll=43.6534817,-79.3839347&limit=50&query=coffee'

Information extraction and reading it to a DataFrame

In [21]:
result=requests.get(url).json()
result

{'meta': {'code': 200, 'requestId': '603518d0ff5f9a25f4a59717'},
 'response': {'venues': [{'id': '59f784dd28122f14f9d5d63d',
    'name': 'HotBlack Coffee',
    'location': {'address': '245 Queen Street West',
     'crossStreet': 'at St Patrick St',
     'lat': 43.65036434800487,
     'lng': -79.38866907575726,
     'labeledLatLngs': [{'label': 'display',
       'lat': 43.65036434800487,
       'lng': -79.38866907575726}],
     'distance': 515,
     'postalCode': 'M5V 1Z4',
     'cc': 'CA',
     'neighborhood': 'Entertainment District',
     'city': 'Toronto',
     'state': 'ON',
     'country': 'Canada',
     'formattedAddress': ['245 Queen Street West (at St Patrick St)',
      'Toronto ON M5V 1Z4',
      'Canada']},
    'categories': [{'id': '4bf58dd8d48988d1e0931735',
      'name': 'Coffee Shop',
      'pluralName': 'Coffee Shops',
      'shortName': 'Coffee Shop',
      'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/food/coffeeshop_',
       'suffix': '.png'},
      'pr

In [22]:
venues=result['response']['venues']
coffee_lat=[]
coffee_lon=[]
coffee_name=[]
coffee_neigh=[]
for i in range(len(venues)):
    coffee_lat.append(venues[i]['location']['lat'])
    coffee_lon.append(venues[i]['location']['lng'])
    coffee_name.append(venues[i]['name'])
data={'name':coffee_name,'lat':coffee_lat,'lon':coffee_lon}
df_coffee=pd.DataFrame(data)
df_coffee.head()

Unnamed: 0,name,lat,lon
0,HotBlack Coffee,43.650364,-79.388669
1,Timothy's World Coffee,43.654053,-79.38809
2,Timothy's World Coffee,43.653436,-79.382314
3,Timothy's World Coffee,43.652135,-79.381172
4,Balzac's Coffee,43.657854,-79.3792


In [23]:
import sys
!{sys.executable} -m pip install folium

Collecting folium
  Downloading folium-0.12.1-py2.py3-none-any.whl (94 kB)
[K     |████████████████████████████████| 94 kB 6.7 MB/s  eta 0:00:01
Collecting branca>=0.3.0
  Downloading branca-0.4.2-py3-none-any.whl (24 kB)
Installing collected packages: branca, folium
Successfully installed branca-0.4.2 folium-0.12.1


In [24]:
import folium
import numpy as np
import matplotlib.cm as cm
import matplotlib.colors as colors

The locations of coffee shops around Toronto are clustered into 5 different groups based on their coordinates.

In [25]:
map_lat=df_coffee['lat'][0]
map_lon=df_coffee['lon'][0]
toronto_map=folium.Map(location=[map_lat,map_lon],zoom_start=13)
km=KMeans(n_clusters=5, random_state=4).fit(df_coffee[['lat','lon']])
x=np.arange(5)
ys = [i + x + (i*x)**2 for i in range(5)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
df_coffee.insert(0,"cluster",km.labels_)

for lat,long,name,cluster in zip(df_coffee['lat'],df_coffee['lon'],df_coffee['name'],df_coffee['cluster']):
    label='{}'.format(name)
    label=folium.Popup(label,parse_html=True)
    folium.CircleMarker(
        [lat,long],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7,
        parse_html=False).add_to(toronto_map)        
    

toronto_map

### Identifying trending coffee shops-

Studying trending venues is useful in understanding the likes and prefrences of potential customers. Based on the trending coffee shops further research can be done into their business model,product offering, target audience, branding and such factors in order to develop better branding, pricing and marketing startigies. 

In [35]:
url='https://api.foursquare.com/v2/venues/trending?&client_id={}&client_secret={}&v={}&ll={},{}&limit={}&query={}'.format(
    CLIENT_ID,
    CLIENT_SECRET,
    VERSION,
    lat,
    lon,
    50,
    'coffee')

# send GET request and get trending venues
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '60351de9a6f02c11e202b2c6'},
 'response': {'venues': []}}

In [36]:
if len(results['response']['venues']) == 0:
    trending_venues_df = 'No trending venues are available at the moment!'
    
else:
    trending_venues = results['response']['venues']
    trending_venues_df = json_normalize(trending_venues)

    # filter columns
    columns_filtered = ['name', 'categories'] + ['location.distance', 'location.city', 'location.postalCode', 'location.state', 'location.country', 'location.lat', 'location.lng']
    trending_venues_df = trending_venues_df.loc[:, columns_filtered]

    # filter the category for each row
    trending_venues_df['categories'] = trending_venues_df.apply(get_category_type, axis=1)

In [37]:
trending_venues_df

'No trending venues are available at the moment!'

In [38]:
if len(results['response']['venues']) == 0:
    venues_map = 'Cannot generate visual as no trending venues are available at the moment!'

else:
    venues_map = folium.Map(location=[latitude, longitude], zoom_start=15) # generate map centred around Ecco


    # add Ecco as a red circle mark
    folium.CircleMarker(
        [latitude, longitude],
        radius=10,
        popup='Ecco',
        fill=True,
        color='red',
        fill_color='red',
        fill_opacity=0.6
    ).add_to(venues_map)


    # add the trending venues as blue circle markers
    for lat, lng, label in zip(trending_venues_df['location.lat'], trending_venues_df['location.lng'], trending_venues_df['name']):
        folium.CircleMarker(
            [lat, lng],
            radius=5,
            poup=label,
            fill=True,
            color='blue',
            fill_color='blue',
            fill_opacity=0.6
        ).add_to(venues_map)

In [39]:
venues_map

'Cannot generate visual as no trending venues are available at the moment!'