# IBM Data Science Final Project

Use the Foursquare API to solve a real business problem

Welcome to the final project of the IBM Datascience Professional Certification.  

## The Problem: Find the best area of Toronto to open a pet store.

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1. <a href="#item1">Introduction</a>

2. <a href="#item2">Data</a>

3. <a href="#item3">Methodology</a>

4. <a href="#item4">Discussion</a>

5. <a href="#item5">Conclusion</a>    
</font>
</div>

## 1. Introduction

<b>The business problem:</b>

What is the best location to open a pet store in Toronto, Canada.

<b>Who would be interested in this project:

</b> For 2018, it estimated that $72.13 billion was spent on pets in the U.S. 

Expansion into Canada is a massive business opportunity, with Toronto being the largest market in Canada. 

By using data, it is hoped that a more informed approach to neighbourhood selection could be utilized.

It is worth noting that this approach could be extended almost any other retail sector. 

## 2. The Data 

Stastistical data on distribution of pets in the city as well as existing potentially competitive retail stores and services are critical factors in determine the suitability of a location for a pet store. Data on dsitribution of pets and potentially competitive stores is needed. 

The following data sets will be utilized:

- The City of Toronto’s Open Data Catalogue for data on pets Toronto be area code
- Wikipedia for a list of Toronto's Areacodes by Borough and Neighbourhood
- Forsquare API to get the Pet related store and services in a given Borough and Neighbourhood of Toronto
- Long and Lat corrdinates provided by https://cocl.us/Geospatial_data


### 2.1 Importing and Cleaning the Data

Step 1 - Scraping the neighborhoods data from the wikipedia page

Let's download all the dependencies that we will need.

In [2]:
import pandas as pd
from bs4 import BeautifulSoup
import urllib.request

Now lets download the contents of the wikipedia Toronto web page using urllib and BeautifulSoup

In [3]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, 'html.parser')

Find the table with postal codes from html of the page and clean it up so it can be easily imported into a Pandas Date frame, ready to be worked with

In [4]:
table = soup.find_all('table')[0]
del table['class']
dfs = pd.read_html(table.prettify(), flavor='bs4')
df = dfs[0]
df.columns = df.iloc[0]
df = df.reindex(df.index.drop(0))
df.rename(columns={'Postcode':'PostalCode'}, inplace=True)
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront


Great, now we have a dataframe with the correct colums! But now we need to get rid of the "Not Assigned" rows. 

First we get rid of the Not assigned boroughs, then we replace the Not assigned neighbourhoods with the Borough names if needed. 

Then we will group the data by postal code and Comma Seperate the Neighbourhood.

In [5]:
df = df[df.Borough != 'Not assigned']
df.Neighbourhood.replace('Not assigned',df.Borough,inplace=True)
df = df.groupby(['PostalCode', 'Borough'])['Neighbourhood'].apply(", ".join)
df = df.reset_index()
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


Now lets add the long and lat to the Neighbourhood! First lets download the long and lat data by postal code. 

In [6]:
data_url="https://cocl.us/Geospatial_data"
c=pd.read_csv(data_url)
df = df.set_index('PostalCode').join(c.set_index('Postal Code'))
df = df.reset_index()
df.head(11)

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848


Finally lets add the Pet Data to the Area Code regions and eliminate any areas with no cats and dogs. 

In [7]:
cats_url="dogs_and_cats_toronto.csv"
cats_dogs=pd.read_csv(cats_url)
df = df.set_index('PostalCode').join(cats_dogs.set_index('PostalCode'))

In [8]:
df = df.reset_index()
df.head(11)

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude,CAT,DOG
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353,285,627
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497,297,775
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711,467,963
3,M1G,Scarborough,Woburn,43.770992,-79.216917,220,385
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476,155,309
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476,220,398
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029,453,654
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577,299,510
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476,250,627
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848,393,852


In [9]:
df = df.dropna()

In [10]:

df['DOG'] = df['DOG'].str.replace(',', '')


### Section 1.2 Understanding the Data

In [11]:
import numpy as np # library to handle data in a vectorized manner

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


In [12]:
address = 'Toronto, ON'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

neighborhoods = df

The geograpical coordinate of Toronto are 43.653963, -79.387207.


## Bubble Map of Areas of Toronto and Quantity of Registered Cats

In [13]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, borough, neighborhood, cats in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighbourhood'], neighborhoods['CAT']):
    label = '{}, {}, {}'.format(neighborhood, borough, cats)
    label = folium.Popup(label, parse_html=True)
    nubcat = int(cats) / 25
    if nubcat > 20:
        colord = 'red'
    else:
        colord = 'blue'
    folium.CircleMarker(
        [lat, lng],
        radius=nubcat,
        popup=label,
        color='blue',
        fill=True,
        fill_color=colord,
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

## Bubble Map of Areas of Toronto and Quantity of Registered Dogs

In [14]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, borough, neighborhood, dogs in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighbourhood'], neighborhoods['DOG']):
    label = '{}, {}, Registered Dogs - {}'.format(neighborhood, borough, dogs)
    label = folium.Popup(label, parse_html=True)
    nubcat = int(dogs) / 25
    if nubcat > 40:
        colord = 'red'
    else:
        colord = 'blue'
    folium.CircleMarker(
        [lat, lng],
        radius=nubcat,
        popup=label,
        color='blue',
        fill=True,
        fill_color=colord,
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

K Means to Cluster Areas on Cats and Dogs

In [15]:
from sklearn.preprocessing import StandardScaler


In [16]:
X = df.values[:,5:]

In [17]:
cluster_dataset = StandardScaler().fit_transform(X)



In [18]:
num_clusters = 5

k_means = KMeans(init="k-means++", n_clusters=num_clusters, n_init=12)
k_means.fit(cluster_dataset)
labels = k_means.labels_

In [19]:
df["Labels"] = labels
df.head(5)

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude,CAT,DOG,Labels
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353,285,627,3
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497,297,775,3
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711,467,963,1
3,M1G,Scarborough,Woburn,43.770992,-79.216917,220,385,0
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476,155,309,0


In [20]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, borough, neighborhood, lable in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighbourhood'], neighborhoods['Labels']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    if lable == 0:
        colord = 'lightgreen'
    elif lable == 1:
        colord = 'yellow'
    elif lable == 2:
        colord = 'pink'
    elif lable == 3:
        colord = 'green'
    else:
        colord = 'red'
    folium.CircleMarker(
        [lat, lng],
        radius=20,
        popup=label,
        color='red',
        fill=True,
        fill_color=colord,
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

In [21]:
df2 = df.where(df["Labels"]==3)

In [22]:
df2 = df2.dropna()

In [23]:
df2

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude,CAT,DOG,Labels
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353,285,627,3.0
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497,297,775,3.0
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577,299,510,3.0
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476,250,627,3.0
10,M1P,Scarborough,"Dorset Park, Scarborough Town Centre, Wexford ...",43.75741,-79.273304,298,500,3.0
18,M2J,North York,"Fairview, Henry Farm, Oriole",43.778517,-79.346556,295,659,3.0
24,M2R,North York,Willowdale West,43.782736,-79.442259,280,509,3.0
25,M3A,North York,Parkwoods,43.753259,-79.329656,309,646,3.0
28,M3H,North York,"Bathurst Manor, Downsview North, Wilson Heights",43.754328,-79.442259,249,675,3.0
35,M4B,East York,"Woodbine Gardens, Parkview Hill",43.706397,-79.309937,262,488,3.0


In [24]:
CLIENT_ID = '2C3JZ512Y1LVCCZGLPDVYCQIBZYLJJWQTHFSEUCK1CMWWJJ1' # your Foursquare ID
CLIENT_SECRET = 'XACOFOSDV3L5UNB2TFQ3E4BQQNHUDHDEIK2SWJBZXX1Y01TZ' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 100000 # define radius

Your credentails:
CLIENT_ID: 2C3JZ512Y1LVCCZGLPDVYCQIBZYLJJWQTHFSEUCK1CMWWJJ1
CLIENT_SECRET:XACOFOSDV3L5UNB2TFQ3E4BQQNHUDHDEIK2SWJBZXX1Y01TZ


In [25]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/search?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}&query=pet'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['categories'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [26]:
toronto_venues = getNearbyVenues(names=df2['Neighbourhood'],
                                   latitudes=df['Latitude'],
                                   longitudes=df['Longitude']
                                  )

Rouge, Malvern


KeyError: 'categories'

In [40]:
print(toronto_venues.shape)
toronto_venues.head()

(39, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Willowdale South,43.806686,-79.194353,Wendy's,43.807448,-79.199056,Fast Food Restaurant
1,Woodbine Heights,43.784535,-79.160497,Royal Canadian Legion,43.782533,-79.163085,Bar
2,Woodbine Heights,43.784535,-79.160497,Affordable Toronto Movers,43.787919,-79.162977,Moving Target
3,East Toronto,43.763573,-79.188711,Swiss Chalet Rotisserie & Grill,43.767697,-79.189914,Pizza Place
4,East Toronto,43.763573,-79.188711,G & G Electronics,43.765309,-79.191537,Electronics Store


In [41]:
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Dovercourt Village, Dufferin",7,7,7,7,7,7
East Toronto,6,6,6,6,6,6
"High Park, The Junction South",2,2,2,2,2,2
"Humber Bay Shores, Mimico South, New Toronto",10,10,10,10,10,10
"Runnymede, Swansea",7,7,7,7,7,7
"The Beaches West, India Bazaar",4,4,4,4,4,4
Willowdale South,1,1,1,1,1,1
Woodbine Heights,2,2,2,2,2,2


In [42]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 31 uniques categories.


In [43]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Neighborhood,Athletics & Sports,Bakery,Bank,Bar,Breakfast Spot,Bus Line,Bus Station,Caribbean Restaurant,Chinese Restaurant,Coffee Shop,Department Store,Discount Store,Electronics Store,Fast Food Restaurant,Fried Chicken Joint,Hakka Restaurant,Insurance Office,Intersection,Korean Restaurant,Medical Center,Metro Station,Mexican Restaurant,Moving Target,Park,Pizza Place,Playground,Rental Car Location,Soccer Field,Spa,Thai Restaurant,Train Station
0,Willowdale South,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,Woodbine Heights,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,Woodbine Heights,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
3,East Toronto,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
4,East Toronto,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [45]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped

Unnamed: 0,Neighborhood,Athletics & Sports,Bakery,Bank,Bar,Breakfast Spot,Bus Line,Bus Station,Caribbean Restaurant,Chinese Restaurant,Coffee Shop,Department Store,Discount Store,Electronics Store,Fast Food Restaurant,Fried Chicken Joint,Hakka Restaurant,Insurance Office,Intersection,Korean Restaurant,Medical Center,Metro Station,Mexican Restaurant,Moving Target,Park,Pizza Place,Playground,Rental Car Location,Soccer Field,Spa,Thai Restaurant,Train Station
0,"Dovercourt Village, Dufferin",0.142857,0.142857,0.142857,0.0,0.0,0.0,0.0,0.142857,0.0,0.0,0.0,0.0,0.0,0.0,0.142857,0.142857,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.142857,0.0
1,East Toronto,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.166667,0.0,0.0,0.166667,0.0,0.166667,0.0,0.0,0.0,0.0
2,"High Park, The Junction South",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.5,0.0,0.0
3,"Humber Bay Shores, Mimico South, New Toronto",0.0,0.2,0.0,0.0,0.0,0.2,0.1,0.0,0.0,0.0,0.0,0.0,0.0,0.1,0.0,0.0,0.0,0.1,0.0,0.0,0.1,0.0,0.0,0.1,0.0,0.0,0.0,0.1,0.0,0.0,0.0
4,"Runnymede, Swansea",0.0,0.0,0.0,0.0,0.0,0.0,0.142857,0.0,0.142857,0.142857,0.142857,0.285714,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.142857
5,"The Beaches West, India Bazaar",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Willowdale South,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,Woodbine Heights,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [46]:
num_top_venues = 3

for hood in toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Dovercourt Village, Dufferin----
                  venue  freq
0    Athletics & Sports  0.14
1  Caribbean Restaurant  0.14
2       Thai Restaurant  0.14


----East Toronto----
                 venue  freq
0       Breakfast Spot  0.17
1  Rental Car Location  0.17
2          Pizza Place  0.17


----High Park, The Junction South----
                venue  freq
0                 Spa   0.5
1          Playground   0.5
2  Athletics & Sports   0.0


----Humber Bay Shores, Mimico South, New Toronto----
          venue  freq
0        Bakery   0.2
1      Bus Line   0.2
2  Intersection   0.1


----Runnymede, Swansea----
            venue  freq
0  Discount Store  0.29
1   Train Station  0.14
2     Bus Station  0.14


----The Beaches West, India Bazaar----
               venue  freq
0        Coffee Shop  0.50
1   Insurance Office  0.25
2  Korean Restaurant  0.25


----Willowdale South----
                  venue  freq
0  Fast Food Restaurant   1.0
1    Athletics & Sports   0.0
2      Insurance O