# IBM Data Science Final Project

Use the Foursquare API to solve a real business problem

Welcome to the final project of the IBM Datascience Professional Certification.  

## The Problem: Helping Doggy NYC find the best area of Toronto to open a pet store.

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1. <a href="#item1">Introduction</a>

2. <a href="#item2">Data</a>

3. <a href="#item3">Methodology</a>

4. <a href="#item4">Discussion</a>

5. <a href="#item5">Conclusion</a>    
</font>
</div>

## 1. Introduction

<b>The business problem:</b>

The NYC Store Doggy Style NYC, would like to open a store in Toronto, Canada. What is the best location to open a pet store in Toronto, Canada. In terms of customer quantity and quality.

<b>Who would be interested in this project:

</b> For 2018, it estimated that $72.13 billion was spent on pets in the U.S. 

Expansion into Canada is a massive business opportunity, with Toronto being the largest market in Canada. 

By using data, it is hoped that a more informed approach to neighbourhood selection could be utilized.

It is worth noting that this approach could be extended almost any other retail sector. 

## 2. The Data 

Stastistical data on distribution of pets in the city as well as existing potentially competitive retail stores and services are critical factors in determine the suitability of a location for a pet store. Data on dsitribution of pets and potentially competitive stores is needed. 

The following data sets will be utilized:

- The City of Toronto’s Open Data Catalogue for data on pets Toronto by area code
- Wikipedia for a list of Toronto's Areacodes by Borough and Neighbourhood and area code
- Forsquare API to get the store and services in a given Borough and Neighbourhood of Toronto and the store in NYC
- Long and Lat corrdinates by area code provided by https://cocl.us/Geospatial_data


# 3. Methodology

To select the ideal location for a neighbourhood, we will use two phases, with both leveraging the power of machine learning to help us un our search.

Phase 1 - Quantity - Clustering the neighbourhoods based on dog and cat populations - i.e. our customers. 

Data Needed : Toronto Open Data, Toronto Wikipedia, Log and Lat Coortinates. 

Machine Learning Algorythmn used : K Means - popular and effective for unsupervised clustering. 

Phase 2 - Quality - From the best neighbourhoods, selecting the ones that more most like the one our shop is in in NYC - Doggy style knows its neighbours love the store in New York so they want the most similar neighbours in Toronto! 

Data Needed : Foursquare API - neighbourhood data

Machine Learning Algorythmn used : K Means - popular and effective for unsupervised clustering. 

By identifying the neighbourhoods with the most customers and the best customers, Doggy Style NYC can mitigate the risks of opening a new store in a bad location in a new city. 

### 3.1 Importing and Cleaning the Data

Step 1 - Scraping and cleaning the neighborhoods data from the Toronto wikipedia page

Let's download all the dependencies that we will need.

In [1]:
import pandas as pd
from bs4 import BeautifulSoup
import urllib.request

Now lets download the contents of the wikipedia Toronto web page using urllib and BeautifulSoup

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, 'html.parser')

Find the table with postal codes from html of the page and clean it up so it can be easily imported into a Pandas Date frame, ready to be worked with

In [3]:
table = soup.find_all('table')[0]
del table['class']
dfs = pd.read_html(table.prettify(), flavor='bs4')
df = dfs[0]
df.columns = df.iloc[0]
df = df.reindex(df.index.drop(0))
df.rename(columns={'Postcode':'PostalCode'}, inplace=True)
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront


Great, now we have a dataframe with the correct colums! But now we need to get rid of the "Not Assigned" rows. 

First we get rid of the Not assigned boroughs, then we replace the Not assigned neighbourhoods with the Borough names if needed. 

Then we will group the data by postal code and Comma Seperate the Neighbourhood.

In [4]:
df = df[df.Borough != 'Not assigned']
df.Neighbourhood.replace('Not assigned',df.Borough,inplace=True)
df = df.groupby(['PostalCode', 'Borough'])['Neighbourhood'].apply(", ".join)
df = df.reset_index()
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


Step 2. Adding the Lat and Long coordinates to the data.

Now lets add the long and lat to the Neighbourhood! First lets download the long and lat data by postal code. 

In [5]:
data_url="https://cocl.us/Geospatial_data"
c=pd.read_csv(data_url)
df = df.set_index('PostalCode').join(c.set_index('Postal Code'))
df = df.reset_index()
df.head(11)

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848


Finally lets download and add the Pet Data to the Area Code regions and eliminate any areas with no cats and dogs. 

In [6]:
cats_url="dogs_and_cats_toronto.csv"
cats_dogs=pd.read_csv(cats_url)
df = df.set_index('PostalCode').join(cats_dogs.set_index('PostalCode'))

In [7]:
df = df.reset_index()
df.head(11)

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude,CAT,DOG
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353,285,627
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497,297,775
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711,467,963
3,M1G,Scarborough,Woburn,43.770992,-79.216917,220,385
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476,155,309
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476,220,398
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029,453,654
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577,299,510
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476,250,627
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848,393,852


In [8]:
df = df.dropna()

In [9]:

df['DOG'] = df['DOG'].str.replace(',', '')


### Section 3.2 Understanding the Data

TO visualize the data, we will install some libraries to help us work with and visualize the data. 

In [17]:
import numpy as np # library to handle data in a vectorized manner

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib.colors as colors


# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


Now we need to find the geograpical coordinate of Toronto

In [13]:
address = 'Toronto, ON'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

neighborhoods = df

The geograpical coordinate of Toronto are 43.653963, -79.387207.


In [None]:
df.plot

## Bubble Map of Areas of Toronto and Quantity of Registered Cats

In [14]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, borough, neighborhood, cats in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighbourhood'], neighborhoods['CAT']):
    label = '{}, {}, {}'.format(neighborhood, borough, cats)
    label = folium.Popup(label, parse_html=True)
    nubcat = int(cats) / 25
    if nubcat > 20:
        colord = 'red'
    else:
        colord = 'blue'
    folium.CircleMarker(
        [lat, lng],
        radius=nubcat,
        popup=label,
        color='blue',
        fill=True,
        fill_color=colord,
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

## Bubble Map of Areas of Toronto and Quantity of Registered Dogs

In [15]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, borough, neighborhood, dogs in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighbourhood'], neighborhoods['DOG']):
    label = '{}, {}, Registered Dogs - {}'.format(neighborhood, borough, dogs)
    label = folium.Popup(label, parse_html=True)
    nubcat = int(dogs) / 25
    if nubcat > 40:
        colord = 'red'
    else:
        colord = 'blue'
    folium.CircleMarker(
        [lat, lng],
        radius=nubcat,
        popup=label,
        color='blue',
        fill=True,
        fill_color=colord,
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

### 3.3 Data Anaylsis and Machine Learning

K Means to Cluster Areas on Cats and Dogs

In [40]:
from sklearn.preprocessing import StandardScaler


In [41]:
X = df.values[:,5:]

In [42]:
cluster_dataset = StandardScaler().fit_transform(X)



In [43]:
num_clusters = 5

k_means = KMeans(init="k-means++", n_clusters=num_clusters, n_init=12)
k_means.fit(cluster_dataset)
labels = k_means.labels_

In [44]:
df["Labels"] = labels
df.head(5)

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude,CAT,DOG,Labels
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353,285,627,1
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497,297,775,1
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711,467,963,3
3,M1G,Scarborough,Woburn,43.770992,-79.216917,220,385,4
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476,155,309,4


In [47]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, borough, neighborhood, lable in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighbourhood'], neighborhoods['Labels']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    if lable == 0:
        colord = 'red'
    elif lable == 1:
        colord = 'lightblue'
    elif lable == 2:
        colord = 'lightgray'
    elif lable == 3:
        colord = 'white'
    else:
        colord = 'lightgreen'
    folium.CircleMarker(
        [lat, lng],
        radius=20,
        popup=label,
        color='red',
        fill=True,
        fill_color=colord,
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

In [51]:
df2 = df.where(df["Labels"]==0)

In [52]:
df2 = df2.dropna()

In [53]:
df2

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude,CAT,DOG,Labels
22,M2N,North York,Willowdale South,43.77012,-79.408493,513,1034,0.0
36,M4C,East York,Woodbine Heights,43.695344,-79.318389,663,1201,0.0
40,M4J,East York,East Toronto,43.685347,-79.338106,571,1110,0.0
42,M4L,East Toronto,"The Beaches West, India Bazaar",43.668999,-79.315572,513,1315,0.0
76,M6H,West Toronto,"Dovercourt Village, Dufferin",43.669005,-79.442259,676,1085,0.0
82,M6P,West Toronto,"High Park, The Junction South",43.661608,-79.464763,630,1179,0.0
84,M6S,West Toronto,"Runnymede, Swansea",43.651571,-79.48445,516,1177,0.0
88,M8V,Etobicoke,"Humber Bay Shores, Mimico South, New Toronto",43.605647,-79.501321,613,1235,0.0


Lat and and Long of Doggy Style NYC Neighourhood - 40.7324897,-73.9941879

Remove unneeded features and add NYC Doggy Style

In [60]:
df3 = df2[['Neighbourhood', 'Latitude', 'Longitude']].copy()
row = ['Doggy Style NYC', 40.7324897, -73.9941879]
df3.loc[len(df)] = row
df = df.reset_index()
df3

Unnamed: 0,Neighbourhood,Latitude,Longitude
22,Willowdale South,43.77012,-79.408493
36,Woodbine Heights,43.695344,-79.318389
40,East Toronto,43.685347,-79.338106
42,"The Beaches West, India Bazaar",43.668999,-79.315572
76,"Dovercourt Village, Dufferin",43.669005,-79.442259
82,"High Park, The Junction South",43.661608,-79.464763
84,"Runnymede, Swansea",43.651571,-79.48445
88,"Humber Bay Shores, Mimico South, New Toronto",43.605647,-79.501321
97,Doggy Style NYC,40.73249,-73.994188


In [61]:
CLIENT_ID = '2C3JZ512Y1LVCCZGLPDVYCQIBZYLJJWQTHFSEUCK1CMWWJJ1' # your Foursquare ID
CLIENT_SECRET = 'XACOFOSDV3L5UNB2TFQ3E4BQQNHUDHDEIK2SWJBZXX1Y01TZ' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 100000 # define radius

Your credentails:
CLIENT_ID: 2C3JZ512Y1LVCCZGLPDVYCQIBZYLJJWQTHFSEUCK1CMWWJJ1
CLIENT_SECRET:XACOFOSDV3L5UNB2TFQ3E4BQQNHUDHDEIK2SWJBZXX1Y01TZ


In [62]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [64]:
toronto_venues = getNearbyVenues(names=df3['Neighbourhood'],
                                   latitudes=df['Latitude'],
                                   longitudes=df['Longitude']
                                  )

Willowdale South
Woodbine Heights
East Toronto
The Beaches West, India Bazaar
Dovercourt Village, Dufferin
High Park, The Junction South
Runnymede, Swansea
Humber Bay Shores, Mimico South, New Toronto
Doggy Style NYC


In [65]:
print(toronto_venues.shape)
toronto_venues.head()

(36, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Willowdale South,43.806686,-79.194353,Wendy's,43.807448,-79.199056,Fast Food Restaurant
1,Woodbine Heights,43.784535,-79.160497,Royal Canadian Legion,43.782533,-79.163085,Bar
2,East Toronto,43.763573,-79.188711,Swiss Chalet Rotisserie & Grill,43.767697,-79.189914,Pizza Place
3,East Toronto,43.763573,-79.188711,G & G Electronics,43.765309,-79.191537,Electronics Store
4,East Toronto,43.763573,-79.188711,Big Bite Burrito,43.766299,-79.19072,Mexican Restaurant


In [66]:
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Doggy Style NYC,2,2,2,2,2,2
"Dovercourt Village, Dufferin",7,7,7,7,7,7
East Toronto,6,6,6,6,6,6
"High Park, The Junction South",1,1,1,1,1,1
"Humber Bay Shores, Mimico South, New Toronto",10,10,10,10,10,10
"Runnymede, Swansea",5,5,5,5,5,5
"The Beaches West, India Bazaar",3,3,3,3,3,3
Willowdale South,1,1,1,1,1,1
Woodbine Heights,1,1,1,1,1,1


In [67]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 29 uniques categories.


In [68]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Neighborhood,American Restaurant,Athletics & Sports,Bakery,Bank,Bar,Breakfast Spot,Bus Line,Bus Station,Caribbean Restaurant,Coffee Shop,Convenience Store,Department Store,Discount Store,Electronics Store,Fast Food Restaurant,Fried Chicken Joint,Hakka Restaurant,Intersection,Korean Restaurant,Medical Center,Metro Station,Mexican Restaurant,Motel,Park,Pizza Place,Playground,Rental Car Location,Soccer Field,Thai Restaurant
0,Willowdale South,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,Woodbine Heights,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,East Toronto,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
3,East Toronto,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,East Toronto,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0


In [104]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped

Unnamed: 0,Neighborhood,American Restaurant,Athletics & Sports,Bakery,Bank,Bar,Breakfast Spot,Bus Line,Bus Station,Caribbean Restaurant,Coffee Shop,Convenience Store,Department Store,Discount Store,Electronics Store,Fast Food Restaurant,Fried Chicken Joint,Hakka Restaurant,Intersection,Korean Restaurant,Medical Center,Metro Station,Mexican Restaurant,Motel,Park,Pizza Place,Playground,Rental Car Location,Soccer Field,Thai Restaurant
0,Doggy Style NYC,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0
1,"Dovercourt Village, Dufferin",0.0,0.142857,0.142857,0.142857,0.0,0.0,0.0,0.0,0.142857,0.0,0.0,0.0,0.0,0.0,0.0,0.142857,0.142857,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.142857
2,East Toronto,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.166667,0.0,0.0,0.166667,0.0,0.166667,0.0,0.0
3,"High Park, The Junction South",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
4,"Humber Bay Shores, Mimico South, New Toronto",0.0,0.0,0.2,0.0,0.0,0.0,0.2,0.1,0.0,0.0,0.0,0.0,0.0,0.0,0.1,0.0,0.0,0.1,0.0,0.0,0.1,0.0,0.0,0.1,0.0,0.0,0.0,0.1,0.0
5,"Runnymede, Swansea",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.2,0.2,0.4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,"The Beaches West, India Bazaar",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.666667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,Willowdale South,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,Woodbine Heights,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [105]:
num_top_venues = 3

for hood in toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Doggy Style NYC----
                 venue  freq
0  American Restaurant   0.5
1                Motel   0.5
2  Fried Chicken Joint   0.0


----Dovercourt Village, Dufferin----
                 venue  freq
0      Thai Restaurant  0.14
1   Athletics & Sports  0.14
2  Fried Chicken Joint  0.14


----East Toronto----
                 venue  freq
0  Rental Car Location  0.17
1       Breakfast Spot  0.17
2          Pizza Place  0.17


----High Park, The Junction South----
                 venue  freq
0           Playground   1.0
1  American Restaurant   0.0
2  Fried Chicken Joint   0.0


----Humber Bay Shores, Mimico South, New Toronto----
                  venue  freq
0                Bakery   0.2
1              Bus Line   0.2
2  Fast Food Restaurant   0.1


----Runnymede, Swansea----
               venue  freq
0     Discount Store   0.4
1        Coffee Shop   0.2
2  Convenience Store   0.2


----The Beaches West, India Bazaar----
                 venue  freq
0          Coffee Shop  0.67

In [106]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [138]:
num_top_venues = 20

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,11th Most Common Venue,12th Most Common Venue,13th Most Common Venue,14th Most Common Venue,15th Most Common Venue,16th Most Common Venue,17th Most Common Venue,18th Most Common Venue,19th Most Common Venue,20th Most Common Venue
0,Doggy Style NYC,American Restaurant,Motel,Electronics Store,Athletics & Sports,Bakery,Bank,Bar,Breakfast Spot,Bus Line,Bus Station,Caribbean Restaurant,Coffee Shop,Convenience Store,Department Store,Discount Store,Thai Restaurant,Soccer Field,Fried Chicken Joint,Hakka Restaurant,Intersection
1,"Dovercourt Village, Dufferin",Thai Restaurant,Athletics & Sports,Bakery,Bank,Caribbean Restaurant,Hakka Restaurant,Fried Chicken Joint,Discount Store,Bar,Breakfast Spot,Bus Line,Bus Station,Coffee Shop,Convenience Store,Department Store,Fast Food Restaurant,Electronics Store,Soccer Field,Intersection,Korean Restaurant
2,East Toronto,Rental Car Location,Electronics Store,Pizza Place,Mexican Restaurant,Medical Center,Breakfast Spot,Thai Restaurant,Bus Line,Coffee Shop,Caribbean Restaurant,Bus Station,Bar,Department Store,Bank,Bakery,Athletics & Sports,Convenience Store,Fast Food Restaurant,Discount Store,Soccer Field
3,"High Park, The Junction South",Playground,Thai Restaurant,Electronics Store,Athletics & Sports,Bakery,Bank,Bar,Breakfast Spot,Bus Line,Bus Station,Caribbean Restaurant,Coffee Shop,Convenience Store,Department Store,Discount Store,Fast Food Restaurant,Soccer Field,Fried Chicken Joint,Hakka Restaurant,Intersection
4,"Humber Bay Shores, Mimico South, New Toronto",Bakery,Bus Line,Fast Food Restaurant,Park,Metro Station,Bus Station,Intersection,Soccer Field,Department Store,Athletics & Sports,Bank,Bar,Breakfast Spot,Caribbean Restaurant,Coffee Shop,Convenience Store,Thai Restaurant,Discount Store,Electronics Store,Fried Chicken Joint
5,"Runnymede, Swansea",Discount Store,Coffee Shop,Convenience Store,Department Store,Thai Restaurant,Electronics Store,Athletics & Sports,Bakery,Bank,Bar,Breakfast Spot,Bus Line,Bus Station,Caribbean Restaurant,Fast Food Restaurant,Soccer Field,Fried Chicken Joint,Hakka Restaurant,Intersection,Korean Restaurant
6,"The Beaches West, India Bazaar",Coffee Shop,Korean Restaurant,Thai Restaurant,Electronics Store,Athletics & Sports,Bakery,Bank,Bar,Breakfast Spot,Bus Line,Bus Station,Caribbean Restaurant,Convenience Store,Department Store,Discount Store,Fast Food Restaurant,Soccer Field,Fried Chicken Joint,Hakka Restaurant,Intersection
7,Willowdale South,Fast Food Restaurant,Electronics Store,Athletics & Sports,Bakery,Bank,Bar,Breakfast Spot,Bus Line,Bus Station,Caribbean Restaurant,Coffee Shop,Convenience Store,Department Store,Discount Store,Thai Restaurant,Soccer Field,Fried Chicken Joint,Hakka Restaurant,Intersection,Korean Restaurant
8,Woodbine Heights,Bar,Thai Restaurant,Electronics Store,Athletics & Sports,Bakery,Bank,Breakfast Spot,Bus Line,Bus Station,Caribbean Restaurant,Coffee Shop,Convenience Store,Department Store,Discount Store,Fast Food Restaurant,Soccer Field,Fried Chicken Joint,Hakka Restaurant,Intersection,Korean Restaurant


In [139]:
# set number of clusters
kclusters = 4

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 0, 0, 3, 0, 0, 0, 2, 1])

In [142]:

neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = df3.copy()

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighbourhood')

toronto_merged = toronto_merged.dropna()

toronto_merged

Unnamed: 0,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,11th Most Common Venue,12th Most Common Venue,13th Most Common Venue,14th Most Common Venue,15th Most Common Venue,16th Most Common Venue,17th Most Common Venue,18th Most Common Venue,19th Most Common Venue,20th Most Common Venue
22,Willowdale South,43.77012,-79.408493,2,Fast Food Restaurant,Electronics Store,Athletics & Sports,Bakery,Bank,Bar,Breakfast Spot,Bus Line,Bus Station,Caribbean Restaurant,Coffee Shop,Convenience Store,Department Store,Discount Store,Thai Restaurant,Soccer Field,Fried Chicken Joint,Hakka Restaurant,Intersection,Korean Restaurant
36,Woodbine Heights,43.695344,-79.318389,1,Bar,Thai Restaurant,Electronics Store,Athletics & Sports,Bakery,Bank,Breakfast Spot,Bus Line,Bus Station,Caribbean Restaurant,Coffee Shop,Convenience Store,Department Store,Discount Store,Fast Food Restaurant,Soccer Field,Fried Chicken Joint,Hakka Restaurant,Intersection,Korean Restaurant
40,East Toronto,43.685347,-79.338106,0,Rental Car Location,Electronics Store,Pizza Place,Mexican Restaurant,Medical Center,Breakfast Spot,Thai Restaurant,Bus Line,Coffee Shop,Caribbean Restaurant,Bus Station,Bar,Department Store,Bank,Bakery,Athletics & Sports,Convenience Store,Fast Food Restaurant,Discount Store,Soccer Field
42,"The Beaches West, India Bazaar",43.668999,-79.315572,0,Coffee Shop,Korean Restaurant,Thai Restaurant,Electronics Store,Athletics & Sports,Bakery,Bank,Bar,Breakfast Spot,Bus Line,Bus Station,Caribbean Restaurant,Convenience Store,Department Store,Discount Store,Fast Food Restaurant,Soccer Field,Fried Chicken Joint,Hakka Restaurant,Intersection
76,"Dovercourt Village, Dufferin",43.669005,-79.442259,0,Thai Restaurant,Athletics & Sports,Bakery,Bank,Caribbean Restaurant,Hakka Restaurant,Fried Chicken Joint,Discount Store,Bar,Breakfast Spot,Bus Line,Bus Station,Coffee Shop,Convenience Store,Department Store,Fast Food Restaurant,Electronics Store,Soccer Field,Intersection,Korean Restaurant
82,"High Park, The Junction South",43.661608,-79.464763,3,Playground,Thai Restaurant,Electronics Store,Athletics & Sports,Bakery,Bank,Bar,Breakfast Spot,Bus Line,Bus Station,Caribbean Restaurant,Coffee Shop,Convenience Store,Department Store,Discount Store,Fast Food Restaurant,Soccer Field,Fried Chicken Joint,Hakka Restaurant,Intersection
84,"Runnymede, Swansea",43.651571,-79.48445,0,Discount Store,Coffee Shop,Convenience Store,Department Store,Thai Restaurant,Electronics Store,Athletics & Sports,Bakery,Bank,Bar,Breakfast Spot,Bus Line,Bus Station,Caribbean Restaurant,Fast Food Restaurant,Soccer Field,Fried Chicken Joint,Hakka Restaurant,Intersection,Korean Restaurant
88,"Humber Bay Shores, Mimico South, New Toronto",43.605647,-79.501321,0,Bakery,Bus Line,Fast Food Restaurant,Park,Metro Station,Bus Station,Intersection,Soccer Field,Department Store,Athletics & Sports,Bank,Bar,Breakfast Spot,Caribbean Restaurant,Coffee Shop,Convenience Store,Thai Restaurant,Discount Store,Electronics Store,Fried Chicken Joint
97,Doggy Style NYC,40.73249,-73.994188,0,American Restaurant,Motel,Electronics Store,Athletics & Sports,Bakery,Bank,Bar,Breakfast Spot,Bus Line,Bus Station,Caribbean Restaurant,Coffee Shop,Convenience Store,Department Store,Discount Store,Thai Restaurant,Soccer Field,Fried Chicken Joint,Hakka Restaurant,Intersection


# Results

In [132]:
#### create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighbourhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster)-1],
        fill=True,
        fill_color=rainbow[int(cluster)-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Red Locations on the map above identify the best locations for the Doggy Style NYC Expansion

# Discussion

The model successfully identified optimal neighbourhoods. 

It is worth noting that differnt quantities for clusters and numbers of features were used but the model, but these numbers seem to produce the best results. 

This model and dataset could be futher improved by looking at the following data
- competition in the areas from other pet stores
- cost of rental in the areas selected by the model
to improve the viability of each location. 

Furthermore, additional features in other datasets not explored by this report could help to further refine the models and produce better results. 



# Conclusion

According to the data and the machine learning models used to cluster the data, the 5 Best locations for the Doggy Style NYC expansion are:
- East Toronto
- The Beaches West, India Bazaar
- Dovercourt Village, Dufferin
- Runnymede, Swansea
- Humber Bay Shores, Mimico South, New Toronto
