# Capstone Project - The Battle of the Neighborhoods 
### Applied Data Science Capstone by IBM/Coursera

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction: Business Problem <a name="introduction"></a>

A chain of restaurant owners in **Ontario, Canada** want to expand their business.
Currently they have their restaurants open in cities like **Ottawa, Brampton and Hamilton**.

They figured out that they would make more profit by opening up a restaurant in **Toronto** as **Toronto** is the largest city of Canada. So they want to open up a new restaurant some place nice with good neighbourhood in Toronto. They are having trouble figuring out which place to chose within Toronto to open their new restaurant.

We have to help them figure out which place to chose  where there business will be good, they have less competition and nice people live around. They want to know about 2-3 such places so that they can decide for themselves which one is the best.


## Data <a name="data"></a>

#### First Dataset: List of neighbourhoods in Toronto:

Firstly, I will be using data from a wikipedia page which provides information about list of neighbourhoods in Toronto, Canada. I will be using web scrapping tool BeautifulSoup for extracting the data in the form of a table from this wikipedia page.
This table contains 3 columns: Postal Code, Borough and Neighbourhood.
The link for this wikipedia page: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M .
After preprocessing the table and adding two more columns of Latitude and Longitude of each Neighbourhood, this dataset is ready for use. 
Final DataFrame will have 5 columns: Postal Code, Borough, Neighbourhood, Latitude, Longitude.
And it will contain 103 rows having 103 unique neighbourhoods of Toronto and 11 unique Boroughs.

For example,the first row contains a Borough named **North York** which contains one neighbourhood named **Parkwoods** and has a Postal code of **M3A**. The geographical coordinates of this neighbourhood is **(43.753259,-79.329656)**.

#### Second Dataset: List of different venues in the neighbourhoods of Toronto:

This dataset will be formed using the Foursquare API. I will use the Foursquare location data to explore different venues in each neighbourhood of Toronto.
These venues can be any place. For example: Parks, Coffee Shops, Hotels, Gyms, etc. 
Using the Foursquare location data, I can get information about these venues and analyze the neighbourhoods of Toronto easily based on this information.

We will use the geographical coordinates from above dataset to generate this Location dataset.

**In general, I will be using these two datasets to solve the business problem of finding the best place to open a restaurant within Toronto**


Before we get the data and start exploring it, let's download all the dependencies that we will need.

In [573]:
#Importing Libraries
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis

import json # library to handle JSON files

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN

import folium # map rendering library

print('Libraries imported.')


Libraries imported.


Importing the first dataset in form of a DataFrame:

In [574]:
df=pd.read_csv('data1.csv')

In [575]:
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,latitude,longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.65426,-79.360636
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763
4,M7A,Queen's Park,Queen's Park,43.662301,-79.389494


In [576]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(df['Borough'].unique()),
        df.shape[0]
    )
)

The dataframe has 11 boroughs and 103 neighborhoods.


Geographical coordinates of Toronto:

In [577]:
latitude=43.6532
longitude=-79.3832

**Creating a map of Toronto with all 103 neighbourhoods marked on this map:**

In [578]:
# create map of Toronto using latitude and longitude values:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighbourhood in zip(df['latitude'], df['longitude'], df['Borough'], df['Neighbourhood']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

**Folium** is a great visualization library. Feel free to zoom into the above map, and click on each circle mark to reveal the name of the neighborhood and its respective borough.

Next, we are going to start utilizing the Foursquare API to explore the neighborhoods and segment them by creating the second dataset.

#### Define Foursquare Credentials and Version

In [579]:
CLIENT_ID = 'YSX0RLDK0BS2SIZRMNKVYOSSWARTUL3RM50EBE2YCDXDBWYC' # my Foursquare ID
CLIENT_SECRET = 'MX5SMRCWBOOBMH4XLOF04XVECS5RJQSXWIFMAYM3ZHH0BVQU' # my Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100

### Explore different venues in different Neighborhoods of Toronto:

#### Let's create a function to do the same for all the neighborhoods in Toronto:

In [580]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [71]:
toronto_venues = getNearbyVenues(names=df['Neighbourhood'],
                                   latitudes=df['latitude'],
                                   longitudes=df['longitude']
                                  )


**toronto_venues** is a dataframe that contains all the information about different neighbourhoods of Toronto along with their nearby venues like Park, Restaurant, Coffee shop, etc. It is the second dataset that we require to solve the problem:

In [581]:
toronto_venues.head()

Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
1,Parkwoods,43.753259,-79.329656,KFC,43.754387,-79.333021,Fast Food Restaurant
2,Parkwoods,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
3,Victoria Village,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
4,Victoria Village,43.725882,-79.315572,Tim Hortons,43.725517,-79.313103,Coffee Shop


In [582]:
toronto_venues.groupby('Neighbourhood').count()

Unnamed: 0_level_0,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Adelaide, King, Richmond",100,100,100,100,100,100
Agincourt,4,4,4,4,4,4
"Agincourt North, L'Amoreaux East, Milliken, Steeles East",3,3,3,3,3,3
"Albion Gardens, Beaumond Heights, Humbergate, Jamestown, Mount Olive, Silverstone, South Steeles, Thistletown",11,11,11,11,11,11
"Alderwood, Long Branch",10,10,10,10,10,10
"Bathurst Manor, Downsview North, Wilson Heights",18,18,18,18,18,18
Bayview Village,4,4,4,4,4,4
"Bedford Park, Lawrence Manor East",25,25,25,25,25,25
Berczy Park,55,55,55,55,55,55
"Birch Cliff, Cliffside West",4,4,4,4,4,4


### Note:
**We see that Foursquare does not provide any information about 3 specific neighbourhoods from df dataframe, hence 3 rows are missing from toronto_venues dataframe. Therefore, we have to remove these 3 neighbourhoods from df dataframe also:**

In [583]:
df.drop([5,52,95],axis=0,inplace=True)
df.reset_index(drop=True,inplace=True)
df.head(6)

Unnamed: 0,Postcode,Borough,Neighbourhood,latitude,longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.65426,-79.360636
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763
4,M7A,Queen's Park,Queen's Park,43.662301,-79.389494
5,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353


Preprocessing the second dataset that is **toronto_venues** dataframe so that we can cluster the dataset easily using **one hot encoding** :

In [584]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighbourhood'] = toronto_venues['Neighbourhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Neighbourhood,Accessories Store,Adult Boutique,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,...,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


We're interested in venues in 'food' category, but only those that are proper restaurants - coffee shops, pizza places, bakeries etc. are not direct competitors, so we don't care about those. Hence we will include in out list only venues that have 'restaurant' in category name, and we'll make sure to detect and include all the subcategories of different restaurants in the neighborhood. For example, Afghan restaurant, Italian restaurant, etc. For this, we locate venues from **toronto_onehot** dataframe that are restaurants only:

In [585]:
col=['Neighbourhood']
for column in toronto_onehot.columns:
    if column.__contains__('Restaurant'):
        col.append(column)

In [586]:
toronto_restaurants=toronto_onehot[col]
toronto_restaurants=toronto_restaurants.groupby('Neighbourhood').sum().reset_index()
toronto_restaurants.head()

Unnamed: 0,Neighbourhood,Afghan Restaurant,American Restaurant,Asian Restaurant,Belgian Restaurant,Brazilian Restaurant,Cajun / Creole Restaurant,Caribbean Restaurant,Chinese Restaurant,Colombian Restaurant,...,Restaurant,Seafood Restaurant,Southern / Soul Food Restaurant,Sushi Restaurant,Taiwanese Restaurant,Tapas Restaurant,Thai Restaurant,Theme Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant
0,"Adelaide, King, Richmond",0,4,2,0,1,0,0,0,1,...,3,1,0,2,0,0,3,0,1,0
1,Agincourt,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
2,"Agincourt North, L'Amoreaux East, Milliken, St...",0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"Albion Gardens, Beaumond Heights, Humbergate, ...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"Alderwood, Long Branch",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


**Adding a column containing total number of restaurants in that neighbourhood. This will help us in making clusters using K-Means clustering algorithm.**

In [587]:
toronto_restaurants['Total']=toronto_restaurants.sum(axis=1)
toronto_restaurants= toronto_restaurants.drop('Neighbourhood',axis=1)


**Using K-Means clustering algorithm to make clusters of dataset so that our analysis is easy:**

In [588]:
# set number of clusters
kclusters = 5


# run k-means clustering
kmeans = KMeans(n_clusters=kclusters,random_state=0).fit(toronto_restaurants)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([4, 0, 0, 0, 0, 3, 0, 1, 1, 0])

In [589]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Preparing a dataset **venues_sorted** in which all neighbourhoods of Toronto are listed along with its **top 10 most common venues**. This will help in better visualisation of each cluster after they are formed.

In [610]:
num_top_venues = 5

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
venues_sorted = pd.DataFrame(columns=columns)
venues_sorted['Neighbourhood'] = toronto_grouped['Neighbourhood']

for ind in np.arange(toronto_grouped.shape[0]):
    venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

venues_sorted.head()

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,"Adelaide, King, Richmond",Coffee Shop,Café,American Restaurant,Bar,Steakhouse
1,Agincourt,Lounge,Sandwich Place,Breakfast Spot,Chinese Restaurant,Yoga Studio
2,"Agincourt North, L'Amoreaux East, Milliken, St...",Park,Playground,Asian Restaurant,Yoga Studio,Drugstore
3,"Albion Gardens, Beaumond Heights, Humbergate, ...",Grocery Store,Fast Food Restaurant,Pizza Place,Sandwich Place,Beer Store
4,"Alderwood, Long Branch",Pizza Place,Coffee Shop,Skating Rink,Dance Studio,Pharmacy


In [611]:
# add clustering labels
venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

After adding cluster labels to **venues_sorted** dataframe:

In [612]:
venues_sorted.head()

Unnamed: 0,Cluster Labels,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,4,"Adelaide, King, Richmond",Coffee Shop,Café,American Restaurant,Bar,Steakhouse
1,0,Agincourt,Lounge,Sandwich Place,Breakfast Spot,Chinese Restaurant,Yoga Studio
2,0,"Agincourt North, L'Amoreaux East, Milliken, St...",Park,Playground,Asian Restaurant,Yoga Studio,Drugstore
3,0,"Albion Gardens, Beaumond Heights, Humbergate, ...",Grocery Store,Fast Food Restaurant,Pizza Place,Sandwich Place,Beer Store
4,0,"Alderwood, Long Branch",Pizza Place,Coffee Shop,Skating Rink,Dance Studio,Pharmacy


Creating a dataframe **toronto_merged**, by merging two dataframes: **df** and **venues_sorted**. 

In [613]:

toronto_merged = df

toronto_merged = toronto_merged.join(venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')

toronto_merged.head(10) 

Unnamed: 0,Postcode,Borough,Neighbourhood,latitude,longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,0,Fast Food Restaurant,Park,Food & Drink Shop,Dumpling Restaurant,Diner
1,M4A,North York,Victoria Village,43.725882,-79.315572,0,Intersection,Coffee Shop,Hockey Arena,Portuguese Restaurant,Drugstore
2,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.65426,-79.360636,3,Coffee Shop,Pub,Bakery,Park,Theater
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763,0,Clothing Store,Furniture / Home Store,Women's Store,Coffee Shop,Fraternity House
4,M7A,Queen's Park,Queen's Park,43.662301,-79.389494,1,Coffee Shop,Sushi Restaurant,Gym,Japanese Restaurant,Park
5,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353,0,Fast Food Restaurant,Print Shop,Dessert Shop,Diner,Discount Store
6,M3B,North York,Don Mills North,43.745906,-79.352188,0,Japanese Restaurant,Gym / Fitness Center,Caribbean Restaurant,Baseball Field,Café
7,M4B,East York,"Woodbine Gardens, Parkview Hill",43.706397,-79.309937,0,Fast Food Restaurant,Pizza Place,Pharmacy,Athletics & Sports,Gastropub
8,M5B,Downtown Toronto,"Ryerson, Garden District",43.657162,-79.378937,4,Coffee Shop,Clothing Store,Café,Cosmetics Shop,Middle Eastern Restaurant
9,M6B,North York,Glencairn,43.709577,-79.445073,0,Pizza Place,Park,Japanese Restaurant,Pub,Yoga Studio


**Creating a map of toronto showing all 100 neighbourhoods of toronto, with different colours representing neighbourhoods belonging to different cluster:**

In [614]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['latitude'], toronto_merged['longitude'], toronto_merged['Neighbourhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### Cluster-wise segmentation of the main dataset that is toronto_merged dataframe:

In [615]:
df0=toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]
df0.head()

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,North York,0,Fast Food Restaurant,Park,Food & Drink Shop,Dumpling Restaurant,Diner
1,North York,0,Intersection,Coffee Shop,Hockey Arena,Portuguese Restaurant,Drugstore
3,North York,0,Clothing Store,Furniture / Home Store,Women's Store,Coffee Shop,Fraternity House
5,Scarborough,0,Fast Food Restaurant,Print Shop,Dessert Shop,Diner,Discount Store
6,North York,0,Japanese Restaurant,Gym / Fitness Center,Caribbean Restaurant,Baseball Field,Café


In [616]:
df1=toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]
df1.head()

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
4,Queen's Park,1,Coffee Shop,Sushi Restaurant,Gym,Japanese Restaurant,Park
19,Downtown Toronto,1,Coffee Shop,Cocktail Bar,Bakery,Cheese Shop,Café
32,North York,1,Clothing Store,Fast Food Restaurant,Coffee Shop,Restaurant,Kids Store
35,Downtown Toronto,1,Coffee Shop,Aquarium,Hotel,Café,Restaurant
40,East Toronto,1,Greek Restaurant,Coffee Shop,Ice Cream Shop,Italian Restaurant,Furniture / Home Store


In [617]:
df2=toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]
df2.head()

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
23,Downtown Toronto,2,Coffee Shop,Italian Restaurant,Café,Sandwich Place,Middle Eastern Restaurant
82,Downtown Toronto,2,Café,Vegetarian / Vegan Restaurant,Bakery,Coffee Shop,Mexican Restaurant
96,Downtown Toronto,2,Japanese Restaurant,Coffee Shop,Sushi Restaurant,Restaurant,Gay Bar


In [618]:
df3=toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]
df3.head()

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
2,Downtown Toronto,3,Coffee Shop,Pub,Bakery,Park,Theater
12,North York,3,Gym,Beer Store,Coffee Shop,Asian Restaurant,Chinese Restaurant
22,East York,3,Sporting Goods Shop,Coffee Shop,Grocery Store,Sushi Restaurant,Burger Joint
25,Scarborough,3,Hakka Restaurant,Lounge,Fried Chicken Joint,Athletics & Sports,Bakery
27,North York,3,Coffee Shop,Deli / Bodega,Fast Food Restaurant,Bank,Supermarket


In [619]:
df4=toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]
df4.head()

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
8,Downtown Toronto,4,Coffee Shop,Clothing Store,Café,Cosmetics Shop,Middle Eastern Restaurant
14,Downtown Toronto,4,Café,Coffee Shop,Restaurant,Hotel,Bakery
29,Downtown Toronto,4,Coffee Shop,Café,American Restaurant,Bar,Steakhouse
36,West Toronto,4,Bar,Men's Store,Asian Restaurant,Coffee Shop,Restaurant
41,Downtown Toronto,4,Coffee Shop,Hotel,Café,Restaurant,Gastropub


## Analysis: <a name="analysis"></a>

In [620]:
print('Total number of neighbourhoods in cluster 0 is',toronto_restaurants.loc[df0.index,:].shape[0])
print('Total number of restaurants in this cluster is', toronto_restaurants.loc[df0.index,:]['Total'].sum())
print('Ratio of Restaurant/Neighbourhood in this cluster is',(toronto_restaurants.loc[df0.index,:]['Total'].sum()/toronto_restaurants.loc[df0.index,:].shape[0]) )

Total number of neighbourhoods in cluster 0 is 60
Total number of restaurants in this cluster is 299
Ratio of Restaurant/Neighbourhood in this cluster is 4.983333333333333


In [621]:
print('Total number of neighbourhoods in cluster 1 is',toronto_restaurants.loc[df1.index,:].shape[0])
print('Total number of restaurants in this cluster is', toronto_restaurants.loc[df1.index,:]['Total'].sum())
print('Ratio of Restaurant/Neighbourhood in this cluster is',(toronto_restaurants.loc[df1.index,:]['Total'].sum()/toronto_restaurants.loc[df1.index,:].shape[0]) )

Total number of neighbourhoods in cluster 1 is 12
Total number of restaurants in this cluster is 89
Ratio of Restaurant/Neighbourhood in this cluster is 7.416666666666667


In [622]:
print('Total number of neighbourhoods in cluster 2 is',toronto_restaurants.loc[df2.index,:].shape[0])
print('Total number of restaurants in this cluster is', toronto_restaurants.loc[df2.index,:]['Total'].sum())
print('Ratio of Restaurant/Neighbourhood in this cluster is',(toronto_restaurants.loc[df2.index,:]['Total'].sum()/toronto_restaurants.loc[df2.index,:].shape[0]) )

Total number of neighbourhoods in cluster 2 is 3
Total number of restaurants in this cluster is 23
Ratio of Restaurant/Neighbourhood in this cluster is 7.666666666666667


In [623]:
print('Total number of neighbourhoods in cluster 3 is',toronto_restaurants.loc[df3.index,:].shape[0])
print('Total number of restaurants in this cluster is', toronto_restaurants.loc[df3.index,:]['Total'].sum())
print('Ratio of Restaurant/Neighbourhood in this cluster is',(toronto_restaurants.loc[df3.index,:]['Total'].sum()/toronto_restaurants.loc[df3.index,:].shape[0]) )

Total number of neighbourhoods in cluster 3 is 17
Total number of restaurants in this cluster is 87
Ratio of Restaurant/Neighbourhood in this cluster is 5.117647058823529


In [624]:
print('Total number of neighbourhoods in cluster 4 is',toronto_restaurants.loc[df4.index,:].shape[0])
print('Total number of restaurants in this cluster is', toronto_restaurants.loc[df4.index,:]['Total'].sum())
print('Ratio of Restaurant/Neighbourhood in this cluster is',(toronto_restaurants.loc[df4.index,:]['Total'].sum()/toronto_restaurants.loc[df4.index,:].shape[0]) )

Total number of neighbourhoods in cluster 4 is 8
Total number of restaurants in this cluster is 31
Ratio of Restaurant/Neighbourhood in this cluster is 3.875


### Note: As it is clearly visible that Restaurant/Neighbourhood ratio is lowest for Cluster 4, we will further analyse neighbourhoods belonging to cluster 4 only.

In [625]:
toronto_restaurants.loc[df4.index,:]

Unnamed: 0,Afghan Restaurant,American Restaurant,Asian Restaurant,Belgian Restaurant,Brazilian Restaurant,Cajun / Creole Restaurant,Caribbean Restaurant,Chinese Restaurant,Colombian Restaurant,Comfort Food Restaurant,...,Seafood Restaurant,Southern / Soul Food Restaurant,Sushi Restaurant,Taiwanese Restaurant,Tapas Restaurant,Thai Restaurant,Theme Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Total
8,0,0,0,1,0,0,0,0,0,1,...,2,0,0,0,0,1,0,1,0,12
14,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
29,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
36,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
41,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
47,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
90,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2
94,0,0,0,0,0,0,0,0,0,0,...,0,0,3,0,0,0,0,0,1,15


As we can see, first and last row contains very high Total number of restaurants (12 and 15) in these neighbourhoods, we will remove these neighbourhoods from df4 dataframe:

In [626]:
df4.drop([8,94],axis=0,inplace=True)

In [628]:
toronto_restaurants.loc[df4.index,:]

Unnamed: 0,Afghan Restaurant,American Restaurant,Asian Restaurant,Belgian Restaurant,Brazilian Restaurant,Cajun / Creole Restaurant,Caribbean Restaurant,Chinese Restaurant,Colombian Restaurant,Comfort Food Restaurant,...,Seafood Restaurant,Southern / Soul Food Restaurant,Sushi Restaurant,Taiwanese Restaurant,Tapas Restaurant,Thai Restaurant,Theme Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Total
14,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
29,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
36,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
41,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
47,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
90,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2


In [629]:
toronto_merged.loc[df4.index,:]

Unnamed: 0,Postcode,Borough,Neighbourhood,latitude,longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
14,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418,4,Café,Coffee Shop,Restaurant,Hotel,Bakery
29,M5H,Downtown Toronto,"Adelaide, King, Richmond",43.650571,-79.384568,4,Coffee Shop,Café,American Restaurant,Bar,Steakhouse
36,M6J,West Toronto,"Little Portugal, Trinity",43.647927,-79.41975,4,Bar,Men's Store,Asian Restaurant,Coffee Shop,Restaurant
41,M5K,Downtown Toronto,"Design Exchange, Toronto Dominion Centre",43.647177,-79.381576,4,Coffee Shop,Hotel,Café,Restaurant,Gastropub
47,M5L,Downtown Toronto,"Commerce Court, Victoria Hotel",43.648198,-79.379817,4,Coffee Shop,Café,Hotel,Restaurant,Italian Restaurant
90,M5W,Downtown Toronto,Stn A PO Boxes 25 The Esplanade,43.646435,-79.374846,4,Coffee Shop,Restaurant,Café,Hotel,Beer Bar


In above dataset, we can see that neighbourhoods with index 36 and 47 have Restaurant as their most common venue more than once and hence these neighbourhoods are not suitable for Restaurant business. Hence we have to remove these rows from df4 dataframe:

In [630]:
df4.drop([36,47],axis=0,inplace=True)

In [632]:
toronto_merged.loc[df4.index,:]

Unnamed: 0,Postcode,Borough,Neighbourhood,latitude,longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
14,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418,4,Café,Coffee Shop,Restaurant,Hotel,Bakery
29,M5H,Downtown Toronto,"Adelaide, King, Richmond",43.650571,-79.384568,4,Coffee Shop,Café,American Restaurant,Bar,Steakhouse
41,M5K,Downtown Toronto,"Design Exchange, Toronto Dominion Centre",43.647177,-79.381576,4,Coffee Shop,Hotel,Café,Restaurant,Gastropub
90,M5W,Downtown Toronto,Stn A PO Boxes 25 The Esplanade,43.646435,-79.374846,4,Coffee Shop,Restaurant,Café,Hotel,Beer Bar


**The above Neighbourhoods looks perfect for Restaurant opening. Therefore, finally storing the information of these 4 neighbourhoods in a dataframe named final:**

In [631]:
final=toronto_merged.loc[df4.index,'Postcode':'longitude']
final

Unnamed: 0,Postcode,Borough,Neighbourhood,latitude,longitude
14,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
29,M5H,Downtown Toronto,"Adelaide, King, Richmond",43.650571,-79.384568
41,M5K,Downtown Toronto,"Design Exchange, Toronto Dominion Centre",43.647177,-79.381576
90,M5W,Downtown Toronto,Stn A PO Boxes 25 The Esplanade,43.646435,-79.374846


**Visualising these 4 neighbourhoods on a map:**

In [640]:
# create map of Toronto using latitude and longitude values:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=15)

# add markers to map
for lat, lng, borough, neighbourhood in zip(final['latitude'], final['longitude'], final['Borough'], final['Neighbourhood']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=9,
        popup=label,
        color='blue',
        fill=True,
        fill_color='blue',
        fill_opacity=1,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

### The 4 neighbourhoods are depicted by 4 blue dots in the above map.

## Results and Discussion <a name="results"></a>

Our analysis shows that although there is a great number of restaurants in Toranto, there are pockets of low restaurant density fairly close to city center. To identify these pockets, we used clustering algorithm and segmmented our neighbourhood dataset accordingly. 

We used K-means clustering algorithm for for making 5 clusters each containing some neighbourhoods based on number of restaurants they have in their vicinity. Then we analysed each cluster by calculating Restaurant/Neighbourhood ratio of each cluster. We saw that cluster 4 had lowest ratio, which means very few restaurants are present within vicinity of each neighbourhood. There were total 6 neighbourhoods belonging to cluster 4. Then upon further analysis, we found that 2 among those were not good for opening up a new restaurant. Hence, only 4 neighbourhoods left.

According to our analysis, we got a total of 4 neighbourhoods where restaurant business will be good. There are two reasons for that. First reason is, we saw that these neighbourhoods does not contain much restaurants around their vicinity which will lower the competition in the restaurant business. Second reason is that, as we can see in the above map that these 4 neighbourhoods lie in the center of Toronto which means these neighbourhoods have high population density which means more customers and hence more profit.

The final 4 neighbourhoods that are perfect for opening a new restaurant are stored in a dataframe named final which contains information about latitude, longitude and borough of these neighbourhoods. 

The owners can further chose from these 4 locations which will be the best according to the type of restaurant they are trying to open.

## Conclusion <a name="conclusion"></a>

Purpose of this project was to identify neighbourhoods in Toronto low number of restaurants in order to aid stakeholders in narrowing down the search for optimal location for a new restaurant. By calculating restaurant density distribution from Foursquare data we have first identified the most common nearby venues of each neighbourhood. Then with the help of clustering techniques and further analysis we were able to narrow down to 4 neighbourhoods which were good for opening up a new restaurant. This concludes this project of Battle of nEighbourhoods.