# Capstone Project - The Battle of the Neighborhoods (Week 2)
### Applied Data Science Capstone by IBM/Coursera
#### Chelsea Huang

## Table of contents
* [Introduction/Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Results & Discussion](#results)
* [Conclusion](#conclusion)

#### import necessary Libraries

In [8]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Folium installed')
print('Libraries imported.')

Folium installed
Libraries imported.


## Introduction/Business Problem<a name="introduction"></a>

Canada has an open immigration policy and a diverse and inclusive culture. Its diverse dining culture makes many restaurants with different styles or exotic flavors emerge in Canada. Toronto, the capital of Ontario, is the largest city in Canada and also one of the most diverse. But it is not easy for a restaurant to survive for a long time in a metropolis like Toronto. I am a Chinese who has lived in Toronto for four years and I found that many of my favorite restaurants have closed down after only a few years of opening. The location of the restaurant not only affects the market development capacity of the catering company, but also the size of its attractiveness to consumers, but more importantly, it has a strategic impact on the long-term benefits.

Here we are mainly trying to investigate where would be a good choice for opening a restaurant from the perspective of location selection. Therefore, the target audience would be people that are considering opening a restaurant and they wonder how to choose the best location for their business.

## Data <a name="data"></a>

Data that will be used to solve the problem:

1. Toronto neighborhood data
scraped from a Wikipedia page: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

2. a csv file that has the geographical coordinates of each postal code: https://cocl.us/Geospatial_data

3. Foursquare location data, with latitude and longitude coordinates of each neighborhood.

4. Foursquare API to explore neighborhoods in Toronto.

For the Toronto neighborhood data, the data that we use is collected from a Wikipedia page, which provides all the information we need to explore and cluster the neighborhoods in Toronto.

#### data wrangling

In [35]:
data = pd.read_html("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
df=data[0]

# Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned
df.drop(df[df['Borough']=="Not assigned"].index,axis=0, inplace=True)
df1 = df.reset_index(drop=True)

# More than one neighborhood can exist in one postal code area
df2=df1.groupby("Postal Code").agg(lambda x:','.join(x))

# If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough

df2.loc[df2['Neighbourhood']=="Not assigned",'Neighbourhood']=df2.loc[df2['Neighbourhood']=="Not assigned",'Borough']
df3 = df2.reset_index()
df3.rename(columns={'Postal Code': 'PostalCode', 'Neighbourhood': 'Neighborhood'}, inplace=True)

df3.head(12)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"Kennedy Park, Ionview, East Birchmount Park"
7,M1L,Scarborough,"Golden Mile, Clairlea, Oakridge"
8,M1M,Scarborough,"Cliffside, Cliffcrest, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


First, we built a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name. Then, we used a csv file that has the geographical coordinates of each postal code: https://cocl.us/Geospatial_data Finally, we merge these two dataframes together.

In [36]:
df_geo_coor = pd.read_csv("https://cocl.us/Geospatial_data")
df_geo_coor.rename(columns={'Postal Code': 'PostalCode'}, inplace=True)

df_toronto = pd.merge(df3, df_geo_coor, on = 'PostalCode')
print(df_toronto.shape)
df_toronto.head(12)

(103, 5)


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"Kennedy Park, Ionview, East Birchmount Park",43.727929,-79.262029
7,M1L,Scarborough,"Golden Mile, Clairlea, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffside, Cliffcrest, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848


In [37]:
toronto_data = df_toronto.drop(['PostalCode'], axis = 1)
print(toronto_data.shape)
toronto_data.head()

(103, 4)


Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,Scarborough,Woburn,43.770992,-79.216917
4,Scarborough,Cedarbrae,43.773136,-79.239476


Also, we use the Foursquare API to explore neighborhoods in Toronto.

In [60]:
# Define Foursquare Credentials and Version
CLIENT_ID = 'HCIVIDXNSDTQLK22TG50GUXXJYWHKTIXMETS2A4ZWSTWRYND' # your Foursquare ID
CLIENT_SECRET = '5JDNAWKQ2RSHUCMVJOFCEGVSOMDGCZPALRMTUB2POND205NP' # your Foursquare Secret
VERSION = '20200801' # Foursquare API version

#### Explore Neighborhoods in Toronto

In [61]:
# define a function for getting the top 100 nearby venues with a radius of 500m

def get_venues(lat, lng, radius=500, LIMIT=100):
    
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng,
            radius, 
            LIMIT)
        
        # make the GET request
        results = requests.get(url).json()
        venue_data=results["response"]['groups'][0]['items']
        venues_list=[]
        
        for v in venue_data:
        # return only relevant information for each nearby venue
            try:
                venue_id=v['venue']['id']
                venue_name=v['venue']['name']
                venue_category=v['venue']['categories'][0]['name']
                venues_list.append([venue_id, venue_name, venue_category])
            except KeyError:
                pass

        column_names=['ID','Name','Category']
        df = pd.DataFrame(venues_list, columns=column_names)
    
        return(df)

In [62]:
# we get all the Chinese restaurant in Toronto

column_names=['Borough', 'Neighborhood', 'ID','Name']
cn_rest_to=pd.DataFrame(columns=column_names)
count=1
for row in toronto_data.values.tolist():
    Borough, Neighborhood, Latitude, Longitude=row
    venues = get_venues(Latitude,Longitude)
    cn_resturants=venues[venues['Category']=='Chinese Restaurant']  
    for resturant_detail in cn_resturants.values.tolist():
        id, name , category=resturant_detail
        cn_rest_to = cn_rest_to.append({'Borough': Borough,
                                        'Neighborhood': Neighborhood, 
                                        'ID': id,
                                        'Name' : name}, ignore_index=True)
    count+=1

KeyError: 'groups'

In [None]:
print(cn_rest_to.shape)
cn_rest_to.head()
# among all the boroughs, Scarborough has the largest number of Chinese restaurants

East York and West Toronto only have one Chinese restaurant in the database. Almost all Chinese restaurants are located in Scarborough. North York, East Toronto, Central Toronto, Downtown Toronto, Mississauga, Etobicoke have about the same number.

In [None]:
print(cn_rest_to.groupby('Neighborhood')['ID'].count().nlargest(5))

In [None]:
nei_list=['Agincourt', "Milliken, Agincourt North, Steeles East, L'Amoreaux East", "Steeles West, L'Amoreaux West", 
          'Kennedy Park, Ionview, East Birchmount Park', 'Canada Post Gateway Processing Centre']

In [None]:
df_selected = toronto_data.loc[toronto_data['Neighborhood'].isin(nei_list)]
df_selected.head()

In [None]:
address = "Toronto, ON"

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

In [None]:
# Create the map centering Toronto
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, borough, neighborhood in zip(
        df_selected['Latitude'], 
        df_selected['Longitude'], 
        df_selected['Borough'], 
        df_selected['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=9,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  

map_toronto

In [63]:
def getNearbyVenues(names, latitudes, longitudes, radius=500, LIMIT=100):
    venues_list=[]
    
    for name, lat, lng in zip(names, latitudes, longitudes):
        # print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [64]:
toronto_venues = getNearbyVenues(names=df_toronto['Neighborhood'],
                                   latitudes=df_toronto['Latitude'],
                                   longitudes=df_toronto['Longitude']
                                  )

KeyError: 'groups'

In [None]:
print(toronto_venues.shape)
toronto_venues.head()

In [None]:
print('There are {} Uniques Categories.'.format(len(toronto_venues['Venue Category'].unique())))

We create a dataframe using pandas one hot encoding for the venue categories.

In [None]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

In [None]:
# let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped.head()

Next, we create a dataframe with top 10 most common venues for each neighborhood to get an overview.

In [None]:
# sort the venues in descending order
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    return row_categories_sorted.index.values[0:num_top_venues]

# let's create the new dataframe and display the top 10 venues for each neighborhood.

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

In [None]:
# create a dataframe of top 10 categories
Venues_Top10 = toronto_venues['Venue Category'].value_counts()[0:10].to_frame(name='frequency')
Venues_Top10=Venues_Top10.reset_index()

Venues_Top10.rename(index=str, columns={"index": "Venue_Category", "frequency": "Frequency"}, inplace=True)
Venues_Top10

## Methodology  <a name="methodology"></a>

In this project we will direct our efforts on detecting areas of the Great Toronto Area (GTA) that have a high number of Chinese restaurants. 

First, we have collected the required **data: location and type (category) of every restaurant **. We have also **identified Chinese restaurants** (according to Foursquare categorization).

We will present map of all such locations but also create clusters (using **k-means clustering**) of those locations to identify general zones / neighborhoods / addresses which should be a starting point for final 'street level' exploration and search for optimal venue location by stakeholders.

### _Exploratory Data Analysis_

In [None]:
# data visualisation for Preliminary analysis
toronto_data.groupby('Borough')['Neighborhood'].count().plot(kind='bar',
                                                            figsize=(10,5),
                                                            width=0.8,
                                                            color='#20B2AA')
                                                    
plt.title('Number of Neighborhood of each Borough in Toronto', fontsize=16)
plt.legend(fontsize=14)
plt.xlabel('Borough', fontsize = 15)
plt.ylabel('Number of Neighborhood', fontsize=15)

plt.show()

In [None]:
cn_rest_to.groupby('Borough')['ID'].count().plot(kind='bar',
                                                 figsize=(10,5),
                                                 width=0.8,
                                                 color='#20B2AA')
plt.title('Number of Chinese Restaurant of each Borough in Toronto', fontsize=16)
plt.legend(fontsize=14)
plt.xlabel('Borough', fontsize = 15)
plt.ylabel('Number of Neighborhood', fontsize=15)

plt.show()

From the above barplot, we can see that Scarbrough has the largest number of Chinese restaurants. And North York ranks the second place.

In [None]:
cn_rest_to.groupby('Neighborhood')['ID'].count().nlargest(5).plot(kind='barh',
                                                 figsize=(10,5),
                                                 width=0.8,
                                                 color='#20B2AA')
plt.title('Number of Chinese Restaurant of each Borough in Toronto', fontsize=16)
plt.legend(fontsize=14)
plt.ylabel('Borough', fontsize = 15)
plt.xlabel('Number of Chinese Restaurant', fontsize=15)

plt.show()

From above, we can see that Agincourt borough has the largest number of Chinese restaurant, which is in Scarborough.

There are 267 uniques venue categories. 

In [None]:
import seaborn as sns
fig = plt.figure(figsize=(18,7))
s=sns.barplot(x="Venue_Category", y="Frequency", data=Venues_Top10)
s.set_xticklabels(s.get_xticklabels(), rotation=30)
plt.title('10 Most Frequently Occuring Venues in Toronto', fontsize=15)
plt.xlabel("Venue Category", fontsize=15)
plt.ylabel ("Frequency", fontsize=15)
plt.savefig("Most_Freq_Venues.png", dpi=300)
plt.show()

Now, it seems like coffee shops is the most frequent venue among all categories in Toronto according to Foursquare API. This makes sense since the coffee shop market is already booming in North America. For Chinese restaurants specifically, the frequency is quite low, maybe some of them have not been verified and added to Foursquare API database yet. But it can be said that there is still a lot of room for expansion in the market for Chinese restaurant.

### Clustering the neighborhoods using K-means

In [None]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

In [None]:
# sort the venues in descending order
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    return row_categories_sorted.index.values[0:num_top_venues]

# let's create the new dataframe and display the top 10 venues for each neighborhood.

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

In [None]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = df_selected

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

toronto_merged.head() # check the last columns!

Unfortunately there is no information about rating, likes and tips for Chinese restaurants in Toronto through Foursquare API, so I cannot get any additional information. And somehow my analysis was stopped getting keyerror of groups due to unknown reason. It worked just fine at 10 am this morning and I swear my code is untouched. I cannot give specific recommendations but can only have these results below.

## Results & Discussion <a name="results"></a>

* Coffee shops, Chinese Restaurants, Sandwich Shops, French restaurants, and Wine Bars are the most common venues in our 6 preferred neighborhoods.
* Coffee shops is the most frequent venue among all categories in Toronto according to Foursquare API. For Chinese restaurants specifically, the frequency is quite low, maybe some of them have not been verified and added to Foursquare API database yet. But it can be said that there is still a lot of room for expansion in the market for Chinese restaurant.
* Clustering neighborhoods based on their most popular venues grouped.
* Scarborough has the largest number of Chinese restaurants. And North York ranks the second place. And Agincourt has the largest number of Chinese restaurants, which is a neighborhood located in Scarborough.

As outlined previously, we used Foursquare data so that we have first identified general boroughs that justify further analysis. We used K-means clustering algorithm to cluster those locations and then performed in order to create major zones of interest (containing greatest number of potential locations) and addresses of those zone centers were created to be used as starting points for final exploration by stakeholders.

There are several limitations of our study. First, the findings of the present study were limited by the Foursquare API database. If we really want to open a Chinese restaurant, we haven't considered many other factors, such as rent, passenger flow, traffic and so on. 

## Conclusion <a name="conclusion"></a>

Depend on the results of our analysis, we conclude that final decision on optimal restaurant location will be made by stakeholders based on specific characteristics of neighborhoods and locations in every recommended zone, taking into consideration additional factors like rent, passenger flow, traffic, etc.