# Capstone Project The Battle of the Neighborhoods

### Table of contents

* [Introduction](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction

Toronto is Canada's largest city, the most populous city in Canada, and home to a diverse population of about 2.9 million people.
The city is ranked as one of the top destinations around the globe. It boasts world-class restaurants, cultural attractions as varied as the cultures themselves.
Moreover, Toronto is recognized for being Canada’s commercial capital and for its excellence in a number of sectors including life sciences, technology, and education. Thus, the outstanding opportunities attract investors all around the world. 

A group of stakeholders intend to open a **Chinese restaurant** in **downtown Toronto**, the main central business district of Toronto.

The location will make an impact on succeed of the restaurant. We particularly interested in:
1)	**areas with no Chinese restaurants;**
2)  **areas which are not crowded with restaurants.**

## Data

### 2.1 Data sources

Information of Neighborhoods of Toronto can be found in a Wikipedia page.  A table in this page list postal code, borough and neighborhood name.

In week 3, the course provides a link of a csv document, through which we can obtain the georgical coordinate conveniently. 

We can use Foursquare API to get location information of neighborhoods.

### 2.2 Data cleaning

#### Get the neighborhood information from the Wikipedia page.

Firstly, we use lxml package to scrape the table from Wikipedia page.

In [1]:
import pandas as pd

In [2]:
#install lxml
! pip install lxml
print('successfully installed the package!')

successfully installed the package!


In [3]:
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
df=pd.read_html(url)[0]
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


Then, we delete the rouws,that have Borough with 'Not assigned' value; and check the rows, that have Neighborhood with 'Not assigned' value. 

In [4]:
df.drop(df[df['Borough']=='Not assigned'].index,inplace=True)

In [5]:
df[df['Neighbourhood']=='Not assigned'].count()

Postal Code      0
Borough          0
Neighbourhood    0
dtype: int64

As every Neighborhood has a valid values, we don't need to process. 
Reset index and print the shape of the Dataframe.

In [6]:
df=df.reset_index(drop=True)
df.rename(columns={'Neighbourhood':'Neighborhood'},inplace=True)
print("The size of the dataframe is",df.shape)

The size of the dataframe is (103, 3)


In [7]:
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


#### Get the geographical information (latitude and the longitude coordinates) of each neighborhood

Use the csv file to get the geographical information and load the data into a dataframe.

In [8]:
data='http://cocl.us/Geospatial_data'
df_ll=pd.read_csv(data)
df_ll.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Join the two dataframes to get a new dataframe, with neighborhood information (Postcode,Borough and Neighborhood) and the geographical coordinates.

In [9]:
df_toronto=pd.merge(df,df_ll,on='Postal Code')
df_toronto.rename(columns={'Postal Code':'PostalCode'},inplace=True)
df_toronto.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


Check the shape of the dataframe.

In [10]:
print("The size of the dataframe is",df.shape)

The size of the dataframe is (103, 3)


#### Get the venues information

Import all packages we need.

In [11]:
import numpy as np # library to handle data in a vectorized manner

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't installed
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't installed
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


Our stakeholders intend to open a Chinese restaurant in downtown Toronto. So we slice the original dataframe and create a new dataframe of the Downtown Toronto data.

In [12]:
df_DT = df_toronto[df_toronto['Borough']=='Downtown Toronto']

In [13]:
df_DT=df_DT.reset_index(drop=True)
df_DT.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
4,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306


Use Foursquare API to get venues’ information in each neighborhood. We write a function to get the top 100 venues in a radius of 500 meters in every neighborhood.

In [14]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    CLIENT_ID = 'GL5S4H4TRAC2XW2SLXHB4VIJPLFYW5L1S5T2DJDXBZ2VJ25T' # your Foursquare ID
    CLIENT_SECRET = 'HNWCHMUSPKSXPPB1LLXLHMKTWN2IMAIHJY45PNGGFSCRALDX' # your Foursquare Secret
    VERSION = '20180605' # Foursquare API version
    LIMIT = 100 # A default Foursquare API limit value
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [15]:
#get venues information of all neighborhoods in Downtown Toronto
DT_venues = getNearbyVenues(names=df_DT['Neighborhood'],
                                   latitudes=df_DT['Latitude'],
                                   longitudes=df_DT['Longitude']
                                  )


Regent Park, Harbourfront
Queen's Park, Ontario Provincial Government
Garden District, Ryerson
St. James Town
Berczy Park
Central Bay Street
Christie
Richmond, Adelaide, King
Harbourfront East, Union Station, Toronto Islands
Toronto Dominion Centre, Design Exchange
Commerce Court, Victoria Hotel
University of Toronto, Harbord
Kensington Market, Chinatown, Grange Park
CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport
Rosedale
Stn A PO Boxes
St. James Town, Cabbagetown
First Canadian Place, Underground city
Church and Wellesley


##### **Important remainder** ---

The Foursquare server is not steady, sometimes we fail to get the dataframe DT_vennues. After I successfully runned the code and got the DT_venues, I exported it to a CSV document. If you can't get the DT_venues, please use the CSV document.

If the Foursquare is not steady and can not get the data, Double-click **here** to use the CSV 

<!-- 
DT_venues=pd.read_csv('https://raw.githubusercontent.com/XM-Shang/Coursera_Capstone_Week3/main/DT_venues.csv')
print('The shape of DT_venues is  ',DT_venues.shape)
DT_venues.head()
--> 

##### ---*The remainder comes to an end.*

Let's check the venues dataframe.

In [16]:
print('The shape of DT_venues is  ',DT_venues.shape)
DT_venues.head()

The shape of DT_venues is   (1248, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Regent Park, Harbourfront",43.65426,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery
1,"Regent Park, Harbourfront",43.65426,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop
2,"Regent Park, Harbourfront",43.65426,-79.360636,Cooper Koo Family YMCA,43.653249,-79.358008,Distribution Center
3,"Regent Park, Harbourfront",43.65426,-79.360636,Body Blitz Spa East,43.654735,-79.359874,Spa
4,"Regent Park, Harbourfront",43.65426,-79.360636,Impact Kitchen,43.656369,-79.35698,Restaurant


In Downtown Toronto, there are 1253 venues.

### 2.3 Feature selection

We care about the category of each venue. Let's check how many unique  categories in downtown Toronto.

In [17]:
print('There are {} unique categories.'.format(len(DT_venues['Venue Category'].unique())))

There are 212 unique categories.


In [18]:
#print every category
category_list=[]
for category in DT_venues['Venue Category'].unique():
    category_list.append(category)
    #print(category)
print(category_list)

['Bakery', 'Coffee Shop', 'Distribution Center', 'Spa', 'Restaurant', 'Pub', 'Breakfast Spot', 'Park', 'Gym / Fitness Center', 'Historic Site', 'Farmers Market', 'Chocolate Shop', 'Dessert Shop', 'Performing Arts Venue', 'Theater', 'French Restaurant', 'Yoga Studio', 'Mexican Restaurant', 'Café', 'Event Space', 'Shoe Store', 'Art Gallery', 'Electronics Store', 'Brewery', 'Bank', 'Beer Store', 'Hotel', 'Antique Shop', 'Italian Restaurant', 'Portuguese Restaurant', 'Beer Bar', 'Creperie', 'Sushi Restaurant', 'Hobby Shop', 'Diner', 'Fried Chicken Joint', 'Chinese Restaurant', 'Smoothie Shop', 'Sandwich Place', 'Gym', 'Bar', 'College Auditorium', 'College Cafeteria', 'Music Venue', 'Clothing Store', 'Comic Shop', 'Plaza', 'Burrito Place', 'Ramen Restaurant', 'Pizza Place', 'Thai Restaurant', 'Burger Joint', 'Shopping Mall', 'New American Restaurant', 'Gastropub', 'Bookstore', 'Tanning Salon', 'Fast Food Restaurant', 'Steakhouse', 'Japanese Restaurant', 'Movie Theater', 'College Rec Center'

We should focus on category 'restaurant'. However, there are so many categories related to restaurant, in another words, they are specific restaurant categories, such as 'French Restaurant', 'Mexican Restaurant', 'Portuguese Restaurant', 'Italian Restaurant'.

Those categories should be taken into consideration as well. We examined every category and extracted categories, whose name contains ‘Restaurant’. We regard them as competitor category.

In [19]:
import re

In [22]:
#extract all category, whose name contains 'Restaurant'
competitor_category=[]
temp=[]
for category in category_list:
    temp=re.findall('Restaurant',category)
    if temp==['Restaurant']:
        competitor_category.append(category)        

In [23]:
len(competitor_category)

43

In [25]:
print(competitor_category)

['Restaurant', 'French Restaurant', 'Mexican Restaurant', 'Italian Restaurant', 'Portuguese Restaurant', 'Sushi Restaurant', 'Chinese Restaurant', 'Ramen Restaurant', 'Thai Restaurant', 'New American Restaurant', 'Fast Food Restaurant', 'Japanese Restaurant', 'Middle Eastern Restaurant', 'Modern European Restaurant', 'Ethiopian Restaurant', 'Seafood Restaurant', 'Vietnamese Restaurant', 'American Restaurant', 'Latin American Restaurant', 'Vegetarian / Vegan Restaurant', 'German Restaurant', 'Comfort Food Restaurant', 'Asian Restaurant', 'Moroccan Restaurant', 'Belgian Restaurant', 'Greek Restaurant', 'Eastern European Restaurant', 'Falafel Restaurant', 'Indian Restaurant', 'Korean Restaurant', 'Colombian Restaurant', 'Mediterranean Restaurant', 'Brazilian Restaurant', 'Gluten-free Restaurant', 'Caribbean Restaurant', 'Dumpling Restaurant', 'Doner Restaurant', 'Filipino Restaurant', 'Dim Sum Restaurant', 'Molecular Gastronomy Restaurant', 'Taiwanese Restaurant', 'Theme Restaurant', 'Afg

We found, there are 43 competitor categoryies. Next step, we get all the competitor venues, and structure them into a dataframe.

In [28]:
mask=DT_venues['Venue Category'].isin(competitor_category)
DT_restaurant=DT_venues[mask]
DT_restaurant=DT_restaurant.reset_index(drop=True)
print('How many competitor venues?',DT_restaurant.shape[0])
DT_restaurant.head()

How many competitor venues? 296


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Regent Park, Harbourfront",43.65426,-79.360636,Impact Kitchen,43.656369,-79.35698,Restaurant
1,"Regent Park, Harbourfront",43.65426,-79.360636,Cluny Bistro & Boulangerie,43.650565,-79.357843,French Restaurant
2,"Regent Park, Harbourfront",43.65426,-79.360636,El Catrin,43.650601,-79.35892,Mexican Restaurant
3,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,Mercatto,43.660391,-79.387664,Italian Restaurant
4,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,Nando's,43.661728,-79.386391,Portuguese Restaurant


Let's explore further which type of restanut is most common and show the top 10 category.

In [34]:
Category_group=DT_restaurant.groupby('Venue Category').count()
Category_group.sort_values(by=['Neighborhood'],ascending=False,inplace=True)
Category_group.head(10)

Unnamed: 0_level_0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude
Venue Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Restaurant,44,44,44,44,44,44
Japanese Restaurant,31,31,31,31,31,31
Italian Restaurant,22,22,22,22,22,22
Seafood Restaurant,20,20,20,20,20,20
Sushi Restaurant,17,17,17,17,17,17
American Restaurant,16,16,16,16,16,16
Thai Restaurant,15,15,15,15,15,15
Vegetarian / Vegan Restaurant,14,14,14,14,14,14
Asian Restaurant,9,9,9,9,9,9
Mexican Restaurant,9,9,9,9,9,9


From the result, we found that Japanese, Italian restaurant are very popular in downtown Toronto.

In [35]:
# Get the geograpical coordinate of Downtown Toronto
address='Downtown Toronto, TO'
geolocator=Nominatim(user_agent="to_explorer")
location=geolocator.geocode(address)
latitude=location.latitude
longitude=location.longitude
print('The geograpical coordinate of Downtown Toronto are {},{}.'.format(latitude,longitude))

The geograpical coordinate of Downtown Toronto are 43.6563221,-79.3809161.


Let's create a map and mark these venues.

In [36]:
#create a map of Downtown Toronto using the geograpical coordinate
map_DT=folium.Map(location=[latitude,longitude],zoom_start=11)

In [38]:
#add markers to map
for lat, lng, label in zip(DT_restaurant['Venue Latitude'],DT_restaurant['Venue Longitude'],DT_restaurant['Venue']):
    label=folium.Popup(label,parse_html=True)
    folium.CircleMarker(
    [lat,lng],
    radius=5,
    popup=label,
    color='blue',
    #color = 'red' if == else 'blue'
    fill=True,
    fill_color='#3186cc',
    fill_opacity=0.7,
    parse_html=False).add_to(map_DT)

map_DT