# Coursera IBM Data Science Professional Capstone Project
## Capstone Project - The Battle of the Neighborhoods (Week 2)
<p>
This notebook contains the capstone project that fulfills part of the requirement of the IBM Data Science Professional Course offered by Coursera

### Problem Statement
In this project, we are working for a relocating firm that helps people relocate from Toronto, Ontario, Canada to Manhattan, New York, USA.  As part of the effort to help customers ease into their new environment, we want to help them find neighborhoods that match well with their existing neighborhood in Toronto.

In [1]:
import pandas as pd
import numpy as np

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe


In [2]:
# @hidden_cell
My_secret = 12345
CLIENT_ID = ''
CLIENT_SECRET = ''
VERSION = '20180605' # Foursquare API version

## Part 1 Web Scraping Toronto Neighbourhoods
This section below scrapes the wikipedia website for the Toronto Neighbourhoods

In [3]:
!pip install bs4

Collecting bs4
  Downloading https://files.pythonhosted.org/packages/10/ed/7e8b97591f6f456174139ec089c769f89a94a1a4025fe967691de971f314/bs4-0.0.1.tar.gz
Collecting beautifulsoup4 (from bs4)
[?25l  Downloading https://files.pythonhosted.org/packages/66/25/ff030e2437265616a1e9b25ccc864e0371a0bc3adb7c5a404fd661c6f4f6/beautifulsoup4-4.9.1-py3-none-any.whl (115kB)
[K     |████████████████████████████████| 122kB 7.1MB/s eta 0:00:01
[?25hCollecting soupsieve>1.2 (from beautifulsoup4->bs4)
  Downloading https://files.pythonhosted.org/packages/6f/8f/457f4a5390eeae1cc3aeab89deb7724c965be841ffca6cfca9197482e470/soupsieve-2.0.1-py3-none-any.whl
Building wheels for collected packages: bs4
  Building wheel for bs4 (setup.py) ... [?25ldone
[?25h  Stored in directory: /home/jupyterlab/.cache/pip/wheels/a0/b0/b2/4f80b9456b87abedbc0bf2d52235414c3467d8889be38dd472
Successfully built bs4
Installing collected packages: soupsieve, beautifulsoup4, bs4
Successfully installed beautifulsoup4-4.9.1 bs4-0.0.

In [4]:
# we shall use BeautifulSoup to scrape the website
import requests
from bs4 import BeautifulSoup

web = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
page = requests.get(web)

soup = BeautifulSoup(page.content, "html.parser")
table = soup.find('table')
rows = table.find_all('tr')

# First row are headers
hlist = []
for h in rows[0].find_all('th'):
    t = h.text.split('\n') # remove the newline at the end
    hlist.append(t[0])
hlist

# create a DataFrame and populate with the data
tbl_list = []

idx = 0
for r in rows[1:]:
    clist = []
    for c in r.find_all('td'):
        clist.append(c.text.split('\n')[0])
    t={hlist[0]:clist[0], hlist[1]:clist[1], hlist[2]:clist[2]}
    tbl_list.append(t)
       
df = pd.DataFrame(tbl_list) # somehow cannot append dataframes one at a time
#re-order the columns
df = df[hlist]
print(df.shape)
df.head()

(180, 3)


Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


Perform sanity check that no unassigned neighborhoods and no duplicate postal code. It seems the wiki has since been cleaned out of such errors.

In [5]:
# clean out the dataframes
df1 = df[df['Borough']!= 'Not assigned']

print("Check that no neighborhoods are unassigned:", (df1['Neighborhood']=='Not assigned').sum())

print("Check following postal codes are not unique")
pc = (df1['Postal Code'].value_counts())
for t in range(0, pc.shape[0]):
    if pc[t]>1:
        print(pc.index[t])
    #endif
#endfor

Check that no neighborhoods are unassigned: 0
Check following postal codes are not unique


## Part 2 Adding the Longitude and Latitude
I choose to use the CSV data instead of the unreliable geocoder.  We can always update the CSV file if necessary, though the neighborhoods are not likely to change with time.

In [6]:
dfll = pd.read_csv('http://cocl.us/Geospatial_data')

# set both sets of data to index based on Postal code, then merge them
dfll.set_index('Postal Code', inplace=True)
df1.set_index('Postal Code', drop=True, inplace=True)
df_result = pd.concat([df1, dfll], sort=True, axis=1)
df_result.head(10)
df_result.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
M1G,Scarborough,Woburn,43.770992,-79.216917
M1H,Scarborough,Cedarbrae,43.773136,-79.239476


### For each neighbourhood, find the top 5 categories of venues within.

In [7]:
LIMIT = 100
df_toronto = df_result[df_result['Borough'].str.contains('Toronto')]
print(df_toronto.shape)


(39, 4)


In [8]:
CategoryIdMap = {}

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
          
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        for v in results:
            venues_list.append([(
                name, 
                lat, 
                lng, 
                v['venue']['name'], 
                v['venue']['location']['lat'], 
                v['venue']['location']['lng'],  
                v['venue']['categories'][0]['name'])])
            CategoryIdMap[v['venue']['categories'][0]['name']]=v['venue']['categories'][0]['id']

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [9]:
LIMIT = 100
df_toronto = df_result[df_result['Borough'].str.contains('Toronto')]
print(df_toronto.shape)

tor_venues = getNearbyVenues(names=df_toronto['Neighborhood'],
                             latitudes=df_toronto['Latitude'],
                             longitudes=df_toronto['Longitude'])


(39, 4)


In [10]:
# initial analysis of the data
c =tor_venues['Venue Category'].unique()
print('There are ', len(c), " unique categories")
t = tor_venues.groupby('Neighborhood').count()
t[['Venue Category']].describe()

There are  233  unique categories


Unnamed: 0,Venue Category
count,39.0
mean,41.589744
std,33.520393
min,2.0
25%,16.0
50%,35.0
75%,62.0
max,100.0


From the above, it can be seen that the venues in each neighbourhood varies greatly, from 2 to 100 (or more).  Most places would have about 30 venue categories.  There are a total of 233 unique categories listed from the Four Square results.  We shall focus on the top 5 categories for each neighborhood in Toronto.

In [11]:
tor_onehot = pd.get_dummies(tor_venues[['Venue Category']], prefix="", prefix_sep="")
tor_onehot.insert(0,'Toronto Neighborhood', tor_venues['Neighborhood']) #need to change name because there is a category named "Neighbourhood"

tor_grouped = tor_onehot.groupby('Toronto Neighborhood').mean().reset_index()
tor_grouped

Unnamed: 0,Toronto Neighborhood,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Women's Store,Yoga Studio
0,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.017241,0.0,0.0,0.0,0.0,0.0,0.0
1,"Brockton, Parkdale Village, Exhibition Place",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Business reply mail Processing Centre, South C...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.058824
3,"CN Tower, King and Spadina, Railway Lands, Har...",0.0,0.058824,0.058824,0.058824,0.117647,0.176471,0.117647,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Central Bay Street,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.015385,0.0,0.0,0.015385,0.0,0.0,0.015385
5,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Church and Wellesley,0.012987,0.0,0.0,0.0,0.0,0.0,0.0,0.012987,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.012987,0.0,0.025974
7,"Commerce Court, Victoria Hotel",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.04,0.0,...,0.0,0.0,0.0,0.02,0.0,0.0,0.01,0.0,0.0,0.0
8,Davisville,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.028571,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Davisville North,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [12]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    t_ratings = row_categories_sorted[0:num_top_venues]
    t_categories = row_categories_sorted.index.values[0:num_top_venues] # this returns a numpy array
    for i in range(0, num_top_venues):
        if t_ratings[i]==0 :
            t_categories[i]=None
    t_results = pd.concat([pd.Series(t_categories), t_ratings], axis=0, ignore_index=True)
    return t_results

In [13]:
num_top_venues = 5

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Toronto Neighborhood']
nth_venues = []
nth_wt = []
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
        nth_venues.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))
        nth_venues.append('{}th Most Common Venue'.format(ind+1))

for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Venue Weight'.format(ind+1, indicators[ind]))
        nth_wt.append('{}{} Venue Weight'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Venue Weight'.format(ind+1))
        nth_wt.append('{}th Venue Weight'.format(ind+1))

# create a new dataframe
tor_hoods_venues_sorted = pd.DataFrame(columns=columns)
tor_hoods_venues_sorted['Toronto Neighborhood'] = tor_grouped['Toronto Neighborhood']

for ind in np.arange(tor_grouped.shape[0]):
    t_mcv = return_most_common_venues(tor_grouped.iloc[ind, :], num_top_venues)
    tor_hoods_venues_sorted.iloc[ind, 1:] = t_mcv.values


In [14]:
# Find the total weight and normalize the individual weights by the total
tor_hoods_venues_sorted["Total Wt"] = tor_hoods_venues_sorted.iloc[:, 6:11].sum(axis=1)
tor_hoods_venues_sorted.sort_values(by="Total Wt", axis=0, inplace=True)
for w in nth_wt:
    tor_hoods_venues_sorted[w] = tor_hoods_venues_sorted[w]/tor_hoods_venues_sorted['Total Wt']
    
tor_hoods_venues_sorted.head(10)

Unnamed: 0,Toronto Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,1st Venue Weight,2nd Venue Weight,3rd Venue Weight,4th Venue Weight,5th Venue Weight,Total Wt
29,St. James Town,Café,Coffee Shop,Cocktail Bar,Restaurant,American Restaurant,0.263158,0.263158,0.157895,0.157895,0.157895,0.2375
0,Berczy Park,Coffee Shop,Cocktail Bar,Seafood Restaurant,Bakery,Restaurant,0.357143,0.214286,0.142857,0.142857,0.142857,0.241379
31,Stn A PO Boxes,Coffee Shop,Café,Seafood Restaurant,Cocktail Bar,Restaurant,0.416667,0.166667,0.166667,0.125,0.125,0.247423
13,"Garden District, Ryerson",Clothing Store,Coffee Shop,Bubble Tea Shop,Middle Eastern Restaurant,Café,0.36,0.28,0.12,0.12,0.12,0.25
30,"St. James Town, Cabbagetown",Coffee Shop,Pizza Place,Pub,Italian Restaurant,Bakery,0.25,0.25,0.166667,0.166667,0.166667,0.272727
6,Church and Wellesley,Coffee Shop,Sushi Restaurant,Japanese Restaurant,Restaurant,Gay Bar,0.285714,0.238095,0.190476,0.142857,0.142857,0.272727
25,"Richmond, Adelaide, King",Coffee Shop,Restaurant,Café,Gym,Hotel,0.37037,0.185185,0.185185,0.148148,0.111111,0.287234
17,"Kensington Market, Chinatown, Grange Park",Café,Bakery,Mexican Restaurant,Vietnamese Restaurant,Coffee Shop,0.294118,0.176471,0.176471,0.176471,0.176471,0.288136
14,"Harbourfront East, Union Station, Toronto Islands",Coffee Shop,Aquarium,Hotel,Café,Scenic Lookout,0.448276,0.172414,0.137931,0.137931,0.103448,0.29
2,"Business reply mail Processing Centre, South C...",Yoga Studio,Skate Park,Auto Workshop,Brewery,Burrito Place,0.2,0.2,0.2,0.2,0.2,0.294118


In [15]:
tor_hoods_venues_sorted.tail(10)

Unnamed: 0,Toronto Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,1st Venue Weight,2nd Venue Weight,3rd Venue Weight,4th Venue Weight,5th Venue Weight,Total Wt
10,"Dufferin, Dovercourt Village",Bakery,Pharmacy,Bank,Supermarket,Bar,0.285714,0.285714,0.142857,0.142857,0.142857,0.5
3,"CN Tower, King and Spadina, Railway Lands, Har...",Airport Service,Airport Lounge,Airport Terminal,Boutique,Coffee Shop,0.333333,0.222222,0.222222,0.111111,0.111111,0.529412
9,Davisville North,Gym,Food & Drink Shop,Sandwich Place,Hotel,Dog Run,0.2,0.2,0.2,0.2,0.2,0.625
5,Christie,Grocery Store,Café,Park,Baby Store,Nightclub,0.363636,0.272727,0.181818,0.0909091,0.0909091,0.6875
26,Rosedale,Park,Playground,Trail,,,0.5,0.25,0.25,0.0,0.0,1.0
27,Roselawn,Music Venue,Garden,,,,0.5,0.5,0.0,0.0,0.0,1.0
12,"Forest Hill North & West, Forest Hill Road Park",Park,Trail,Jewelry Store,Sushi Restaurant,,0.25,0.25,0.25,0.25,0.0,1.0
35,The Beaches,Trail,Neighborhood,Health Food Store,Pub,,0.25,0.25,0.25,0.25,0.0,1.0
20,"Moore Park, Summerhill East",Gym,Trail,,,,0.5,0.5,0.0,0.0,0.0,1.0
18,Lawrence Park,Park,Swim School,Bus Line,,,0.333333,0.333333,0.333333,0.0,0.0,1.0


Looking at the total weight, there are about 6 neighborhoods where there are 5 or fewer venues in the neighborhood.  These tend to be near parks. We can run a map with the colour gradient thing.

# Density map
A neighborhood with a lower 'Total Wt' score typically has a lot of other venues in its vicinity.  Intuitively these would be nearer city centres where density is higher.  We can determine if this is is the case with a map.  A Toronto map is overlaid with neighborhood information, where a bigger circle indicates a neighborhood that is more sparse in venue categories (bigger service area per venue) and a smaller circle indicates a neighboorhood that is more dense.

In [16]:
#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library
# Use Toronto coordinates 43.6532° N, 79.3832° W
m = folium.Map(location=[43.6532, -79.3832], zoom_start=11)
# I can add marker one by one on the map
ratings = tor_hoods_venues_sorted[['Toronto Neighborhood','Total Wt']]
ratings.set_index('Toronto Neighborhood', inplace=True)

names=df_toronto['Neighborhood']
latitudes=df_toronto['Latitude']
longitudes=df_toronto['Longitude']

for name, lat, lng in zip(names, latitudes, longitudes):
    scale = ratings.loc[name, 'Total Wt']
    p = name[0:5]
    folium.Circle(
      location=[lat, lng],
      popup=p,
      radius=scale*500,
      color='crimson',
      fill=True,
      fill_color='crimson'
    ).add_to(m)

m

For each Toronto neighborhood, we need to establish the weights of the top 5 categories

Now we prepare the new york neighborhood
1. First retrieve the manhattan neighbourhood information
2. For each neighborhood perform a search based on the feature category of the Toronto nborhood
3. Number of matching venues returned form the weight of the target Manhattan nborhood
4. Score for each manhattan neighborhood = #venue per category * weightage of category
5. Select top 3 matching neighborhood


### Retrieve Manhattan Neighborhood Information

In [17]:
import json # library to handle JSON files
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

In [18]:
!wget -q -O 'newyork_data.json' https://cocl.us/new_york_dataset

with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

In [19]:
# define the dataframe columns

column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
ny_neighborhoods = pd.DataFrame(columns=column_names)

ny_neighborhoods_data = newyork_data['features']

In [20]:
for data in ny_neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    ny_neighborhoods = ny_neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

In [21]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(ny_neighborhoods['Borough'].unique()),
        ny_neighborhoods.shape[0]
    )
)

The dataframe has 5 boroughs and 306 neighborhoods.


So what we want to do now is to find out how closely each Manhattan neighborhood matches the Toronto neighborhoods by searching for the top 5 categories of the Toronto neighborhoods.  Because categories like Coffee Shop, Italian Restaurant appear in multiple Toronto neighborhoods, there is no need to repeat these queries multiple times.  So to reduce the number of Four Square API calls, we get the union of all the categories that need to be searched for.

In [22]:
s1 = tor_hoods_venues_sorted['1st Most Common Venue']
s2 = tor_hoods_venues_sorted['2nd Most Common Venue']
s3 = tor_hoods_venues_sorted['3rd Most Common Venue']
s4 = tor_hoods_venues_sorted['4th Most Common Venue']
s5 = tor_hoods_venues_sorted['5th Most Common Venue']
all_categories = pd.concat([s1, s2, s3, s4, s5]).dropna().unique()


In [23]:
print("There are a total of ", len(all_categories), " categories")

There are a total of  71  categories


In [24]:
radius = 500
LIMIT=100
#catId = '4bf58dd8d48988d1e0931735'

def searchByCategory(categoryId, nborhood):

    numVenues = []
    for latitude, longitude, nbor in zip(nborhood['Latitude'], nborhood['Longitude'], nborhood['Neighborhood']):
        url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&categoryId={}&radius={}&limit={}'.format(CLIENT_ID, 
            CLIENT_SECRET, latitude, longitude, VERSION, categoryId, radius, LIMIT)
        results = requests.get(url).json()
        try:
            v = results["response"]['venues']
            numVenues.append(len(v))
        except:
            numVenues.append(0)
    return numVenues

In [25]:
manhattan_data = ny_neighborhoods[ny_neighborhoods['Borough'] == 'Manhattan'].reset_index(drop=True)
print("No. of neighborhoods in Manhanttan: ", manhattan_data.shape[0])

No. of neighborhoods in Manhanttan:  40


In [26]:
hood_cat_columns = all_categories.tolist()
hood_cat_columns.insert(0,'Manhattan Neighborhood')
hood_cat_columns[0:5]
hood_cat = pd.DataFrame(columns=hood_cat_columns)
hood_cat['Manhattan Neighborhood']= manhattan_data['Neighborhood']

In [27]:
for cat in all_categories:
    catId = CategoryIdMap[cat]
    print(cat)
    numVenues = searchByCategory(catId, manhattan_data)
    hood_cat[cat] = numVenues

hood_cat.head(10)

Café
Coffee Shop
Clothing Store
Yoga Studio
Bar
Thai Restaurant
Sandwich Place
Park
Greek Restaurant
Pub
Breakfast Spot
Bakery
Airport Service
Gym
Grocery Store
Music Venue
Trail
Cocktail Bar
Pizza Place
Sushi Restaurant
Restaurant
Aquarium
Skate Park
Vietnamese Restaurant
Hotel
Mexican Restaurant
Dessert Shop
Italian Restaurant
Gift Shop
Pharmacy
Airport Lounge
Food & Drink Shop
Playground
Garden
Neighborhood
Swim School
Seafood Restaurant
Bubble Tea Shop
Japanese Restaurant
Auto Workshop
Vegetarian / Vegan Restaurant
American Restaurant
Fast Food Restaurant
Diner
Sports Bar
Cuban Restaurant
Bank
Airport Terminal
Jewelry Store
Health Food Store
Bus Line
Middle Eastern Restaurant
Brewery
Asian Restaurant
Discount Store
Sporting Goods Shop
Ice Cream Shop
Liquor Store
Eastern European Restaurant
Supermarket
Boutique
Baby Store
Gay Bar
Scenic Lookout
Burrito Place
Men's Store
Salad Place
Pet Store
Dog Run
Indian Restaurant
Nightclub


Unnamed: 0,Manhattan Neighborhood,Café,Coffee Shop,Clothing Store,Yoga Studio,Bar,Thai Restaurant,Sandwich Place,Park,Greek Restaurant,...,Baby Store,Gay Bar,Scenic Lookout,Burrito Place,Men's Store,Salad Place,Pet Store,Dog Run,Indian Restaurant,Nightclub
0,Marble Hill,3,6,16,1,3,1,7,3,0,...,3,0,2,0,0,0,3,3,1,0
1,Chinatown,49,49,48,4,50,16,18,18,5,...,1,1,7,2,21,0,15,3,6,28
2,Washington Heights,9,11,37,1,13,1,5,5,0,...,2,0,5,0,6,2,8,1,2,2
3,Inwood,8,7,17,2,10,2,4,7,0,...,0,0,3,0,3,2,5,1,0,6
4,Hamilton Heights,9,14,14,5,12,4,10,8,1,...,0,2,2,0,1,0,2,1,2,1
5,Manhattanville,8,8,7,3,7,2,4,7,0,...,0,0,2,0,0,1,0,1,2,4
6,Central Harlem,4,6,9,0,13,0,5,2,0,...,0,1,0,0,2,0,0,3,1,0
7,East Harlem,10,8,19,3,8,5,10,5,1,...,0,1,6,0,2,0,6,3,3,0
8,Upper East Side,27,34,48,11,17,6,6,9,0,...,3,0,10,3,5,5,6,4,5,3
9,Yorkville,15,21,15,5,37,11,11,8,0,...,1,1,12,0,0,0,17,3,7,7


Now we have retrieved the number of venues in Manhattan corresponding to the categories in the neighborhoods of Toronto.  We are ready to score the compatibility of the neighborhoods. A simple way of calculating the score may be to simply take the number of venues corresponding to the category multiplied by the normalized weight for the category.  However, if we look at Chinatown, it has a disproportionate number of coffeeshops and may skew the results towards itself even though it may not have, say, bookstore that is also desired. We try to scale the raw numbers according to each category.



In [28]:
hood_cat.describe()

Unnamed: 0,Café,Coffee Shop,Clothing Store,Yoga Studio,Bar,Thai Restaurant,Sandwich Place,Park,Greek Restaurant,Pub,...,Baby Store,Gay Bar,Scenic Lookout,Burrito Place,Men's Store,Salad Place,Pet Store,Dog Run,Indian Restaurant,Nightclub
count,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0,...,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0
mean,28.925,31.75,35.725,9.775,30.125,7.3,17.725,15.75,2.9,6.525,...,1.125,3.425,11.425,2.1,12.975,5.475,7.6,4.025,8.6,15.1
std,17.375768,16.760378,15.980738,10.77149,16.99802,6.801961,13.739308,8.799038,2.667948,6.694765,...,1.435583,4.706174,8.214707,2.19323,15.967094,8.161786,5.067746,2.506274,8.613645,15.475042
min,0.0,3.0,0.0,0.0,0.0,0.0,2.0,2.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,13.0,14.75,20.5,2.75,13.0,1.75,7.0,8.0,0.0,1.75,...,0.0,0.0,5.75,0.0,1.0,0.75,3.0,2.0,2.0,3.75
50%,29.0,33.5,45.0,5.5,32.0,6.0,15.0,15.0,2.0,4.0,...,0.5,1.5,10.0,2.0,6.0,3.0,7.0,4.0,6.0,9.0
75%,47.25,48.0,47.25,11.25,50.0,10.0,23.25,21.25,5.0,9.25,...,2.0,5.0,16.25,3.0,21.75,6.25,12.0,5.25,10.0,28.5
max,50.0,50.0,50.0,48.0,50.0,27.0,50.0,35.0,9.0,26.0,...,5.0,19.0,32.0,9.0,50.0,38.0,17.0,10.0,34.0,47.0


In [29]:
hood_cat_norm = hood_cat.copy()
cl = all_categories.tolist()
hood_cat_norm[cl] = (hood_cat_norm[cl]-hood_cat_norm[cl].min())/(hood_cat_norm[cl].max()-hood_cat_norm[cl].min())
hood_cat_norm.fillna(0, inplace=True) # fix div-by-zero for the columns where there is only 1 value and min=max.

hood_cat_norm.head()

Unnamed: 0,Manhattan Neighborhood,Café,Coffee Shop,Clothing Store,Yoga Studio,Bar,Thai Restaurant,Sandwich Place,Park,Greek Restaurant,...,Baby Store,Gay Bar,Scenic Lookout,Burrito Place,Men's Store,Salad Place,Pet Store,Dog Run,Indian Restaurant,Nightclub
0,Marble Hill,0.06,0.06383,0.32,0.020833,0.06,0.037037,0.104167,0.030303,0.0,...,0.6,0.0,0.0625,0.0,0.0,0.0,0.176471,0.3,0.029412,0.0
1,Chinatown,0.98,0.978723,0.96,0.083333,1.0,0.592593,0.333333,0.484848,0.555556,...,0.2,0.052632,0.21875,0.222222,0.42,0.0,0.882353,0.3,0.176471,0.595745
2,Washington Heights,0.18,0.170213,0.74,0.020833,0.26,0.037037,0.0625,0.090909,0.0,...,0.4,0.0,0.15625,0.0,0.12,0.052632,0.470588,0.1,0.058824,0.042553
3,Inwood,0.16,0.085106,0.34,0.041667,0.2,0.074074,0.041667,0.151515,0.0,...,0.0,0.0,0.09375,0.0,0.06,0.052632,0.294118,0.1,0.0,0.12766
4,Hamilton Heights,0.18,0.234043,0.28,0.104167,0.24,0.148148,0.166667,0.181818,0.111111,...,0.0,0.105263,0.0625,0.0,0.02,0.0,0.117647,0.1,0.058824,0.021277


In [30]:
# now we are ready to score neighbourhoods against each other.
ny_tor_cols = tor_hoods_venues_sorted['Toronto Neighborhood'].tolist()
ny_tor_cols.insert(0, 'Manhattan Neighborhood')
ny_tor_score = pd.DataFrame(columns = ny_tor_cols)
ny_tor_score['Manhattan Neighborhood'] = hood_cat['Manhattan Neighborhood']

# set Toronto Neighborhood as index to facilitate subsequent processing
tor_hoods_venues_sorted.set_index('Toronto Neighborhood', inplace=True)
tor_hoods_venues_sorted.head()        

Unnamed: 0_level_0,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,1st Venue Weight,2nd Venue Weight,3rd Venue Weight,4th Venue Weight,5th Venue Weight,Total Wt
Toronto Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
St. James Town,Café,Coffee Shop,Cocktail Bar,Restaurant,American Restaurant,0.263158,0.263158,0.157895,0.157895,0.157895,0.2375
Berczy Park,Coffee Shop,Cocktail Bar,Seafood Restaurant,Bakery,Restaurant,0.357143,0.214286,0.142857,0.142857,0.142857,0.241379
Stn A PO Boxes,Coffee Shop,Café,Seafood Restaurant,Cocktail Bar,Restaurant,0.416667,0.166667,0.166667,0.125,0.125,0.247423
"Garden District, Ryerson",Clothing Store,Coffee Shop,Bubble Tea Shop,Middle Eastern Restaurant,Café,0.36,0.28,0.12,0.12,0.12,0.25
"St. James Town, Cabbagetown",Coffee Shop,Pizza Place,Pub,Italian Restaurant,Bakery,0.25,0.25,0.166667,0.166667,0.166667,0.272727


In [31]:
for tor in ny_tor_cols[1:]:
#    print(tor)
    ny_tor_score[tor] = 0
    # take each of top 5 category
    for c, w in zip(nth_venues, nth_wt):
        cat = tor_hoods_venues_sorted.loc[tor,c]
        weight = tor_hoods_venues_sorted.loc[tor,w]
        if (weight > 0):
            ny_tor_score[tor] = ny_tor_score[tor]+ hood_cat_norm[cat]* weight

ny_tor_score.head()

Unnamed: 0,Manhattan Neighborhood,St. James Town,Berczy Park,Stn A PO Boxes,"Garden District, Ryerson","St. James Town, Cabbagetown",Church and Wellesley,"Richmond, Adelaide, King","Kensington Market, Chinatown, Grange Park","Harbourfront East, Union Station, Toronto Islands",...,"Dufferin, Dovercourt Village","CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport",Davisville North,Christie,Rosedale,Roselawn,"Forest Hill North & West, Forest Hill Road Park",The Beaches,"Moore Park, Summerhill East",Lawrence Park
0,Marble Hill,0.058786,0.056572,0.057182,0.140272,0.100486,0.028258,0.045026,0.057723,0.046113,...,0.151793,0.018203,0.155944,0.128367,0.065152,0.01,0.057576,0.059615,0.1,0.137761
1,Chinatown,0.788691,0.840724,0.877584,0.863558,0.677282,0.573499,0.767779,0.859122,0.709643,...,0.929543,0.217636,0.550241,0.791404,0.467424,0.291395,0.537497,0.334615,0.348936,0.466581
2,Washington Heights,0.154599,0.167125,0.167823,0.341975,0.17873,0.100115,0.138456,0.187421,0.122811,...,0.341261,0.030024,0.191044,0.187485,0.045455,0.074884,0.073873,0.044231,0.031915,0.143778
3,Inwood,0.137027,0.115377,0.119311,0.16543,0.105465,0.052266,0.099605,0.137708,0.072677,...,0.221599,0.02279,0.090111,0.112475,0.188258,0.066512,0.169794,0.159615,0.2,0.085966
4,Hamilton Heights,0.158151,0.170016,0.178908,0.206879,0.190553,0.130988,0.170893,0.195083,0.166554,...,0.276815,0.252671,0.216955,0.113768,0.265909,0.12814,0.125324,0.103846,0.121277,0.124436


In [41]:
ny_tor_score.iloc[:, 1:14].describe()

Unnamed: 0,St. James Town,Berczy Park,Stn A PO Boxes,"Garden District, Ryerson","St. James Town, Cabbagetown",Church and Wellesley,"Richmond, Adelaide, King","Kensington Market, Chinatown, Grange Park","Harbourfront East, Union Station, Toronto Islands","Business reply mail Processing Centre, South Central Letter Processing Plant Toronto","Little Portugal, Trinity","First Canadian Place, Underground city","Toronto Dominion Centre, Design Exchange"
count,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0
mean,0.51672,0.450157,0.489284,0.544259,0.441317,0.425036,0.536833,0.44133,0.453077,0.228229,0.445943,0.535344,0.492545
std,0.322717,0.287701,0.295978,0.277428,0.268894,0.277667,0.318203,0.270373,0.267347,0.140817,0.269655,0.31919,0.30603
min,0.022654,0.006211,0.015435,0.016849,0.0,0.010021,0.041473,0.017647,0.023491,0.0,0.0,0.038117,0.02856
25%,0.205385,0.169978,0.201258,0.276036,0.208079,0.164605,0.250582,0.21714,0.205808,0.1125,0.188064,0.256897,0.232923
50%,0.48212,0.376558,0.441772,0.612425,0.421005,0.402246,0.538225,0.406612,0.434832,0.234028,0.446883,0.537877,0.463388
75%,0.825258,0.688776,0.762926,0.786125,0.679695,0.659434,0.858806,0.699255,0.710632,0.307986,0.658032,0.85743,0.786003
max,0.986819,0.912799,0.905833,0.902647,0.910474,0.890927,0.98777,0.884447,0.79944,0.5625,0.956044,0.986556,0.989565


In [42]:
ny_tor_score.iloc[:, 14:27].describe()

Unnamed: 0,Studio District,"High Park, The Junction South","Commerce Court, Victoria Hotel","North Toronto West, Lawrence Park","University of Toronto, Harbord",Davisville,"India Bazaar, The Beaches West","Runnymede, Swansea",Central Bay Street,"Brockton, Parkdale Village, Exhibition Place","The Danforth West, Riverdale","Regent Park, Harbourfront","Queen's Park, Ontario Provincial Government"
count,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0
mean,0.506852,0.37054,0.523986,0.468072,0.510296,0.427822,0.388579,0.457734,0.511906,0.490582,0.378781,0.463235,0.522445
std,0.30789,0.230542,0.320641,0.241961,0.312537,0.280418,0.230838,0.286375,0.317677,0.293219,0.264773,0.275617,0.297054
min,0.018462,0.015,0.026787,0.002915,0.032843,0.01964,0.04596,0.012857,0.0096,0.018,0.0203,0.028708,0.012987
25%,0.206766,0.17097,0.236958,0.262249,0.196219,0.212546,0.230068,0.18449,0.214676,0.237055,0.10757,0.206889,0.238894
50%,0.507829,0.326776,0.498355,0.510122,0.517017,0.398464,0.326364,0.480114,0.490922,0.474239,0.36736,0.450293,0.55455
75%,0.801582,0.579918,0.829525,0.653412,0.825763,0.692597,0.536661,0.686215,0.814979,0.735627,0.643565,0.696833,0.802195
max,0.987567,0.912619,0.99051,0.875009,0.9405,0.922153,0.89816,0.903878,0.955029,0.926779,0.82619,0.917755,0.926424


In [44]:
ny_tor_score.iloc[:, 27:].describe()

Unnamed: 0,"Summerhill West, Rathnelly, South Hill, Forest Hill SE, Deer Park","Parkdale, Roncesvalles","The Annex, North Midtown, Yorkville","Dufferin, Dovercourt Village","CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport",Davisville North,Christie,Rosedale,Roselawn,"Forest Hill North & West, Forest Hill Road Park",The Beaches,"Moore Park, Summerhill East",Lawrence Park
count,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0
mean,0.40872,0.291955,0.478565,0.414356,0.1753,0.487487,0.370576,0.383958,0.262192,0.348865,0.235865,0.413138,0.326566
std,0.202194,0.220131,0.290713,0.223605,0.152649,0.253583,0.211775,0.201305,0.203451,0.224835,0.188289,0.240927,0.170433
min,0.020408,0.028571,0.016364,0.03207,0.0,0.052611,0.056843,0.045455,0.01,0.005319,0.009615,0.0,0.060606
25%,0.237587,0.105464,0.189576,0.24329,0.069409,0.272843,0.215649,0.266477,0.127267,0.148826,0.092788,0.231915,0.178272
50%,0.421326,0.217945,0.464587,0.400791,0.159362,0.503543,0.326868,0.356818,0.182326,0.298862,0.190385,0.404255,0.317645
75%,0.537141,0.478853,0.705476,0.524054,0.218227,0.678416,0.525056,0.460606,0.327965,0.515749,0.351442,0.589362,0.435203
max,0.836735,0.79616,0.981283,0.929543,0.762222,0.896401,0.849663,0.847348,0.906977,0.818529,0.721154,0.968085,0.766495


In [34]:
num_matching_hoods = 4

bmcolumns =[]
for ind in np.arange(num_matching_hoods):
    try:
        bmcolumns.append('{}{} Matching Neighborhood'.format(ind+1, indicators[ind]))
        bmcolumns.append('{}{} Matching Score'.format(ind+1, indicators[ind]))
    except:
        bmcolumns.append('{}th Matching Neighborhood'.format(ind+1))
        bmcolumns.append('{}th Matching Score'.format(ind+1))

df_best_matches = pd.DataFrame(index = ny_tor_cols[1:], columns = bmcolumns)
# for every column from col 1 onwards
for tor in ny_tor_cols[1:]:
#    tor = ny_tor_cols[1]
    # subset df with columns "Manhattan Nborhood" and "Toronto"
    MTdf = ny_tor_score[['Manhattan Neighborhood', tor]]
    # sort df with descending score
    MTdf.sort_values(tor,ascending=False, inplace=True, ignore_index=True)
    # take first 10 entries and copy to resultant DF
    for i in range(0, num_matching_hoods):
        nbor   = MTdf.iloc[i,0]
        score = MTdf.iloc[i,1]
        df_best_matches.loc[tor,bmcolumns[i*2]]   = nbor
        df_best_matches.loc[tor,bmcolumns[i*2+1]] = score
    

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [35]:
df_best_matches.sort_values('1st Matching Score', ascending=False, inplace=True)
df_best_matches.index.rename('Toronto Neighborhood', inplace=True)
df_best_matches.head(39)

Unnamed: 0_level_0,1st Matching Neighborhood,1st Matching Score,2nd Matching Neighborhood,2nd Matching Score,3rd Matching Neighborhood,3rd Matching Score,4th Matching Neighborhood,4th Matching Score
Toronto Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
"Commerce Court, Victoria Hotel",Midtown South,0.99051,Midtown,0.988213,Financial District,0.937204,Murray Hill,0.929568
"Toronto Dominion Centre, Design Exchange",Midtown,0.989565,Midtown South,0.96102,Murray Hill,0.917313,Financial District,0.890235
"Richmond, Adelaide, King",Midtown South,0.98777,Midtown,0.98027,Financial District,0.939819,Murray Hill,0.937939
Studio District,Midtown South,0.987567,Soho,0.95375,Greenwich Village,0.946224,Midtown,0.916797
St. James Town,Midtown,0.986819,Midtown South,0.982041,Noho,0.945272,Flatiron,0.941845
"First Canadian Place, Underground city",Midtown South,0.986556,Midtown,0.982103,Soho,0.940827,Financial District,0.939222
"The Annex, North Midtown, Yorkville",Midtown,0.981283,Midtown South,0.951889,Financial District,0.914043,Murray Hill,0.890407
"Moore Park, Summerhill East",Financial District,0.968085,Greenwich Village,0.878723,Flatiron,0.689362,Civic Center,0.678723
"Little Portugal, Trinity",Little Italy,0.956044,Noho,0.859302,Greenwich Village,0.853575,Chinatown,0.84031
Central Bay Street,Financial District,0.955029,Midtown,0.934248,Noho,0.910914,Murray Hill,0.892248


In [36]:
def CheckNborHood (check_nborhood):
    check_preferences = tor_hoods_venues_sorted.loc[check_nborhood]
    check_cat = check_preferences[0:num_top_venues].dropna().tolist()
    check_df = pd.DataFrame(columns=['Nhood']+check_cat)

    row = {'Nhood': 'Weightage'}
    for x in range(0, len(check_cat)):
        row[check_cat[x]] = check_preferences[num_top_venues+x]
    check_df = check_df.append(row, ignore_index=True)

    # get a list of neighborhoods from df_best_matches
    cols = [k for k in df_best_matches.columns if 'Neighborhood' in k]
    check_hoods = df_best_matches.loc[check_nborhood, cols]

    hood_cat.set_index('Manhattan Neighborhood', inplace=True) if hood_cat.index.name != 'Manhattan Neighborhood' else print("")
    # now we've got the hoods, we retrieve the venue count for each category

    for x in check_hoods:
    # x = check_hoods[0]
        row['Nhood']=x
        # retrieve the corresponding row
        venue_count = hood_cat.loc[x,check_cat]
        for y in range(0,len(check_cat)):
            row[check_cat[y]] = venue_count[y]
        #end for y
        check_df = check_df.append(row, ignore_index=True)
    return check_df

In [37]:
dfcheck = CheckNborHood('The Beaches')
dfcheck

Unnamed: 0,Nhood,Trail,Neighborhood,Health Food Store,Pub
0,Weightage,0.25,0.25,0.25,0.25
1,West Village,4.0,2.0,7.0,10.0
2,Greenwich Village,4.0,1.0,8.0,17.0
3,Midtown South,1.0,0.0,10.0,26.0
4,Gramercy,5.0,1.0,3.0,10.0


In [38]:
dfcheck = CheckNborHood('Commerce Court, Victoria Hotel')
dfcheck




Unnamed: 0,Nhood,Coffee Shop,Restaurant,Café,Hotel,American Restaurant
0,Weightage,0.333333,0.194444,0.194444,0.166667,0.111111
1,Midtown South,50.0,48.0,49.0,49.0,49.0
2,Midtown,50.0,46.0,50.0,49.0,50.0
3,Financial District,50.0,40.0,48.0,45.0,48.0
4,Murray Hill,50.0,37.0,47.0,47.0,49.0


In [39]:
dfcheck = CheckNborHood('Business reply mail Processing Centre, South Central Letter Processing Plant Toronto')
dfcheck




Unnamed: 0,Nhood,Yoga Studio,Skate Park,Auto Workshop,Brewery,Burrito Place
0,Weightage,0.2,0.2,0.2,0.2,0.2
1,Midtown,15.0,0.0,2.0,3.0,9.0
2,Midtown South,37.0,1.0,0.0,6.0,4.0
3,Flatiron,48.0,0.0,0.0,3.0,8.0
4,Noho,24.0,0.0,1.0,4.0,3.0


In [40]:
dfcheck = CheckNborHood('Christie')
dfcheck





Unnamed: 0,Nhood,Grocery Store,Café,Park,Baby Store,Nightclub
0,Weightage,0.363636,0.272727,0.181818,0.090909,0.090909
1,Soho,41.0,50.0,22.0,5.0,37.0
2,Little Italy,48.0,50.0,13.0,3.0,42.0
3,Chinatown,49.0,49.0,18.0,1.0,28.0
4,Civic Center,14.0,47.0,35.0,4.0,14.0


In [None]:
# @hidden_cell
# #!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
# import folium # map rendering library

# # create map
# # Use Toronto coordinates 43.6532° N, 79.3832° W
# map_clusters = folium.Map(location=[43.6532, -79.3832], zoom_start=11)

# # set color scheme for the clusters
# x = np.arange(kclusters)
# ys = [i + x + (i*x)**2 for i in range(kclusters)]
# colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
# rainbow = [colors.rgb2hex(i) for i in colors_array]

# # add markers to the map
# markers_colors = []
# for lat, lon, poi, cluster in zip(tor_merged['Latitude'], tor_merged['Longitude'], tor_merged['Toronto Neighborhood'], tor_merged['Cluster Labels']):
#     label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
#     folium.CircleMarker(
#         [lat, lon],
#         radius=5,
#         popup=label,
#         color=rainbow[cluster-1],
#         fill=True,
#         fill_color=rainbow[cluster-1],
#         fill_opacity=0.7).add_to(map_clusters)
       
# map_clusters