# Capstone Project Data Science Coursera

# Introduction/Business Problem

When moving to a new city, it takes some time to get to know the place and get oriented. Particularly in major metropolitan cities, which neighborhood you live in matters quite a lot - each one has its own atmosphere, like a miniature city. How, when moving somewhere for the first time, could you pick a neighborhood? As a solution, I'm proposing a tool to map neighborhoods in one city to the most similar neighborhoods in another. To prove the concept, I will use the Toronto and New York.

# Data

For this task, Foursquare data will be essential. The specific data will be all venues in both cities (unless I limit this for speed and proof of concept.) For example, a sample row of Foursquare data that I use would need at minimum the latitude, longitude, name, and category of a venue. I will also need GeoJSON files of Toronto and New York for visualization. 

Using this data, I'll determine, for each neighborhood in New York, which is the most similar neighborhood in Sydney - and vice-versa. Instead of using a clustering algorithm for this, since I want a one-to-one mapping, I'll simply calculate the euclidean distance between all the neighborhoods (once they've been properly encoded.) The neighborhood in the new city with the smallest distance from the neighborhood in your current city will be the best choice!

# Methedology

To get started, I'll retrieve the geoJSON data for neighborhoods the two cities

In [43]:
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation

!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 
    
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

!conda install -c conda-forge folium=0.5.0 --yes
import folium # plotting library
import json
import sklearn
import urllib.request
import bs4 as bs
print('Folium installed')
print('Libraries imported.')



Solving environment: done


  current version: 4.4.10
  latest version: 4.8.3

Please update conda by running

    $ conda update -n base conda



# All requested packages already installed.

Solving environment: done


  current version: 4.4.10
  latest version: 4.8.3

Please update conda by running

    $ conda update -n base conda



# All requested packages already installed.

Folium installed
Libraries imported.


In [44]:
#geoJSON neighborhoods
geo_ny_url = '../../../downloads/nyu-geojson.json'
with open(geo_ny_url) as geo_syd:
    geo_ny_data = json.load(geo_syd)

In [45]:
geo_ny_data = geo_ny_data['features']

In [46]:
geo_ny_data

[{'geometry': {'coordinates': [-73.84720052054902, 40.89470517661],
   'type': 'Point'},
  'geometry_name': 'geom',
  'id': 'nyu_2451_34572.1',
  'properties': {'annoangle': 0.0,
   'annoline1': 'Wakefield',
   'annoline2': None,
   'annoline3': None,
   'bbox': [-73.84720052054902,
    40.89470517661,
    -73.84720052054902,
    40.89470517661],
   'borough': 'Bronx',
   'name': 'Wakefield',
   'stacked': 1},
  'type': 'Feature'},
 {'geometry': {'coordinates': [-73.82993910812398, 40.87429419303012],
   'type': 'Point'},
  'geometry_name': 'geom',
  'id': 'nyu_2451_34572.2',
  'properties': {'annoangle': 0.0,
   'annoline1': 'Co-op',
   'annoline2': 'City',
   'annoline3': None,
   'bbox': [-73.82993910812398,
    40.87429419303012,
    -73.82993910812398,
    40.87429419303012],
   'borough': 'Bronx',
   'name': 'Co-op City',
   'stacked': 2},
  'type': 'Feature'},
 {'geometry': {'coordinates': [-73.82780644716412, 40.887555677350775],
   'type': 'Point'},
  'geometry_name': 'geom',


In [50]:
data_url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

In [47]:
column_names_ny = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 
neighborhoods_ny = pd.DataFrame(columns=column_names_ny)

In [48]:
for data in geo_ny_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods_ny = neighborhoods_ny.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

In [52]:
source = urllib.request.urlopen(data_url).read()
soup = bs.BeautifulSoup(source,'lxml')

table = soup.find('table')
table_rows = table.find_all('tr')
l = []
for tr in table_rows:
    td = tr.find_all('td')
    row = [tr.text.replace('\n','') for tr in td]
    if (len(row) > 2 and row[1] != 'Not assigned'):
        if (row[2] == ''):
            row[2] = row[1]
        l.append(row)
neighborhoods_tr = pd.DataFrame(l, columns=['PostalCode', 'Borough', 'Neighborhood'])

In [54]:
#get latitude and logitude from csv instead
zips = pd.read_csv('../../../downloads/Geospatial_Coordinates.csv')
zips.rename(columns={'Postal Code': 'PostalCode'}, inplace=True)
neighborhoods_tr = pd.merge(neighborhoods_tr, zips, on='PostalCode')

In [57]:
print(neighborhoods_tr.head())
neighborhoods_ny.head()

  PostalCode           Borough                                 Neighborhood  \
0        M3A        North York                                    Parkwoods   
1        M4A        North York                             Victoria Village   
2        M5A  Downtown Toronto                    Regent Park, Harbourfront   
3        M6A        North York             Lawrence Manor, Lawrence Heights   
4        M7A  Downtown Toronto  Queen's Park, Ontario Provincial Government   

    Latitude  Longitude  
0  43.753259 -79.329656  
1  43.725882 -79.315572  
2  43.654260 -79.360636  
3  43.718518 -79.464763  
4  43.662301 -79.389494  


Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


## For simplicity, let's limit to one burrough each

In [62]:
manhattan_data = neighborhoods_ny[neighborhoods_ny['Borough'] == 'Manhattan'].reset_index(drop=True)
manhattan_data.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Manhattan,Marble Hill,40.876551,-73.91066
1,Manhattan,Chinatown,40.715618,-73.994279
2,Manhattan,Washington Heights,40.851903,-73.9369
3,Manhattan,Inwood,40.867684,-73.92121
4,Manhattan,Hamilton Heights,40.823604,-73.949688


In [64]:
north_data = neighborhoods_tr[neighborhoods_tr['Borough'] == 'North York'].reset_index(drop=True)
north_data.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
3,M3B,North York,Don Mills,43.745906,-79.352188
4,M6B,North York,Glencairn,43.709577,-79.445073


In [65]:
address = 'Manhattan, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Manhattan are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Manhattan are 40.7896239, -73.9598939.


In [66]:
# create map of Manhattan using latitude and longitude values
map_manhattan = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(manhattan_data['Latitude'], manhattan_data['Longitude'], manhattan_data['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_manhattan)  
    
map_manhattan

In [67]:
# create map of North york using latitude and longitude values
latitude, longitude = north_data['Latitude'][0], north_data['Longitude'][0]
neighborhoods=north_data
map_north = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_north)  
    
map_north

## Foursquare

With both neighborhood data sets loaded, we can retrieve the venues we need to cluster.

In [38]:
#Foursquare info
CLIENT_ID = 'CTPL1FF2IUUSJGQ01Q3WDTB3YTLELWJZ0XFF50QVW3TJUY0E' # your Foursquare ID
CLIENT_SECRET = 'CJPLY1NZ0M4TD3Q4YMTKCEZ1CC1NIGYM00XGIAYFJVMFREJV' # your Foursquare Secret
VERSION = '20180604'

In [77]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    LIMIT = 100
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [78]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [79]:

manhattan_venues = getNearbyVenues(names=manhattan_data['Neighborhood'],
                                   latitudes=manhattan_data['Latitude'],
                                   longitudes=manhattan_data['Longitude']
                                  )
north_venues = getNearbyVenues(names=north_data['Neighborhood'],
                                   latitudes=neighborhoods['Latitude'],
                                   longitudes=neighborhoods['Longitude']
                                  )

In [82]:
# one hot encoding
north_onehot = pd.get_dummies(north_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
north_onehot['Neighborhood'] = north_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [north_onehot.columns[-1]] + list(north_onehot.columns[:-1])
north_onehot = north_onehot[fixed_columns]

north_onehot.head()
print(north_onehot.shape)

(241, 99)


In [83]:
# one hot encoding
manhattan_onehot = pd.get_dummies(manhattan_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
manhattan_onehot['Neighborhood'] = manhattan_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [manhattan_onehot.columns[-1]] + list(manhattan_onehot.columns[:-1])
manhattan_onehot = manhattan_onehot[fixed_columns]

manhattan_onehot.head()

Unnamed: 0,Neighborhood,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,American Restaurant,Antique Shop,Arcade,Arepa Restaurant,Argentinian Restaurant,...,Video Store,Vietnamese Restaurant,Volleyball Court,Waterfront,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Marble Hill,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Marble Hill,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,Marble Hill,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Marble Hill,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Marble Hill,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [86]:
manhattan_grouped = manhattan_onehot.groupby('Neighborhood').mean().reset_index()
print(manhattan_grouped.head())
toronto_grouped = north_onehot.groupby('Neighborhood').mean().reset_index()
print(toronto_grouped.head())

        Neighborhood  Accessories Store  Adult Boutique  Afghan Restaurant  \
0  Battery Park City                0.0             0.0                0.0   
1      Carnegie Hill                0.0             0.0                0.0   
2     Central Harlem                0.0             0.0                0.0   
3            Chelsea                0.0             0.0                0.0   
4          Chinatown                0.0             0.0                0.0   

   African Restaurant  American Restaurant  Antique Shop  Arcade  \
0            0.000000             0.000000           0.0     0.0   
1            0.000000             0.011905           0.0     0.0   
2            0.066667             0.044444           0.0     0.0   
3            0.000000             0.030000           0.0     0.0   
4            0.000000             0.030000           0.0     0.0   

   Arepa Restaurant  Argentinian Restaurant  ...  Video Store  \
0               0.0                0.000000  ...         

In [88]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [91]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
manhattan_neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
manhattan_neighborhoods_venues_sorted['Neighborhood'] = manhattan_grouped['Neighborhood']

for ind in np.arange(manhattan_grouped.shape[0]):
    manhattan_neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(manhattan_grouped.iloc[ind, :], num_top_venues)

print(manhattan_neighborhoods_venues_sorted.head())

#same for north
indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
tr_neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
tr_neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    tr_neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

print(tr_neighborhoods_venues_sorted.head())


        Neighborhood 1st Most Common Venue 2nd Most Common Venue  \
0  Battery Park City                  Park                 Hotel   
1      Carnegie Hill           Coffee Shop           Pizza Place   
2     Central Harlem    African Restaurant        Cosmetics Shop   
3            Chelsea           Art Gallery           Coffee Shop   
4          Chinatown    Chinese Restaurant          Cocktail Bar   

  3rd Most Common Venue 4th Most Common Venue 5th Most Common Venue  \
0           Coffee Shop                   Gym         Memorial Site   
1                  Café   Japanese Restaurant                   Gym   
2                   Bar   American Restaurant     French Restaurant   
3                  Café        Ice Cream Shop   American Restaurant   
4                Bakery          Optical Shop   American Restaurant   

  6th Most Common Venue  7th Most Common Venue 8th Most Common Venue  \
0    Mexican Restaurant            Beer Garden         Shopping Mall   
1         Grocery St

## Pairing

Now that we have the dataframe encoded nicely, we have to match neighborhoods between the cities. Let's start by calculating the euclidean distance between each neighborhood of a different city.

In [92]:
import scipy

In [97]:
manhattan_pairing = manhattan_grouped.drop('Neighborhood', 1)
tr_pairing = toronto_grouped.drop('Neighborhood', 1)

The dataframes have different numbers of dimensions! Let's ignore category types that aren't present in both. 

In [110]:
print(manhattan_pairing.shape)
print(tr_pairing.shape)
manhattan_trim = manhattan_pairing
tr_trim = tr_pairing
for col in manhattan_pairing.columns:
    if (col not in tr_pairing):
        manhattan_trim = manhattan_trim.drop(col, 1)
for col in tr_pairing.columns:
    if (col not in manhattan_pairing):
        tr_trim = tr_trim.drop(col, 1)
print(manhattan_trim.shape)
print(tr_trim.shape)

(40, 328)
(17, 98)
(40, 90)
(17, 90)


In [112]:
dist = scipy.spatial.distance.cdist(manhattan_trim, tr_trim, metric='euclidean')

In [113]:
print(dist)

[[0.27054644 0.51937728 0.26086372 0.26227623 0.27749186 0.24605596
  0.46084239 0.51937728 0.72660132 0.72660132 0.47695268 0.41673609
  0.39592309 0.41673609 0.35827268 0.22933906 0.39758323]
 [0.231504   0.47886156 0.19409593 0.21421798 0.29591776 0.20855296
  0.49113456 0.52055265 0.68800873 0.72178602 0.45999145 0.47678534
  0.36030348 0.49315036 0.28372322 0.16996935 0.45002519]
 [0.2662033  0.48163815 0.24202657 0.24912687 0.30481602 0.22636137
  0.50418006 0.5150812  0.70220112 0.71785001 0.46429692 0.47192801
  0.37298548 0.47192801 0.35944058 0.2278594  0.43783277]
 [0.22759613 0.48764741 0.19468942 0.22271057 0.2978884  0.19128283
  0.5027922  0.5027922  0.70908392 0.71610055 0.44537864 0.46370489
  0.36441734 0.4708385  0.31272992 0.17200786 0.43623388]
 [0.24738634 0.47665501 0.22583096 0.23579652 0.31919821 0.21292474
  0.50714889 0.51691392 0.7051241  0.71916618 0.46042687 0.46125433
  0.38180025 0.4893079  0.34234486 0.21346112 0.45243784]
 [0.24457967 0.49754044 0.2116

I now have the distance between every pair of neighborhoods! To find the closest match for an indivudal neighborhood, I would simply take the minimum matrix value from the row or column of that neighborhood. For example, let's find the closest match for Manhattan's Battery Park City in North York. 

In [125]:
bpc_min = np.argmin(dist[0])
print(bpc_min)
print(north_onehot['Neighborhood'][bpc_min])

15
Lawrence Manor, Lawrence Heights


Unnamed: 0,Neighborhood,Accessories Store,Airport,American Restaurant,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Bakery,Bank,Bar,...,Supermarket,Sushi Restaurant,Tea Room,Thai Restaurant,Theater,Toy / Game Store,Video Game Store,Video Store,Vietnamese Restaurant,Women's Store
0,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


And that's the only step! According to this pairing based on venue information, Lawrence Manor, Larence Heights is the most similar neighborhood in North York Toronto to Battery Park City in Manhattan. 

# Results

In conclusion, I was successfully able to match neighborhoods between the two cities. 




# Discussion

The best way to test the accuracy of these claims would probably be to consult someone with local knowledge of both cities. 

However, we can can at least get a feel for this with some 'top venue' exploration:

In [131]:
print(tr_neighborhoods_venues_sorted[tr_neighborhoods_venues_sorted['Neighborhood'] == 'Lawrence Manor, Lawrence Heights'])
print(manhattan_neighborhoods_venues_sorted[manhattan_neighborhoods_venues_sorted['Neighborhood'] == 'Battery Park City'])

                        Neighborhood 1st Most Common Venue  \
10  Lawrence Manor, Lawrence Heights        Clothing Store   

   2nd Most Common Venue 3rd Most Common Venue 4th Most Common Venue  \
10     Accessories Store           Coffee Shop              Boutique   

     5th Most Common Venue 6th Most Common Venue 7th Most Common Venue  \
10  Furniture / Home Store           Event Space    Miscellaneous Shop   

    8th Most Common Venue 9th Most Common Venue 10th Most Common Venue  
10  Vietnamese Restaurant                  Bank                Dog Run  
        Neighborhood 1st Most Common Venue 2nd Most Common Venue  \
0  Battery Park City                  Park                 Hotel   

  3rd Most Common Venue 4th Most Common Venue 5th Most Common Venue  \
0           Coffee Shop                   Gym         Memorial Site   

  6th Most Common Venue 7th Most Common Venue 8th Most Common Venue  \
0    Mexican Restaurant           Beer Garden         Shopping Mall   

  9th Most C

Intuitively, these don't seem to have much overlap.

In [5]:
address_syd = 'Sydney, NSW'
address_sf = 'San Francisco, CA'
geolocator = Nominatim(user_agent="neighborhood_matching")
location_syd = geolocator.geocode(address_syd)
location_sf = geolocator.geocode(address_sf)
latitude_syd = location_syd.latitude
longitude_syd = location_syd.longitude
latitude_sf = location_sf.latitude
longitude_sf = location_sf.longitude
print('The geograpical coordinate of Sydnet, NSW are {}, {}.'.format(latitude_syd, longitude_syd))
print('The geograpical coordinate of San Francisco, CA are {}, {}.'.format(latitude_sf, longitude_sf))





The geograpical coordinate of Sydnet, NSW are -33.8548157, 151.2164539.
The geograpical coordinate of San Francisco, CA are 45.3842702, -73.995482.


Hello Capstone Project Course!
