# Report for Capstone Project 
## Title: Finding Similar Neighborhoods between Toronto, ON and Queens, NewYork

### Written by Collins Opoku-Baah

### 1. Problem Description and Background

The world has become one giant global village and thus, each and every minute, people travel from one place to another. 
The purpose for traveling could be temporary that is for vacation, business, visits etc or could be permanent e.g. school, work etc
When people live in a particular region for a long time, they tend to embrace the cultures (e.g food, clothing, language etc) of that region,
making it very difficult to transition into other neighborhoods. For example, a person who loves seafood and attend yoga classes will want to
move to a new place with such venues in order to continue having their pleasant life experience. However, find a place that share similar venue
as your current neighborhood can be burdensome considering how developed most cities in the world are currently. Hence, this project aims to find
neighborhoods between two big cities namely Toronto, ON and Queens, New York that are similar with respect to venues. To do this, I will employ
machine learning approaches and other techniques to segment and cluster neighborhoods in these two cities. 

### 2. Decription of the Data and its Intended Usage

The data for this project will comprise the neighborhoods in Toronto, ON, Canada and Queens, New York, USA. We will obtain the coordinates 
(latitudes and longitudes) of these neighborhoods and using the Foursquare API, we will obtain location venues such as yoga places, restaurants, etc that are within a defined radius around
these neighborhoods. Using Kmeans clustering approach, we will segment and cluster these neighborhoods to determine which ones between the two 
cities are similar.

### Import All Necessary Libraries

In [8]:
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation
from bs4 import BeautifulSoup

#!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

import json # library to handle JSON files

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 
    
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

# import k-means from clustering stage
from sklearn.cluster import KMeans

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

#!conda install -c conda-forge folium=0.5.0 --yes
import folium # plotting library

print('Libraries imported.')

Libraries imported.


### Establish Credentials for The FourSquare API

In [9]:
CLIENT_ID = '0QRBYXQQQNHYFERTIUKAL2W4QKA5FLNNSHFLS3ZJBW4VODAX' # your Foursquare ID
CLIENT_SECRET = '2I3XMEMQFY5C4XBXGKUURPCMWA15AGA412EGVVYHMEPHOJNG' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: 0QRBYXQQQNHYFERTIUKAL2W4QKA5FLNNSHFLS3ZJBW4VODAX
CLIENT_SECRET:2I3XMEMQFY5C4XBXGKUURPCMWA15AGA412EGVVYHMEPHOJNG


### Import and Clean Data for the Neighborhoods of Toronto

Get List of postal codes of Canada from Wikipedia

In [10]:
r = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')
soup = BeautifulSoup(r.text, 'html.parser')

webtable = soup.table

Postcode = []
Borough = []
Neighborhood = []

for num, tabElmt in enumerate(webtable.find_all('td')):
    
    if num%3 == 0:
        Postcode.append(tabElmt.text.rstrip())
        
    elif num%3 == 1:
        Borough.append(tabElmt.text.rstrip())
        
    elif num%3 == 2:
        Neighborhood.append(tabElmt.text.rstrip())
            

Create a Dataframe from the imported wikipedia data

In [11]:
d = {'PostalCode':Postcode, 'Borough':Borough, 'Neighborhood':Neighborhood}
newtable = pd.DataFrame(data = d)

newtable.drop(index = newtable.index[newtable['Borough']=='Not assigned'], inplace = True)
newtable.reset_index(drop = True, inplace = True)

for num, i in enumerate(newtable['Neighborhood']):
    if i == 'Not assigned':
        newtable.loc[num, 'Neighborhood'] = newtable.loc[num, 'Borough']
        
indx = []
for i in range(newtable['Borough'].size):
    indx.append(newtable['Borough'][i].endswith('Toronto'))

# Use that list to create the new dataframe
Toronto_neigh = newtable[indx].reset_index(drop=True)

t67 = Toronto_neigh.loc[67, 'Neighborhood']
t67 = t67[t67.find('T'):]
Toronto_neigh.loc[67, 'Neighborhood'] = t67

t73 = Toronto_neigh.loc[73, 'Neighborhood']
t73 = t73[:t73.find('9')]
Toronto_neigh.loc[73, 'Neighborhood'] = t73

Toronto_neigh.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M5A,Downtown Toronto,Harbourfront
1,M5A,Downtown Toronto,Regent Park
2,M5B,Downtown Toronto,Ryerson
3,M5B,Downtown Toronto,Garden District
4,M5C,Downtown Toronto,St. James Town


__Use Geocorder to get the latitude and the longitudes of all the Neighborhoods in Boroughs with Toronto in the name__

In [12]:
Toronto_cord = pd.DataFrame(columns = ['Neighborhood', 'Latitude', 'Longitude'])


for i, neigh in enumerate(Toronto_neigh['Neighborhood']):
    
    location = None
    counter = 0
    
    while location is None:
        counter += 1
        
        if i == 64:
            latitude = 43.777140
            longitude = -79.332610
        elif i == 73:
            latitude = 43.638080
            longitude = -79.273890    
        else: 
            address = neigh + ', Toronto, ON'
            geolocator = Nominatim(user_agent="T_explorer")
            location = geolocator.geocode(address)
            latitude = location.latitude
            longitude = location.longitude
        
        if counter == 10:
            break 
             
    Toronto_cord = Toronto_cord.append({'Neighborhood': neigh,
                               'Latitude': latitude ,
                               'Longitude': longitude}, ignore_index = True)

Toronto_data = Toronto_neigh.join(Toronto_cord[['Latitude', 'Longitude']])

Toronto_data

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M5A,Downtown Toronto,Harbourfront,43.640080,-79.380150
1,M5A,Downtown Toronto,Regent Park,43.660706,-79.360457
2,M5B,Downtown Toronto,Ryerson,43.621573,-79.559130
3,M5B,Downtown Toronto,Garden District,43.656502,-79.377128
4,M5C,Downtown Toronto,St. James Town,43.669403,-79.372704
5,M4E,East Toronto,The Beaches,43.671024,-79.296712
6,M5E,Downtown Toronto,Berczy Park,43.648001,-79.375385
7,M5G,Downtown Toronto,Central Bay Street,43.660920,-79.385878
8,M6G,Downtown Toronto,Christie,43.664111,-79.418405
9,M5H,Downtown Toronto,Adelaide,43.650298,-79.380477


### Import and Clean Data for the Neighborhoods of New York

In [13]:
!wget -q -O 'newyork_data.json' https://cocl.us/new_york_dataset
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)
neighborhoods_data = newyork_data['features']


# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

neighborhoods.head()

for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)
    
    
neighborhoods.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


Get Neighborhoods for a specified borough in New York

In [14]:
NYborough_choice = 'Queens'
NYborough_data = neighborhoods[neighborhoods['Borough'] == NYborough_choice].reset_index(drop=True)
NYborough_data.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Queens,Astoria,40.768509,-73.915654
1,Queens,Woodside,40.746349,-73.901842
2,Queens,Jackson Heights,40.751981,-73.882821
3,Queens,Elmhurst,40.744049,-73.881656
4,Queens,Howard Beach,40.654225,-73.838138


Get the coordinates for the chosen borough for New York

__Combine the data of the Two Cities into one DataFrame__

In [15]:
Comb_data = NYborough_data.append(Toronto_data.iloc[:,1:], ignore_index=True)

Comb_data.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Queens,Astoria,40.768509,-73.915654
1,Queens,Woodside,40.746349,-73.901842
2,Queens,Jackson Heights,40.751981,-73.882821
3,Queens,Elmhurst,40.744049,-73.881656
4,Queens,Howard Beach,40.654225,-73.838138


__Use the Foursquare API to get venues around these neighborhoods__

In [16]:
def getNearbyVenues(names, borough, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, bor, lat, lng in zip(names, borough, latitudes, longitudes):
        print(name, bor)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name,
            bor,
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood',
                  'Borough',
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Use the function above the get the venues for each of the Neighborhoods

In [17]:
LIMIT = 200
Combined_venues = getNearbyVenues(names=Comb_data['Neighborhood'],
                                  borough=Comb_data['Borough'],
                                   latitudes=Comb_data['Latitude'],
                                   longitudes=Comb_data['Longitude'])

Astoria Queens
Woodside Queens
Jackson Heights Queens
Elmhurst Queens
Howard Beach Queens
Corona Queens
Forest Hills Queens
Kew Gardens Queens
Richmond Hill Queens
Flushing Queens
Long Island City Queens
Sunnyside Queens
East Elmhurst Queens
Maspeth Queens
Ridgewood Queens
Glendale Queens
Rego Park Queens
Woodhaven Queens
Ozone Park Queens
South Ozone Park Queens
College Point Queens
Whitestone Queens
Bayside Queens
Auburndale Queens
Little Neck Queens
Douglaston Queens
Glen Oaks Queens
Bellerose Queens
Kew Gardens Hills Queens
Fresh Meadows Queens
Briarwood Queens
Jamaica Center Queens
Oakland Gardens Queens
Queens Village Queens
Hollis Queens
South Jamaica Queens
St. Albans Queens
Rochdale Queens
Springfield Gardens Queens
Cambria Heights Queens
Rosedale Queens
Far Rockaway Queens
Broad Channel Queens
Breezy Point Queens
Steinway Queens
Beechhurst Queens
Bay Terrace Queens
Edgemere Queens
Arverne Queens
Rockaway Beach Queens
Neponsit Queens
Murray Hill Queens
Floral Park Queens
Holli

In [18]:
print(Combined_venues.shape)
Combined_venues.head()

(5872, 8)


Unnamed: 0,Neighborhood,Borough,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Astoria,Queens,40.768509,-73.915654,Favela Grill,40.767348,-73.917897,Brazilian Restaurant
1,Astoria,Queens,40.768509,-73.915654,Orange Blossom,40.769856,-73.917012,Gourmet Shop
2,Astoria,Queens,40.768509,-73.915654,Titan Foods Inc.,40.769198,-73.919253,Gourmet Shop
3,Astoria,Queens,40.768509,-73.915654,CrossFit Queens,40.769404,-73.918977,Gym
4,Astoria,Queens,40.768509,-73.915654,Simply Fit Astoria,40.769114,-73.912403,Gym


__Utilize one hot encoding to create features for the clustering__

In [19]:
# one hot encoding
Combined_onehot = pd.get_dummies(Combined_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
Combined_onehot[['Neighborhood', 'Borough']] = Combined_venues[['Neighborhood', 'Borough']]

# move neighborhood column to the first column
fixed_columns = [Combined_onehot.columns[-1]] + list(Combined_onehot.columns[:-1])
Combined_onehot = Combined_onehot[fixed_columns]

Combined_onehot.head()

Unnamed: 0,Borough,Accessories Store,Afghan Restaurant,African Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,...,Vietnamese Restaurant,Warehouse Store,Weight Loss Center,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Zoo
0,Queens,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Queens,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Queens,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Queens,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Queens,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


__Group one hot encoded dataframe by Neighborhood__

In [20]:
Combined_grouped = Combined_onehot.groupby(['Neighborhood', 'Borough']).mean().reset_index()
Combined_grouped

Unnamed: 0,Neighborhood,Borough,Accessories Store,Afghan Restaurant,African Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,...,Vietnamese Restaurant,Warehouse Store,Weight Loss Center,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Zoo
0,Adelaide,Downtown Toronto,0.00000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.000000,0.0,0.00000,0.0,0.010000,0.000000,0.000000,0.010000,0.000000,0.0
1,Arverne,Queens,0.00000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.000000,0.0,0.00000,0.0,0.000000,0.062500,0.000000,0.000000,0.000000,0.0
2,Astoria,Queens,0.00000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.000000,0.0,0.00000,0.0,0.000000,0.010000,0.000000,0.000000,0.000000,0.0
3,Astoria Heights,Queens,0.00000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.000000,0.0,0.00000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0
4,Auburndale,Queens,0.00000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.000000,0.0,0.00000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0
5,Bathurst Quay,Downtown Toronto,0.00000,0.000000,0.0,0.0,0.0,0.0,0.0,0.038462,...,0.000000,0.0,0.00000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0
6,Bay Terrace,Queens,0.02381,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.000000,0.0,0.02381,0.0,0.000000,0.000000,0.000000,0.071429,0.000000,0.0
7,Bayside,Queens,0.00000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.000000,0.0,0.00000,0.0,0.014925,0.000000,0.000000,0.000000,0.014925,0.0
8,Bayswater,Queens,0.00000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.000000,0.0,0.00000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0
9,Beechhurst,Queens,0.00000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.000000,0.0,0.00000,0.0,0.000000,0.000000,0.000000,0.000000,0.083333,0.0
