# Data Science Real World Project

# The battle of Neighborhood: Identify similarity or disimilarity of New York and Toronto City

### Business problem:

New York and Toronto, both cities are very diverse and are financial capitals of their respective countries. The problem description is how could we determine the similarity or disimilarity of both cities by comparing their neighborhoods. Is New York City more like Toronto or Paris or some other multicultural city?

### Source of Data
We have to use different sources of data to address this problem. Wikipedia is one of the source of data to get list of Postal Codes of Canada[2] and New York City[3]. The data is unstructured and needs to be scrapped and cleaned to make it structured and ready for data analysis.

We can get geographical coordinates of neighborhoods[4] for getting latitude and longitude data for neighborhoods.

Another source of data is Foursquare Database. We will use Foursquare API[4] to get all venues. The data we will get is in .json format and we can then filter out all category venues from the json.

## Data Preparation
## Preparing Toronto Data
We are going to scrape postal code data of Toronto from https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
and extract Toronto's Coordinates from https://cocl.us/Geospatial_data


In [17]:
# import pandas
import pandas as pd

In [2]:
toronto_df = pd.read_html("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")[0]

In [3]:
toronto_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,M1ANot assigned,M2ANot assigned,M3ANorth York(Parkwoods),M4ANorth York(Victoria Village),M5ADowntown Toronto(Regent Park / Harbourfront),M6ANorth York(Lawrence Manor / Lawrence Heights),M7AQueen's Park(Ontario Provincial Government),M8ANot assigned,M9AEtobicoke(Islington Avenue)
1,M1BScarborough(Malvern / Rouge),M2BNot assigned,M3BNorth York(Don Mills)North,M4BEast York(Parkview Hill / Woodbine Gardens),"M5BDowntown Toronto(Garden District, Ryerson)",M6BNorth York(Glencairn),M7BNot assigned,M8BNot assigned,M9BEtobicoke(West Deane Park / Princess Garden...
2,M1CScarborough(Rouge Hill / Port Union / Highl...,M2CNot assigned,M3CNorth York(Don Mills)South(Flemingdon Park),M4CEast York(Woodbine Heights),M5CDowntown Toronto(St. James Town),M6CYork(Humewood-Cedarvale),M7CNot assigned,M8CNot assigned,M9CEtobicoke(Eringate / Bloordale Gardens / Ol...
3,M1EScarborough(Guildwood / Morningside / West ...,M2ENot assigned,M3ENot assigned,M4EEast Toronto(The Beaches),M5EDowntown Toronto(Berczy Park),M6EYork(Caledonia-Fairbanks),M7ENot assigned,M8ENot assigned,M9ENot assigned
4,M1GScarborough(Woburn),M2GNot assigned,M3GNot assigned,M4GEast York(Leaside),M5GDowntown Toronto(Central Bay Street),M6GDowntown Toronto(Christie),M7GNot assigned,M8GNot assigned,M9GNot assigned


In [4]:
to_coord = pd.read_csv("https://cocl.us/Geospatial_data")

In [5]:
to_coord.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


## Data Cleansing Part I:
Cleaning the Toronto Data

In [6]:
toronto_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,M1ANot assigned,M2ANot assigned,M3ANorth York(Parkwoods),M4ANorth York(Victoria Village),M5ADowntown Toronto(Regent Park / Harbourfront),M6ANorth York(Lawrence Manor / Lawrence Heights),M7AQueen's Park(Ontario Provincial Government),M8ANot assigned,M9AEtobicoke(Islington Avenue)
1,M1BScarborough(Malvern / Rouge),M2BNot assigned,M3BNorth York(Don Mills)North,M4BEast York(Parkview Hill / Woodbine Gardens),"M5BDowntown Toronto(Garden District, Ryerson)",M6BNorth York(Glencairn),M7BNot assigned,M8BNot assigned,M9BEtobicoke(West Deane Park / Princess Garden...
2,M1CScarborough(Rouge Hill / Port Union / Highl...,M2CNot assigned,M3CNorth York(Don Mills)South(Flemingdon Park),M4CEast York(Woodbine Heights),M5CDowntown Toronto(St. James Town),M6CYork(Humewood-Cedarvale),M7CNot assigned,M8CNot assigned,M9CEtobicoke(Eringate / Bloordale Gardens / Ol...
3,M1EScarborough(Guildwood / Morningside / West ...,M2ENot assigned,M3ENot assigned,M4EEast Toronto(The Beaches),M5EDowntown Toronto(Berczy Park),M6EYork(Caledonia-Fairbanks),M7ENot assigned,M8ENot assigned,M9ENot assigned
4,M1GScarborough(Woburn),M2GNot assigned,M3GNot assigned,M4GEast York(Leaside),M5GDowntown Toronto(Central Bay Street),M6GDowntown Toronto(Christie),M7GNot assigned,M8GNot assigned,M9GNot assigned


In [7]:
# transforming into one long column
toronto_df = toronto_df.melt()

In [8]:
toronto_df.head()

Unnamed: 0,variable,value
0,0,M1ANot assigned
1,0,M1BScarborough(Malvern / Rouge)
2,0,M1CScarborough(Rouge Hill / Port Union / Highl...
3,0,M1EScarborough(Guildwood / Morningside / West ...
4,0,M1GScarborough(Woburn)


In [9]:
# extract postal code
toronto_df['Postal Code'] = toronto_df['value'].apply(lambda x: x[:3])

In [10]:
toronto_df.head()

Unnamed: 0,variable,value,Postal Code
0,0,M1ANot assigned,M1A
1,0,M1BScarborough(Malvern / Rouge),M1B
2,0,M1CScarborough(Rouge Hill / Port Union / Highl...,M1C
3,0,M1EScarborough(Guildwood / Morningside / West ...,M1E
4,0,M1GScarborough(Woburn),M1G


In [11]:
# again, extract Borough
toronto_df['Borough'] = toronto_df['value'].apply(lambda x: x[3:])

In [12]:
toronto_df.head()

Unnamed: 0,variable,value,Postal Code,Borough
0,0,M1ANot assigned,M1A,Not assigned
1,0,M1BScarborough(Malvern / Rouge),M1B,Scarborough(Malvern / Rouge)
2,0,M1CScarborough(Rouge Hill / Port Union / Highl...,M1C,Scarborough(Rouge Hill / Port Union / Highland...
3,0,M1EScarborough(Guildwood / Morningside / West ...,M1E,Scarborough(Guildwood / Morningside / West Hill)
4,0,M1GScarborough(Woburn),M1G,Scarborough(Woburn)


In [13]:
# drop the columns, varaible and value
toronto_df.drop(['variable', 'value'], axis=1, inplace=True)

In [14]:
toronto_df.head()

Unnamed: 0,Postal Code,Borough
0,M1A,Not assigned
1,M1B,Scarborough(Malvern / Rouge)
2,M1C,Scarborough(Rouge Hill / Port Union / Highland...
3,M1E,Scarborough(Guildwood / Morningside / West Hill)
4,M1G,Scarborough(Woburn)


In [15]:
# drop cells that consists of Not assinged value
toronto_df = toronto_df[toronto_df['Borough']!='Not assigned'].reset_index(drop=True)

In [16]:
toronto_df.head()

Unnamed: 0,Postal Code,Borough
0,M1B,Scarborough(Malvern / Rouge)
1,M1C,Scarborough(Rouge Hill / Port Union / Highland...
2,M1E,Scarborough(Guildwood / Morningside / West Hill)
3,M1G,Scarborough(Woburn)
4,M1H,Scarborough(Cedarbrae)


In [17]:
# extract neighborhoods
import re

In [18]:
toronto_df['Neighborhood'] = toronto_df['Borough'].apply(lambda x : (re.findall("\((.*?)\)", x))[0])

In [19]:
toronto_df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1B,Scarborough(Malvern / Rouge),Malvern / Rouge
1,M1C,Scarborough(Rouge Hill / Port Union / Highland...,Rouge Hill / Port Union / Highland Creek
2,M1E,Scarborough(Guildwood / Morningside / West Hill),Guildwood / Morningside / West Hill
3,M1G,Scarborough(Woburn),Woburn
4,M1H,Scarborough(Cedarbrae),Cedarbrae


In [20]:
toronto_df['Borough'] = toronto_df['Borough'].apply(lambda x: x.split('(')[0])

In [21]:
toronto_df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1B,Scarborough,Malvern / Rouge
1,M1C,Scarborough,Rouge Hill / Port Union / Highland Creek
2,M1E,Scarborough,Guildwood / Morningside / West Hill
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [22]:
toronto_df['Neighborhood'] = toronto_df['Neighborhood'].apply(lambda x: x.replace(" / ", ','))

In [23]:
toronto_df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1B,Scarborough,"Malvern,Rouge"
1,M1C,Scarborough,"Rouge Hill,Port Union,Highland Creek"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [24]:
toronto_df

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1B,Scarborough,"Malvern,Rouge"
1,M1C,Scarborough,"Rouge Hill,Port Union,Highland Creek"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
...,...,...,...
98,M9N,York,Weston
99,M9P,Etobicoke,Westmount
100,M9R,Etobicoke,"Kingsview Village,St. Phillips,Martin Grove Ga..."
101,M9V,Etobicoke,"South Steeles,Silverstone,Humbergate,Jamestown..."


In [None]:
toronto_df['Neighborhood'] = toronto_df.Neighborhood.apply(lambda x: x.replace(' / ',',').strip())

In [None]:
toronto_df

## Merging the two dataframes

In [None]:
# shapes of the data
(toronto_df.shape, to_coord.shape)

In [None]:
# sorting both dataframes
to_coord = to_coord.sort_values(by="Postal Code", ascending=True)
toronto_df = toronto_df.sort_values(by="Postal Code", ascending=True)

In [None]:
# merging the two dataframes
trt_df = toronto_df.merge(to_coord, on='Postal Code', sort=False)

In [None]:
trt_df.head()

In [None]:
# make sure that the data has 15 boroughs and 103 neighbohoods
print("The dataframe has {} boroughs and {} neighborhoods.".format(len(trt_df['Borough'].unique()), 
                                                                  trt_df.shape[0]))

In [None]:
trt_df.to_csv("toronto_data.csv", index=False)

# Downloading and Parsing New York Data

In [29]:
# download New York data in json format
!wget -q -O 'newyork_data.json' https://cocl.us/new_york_dataset

In [2]:
# parsing json
import json

In [3]:
with open('newyork_data.json') as json_data:
    newyork = json.load(json_data)


In [4]:
 neighborhoods_data = newyork['features']

In [5]:
col_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude']

In [6]:
ny_df = pd.DataFrame(columns=col_names)

In [7]:
ny_df

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude


In [8]:
neighborhoods_data[0]

{'type': 'Feature',
 'id': 'nyu_2451_34572.1',
 'geometry': {'type': 'Point',
  'coordinates': [-73.84720052054902, 40.89470517661]},
 'geometry_name': 'geom',
 'properties': {'name': 'Wakefield',
  'stacked': 1,
  'annoline1': 'Wakefield',
  'annoline2': None,
  'annoline3': None,
  'annoangle': 0.0,
  'borough': 'Bronx',
  'bbox': [-73.84720052054902,
   40.89470517661,
   -73.84720052054902,
   40.89470517661]}}

In [9]:
# loop through the data and fill the dataframe one row at a time


In [10]:
for data in neighborhoods_data:
    borough = data['properties']['borough']
    neighborhood_name = data['properties']['name']
    lat = data['geometry']['coordinates'][1]
    long = data['geometry']['coordinates'][0]
    
    ny_df = ny_df.append({'Borough':borough,
                        'Neighborhood': neighborhood_name,
                        'Latitude': lat,
                        'Longitude': long}, ignore_index=True)
    
    
    
    

In [11]:
ny_df.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


In [12]:
# check the shape of the dataset
print("The dataframe has {} boroughs and {} neighborhoods".format(
len(ny_df['Borough'].unique()), ny_df.shape[0]))

The dataframe has 5 boroughs and 306 neighborhoods


In [13]:
ny_df.shape

(306, 4)

# Get Latitude and Longitude of both New York and Toronto Using Geopy

In [14]:
# To Install geocoder and geopy uncomment the following
#!pip install geocoder
#!pip install geopy


In [15]:
import geocoder
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

In [16]:
# New York
address_ny = "New York City, NY"

geolocator_ny = Nominatim(user_agent='ny_explorer')
location_ny = geolocator_ny.geocode(address_ny)
# get long and lat
lat_ny = location_ny.latitude
lon_ny = location_ny.longitude


In [17]:
# Toronto City
address_ca = "Toronto City, CA"

geolocator_ca = Nominatim(user_agent='ca_explorer')
location_ca = geolocator_ca.geocode(address_ca)
# get long and lat
lat_ca = location_ca.latitude
lon_ca = location_ca.longitude


In [18]:
print("The latitude and Longitude of New York City are {} and {} respectively.".format(lat_ny, lon_ny))
print("The latitude and Longitude of Toronto City are {} and {} respectively.".format(lat_ca, lon_ca))

The latitude and Longitude of New York City are 40.7127281 and -74.0060152 respectively.
The latitude and Longitude of Toronto City are 43.6534817 and -79.3839347 respectively.


# Visualize Map of Toronto and New York City using Folium

In [19]:
#!pip install folium # uncomment this line if you haven't Installed foluim library


In [20]:
import folium # import map rendering library

In [8]:
# Read and explore toronto_data.csv, this is already previously
to_df = pd.read_csv('toronto_data.csv')
to_df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern,Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill,Port Union,Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


In [22]:
# Explore the new York data
ny_df.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


In [28]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[lat_ca, lon_ca], zoom_start=True)

# add markers to map
for lat, lon, borough, neighborhood in zip(to_df['Latitude'], to_df['Longitude'], to_df['Borough'], to_df['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup = label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)


In [40]:
map_toronto

## Create Foursquare Developer Account

Next, we are going to start utilizing the Foursquare API to explore the neighborhoods and segment them.

N.B: You need to register to Forsquare API Developer account to get your creditials. If you are new to forsquare api developer, you can register here <a href="https://foursquare.com/developers/home">Create Developer Account</a>. Don't share your creditials to the public.

Define Foursquare Credentials and Version

In [18]:
CLIENT_ID = 'LXQNFFZVNT0MVDIHJMJ4K5EKYVIMGNEST2JITALT4RTF1UTQ' # your Foursquare ID
CLIENT_SECRET = 'F1R24RHVTPKTBZ5URFLJWLCGWSM0H1BMFQGVTZ4HCBHJV2LO' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: LXQNFFZVNT0MVDIHJMJ4K5EKYVIMGNEST2JITALT4RTF1UTQ
CLIENT_SECRET:F1R24RHVTPKTBZ5URFLJWLCGWSM0H1BMFQGVTZ4HCBHJV2LO


## Get Location Data From Foursquare 

Now, let's get the top 100 venues within a radius of 500 meters.

Let's create a function to repeat the same process to all the neighborhoods.

In [10]:
import requests # import requests for sending an http request

In [29]:
       
# create the API request URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    )
headers = {"Accept": "application/json"}
# make the GET request
results = requests.get(url, headers=headers)

In [30]:
results

<Response [410]>

In [None]:
headers = {"Accept": "application/json"}

In [31]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        #print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()
        
      
    
    return results

In [32]:
toronto_venues = getNearbyVenues(
names=to_df['Neighborhood'], latitudes=to_df['Latitude'], longitudes=to_df['Longitude'])

In [33]:
toronto_venues

{'meta': {'code': 410,
  'errorType': 'deprecated',
  'errorDetail': 'Usage of the V2 Places API has been deprecated for new Projects. Please see our updated documentation for V3 for more details: https://docs.foursquare.com/reference',
  'requestId': '62999ba5ef9a3e7bda3e4d09'},
 'response': {}}