# Capstone Project

## Introduction/ Business problem

For this project I want to compare the neighborhood of the two cities.

**Why?**

New York and Toronto are two great cities in North America and the business capitals so we can assume that their inhabitant have many similarities and thus one manager who wants to open any business in both cities or expand an already-existing business in one city to another may want to know which neighborhoods might be similar.

For example if he wants to open a restaurant, the design and the offer as well as the atmosphere can attract different kind of customers according to the neighborhoods.
 
**How?**

To achieve this, we'll use an aggregation of the two datasets of both cities and perform the clustering on all the neighborhoods.
We're likely to find many neighborhoods from the same city in one cluster but any other neighborhood from the other city brings a great information.

## The data to be used

We'll use Foursquare to explore neigborhoods in New York City and Toronto.We'll build a dataframe that contains an aggregation of both neighborhoods .

**Necessary libraries**

In [4]:
import numpy as np
import pandas as pd 
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import json
!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim 
import requests 
from pandas.io.json import json_normalize 
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
#!conda install -c conda-forge folium=0.5.0 --yes 
import folium 

Collecting package metadata (repodata.json): done
Solving environment: done

# All requested packages already installed.



**New York Data**

In [6]:
with open('nyu-geojson.json') as json_data:
    newyork_data = json.load(json_data)
print('Data for New York ready')  

Data for New York ready


In [7]:
neighborhoods_data = newyork_data['features']
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate and fill the New York dataframe
neighborhoods_NY = pd.DataFrame(columns=column_names)
for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods_NY = neighborhoods_NY.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)
neighborhoods_NY.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


**Toronto data**

In [27]:
from bs4 import BeautifulSoup

res = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
soup = BeautifulSoup(res.content,'lxml')
table = soup.find_all('table')[0] 
df = pd.read_html(str(table))
df = pd.read_html(str(table))
dataset = df[0]
headers = dataset.iloc[0]
dataset  = pd.DataFrame(dataset.values[1:], columns=headers)
dataset.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


Adding latitude and longitude:

In [33]:
unique_Postcode = dataset["Postcode"].unique()
datasetUniquePost = pd.DataFrame([], columns=dataset.columns.values )

for i, val in enumerate(unique_Postcode): 
    f1 = dataset[dataset['Postcode']==val]
    if len(f1) > 1 :
        newN = ""
        i=0;
        for index, row in f1.iterrows():
            if i < (len(f1)-1) : 
                newN = newN + row['Neighbourhood'] + ", " 
                i = i+1
            else :
                newN = newN + row['Neighbourhood'] 
                i = i+1
        df2 = pd.DataFrame({"Postcode":row['Postcode'],"Borough" :  row['Borough'],"Neighbourhood":newN },index=[0]) 
        datasetUniquePost = datasetUniquePost.append(df2)
    else : 
        datasetUniquePost = datasetUniquePost.append(f1) 
       
datasetUniquePost  = datasetUniquePost[datasetUniquePost["Borough"]!="Not assigned"]


In [34]:
datasetUniquePost

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
0,M6A,North York,"Lawrence Heights, Lawrence Manor"
7,M7A,Downtown Toronto,Queen's Park
9,M9A,Queen's Park,Not assigned
0,M1B,Scarborough,"Rouge, Malvern"
13,M3B,North York,Don Mills North
0,M4B,East York,"Woodbine Gardens, Parkview Hill"
0,M5B,Downtown Toronto,"Ryerson, Garden District"


In [35]:
withNoNeighbourhoodIndex  = datasetUniquePost[datasetUniquePost["Neighbourhood"]=="Not assigned"].index
datasetUniquePost.loc[withNoNeighbourhoodIndex,'Neighbourhood'] = datasetUniquePost.loc[withNoNeighbourhoodIndex,'Borough']
datasetUniquePost.shape

(103, 3)

In [36]:
geoloc = pd.read_csv("Geospatial_Coordinates.csv")
geoloc.rename(columns={"Postal Code": "Postcode"},inplace=True)
datasetUniquePost = datasetUniquePost.join(geoloc.set_index('Postcode'), on='Postcode')
datasetUniquePost.reset_index(drop=True,inplace=True)
datasetUniquePost.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763
4,M7A,Downtown Toronto,Queen's Park,43.662301,-79.389494


We now have the data we need for our study, in the following sections (next submissions) we'll aggregate the two dataframes and start the clustering.