# Capstone Project - The Battle of Neighborhoods

Author: Airton Raimundo

## Table of contents
* [Introduction](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction <a name="introduction"></a>

This project will analyze the similarities between New York and Toronto neighborhoods with the aim of facilitating decision making for a person moving from Toronto to New York.
The criterion that will be used for this decision is the proximity between commercial establishments in order to preserve the habits of life that the person living in Toronto is used to and thus preserve them when settling in New York.

## Data <a name="data"></a>

Initially we will need a dataset containing all the neighborhoods in both cities, this dataset can be easily found on the web.
The second part will be collecting the locations of New York and Toronto using the geopy library to help us visualize our data.
To top it off, let's get information from nearby vendors using the foursquare api.
These steps will be done below:

### General dependencies

In [28]:
import numpy as np # library to handle data in a vectorized manner
import pandas as pd # library for data analsysis
import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

### Building the New York dataframe

In [29]:
# download the file
!wget -q -O 'newyork_data.json' https://cocl.us/new_york_dataset

# open and read as json
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)
    
neighborhoods_data = newyork_data['features']

# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

# populate the dataframe
for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

In [30]:
# using geopy library to get the latitude and longitude values of New York City
address = 'New York City, NY'

geolocator = Nominatim(user_agent="Capstone-Airton")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

In [31]:
ny_dataframe = neighborhoods
ny_geolocator = [latitude, longitude]

In [32]:
ny_dataframe.shape

(306, 4)

In [48]:
# The code was removed by Watson Studio for sharing.

In [35]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [36]:
ny_venues = getNearbyVenues(names=ny_dataframe['Neighborhood'],
                                   latitudes=ny_dataframe['Latitude'],
                                   longitudes=ny_dataframe['Longitude']
                                  )

In [37]:
ny_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Wakefield,40.894705,-73.847201,Lollipops Gelato,40.894123,-73.845892,Dessert Shop
1,Wakefield,40.894705,-73.847201,Rite Aid,40.896649,-73.844846,Pharmacy
2,Wakefield,40.894705,-73.847201,Carvel Ice Cream,40.890487,-73.848568,Ice Cream Shop
3,Wakefield,40.894705,-73.847201,Walgreens,40.896687,-73.84485,Pharmacy
4,Wakefield,40.894705,-73.847201,Dunkin',40.890459,-73.849089,Donut Shop


### Building the Toronto dataframe

In [40]:
# The data was downloaded from https://www.aggdata.com/free/canada-postal-codes and saved to github
df = pd.read_csv('https://raw.githubusercontent.com/airtonraimundo/Coursera_Capstone/master/ca_postal_codes_v3.csv')
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3K,Downsview East,CFB Toronto
1,M4E,East Toronto,The Beaches
2,M4J,East Toronto,The Danforth East
3,M4K,East Toronto,The Danforth West / Riverdale
4,M4L,East Toronto,India Bazaar / The Beaches West


In [41]:
df = df.rename(columns={'Neighbourhood': 'Neighborhood'})
geo_data = pd.read_csv('https://cocl.us/Geospatial_data')
geo_data.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [42]:
geo_data.rename(columns={'Postal Code':'PostalCode'}, inplace=True)
df2 = pd.merge(df, geo_data, on='PostalCode')
df2.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3K,Downsview East,CFB Toronto,43.737473,-79.464763
1,M4E,East Toronto,The Beaches,43.676357,-79.293031
2,M4J,East Toronto,The Danforth East,43.685347,-79.338106
3,M4K,East Toronto,The Danforth West / Riverdale,43.679557,-79.352188
4,M4L,East Toronto,India Bazaar / The Beaches West,43.668999,-79.315572


In [43]:
# using geopy library to get the latitude and longitude values of Toronto
address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="Capstone-Airton")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

In [44]:
tt_dataframe = df2.drop(["PostalCode"], axis=1)
tt_geolocator = [latitude, longitude]

In [45]:
tt_venues = getNearbyVenues(names=tt_dataframe['Neighborhood'],
                                   latitudes=tt_dataframe['Latitude'],
                                   longitudes=tt_dataframe['Longitude']
                                  )

In [46]:
tt_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,CFB Toronto,43.737473,-79.464763,Toronto Downsview Airport (YZD),43.738883,-79.470111,Airport
1,CFB Toronto,43.737473,-79.464763,Ancaster Park,43.734706,-79.464777,Park
2,CFB Toronto,43.737473,-79.464763,VAUGHAN GARAGE DOOR REPAIR,43.734877,-79.468584,Other Repair Shop
3,The Beaches,43.676357,-79.293031,Glen Manor Ravine,43.676821,-79.293942,Trail
4,The Beaches,43.676357,-79.293031,The Big Carrot Natural Food Market,43.678879,-79.297734,Health Food Store


## Methodology <a name="methodology"></a>

In this project we are looking to identify similar characteristics between the cities of New York and Toronto.
First, we will make use of machine learning to determine what are the characteristics that make each neighborhood unique, thus being able to differentiate and group them.
Using the k-means algorithm, we will group similar neighborhoods in clusters in each city and then use a classification algorithm to relate the neighborhoods of the two cities to each other.

## Analysis <a name="analysis"></a>

## Results and Discussion <a name="results"></a>

## Conclusion <a name="conclusion"></a>