# Capstone Project - The Battle of Neighborhoods

Author: Airton Raimundo

## Table of contents
* [Introduction](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction <a name="introduction"></a>

#### Background

In the context of globalization, migratory currents are increasingly strong and people are increasingly experiencing the cultural diversity that the world provides.
However many people have habits of consumption that are part of their lives and they would like to maintain them even during a change of country.

#### Problem

The question we will try to answer in this project is: "What is the best way to choose a neighborhood that provides the services that I am used to consuming?”.

#### Interest

The results presented here aim to reach people in temporary or permanent change with little taste for changes in their essential services. Although we are restricted to Toronto and New York only, the intention of this project is to present a generic way of comparing two major cities and making the best decision.

## Data <a name="data"></a>

#### Data sources

Initially we will need a dataset containing all the neighborhoods in both cities, these datasets can be downloaded <a href="https://cocl.us/new_york_dataset" >here</a> and <a href="https://raw.githubusercontent.com/airtonraimundo/Coursera_Capstone/master/ca_postal_codes_v3.csv" >here</a>. The second part will be collecting the locations of New York and Toronto using the geopy library to help us visualize our data. To top it off, let's get information from nearby vendors using the Foursquare API.

#### Data preparation

The data comes from different sources, so it was necessary to adjust column names and eliminate unnecessary data.

### General dependencies

In [1]:
import numpy as np # library to handle data in a vectorized manner
import pandas as pd # library for data analsysis
import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

### Building the New York dataframe

In [2]:
# download the file
!wget -q -O 'newyork_data.json' https://cocl.us/new_york_dataset

# open and read as json
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)
    
neighborhoods_data = newyork_data['features']

# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

# populate the dataframe
for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

In [3]:
# using geopy library to get the latitude and longitude values of New York City
address = 'New York City, NY'

geolocator = Nominatim(user_agent="Capstone-Airton")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

In [4]:
ny_dataframe = neighborhoods
ny_geolocator = [latitude, longitude]

In [5]:
ny_dataframe.shape

(306, 4)

In [6]:
# The code was removed by Watson Studio for sharing.

In [7]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [8]:
ny_venues = getNearbyVenues(names=ny_dataframe['Neighborhood'],
                                   latitudes=ny_dataframe['Latitude'],
                                   longitudes=ny_dataframe['Longitude']
                                  )

In [9]:
ny_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Wakefield,40.894705,-73.847201,Lollipops Gelato,40.894123,-73.845892,Dessert Shop
1,Wakefield,40.894705,-73.847201,Rite Aid,40.896649,-73.844846,Pharmacy
2,Wakefield,40.894705,-73.847201,Carvel Ice Cream,40.890487,-73.848568,Ice Cream Shop
3,Wakefield,40.894705,-73.847201,Walgreens,40.896687,-73.84485,Pharmacy
4,Wakefield,40.894705,-73.847201,Dunkin',40.890459,-73.849089,Donut Shop


### Building the Toronto dataframe

In [10]:
# The data was downloaded from https://www.aggdata.com/free/canada-postal-codes and saved to github
df = pd.read_csv('https://raw.githubusercontent.com/airtonraimundo/Coursera_Capstone/master/ca_postal_codes_v3.csv')
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3K,Downsview East,CFB Toronto
1,M4E,East Toronto,The Beaches
2,M4J,East Toronto,The Danforth East
3,M4K,East Toronto,The Danforth West / Riverdale
4,M4L,East Toronto,India Bazaar / The Beaches West


In [11]:
df = df.rename(columns={'Neighbourhood': 'Neighborhood'})
geo_data = pd.read_csv('https://cocl.us/Geospatial_data')
geo_data.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [12]:
geo_data.rename(columns={'Postal Code':'PostalCode'}, inplace=True)
df2 = pd.merge(df, geo_data, on='PostalCode')
df2.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3K,Downsview East,CFB Toronto,43.737473,-79.464763
1,M4E,East Toronto,The Beaches,43.676357,-79.293031
2,M4J,East Toronto,The Danforth East,43.685347,-79.338106
3,M4K,East Toronto,The Danforth West / Riverdale,43.679557,-79.352188
4,M4L,East Toronto,India Bazaar / The Beaches West,43.668999,-79.315572


In [13]:
# using geopy library to get the latitude and longitude values of Toronto
address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="Capstone-Airton")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

In [14]:
tt_dataframe = df2.drop(["PostalCode"], axis=1)
tt_geolocator = [latitude, longitude]

In [15]:
tt_venues = getNearbyVenues(names=tt_dataframe['Neighborhood'],
                                   latitudes=tt_dataframe['Latitude'],
                                   longitudes=tt_dataframe['Longitude']
                                  )

In [16]:
tt_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,CFB Toronto,43.737473,-79.464763,Toronto Downsview Airport (YZD),43.738883,-79.470111,Airport
1,CFB Toronto,43.737473,-79.464763,Ancaster Park,43.734706,-79.464777,Park
2,The Beaches,43.676357,-79.293031,Glen Manor Ravine,43.676821,-79.293942,Trail
3,The Beaches,43.676357,-79.293031,The Big Carrot Natural Food Market,43.678879,-79.297734,Health Food Store
4,The Beaches,43.676357,-79.293031,Grover Pub and Grub,43.679181,-79.297215,Pub


## Methodology <a name="methodology"></a>

First, we will explore the data to determine what are the characteristics that make each neighborhood unique, thus being able to differentiate and group them. After that, we will use the k-means algorithm to group similar neighborhoods into clusters based on the categories previously determined to visualize the similarities between the two cities.

## Analysis <a name="analysis"></a>

#### Category selection

For each neighborhood of the two cities we search for all commercial establishments within a radius of 500 meters. After that we group the categories into more generalist groups and select the 22 most representative categories.

In [17]:
tt_venues["Venue Category"].value_counts()

Coffee Shop                   91
Café                          69
Park                          45
Restaurant                    42
Pizza Place                   39
Bakery                        31
Sandwich Place                30
Italian Restaurant            29
Japanese Restaurant           23
Grocery Store                 22
Bar                           21
Bank                          18
Pharmacy                      17
Gym                           17
Pub                           17
Fast Food Restaurant          17
Clothing Store                16
Sushi Restaurant              16
Breakfast Spot                14
Chinese Restaurant            14
Gastropub                     14
Liquor Store                  13
Dessert Shop                  13
Thai Restaurant               13
Diner                         13
American Restaurant           13
Ice Cream Shop                12
Seafood Restaurant            12
Greek Restaurant              12
Mexican Restaurant            11
          

In [18]:
tt_venues_clean = tt_venues

In [19]:
tt_venues_clean["Venue Category"] = np.where(tt_venues_clean["Venue Category"].str.contains("Restaurant",regex=False), "Restaurant", tt_venues_clean['Venue Category'])
tt_venues_clean["Venue Category"] = np.where(tt_venues_clean["Venue Category"].str.contains("Burrito Place",regex=False), "Restaurant", tt_venues_clean['Venue Category'])
tt_venues_clean["Venue Category"] = np.where(tt_venues_clean["Venue Category"].str.contains("Diner",regex=False), "Restaurant", tt_venues_clean['Venue Category'])
tt_venues_clean["Venue Category"] = np.where(tt_venues_clean["Venue Category"].str.contains("Steakhouse",regex=False), "Restaurant", tt_venues_clean['Venue Category'])
tt_venues_clean["Venue Category"] = np.where(tt_venues_clean["Venue Category"].str.contains("Café",regex=False), "Coffee Shop", tt_venues_clean['Venue Category'])
tt_venues_clean["Venue Category"] = np.where(tt_venues_clean["Venue Category"].str.contains("Cafe",regex=False), "Coffee Shop", tt_venues_clean['Venue Category'])
tt_venues_clean["Venue Category"] = np.where(tt_venues_clean["Venue Category"].str.contains("Cafeteria",regex=False), "Coffee Shop", tt_venues_clean['Venue Category'])
tt_venues_clean["Venue Category"] = np.where(tt_venues_clean["Venue Category"].str.contains("Pizza Place",regex=False), "Fast Food", tt_venues_clean['Venue Category'])
tt_venues_clean["Venue Category"] = np.where(tt_venues_clean["Venue Category"].str.contains("Sandwich Place",regex=False), "Fast Food", tt_venues_clean['Venue Category'])
tt_venues_clean["Venue Category"] = np.where(tt_venues_clean["Venue Category"].str.contains("Burger Joint",regex=False), "Fast Food", tt_venues_clean['Venue Category'])
tt_venues_clean["Venue Category"] = np.where(tt_venues_clean["Venue Category"].str.contains("Fried Chicken Joint",regex=False), "Fast Food", tt_venues_clean['Venue Category'])
tt_venues_clean["Venue Category"] = np.where(tt_venues_clean["Venue Category"].str.contains("Bakery",regex=False), "Bakery", tt_venues_clean['Venue Category'])
tt_venues_clean["Venue Category"] = np.where(tt_venues_clean["Venue Category"].str.contains("Breakfast Spot",regex=False), "Bakery", tt_venues_clean['Venue Category'])
tt_venues_clean["Venue Category"] = np.where(tt_venues_clean["Venue Category"].str.contains("Dessert Shop",regex=False), "Grocery Store", tt_venues_clean['Venue Category'])
tt_venues_clean["Venue Category"] = np.where(tt_venues_clean["Venue Category"].str.contains("Ice Cream Shop",regex=False), "Grocery Store", tt_venues_clean['Venue Category'])
tt_venues_clean["Venue Category"] = np.where(tt_venues_clean["Venue Category"].str.contains("Candy Store",regex=False), "Grocery Store", tt_venues_clean['Venue Category'])
tt_venues_clean["Venue Category"] = np.where(tt_venues_clean["Venue Category"].str.contains("Donut Shop",regex=False), "Grocery Store", tt_venues_clean['Venue Category'])
tt_venues_clean["Venue Category"] = np.where(tt_venues_clean["Venue Category"].str.contains("Pub",regex=False), "Bar", tt_venues_clean['Venue Category'])
tt_venues_clean["Venue Category"] = np.where(tt_venues_clean["Venue Category"].str.contains("Gastropub",regex=False), "Bar", tt_venues_clean['Venue Category'])
tt_venues_clean["Venue Category"] = np.where(tt_venues_clean["Venue Category"].str.contains("Liquor Store",regex=False), "Bar", tt_venues_clean['Venue Category'])
tt_venues_clean["Venue Category"] = np.where(tt_venues_clean["Venue Category"].str.contains("Beer Bar",regex=False), "Bar", tt_venues_clean['Venue Category'])
tt_venues_clean["Venue Category"] = np.where(tt_venues_clean["Venue Category"].str.contains("Deli / Bodega",regex=False), "Bar", tt_venues_clean['Venue Category'])
tt_venues_clean["Venue Category"] = np.where(tt_venues_clean["Venue Category"].str.contains("Beer Store",regex=False), "Bar", tt_venues_clean['Venue Category'])
tt_venues_clean["Venue Category"] = np.where(tt_venues_clean["Venue Category"].str.contains("Bar",regex=False), "Bar", tt_venues_clean['Venue Category'])
tt_venues_clean["Venue Category"] = np.where(tt_venues_clean["Venue Category"].str.contains("Brewery",regex=False), "Bar", tt_venues_clean['Venue Category'])
tt_venues_clean["Venue Category"] = np.where(tt_venues_clean["Venue Category"].str.contains("Gym",regex=False), "Gym", tt_venues_clean['Venue Category'])
tt_venues_clean["Venue Category"] = np.where(tt_venues_clean["Venue Category"].str.contains("Hostel",regex=False), "Hotel", tt_venues_clean['Venue Category'])
tt_venues_clean["Venue Category"] = np.where(tt_venues_clean["Venue Category"].str.contains("Theater",regex=False), "Theater", tt_venues_clean['Venue Category'])
tt_venues_clean["Venue Category"] = np.where(tt_venues_clean["Venue Category"].str.contains("Discount Store",regex=False), "Clothing Store", tt_venues_clean['Venue Category'])
tt_venues_clean["Venue Category"] = np.where(tt_venues_clean["Venue Category"].str.contains("Furniture / Home Store",regex=False), "General Store", tt_venues_clean['Venue Category'])
tt_venues_clean["Venue Category"] = np.where(tt_venues_clean["Venue Category"].str.contains("Sporting Goods Shop",regex=False), "General Store", tt_venues_clean['Venue Category'])
tt_venues_clean["Venue Category"] = np.where(tt_venues_clean["Venue Category"].str.contains("Cosmetics Shop",regex=False), "General Store", tt_venues_clean['Venue Category'])
tt_venues_clean["Venue Category"] = np.where(tt_venues_clean["Venue Category"].str.contains("Electronics Store",regex=False), "General Store", tt_venues_clean['Venue Category'])
tt_venues_clean["Venue Category"] = np.where(tt_venues_clean["Venue Category"].str.contains("Hardware Store",regex=False), "General Store", tt_venues_clean['Venue Category'])
tt_venues_clean["Venue Category"] = np.where(tt_venues_clean["Venue Category"].str.contains("Video Store",regex=False), "General Store", tt_venues_clean['Venue Category'])
tt_venues_clean["Venue Category"] = np.where(tt_venues_clean["Venue Category"].str.contains("Bridal Shop",regex=False), "General Store", tt_venues_clean['Venue Category'])
tt_venues_clean["Venue Category"] = np.where(tt_venues_clean["Venue Category"].str.contains("Convenience Store",regex=False), "Supermarket", tt_venues_clean['Venue Category'])
tt_venues_clean["Venue Category"] = np.where(tt_venues_clean["Venue Category"].str.contains("Farmers Market",regex=False), "Supermarket", tt_venues_clean['Venue Category'])
tt_venues_clean["Venue Category"] = np.where(tt_venues_clean["Venue Category"].str.contains("Yoga Studio",regex=False), "Spa", tt_venues_clean['Venue Category'])
tt_venues_clean["Venue Category"] = np.where(tt_venues_clean["Venue Category"].str.contains("Tea",regex=False), "Tea Room", tt_venues_clean['Venue Category'])
tt_venues_clean["Venue Category"] = np.where(tt_venues_clean["Venue Category"].str.contains("Bus",regex=False), "Bus Line", tt_venues_clean['Venue Category'])
tt_venues_clean["Venue Category"] = np.where(tt_venues_clean["Venue Category"].str.contains("Train Station",regex=False), "Bus Line", tt_venues_clean['Venue Category'])
tt_venues_clean["Venue Category"] = np.where(tt_venues_clean["Venue Category"].str.contains("Food",regex=False), "Fast Food", tt_venues_clean['Venue Category'])
tt_venues_clean["Venue Category"] = np.where(tt_venues_clean["Venue Category"].str.contains("Park",regex=False), "Park", tt_venues_clean['Venue Category'])
tt_venues_clean["Venue Category"] = np.where(tt_venues_clean["Venue Category"].str.contains("Garden",regex=False), "Park", tt_venues_clean['Venue Category'])
tt_venues_clean["Venue Category"] = np.where(tt_venues_clean["Venue Category"].str.contains("Lake",regex=False), "Park", tt_venues_clean['Venue Category'])
tt_venues_clean["Venue Category"] = np.where(tt_venues_clean["Venue Category"].str.contains("Monument / Landmark",regex=False), "Park", tt_venues_clean['Venue Category'])
tt_venues_clean["Venue Category"] = np.where(tt_venues_clean["Venue Category"].str.contains("Harbor / Marina",regex=False), "Park", tt_venues_clean['Venue Category'])
tt_venues_clean["Venue Category"] = np.where(tt_venues_clean["Venue Category"].str.contains("Field",regex=False), "Park", tt_venues_clean['Venue Category'])
tt_venues_clean["Venue Category"] = np.where(tt_venues_clean["Venue Category"].str.contains("Accessories Store",regex=False), "General Store", tt_venues_clean['Venue Category'])
tt_venues_clean["Venue Category"] = np.where(tt_venues_clean["Venue Category"].str.contains("Men's Store",regex=False), "General Store", tt_venues_clean['Venue Category'])
tt_venues_clean["Venue Category"] = np.where(tt_venues_clean["Venue Category"].str.contains("Shoe Store",regex=False), "General Store", tt_venues_clean['Venue Category'])
tt_venues_clean["Venue Category"] = np.where(tt_venues_clean["Venue Category"].str.contains("Video Game Store",regex=False), "General Store", tt_venues_clean['Venue Category'])
tt_venues_clean["Venue Category"] = np.where(tt_venues_clean["Venue Category"].str.contains("Arts & Crafts Store",regex=False), "General Store", tt_venues_clean['Venue Category'])
tt_venues_clean["Venue Category"] = np.where(tt_venues_clean["Venue Category"].str.contains("Organic Grocery",regex=False), "Grocery Store", tt_venues_clean['Venue Category'])
tt_venues_clean["Venue Category"] = np.where(tt_venues_clean["Venue Category"].str.contains("Club",regex=False), "Club", tt_venues_clean['Venue Category'])
tt_venues_clean["Venue Category"] = np.where(tt_venues_clean["Venue Category"].str.contains("Plaza",regex=False), "Shopping Mall", tt_venues_clean['Venue Category'])
tt_venues_clean["Venue Category"] = np.where(tt_venues_clean["Venue Category"].str.contains("Skating Rink",regex=False), "Park", tt_venues_clean['Venue Category'])
tt_venues_clean["Venue Category"] = np.where(tt_venues_clean["Venue Category"].str.contains("Playground",regex=False), "Park", tt_venues_clean['Venue Category'])
tt_venues_clean["Venue Category"] = np.where(tt_venues_clean["Venue Category"].str.contains("Trail",regex=False), "Park", tt_venues_clean['Venue Category'])
tt_venues_clean["Venue Category"] = np.where(tt_venues_clean["Venue Category"].str.contains("Department Store",regex=False), "General Store", tt_venues_clean['Venue Category'])
tt_venues_clean["Venue Category"] = np.where(tt_venues_clean["Venue Category"].str.contains("River",regex=False), "Park", tt_venues_clean['Venue Category'])
tt_venues_clean["Venue Category"] = np.where(tt_venues_clean["Venue Category"].str.contains("Motel",regex=False), "Hotel", tt_venues_clean['Venue Category'])
tt_venues_clean["Venue Category"] = np.where(tt_venues_clean["Venue Category"].str.contains("Yogurt",regex=False), "Grocery Store", tt_venues_clean['Venue Category'])
tt_venues_clean["Venue Category"] = np.where(tt_venues_clean["Venue Category"].str.contains("Baby Store",regex=False), "General Store", tt_venues_clean['Venue Category'])
tt_venues_clean["Venue Category"] = np.where(tt_venues_clean["Venue Category"].str.contains("Record Shop",regex=False), "General Store", tt_venues_clean['Venue Category'])
tt_venues_clean["Venue Category"] = np.where(tt_venues_clean["Venue Category"].str.contains("Drugstore",regex=False), "Pharmacy", tt_venues_clean['Venue Category'])
tt_venues_clean["Venue Category"].value_counts()

Restaurant                    321
Coffee Shop                   162
Bar                           126
Fast Food                      96
Park                           76
Grocery Store                  50
Bakery                         45
General Store                  45
Gym                            29
Clothing Store                 25
Supermarket                    23
Pharmacy                       18
Bank                           18
Spa                            14
Theater                        13
Shopping Mall                  13
Bus Line                       12
Hotel                          12
Tea Room                       12
Pet Store                      10
Art Gallery                     8
Intersection                    8
Bookstore                       8
Gas Station                     6
Rental Car Location             5
Gift Shop                       5
Lounge                          5
Nightclub                       5
Museum                          4
Athletics & Sp

In [20]:
ny_venues["Venue Category"].value_counts()

Pizza Place                        306
Deli / Bodega                      186
Italian Restaurant                 170
Coffee Shop                        157
Chinese Restaurant                 149
Grocery Store                      131
Pharmacy                           130
Sandwich Place                     129
Donut Shop                         128
Bakery                             125
Park                               124
Bar                                115
Bank                               111
Mexican Restaurant                 106
Ice Cream Shop                      95
American Restaurant                 94
Café                                94
Supermarket                         80
Bagel Shop                          79
Gym                                 73
Caribbean Restaurant                71
Fast Food Restaurant                69
Diner                               69
Gym / Fitness Center                66
Bus Station                         62
Sushi Restaurant         

In [21]:
ny_venues_clean = ny_venues

In [22]:
ny_venues_clean["Venue Category"] = np.where(ny_venues_clean["Venue Category"].str.contains("Restaurant",regex=False), "Restaurant", ny_venues_clean['Venue Category'])
ny_venues_clean["Venue Category"] = np.where(ny_venues_clean["Venue Category"].str.contains("Burrito Place",regex=False), "Restaurant", ny_venues_clean['Venue Category'])
ny_venues_clean["Venue Category"] = np.where(ny_venues_clean["Venue Category"].str.contains("Diner",regex=False), "Restaurant", ny_venues_clean['Venue Category'])
ny_venues_clean["Venue Category"] = np.where(ny_venues_clean["Venue Category"].str.contains("Steakhouse",regex=False), "Restaurant", ny_venues_clean['Venue Category'])
ny_venues_clean["Venue Category"] = np.where(ny_venues_clean["Venue Category"].str.contains("Café",regex=False), "Coffee Shop", ny_venues_clean['Venue Category'])
ny_venues_clean["Venue Category"] = np.where(ny_venues_clean["Venue Category"].str.contains("Cafe",regex=False), "Coffee Shop", ny_venues_clean['Venue Category'])
ny_venues_clean["Venue Category"] = np.where(ny_venues_clean["Venue Category"].str.contains("Cafeteria",regex=False), "Coffee Shop", ny_venues_clean['Venue Category'])
ny_venues_clean["Venue Category"] = np.where(ny_venues_clean["Venue Category"].str.contains("Pizza Place",regex=False), "Fast Food", ny_venues_clean['Venue Category'])
ny_venues_clean["Venue Category"] = np.where(ny_venues_clean["Venue Category"].str.contains("Sandwich Place",regex=False), "Fast Food", ny_venues_clean['Venue Category'])
ny_venues_clean["Venue Category"] = np.where(ny_venues_clean["Venue Category"].str.contains("Burger Joint",regex=False), "Fast Food", ny_venues_clean['Venue Category'])
ny_venues_clean["Venue Category"] = np.where(ny_venues_clean["Venue Category"].str.contains("Fried Chicken Joint",regex=False), "Fast Food", ny_venues_clean['Venue Category'])
ny_venues_clean["Venue Category"] = np.where(ny_venues_clean["Venue Category"].str.contains("Bakery",regex=False), "Bakery", ny_venues_clean['Venue Category'])
ny_venues_clean["Venue Category"] = np.where(ny_venues_clean["Venue Category"].str.contains("Breakfast Spot",regex=False), "Bakery", ny_venues_clean['Venue Category'])
ny_venues_clean["Venue Category"] = np.where(ny_venues_clean["Venue Category"].str.contains("Dessert Shop",regex=False), "Grocery Store", ny_venues_clean['Venue Category'])
ny_venues_clean["Venue Category"] = np.where(ny_venues_clean["Venue Category"].str.contains("Ice Cream Shop",regex=False), "Grocery Store", ny_venues_clean['Venue Category'])
ny_venues_clean["Venue Category"] = np.where(ny_venues_clean["Venue Category"].str.contains("Candy Store",regex=False), "Grocery Store", ny_venues_clean['Venue Category'])
ny_venues_clean["Venue Category"] = np.where(ny_venues_clean["Venue Category"].str.contains("Donut Shop",regex=False), "Grocery Store", ny_venues_clean['Venue Category'])
ny_venues_clean["Venue Category"] = np.where(ny_venues_clean["Venue Category"].str.contains("Pub",regex=False), "Bar", ny_venues_clean['Venue Category'])
ny_venues_clean["Venue Category"] = np.where(ny_venues_clean["Venue Category"].str.contains("Gastropub",regex=False), "Bar", ny_venues_clean['Venue Category'])
ny_venues_clean["Venue Category"] = np.where(ny_venues_clean["Venue Category"].str.contains("Liquor Store",regex=False), "Bar", ny_venues_clean['Venue Category'])
ny_venues_clean["Venue Category"] = np.where(ny_venues_clean["Venue Category"].str.contains("Beer Bar",regex=False), "Bar", ny_venues_clean['Venue Category'])
ny_venues_clean["Venue Category"] = np.where(ny_venues_clean["Venue Category"].str.contains("Deli / Bodega",regex=False), "Bar", ny_venues_clean['Venue Category'])
ny_venues_clean["Venue Category"] = np.where(ny_venues_clean["Venue Category"].str.contains("Beer Store",regex=False), "Bar", ny_venues_clean['Venue Category'])
ny_venues_clean["Venue Category"] = np.where(ny_venues_clean["Venue Category"].str.contains("Bar",regex=False), "Bar", ny_venues_clean['Venue Category'])
ny_venues_clean["Venue Category"] = np.where(ny_venues_clean["Venue Category"].str.contains("Brewery",regex=False), "Bar", ny_venues_clean['Venue Category'])
ny_venues_clean["Venue Category"] = np.where(ny_venues_clean["Venue Category"].str.contains("Gym",regex=False), "Gym", ny_venues_clean['Venue Category'])
ny_venues_clean["Venue Category"] = np.where(ny_venues_clean["Venue Category"].str.contains("Hostel",regex=False), "Hotel", ny_venues_clean['Venue Category'])
ny_venues_clean["Venue Category"] = np.where(ny_venues_clean["Venue Category"].str.contains("Theater",regex=False), "Theater", ny_venues_clean['Venue Category'])
ny_venues_clean["Venue Category"] = np.where(ny_venues_clean["Venue Category"].str.contains("Discount Store",regex=False), "Clothing Store", ny_venues_clean['Venue Category'])
ny_venues_clean["Venue Category"] = np.where(ny_venues_clean["Venue Category"].str.contains("Furniture / Home Store",regex=False), "General Store", ny_venues_clean['Venue Category'])
ny_venues_clean["Venue Category"] = np.where(ny_venues_clean["Venue Category"].str.contains("Sporting Goods Shop",regex=False), "General Store", ny_venues_clean['Venue Category'])
ny_venues_clean["Venue Category"] = np.where(ny_venues_clean["Venue Category"].str.contains("Cosmetics Shop",regex=False), "General Store", ny_venues_clean['Venue Category'])
ny_venues_clean["Venue Category"] = np.where(ny_venues_clean["Venue Category"].str.contains("Electronics Store",regex=False), "General Store", ny_venues_clean['Venue Category'])
ny_venues_clean["Venue Category"] = np.where(ny_venues_clean["Venue Category"].str.contains("Hardware Store",regex=False), "General Store", ny_venues_clean['Venue Category'])
ny_venues_clean["Venue Category"] = np.where(ny_venues_clean["Venue Category"].str.contains("Video Store",regex=False), "General Store", ny_venues_clean['Venue Category'])
ny_venues_clean["Venue Category"] = np.where(ny_venues_clean["Venue Category"].str.contains("Bridal Shop",regex=False), "General Store", ny_venues_clean['Venue Category'])
ny_venues_clean["Venue Category"] = np.where(ny_venues_clean["Venue Category"].str.contains("Convenience Store",regex=False), "Supermarket", ny_venues_clean['Venue Category'])
ny_venues_clean["Venue Category"] = np.where(ny_venues_clean["Venue Category"].str.contains("Farmers Market",regex=False), "Supermarket", ny_venues_clean['Venue Category'])
ny_venues_clean["Venue Category"] = np.where(ny_venues_clean["Venue Category"].str.contains("Yoga Studio",regex=False), "Spa", ny_venues_clean['Venue Category'])
ny_venues_clean["Venue Category"] = np.where(ny_venues_clean["Venue Category"].str.contains("Tea",regex=False), "Tea Room", ny_venues_clean['Venue Category'])
ny_venues_clean["Venue Category"] = np.where(ny_venues_clean["Venue Category"].str.contains("Bus",regex=False), "Bus Line", ny_venues_clean['Venue Category'])
ny_venues_clean["Venue Category"] = np.where(ny_venues_clean["Venue Category"].str.contains("Train Station",regex=False), "Bus Line", ny_venues_clean['Venue Category'])
ny_venues_clean["Venue Category"] = np.where(ny_venues_clean["Venue Category"].str.contains("Food",regex=False), "Fast Food", ny_venues_clean['Venue Category'])
ny_venues_clean["Venue Category"] = np.where(ny_venues_clean["Venue Category"].str.contains("Park",regex=False), "Park", ny_venues_clean['Venue Category'])
ny_venues_clean["Venue Category"] = np.where(ny_venues_clean["Venue Category"].str.contains("Garden",regex=False), "Park", ny_venues_clean['Venue Category'])
ny_venues_clean["Venue Category"] = np.where(ny_venues_clean["Venue Category"].str.contains("Lake",regex=False), "Park", ny_venues_clean['Venue Category'])
ny_venues_clean["Venue Category"] = np.where(ny_venues_clean["Venue Category"].str.contains("Monument / Landmark",regex=False), "Park", ny_venues_clean['Venue Category'])
ny_venues_clean["Venue Category"] = np.where(ny_venues_clean["Venue Category"].str.contains("Harbor / Marina",regex=False), "Park", ny_venues_clean['Venue Category'])
ny_venues_clean["Venue Category"] = np.where(ny_venues_clean["Venue Category"].str.contains("Field",regex=False), "Park", ny_venues_clean['Venue Category'])
ny_venues_clean["Venue Category"] = np.where(ny_venues_clean["Venue Category"].str.contains("Accessories Store",regex=False), "General Store", ny_venues_clean['Venue Category'])
ny_venues_clean["Venue Category"] = np.where(ny_venues_clean["Venue Category"].str.contains("Men's Store",regex=False), "General Store", ny_venues_clean['Venue Category'])
ny_venues_clean["Venue Category"] = np.where(ny_venues_clean["Venue Category"].str.contains("Shoe Store",regex=False), "General Store", ny_venues_clean['Venue Category'])
ny_venues_clean["Venue Category"] = np.where(ny_venues_clean["Venue Category"].str.contains("Video Game Store",regex=False), "General Store", ny_venues_clean['Venue Category'])
ny_venues_clean["Venue Category"] = np.where(ny_venues_clean["Venue Category"].str.contains("Arts & Crafts Store",regex=False), "General Store", ny_venues_clean['Venue Category'])
ny_venues_clean["Venue Category"] = np.where(ny_venues_clean["Venue Category"].str.contains("Organic Grocery",regex=False), "Grocery Store", ny_venues_clean['Venue Category'])
ny_venues_clean["Venue Category"] = np.where(ny_venues_clean["Venue Category"].str.contains("Club",regex=False), "Club", ny_venues_clean['Venue Category'])
ny_venues_clean["Venue Category"] = np.where(ny_venues_clean["Venue Category"].str.contains("Plaza",regex=False), "Shopping Mall", ny_venues_clean['Venue Category'])
ny_venues_clean["Venue Category"] = np.where(ny_venues_clean["Venue Category"].str.contains("Skating Rink",regex=False), "Park", ny_venues_clean['Venue Category'])
ny_venues_clean["Venue Category"] = np.where(ny_venues_clean["Venue Category"].str.contains("Playground",regex=False), "Park", ny_venues_clean['Venue Category'])
ny_venues_clean["Venue Category"] = np.where(ny_venues_clean["Venue Category"].str.contains("Trail",regex=False), "Park", ny_venues_clean['Venue Category'])
ny_venues_clean["Venue Category"] = np.where(ny_venues_clean["Venue Category"].str.contains("Department Store",regex=False), "General Store", ny_venues_clean['Venue Category'])
ny_venues_clean["Venue Category"] = np.where(ny_venues_clean["Venue Category"].str.contains("River",regex=False), "Park", ny_venues_clean['Venue Category'])
ny_venues_clean["Venue Category"] = np.where(ny_venues_clean["Venue Category"].str.contains("Motel",regex=False), "Hotel", ny_venues_clean['Venue Category'])
ny_venues_clean["Venue Category"] = np.where(ny_venues_clean["Venue Category"].str.contains("Yogurt",regex=False), "Grocery Store", ny_venues_clean['Venue Category'])
ny_venues_clean["Venue Category"] = np.where(ny_venues_clean["Venue Category"].str.contains("Baby Store",regex=False), "General Store", ny_venues_clean['Venue Category'])
ny_venues_clean["Venue Category"] = np.where(ny_venues_clean["Venue Category"].str.contains("Record Shop",regex=False), "General Store", ny_venues_clean['Venue Category'])
ny_venues_clean["Venue Category"] = np.where(ny_venues_clean["Venue Category"].str.contains("Drugstore",regex=False), "Pharmacy", ny_venues_clean['Venue Category'])
ny_venues_clean["Venue Category"].value_counts()

Restaurant                    1559
Fast Food                      617
Bar                            580
Grocery Store                  415
Park                           260
Coffee Shop                    256
General Store                  194
Bakery                         151
Gym                            150
Bus Line                       139
Supermarket                    132
Pharmacy                       130
Bank                           111
Spa                             92
Clothing Store                  88
Bagel Shop                      79
Hotel                           51
Mobile Phone Shop               42
Beach                           37
Wine Shop                       36
Theater                         36
Shopping Mall                   34
Bookstore                       28
Metro Station                   27
Lounge                          26
Pet Store                       23
Tea Room                        22
Gourmet Shop                    22
Intersection        

Now we need to choose the most frequent categories for our analysis.

In [23]:
tt_venues_top = tt_venues_clean[tt_venues_clean["Venue Category"].isin(
    ["Restaurant", 
     "Coffee Shop", 
     "Bar", 
     "Fast Food", 
     "Park", 
     "Grocery Store",
     "Bakery", 
     "General Store", 
     "Gym", 
     "Supermarket", 
     "Clothing Store", 
     "Pharmacy", 
     "Bank", 
     "Spa", 
     "Theater", 
     "Tea Room", 
     "Hotel", 
     "Shopping Mall", 
     "Bus Line", 
     "Pet Store", 
     "Art Gallery", 
     "Bookstore"])]

In [24]:
ny_venues_top = ny_venues_clean[ny_venues_clean["Venue Category"].isin(
    ["Restaurant", 
     "Coffee Shop", 
     "Bar", 
     "Fast Food", 
     "Park", 
     "Grocery Store",
     "Bakery", 
     "General Store", 
     "Gym", 
     "Supermarket", 
     "Clothing Store", 
     "Pharmacy", 
     "Bank", 
     "Spa", 
     "Theater", 
     "Tea Room", 
     "Hotel",  
     "Shopping Mall", 
     "Bus Line", 
     "Pet Store", 
     "Art Gallery", 
     "Bookstore"])]

We have the database clean and now we are going to prepare it to apply the k-means algorithm.

In [25]:
ny_venues_top.loc[:,"city"] = "New York"
tt_venues_top.loc[:,"city"] = "Toronto"

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


In [26]:
dataset = ny_venues_top.append(tt_venues_top)

In [27]:
dataset_onehot = pd.get_dummies(dataset["Venue Category"], prefix="", prefix_sep="")
dataset_onehot["city"] = dataset["city"]
dataset_onehot["Neighborhood"] = dataset["Neighborhood"]

#### Performing K-means algorithm

To perform k-means clustering we first count the number of occurrences of each category in each neighborhood and then apply the algorithm using K = [10, 15, 20].
The choice of these K’s was based on the number of neighborhoods in Toronto that is just over 100.

In [None]:
dataset_group = dataset_onehot.groupby(["city", "Neighborhood"], as_index=False).sum()

In [46]:
from sklearn.cluster import KMeans
k=20
dataset_clustering = dataset_group.drop(['city','Neighborhood'],1)
kmeans = KMeans(n_clusters = k, random_state=0).fit(dataset_clustering)
kmeans.labels_
dataset_group.insert(0, 'Cluster', kmeans.labels_)

In [47]:
dataset_cluster = dataset_group[["Cluster","Neighborhood", "city"]]
ny_cluster = dataset_cluster[dataset_cluster['city'].str.contains('New York',regex=False)]
tt_cluster = dataset_cluster[dataset_cluster['city'].str.contains('Toronto',regex=False)]

In [48]:
ny_map = pd.merge(ny_cluster, ny_dataframe, on='Neighborhood')
tt_map = pd.merge(tt_cluster, tt_dataframe, on='Neighborhood')

## Results and Discussion <a name="results"></a>

The results show us that the city of Toronto is very homogeneous in terms of the main services offered by the city and depends very little on the chosen K, on the other hand, New York City has much more variety.
Our graphical presentation of the results allows us to directly conclude which are the most similar neighborhoods between the two cities.

In [32]:
!conda install -c conda-forge folium=0.5.0 --yes

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    certifi-2019.11.28         |   py36h9f0ad1d_1         149 KB  conda-forge
    branca-0.4.0               |             py_0          26 KB  conda-forge
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    python_abi-3.6             |          1_cp36m           4 KB  conda-forge
    ca-certificates-2019.11.28 |       hecc5488_0         145 KB  conda-forge
    altair-4.0.1               |             py_0         575 KB  conda-forge
    openssl-1.1.1e             |       h516909a_0         2.1 MB  conda-forge
    ------------------------------------------------------------
                       

In [49]:
import folium
import matplotlib.cm as cm
import matplotlib.colors as colors

map_toronto = folium.Map(location=[tt_geolocator[0],tt_geolocator[1]],zoom_start=10)

# set color scheme for the clusters
x = np.arange(k)
ys = [i + x + (i*x)**2 for i in range(k)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, neighbourhood, cluster in zip(tt_map['Latitude'], tt_map['Longitude'], tt_map['Neighborhood'], tt_map['Cluster']):
    label = folium.Popup(' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_toronto)
       

In [50]:
map_newyork = folium.Map(location=[ny_geolocator[0],ny_geolocator[1]],zoom_start=10)

# set color scheme for the clusters
x = np.arange(k)
ys = [i + x + (i*x)**2 for i in range(k)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, neighbourhood, cluster in zip(ny_map['Latitude'], ny_map['Longitude'], ny_map['Neighborhood'], ny_map['Cluster']):
    label = folium.Popup(' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_newyork)

We can see on the map that there are many neighborhoods in New York that are very different than Toronto. This is a favorable indication of the usefulness of this project

In [51]:
from IPython.core.display import display, HTML

htmlmap = HTML('<iframe srcdoc="{}" style="float:left; width: {}px; height: {}px; display:inline-block; width: 50%; margin: 0 auto; border: 2px solid black"></iframe>'
           '<iframe srcdoc="{}" style="float:right; width: {}px; height: {}px; display:inline-block; width: 50%; margin: 0 auto; border: 2px solid black"></iframe>'
           .format(map_toronto.get_root().render().replace('"', '&quot;'),500,500,
                   map_newyork.get_root().render().replace('"', '&quot;'),500,500))
display(htmlmap)



## Conclusion <a name="conclusion"></a>

This work satisfactorily demonstrated a procedure to compare the neighborhoods of two cities using public databases and machine learning and data visualization tools with the practical objective of choosing the most similar neighborhood.