## A Tale of Two Cities through ML

## Table of contents
* [Introduction: Business Problem](#Introduction)
* [Data](#Data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

### 1.  Introduction: Business Problem <a name="Introduction"></a>

1.1 **Background** : 
New York, compared to Toronto, is a far bigger and better city - as would be agreed by one and all. At the same time, the city has quite a few similarities, too, with Toronto. They are both the financial and tourist capitals of their respective countries. Downtown Toronto and Manhattan are crowded with skyscrapers. They both have their signature landmarks that make Toronto and New York stand out. Quite obviously there is, therefore, a constant movement of people between the two cities due to reasons like business, travel, profession, studies etc. The location within the city where the individuals stay would depend on the purpose of visit and duration of stay. Choice of the location, therefore, is an important part of the preparation for the visit.

1.2 **Problem** : For the population which wants to relocate from **Place A** to **Place B**, the **problem** is to find out which neighborhood in **Place B** would be suitable for moving in. For example, an individual moving in from **Toronto** to **New York** for higher education would like to stay close to **educational hub**. Likewise, a professional moving into Toronto on office job, would like to stay in the **downtown** area - within short commuting distance from the office. Selection of the appropriate neighborhood would not only avoid wasteful expenditure for moving to a wrong neighborhood, but would also help in timely planning of other aspects of relocation - such as looking for an accomodation, dialogue with the local school for admission, opening of bank account etc. Thus, the problem statement of the project would be the following: -
 - **Given the purpose of relocation, which is the right neighborhood in Toronto while moving in from New York and vice versa?**

1.3 **Target Audience** : 
The **target audience** will be the people living in New York and wanting to relocate to Toronto for various reasons such as the following : -

    a. People migrating for a similar or better quality of life
    b. Professionals shifting to the other city looking to stay close to the office
    c. International students looking for a place to live with certain amenities within a budget
    d. Businessman looking to open shop in the other city in a location which would offer good customer base and minimum competition

Likewise, the people living in Toronto and wanting to relocate to New York are also part of the **target audience** for this project.


Although this project concentrates only on movement between New York and Toronto, however, the solution model can be enhanced for the **target audience** who are interested in moving from one location to any other location in the world. The tool would aid in choosing appropriate neighborhood depending upon the purpose of move (viz, new business, better quality of life, education, profession etc).

### 2. Data acquisition and cleaning <a name="Data"></a>

2.1 **Data sources**.   The idea of the project is to first **understand user's (i.e. Target audience) requirements for relocation**, thereafter **know his /her present neighborhood** and finally **recommend a suitable neighborhood in the destination city** (New York or Toronto). To achieve the above objectives, following data sources will be used: -  
> a. **Understand user's requirements**:  The user has to specify the following: -  
>> i. **Purpose of relocation** (i.e., new business, better quality of life, education or profession). Each option             will imply a different set of neighborhood selection criteria  
ii.  For the selected purpose of relocation, specify few characteristics of the neighborhood he / she would like to move in. This would help in narrowing down the recommended options: -
- New business: Clientele would be high income group, medium income group or low income group
- Better quality of life: Countryside, uptown or downtown
- Education: Name of the University
- Profession: Location of the Office where he / she would work    

> b. **know his /her present neighborhood**.  The user has to specify the current residential and office address. The residential address will be used to benchmark the "quality of life" (as a function of amenities available in the neighborhood). For this purpose, details of present neighborhood of the individual will be extracted out of his / her residential address using **Foursquare** app.
The office address will be used to obtain distance between office and residence. The recommended neighborhood should be at a similar distance from the office in the new location where the individual would join. 

> c. **Recommend a suitable neighborhood**. To recommend a suitable neighborhood, we would need details of all the neighborhoods in New York and Toronto. Neighborhood details would comprise of the following: -   
>>  i. **Neighborhood details of Toronto and New York** - Borough, Neighborhood name, Neighborhood Latitude, Neighborhood Longitude  
>> For this purpose, following data sources will be used:-
>>> List of all neighborhoods of Toronto will be scraped from https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
>>> However, the neighborhood dataset of Toronto available at this website does not contain Latitude, Longitude data of the neighborhoods. The Latitude, Longitude data will be retieved from the site http://cocl.us/Geospatial_data.
>>> List of all neighborhoods of New York will be scraped from the database maintained at https://cocl.us/new_york_dataset (courtesy: "Segmenting and Clustering Neighborhoods in New York City" Lab Exercise of week 3 of this Module)  

>> ii. **List of prominent venues in each neighborhood**: This data will used to characterise various neighborhoods and compare them with the user's requirements for relocation. Following types of venues will be obtained for each neighborhood using **Foursquare** application: - 
>>> - Public amenities (Departmental stores, Hospital, School, Bank, Post Office etc) 
>>> - Recreational & sports facilities (Sports centre, Cinema, Museum, Gym, Yoga studio etc.)
>>> - Shopping malls, airport terminal, railway station
>>> - Food joints (restaurant of various cuisines) 
  
2.2 **Data cleaning**.  Data downloaded or scraped from the abovementioned sources will be combined into one table for each city. Following data cleaning activities will be carried out: -

> a. **Borrough not assigned**. In this case, the neighborhood record will be deleted.

> b. **Neighborhood not assigned**. In such cases, the neighborhood name will be same as the borough.

> c. **Wrong venue details**. In some cases, the venue type was mentioned as "neighborhood" in the **Foursquare** data. Such records will be deleted.

> d. **Pre-processing of Venues data**. Following pre-processing will be performed with the venues data obtained by using **Foursquare** : - 
>> i.   Generalisation of venue types. Venue details received through Four Square application would be categorised into broader groups indicating the nature of the venue, e.g., Food joints, public amenities, Recreational & sports facilities etc   
>> ii.  Additional attributes of the venues will also be recorded. E.g., in case of Food Joints, the cuisine (such as Indian, Thai etc). These will be used only if the individual has such specific requirements for stay / relocation. 


     I will experiment with the above pre-processed data and try to figure out whether the clustering results improved due to the pre-processing.

2.3 **Feature selection**.  Venue details and categories will be used as the features of each neighborhood for further analysis. Selection of features will depend on the purpose of move. This is illustrated below: - 

Purpose of Move                       | Features of Neighborhood
    :-------                         | :-----
    Similar or better quality of life | Public amenities, educational institutes, shopping, Recreational & sports facilities
    Professionals                     | Proximity to office, Food Joints, public amenities
    International students            | Public amenities, Library, transport facility, low cost restaurant
    Businessman                       | availability of potential customer

In [None]:
# The code was removed by Watson Studio for sharing.

### Preparations for the project

importing all libraries required for data processing

In [1]:
import json    # to handle JSON files

from pandas.io.json import json_normalize  # tranform JSON file into a pandas dataframe

import pandas as pd

import numpy as np

import requests    # for importing location data through Foursquare

Get geopy to obtain city coordinates for map rendering

In [2]:
#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

Get Folium for map rendering 

In [None]:
!conda install -c conda-forge folium=0.5.0 --yes
import folium # map rendering library

## ============== Sample data on New York and Toronto neighborhood are given below. Also, venue details of Toronto neighborhood, extracted using Foursquare app is given below =========================== 

### Toronto neighborhood data

In [3]:
#Download and save the web page into local directory
!wget -q -O 'Canada_data' https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

In [4]:
# The code was removed by Watson Studio for sharing.

    Postal Code           Borough  \
0           M1A      Not assigned   
1           M2A      Not assigned   
2           M3A        North York   
3           M4A        North York   
4           M5A  Downtown Toronto   
5           M6A        North York   
6           M7A  Downtown Toronto   
7           M8A      Not assigned   
8           M9A         Etobicoke   
9           M1B       Scarborough   
10          M2B      Not assigned   
11          M3B        North York   
12          M4B         East York   
13          M5B  Downtown Toronto   
14          M6B        North York   
15          M7B      Not assigned   
16          M8B      Not assigned   
17          M9B         Etobicoke   
18          M1C       Scarborough   
19          M2C      Not assigned   
20          M3C        North York   
21          M4C         East York   
22          M5C  Downtown Toronto   
23          M6C              York   
24          M7C      Not assigned   
25          M8C      Not assigned   
2

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [5]:
# The code was removed by Watson Studio for sharing.

In [14]:
df.head(10)


Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"
11,M3B,North York,Don Mills
12,M4B,East York,"Parkview Hill, Woodbine Gardens"
13,M5B,Downtown Toronto,"Garden District, Ryerson"


In [6]:
#Obtained lat, long info using the file mentioned in the Assignment
!wget -q -O "lat_long.csv" http://cocl.us/Geospatial_data


In [7]:
# The code was removed by Watson Studio for sharing.

In [8]:
toronto_neigh.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


### New York neighborhood data

In [9]:
!wget -q -O 'newyork_data.json' https://cocl.us/new_york_dataset

with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

#all the relevant data is in the features key, which is basically a list of the neighborhoods. Define a new variable that includes this data.
neighborhoods_data = newyork_data['features']

In [10]:
# The code was removed by Watson Studio for sharing.

In [11]:
newyork_neigh.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


### Foursquare credentials

In [12]:
# The code was removed by Watson Studio for sharing.

My credentails:
CLIENT_ID: XETBJTAFF0JQNLQ55QGHFW2EFGRSQSQQQG5Z4MTCKPFDRSAC
CLIENT_SECRET:FJY50XE0LAR1L4C5LW4BIWY0HPG3QMP5GZLZ2ZM3Q3SRUBXZ


In [13]:
# The code was removed by Watson Studio for sharing.

In [14]:
LIMIT = 100
RADIUS = 500

venues_list=[]
    
for i in range(toronto_neigh.shape[0]-1):
    name=toronto_neigh.loc[i, 'Neighborhood']
    neigh_lat = toronto_neigh.loc[i, 'Latitude']
    neigh_long = toronto_neigh.loc[i, 'Longitude']

    url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, neigh_lat, neigh_long, VERSION, RADIUS, LIMIT)

    results = requests.get(url).json()["response"]['groups'][0]['items']
    
    venues_list.append([(
            name, 
            neigh_lat, 
            neigh_long, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    toronto_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    
    toronto_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']

In [15]:
toronto_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
1,Parkwoods,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
2,Victoria Village,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
3,Victoria Village,43.725882,-79.315572,Portugril,43.725819,-79.312785,Portuguese Restaurant
4,Victoria Village,43.725882,-79.315572,Tim Hortons,43.725517,-79.313103,Coffee Shop
