## A Tale of Two Cities through ML

## Table of contents
* [Introduction: Business Problem](#Introduction)
* [Data](#Data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

### 1.  Introduction: Business Problem <a name="Introduction"></a>

1.1 **Background** : 
New York, compared to Toronto, is a far bigger and better city - as would be agreed by one and all. The city has quite a few similarities, too, with Toronto. They are both the financial and tourist capitals of their respective countries. Downtown Toronto and Manhattan are crowded with skyscrapers. They both have their signature landmarks that make Toronto and New York stand out. Quite obviously there is a constant movement of people between the two cities due to reasons like business, travel, profession, studies etc. The location within the city where the individuals stay would depend on the purpose of visit and duration of stay. Choice of the location, therefore, is an important part of the preparation for the visit.

1.2 **Problem** : For the population which wants to travel from '**Place A**' to '**Place B**', figuring out suitability of '**Place B**' upfront is very important. As an example, considering the case of relocating from Toronto to New York, selection of the appropriate neighborhood would not only avoid wasteful expenditure for moving to a wrong neighborhood, but would also help in timely planning of other aspects of relocation - such as dialogue with the local school for admission, communicating for opening of bank account etc. In case the individual is relocating for higher education, he/ she would probably loook for an affordable neighborhood near the University. Thus, the problem statement of the project would be the following: -
 - Given the purpose of visit/stay, which is the right neighborhood in Toronto while moving in from New York and vice versa?


1.3 **Interest** : 
The following categories of people (living in New York and Toronto) will be interested in this solution: -

    a. People migrating from one city to the other for a similar or better quality of life
    b. Professionals shifting to the other city looking to stay in a place with similar amenities
    c. International students looking for a place to live with certain amenities within a budget
    d. Businessman looking to open shop in the other city in a location which would offer good customer base and minimum competition

Although this project concentrates only on movement between New York and Toronto, however, the solution model can be enhanced to create a planning tool for movement from one location to any other location in the world. The tool would aid in choosing appropriate neighborhood depending upon the purpose of move (viz, new business, better life, education, profession etc).

### 2. Data acquisition and cleaning <a name="Data"></a>

2.1 **Data sources**.   For the project, following sets of input data will be necessary: -

     a. Neighborhood details of Toronto and New York - Borough and Neighborhood names
     b. Neighborhood location - Latitude, Longitude  
     c. Details of venues in each neighborhood - Venues will include the following: -
          i.   Public amenities (Departmental stores, Hospital, School, Bank, Post Office etc) 
          ii.  Recreational & sports facilities (Sports centre, Cinema, Museum, Gym, Yoga studio etc.)
          iii. Shopping malls, airport terminal, railway station
          iv.  Food joints (restaurant of various cuisines) 
          v.   Any other significant venues
     
     
In case of New York, both the data will be obtained from the database maintained at https://cocl.us/new_york_dataset (courtesy: "Segmenting and Clustering Neighborhoods in New York City" Lab Exercise of week 3 of this Module)

Neighborhood data for Toronto will be obtained from https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M. However, the neighborhood dataset available at this website does not contain Latitude, Longitude data of the neighborhoods. The Latitude, Longitude data will be retieved from the site http://cocl.us/Geospatial_data.

Venue details will be extracted by using **Foursquare location data**.

2.2 **Data cleaning**.  Data downloaded or scraped from the abovementioned sources will be combined into one table for each city. Following data cleaning activities will be carried out: -

    a. Borrough not assigned. In this case, the neighborhood record will be deleted.
    b. Neighborhood not assigned. In such cases, the record will be deleted.
    c. Wrong venue details. In some cases the venue type could be mentioned as "neighborhood". Such records will be deleted.
    d. Pre-processing of Venues data. Following pre-processing will be performed with the venues data obtained by using Foursquare : - 
         i.   Generalisation of venue types. Venue details received through Four Square application would be categorised into broader groups indicating the nature of the venue, e.g., Food joints, public amenities, Recreational & sports facilities etc
         ii.  Further attributes of the venue will be recorded as separate characteristic. E.g., in case of Food Joints, the cousine (such as Indian, Thai etc). These 
         will be used only if the individual has such specific requirements for stay / relocation. 
         iii. A parameter to indicate concetration (or density) of each venue category within each neighborhood will be included. 

     I will experiment with the above pre-processed data and try to figure out whether the clustering results improved due to the pre-processing.

2.3 **Feature selection**.  Venue types will be used as the features of each neighborhood for further analysis. Selection of features will depend on the purpose of move. This is illustrated below: - 

Purpose of Move                       | Features of Neighborhood
    :-------                         | :-----
    Similar or better quality of life | Public amenities, educational institutes, shopping, Recreational & sports facilities
    Professionals                     | Proximity to office, Food Joints, public amenities
    International students            | Public amenities, Library, transport facility, low cost restaurant
    Businessman                       | availability of potential customer

### Toronto location data

In [None]:
#Download and save the web page into local directory
!wget -q -O 'Canada_data' https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

In [None]:
# The code was removed by Watson Studio for sharing.

In [None]:
# The code was removed by Watson Studio for sharing.

In [None]:
#Obtained lat, long info using the file mentioned in the Assignment
!wget -q -O "lat_long.csv" http://cocl.us/Geospatial_data


In [None]:
# The code was removed by Watson Studio for sharing.

### New York location data

In [None]:
!wget -q -O 'newyork_data.json' https://cocl.us/new_york_dataset

with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

#all the relevant data is in the features key, which is basically a list of the neighborhoods. Define a new variable that includes this data.
neighborhoods_data = newyork_data['features']

In [None]:
# The code was removed by Watson Studio for sharing.