# Battle of the Neighborhoods

### -- A Glance of the Major Cities in Canada from a Tourist's Point of View

## 1. Business Proglem

### 1.1 Inspiration
As a first time visitor in Canada, 3 of the major cities Toronto, Vandouver and Ottawa are on my list. 
While planning my trip, I want to learn about the cities on the high level so I know what to expect, and what to focus on in each city.
#####    1.1.1 Downtowns: How similar/dissimilar are the downtowns?
#####    1.1.2 Neighborhoods: What are the different neighborhoods in each city?
#####    1.1.3 Food Selection: What are the most popular cuisines in each city?
#####    1.1.4 To-do: What are the most popular recreational activities in each city?
#####    1.1.5 Other: What are some of the unique characteristics of each city?

### 1.2 Generalization
This analysis can be packaged into a product that provides customizable trip planning service to travelers.
##### 1.2.1 Custom Combination of Cities
Unlike most of the current travel sites that provide plans for only fixed combination of cities, with this tool customers can create their own combination of cities in the plan. The tool will then run the analysis in real time and provide costomized reports.
##### 1.2.2 Compare and Contrast
Rather than just giving out plans, compare and contrast the cities from various perspective to give travelers more background information. Travelers can then refine their plans to include/exclude, or increase/reduce time in certain cities based on what they see and what interest them
##### 1.2.3 Refining Plans by Category
In each categories (such as food, culture, activities), customers can have an overview of what’s most popular and what’s most unique to each city, so that they can prioritize based on their own interests.

## 2. Data

To achieve this, a few different datasets will be needed. I will discuss each of them in the upcoming section.

### 2.1 Zip Code and Neiborhood

##### 2.1.1 Formatted Data
For Toronto, there’s a well-formatted data source that contains zip code – neighborhood information:
https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

Once scraped from the website, there's not a lot of transformation needed to convert it into a clean dataframe. However, some processing and cleansing is still needed to address the following issue: 1) Rows with missing information, and 2) Records that are outside of the area we are interested in

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import requests # library to handle requests

# conda install -c anaconda beautifulsoup4 
from bs4 import BeautifulSoup

In [2]:
res = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
soup = BeautifulSoup(res.content,'lxml')
table = soup.find_all('table')[0] 
df = pd.read_html(str(table))
# Save initial DataFrame to df_toronto
df_toronto = df[0]
df_toronto.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [3]:
# Data Cleansing

# drop records where Borough = 'Not assigned'
df_toronto = df_toronto[df_toronto.Borough != 'Not assigned']
# Now check records where Borough = 'Not assigned'. There shoudn't be any records showing anymore
df_toronto[df_toronto.Borough == 'Not assigned']
# group data by by PostCode, and combine records by concatenating "Neighbourhood" values as comma deliminated strings
df_toronto = df_toronto.groupby(['Postcode','Borough'])['Neighbourhood'].apply(lambda x: ','.join(x)).reset_index()
# If Neighbourhood = "Not assigned", assign Borough value to Neighbourhood
df_toronto[df_toronto.Neighbourhood == 'Not assigned'] = df_toronto[df_toronto.Neighbourhood == 'Not assigned'].assign(Neighbourhood = df_toronto[df_toronto.Neighbourhood == 'Not assigned'].Borough)
# Rename column "Postcode" to "PostalCode"
df_toronto.rename(columns = {'Postcode':'PostalCode'}, inplace = True)

In [4]:
# keep only records that has the string "Toronto" in their Boroughs
toronto_data = df_toronto[df_toronto['Borough'].str.contains('Toronto')]
np.shape(toronto_data)

(38, 3)

In [5]:
toronto_data.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
37,M4E,East Toronto,The Beaches
41,M4K,East Toronto,"The Danforth West,Riverdale"
42,M4L,East Toronto,"The Beaches West,India Bazaar"
43,M4M,East Toronto,Studio District
44,M4N,Central Toronto,Lawrence Park


##### 2.1.2 Unformatted Data
For Vancouver and Ottawa, more data processing is needed

Vancouver: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_V

Ottawa: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_K

Data is not organized in the format we need, some processing is required
Some cleansing is needed for missing data and irrelevant data (for the screenshot shown, need to limit to Vancouver data only)

In [6]:
# Firstly, process Vancouver data
res = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_V")
soup = BeautifulSoup(res.content,'lxml')
table = soup.find_all('table')[0] 
df = pd.read_html(str(table))

In [7]:
df = pd.DataFrame(df[0])
df = pd.DataFrame(df.values.flatten())
df.head()

Unnamed: 0,0
0,V1AKimberley
1,V2APenticton
2,V3ALangley Township(Langley City)
3,V4ASurreySouthwest
4,V5ABurnaby(Government Road / Lake City / SFU /...


In [8]:
# parse out Postal Code (first 3 characters in column 0)
df['PostalCode'] = df[0].apply(lambda x: x[:3])
df['new_col'] = df[0].apply(lambda x: x[3:])
# keep only records that contain the string "Vancouver"
vancouver_data = df[df['new_col'].str.contains('Vancouver')]
# parse out the remaining two columns
# find location of the string "Vancouver", Bourough will be the string before that plus the string "Vancouver"
# Neighborhood will be the string after "Vancouver"
vancouver_data['Bourough'] = vancouver_data.new_col.apply(lambda x: x[:x.find('Vancouver') + 9])
vancouver_data['Neiborhood'] = vancouver_data.new_col.apply(lambda x: x[x.find('Vancouver') + 9:])
# Clean up the dataframe by dropping columns created for intermidiate steps
vancouver_data.drop(['new_col'], axis=1, inplace=True)
vancouver_data.drop(vancouver_data.columns[0], axis=1, inplace=True)
# check the shape of the final dataframe
np.shape(vancouver_data)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # Remove the CWD from sys.path while we load stuff.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)


(44, 3)

In [9]:
vancouver_data.head()

Unnamed: 0,PostalCode,Bourough,Neiborhood
5,V6A,Vancouver,(Strathcona / Chinatown / Downtown Eastside)
14,V6B,Vancouver,(NE Downtown / Gastown / Harbour Centre / Inte...
23,V6C,Vancouver,(Waterfront / Coal Harbour / Canada Place)
32,V6E,Vancouver,(SE West End / Davie Village)
41,V6G,Vancouver,(NW West End / Stanley Park)


In [10]:
# Now repeat the previous steps to process Ottawa data
res = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_K")
soup = BeautifulSoup(res.content,'lxml')
table = soup.find_all('table')[0] 
df = pd.read_html(str(table))
df = pd.DataFrame(df[0])
df = pd.DataFrame(df.values.flatten())
df.head()

Unnamed: 0,0
0,K1AGovernment of CanadaOttawa and Gatineau off...
1,K2AOttawa(Highland Park / McKellar Park /Westb...
2,K4AOttawa(Fallingbrook)
3,K6AHawkesbury
4,K7ASmiths Falls


In [11]:
# parse out Postal Code (first 3 characters in column 0)
df['PostalCode'] = df[0].apply(lambda x: x[:3])
df['new_col'] = df[0].apply(lambda x: x[3:])
# keep only records that contain the string "Ottawa"
ottawa_data = df[df['new_col'].str.contains('Ottawa')]
# parse out the remaining two columns
# find location of the string "Ottawa", Bourough will be the string before that plus the string "Ottawa"
# Neighborhood will be the string after "Ottawa"
ottawa_data['Bourough'] = ottawa_data.new_col.apply(lambda x: x[:x.find('Ottawa') + 6])
ottawa_data['Neiborhood'] = ottawa_data.new_col.apply(lambda x: x[x.find('Ottawa') + 6:])
# Clean up the dataframe by dropping columns created for intermidiate steps
ottawa_data.drop(['new_col'], axis=1, inplace=True)
ottawa_data.drop(ottawa_data.columns[0], axis=1, inplace=True)
# In addition, we don't need the first record (Government of Canada, index = 0), so drop it
ottawa_data.drop(ottawa_data.index[0], inplace=True)
# check the shape of the final dataframe
np.shape(ottawa_data)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # Remove the CWD from sys.path while we load stuff.


(40, 3)

In [12]:
ottawa_data.head()

Unnamed: 0,PostalCode,Bourough,Neiborhood
1,K2A,Ottawa,(Highland Park / McKellar Park /Westboro /Glab...
2,K4A,Ottawa,(Fallingbrook)
7,K1B,Ottawa,(Blackburn Hamlet / Pine View / Sheffield Glen)
8,K2B,Ottawa,(Britannia /Whitehaven / Bayshore / Pinecrest)
9,K4B,Ottawa,(Navan)


### 2.2 Coordinates
Data is downloaded in csv format from Google fusion table:
https://fusiontables.google.com/DataSource?docid=1H_cl-oyeG4FDwqJUTeI_aGKmmkJdPDzRNccp96M&hl=en_US&pli=1
for FSA starting with V and K

### 2.3 Venue Information
Use Foursqaure API to explore evenues near each Zip Code. The format of API is https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}

Once data is obtained, a cluster analysis will be performed on neighborhoods, to find different groups of neighborhoods in each city.

For analysis on “food selection”, “to-do”, and “others”, realized the “category name” of venues are too detailed/granular for the purpose of this analysis. For example, restaurants/food is separated into cuisines and subtypes of cuisines; similar for other categories. Some manual grouping (most likely based on string matches) will be needed before further analyses.

## 3. Methodology