# The Battle of Neighborhoods - Week 01
## 1. Introduction/Business Problem


Opening a cafeteria in Toronto

Timeout Company plans to open a cafeteria in the city of Toronto. The decision of where the cafeteira depends highly on the information of the neighborhoods in Toronto (e.g. the most popular avenue, whether there is a park in the neighborhood or not, etc.).

The following information (not limited to) are helpful for making the decision:

* Toronto population
* City demographics
* Are there any venues like parks, shopping centers etc.
* Are there competitors in that location?

## 2. Data

**Neighborhoods in Toronto data source:**
https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

In order to provide relevant information to Timeout, first I sort the neighborhoods by postal codes, create a table with corresponding longitude and latitude. The next step is to combine these information  with Foursquare API to collect information (avenues, competitors, etc.) on the same neighborhoods.

In [10]:
# import libs

import pandas as pd
import numpy as np
from geopy.geocoders import Nominatim
import json
from pandas.io.json import json_normalize
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
!pip install folium
import folium
from bs4 import BeautifulSoup
import requests

print('Libraries imported!')

Libraries imported!


## 2.1 Download the data

In [11]:
# scrape the table from Wiki
url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(url,'lxml')
table=soup.find('table')
# convert to pandas dataframe
col_names = ['PostalCode','Borough','Neighborhood']
df = pd.DataFrame(columns = col_names)
# search all the postcode, borough, neighborhood 
for tr_cell in table.find_all('tr'):
    row_data=[]
    for td_cell in tr_cell.find_all('td'):
        row_data.append(td_cell.text.strip())
    if len(row_data)==3:
        df.loc[len(df)] = row_data
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [12]:
# data wrangling and cleaning
## drop the data 'Not assigned' Borough
df=df[df['Borough']!='Not assigned']
## deal with 'Not assigned' Neighborhood
df.loc[df['Neighborhood'] =='Not assigned' , 'Neighborhood'] = df['Borough']
# merge neighborhood with the same postal code
temp = df.groupby(['PostalCode','Borough'], sort=False).agg( ', '.join)
df_merge = temp.reset_index()
df_merge.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [13]:
# use csv file to get the latitude and longitude
# download and read the csv file
!wget -q -O 'Toronto_long_lat_data.csv'  http://cocl.us/Geospatial_data
df_lon_lat = pd.read_csv('Toronto_long_lat_data.csv')
df_lon_lat.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [14]:
# merge the latitude and longitude data to neighborhood table
df_lon_lat.columns=['PostalCode','Latitude','Longitude']
df_new = pd.merge(df_merge,
                 df_lon_lat[['PostalCode','Latitude', 'Longitude']],
                 on='PostalCode')
df_new.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


**So far, the geological inforamtion of the neighborhoods are collected.**