# Aim

In this project, the aim is to do some exploration and clustering of neighborhoods in Toronto.


# Methodology

1. The scraping of the Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, had to be done in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe like the one shown below:

![alt text](https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/7JXaz3NNEeiMwApe4i-fLg_40e690ae0e927abda2d4bde7d94ed133_Screen-Shot-2018-06-18-at-7.17.57-PM.png?expiry=1588464000000&hmac=JOuB1OoX2V8d-sYkNligqHYnrtbWLMxh9JZjrx2roTE "Output")



### Requirements

    - The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood.
    - Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
    - More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.
    - If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. 
    - In the last cell of the notebook, use the .shape method will be used to print the number of rows of your dataframe.

In [1]:
#import all libraries needed for project

import numpy as np #numpy library to vectorise data

import pandas as pd #pandas library for data analysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json #json library to handle json files will will be received

!conda install -c conda-forge geopy --yes

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests #to handle requests

from pandas.io.json import json_normalize #transform JSON file into a pandas dataframe

#import matplotlib
import matplotlib.cm as cm
import matplotlib.colors as colors

#import kmeans from sklearn
from sklearn.cluster import KMeans

!conda install -c conda-forge folium = 0.5.0 --yes
import folium #map library to render maps
from bs4 import BeautifulSoup

print ('Libraries imported!')

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.




CondaValueError: invalid package specification: =



Libraries imported!


# Data downloading and Wrangling

In [2]:
#using BeautifulSoup package to receive data and wrangling
data = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(data,'xml')
table = soup.find('table')

df = pd.DataFrame(columns = ['Postalcode','Borough','Neighborhood'])

for tr in table.find_all('tr'): #search through entire data to find only the wanted data from table
    row=[]
    for td in tr.find_all('td'):
        row.append(td.text.strip())
    if len(row)==3: #every 3 data received, append it to table
        df.loc[len(df)] = row

df.head(15) #checked with website

Unnamed: 0,Postalcode,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront
5,M6A,North York,Lawrence Manor / Lawrence Heights
6,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government
7,M8A,Not assigned,
8,M9A,Etobicoke,Islington Avenue
9,M1B,Scarborough,Malvern / Rouge


In [3]:
drop_index = df[df['Borough'] == 'Not assigned'].index
df.drop(drop_index, inplace = True) #drop those index which are unassigned
df.head(15) #check again

Unnamed: 0,Postalcode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront
5,M6A,North York,Lawrence Manor / Lawrence Heights
6,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government
8,M9A,Etobicoke,Islington Avenue
9,M1B,Scarborough,Malvern / Rouge
11,M3B,North York,Don Mills
12,M4B,East York,Parkview Hill / Woodbine Gardens
13,M5B,Downtown Toronto,"Garden District, Ryerson"


In [4]:
#all cells with a borough has assigned Neighborhood

In [5]:
#All postal codes are already joined into one row
df.shape

(103, 3)

Now that the dataframe of the postal code of each neighborhood along with the borough name and neighborhood name has been built, in order to utilize the Foursquare location data, the latitude and the longitude coordinates of each neighborhood needs to be retrieved.

The Geocoder Python package https://geocoder.readthedocs.io/index.html.

Example: 

import geocoder # import geocoder

/# initialize your variable to None
lat_lng_coords = None

/# loop until you get the coordinates
while(lat_lng_coords is None):
  g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
  lat_lng_coords = g.latlng

latitude = lat_lng_coords[0]
longitude = lat_lng_coords[1]

Geographical coordinates of each postal code: http://cocl.us/Geospatial_data

Using the Geocoder package or the csv file to create the following dataframe:

![alt text](https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/HZ3jNHNOEeiMwApe4i-fLg_f44f0f10ccfaf42fcbdba9813364e173_Screen-Shot-2018-06-18-at-7.18.16-PM.png?expiry=1588464000000&hmac=sYdITc_Q4Z-Czub_hbVjujMLeGUclZ2bBKiQMDqOQiY "Output")

In [6]:
def get_geocode(postal_code):
    # initialize variable
    lat_lng_coords = None
    while(lat_lng_coords is None):
        g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
        lat_lng_coords = g.latlng
    latitude = lat_lng_coords[0]
    longitude = lat_lng_coords[1]
    return latitude,longitude

In [7]:
geodata = pd.read_csv('http://cocl.us/Geospatial_data')
geodata.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [8]:
geodata.rename(columns = {'Postal Code':'Postalcode'},inplace = True)
geodata_2 = pd.merge(df, geodata, on = 'Postalcode')
geodata_2.head()

Unnamed: 0,Postalcode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,Regent Park / Harbourfront,43.65426,-79.360636
3,M6A,North York,Lawrence Manor / Lawrence Heights,43.718518,-79.464763
4,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government,43.662301,-79.389494
