# Analysing Toronto neighborhoods

This notebook is for scraping and analysing data of neighborhoods of Toronto. The first part is scraping the data from a Wikipedia page. The second part is to analyse the data to cluster the neighborhoods.

### Scraping neighborhood data from Wikipedia

The link to the Wikipedia page for Toronto neighborhoods is [here.](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M)

#### Scraping data using pandas :

In [25]:
import pandas as pd

In [26]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
df_list = pd.read_html(url)

len(df_list)

3

In [27]:
df = df_list[0]
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


#### Scraping data using Beautiful Soup :

In [4]:
from bs4 import BeautifulSoup

import requests

In [None]:
import pprint
pp = pprint.PrettyPrinter()

page = requests.get(url)

soup = BeautifulSoup(page.content,'html.parser')

table = soup.find(id="bodyContent")

elems = table.find_all(table,class_="wikitable sortable jquery-tablesorter")
print(elems)

This section will be completed later. For now, we will use the pandas dataframe output.

### Cleaning the data

The table needs to be cleaned to remove NA values.

In [28]:
df = df.drop(df[df['Borough']=="Not assigned"].index)
df.reset_index(inplace=True,drop=True)
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


On checking the data, there seems to be no duplicate entries of postal code.   
So, the final step in cleaning will be to replace NA values for Neighborhood with the Borough names.

In [29]:
for neighborhood in df["Neighbourhood"]:
    if neighborhood=="Not assigned":
        df.loc["Neighbourhood"]=df.loc["Borough"]
        
df.head(12)

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


In [30]:
df.shape

(103, 3)

So we have 103 postal codes with complete information. Now, moving on to the analysis.

### Location data

The following code finds the latitude/longitude information of each postal code using Geocoder python, and builds a dataframe to include this information.

In [31]:
import geocoder

In [32]:
"""
count_calls = 0

for code in df["Postal Code"]:
    lat_lng_coords=None
    while(lat_lng_coords==None):
        g = geocoder.google('{}, Toronto, Ontario'.format(code))
        lat_lng_coords = g.latlng
        count_calls+=1
        
    df.loc["Latitude"] = lat_lng_coords[0]
    df.loc["Longitude"] = lat_lng_coords[1]
    
print("Total number of calls: ",count_calls,"\n") 
"""

df = df.sort_values(by="Postal Code").reset_index(drop=True)
df.head()
    

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


The above code has not been successful, even after 2499 calls, as the package is said to be very unreliable.
So I will download the coordinates information from this [link.](https://cocl.us/Geospatial_data)

In [33]:
df_latlng = pd.read_csv("https://cocl.us/Geospatial_data")
df_latlng.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [34]:
df_latlng.shape

(103, 3)

In [35]:
df["Latitude"] = df_latlng["Latitude"]
df["Longitude"] = df_latlng["Longitude"]
df.head(12)

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"Kennedy Park, Ionview, East Birchmount Park",43.727929,-79.262029
7,M1L,Scarborough,"Golden Mile, Clairlea, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffside, Cliffcrest, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848
