<h1>Segmenting and Clustering Neighborhoods in Toronto</h1>

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1. <a href="#item1">Web Scraping</a>

2. <a href="#item2">Geocoder</a>

3. <a href="#item3">Analyze Each Cluster Neighborhood</a>   
</font>
</div>

<h3>1. Web Scraping</h3>

In this first cell we obtain the Html of the Wiki page with a package called Beautiful Soup.

In [1]:
import pandas as pd
import requests
#!conda install -c conda-forge beautifulsoup4 --yes
from bs4 import BeautifulSoup
from lxml import html

link = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

res = requests.get(link)

page = BeautifulSoup(res.text, "html.parser")

Next we create the data frame with the labels, we find the rows of the table and add it into an iterator, we skip of one row for exclude the header and last we add the info of each Toronto postal codes with a foreach cicle.

In [35]:
columns = ["PostalCode", "Borough", "Neighborhood"]

df_t = pd.DataFrame(columns = columns) 

table = page.find("table", {"class": "wikitable"})
rows = table.find_all("tr")
rows = iter(rows) #list iterator
next(rows)

for row in rows:
    data = row.findChildren("td")
    
    df_t = df.append({
            "PostalCode" : data[0].get_text().replace("\n",""),
            "Borough" : data[1].get_text().replace("\n",""),
            "Neighborhood" : data[2].get_text().replace("\n",""),
        }, ignore_index = True)
    
df_t.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


We ignore the cells with a borough that is 'Not assigned'.
If a cell has a borough but a 'Not assigned' neighborhood, then the neighborhood will be the same as the borough. 

In [37]:
df_t = df_t[df_t.Borough != "Not assigned"]

df_t["Neighborhood"].loc[df_t.Neighborhood == "Not assigned"] = df_t["Borough"]

df_t.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


With more than one neighborhood that have the same postal code area, we join it into one row with the neighborhoods separated with a comma.

In [38]:
dataFrame = df_t.groupby(["PostalCode","Borough"],as_index=False).agg(lambda x: ','.join(x))

dataFrame.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


We print the number of rows of the dataframe.

In [6]:
dataFrame.shape

(103, 3)

<h3>2. Geocoder</h3>

In the second part we obtain the latitude and the longitude of each postal code with a Csv file. I have try to use geocoder but it was freeze for much time and I haven't obtain the necessary info from the call.

In [34]:
'''
!conda install -c conda-forge geocoder --yes
import geocoder

for index,row in dataFrame.iterrows():
    # initialize your variable to None
    lat_lng_coords = None

    # loop until you get the coordinates
    while(lat_lng_coords is None):
      g = geocoder.google('{}, Toronto, Ontario'.format(row["PostalCode"]))
      lat_lng_coords = g.latlng

    dataFrame[index, "Latitude"] = lat_lng_coords[0]
    dataFrame[index, "Longitude"] = lat_lng_coords[1]
'''

'\n!conda install -c conda-forge geocoder --yes\nimport geocoder\n\nfor index,row in dataFrame.iterrows():\n    # initialize your variable to None\n    lat_lng_coords = None\n\n    # loop until you get the coordinates\n    while(lat_lng_coords is None):\n      g = geocoder.google(\'{}, Toronto, Ontario\'.format(row["PostalCode"]))\n      lat_lng_coords = g.latlng\n\n    dataFrame[index, "Latitude"] = lat_lng_coords[0]\n    dataFrame[index, "Longitude"] = lat_lng_coords[1]\n'

In [14]:
!wget -q -O 'Geospatial_Coordinates.csv' https://cocl.us/Geospatial_data
print('Data downloaded!')

Data downloaded!


After download the Csv, we read it, we convert it into a dataframe and we renaming the column Postal Code for merging with first dataframe.

In [39]:
ll_data = pd.read_csv("Geospatial_Coordinates.csv")

ll_df = pd.DataFrame(ll_data)

ll_df = ll_df.rename(index = str, columns = {"Postal Code" : "PostalCode"})

ll_df.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


We merge the two dataframe for analizing and clustering in the next part.

In [40]:
df = pd.merge(dataFrame, ll_df, on="PostalCode")

df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
