# Segmenting and Clustering Neighborhoods in Toronto

### 1: Scrape the Wiki page using BeautifulSoup

We start by importing requests library to send organic, grass-fed HTTP/1.1 requests.
<br>
BeautifulSoup will read the source code for a given web page and create a BeautifulSoup (soup) object with the BeautifulSoup function.
<br>
<br>
If we check the source code of the Wiki page, we can see the table we need is in class 'wikitable sortable jquery-tablesorter'. Let's retrieve that class from the soup object with the find() method.

In [None]:
import requests
from bs4 import BeautifulSoup

wiki_url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(wiki_url,'html.parser')

My_table = soup.find('table',{'class':'wikitable sortable'})
My_table

### 2: Turn HTML code into an array

As we can see, each line of the table starts with 'tr', and each element of the column starts with 'th' (for the headers) or 'td'.
<br>
Let's create an empty array words. Then, run a for loop on every line (identified by 'tr').
<br>
Looping on each line, let's use the text.split() method to separate words, identified with 'th' or 'td', and append them to our array words.

In [None]:
words = []

for items in My_table.find_all("tr"):
    data = [' '.join(item.text.split()) for item in items.find_all(['th','td'])]
    words.append(data)
    
words[0:5]

### 3: Turn the array of words into a DataFrame

We now have a array of rows of the original table. Let's turn it into a DataFrame.
<br>
First, we need to import pandas and DataFrame. We then use from_records() method to convert the words array into a DataFrame, passing the first row as header. 

In [None]:
import pandas as pd
from pandas import DataFrame

postal_df = DataFrame.from_records(words[1:], columns=words[0])
postal_df.head()

### 4: Data preparation

Let's remove the rows where the Borough is Not assigned.
<br>
We then group the DataFrame by Postcode and Borough, joining the aggregated Neighbourhoods into a string. The index is lost in this operation, se we need to reset it.
<br>
<br>
Finally, we locate all rows where Neighbourhood is Not assigned, and replace the it with the respective Borough.
<br>
The shape of our DataFrame is displayed below.

In [None]:
postal_df = postal_df[postal_df.Borough != 'Not assigned']
postal_df = pd.DataFrame(postal_df.groupby(['Postcode', 'Borough'])['Neighbourhood'].apply(', '.join))
postal_df = postal_df.reset_index()

postal_df.loc[postal_df['Neighbourhood'] == 'Not assigned', 'Neighbourhood'] = postal_df['Borough']

postal_df.head()

In [None]:
postal_df.shape

### 5: Get coordinates of each neighborhood

Install and import the Geocoder Python package, to retrieve the latitude and the longitude coordinates of each neighborhood.
<br>
Since the package can be very unreliable, we will run a while loop for each postal code.
<br>
Finally, we create two new columns for latitude and longitude in our DataFrame, applying the while loop to each of the postal codes.
<br>
<br>
Since I reached the limit of queries for Geocoder, I'm just gonna import the csv with the geographical coordinates, as a new DataFrame, and merge it with my DataFrame.

In [None]:
'''
! pip install geocoder
import geocoder # import geocoder

def get_coord(code):
    # initialize your variable to None
    lat_lng_coords = None

    # loop until you get the coordinates
    while(lat_lng_coords is None):
        g = geocoder.google(str(code) + ', Toronto, Ontario')
        lat_lng_coords = g.latlng
    
    latitude = lat_lng_coords[0]
    longitude = lat_lng_coords[1]
    return latitude, longitude


postal_df['Latitude'], postal_df['Longitude'] = postal_df.apply(lambda row: get_coord(row['Postcode']), axis=1)
postal_df
'''

url = 'http://cocl.us/Geospatial_data'
coordinates = pd.read_csv(url)

postal_df = postal_df.merge(coordinates, how='inner', left_on='Postcode', right_on='Postal Code')
postal_df.drop(columns='Postal Code', inplace=True) # Drop repeated column 'Postal Code'
postal_df.head()