# Segmenting and Clustering Neighborhoods in Toronto -- PART 2

### Preparations (PART 1)

In [1]:
from requests import get       # import the get function from requests module
url="https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
response = get(url)            # retrieve HTML file from the given Wikipedia URL

from bs4 import BeautifulSoup                       # import the "beautifulsoup4" library
soup=BeautifulSoup(response.text,"html.parser")     # make the soup, the format convenient for both extracting and preprocessing of the data
table = soup.find("tbody")    # find the table data and save as "table"

import pandas as pd
data=[]                                          # create an empty dataset "data"

rows = table.find_all('tr')                      # find all rows in the "table" 
for row in rows:                                 # use for loop to read each entry of the "Wikitable" to the "data"
    cols = row.find_all('td')               
    cols = [ele.text.strip() for ele in cols]
    data.append([ele for ele in cols if ele])

df=pd.DataFrame(data)                            # create the dataframe requried using "Pandas"


df.columns=["PostalCode","Borough","Neighborhood"]   # Rename the columns as required

for i in df.index:                         # use the for loop to find all rows with their "Borough" values to be "Not Assigned"
    if df.iloc[i,1]=="Not assigned":       #
        df.iloc[i,1]=None                  # replace the "Not Assigned" to "None", which can be droped using dropna()

df.dropna(inplace=True)                 # remove all rows with "None" values
df.reset_index(drop=True,inplace=True)     # reset the index

for j in df.index:                      # use the for loop to find the all rows with their "Neighborhood" values to be "Not Assigned"                
    if df.iloc[j,2]=="Not assigned":    #       
        df.iloc[j,2]=df.iloc[j,1]       # set their neighborhood to be equivalent to their "Borgough" values
        

pd.options.mode.chained_assignment = None               # avoid the warn of chained assignment, default='warn'

for k in range(1,212):                                  # use for loop to combine the rows with the same "Borough" values
    if df.iloc[k-1,1]==df.iloc[k,1]:                    # if i-1th and ith row share the same borough
        df.iloc[k,2]=df.iloc[k-1,2]+","+df.iloc[k,2]    # append i-1th neighborhood to ith neighborhood separated with ","
        df.iloc[k-1,2]=None                             # set the i-1th neighborhood to "None" in order to remove by dropna later

df.dropna(inplace=True)                                 # drop all the rows with "None" values
df.reset_index(drop=True,inplace=True)                  # reset the index

print(df.shape)
df.head()

(85, 3)


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M4A,North York,"Parkwoods,Victoria Village"
1,M5A,Downtown Toronto,"Harbourfront,Regent Park"
2,M6A,North York,"Lawrence Heights,Lawrence Manor"
3,M7A,Queen's Park,Queen's Park
4,M9A,Etobicoke,Islington Avenue


In [None]:
##############################################################################################################

### Building the Dataframe with Latitude and Longitude Data

Since the `Geocoder` does not work well here, we will use the `CSV` file instead. 

In [2]:
latlng=pd.read_csv("https://cocl.us/Geospatial_data")    # read the online CSV file into pandas dataframe
latlng.head()                                            # check the dataframe

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


After retrieving the `latlng` dataframe, now it is time to build the new dataframe by updating latitude and longtitude data. We will use pandas's built-in `.merge()` method to do the **Inner-Join** of two dataframes.

In [4]:
df_latlng=df.merge(latlng,left_on="PostalCode",right_on="Postal Code")    # inner join two dataframes, "df" as left set and "latlng" as right set, keyword is the Postal Code from both sets                                                       #
print(df_latlng.shape)                                                    #
df_latlng.head()                                                          # checked the merged dataframe "df_latlng"

(85, 6)


Unnamed: 0,PostalCode,Borough,Neighborhood,Postal Code,Latitude,Longitude
0,M4A,North York,"Parkwoods,Victoria Village",M4A,43.725882,-79.315572
1,M5A,Downtown Toronto,"Harbourfront,Regent Park",M5A,43.65426,-79.360636
2,M6A,North York,"Lawrence Heights,Lawrence Manor",M6A,43.718518,-79.464763
3,M7A,Queen's Park,Queen's Park,M7A,43.662301,-79.389494
4,M9A,Etobicoke,Islington Avenue,M9A,43.667856,-79.532242


Since the order of the columns changes, here we rearrange the order and get the final dataframe.

In [5]:
df_latlng=df_latlng[["Postal Code","Borough","Neighborhood","Latitude","Longitude"]]    # rearrange the order by recalling each columns in required order

In [6]:
df_latlng    # The final datafram with latitude and longitude

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M4A,North York,"Parkwoods,Victoria Village",43.725882,-79.315572
1,M5A,Downtown Toronto,"Harbourfront,Regent Park",43.654260,-79.360636
2,M6A,North York,"Lawrence Heights,Lawrence Manor",43.718518,-79.464763
3,M7A,Queen's Park,Queen's Park,43.662301,-79.389494
4,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242
5,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353
6,M3B,North York,Don Mills North,43.745906,-79.352188
7,M4B,East York,"Woodbine Gardens,Parkview Hill",43.706397,-79.309937
8,M5B,Downtown Toronto,"Ryerson,Garden District",43.657162,-79.378937
9,M6B,North York,Glencairn,43.709577,-79.445073
