# Segmenting and Clustering Neighborhoods in Toronto -- PART 1

### Importing HTML File from Wikipedia

Firstly, we have to convert the entire website into HTML file and import it to Python (in the Jupyter Notebook). Using the website url provided, we use the `get` function to retrieve HTML file from the Wikipedia.

In [1]:
from requests import get       # import the get function from requests module
url="https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
response = get(url)            # retrieve HTML file from the given Wikipedia URL   
print(response.text[:500])     # print the HTML file in readable "text" format and print first 500 characters to check whether we import the right site

<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>List of postal codes of Canada: M - Wikipedia</title>
<script>document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );</script>
<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"List_of_postal_codes_of_Canada:_M","w


In [2]:
###########################################################################################################################################

### Retrieving  `pandas`  DataFrame Using  `beautifulsoup4`

After importing the HTML file, we need to scrapy the data we using **beautifulsoup4** library.

In [3]:
from bs4 import BeautifulSoup                       # import the "beautifulsoup4" library
soup=BeautifulSoup(response.text,"html.parser")     # make the soup, the format convenient for both extracting and preprocessing of the data

While inspecting the HTML codes, the geographical data is displayed in the `class` **Wikitable Sortable** and all table entries are included in `tbody`. We used method `.find()` to select the data.

In [4]:
table = soup.find("tbody")    # find the table data and save as "table"

Viewing the detailed composition of the HTML codes, we can observe that the each row of the table is coded in the `tr` and each entry in the row is coded in the `td`. 

In [5]:
import pandas as pd
data=[]                                          # create an empty dataset "data"

rows = table.find_all('tr')                      # find all rows in the "table" 
for row in rows:                                 # use for loop to read each entry of the "Wikitable" to the "data"
    cols = row.find_all('td')               
    cols = [ele.text.strip() for ele in cols]
    data.append([ele for ele in cols if ele])

df=pd.DataFrame(data)                            # create the dataframe requried using "Pandas"
df.shape                                         # check the size of the dataset, which is equivalent to the size on the Wikipedia

(290, 3)

In [6]:
###########################################################################################################################################

### Preprocessing the DataFrame in Pandas

Now, we need to preprocess the dataframe retrieved. There are several steps of data wrangling.

1.***"The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood."*** 

In [7]:
df.columns=["PostalCode","Borough","Neighborhood"]   # Rename the columns as required
df.head()                                            # check the column name change

Unnamed: 0,PostalCode,Borough,Neighborhood
0,,,
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village


2.***"Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned."*** 
     Remove all the rows with their values in "Borough" column to be "Not Assigned"

In [8]:
for i in df.index:                         # use the for loop to find all rows with their "Borough" values to be "Not Assigned"
    if df.iloc[i,1]=="Not assigned":       #
        df.iloc[i,1]=None                  # replace the "Not Assigned" to "None", which can be droped using dropna()

df.dropna(inplace=True)                 # remove all rows with "None" values
df.reset_index(drop=True,inplace=True)     # reset the index
df.shape                                   # check the size of the dataset to see the change

(212, 3)

3.***If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.***

In [9]:
for j in df.index:                      # use the for loop to find the all rows with their "Neighborhood" values to be "Not Assigned"                
    if df.iloc[j,2]=="Not assigned":    #       
        df.iloc[j,2]=df.iloc[j,1]       # set their neighborhood to be equivalent to their "Borgough" values
        
df.loc[df["Borough"]=="Queen's Park"]   # check the row of "Queen's Park"

Unnamed: 0,PostalCode,Borough,Neighborhood
6,M7A,Queen's Park,Queen's Park


4.***More than one neighborhood can exist in one postal code area. Combined these rows into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.***

In [10]:
pd.options.mode.chained_assignment = None               # avoid the warn of chained assignment, default='warn'

for k in range(1,212):                                  # use for loop to combine the rows with the same "Borough" values
    if df.iloc[k-1,1]==df.iloc[k,1]:                    # if i-1th and ith row share the same borough
        df.iloc[k,2]=df.iloc[k-1,2]+","+df.iloc[k,2]    # append i-1th neighborhood to ith neighborhood separated with ","
        df.iloc[k-1,2]=None                             # set the i-1th neighborhood to "None" in order to remove by dropna later

df.dropna(inplace=True)                                 # drop all the rows with "None" values
df.reset_index(drop=True,inplace=True)                  # reset the index
df.shape                                                # check the size of the dataset to see the change

(85, 3)

In [None]:
###########################################################################################################################################

### Results - Preprocessed Pandas Dataframe

In [11]:
df    # check the final dataframe

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M4A,North York,"Parkwoods,Victoria Village"
1,M5A,Downtown Toronto,"Harbourfront,Regent Park"
2,M6A,North York,"Lawrence Heights,Lawrence Manor"
3,M7A,Queen's Park,Queen's Park
4,M9A,Etobicoke,Islington Avenue
5,M1B,Scarborough,"Rouge,Malvern"
6,M3B,North York,Don Mills North
7,M4B,East York,"Woodbine Gardens,Parkview Hill"
8,M5B,Downtown Toronto,"Ryerson,Garden District"
9,M6B,North York,Glencairn
