# IBM Data Science Professional Certification 
## Course 9: Applied Data Science Capstone
### Peer-graded assignment (Week 3):  Segmenting and Clustering Neighborhoods in Toronto - Part 1

The python program in this notebook catches the postal codes of Canada and the corresponding boroughs and neighborhoods, available in the wikipedia page https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M. Such data are used to build a table in which each line contains a postal code (first column) the correponding borough (second column) and the list of neighborhoods (third column). This notebook was produced using the Jupyter Notebook IDE provided by the Anaconda Python distribution.

**Importing the necessary libraries.**

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests

from bs4 import BeautifulSoup # library for process html code text.

**Reading html code of the wikipedia page.**

In [2]:
wikipedia_link="https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M" 
raw_wikipedia_page=requests.get(wikipedia_link)
page=raw_wikipedia_page.text
#print(page)

**Using the BeautifulSoup library to identify the portion of the html code corresponding to the table with the desired data.** 

In [3]:
soup=BeautifulSoup(page,'html.parser')  #Convert the html code from string format to a BeautifulSoup object.

html_table=soup.tbody #Returns a tbody object, which is a tag object of the Beautiful Soup library, that stores the first html 
                      #table in the code. This table contains the data.
#html_table

**Identifies and extrancts the data from the html table and stores them in a pandas data frame object.**

In [4]:
html_table=html_table.find_all("td") #Returns a list with all the html code of each cell of the html table. The html code of the
                                     #cells are stored in td objects. td object is a tag object of the Beautiful Soup library.

code=[]  #List that will store the postal codes.
borough=[]  #List that will store the boroughs.
neighborhood=[]  #List that will store the neighborhoods.

#The lines of the html table are stored in the list "html_table" as non overlapping sequences of three elements, corresponding 
#to the code, borough and neighborhood. The following code extracts the contents of each html line and stores them in a 
#pandas data frame.

for i in range(0,len(html_table)-3,3):
        aux=html_table[i].contents[0]  #The method "contents" returns a string corresponding to the content enclosed by 
                                       #the td tag (content within the cell of the table). Here, the cell contains the 
                                       #correspond to the postal code.
        
        code.append(aux.replace("\n","")) #Some strings contain the new line character, which must be removed.
        
        try:
            aux=html_table[i+1].a.contents[0]  #If the content of the cell, in this case the borough, is a hyperlink, then it is enclosed 
                                               #this comand returns the content of the html cell enclosed by an "a" tag.
            borough.append(aux.replace("\n",""))
        except:
            aux=html_table[i+1].contents[0] #Returns the content of the html cell in the case that it is not a hyperlink.
            borough.append(aux.replace("\n",""))
            
        try:
            aux=html_table[i+2].a.contents[0]
            neighborhood.append(aux.replace("\n",""))
        except:
            aux=html_table[i+2].contents[0]
            neighborhood.append(aux.replace("\n",""))

df_neigh=pd.DataFrame({"PostalCode":code,"Borough":borough,"Neighborhood":neighborhood})
df_neigh.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


**Removes the records without borough and assigns to each "Not assingned" neighborhood the name of its corresponding borough.**

In [5]:
df=df_neigh.copy()
aux=df_neigh["Borough"]!="Not assigned"
df_neigh=df[aux]

df=df_neigh.copy()
aux=df_neigh["Neighborhood"]=="Not assigned"
df_neigh["Neighborhood"][aux]=df["Borough"][aux]
del df

df_neigh.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


**Builds a pandas data frame in which each line contains a postal code (first column) the correponding borough (second column) and the corresponding neighborhoods (third column).**

In [6]:
borough=df_neigh.groupby("PostalCode")["Borough"].apply(list)  #Returns the lists of boroughs grouped by postal code.
neighborhood=df_neigh.groupby("PostalCode")["Neighborhood"].apply(list)  #Returns the lists of neighborhoods grouped by 
                                                                         #postal code.
code=borough.index
s=", "
n=len(code)
for i in range(0,n):
    borough[i]=borough[i][0]
    neighborhood[i]=s.join(neighborhood[i])  #Returns a string containing all the neighborhoods separeted by comma.

#Note that len(borough)==len(neighborhood)==103 and the indexes in borough and neighborhood are the same (the postal code) and 
#are sorted in the same way. Hence, the data frame df_neigh below with columns "code", "borough" and "neighborhood"  is corect.
df_neigh=pd.DataFrame({"PostalCode":code,"Borough":borough,"Neighborhood":neighborhood})
df_neigh.reset_index(drop=True,inplace=True)

**Final data frame**

In [7]:
df_neigh.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [8]:
df_neigh.tail()

Unnamed: 0,PostalCode,Borough,Neighborhood
98,M9N,York,Weston
99,M9P,Etobicoke,Westmount
100,M9R,Etobicoke,"Kingsview Village, Martin Grove Gardens, Richv..."
101,M9V,Etobicoke,"Albion Gardens, Beaumond Heights, Humbergate, ..."
102,M9W,Etobicoke,Northwest


**Saving the final data frame.**

In [9]:
df_neigh.to_csv("Toronto_neighborhoods.csv",index=False)

**Dimensions of the final data frame**

In [10]:
df_neigh.shape

(103, 3)