# Toronto Neighborhood Segmentation 
_________________________

### This is a notebook for retrieving data about neighborhoods in Toronto from Wikipedia and using this data to cluster the city based on the venues in each part of the city. 


In [1]:
# first install the needed libraries
!pip install bs4
!pip install requests

  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
Collecting bs4
  Downloading bs4-0.0.1.tar.gz (1.1 kB)
Building wheels for collected packages: bs4
  Building wheel for bs4 (setup.py) ... [?25ldone
[?25h  Created wheel for bs4: filename=bs4-0.0.1-py3-none-any.whl size=1272 sha256=3ff0869d41fbd361934e53e4933d7e67f1b98236d79463776e6adbd25478a08b
  Stored in directory: /tmp/wsuser/.cache/pip/wheels/0a/9e/ba/20e5bbc1afef3a491f0b3bb74d508f99403aabe76eda2167ca
Successfully built bs4
Installing collected packages: bs4
Successfully installed bs4-0.0.1
  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes


In [3]:
# next import the needed libraries for the web scraping 
from bs4 import BeautifulSoup # this helps us to make objects from the HTML document (tree like manner)
import requests  # this is for requests making using HTTP requests
import pandas as pd # this is for the dataframe structure

Use a Wikipedia site that contains the post codes for the boroughs and neighborhoods in Toronto.


In [4]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

In [5]:
# use the request library to retrieve the HTML from the page, in a text format
data  = requests.get(url).text

Now we need to format the text in a tree-like structure using the BeautifulSoup object  

In [6]:
soup = BeautifulSoup(data,"html5lib")

With the use of the __find()__ method of the soup object we can find the first table in the HTML document. As long as we have one table in the given page that is enough for our exploration. 


In [7]:
table = soup.find('table')

In [8]:
# let's make an empty list for holding the dictionaries that we will retrieve from our Wikipedia table 
table_list = []

In [10]:
# iterate through all the rows in the table 
for row in table.findAll('td'):
    temp_dic = {}    #create an empty dictionary for holding the values of a row 
    if row.span.text=='Not assigned':
        pass
    else:
        temp_dic['PostalCode'] = row.p.text[:3]
        temp_dic['Borough'] = (row.span.text).split('(')[0]
        temp_dic['Neighborhood'] = (((((row.span.text).split('(')[1]).strip(')')).replace(' /',',')).replace(')',' ')).strip(' ')
        table_list.append(temp_dic)
        
# print(table_contents)
df=pd.DataFrame(table_list)
df['Borough']=df['Borough'].replace({'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A',
                                             'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business',
                                             'EtobicokeNorthwest':'Etobicoke Northwest','East YorkEast Toronto':'East York/East Toronto',
                                             'MississaugaCanada Post Gateway Processing Centre':'Mississauga'})

In [15]:
#print the dataframe's first 5 rows 
df.head(11)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills North
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


In the next cell we will check that no __'Not assigned'__ cells have remained in our dataframe. 

In [35]:
x = df[df['Borough'] == 'Not assigned'].shape
y = df[df['Neighborhood'] == 'Not assigned'].shape
if(x[0] == 0 & y[0] == 0 ):
    print("The dataset not includes any Not assigned cells.")

The dataset not includes any Not assigned cells.


The next cell will print out the __dimensions__ of our dataset 


In [37]:
dim = df.shape
print('The dataframe has {} rows and {} columns.'.format(dim[0], dim[1]))

The dataframe has 103 rows and 3 columns.
