# Segmenting and Clustering Neighborhoods in Toronto

### Objective

In this assignment, we're required to explore, segment, and cluster the neighborhoods in the city of Toronto. We'll be scraping the Wikipedia page and wrangle the data, clean it, and then read it into a pandas dataframe so that it is in a structured format like the New York dataset.

Once the data is in a structured format, we'll replicate the analysis that we did to the New York City dataset to explore and cluster the neighborhoods in the city of Toronto.

#### Requirement: 1

Use pandas, or the BeautifulSoup package, or any other way to transform the data in the table on the Wikipedia page into the above pandas dataframe.

In [3]:
# Installing required libraries
#!pip install beautifulsoup4
#!pip install lxml
#!pip install geopy
#!pip install geocoder
print('Installations done.')

Installations done.


In [4]:
# Importing libraries
import pandas as pd
import requests
from random import randint
from time import sleep
import sys
import logging

from bs4 import BeautifulSoup
#from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values
import geocoder as gc # import geocoder

import pprint

print('Libraries imported.')

# Adjust options
#pd.set_option('display.max_colwidth', 100)
#pd.reset_option('display.max_colwidth')
pp = pprint.PrettyPrinter(width=120)
print('Options adjusted.')

Libraries imported.
Options adjusted.


In [5]:
url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text

page = BeautifulSoup(url,'lxml')

table = page.find ('table', class_='wikitable sortable')

header_list=[]
for header in table.findAll ('th'):
    header_list.append(header.text.replace('\n', '').title().replace(' ',''))
    
output_rows = []
for table_row in table.findAll('tr'):
    columns = table_row.findAll('td')
    output_row = []
    for column in columns:
        output_row.append(column.text.replace('\n', '').replace(' / ',', '))
    if output_row and not output_row[1]=='Not assigned': 
        output_rows.append(output_row)

df = pd.DataFrame(output_rows)
df.columns = header_list

print ('\'PostalCode\' column has unique values:', df['PostalCode'].nunique()==df['PostalCode'].size)
print ('\'Neighborhood\' column doen\'t have a blank value:', df[df['Neighborhood']==''].size==0)

df.shape

'PostalCode' column has unique values: True
'Neighborhood' column doen't have a blank value: True


(103, 3)

In [7]:
# initialize your variable to None
lat_lng_coords = None

lat_list=[]
lng_list=[]

sys.stdout.write("Location retrieval progress: %d%%   \r" % (0))
sys.stdout.flush()

logging.basicConfig(level=logging.ERROR,
                    format='%(asctime)s %(name)-12s %(levelname)-8s %(message)s', 
                    filename='errors.log', 
                    filemode='w')

for ind in df.index:
    # loop until you get the coordinates
    attempt = 1
    while(lat_lng_coords is None):
        try:
            g = gc.arcgis('{}, Toronto, Ontario'.format(df['PostalCode'][ind]))
            lat_lng_coords = g.latlng
        except:
            if attempt <=5:
                logging.error('Failed attempt #{} for {}, Toronto, Ontario'.format(attempt, df['PostalCode'][ind]))
                attempt = attempt + 1
                sleep(0.01/randint(1,10000))
                pass
            else:
                raise
        
    lat_list.append(round(lat_lng_coords[0], 6))
    lng_list.append(round(lat_lng_coords[1], 6))
    lat_lng_coords = None
        
    sys.stdout.write("Location retrieval progress: %d%%   \r" % (ind/df['PostalCode'].size*100))
    sys.stdout.flush()
    
    sleep(0.01/randint(1,10000))

sys.stdout.write("Location retrieval progress: %d%%   \r" % (100))

df['Latitude']=lat_list
df['Longitude']=lng_list

df

Location retrieval progress: 100%   

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.752935,-79.335641
1,M4A,North York,Victoria Village,43.728102,-79.311890
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.650964,-79.353041
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.723265,-79.451211
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.661790,-79.389390
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road , Old Mill North",43.653340,-79.509766
99,M4Y,Downtown Toronto,Church and Wellesley,43.666659,-79.381472
100,M7Y,East Toronto,Business reply mail Processing CentrE,43.648700,-79.385450
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.632798,-79.493017


Location retrieval progress: 10%   