# Segmenting and Clustering Neighborhoods in Toronto

## Introduction

In this assignment, we will be required to explore, segment, and cluster the neighborhoods in the city of Toronto. What is interesting about the field of data science is that each project can be challenging in its unique way, so we need to learn to be agile and refine the skill to learn new libraries and tools quickly depending on the project.

For the Toronto neighborhood data, a Wikipedia page exists that has all the information we need to explore and cluster the neighborhoods in Toronto. We will be required to scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas dataframe so that it is in a structured format.

Once the data is in a structured format, we can replicate the analysis to explore and cluster the neighborhoods in the city of Toronto.

### Import libraries

In [1]:
from bs4 import BeautifulSoup
import requests
import numpy as np
import pandas as pd
from opencage.geocoder import OpenCageGeocode
from time import sleep
from IPython.display import display
import folium

### Downloading and parsing the data to work with

In [3]:
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(source, 'lxml')

In [4]:
raw = []
#going through tr objects
for tr in soup.find('tbody'):
    #going through td objects
    for td in tr:
        #extracting content of each td object
        for element in td:
            #checking if the object is not empty line
            if (element != '\n'):
                #checking if object has another tag inside, if so - extract
                if (element.find('<')!=-1):
                    s0up = BeautifulSoup(str(element), 'lxml')
                    element = s0up.find('a').text
                #checking for newlines on the end of elements, if so - cut
                if "\n" in element:
                    element = element.rstrip('\n')
                #appending the extracted element to the list    
                raw.append(str(element))

### Cleaning up from empty Boroughs

In [5]:
for i in range(0, len(raw)):
    if(raw[i]==raw[i-1]):
        raw[i-2] = ""
        raw[i-1] = ""
        raw[i] = ""      
clean = list(filter(None, raw))

### Setting "Not assigned" Neighbourhood value equal to the Borough

In [7]:
for i in range(0, len(clean)):      
    if(clean[i]=='Not assigned'):
        clean[i] = clean[i-1]

### Splitting the list into dataframe by categories (Postcode, Borough, Neighbourhood)

In [9]:
#splitting list by rows
split = np.array_split(clean,len(clean)/3)
#write to dataframe with using data from first row elements as headers
df = pd.DataFrame(split[1:], columns=split[0])
#combining Neighbourhoods to corresponding postcodes and dropping duplicates
df['Neighbourhood'] = df.groupby(['Postcode','Borough'])['Neighbourhood'].transform(lambda x: ', '.join(x)) 
df.drop_duplicates(inplace=True)
df.reset_index(drop=True, inplace=True)

### Checking the shape of cleaned dataset and printing first columns to see formating

In [19]:
display(df.shape)
display(df.head(10))

(103, 3)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront, Regent Park"
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Queen's Park
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Rouge, Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens, Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson, Garden District"
