# Segmenting and Clustering Neighborhoods in Toronto

## Getting Data

**We will get data from wikipedia website using Web Scraping**

In [67]:
#import libraries
from bs4 import BeautifulSoup #library for web scraping
import requests
import pandas as pd
import numpy as np

In [68]:
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(source,'lxml')
table = soup.table

In [71]:
#Find the respective data from tag and store in list
postcode = []
borough = []
neighborhood = []

for cell in table.find_all('td'):
    try:
        postcode.append(cell.b.text) 
        borough.append(cell.span.a.text)
        neigh = cell.span.text
        #split neighborhood by brackets and repace / with ,
        neighbor = ''.join(neigh.split('(')[1].split(')'))
        neighborhood.append(neighbor.replace('/',','))  
        
    except Exception as e:
        borough.append('Not Assigned')
        neighborhood.append('Not Assigned')

**Converting data into Dataframe**

In [101]:
df = pd.DataFrame(list(zip(postcode,borough,neighborhood)),columns=['PostalCode','Borough','Neighborhood'])
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not Assigned,Not Assigned
1,M2A,Not Assigned,Not Assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park , Harbourfront"


## Data Cleaning

1. Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.

In [103]:
#first check
df[df['Borough'] == 'Not Assigned'][0:5]

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not Assigned,Not Assigned
1,M2A,Not Assigned,Not Assigned
7,M8A,Not Assigned,Not Assigned
8,M9A,Not Assigned,Islington Avenue
11,M3B,Not Assigned,Don MillsNorth


In [105]:
df = df.drop(df[df['Borough'] == 'Not Assigned'].index)
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park , Harbourfront"
5,M6A,North York,"Lawrence Manor , Lawrence Heights"
6,M7A,Queen's Park,Not Assigned


In [107]:
#check again
df[df['Borough'] == 'Not Assigned'][0:5]

Unnamed: 0,PostalCode,Borough,Neighborhood


  #   

2. If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

In [160]:
#first check
df[df['Neighborhood'] == 'Not Assigned'][0:5]

Unnamed: 0,PostalCode,Borough,Neighborhood


In [161]:
#replace
df.loc[df['Neighborhood'] == 'Not Assigned', 'Neighborhood'] = df['Borough']

In [162]:
#check again
df[df['Neighborhood'] == 'Not Assigned'][0:5]

Unnamed: 0,PostalCode,Borough,Neighborhood


In [164]:
#number of rows and column of dataframe.
df.shape

(101, 3)