# Segmenting and Clustering Neighborhoods in Toronto

## Part 1 : Getting Data

**We will get data from wikipedia website using Web Scraping**

In [1]:
#import libraries
from bs4 import BeautifulSoup #library for web scraping
import requests
import pandas as pd
import numpy as np

In [2]:
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(source,'lxml')
table = soup.table

In [3]:
#Find the respective data from tag and store in list
postcode = []
borough = []
neighborhood = []

for cell in table.find_all('td'):
    postcode.append(cell.b.text)
    borough.append(cell.span.text.split('(')[0])
    try:
        neigh = cell.span.text
        #split neighborhood by brackets and repace / with ,
        neighbor = ''.join(neigh.split('(')[1].split(')'))
        neighborhood.append(neighbor.replace('/',','))  
    except Exception as e:
        neighborhood.append('Not assigned')

**Converting data into Dataframe**

In [4]:
df = pd.DataFrame(list(zip(postcode,borough,neighborhood)),columns=['PostalCode','Borough','Neighborhood'])
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park , Harbourfront"


In [5]:
df.shape

(180, 3)

## Data Cleaning

1. Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.

In [6]:
#first check
len(df[df['Borough'] == 'Not assigned'])

77

In [7]:
#drop rows
df = df.drop(df[df['Borough'] == 'Not assigned'].index)
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park , Harbourfront"
5,M6A,North York,"Lawrence Manor , Lawrence Heights"
6,M7A,Queen's Park / Ontario Provincial Government,Not assigned


In [8]:
#check again
len(df[df['Borough'] == 'Not assigned'])

0

In [9]:
df.shape

(103, 3)

2. If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

In [10]:
#first check
df[df['Neighborhood'] == 'Not assigned']

Unnamed: 0,PostalCode,Borough,Neighborhood
6,M7A,Queen's Park / Ontario Provincial Government,Not assigned


In [11]:
#replace
df.loc[df['Neighborhood'] == 'Not Assigned', 'Neighborhood'] = df['Borough']

In [12]:
#check again
df[df['Neighborhood'] == 'Not Assigned'][0:5]

Unnamed: 0,PostalCode,Borough,Neighborhood


In [13]:
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park , Harbourfront"
5,M6A,North York,"Lawrence Manor , Lawrence Heights"
6,M7A,Queen's Park / Ontario Provincial Government,Not assigned


In [14]:
#number of rows and column of dataframe
df.shape

(103, 3)

#   

## Part 2 : Extracting the Latitude and Longitude

In [15]:
#Second dataframe which has latitude and longitude 
path = 'http://cocl.us/Geospatial_data'
df_lat_lng = pd.read_csv(path)
df_lat_lng.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [16]:
#Changing the column name Postal code to Postcode to merge the two data frames together
df_lat_lng.columns = ['PostalCode','Latitude','Longitude']
df_lat_lng.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [17]:
df_lat_lng.shape

(103, 3)

In [18]:
#Merging of both dataframe
df_can = pd.merge(df,df_lat_lng, on='PostalCode')
df_can.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park , Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor , Lawrence Heights",43.718518,-79.464763
4,M7A,Queen's Park / Ontario Provincial Government,Not assigned,43.662301,-79.389494


In [19]:
df.shape

(103, 3)

#     