## Peer-graded Assignment: Segmenting and Clustering Neighborhoods in Toronto

**This assignment is aiming to explore and cluster the neighborhoods in Toronto**

#### 1. Scrape the data from Wikipedia and transform it into a pandas dataframe
[https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M]

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np

print('Imported XD')

Imported XD


Using GET from requests to raw get the data from website

In [2]:
wiki = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
raw_source = requests.get(wiki)

Then, let's create a parsed tree which will help us to extract data from raw_source through BeautifulSoup

In [3]:
#Creating a BeautifulSoup Object and display the parsed page
soup = BeautifulSoup(raw_source.text, 'lxml')
#print(soup.prettify()), this line helps to display the raw data in a nested way.

Now, let's extract the HTML script which is relevant to our trageted table

In [4]:
table = soup.find('table', {'class': 'wikitable sortable'})
#print(table) by displaying the 'table', we can see the content and its relevant 'tag'

In [5]:
#Creat a loop to filter out each row into a list
list = []

for tr in table.find_all('tr')[1:]:
    row = []
    for td in tr.find_all('td'):
        row.append(td.text)
    row = [str1.strip('\n') for str1 in row]
    list.append(row)

Now, we have all rows stored in list,let's transform it to pandas DataFrame

In [6]:
df = pd.DataFrame(list, columns = ['Postcode', 'Borough', 'Neighborhood'])

In [7]:
df.head()

Unnamed: 0,Postcode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [8]:
#Let's drop those rows whose 'Borough' is 'Not assigned'
df = df[df['Borough']!= 'Not assigned']

In [9]:
#More than one neighborhood can exist in one postal code area.
df_new = df.groupby(['Postcode', 'Borough'])['Neighborhood'].apply(', '.join).reset_index()

In [10]:
df_new.head(10)

Unnamed: 0,Postcode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


If a cell has a borough but a 'Not assigned' neighborhood, then the neighborhood will be the same as the borough.

In [11]:
i = 0
for hood in df_new['Neighborhood']:
    if hood == 'Not assigned':
        df_new['Neighborhood'][i] = df_new['Borough'][i]
        print('The Neighborhood without name has beenn changed to {}'.format(df_new['Borough'][i]))
    else:
        i += 1

The Neighborhood without name has beenn changed to Queen's Park


In [12]:
#Now, let's the see the total number of rows of the DataFrame
print('This DataFrame has total {} rows'.format(df_new.shape[0]))

This DataFrame has total 103 rows


In [13]:
df_new.head(10)

Unnamed: 0,Postcode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


Let's firstly fetch the csv file which contains all the coordinates of each Postcode

In [14]:
df_csv = pd.read_csv('http://cocl.us/Geospatial_data')

In [15]:
df_csv.columns = ['Postcode', 'Latitude', 'Longitude']
df_csv.head()

Unnamed: 0,Postcode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [16]:
print('The number of postcodes in NEIGHBORHOOD dataframe, {}, matches with the one in coordinates file, {}'.format(df_new.shape[0], df_csv.shape[0]))

The number of postcodes in NEIGHBORHOOD dataframe, 103, matches with the one in coordinates file, 103


Let's put merge two dataframes together

In [17]:
df_merged = pd.concat([df_new, df_csv], ignore_index = False, axis = 1)

In [18]:
df_merged.head(10)

Unnamed: 0,Postcode,Borough,Neighborhood,Postcode.1,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",M1B,43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",M1C,43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",M1E,43.763573,-79.188711
3,M1G,Scarborough,Woburn,M1G,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,M1H,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,M1J,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",M1K,43.727929,-79.262029
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",M1L,43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",M1M,43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",M1N,43.692657,-79.264848


In [19]:
#There is one duplicate in columns, let's get rid of it.
df_merged.columns = ['Postcode', 'Borough', 'Neighborhood', 'duplicate', 'Latitude',
       'Longitude']
df_merged.drop('duplicate', axis = 1, inplace = True)
df_merged.head(10)

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848
