<h1 align=center><font size = 5>Peer-graded Assignment: Segmenting and Clustering Neighborhoods in Toronto</font></h1>


---

<h1 align=center><font size = 10>~~~ Part 1 ~~~</font></h1>

### In part 1 we will: 
 * use the Notebook to build the code to scrape the table from Canada's Wikipedia page
 * obtain the data that is in the table of postal codes
 * transform the data into a pandas dataframe

### Preparations

In [1]:
# import request to send HTTP request 

import requests 

# assign url
url = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")

Scrapping table from url

In [2]:
# import BeautifulSoup from bs4 library to scrap table from url

from bs4 import BeautifulSoup

soup = BeautifulSoup(url.content,'lxml')
table = soup.find_all('table')[0]

Convert the table into Pandas DataFrame

In [3]:
# import pandas library to cast table into dataframe
import pandas as pd

df = pd.read_html(str(table))
df = pd.DataFrame(df[0])

df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


### Cleaning Process

In [4]:
# Drop row with 'Not assigned' value in 'Borough' column

df.drop(df.loc[df['Borough']=='Not assigned'].index, inplace=True)
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor


In [5]:
# Combine same Neighbourhood that have same Postcode, separated by comma
# Sorted by groupby by default
# .reset method to the index
# .rename to rename the first column

df_combined  = df.groupby(['Postcode', 'Borough'])['Neighbourhood'].apply(', '.join).reset_index().rename(columns={"Postcode": "PostalCode"})
df_combined

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [6]:
# Check wheter 'Neighbourhood' column has 'Not assigned' value

df_combined.loc[df_combined['Neighbourhood'].isin(['Not assigned'])]

Unnamed: 0,PostalCode,Borough,Neighbourhood


There is no 'Not assigned' value in 'Neighbourhood' column

In [7]:
# print the number of rows of dataframe
df_part1 = df_combined
df_part1.shape

(103, 3)

<h1 align=center><font size = 10>~~~ End of Part 1 ~~~</font></h1>

---

---

<h1 align=center><font size = 10>~~~ Part 2 ~~~</font></h1>

### In part 2 we will use the Geocoder package or the csv file to create the more comprehensive dataframe
Inner join part_1 dataframe with the Geocoder package, so the new dataframe include latitude and longitude based on it's postal code

In [8]:
# cast the coordinate file into dataframe

file = 'http://cocl.us/Geospatial_data'
df_coordinate = pd.read_csv(file)
df_coordinate.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [9]:
# inner join dataframe from Part 1 with coordinate dataframe
df_part2 = df_part1.merge(df_coordinate, left_on='PostalCode', right_on='Postal Code')
df_part2.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood,Postal Code,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",M1B,43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",M1C,43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",M1E,43.763573,-79.188711
3,M1G,Scarborough,Woburn,M1G,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,M1H,43.773136,-79.239476


## Cleaning

In [10]:
# remove Postal Code column
df_part2.drop(['Postal Code'], axis=1, inplace=True) 

In [11]:
df_part2

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848


<h1 align=center><font size = 10>~~~ End of Part 2 ~~~</font></h1>

---

---

<h1 align=center><font size = 10>~~~Part 3 ~~~</font></h1>

### In part 3 we will generate maps to visualize your neighborhoods and how they cluster together using folium library

In [12]:
# install and import folium

!conda install -c conda-forge folium=0.5.0 --yes
import folium

Solving environment: done

# All requested packages already installed.



In [13]:
# get canada dataframe from part 2 dataframe
df_canada = df_part2
df_canada.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


In [14]:
# import plugins from folium library
from folium import plugins

# map of Canada
latitude = 43.662301
longitude = -79.389494
canada_map = folium.Map(location=[latitude, longitude], zoom_start=12)

# instantiate a mark cluster object for the incidents in the dataframe
postal = plugins.MarkerCluster().add_to(canada_map)

# loop through the dataframe and add each data point to the mark cluster
# Label every marker with it's own PostalCode
for lat, lng, label in zip(df_canada.Latitude, df_canada.Longitude, df_canada.PostalCode):
    folium.Marker(
        location=[lat, lng],
        icon=None,
        popup=label,
    ).add_to(postal)
 
    
# display map
canada_map

<h1 align=center><font size = 10>~~~ End of Part 3 ~~~</font></h1>

---

---

<h1 align=center><font size = 5>Thank you for reviewing my Peer-graded Assignment</font></h1>
<h1 align=center><font size = 2>Have a nice day, and keep up the good work!</font></h1>
