# Coursera Capstone Project
### Andrea Provino - Italy, TO
This notebook contains code for coursera capstone project.

## <span style="color: #00a8b5">Libraries</span>

In [1]:
import numpy as np
import pandas as pd
# add support up to 30 columns
pd.set_option("display.max_columns",30)

In [2]:
# sample code
print("Hello Capstone Project Course!")

Hello Capstone Project Course!


## <span style="color: #00a8b5">Project</span>
We are required to explore and cluster the neighborhoods in Toronto.  
In order to acquire data, we scrapes data from the [List of Postal codes of Canada: M](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M)

In [12]:
# install the following dependecies
#!pip install bs4
#!pip installl lxml
#!pip install html5lib
#!pip install requests



You should consider upgrading via the 'python -m pip install --upgrade pip' command.


In [14]:
from bs4 import BeautifulSoup
import requests

# link to the resource
doc_url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

# get the resource
source = requests.get(doc_url).text

# parse it:
soup = BeautifulSoup(source, 'lxml')

In [94]:
# isolate the table
table = soup.find('tbody')
table_rows = table.find_all('tr')
len(table_rows)

288

In [109]:
# get the data
raw_df = []
for tr in table_rows:
    row = tr.text.split('\n')[1:-1] # first and last lines are empty string
    raw_df.append(row)
raw_df = pd.DataFrame(raw_df[1:], columns=raw_df[0])
raw_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


## <span style="color: #00a8b5">Data Cleaning</span>
For the purpose of the project, we have to:
* Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
* More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.
* If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. So for the 9th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park.
* In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

In [113]:
# remove cells with with not assigned borough
raw_df = raw_df[raw_df.Borough != 'Not assigned']

#reset index
raw_df.reset_index(drop=True, inplace=True)

# sanity check
raw_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,Lawrence Heights
4,M6A,North York,Lawrence Manor


Used this as reference: [stack overflow](https://stackoverflow.com/questions/17841149/pandas-groupby-how-to-get-a-union-of-strings)

In [116]:
# group by Postcode
grouped = raw_df.groupby(['Postcode'])


# combine the neighborhoods grouped by postcode and into a new df
neighborhood_grouped = grouped['Neighbourhood'].apply(lambda x: x.sum()) 

# adds spaces and commas between neighborhoods
neighborhood_grouped = grouped['Neighbourhood'].apply(lambda x: "%s" % ', '.join(x))

# matches a borough to each postcode
borough_grouped = grouped['Borough'].apply(lambda x: set(x).pop())

# turn borough_grouped and neighborhood_grouped into dataframes
borough = borough_grouped.to_frame()
neighborhood = neighborhood_grouped.to_frame()

# combine the dataframe borough and the dataframe neighborhood into one dataframe
grouped_final = borough.merge(neighborhood, on="Postcode")

# sanity check
grouped_final.head()

Unnamed: 0_level_0,Borough,Neighbourhood
Postcode,Unnamed: 1_level_1,Unnamed: 2_level_1
M1B,Scarborough,"Rouge, Malvern"
M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
M1E,Scarborough,"Guildwood, Morningside, West Hill"
M1G,Scarborough,Woburn
M1H,Scarborough,Cedarbrae


In [117]:
grouped_final.shape

(103, 2)

## <span style="color: #00a8b5">Geo Data</span>

In [118]:
geospatial_data = pd.read_csv('Geospatial_Coordinates.csv')

geospatial_data.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [120]:
# fix name
geospatial_data = geospatial_data.rename(columns={geospatial_data.columns[0]: "Postcode" })

In [121]:
# merge data
full_table = grouped_final.merge(geospatial_data, on = 'Postcode')

full_table.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


In [124]:
#!pip install geopy 
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 

import folium

import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

import json # library to handle JSON files
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

## <span style="color: #00a8b5">Clustering Data</span>

In [125]:
# get toronto address
address = 'Toronto, Ontario, Canada'
geolocator = Nominatim(user_agent="TO_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('Toronto is {}, {}.'.format(latitude, longitude))

Toronto is 43.653963, -79.387207.


In [127]:
# let's plotting together
map_toronto = folium.Map(location = [latitude, longitude], zoom_start = 10)

#add neighborhood markers to the Toronto map
for lat, long, bor, neigh in zip(full_table['Latitude'], full_table['Longitude'], 
                                 full_table['Borough'], full_table['Neighbourhood']):
    label = '{}, {}'.format(neigh, bor)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, long],
        radius = 7, 
        popup = label,
        color = 'red',
        fill = True,
        fill_color = 'white',
        fill_opacity = 0.7,
        parse_html = False).add_to(map_toronto)
        
map_toronto