# Week 3 - Segmenting and Clustering Neighborhoods in Toronto (Toronto Capstone Project)


## Part 1 - Build a dataframe of the postal code of each neighborhood along with it's borough name and neighborhood name

In [2]:
!conda install -c conda-forge geopy --yes
get_ipython().system(u' pip install beautifulsoup4')

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs:
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    geographiclib-1.50         |             py_0          34 KB  conda-forge
    geopy-1.22.0               |     pyh9f0ad1d_0          63 KB  conda-forge
    ------------------------------------------------------------
                                           Total:          97 KB

The following NEW packages will be INSTALLED:

  geographiclib      conda-forge/noarch::geographiclib-1.50-py_0
  geopy              conda-forge/noarch::geopy-1.22.0-pyh9f0ad1d_0



Downloading and Extracting Packages
geopy-1.22.0         | 63 KB     | ##################################### | 100% 
geographiclib-1.50   | 34 KB     | ###############################

### Project Requirements:

1. Start a new notebook for this project.
2. Use the Notebook to build the code to scrape the following Wikipedia page.
3. Create a dataframe according to the requirements in the assignment.

### 1.Start a new notebook for this project (Import the Liabraries).

In [13]:
import random # library for random number generation
import numpy as np # library for vectorized computation
import pandas as pd # library to process data as dataframes
import requests

import matplotlib.pyplot as plt # plotting library
# backend for rendering plots within the browser
%matplotlib inline 

from bs4 import BeautifulSoup
from sklearn.cluster import KMeans 
from sklearn.datasets.samples_generator import make_blobs

import bs4 as bs
import urllib.request
!conda install -c conda-forge folium=0.5.0 --yes
import folium # map rendering library

print('Libraries imported.')

Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Collecting package metadata (repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs:
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    altair-4.1.0               |             py_1         614 KB  conda-forge
    branca-0.4.1               |             py_0          26 KB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    pandas-1.0.4               |   py36h830a2c2_0        10.1 MB  conda-forge
    toolz-0.10.0               |             py_0          46 KB  conda-forge
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    --------------------------------------

### 2. Use the Notebook to build the code to scrape the following Wikipedia page (ttps://en.wikipedia.org).

In [2]:
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

# The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
#table = pd.read_html(url, thousands=' ', header=0)[0]
#table.columns
#table.head(10)

source = urllib.request.urlopen('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').read()
soup = bs.BeautifulSoup(source,'html.parser')

table = soup.find('table')
table_rows = table.find_all('tr')

l = []
for tr in table_rows:
    td = tr.find_all('td')
    row = [tr.text.strip() for tr in td if tr.text.strip()]
    if row:
        l.append(row)

### 3. Create a dataframe according to the requirements in the assignment.

* The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood.

In [3]:
df_Toronto = pd.DataFrame(l, columns=["PostalCode", "Borough", "Neighbourhood"])
df_Toronto.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


* Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.

In [7]:
df_Toronto = df_Toronto[df_Toronto.Borough != 'Not assigned']
df_Toronto.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


* More than one neighborhood can exist in one postal code area. These rows will be combined into one row with the neighborhoods separated with a comma.

In [10]:
df_Toronto = df_Toronto.groupby(['PostalCode', 'Borough']).agg(', '.join)
df_Toronto = df_Toronto.reset_index()
df_Toronto.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


* If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

In [11]:
df_Toronto.loc[df_Toronto['Neighbourhood']=='Not assigned', ['Neighbourhood']] = 'Queen\'s Park'
df_Toronto.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


* Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.

   1. I used the BeautifulSoup package to scrape the imported wikipage.
   2. The pandas dataframe comprise 3 columns, viz. PostalCode, Borough, and Neighborhood.
   3. Only cells that have an assigned borough were processed and the "Not assigned" cells were ignored.
   4. Borrows with the same PostalCode were merged into one row with the neighborhoods separated by a comma.
   5. "Not assigned" neighborhoods with a borough were made to have the same name as the borrow.

* In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

In [12]:
df_Toronto.shape

(103, 3)

## Part 2 - Get the latitude and the longitude coordinates of each neighborhood.

In [19]:
# load the coordinates from the csv file on Coursera
coordinates = pd.read_csv("http://cocl.us/Geospatial_data/Geospatial_Coordinates.csv")
coordinates.rename(columns={"Postal Code": "PostalCode"}, inplace=True)
coordinates.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [21]:
df = df_Toronto.merge(coordinates, on="PostalCode", how="left")
df.head(10)

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M1A,Not assigned,Not assigned,,
1,M2A,Not assigned,Not assigned,,
2,M3A,North York,Parkwoods,43.753259,-79.329656
3,M4A,North York,Victoria Village,43.725882,-79.315572
4,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
5,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
7,M8A,Not assigned,Not assigned,,
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242
9,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353


##### In this part of the project, we got the latitude and the longitude coordinates of each neighborhood.

## Part 3 - Segmenting and Clustering the neighbourhoods in Toronto.

In [15]:
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

# Use geopy library to get the latitude and longitude values of Toronto Ontario
address = 'Toronto Ontario, TO'
geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto Ontario are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto Ontario are 43.6534817, -79.3839347.


In [18]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(df['Latitude'], df['Longitude'], df['Borough'], df['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

ValueError: Location values cannot contain NaNs, got:
[nan, nan]