# Segmenting and Clustering Neighborhoods in Toronto
This notebook will be mainly used for the applied data science capstone project which is part of the [IBM Data Science Professional Certificate](https://www.coursera.org/professional-certificates/ibm-data-science)

## About this Assignment
In this assignment, you will be required to explore, segment, and cluster the neighborhoods in the city of Toronto based on the postalcode and borough information.. However, unlike New York, the neighborhood data is not readily available on the internet. What is interesting about the field of data science is that each project can be challenging in its unique way, so you need to learn to be agile and refine the skill to learn new libraries and tools quickly depending on the project.

For the Toronto neighborhood data, a Wikipedia page exists that has all the information we need to explore and cluster the neighborhoods in Toronto. You will be required to scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas  dataframe so that it is in a structured format like the New York dataset.

Once the data is in a structured format, you can replicate the analysis that we did to the New York City dataset to explore and cluster the neighborhoods in the city of Toronto.

The submission will be a link to your Jupyter Notebook on your Github repository.

## Review criteria

This assignment will be graded by other peers who are completing this course during the same session. This assignment is worth 15% of your total grade.

You can prepare one notebook for all three parts, but please use Markdown to clearly label your work for each part in order to make it easy for your peers to grade your work. However, you will have to submit the notebook three times since a submission has to be associated with each question. Sorry about that.

<hr>

## [Hints for scraping Notebook](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs_v1/NewLinkWebscrapingHints.md)

### Tips for  Webscraping Updated Table in Week3 Peer Graded Assignment

 **After retreiving the URL and creating a Beautiful soup object** 

 **Firstly create a list**  

 **Later after finding the table and table data  create a dictionary called cell having 3 keys PostalCode, Borough and Neighborhood.**

**As postal code contains upto 3 characters extract that using tablerow.p.text**

 **Next use split ,strip and replace functions for getting Borough and Neighborhood information.**.

 **Append to the list**  

 **Create a dataframe with list**

## Sample code

```python
table_contents=[]
table=soup.find('table')
for row in table.findAll('td'):
    cell = {}
    if row.span.text=='Not assigned':
        pass
    else:
        cell['PostalCode'] = row.p.text[:3]
        cell['Borough'] = (row.span.text).split('(')[0]
        cell['Neighborhood'] = (((((row.span.text).split('(')[1]).strip(')')).replace(' /',',')).replace(')',' ')).strip(' ')
        table_contents.append(cell)

# print(table_contents)
df=pd.DataFrame(table_contents)
df['Borough']=df['Borough'].replace(
    {'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A',
    'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business',
    'EtobicokeNorthwest':'Etobicoke Northwest',
    'East YorkEast Toronto':'East York/East Toronto',
    'MississaugaCanada Post Gateway Processing Centre':'Mississauga'})
```


<img src="dataframe 1.png"
     style="float: left; margin-right: 10px;"
     width="35%" />

<hr>

## Q1: create dataframe geo_coor

To create the above dataframe:

* The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
* Only process the cells that have an assigned borough. Ignore cells with a borough that is <b> Not assigned </b>.
* More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that <b>M5A </b> is listed twice and has two neighborhoods: <b>Harbourfront</b> and <b>Regent Park</b>. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in <b>row 11</b>  in the above table.

* If a cell has a borough but a <b>Not assigned</b>  neighborhood, then the neighborhood will be the same as the borough.
* Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.
* In the last cell of your notebook, use the <b>.shape</b> method to print the number of rows of your dataframe.


Note: There are different website scraping libraries and packages in Python. For scraping the above table, you can simply use pandas  to read the table into a pandas dataframe.

Another way, which would help to learn for more complicated cases of web scraping is using the BeautifulSoup package. Here is the package's main documentation page: [http://beautiful-soup-4.readthedocs.io/en/latest/](http://beautiful-soup-4.readthedocs.io/en/latest/)

Use pandas, or the BeautifulSoup package, or any other way you are comfortable with to transform the data in the table on the Wikipedia page into the above pandas dataframe.

In [1]:
import pandas as pd
import numpy as np 

In [2]:
import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
url_data = requests.get(url).text
soup = BeautifulSoup(url_data, 'html5lib')

In [3]:
table_contents=[]
table = soup.find('table')

for row in table.findAll('td'):
    cell = {}
    # ignore cells with borough that is not assigned
    if row.span.text=='Not assigned':
        pass
    else:
        cell['PostalCode'] = row.p.text[:3]
        cell['Borough'] = (row.span.text).split('(')[0]
        cell['Neighborhood'] = (((((row.span.text).split('(')[1]).strip(')')).replace(' /',',')).replace(')',' ')).strip(' ')
        table_contents.append(cell)

# print(table_contents)
geo_coor = pd.DataFrame(table_contents)
geo_coor['Borough'] = geo_coor['Borough'].replace(
    {'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A',
    'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business',
    'EtobicokeNorthwest':'Etobicoke Northwest',
    'East YorkEast Toronto':'East York/East Toronto',
    'MississaugaCanada Post Gateway Processing Centre':'Mississauga'})

In [4]:
geo_coor.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government


<hr>

## Q2: find geographical coordinates of a given postal code

Now that you have built a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name, in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood. 

In an older version of this course, we were leveraging the Google Maps Geocoding API to get the latitude and the longitude coordinates of each neighborhood. However, recently Google started charging for their API: [http://geoawesomeness.com/developers-up-in-arms-over-google-maps-api-insane-price-hike/](http://geoawesomeness.com/developers-up-in-arms-over-google-maps-api-insane-price-hike/), so we will use the Geocoder Python package instead: [https://geocoder.readthedocs.io/index.html](https://geocoder.readthedocs.io/index.html).

The problem with this Package is, you have to be persistent sometimes in order to get the geographical coordinates of a given postal code. So you can make a call to get the latitude and longitude coordinates of a given postal code and the result would be None, and then make the call again and you would get the coordinates. So, in order to make sure that you get the coordinates for all of our neighborhoods, you can run a while loop for each postal code.  

In [5]:
# install package if you haven't done so
# !pip install geocoder

import geocoder

In [6]:
# make a copy of the data frame in case I got it wrong
new_geo_coor = geo_coor.copy()

for index, row in geo_coor.iterrows():
    postal_code = row['PostalCode']
    g = geocoder.arcgis('{}, Toronto, Ontario'.format(postal_code))
    lat_lng_coords = g.json

    new_geo_coor['Latitude'] = latitude = lat_lng_coords['lat']
    new_geo_coor['Longitude'] = lat_lng_coords['lng']

In [7]:
new_geo_coor.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.62513,-79.52681
1,M4A,North York,Victoria Village,43.62513,-79.52681
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.62513,-79.52681
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.62513,-79.52681
4,M7A,Queen's Park,Ontario Provincial Government,43.62513,-79.52681


<hr>

### Q3: Cluster and Visualization
Given that this package can be very unreliable, in case you are not able to get the geographical coordinates of the neighborhoods using the Geocoder package, here is a link to a csv file that has the geographical coordinates of each postal code: 

[GeoSpatial Dataset](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs_v1/Geospatial_Coordinates.csv)

Use the Geocoder package or the csv file to create the following dataframe:

<img src="dataframe 2.png"
     style="float: left; margin-right: 10px;"
     width="35%" />

Important Note: There is a limit on how many times you can call geocoder.google function. It is 2500 times per day. This should be way more than enough for you to get acquainted with the package and to use it to get the geographical coordinates of the neighborhoods in the Toronto.

Explore and cluster the neighborhoods in Toronto. You can decide to work with only boroughs that contain the word Toronto and then replicate the same analysis we did to the New York City data. It is up to you. 

Just make sure:

1. to add enough Markdown cells to explain what you decided to do and to report any observations you make. 
2. to generate maps to visualize your neighborhoods and how they cluster together. 

Once you are happy with your analysis, submit a link to the new Notebook on your Github repository. (3 marks)

In [8]:
# import this file in case I did't get the above dataframe right
df = pd.read_csv("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs_v1/Geospatial_Coordinates.csv")
df = df.rename(columns={'Postal Code': 'PostalCode'})
df.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [9]:
new_geo_coor = geo_coor.join(df.set_index('PostalCode'), on='PostalCode', how='inner')
new_geo_coor.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Queen's Park,Ontario Provincial Government,43.662301,-79.389494


In [10]:
!pip install
import folium

# find latitude and lonitude of Toronto
g = geocoder.arcgis('Toronto, Ontario')
lat_lng_coords = g.json
toronto_latitude = lat_lng_coords['lat']
toronto_longitude = lat_lng_coords['lng']

# draw toronto map
toronto_map = folium.Map(location=[toronto_latitude, toronto_longitude], zoom_start=11)
    
toronto_map

## [Hints for scraping Notebook](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs_v1/NewLinkWebscrapingHints.md)

### Tips for  Webscraping Updated Table in Week3 Peer Graded Assignment

 **After retreiving the URL and creating a Beautiful soup object** 

 **Firstly create a list**  

 **Later after finding the table and table data  create a dictionary called cell having 3 keys PostalCode, Borough and Neighborhood.**

**As postal code contains upto 3 characters extract that using tablerow.p.text**

 **Next use split ,strip and replace functions for getting Borough and Neighborhood information.**.

 **Append to the list**  

 **Create a dataframe with list**

## Sample code

```python
table_contents=[]
table=soup.find('table')
for row in table.findAll('td'):
    cell = {}
    if row.span.text=='Not assigned':
        pass
    else:
        cell['PostalCode'] = row.p.text[:3]
        cell['Borough'] = (row.span.text).split('(')[0]
        cell['Neighborhood'] = (((((row.span.text).split('(')[1]).strip(')')).replace(' /',',')).replace(')',' ')).strip(' ')
        table_contents.append(cell)

# print(table_contents)
df=pd.DataFrame(table_contents)
df['Borough']=df['Borough'].replace(
    {'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A',
    'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business',
    'EtobicokeNorthwest':'Etobicoke Northwest',
    'East YorkEast Toronto':'East York/East Toronto',
    'MississaugaCanada Post Gateway Processing Centre':'Mississauga'})
```


In the previous steps, we have draw a map of Toronto.

Next, we want to cluster the neiborhoods based on areas color, using k means. We will divide them into 4 groups (this number is an arbitrary number). 
The center of the areas will be represented as squares.

In [11]:
from sklearn.cluster import KMeans 

# initialize k means
k_means = KMeans(init="k-means++", n_clusters=4, n_init=12)
df = new_geo_coor[['Latitude', 'Longitude']]
x = df.values.tolist()

# fit the k means mode
k_means.fit(x)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=4, n_init=12, n_jobs=None, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=0)

In [12]:
k_means_labels = k_means.labels_
k_means_labels

array([2, 2, 0, 3, 0, 1, 2, 3, 0, 0, 3, 1, 2, 0, 0, 0, 3, 1, 2, 0, 0, 3,
       2, 0, 0, 0, 2, 3, 3, 0, 0, 0, 2, 3, 3, 0, 0, 0, 2, 3, 3, 0, 0, 0,
       2, 3, 1, 0, 0, 1, 1, 2, 3, 1, 0, 3, 1, 1, 2, 3, 1, 3, 3, 1, 1, 2,
       3, 3, 0, 1, 1, 2, 3, 3, 0, 1, 1, 1, 2, 0, 0, 1, 2, 0, 0, 2, 0, 0,
       1, 1, 2, 0, 0, 1, 1, 2, 0, 0, 1, 0, 0, 1, 1])

In [13]:
k_means_cluster_centers = k_means.cluster_centers_
k_means_cluster_centers

array([[ 43.66883403, -79.37416041],
       [ 43.68059059, -79.52478493],
       [ 43.76342274, -79.25682511],
       [ 43.74471936, -79.41377875]])

In [14]:
# set area color code
colors = {0:"#fe8a71", 1:"#f6cd61", 2:"#3da4ab", 3:'#4a4e4d'}

In [15]:
# draw neighborhood label based on cluster
for i, lat, lng, borough, neighborhood in zip(new_geo_coor.index, new_geo_coor['Latitude'], new_geo_coor['Longitude'], new_geo_coor['Borough'], new_geo_coor['Neighborhood']):
    cluster = k_means_labels[i]
    label = '{}, {}, {}'.format(neighborhood, borough, (cluster+1))
    label = folium.Popup(label, parse_html=True)

    folium.CircleMarker(
        [lat, lng],
        radius=3,
        popup=label,
        color=colors[cluster],
        fill=True,
        fill_color=colors[cluster],
        fill_opacity=0.7,
        parse_html=False).add_to(toronto_map)  

In [16]:
# draw neighborhood center based on clusterk_means_cluster_centers[0]

for j in range(0,4):
    lat,lng = [k_means_cluster_centers[j][0], k_means_cluster_centers[j][1]]
    label = '{}'.format(j+1)
    label = folium.Popup(label, parse_html=True)

    folium.RegularPolygonMarker(
        [lat, lng],
        popup=label,
        color=colors[j],
        number_of_sides=4,
        radius=10,
        fill_color=colors[j],
        fill_opacity=0.7).add_to(toronto_map)  

toronto_map

<hr>