# Segmenting and Clustering Neighborhoods in Toronto

For this assignment, you will be required to explore and cluster the neighborhoods in Toronto
1. Start by creating a new Notebook for this assignment.
2. Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe like the one shown below:

![title](https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/7JXaz3NNEeiMwApe4i-fLg_40e690ae0e927abda2d4bde7d94ed133_Screen-Shot-2018-06-18-at-7.17.57-PM.png?expiry=1587859200000&hmac=HfAJW7cKer9DR4eIizr-tozubLdriY7H9wpJHW3oulY)

3. To create the above dataframe:

  - The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
  - Only process the cells that have an assigned borough. Ignore cells with a borough that is **Not assigned**.
  - More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that **M5A** is listed twice and has two neighborhoods: **Harbourfront** and **Regent Park**. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in **row 11** in the above table.
  - If a cell has a borough but a **Not assigned** neighborhood, then the neighborhood will be the same as the borough.
  - Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.
  - In the last cell of your notebook, use the **.shape** method to print the number of rows of your dataframe.


4. Submit a link to your Notebook on your Github repository. (10 marks)

  **Note:** There are different website scraping libraries and packages in Python. For scraping the above table, you can simply use _pandas_ to read the table into a _pandas_ dataframe.

Another way, which would help to learn for more complicated cases of web scraping is using the BeautifulSoup package. Here is the package's main documentation page: http://beautiful-soup-4.readthedocs.io/en/latest/

The package is so popular that there is a plethora of tutorials and examples on how to use it. Here is a very good Youtube video on how to use the BeautifulSoup package: https://www.youtube.com/watch?v=ng2o98k983k

Use _pandas_, or the BeautifulSoup package, or any other way you are comfortable with to transform the data in the table on the Wikipedia page into the above _pandas_ dataframe.

In [1]:
import pandas as pd
from bs4 import BeautifulSoup
import requests

In [2]:
url_list = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
source = requests.get(url_list).text

bs = BeautifulSoup(source, 'xml')
tab = bs.find('table')

column_names = ['Postalcode','Borough','Neighborhood']
df = pd.DataFrame(columns = column_names)

for tr in tab.find_all('tr'):
    row_data=[]
    for td in tr.find_all('td'):
        row_data.append(td.text.strip())
        
    if len(row_data)==3:
        df.loc[len(df)] = row_data
        
df.head()

Unnamed: 0,Postalcode,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront


In [3]:
def get_geocode(postal_code):
    lat_lng_coords = None
    
    while(lat_lng_coords is None):
        g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
        lat_lng_coords = g.latlng
    latitude = lat_lng_coords[0]
    longitude = lat_lng_coords[1]
    
    return latitude,longitude

In [4]:
geo_df=pd.read_csv('http://cocl.us/Geospatial_data')
geo_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [5]:
merge = pd.concat([df,geo_df], axis=1, join='inner')
merge.head()

Unnamed: 0,Postalcode,Borough,Neighborhood,Postal Code,Latitude,Longitude
0,M1A,Not assigned,,M1B,43.806686,-79.194353
1,M2A,Not assigned,,M1C,43.784535,-79.160497
2,M3A,North York,Parkwoods,M1E,43.763573,-79.188711
3,M4A,North York,Victoria Village,M1G,43.770992,-79.216917
4,M5A,Downtown Toronto,Regent Park / Harbourfront,M1H,43.773136,-79.239476


In [6]:
merge = merge[merge['Borough']!='Not assigned']
merge[['Postalcode','Borough','Neighborhood','Latitude','Longitude']].head()

Unnamed: 0,Postalcode,Borough,Neighborhood,Latitude,Longitude
2,M3A,North York,Parkwoods,43.763573,-79.188711
3,M4A,North York,Victoria Village,43.770992,-79.216917
4,M5A,Downtown Toronto,Regent Park / Harbourfront,43.773136,-79.239476
5,M6A,North York,Lawrence Manor / Lawrence Heights,43.744734,-79.239476
6,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government,43.727929,-79.262029
