## Assignment

In this assignment, you will be required to explore, segment, and cluster the neighborhoods in the city of Toronto. However, unlike New York, the neighborhood data is not readily available on the internet. What is interesting about the field of data science is that each project can be challenging in its unique way, so you need to learn to be agile and refine the skill to learn new libraries and tools quickly depending on the project.

For the Toronto neighborhood data, a Wikipedia page exists that has all the information we need to explore and cluster the neighborhoods in Toronto. You will be required to scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas dataframe so that it is in a structured format like the New York dataset.

Once the data is in a structured format, you can replicate the analysis that we did to the New York City dataset to explore and cluster the neighborhoods in the city of Toronto.

Your submission will be a link to your Jupyter Notebook on your Github repository. 

In [None]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda update -n base -c defaults conda -y # update Conda

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

<a id='item1'></a>

## 1. Scrape Wiki Page List of Toronto Neighborhoods

Use beautifulsoup to scrape Wiki (https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M)

In [2]:
!pip install beautifulsoup4

import urllib.request
import urllib.parse
import urllib.error
from bs4 import BeautifulSoup
import ssl

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

check = False
while check is False:
    try:
        check = True
        html = urllib.request.urlopen(url).read()
    except:
        check = False
        print("Error")
        url = input("Enter site: ")
    continue

soup = BeautifulSoup(html, "html.parser")

#print(soup.prettify())

# Retrieve all of the anchor tags
tags = soup('td')

scraped_data = list()
for tag in tags:
    try:
        #print('TAG:', tag.contents[0].string)
        word = str(tag.contents[0].string)
        scraped_data.append(word)
            
    except: break

count=0 
postal_code = list()
borough = list()
neighborhood = list()
for item in scraped_data:
    if count==0: 
        postal_code.append(item)
        count=1
    elif count==1:
        borough.append(item)
        count=2
    elif count==2:
        item = item.rstrip()
        neighborhood.append(item)
        count=0

#print(postal_code)
#print(borough)
#print(neighborhood)


zipped_list = list(zip(postal_code, borough, neighborhood))
zipped_list

Collecting beautifulsoup4
[?25l  Downloading https://files.pythonhosted.org/packages/cb/a1/c698cf319e9cfed6b17376281bd0efc6bfc8465698f54170ef60a485ab5d/beautifulsoup4-4.8.2-py3-none-any.whl (106kB)
[K     |████████████████████████████████| 112kB 29.5MB/s eta 0:00:01
[?25hCollecting soupsieve>=1.2 (from beautifulsoup4)
  Downloading https://files.pythonhosted.org/packages/81/94/03c0f04471fc245d08d0a99f7946ac228ca98da4fa75796c507f61e688c2/soupsieve-1.9.5-py2.py3-none-any.whl
Installing collected packages: soupsieve, beautifulsoup4
Successfully installed beautifulsoup4-4.8.2 soupsieve-1.9.5


[('M1A', 'Not assigned', 'Not assigned'),
 ('M2A', 'Not assigned', 'Not assigned'),
 ('M3A', 'North York', 'Parkwoods'),
 ('M4A', 'North York', 'Victoria Village'),
 ('M5A', 'Downtown Toronto', 'Harbourfront'),
 ('M6A', 'North York', 'Lawrence Heights'),
 ('M6A', 'North York', 'Lawrence Manor'),
 ('M7A', 'Downtown Toronto', "Queen's Park"),
 ('M8A', 'Not assigned', 'Not assigned'),
 ('M9A', "Queen's Park", 'Not assigned'),
 ('M1B', 'Scarborough', 'Rouge'),
 ('M1B', 'Scarborough', 'Malvern'),
 ('M2B', 'Not assigned', 'Not assigned'),
 ('M3B', 'North York', 'Don Mills North'),
 ('M4B', 'East York', 'Woodbine Gardens'),
 ('M4B', 'East York', 'Parkview Hill'),
 ('M5B', 'Downtown Toronto', 'Ryerson'),
 ('M5B', 'Downtown Toronto', 'Garden District'),
 ('M6B', 'North York', 'Glencairn'),
 ('M7B', 'Not assigned', 'Not assigned'),
 ('M8B', 'Not assigned', 'Not assigned'),
 ('M9B', 'Etobicoke', 'Cloverdale'),
 ('M9B', 'Etobicoke', 'Islington'),
 ('M9B', 'Etobicoke', 'Martin Grove'),
 ('M9B', 'Et

In [137]:
import pandas as pd
import re

neighborhoods = pd.DataFrame(zipped_list, columns = ['Postal Code' , 'Borough', 'Neighborhood']) 

# Drop rows if borough is 'Not assigned'. Reset index.

index_bor = neighborhoods[ neighborhoods['Borough'] == 'Not assigned' ].index
 
neighborhoods.drop(index_bor, inplace=True)

neighborhoods.reset_index(drop=True, inplace=True)

# Replace neighborhood name with name of borough if neighborhood not assigned

neighborhoods['duplicates'] = pd.Series([0 for x in range(len(neighborhoods.index))], index=neighborhoods.index)

count=0
for item in neighborhoods['Neighborhood']:
    #print(item)
    if item=='Not assigned':
        neighborhoods['Neighborhood'][count] = neighborhoods['Borough'][count]
    
    try:
        if neighborhoods['Postal Code'][count] == neighborhoods['Postal Code'][count+1]:
            newvar = neighborhoods['Neighborhood'][count] + ', ' + neighborhoods['Neighborhood'][count+1]
            neighborhoods['Neighborhood'][count] = newvar
            neighborhoods['duplicates'][count+1] = 1
    except: continue
    
    count=count+1

# Drop duplicate Postal Codes

new_index = neighborhoods[ neighborhoods['duplicates'] == 1 ].index
neighborhoods.drop(new_index , inplace=True)
neighborhoods.reset_index(drop=True, inplace=True)
neighborhoods.drop(['duplicates'], axis=1, inplace=True)

neighborhoods.head(20)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Downtown Toronto,Queen's Park
5,M9A,Queen's Park,Queen's Park
6,M1B,Scarborough,"Rouge, Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens, Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson, Garden District"


In [14]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(neighborhoods['Borough'].unique()),
        neighborhoods.shape[0]
    )
)

neighborhoods.shape


The dataframe has 11 boroughs and 210 neighborhoods.


(210, 3)