# Capstone Project Notebook

##### "In this assignment, you will be required to explore, segment, and cluster the neighborhoods in the city of Toronto. However, unlike New York, the neighborhood data is not readily available on the internet. What is interesting about the field of data science is that each project can be challenging in its unique way, so you need to learn to be agile and refine the skill to learn new libraries and tools quickly depending on the project.

##### For the Toronto neighborhood data, a Wikipedia page exists that has all the information we need to explore and cluster the neighborhoods in Toronto. You will be required to scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas dataframe so that it is in a structured format like the New York dataset.

##### Once the data is in a structured format, you can replicate the analysis that we did to the New York City dataset to explore and cluster the neighborhoods in the city of Toronto."

In [2]:
# import preliminary files
import numpy as np 
import pandas as pd
import json
import seaborn as sns

import matplotlib.pyplot as plt
%matplotlib inline

In [3]:
print('Hello Capstone Project Course!')

Hello Capstone Project Course!


### Leg 1: Segmenting and Clustering Neighborhoods
#### Obtain data from Wikipedia website

In [4]:
# Ended up not using Beautiful soup

#!pip install BeautifulSoup4
#!pip install requests
#from bs4 import BeautifulSoup
#import requests
#from urllib.request import urlopen
#!pip install lxml

In [None]:
# install to read from wiki
!pip install wikipedia

In [6]:
import wikipedia as wp

In [7]:
# Obtain the page information
html = wp.page("List of postal codes of Canada: M").html().encode("UTF-8")
df_tor = pd.read_html(html)[0]

#Delete rows where borough is not assigned
df_tor = df_tor[df_tor['Borough'] != 'Not assigned']

# change column names
column_names = ['PostalCode','Borough', 'Neighborhood'] 
df_tor.columns = column_names
df_tor.head(2)

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village


In [8]:
#Combine all neighborhoods where postcode and borough are the same
df_tor = df_tor.groupby(['PostalCode', 'Borough'])['Neighborhood'].apply(list).apply(lambda x:', '.join(x)).to_frame().reset_index()

# reset the index
df_tor.reset_index()

# view the table, viewing more rows shows distinct postal codes
df_tor.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [9]:
# inspect size of dataset
df_tor.shape

(103, 3)

In [13]:
import io
import requests
url="https://cocl.us/Geospatial_data"
s=requests.get(url).content
c=pd.read_csv(io.StringIO(s.decode('utf-8')))

df_torl = df_tor.join(c.set_index('Postal Code'), on='PostalCode')
df_torl.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
