## Part One: Segmenting and Clustering Neighborhoods in Toronto

In the next two cells, we need to install Pandas, Numpy, and lxml (in order to read HTML from the Wikipedia page). 

In [62]:
import pandas as pd
import numpy as np

In [63]:
!pip install lxml
print('done')

done


Next, we'll import the Wikipedia table and turn it into a Pandas DataFrame, in order to be able to use it in this project.

In [66]:
link = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

table = pd.read_html(link)

tor_neigh = table[0].iloc[1:]

tor_neigh.columns = ['Postal Code', 'Borough', 'Neighborhood']

tor_neigh

Unnamed: 0,Postal Code,Borough,Neighborhood
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


In [65]:
tor_neigh.shape

(179, 3)

As you can see, numerous Boroughs and Neighborhoods are undefined - these won't be very helpful to us. The next cell will help clean up our DataFrame and make it a little more friendly. Prior to cleaning, we are looking at 180 different Postal Code values.

In [68]:
tor_neigh['Neighborhood'] = np.where(tor_neigh['Neighborhood'] == 'Not assigned', tor_neigh['Borough'], tor_neigh['Neighborhood'])
tor_neigh = tor_neigh[~tor_neigh['Borough'].isin(['Not assigned'])]
tor_neigh

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,Postal Code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
160,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
165,M4Y,Downtown Toronto,Church and Wellesley
168,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
169,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [69]:
tor_neigh.shape

(103, 3)

This is looking a little cleaner - now, all of the Boroughs and Neighborhoods are named, allow us to use every value available. Our DataFrame now consists of 104 Postal Code values, meaning we had 76 unassigned Boroughs and Neighborhoods - only about 58% of the data is useful! The only thing left to do is reset our index.

In [79]:
tor_neigh.reset_index(drop = True)

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


There we go, one cleaned, sorted Data Set, ready to be used! Cheers!