# Neighborhoods in Toronto Clustering

In [1]:
import pandas as pd

In [2]:
url1 = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
postal = pd.read_html(url1, header=0)
postal

[    Postal code           Borough  \
 0           M1A      Not assigned   
 1           M2A      Not assigned   
 2           M3A        North York   
 3           M4A        North York   
 4           M5A  Downtown Toronto   
 5           M6A        North York   
 6           M7A  Downtown Toronto   
 7           M8A      Not assigned   
 8           M9A         Etobicoke   
 9           M1B       Scarborough   
 10          M2B      Not assigned   
 11          M3B        North York   
 12          M4B         East York   
 13          M5B  Downtown Toronto   
 14          M6B        North York   
 15          M7B      Not assigned   
 16          M8B      Not assigned   
 17          M9B         Etobicoke   
 18          M1C       Scarborough   
 19          M2C      Not assigned   
 20          M3C        North York   
 21          M4C         East York   
 22          M5C  Downtown Toronto   
 23          M6C              York   
 24          M7C      Not assigned   
 25         

In [3]:
type(postal)

list

In [4]:
postal[0]

Unnamed: 0,Postal code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront
5,M6A,North York,Lawrence Manor / Lawrence Heights
6,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government
7,M8A,Not assigned,
8,M9A,Etobicoke,Islington Avenue
9,M1B,Scarborough,Malvern / Rouge


In [5]:
postal[1]

Unnamed: 0.1,Unnamed: 0,Canadian postal codes,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17
0,NL NS PE NB QC ON MB SK AB BC NU/NT YT A B C E...,NL NS PE NB QC ON MB SK AB BC NU/NT YT A B C E...,NL NS PE NB QC ON MB SK AB BC NU/NT YT A B C E...,,,,,,,,,,,,,,,
1,NL,NS,PE,NB,QC,QC,QC,ON,ON,ON,ON,ON,MB,SK,AB,BC,NU/NT,YT
2,A,B,C,E,G,H,J,K,L,M,N,P,R,S,T,V,X,Y


### As shown, the approach used returned a list with 2 data frames, the first containing the required information, so, we'll just create a new variable containainig the relevant data frame.

In [6]:
postaldf = postal[0]
postaldf.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront


In [7]:
postaldf.Neighborhood.isna().sum()

77

### We'll just ingnore al the observations without a borough assigned

In [8]:
#create a mask for filtering out "not assigned", and remove them
print('Before:', postaldf.shape)
na_mask=postaldf['Borough']!='Not assigned'
postaldf = postaldf[na_mask]
print('After:', postaldf.shape)

Before: (180, 3)
After: (103, 3)


In [9]:
postaldf.reset_index(inplace=True, drop=True)
postaldf.head(6)

Unnamed: 0,Postal code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Regent Park / Harbourfront
3,M6A,North York,Lawrence Manor / Lawrence Heights
4,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government
5,M9A,Etobicoke,Islington Avenue


### Then, the " / " separator for neighborhoods associated with the same code will be replaced by a comma

In [10]:
#get indexes than contains more than one Neighborhood:
plusone_n = postaldf[postaldf['Neighborhood'].str.contains("/")]
plusone_n.index

Int64Index([  2,   3,   4,   6,   8,  11,  12,  17,  18,  28,  30,  31,  33,
             34,  36,  37,  38,  41,  42,  43,  44,  45,  47,  48,  49,  51,
             52,  55,  56,  57,  58,  63,  65,  69,  71,  74,  75,  77,  80,
             81,  82,  83,  84,  85,  86,  87,  88,  89,  90,  93,  96,  97,
             98, 101, 102],
           dtype='int64')

In [11]:
#replace slashes with commas:
postaldf['Neighborhood'] = postaldf['Neighborhood'].str.replace(' /', ',')
postaldf['Neighborhood'].head(10)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


0                                      Parkwoods
1                               Victoria Village
2                      Regent Park, Harbourfront
3               Lawrence Manor, Lawrence Heights
4    Queen's Park, Ontario Provincial Government
5                               Islington Avenue
6                                 Malvern, Rouge
7                                      Don Mills
8                Parkview Hill, Woodbine Gardens
9                       Garden District, Ryerson
Name: Neighborhood, dtype: object

### Checking if there are still NaN or 'Not assigned' values in the 'Neighborhood' column:

In [12]:
postaldf['Neighborhood'].isna().sum()

0

In [13]:
postaldf[postaldf['Neighborhood']=='Not assigned']

Unnamed: 0,Postal code,Borough,Neighborhood


###...and let's chek how many rows are in the dataframe:

### As there aren't any missing values in Neighborhood rows, no furhter operations are needed for filling them, and at this point it's convenient to check how many rows we have:

In [14]:
postaldf.head(12)

Unnamed: 0,Postal code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


In [15]:
postaldf.shape

(103, 3)