## Segmenting and Clustering Neighbourhoods in Toronto

<p>For the Toronto neighborhood data, a wikipedia page available in this link <a href=https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M">here</a>. 
</p>


<h4>Import Pandas</h4>

In [1]:
import pandas as pd

In [2]:
pd.__version__

'0.24.1'

'pd.read_html()' reads all the tables in the wikipedia page!

In [3]:
#Read HTML file using read_html method
table=pd.read_html("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M", header=None)

Here, type of table is list

In [4]:
type(table)

list

Length of the table is 3

In [5]:
len(table)

3

We need only first table in the list 'table'

In [6]:
#Daisply the first element of the list 
table[0][0:5]

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


<p>Convert first element of the table into a dataframe <b>'df'</b>. Dataframe consists of three columns : <b>Postcode, Borough, Neighbourhood</b></p>

In [7]:
#Select the first table in the html content 
df=pd.DataFrame(table[0])

#Display the first five elements in the dataframe 
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


<p>We need cells that have an assigned borough. 
    Ignore cells with a borough that is <b>Not assigned</b>.
</p>

In [8]:
#Check for the index value Not assigned in Borough column of df
index_borough=df[df['Borough'].isin(['Not assigned'])].index

#print indexes 
print(index_borough)

Int64Index([  0,   1,   9,  13,  20,  21,  30,  36,  37,  45,  46,  50,  51,
             52,  54,  55,  59,  60,  61,  73,  74,  75,  88,  89,  90, 104,
            105, 106, 120, 121, 136, 137, 148, 149, 155, 161, 162, 167, 175,
            181, 182, 188, 189, 190, 194, 195, 201, 202, 203, 204, 209, 210,
            223, 224, 237, 238, 241, 242, 247, 248, 253, 254, 258, 259, 260,
            261, 263, 264, 274, 275, 276, 277, 278, 279, 280, 281, 287],
           dtype='int64')


<p>Drop the cells with a borough <b>Not assigned</b></p>

In [9]:
df.drop(index_borough, inplace=True, axis=0) #drop the cells of index_borough
df.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern


<p>We can reset the row index in dataframe with reset_index() to make the index start from 0 and specify <b>drop=True</b> to not to keep the original index with the argument.
</p>

In [10]:
df.reset_index(drop=True, inplace=True)
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


<p>Inspect the Neighbourhood column for Not assigned value.</p>

In [11]:
print(df.loc[df['Neighbourhood'].isin(['Not assigned'])])

  Postcode       Borough Neighbourhood
6      M7A  Queen's Park  Not assigned


<p>Assign the Borough value to Neighbourhood, then the neighborhood will be the same as the borough. So for the 6th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park.</p>

In [12]:
df.loc[df['Neighbourhood'].isin(['Not assigned']), 'Neighbourhood']=df['Borough']
df.iloc[6,:]

Postcode                  M7A
Borough          Queen's Park
Neighbourhood    Queen's Park
Name: 6, dtype: object

In [13]:
df['Borough'].value_counts()

Etobicoke           45
North York          38
Scarborough         37
Downtown Toronto    37
Central Toronto     17
West Toronto        13
York                 9
East Toronto         7
East York            6
Mississauga          1
Queen's Park         1
Name: Borough, dtype: int64

In [14]:
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


In [15]:
df_sort=df.sort_values('Postcode')
df_sort.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern
23,M1C,Scarborough,Port Union
22,M1C,Scarborough,Rouge Hill
21,M1C,Scarborough,Highland Creek


In [16]:
dict={}
for index, label in df_sort.iterrows():
    if (df_sort[index] == df_sort[index +=1]):
        print('yes')

SyntaxError: invalid syntax (<ipython-input-16-7bb6b601ef48>, line 3)

In [None]:
dict