# Peer-graded Assignment: Segmenting and Clustering Neighborhoods in Toronto

## -----------------------------First Part of the assignment (week 3)-----------------------------

### Scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

In [2]:
import pandas as pd
import numpy as np

In [3]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
df_list = pd.read_html(url, flavor='html5lib')
df_list[0]

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


In [4]:
df = pd.DataFrame(df_list[0])
print(df.head(5))
df.shape

  Postal Code           Borough              Neighbourhood
0         M1A      Not assigned               Not assigned
1         M2A      Not assigned               Not assigned
2         M3A        North York                  Parkwoods
3         M4A        North York           Victoria Village
4         M5A  Downtown Toronto  Regent Park, Harbourfront


(180, 3)

### Clean up data

#### Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.

In [5]:
indexBoroughs = df[df['Borough'] == 'Not assigned'].index # Get index of rows where Borough == 'Not assigned'
print('Index of rows: ', indexBoroughs)

df = df.drop(indexBoroughs) # Drop rows with above condition
df.head(5)

Index of rows:  Int64Index([  0,   1,   7,  10,  15,  16,  19,  24,  25,  28,  29,  33,  34,
             35,  37,  38,  42,  43,  44,  51,  52,  53,  60,  61,  62,  69,
             70,  71,  78,  79,  87,  88,  96,  97, 101, 105, 106, 110, 115,
            118, 119, 123, 124, 125, 127, 128, 131, 132, 133, 134, 136, 137,
            140, 141, 145, 146, 149, 150, 154, 155, 158, 159, 161, 162, 163,
            164, 166, 167, 170, 171, 172, 173, 174, 175, 176, 177, 179],
           dtype='int64')


Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


#### More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.

In [6]:
# There are no multiple rows with the same postal code
df_copy = df.copy()

print('Number of duplicate postal codes: ', len(df_copy['Postal Code']) - len(df_copy['Postal Code'].drop_duplicates()))

# This should be the solution if there were any
df_copy = df_copy.groupby(['Postal Code', 'Borough'])['Neighbourhood'].apply(', '.join).reset_index()

df_copy.head(5)

Number of duplicate postal codes:  0


Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


#### If a cell has a Borough but a 'Not assigned' Neighbourhood, then the neighborhood will be the same as the borough.

In [8]:
df_copy.loc[df_copy['Neighbourhood'] == 'Not assigned', 'Neighbourhood'] = df_copy['Borough']

df_copy.to_csv('applied_DS_week3_part1.csv')

df_copy.shape

(103, 3)