# Segmenting and Clustering Neighborhoods in Toronto  
# PART - 1

##### Firstly, import modules Numpy and Pandas

In [2]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
# i will be using pandas to scarpe the webpage
# beautifulsoup is in fact not required

#### Next we scrape the data from the wikipedia page as a pandas dataframe  
lets use Pandas 'read' feature.

In [3]:
wikilink="https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
raw_df0=pd.read_html(wikilink, header=0)[0]
# lets check first 10 rows of the dataframe
raw_df0.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
9,M8A,Not assigned,Not assigned


#### The header need a bit cleanup.  
'Postcode' and 'Neighbourhood' need to be corrected to 'PostalCode' and 'Neighborhood'

In [4]:
raw_df0.rename(columns={'Postcode': 'PostalCode', 'Neighbourhood': 'Neighborhood'}, inplace= True)
raw_df0.head(5)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


#### Now lets check the number of rows and columns with the 'shape' feature

In [5]:
raw_df0.shape

(288, 3)

The dataframe has 288 rows and 3 columns.  
Columns 'Borough' and 'Neighborhood' has 'Not assigned' in few rows.  
Now count number of records with 'Not assigned' for both columns

In [6]:
print ('number of \'Not assigned\' in Borough = ', raw_df0.Borough.str.count('Not assigned').sum())
print ('number of \'Not assigned\' in Neighborhood = ', raw_df0.Neighborhood.str.count('Not assigned').sum())

number of 'Not assigned' in Borough =  77
number of 'Not assigned' in Neighborhood =  78


#### As per rule, only process the cells have an assigned Burough.  
so we eliminate 77 records with a Burough - Not assigned.  
There should be 211 rows remaining in the dataframe, where one of them have no Neighborhood assigned.

In [7]:
#select rows with 'Burough' as 'Not assigned'
indexNames = raw_df0[raw_df0['Borough'] == 'Not assigned' ].index

# now just drop all the selected rows
raw_df1= raw_df0.drop(indexNames)

# basic rule, reset index after rows deleted.
raw_df1.reset_index(drop=True, inplace=True)

#lets check howmany rows are now in the dataframe
raw_df1.shape[0]

211

Perfect!   
#### Now find the row with Neighborhood as 'Not assigned' and assign the corresponding Borough name to the Neighborhood

In [8]:
raw_df1.loc[raw_df1['Neighborhood']=='Not assigned']

Unnamed: 0,PostalCode,Borough,Neighborhood
6,M7A,Queen's Park,Not assigned


Just replace the string and verify the dataframe

In [9]:
raw_df1.loc[raw_df1['Neighborhood']=='Not assigned','Neighborhood'] = raw_df1[raw_df1['Neighborhood']=='Not assigned']['Borough']
raw_df1.loc[raw_df1['PostalCode']=='M7A']


Unnamed: 0,PostalCode,Borough,Neighborhood
6,M7A,Queen's Park,Queen's Park


Nice! Poastlcode M7A has now both Borough and Neighborhood assigned to 'Queen's Park'

In [10]:
df=(raw_df1.groupby('PostalCode').agg({'Borough':'first','Neighborhood' : ', '.join})
    .reset_index()
   )
df.head(25) #better check few extra rows to verify the Boroughs are listed

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [11]:
#lets check the number of rows in the clean dataframe
print('Cleaned dataframe \'df\' has ',df.shape[0], ' rows')

Cleaned dataframe 'df' has  103  rows


#### --end of part 1--