# Segmenting and Clustering Neighborhoods in Toronto

----

In this assignment we are working with python to pull data from the Internet for some neighnourhood analysis.

### Step One: download and clean the neighbourhoods

In [33]:
import pandas as pd


The following code uses `pandas` to scrape a table from a Wikipedia page.  Since the table we want
is the first table and only table on the page the code is a little simpler than if we had to search through 
multiple tables.

In [34]:
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')[0]
print(df.shape)
df.head()

(180, 3)


Unnamed: 0,Postal code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront


Cleaning invovles removing the rows where the postal code has not been assigned (to a Borough) and then for the
Boroughs that have multiple neighbourhoods, replacing the '/' divider with a ','.  The assignment also wanted 
Boroughs with no Neibourhood names to use the Borough as the Neighbourhood.  A check for NaNs or `Not assigned`
in the Neighbourhood column did not find any of these cases.  

In [35]:
df = df[df.Borough != 'Not assigned']
df['Neighborhood'] = df['Neighborhood'].str.replace(' /',',')

In [36]:
df[df['Neighborhood'].isnull()]

Unnamed: 0,Postal code,Borough,Neighborhood


In [37]:
df[df.Neighborhood == 'Not assigned'] 

Unnamed: 0,Postal code,Borough,Neighborhood


In [39]:
df.head()

Unnamed: 0,Postal code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [38]:
df.shape

(103, 3)

A total of 77 rows were removed in this process leaving 103, which corresponds with the number of Toronto Forward Sorting Areas (FSAs) according to the Wikipedia Web page.  This seems like a good check that the processing has succeeded.

### Step Two: Get the Lat-Long coordiantes for the postal codes

Didn't have much success getting the geocoder to work so went to the backup plan ofusing the csv file provided.

In [64]:
ll = pd.read_csv("Geospatial_Coordinates.csv")
print(ll.shape)
ll.head()


(103, 3)


Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


The shape looks good and the head look reasonable too.  The only difference is in the column name for the postal
code.  Need to adjust that so a join will work easily.

In [68]:
ll.rename(columns={'Postal Code':'Postal code'}, inplace=True)
ll.head()
#result = pd.concat([df, ll], axis=1, sort=False)

Unnamed: 0,Postal code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [69]:
df = pd.merge(df, ll)
df.head()

Unnamed: 0,Postal code,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


In [70]:
df.shape

(103, 5)

Join looks like it had the correct effect and the shape is still good.