# Week 3 Assignment - Part I

## Requirements:

1. Three columns:  PostalCode, Borough, and Neighborhood  
2. Ignore cells with borough "Not assigned"
3. Combine duplicate postal codes by combining neighborhoods into comma separated list  
4. Records with assigned borough but neighborhood as "Not assigned" should have the neighborhood updated to be the borough  
5. The last cell should use the .shape method to print the number of rows in the dataframe

In [1]:
# import packages
import pandas as pd
import numpy as np

Get postal codes for neighborhoods based on wiki link provided and clean

In [2]:
# get the data from the url provided
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
postcodes = pd.read_html(url, na_values='Not assigned')[0]

In [3]:
# removing rows with missing boroughs
postcodes.dropna(subset=['Borough'], inplace=True)

In [4]:
# update missing neighborhoods with bourough
postcodes.loc[postcodes['Neighbourhood'].isnull()==True, 'Neighbourhood'] = postcodes['Borough']

In [5]:
# combine duplicate rows and display the first 5 rows for the reader to check the column names
postcodes = postcodes.groupby(['Postcode','Borough'])['Neighbourhood'].apply(', '.join).reset_index()
postcodes.columns = ['PostalCode', 'Borough', 'Neighborhood']
postcodes.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [6]:
postcodes.shape

(103, 3)

__The dataset has 103 rows__

## Finish Part I

***

# Week 3 Assignment - Part II

## Requirements:

Use the [geocoder package](https://geocoder.readthedocs.io/index.html) to get the latitude and longitude of each neighborhood to create a dataframe showing:  
- PostalCode  
- Borough 
- Neighborhood 
- Latitude 
- Longitude 

Install geocoder using pip because it's much faster

In [7]:
!pip install geocoder

# !conda install -c conda-forge geocoder --yes



In [8]:
import geocoder

In [9]:
# create an empty dataframe to hold the results
dfPostCoords = pd.DataFrame(columns=['PostalCode', 'Latitude', 'Longitude'])

In [10]:
# loop through the post codes and then join results to postcodes dataframe
for postal_code in postcodes['PostalCode'].unique() : 
    coords = None
    
    while(coords is None) : 
        g = geocoder.arcgis('{}, Toronto, Ontario'.format(postal_code))
        coords = g.latlng
        lat = coords[0]
        lon = coords[1]
        
    dfPostCoords = dfPostCoords.append({'PostalCode': postal_code, 'Latitude': lat, 'Longitude': lon}, ignore_index=True)
    


In [11]:
# merge dataframes to bring gps coordinates in with dataframe holding zip and neighborhood data
dfLocations = pd.merge(left=postcodes, right=dfPostCoords, left_on='PostalCode', right_on='PostalCode', how='left')
dfLocations.head()



Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.811525,-79.195517
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.785665,-79.158725
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.765815,-79.175193
3,M1G,Scarborough,Woburn,43.768369,-79.21759
4,M1H,Scarborough,Cedarbrae,43.769688,-79.23944


## Finish Part II

***

# Week 3 Assignment - Part III

## Requirements:

Explore and cluster the neighborhoods in Toronto.  One can decide to work with only boroughs that contain the word "Toronto" and then replicate the same analysis as the lectures' New York City data.

1. add enough Markdown cells to explain what you decided to do and to report any observations you make
2. generate maps to visualize your beighborhoods and how they cluster together

Install folium in order to make some nice maps.  Then install geopy.

In [12]:
!pip -q install folium

In [13]:
!pip install geopy
# !conda install -c conda-forge geopy --yes



In [14]:
import folium
import json
from geopy.geocoders import Nominatim
import requests
from pandas.io.json import json_normalize
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans


Prepare the data for the map and get the coordinates to center the map at Toronto

In [15]:
# filter to only Toronto neighborhoods
toronto = dfLocations[dfLocations.Borough.str.contains('Toronto')].reset_index(drop=True)


Create the map of neighborhoods

In [16]:
# get the center of Toronto using Nominatim
address = "Toronto, Ontario"
geolocator = Nominatim(user_agent='tor_explorer')
torLoc = geolocator.geocode(address)
torLat = torLoc.latitude
torLong = torLoc.longitude

# create the map with neighborhoods superimposed on top
mToronto = folium.Map(location=[torLat, torLong], zoom_start=12)

# add markers
for lat, lon, label in zip(toronto['Latitude'], toronto['Longitude'], toronto['Neighborhood']) : 
    lbl = folium.Popup(label, parse_html=True)
    folium.CircleMarker([lat, lon], radius=5, popup=lbl, color='blue', fill=True, fill_color='#3186cc').add_to(mToronto)


    
mToronto

## Which neighborhoods have the highest home prices?

Accessing data from an excel workbook saved to IBM cloud, so I've used a hidden cell for security.  The resulting dataframe is called dfHousing

In [17]:
# The code was removed by Watson Studio for sharing.

In [18]:
# get data from above hidden code cell and limit to only home prices
dfHousing = dfHousing[['Neighbourhood', 'Home Prices']].reset_index()
dfHousing = dfHousing.drop(['index'], axis=1)
# clean dataframe to hold zip codes so we can join later
dfZipLookup = pd.DataFrame(columns=['Neighborhood', 'Latitude', 'Longitude'])
dfZipLookup

Unnamed: 0,Neighborhood,Latitude,Longitude


In [19]:
# loop through the post codes and then join results to postcodes dataframe
for neigh in dfHousing['Neighbourhood'].unique() : 
    coords = None
    
    while(coords is None) : 
        g = geocoder.arcgis('{}, Toronto, Ontario'.format(neigh))
        coords = g.latlng
        lat = coords[0]
        lon = coords[1]
        
    dfZipLookup = dfZipLookup.append({'Neighborhood': neigh, 'Latitude': lat, 'Longitude': lon}, ignore_index=True)
    
dfZipLookup.head()

Unnamed: 0,Neighborhood,Latitude,Longitude
0,West Humber-Clairville,43.71455,-79.59258
1,Mount Olive-Silverstone-Jamestown,43.74721,-79.58826
2,Thistletown-Beaumond Heights,43.73858,-79.56428
3,Rexdale-Kipling,43.72433,-79.5672
4,Elms-Old Rexdale,43.72429,-79.54956


In [20]:
# join zip codes to neighborhood housing data
# drop the extra index column

dfHousing = pd.merge(left=dfHousing, right=dfZipLookup, left_on='Neighbourhood', right_on='Neighborhood', how='left')
dfHousing = dfHousing.drop(['Neighbourhood'], axis=1)
dfHousing.head()

Unnamed: 0,Home Prices,Neighborhood,Latitude,Longitude
0,317508,West Humber-Clairville,43.71455,-79.59258
1,251119,Mount Olive-Silverstone-Jamestown,43.74721,-79.58826
2,414216,Thistletown-Beaumond Heights,43.73858,-79.56428
3,392271,Rexdale-Kipling,43.72433,-79.5672
4,233832,Elms-Old Rexdale,43.72429,-79.54956


Assign each neighborhood to a cluster based on physical location and house prices.

In [21]:
k = 5
torontodat = dfHousing.drop('Neighborhood', 1)
kmeans = KMeans(n_clusters=k, random_state=0).fit(torontodat)
kmeans.labels_[0:10]

array([2, 2, 2, 2, 2, 2, 0, 0, 3, 4], dtype=int32)

In [22]:
torontodat.insert(0, 'Cluster Labels', kmeans.labels_)
torontodat.head()

Unnamed: 0,Cluster Labels,Home Prices,Latitude,Longitude
0,2,317508,43.71455,-79.59258
1,2,251119,43.74721,-79.58826
2,2,414216,43.73858,-79.56428
3,2,392271,43.72433,-79.5672
4,2,233832,43.72429,-79.54956


Print up the map

In [24]:
map_clusters = folium.Map(location=[torLat, torLong], zoom_start=11)
# set colors for clusters
x = np.arange(k)
ys = [i+x+(i*x)**2 for i in range(k)]
colors_array = cm.plasma(np.linspace(0,1,len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
# add markers
markers_colors = []
for lat, lon, price, cluster in zip(torontodat['Latitude'], torontodat['Longitude'], torontodat['Home Prices'], torontodat['Cluster Labels']) : 
    label = folium.Popup(str(price) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker([lat, lon], radius=5, popup=label, color=rainbow[cluster-1], fill=True, fill_color=rainbow[cluster-1],
                       fill_opacity=0.7).add_to(map_clusters)
    
map_clusters

You may know compare real estate listings and list price to these clusters.  Housing listed below neighboring data points may make for interesting investment opportunities.

## Finish Part III