# Applied Data Science Capstone: Week 3 
## Segmenting and Clustering Neighborhoods in the city of Toronto, Canada, Part 2

A Jupyter Notebook that uses pandas and other python libraries to demonstrate 
*k means clustering*. Builds on work in Part 1, which is a separate notebook.
The new work begins under the "Part 2 code starts here" heading.

## Installing libraries

I'm not sure why, but in my environment these two pip commands need
to be in separate cells in the Jupyter Notebook. 

In [1]:
pip install pandas

You should consider upgrading via the '/usr/local/Cellar/jupyterlab/2.2.7/libexec/bin/python3.8 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


In [2]:
pip install lxml

You should consider upgrading via the '/usr/local/Cellar/jupyterlab/2.2.7/libexec/bin/python3.8 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


## Import the libraries

I don't use lxml, but it is required by Pandas, and I found during
development that it was useful to check for it here before
Pandas complained about it not being installed.

In [3]:
import pandas as pd
import lxml

## Load the data from Wikipedia

This is a pretty cool bit of functionality.

In [4]:
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')

# there are three tables on the page; the one with the postal codes 
# is the first one. If I were putting this in a production environment,
# where I was re-loading the data regularly, I would add some error checking
# to make sure the page's structure hadn't changed.

raw_postal_codes = df[0]
raw_postal_codes.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


## Clean the data

It appears that since the capstone assignment was prepared, 
there have been changes to the Wikipedia page, and some of the
changes specified by the assignment are already made on the 
Wikipedia page. I noticed this after I had written most of the
code.

In [5]:
# remove unassigned boroughs

raw_postal_codes = raw_postal_codes[(raw_postal_codes.Borough != 'Not assigned')]

# As of this writing, there are no rows with an assigned Borough but
# an unassigned Neighbourhood. But the data set is small enough that
# doing this check doesn't hurt.

raw_postal_codes = raw_postal_codes[(raw_postal_codes.Neighbourhood != 'Not assigned')]

# Contrary to the instructions, each postal code is only listed once, and when
# a postal code has more than one neighborhood they are listed on the
# same line, separated by a comma, as required by the assignement. Like I said,
# I noticed that after I had already written the code.

# since I'm creating a new dictionary, I change the names here rather than
# use dataframe.rename()

grouped = raw_postal_codes.groupby('Postal Code')
grouped_data = {'PostalCode':[], 'Borough':[], 'Neighborhood':[]}
for a, b in grouped:
    grouped_data['PostalCode'].append(a)
    grouped_data['Borough'].append(', '.join(b['Borough'].tolist()))
    grouped_data['Neighborhood'].append(', '.join(b['Neighbourhood'].tolist()))

postal_codes = pd.DataFrame(grouped_data)
postal_codes.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [6]:
postal_codes.shape

(103, 3)

<a name="part-2-code-starts-here"></a>

## Part 2 code starts here

### geocoder code from Coursera

I tried this code, but as suggested by the description it didn't work, at all. 
I leave it in for completeness, but for actually getting work done I did
something else. 

In [7]:
pip install geocoder

You should consider upgrading via the '/usr/local/Cellar/jupyterlab/2.2.7/libexec/bin/python3.8 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


In [8]:
import geocoder

# initialize your variable to None

# Small change from Coursera - I found that often
# geocoder wouldn't respond, at all, and this code
# would loop infinitely. So I added the counter
# as an extra sanity check. it didn't seem to matter
# how many times I tried, geocoder always came
# back with None. I suspect the project is dead - 
# the code has not been updated in 2 years.

lat_lng_coords = None
i = 0

# loop until you get the coordinates
while(lat_lng_coords is None and i <= 100):
    i += 1
    g = geocoder.google('M1B, Toronto, Ontario, Canada')
    lat_lng_coords = g.latlng

latitude = lat_lng_coords[0]
longitude = lat_lng_coords[1]

TypeError: 'NoneType' object is not subscriptable

## Another attempt using geopy

I did a little research and found the geopy Python package, 
but the data is incomplete - it found only 20 of the 103 codes
in the data set. 

In [None]:
pip install geopy

In [None]:
from geopy.geocoders import Nominatim

found = 0
not_found = 0

geolocator = Nominatim(user_agent="Castone Week 3")
for i in postal_codes['PostalCode']:
    print(".", end='')
    location = geolocator.geocode("{}, Toronto, Ontario, Canada".format(i))
    if location is not None:
        found += 1
    else:
        not_found += 1

print("\nfound: {}\nnot found: {}".format(found, not_found))

## Using the Coursera CSV File

I tried. I think if I were doing this in a production environment,
I would get a Google developer API key and use their service.

In [11]:
long_lat = pd.read_csv("https://cocl.us/Geospatial_data")
long_lat.rename(columns={'Postal Code':'PostalCode'}, inplace=True)
long_lat.head()
postal_codes = postal_codes.merge(long_lat, on='PostalCode')
postal_codes.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


In [12]:
postal_codes.shape

(103, 5)