# Capstone Week 3 Project - Part 2

#### To set the context, the below is a continuation of part 1. Scroll to the bottom to pick with part 2

Get neighborhood data for Toronto

In [1]:
# Import all needed packages here
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import requests
import lxml
from pprint import pprint
from collections import defaultdict
import csv
print('Libraries imported.')

Libraries imported.


In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
# go get the web page
web_page = requests.get(url)
# make some tasty soup
soup = BeautifulSoup(web_page.content, 'html.parser')

I viewed the Wikipedia page in Firefox then pressed Ctrl+U to see the source. The data is in a table of class 'wikitable'. With this I will:
1. Get the HTML for that specific table
2. Parse out the row data
3. Load the data into a Python list
4. Create the pandas DataFrame from the list

**Note:** I am using ``defaultdict`` from the handy Python ``collections`` module to accumulate a list of neighborhoods in a borough. Then I do some data wrangling to transform that list into a comma separated string and output the results to the list for pandas DataFrame creation.

#### DataFrame Creation Below...

In [30]:
trono_fsas = soup.find('table', class_='wikitable')
html_rows = trono_fsas.find_all('tr')
data = []
d_pc_boro = defaultdict(list)
# loop thru the rows, populate dict with key (boro, pc), append list of neighborhoods
for tr in html_rows:
    td = tr.find_all('td')
    row = [tr.text.strip() for tr in td if tr.text.strip()]
    if len(row) == 3:
        pc = row[0]
        boro = row[1]
        nbh = row[2]
        if boro == 'Not assigned':  # ignore cells with unassigned Borough
            continue
        if nbh == 'Not assigned':  # name unassigned neighborhoods to the name of the Borough
            nbh = boro
        d_pc_boro[(boro, pc)].append(nbh)  # group the neighborhoods
# loop thru the dict, turn list of neighborhoods to a string and output a list for pandas
for (bro, pcode), hoods in sorted(d_pc_boro.items()):
    nbstr = ', '.join(hoods)  # concatenate the hoods
    data.append([pcode, bro, nbstr])

df = pd.DataFrame(data, columns=['PostalCode', 'Borough', 'Neighborhood'])

In [4]:
df.shape

(103, 3)

---

### Part 2 starts here. Make sure the above cells are run...

``Click Cell > Run All``

Download the Geospatial data

In [5]:
!conda install -c conda-forge geocoder --yes
import geocoder
print('Libraries imported.')

Solving environment: done

## Package Plan ##

  environment location: /home/steve.orr/anaconda3

  added / updated specs: 
    - geocoder


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    geocoder-1.38.1            |             py_0          52 KB  conda-forge
    ratelim-0.1.6              |           py36_0           5 KB  conda-forge
    orderedset-2.0             |           py36_0         231 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         288 KB

The following NEW packages will be INSTALLED:

    geocoder:   1.38.1-py_0  conda-forge
    orderedset: 2.0-py36_0   conda-forge
    ratelim:    0.1.6-py36_0 conda-forge


Downloading and Extracting Packages
geocoder-1.38.1      | 52 KB     | ##################################### | 100% 
ratelim-0.1.6        | 5 KB      | ##########################

In [31]:
# Looks like arcgis works okay - Google is not dependable
#g = geocoder.google('Toronto, Ontario')
g = geocoder.arcgis('Toronto, Ontario')
print(g.latlng)

[43.648690000000045, -79.38543999999996]


In [13]:
for idx, row in df.iterrows():
    lat = lng = None
    while lng is None:
        gc = geocoder.arcgis('{}, Toronto, Ontario'.format(row['PostalCode']))
        lat, lng = gc.latlng
    df.at[idx, 'Latitude'] = lat
    df.at[idx, 'Longitude'] = lng
    print(row['PostalCode'], lat, lng)

print('Done getting coordinates.')

M4N 43.728135000000066 -79.38709009599995
M4P 43.71275500000007 -79.38851449699996
M4R 43.71452278400005 -79.40695999999997
M4S 43.702765000000056 -79.38576922699997
M4T 43.69050500000003 -79.38297337799997
M4V 43.68600329800006 -79.40233499999994
M5N 43.711941154000044 -79.41911999999996
M5P 43.69478500000008 -79.41440483299994
M5R 43.674840000000074 -79.40369769099993
M4W 43.68196000000006 -79.37844455599998
M4X 43.66815500000007 -79.36660016899998
M4Y 43.666585000000055 -79.38130203699995
M5A 43.65512000000007 -79.36263979699999
M5B 43.65736301100003 -79.37817999999999
M5C 43.65121000000005 -79.37548057699996
M5E 43.64516015600003 -79.37367499999993
M5G 43.65609081300005 -79.38492999999994
M5H 43.64970000000005 -79.38258157399997
M5J 43.623470000000054 -79.39397931299999
M5K 43.64839853600006 -79.38393934999993
M5L 43.64839500000005 -79.37886491099994
M5S 43.663110000000074 -79.40180056699995
M5T 43.65352500000006 -79.39723062399997
M5V 43.64081500000003 -79.39953781899999
M5W 43.64

In [14]:
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M4N,Central Toronto,Lawrence Park,43.728135,-79.38709
1,M4P,Central Toronto,Davisville North,43.712755,-79.388514
2,M4R,Central Toronto,North Toronto West,43.714523,-79.40696
3,M4S,Central Toronto,Davisville,43.702765,-79.385769
4,M4T,Central Toronto,"Moore Park, Summerhill East",43.690505,-79.382973


In [15]:
!wget -q -O 'Geospatial_Coordinates.csv' https://cocl.us/Geospatial_data
print('Data downloaded!')

Data downloaded!


Look at the data...

In [16]:
with open('Geospatial_Coordinates.csv') as csv_data:
    longlatdat = csv_data.readlines()
    longlatdat

In [17]:
longlatdat[:5]

['Postal Code,Latitude,Longitude\n',
 'M1B,43.8066863,-79.1943534\n',
 'M1C,43.7845351,-79.1604971\n',
 'M1E,43.7635726,-79.1887115\n',
 'M1G,43.7709921,-79.2169174\n']

There are three columns, so create a dictionary for each postal code so we can look them when we loop through our data.

In [18]:
d_postal_code = {}  # {'code': (lat, long)}
with open('Geospatial_Coordinates.csv', newline='') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        d_postal_code[row['Postal Code']] = (row['Latitude'], row['Longitude'])

In [19]:
data[:5]

[['M4N', 'Central Toronto', 'Lawrence Park'],
 ['M4P', 'Central Toronto', 'Davisville North'],
 ['M4R', 'Central Toronto', 'North Toronto West'],
 ['M4S', 'Central Toronto', 'Davisville'],
 ['M4T', 'Central Toronto', 'Moore Park, Summerhill East']]

**Critical Note:** Make sure the data appears in the above. If not rerun the **DataFrame creation** cell above!!

Loop thru the original data and create a list with the coordinate data.

In [20]:
l_hoods = []
for pc, boro, hoods in data:
    lat, lng = d_postal_code[pc]
    l_hoods.append([pc, boro, hoods, lat, lng])
l_hoods[:5]

[['M4N', 'Central Toronto', 'Lawrence Park', '43.7280205', '-79.3887901'],
 ['M4P', 'Central Toronto', 'Davisville North', '43.7127511', '-79.3901975'],
 ['M4R', 'Central Toronto', 'North Toronto West', '43.7153834', '-79.4056784'],
 ['M4S', 'Central Toronto', 'Davisville', '43.7043244', '-79.3887901'],
 ['M4T',
  'Central Toronto',
  'Moore Park, Summerhill East',
  '43.6895743',
  '-79.3831599']]

Create the new DataFrame with the coordinate data...

In [29]:
coord_df = pd.DataFrame(l_hoods, columns=['PostalCode', 'Borough', 'Neighborhood', 'Latitude', 'Longitude'])

In [22]:
coord_df.shape

(103, 5)

In [23]:
df.shape

(103, 5)

#### Note from the below there differences depending on the data source.

In [25]:
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M4N,Central Toronto,Lawrence Park,43.728135,-79.38709
1,M4P,Central Toronto,Davisville North,43.712755,-79.388514
2,M4R,Central Toronto,North Toronto West,43.714523,-79.40696
3,M4S,Central Toronto,Davisville,43.702765,-79.385769
4,M4T,Central Toronto,"Moore Park, Summerhill East",43.690505,-79.382973


In [26]:
coord_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M4N,Central Toronto,Lawrence Park,43.7280205,-79.3887901
1,M4P,Central Toronto,Davisville North,43.7127511,-79.3901975
2,M4R,Central Toronto,North Toronto West,43.7153834,-79.4056784
3,M4S,Central Toronto,Davisville,43.7043244,-79.3887901
4,M4T,Central Toronto,"Moore Park, Summerhill East",43.6895743,-79.3831599


In [27]:
df.tail()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
98,M6C,York,Humewood-Cedarvale,43.692105,-79.430355
99,M6E,York,Caledonia-Fairbanks,43.68861,-79.451003
100,M6M,York,"Del Ray, Keelsdale, Mount Dennis, Silverthorn",43.694545,-79.484495
101,M6N,York,"The Junction North, Runnymede",43.675795,-79.48196
102,M9N,York,Weston,43.704905,-79.517712


In [28]:
coord_df.tail()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
98,M6C,York,Humewood-Cedarvale,43.6937813,-79.4281914
99,M6E,York,Caledonia-Fairbanks,43.6890256,-79.453512
100,M6M,York,"Del Ray, Keelsdale, Mount Dennis, Silverthorn",43.6911158,-79.4760133
101,M6N,York,"The Junction North, Runnymede",43.6731853,-79.4872619
102,M9N,York,Weston,43.706876,-79.5181884
