# Capstone Week 3 Project - Part 2

#### To set the context, the below is a continuation of part 1. Scroll to the bottom to pick with part 2

Get neighborhood data for Toronto

In [28]:
# Import all needed packages here
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import requests
import lxml
from pprint import pprint
from collections import defaultdict
import csv

In [29]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
# go get the web page
web_page = requests.get(url)
# make some tasty soup
soup = BeautifulSoup(web_page.content, 'html.parser')

I viewed the Wikipedia page in Firefox then pressed Ctrl+U to see the source. The data is in a table of class 'wikitable'. With this I will:
1. Get the HTML for that specific table
2. Parse out the row data
3. Load the data into a Python list
4. Create the pandas DataFrame from the list

**Note:** I am using ``defaultdict`` from the handy Python ``collections`` module to accumulate a list of neighborhoods in a borough. Then I do some data wrangling to transform that list into a comma separated string and output the results to the list for pandas DataFrame creation.

#### DataFrame Creation Below...

In [50]:
trono_fsas = soup.find('table', class_='wikitable')
html_rows = trono_fsas.find_all('tr')
data = []
d_pc_boro = defaultdict(list)
# loop thru the rows, populate dict with key (boro, pc), append list of neighborhoods
for tr in html_rows:
    td = tr.find_all('td')
    row = [tr.text.strip() for tr in td if tr.text.strip()]
    if len(row) == 3:
        pc = row[0]
        boro = row[1]
        nbh = row[2]
        if boro == 'Not assigned':  # ignore cells with unassigned Borough
            continue
        if nbh == 'Not assigned':  # name unassigned neighborhoods to the name of the Borough
            nbh = boro
        d_pc_boro[(boro, pc)].append(nbh)  # group the neighborhoods
# loop thru the dict, turn list of neighborhoods to a string and output a list for pandas
for (bro, pcode), hoods in sorted(d_pc_boro.items()):
    nbstr = ', '.join(hoods)  # concatenate the hoods
    data.append([pcode, bro, nbstr])

df = pd.DataFrame(data, columns=['PostalCode', 'Borough', 'Neighborhood'])
df

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M4N,Central Toronto,Lawrence Park
1,M4P,Central Toronto,Davisville North
2,M4R,Central Toronto,North Toronto West
3,M4S,Central Toronto,Davisville
4,M4T,Central Toronto,"Moore Park, Summerhill East"
5,M4V,Central Toronto,"Deer Park, Forest Hill SE, Rathnelly, South Hi..."
6,M5N,Central Toronto,Roselawn
7,M5P,Central Toronto,"Forest Hill North, Forest Hill West"
8,M5R,Central Toronto,"The Annex, North Midtown, Yorkville"
9,M4W,Downtown Toronto,Rosedale


The above output already gives the shape but the assignment says to use the shape method, so here it is...

In [31]:
df.shape

(103, 3)

---

### Part 2 starts here. Make sure the above cells are run...

``Click Cell > Run All``

Download the Geospatial data

In [1]:
!wget -q -O 'Geospatial_Coordinates.csv' https://cocl.us/Geospatial_data
print('Data downloaded!')

Data downloaded!


Look at the data...

In [3]:
with open('Geospatial_Coordinates.csv') as csv_data:
    longlatdat = csv_data.readlines()
    longlatdat

In [7]:
longlatdat[:5]

['Postal Code,Latitude,Longitude\n',
 'M1B,43.8066863,-79.1943534\n',
 'M1C,43.7845351,-79.1604971\n',
 'M1E,43.7635726,-79.1887115\n',
 'M1G,43.7709921,-79.2169174\n']

There are three columns, so create a dictionary for each postal code so we can look them when we loop through our data.

In [24]:
d_postal_code = {}  # {'code': (lat, long)}
with open('Geospatial_Coordinates.csv', newline='') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        d_postal_code[row['Postal Code']] = (row['Latitude'], row['Longitude'])

In [40]:
data[:5]

[['M4N', 'Central Toronto', 'Lawrence Park'],
 ['M4P', 'Central Toronto', 'Davisville North'],
 ['M4R', 'Central Toronto', 'North Toronto West'],
 ['M4S', 'Central Toronto', 'Davisville'],
 ['M4T', 'Central Toronto', 'Moore Park, Summerhill East']]

**Critical Note:** Make sure the data appears in the above. If not rerun the **DataFrame creation** cell above!!

Loop thru the original data and create a list with the coordinate data.

In [46]:
l_hoods = []
for pc, boro, hoods in data:
    lat, lng = d_postal_code[pc]
    l_hoods.append([pc, boro, hoods, lat, lng])
l_hoods[:5]

[['M4N', 'Central Toronto', 'Lawrence Park', '43.7280205', '-79.3887901'],
 ['M4P', 'Central Toronto', 'Davisville North', '43.7127511', '-79.3901975'],
 ['M4R', 'Central Toronto', 'North Toronto West', '43.7153834', '-79.4056784'],
 ['M4S', 'Central Toronto', 'Davisville', '43.7043244', '-79.3887901'],
 ['M4T',
  'Central Toronto',
  'Moore Park, Summerhill East',
  '43.6895743',
  '-79.3831599']]

Create the new DataFrame with the coordinate data...

In [47]:
coord_df = pd.DataFrame(l_hoods, columns=['PostalCode', 'Borough', 'Neighborhood', 'Latitude', 'Longitude'])
coord_df

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M4N,Central Toronto,Lawrence Park,43.7280205,-79.3887901
1,M4P,Central Toronto,Davisville North,43.7127511,-79.3901975
2,M4R,Central Toronto,North Toronto West,43.7153834,-79.4056784
3,M4S,Central Toronto,Davisville,43.7043244,-79.3887901
4,M4T,Central Toronto,"Moore Park, Summerhill East",43.6895743,-79.3831599
5,M4V,Central Toronto,"Deer Park, Forest Hill SE, Rathnelly, South Hi...",43.6864123,-79.4000493
6,M5N,Central Toronto,Roselawn,43.7116948,-79.4169356
7,M5P,Central Toronto,"Forest Hill North, Forest Hill West",43.6969476,-79.4113072
8,M5R,Central Toronto,"The Annex, North Midtown, Yorkville",43.6727097,-79.4056784
9,M4W,Downtown Toronto,Rosedale,43.6795626,-79.3775294


In [51]:
coord_df.shape

(103, 5)