## Download and Cleanup Wiki-page Table

## Requirements

1. Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table postal codes and to transform to a pandas dataframe

2. To create the dataframe:
    - The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
    - Only process the cells that have an assigned borough. 
    - Ignore cells with a borough that is Not assigned.
    - More than one neighborhood can exist in one postal code area.
        - For example, in the table on the Wikipedia page, you will notice that 
        - M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. 
        - These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.
    - If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.
    - Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.
    - In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.
3. Generate some maps of Toronto
4. Generate a population density map based on census data
5. Use FourSquare API to build and Italian Resturant Density map
6. Use FourSqure API to find the higest rated Seafood Restaurants on the east side of Toronto


### Download the wiki page and load the table into a Pandas DataFrame

In [1]:
# Pull in the required libraries
import pandas as pd
import numpy as np
import pickle

# To fetch URLs/HTML pages
import requests
# For parsing HTML Pages
from bs4 import BeautifulSoup

# For displaying Maps
import folium
import geocoder
# module to convert an address into latitude and longitude values
from geopy.geocoders import Nominatim 

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 



In [2]:
# Define and download the wiki page
the_url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
res = requests.get(the_url)

In [3]:
#Parse the table with BeautifulSoup
soup = BeautifulSoup(res.content,'lxml')
table = soup.find_all('table')[0]

In [4]:
# Load the Table into Pandas
df = pd.read_html(str(table))[0]
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [5]:
df.tail()

Unnamed: 0,Postal Code,Borough,Neighborhood
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."
179,M9Z,Not assigned,Not assigned


### Ignore cells with a borough that are 'Not assigned'.

In [6]:
# We'll use 'wdf' as our 'working data frame' so that that we can 
# refer back to the original if needed.
wdf = df.loc[(df['Borough'] != 'Not assigned')]

# Validate ... 
wdf.loc[(wdf['Borough'] == 'Not Assigned')]

Unnamed: 0,Postal Code,Borough,Neighborhood


### Group Neighborhoods by Postal Code
For Postal Codes that span more than one Neighborhood, group the Neighborhoods with a comma, forming a single 'Postal Code' record

In [7]:
# Check if we have any codes that span Neighborhood
wdf[wdf.duplicated(['Postal Code'])]

Unnamed: 0,Postal Code,Borough,Neighborhood


The output above indicates that only one entry for each postal code exists, so this requirement is already met.

### If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

In [8]:
wdf[wdf.Neighborhood == 'Not assigned']

Unnamed: 0,Postal Code,Borough,Neighborhood


In this dataset, all Neighborhoods are assigned.

In [9]:
# If we needed to change a few, we can do it with one one of code:

mask = wdf.Neighborhood == 'Not assigned'
wdf['Neighborhood'][mask].Neighborhood = wdf['Borough'][mask]

print("Shape after 'hood assignment fix: {0}".format(wdf.shape))

Shape after 'hood assignment fix: (103, 3)


In [10]:
coord_df = pd.read_csv("https://cocl.us/Geospatial_data")

In [11]:
wdf2 = wdf.merge(coord_df, on="Postal Code")
wdf2.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


In [12]:
#!conda install -y -c conda-forge geocoder 

## **The GeoCoder for finding lat/long based on Postal Code did not return ANY results**

In [13]:
missed_codes = 0
found_codes = 0

for postal_code in wdf['Postal Code'][0:10]:
    # initialize your variable to None
    lat_lng_coords = None

    loop_count = 0

    # loop until you get the coordinates
    while(lat_lng_coords is None and loop_count < 10):
        g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
        lat_lng_coords = g.latlng
        loop_count += 1

    try:
        latitude = lat_lng_coords[0]
        longitude = lat_lng_coords[1]
        found_codes
        print("{0} => Lat|Long: {1}".foramt(postal_code, lat_lng_coords))
    except TypeError:
        print("Couldn't get a lat/long for {0}".format(postal_code))
        missed_codes += 1
        
print("Found: {0} Missed: {1}".format(found_codes, missed_codes))

Couldn't get a lat/long for M3A
Couldn't get a lat/long for M4A
Couldn't get a lat/long for M5A
Couldn't get a lat/long for M6A
Couldn't get a lat/long for M7A
Couldn't get a lat/long for M9A
Couldn't get a lat/long for M1B
Couldn't get a lat/long for M3B
Couldn't get a lat/long for M4B
Couldn't get a lat/long for M5B
Found: 0 Missed: 10


In [14]:
# I needed a comma separated list of postal codes for a 3rd party app to get population for each postal code
s = ""
for pc in wdf['Postal Code']:
    s += "\'{0}\',".format(pc)


## Lets look at population in each Postal Code and map it on a Chloropleth 

In [15]:
df_pop = pd.read_csv("data/can_pop_fsa.csv")
# we can immediately drop the columns we're not interested in 
df_pop=df_pop[['Geographic code','Province or territory', 'Population, 2016']]
df_pop=df_pop[(df_pop['Province or territory']== "Ontario")]
df_pop.head()

Unnamed: 0,Geographic code,Province or territory,"Population, 2016"
650,K0A,Ontario,103474
651,K0B,Ontario,20945
652,K0C,Ontario,52154
653,K0E,Ontario,38903
654,K0G,Ontario,37097


In [16]:
map_toronto = folium.Map(location=[43.653963, -79.387207], zoom_start=11)
ontario_geo = "data/toronto_fsa.geojson"
map_toronto.choropleth(geo_data=ontario_geo,
    data = df_pop,
    columns=['Geographic code','Population, 2016'],
    key_on='feature.properties.CFSAUID',
    fill_color='YlOrRd',
    fill_opacity=0.7, 
    line_opacity=0.2,
    legend_name='Population by FSA')   
    
map_toronto

## Set up FourSquare API Info

In [17]:
CLIENT_ID = '5HPET1K2F5KSIMNIAOJSTRSQBWW3CO5YTVYI5UQDO0DACZLH' # your Foursquare ID
CLIENT_SECRET = 'MDZZBZNLQ0XDHUCY43AVPJDXXYPYOKYFZKT4SYZRFSONU1S5' # your Foursquare Secret
VERSION = '20180604'
LIMIT = 50



## Loop through the postal codes - finding seafood restaurants

In [18]:
search_query = 'seafood'
radius = 2000
print(search_query + ' .... OK!')
count = 0
results_dict = {"Postal Code":[], "Latitude": [], "Longitude":[], "Hit Count": []}
the_cols = ["Postal Code", "Latitude", "Longitude", "Hit Count"]

for postal_code, latitude, longitude in wdf2[['Postal Code','Latitude', 'Longitude']].values:
    #url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, search_query, radius, LIMIT)
    url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&near={}&v={}&query={}&limit={}&radius={}'.format(CLIENT_ID, CLIENT_SECRET, postal_code, VERSION, search_query, LIMIT, radius)
    #print(url)
    results = requests.get(url).json()
    results_dict[postal_code] = results
    try:
        hit_count = len(results['response']['venues'])
    except KeyError:
        hit_count = 0
    
    # Fill out our results to populate a DataFrame later... 
    results_dict["Postal Code"].append(postal_code)
    results_dict["Latitude"].append(latitude)
    results_dict["Longitude"].append(longitude)
    results_dict["Hit Count"].append(hit_count)
    #print("PC: {0} - Lat: {1:.2f} - Long {1:.2f} - Hit Count: {3}".format(postal_code, latitude, longitude, hit_count))
    
print("Done.")

seafood .... OK!
Done.


## Review the DataFrame

In [19]:
# Build out the SeaFood DataFreame
seafood_df = pd.DataFrame(data=results_dict,columns=the_cols)

In [20]:
seafood_df.describe()

Unnamed: 0,Latitude,Longitude,Hit Count
count,103.0,103.0,103.0
mean,43.704608,-79.397153,3.728155
std,0.052463,0.097146,5.576903
min,43.602414,-79.615819,0.0
25%,43.660567,-79.464763,0.0
50%,43.696948,-79.38879,2.0
75%,43.74532,-79.340923,3.0
max,43.836125,-79.160497,16.0


In [21]:
seafood_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude,Hit Count
0,M3A,43.753259,-79.329656,0
1,M4A,43.725882,-79.315572,0
2,M5A,43.65426,-79.360636,5
3,M6A,43.718518,-79.464763,0
4,M7A,43.662301,-79.389494,16


## Translate the DataFrame to the map

In [22]:
map_to_seafood = folium.Map(location=[43.653963, -79.387207], zoom_start=11)
ontario_geo = "data/toronto_fsa.geojson"
map_to_seafood.choropleth(geo_data=ontario_geo,
    data = seafood_df,
    columns=['Postal Code','Hit Count'],
    key_on='feature.properties.CFSAUID',
    fill_color='PuRd',
    fill_opacity=0.7, 
    line_opacity=0.2,
    legend_name='Seafood Restaurants by Postal Code')   

map_to_seafood

## Find the highest rated Seafood Restaurants on the East Side ( east of -79.389494 Longitude )

### Define some help functions to find restaurants and ratings

In [23]:
#
def get_venue_list(postal_code, the_query):
    # Given a postal code and a query target, use the FourSquare 'search' endpoint to search the area near the Postal Code
    url = 'https://api.foursquare.com/v2/venues/search?query={}&client_id={}&client_secret={}&near={}&v={}&limit={}'.format(the_query,CLIENT_ID, CLIENT_SECRET, postal_code, VERSION, LIMIT)
    results = requests.get(url).json()
    
    # Grab the list of venues from the response
    try:
        my_venue_list = results['response']['venues']
    except:
        my_venue_list = []
        
    return my_venue_list    

def get_venue_rating(venue_id):
    # Given a FourSquare venue id - get it's rating
    get_venue_url = 'https://api.foursquare.com/v2/venues/{0}?client_id={1}&client_secret={2}&v={3}'.format(venue_id, CLIENT_ID, CLIENT_SECRET, VERSION)
    venue_results = requests.get(get_venue_url).json()

    try:
        the_venue = venue_results['response']['venue']
        rating = the_venue['rating']

    except:
        # if we hit any issues ( not data, etc. return zero)
        rating = 0
        
    return rating

### Build of a list of seafood restaurants and ratings 
We'll start with all codes - then filter for eastside and rating later

In [24]:
#43.662301,-79.389494
venue_dict = {"ID":[], "Postal Code": [], "Name":[], "Latitude":[], "Longitude":[], "Rating":[]}

duplicate_count = 0 
for code in wdf2['Postal Code'].values:
    
    venue_list = get_venue_list(code, 'seafood')
    
    for venue in venue_list:
        if venue['id'] not in venue_dict['ID']:
            venue_dict['ID'].append(venue['id'])
            venue_dict['Postal Code'].append(code)
            venue_dict['Name'].append(venue['name'])
            venue_dict['Latitude'].append(venue['location']['lat'])
            venue_dict['Longitude'].append(venue['location']['lng'])
            venue_dict['Rating'].append(get_venue_rating(venue['id']))
        else:
            duplicate_count += 1
    

## Translate venue dictionary to a DataFrame
Fill out the dataframe and sort by rating

In [27]:
venue_df = pd.DataFrame(data=venue_dict)
venue_df.sort_values(by='Rating', inplace=True, ascending=False)
venue_df.head()

Unnamed: 0,ID,Postal Code,Name,Latitude,Longitude,Rating
0,4c85d008b139b7134c99c691,M3A,Diana's Seafood Delight,43.745745,-79.291634,8.1
65,4b985433f964a5201f3c35e3,M5A,"Snug Harbour Seafood, Bar & Grill",43.550433,-79.584689,7.9
95,4ae3398ff964a520ed9121e3,M5B,Red Lobster,43.656328,-79.383621,7.8
110,4cdc8e53d4ecb1f7843c8048,M1T,The Royal Chinese Restaurant 避風塘小炒,43.780505,-79.298844,7.7
51,4ddbe8697d8b771c0b09b885,M4A,Dim Sum King Seafood Restaurant,43.653503,-79.395405,7.6


## Build the map of high rated seafood on the east side

In [30]:
venue_locations = venue_df[(venue_df['Rating'] > 6) & (venue_df['Longitude'] > -79.387207)]
#venue_locations = venue_locations[['Latitude','Longitude']].values.tolist()

rated_seafood_map = folium.Map(location=[43.653963, -79.387207])


# Add markers to the map
for index, row in venue_locations.iterrows():
    lat_long = [row['Latitude'], row['Longitude']]
    rating   = row['Rating']
    name     = row['Name']
    folium.Marker(lat_long, popup=str(rating)).add_to(rated_seafood_map)
    print("{0} - {1} - {2}".format(name, lat_long, rating))

# Set the northeast and southwest zoom values
sw = venue_df[['Latitude', 'Longitude']].min().values.tolist()
ne = venue_df[['Latitude', 'Longitude']].max().values.tolist()

rated_seafood_map.fit_bounds([sw, ne]) 

Diana's Seafood Delight - [43.74574539734533, -79.29163423555568] - 8.1
Red Lobster - [43.656328, -79.383621] - 7.8
The Royal Chinese Restaurant 避風塘小炒 - [43.78050473445372, -79.29884391314476] - 7.7
Fairview Seafood Chinese Cuisine - [43.7929068448252, -79.23934784822218] - 6.6
Tak Fu Seafood Restaurant 德福點心皇 - [43.822633229713155, -79.29895769982762] - 6.5
Fishman Wharf Seafood Restaurant 漁人碼頭 - [43.82239944620864, -79.3133957901482] - 6.5
Aki Da Japanese Seafood House - [43.669050975420376, -79.30449946579239] - 6.3
Very Fair Seafood Cuisine 鴻福海鮮大酒樓 - [43.80307672983049, -79.29361725934118] - 6.1
Wah Too Seafood Restaurant - [43.65483285234745, -79.38720597193928] - 6.1


### Top Rated Seafood on Toronto's East Side!

In [31]:
# Display the map
rated_seafood_map