# Explore and cluster the neighborhoods in Toronto

# Note:  For Toronto Exploration (3rd section of assignment), scroll down to  "Use Foursquare to get data on Toronto"


## Step One:  
Scrape wikipedia page for the table of Neighborhoods
source page: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

In [3]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import lxml
import requests
from IPython.display import display
#pd.options.display.max_columns = None
pd.set_option('display.max_columns', 15)
print('Modules Ready')

Modules Ready


In [4]:
wiki = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(wiki, 'lxml')
print(soup.title.text)
print(type(soup))

List of postal codes of Canada: M - Wikipedia
<class 'bs4.BeautifulSoup'>


In [5]:
# Have a quick view of our HTML page

# print(soup.prettify())
# removing this - clutters final notebook with tons of HTML

## Step Two: 
Find the pertinent data from the webpage, and put into a dataframe

In [6]:
# Use soup find the table body
table_body = soup.find('tbody')
#print(table_body)

In [14]:
# found help for this at https://datascience.stackexchange.com/questions/10857/how-to-scrape-a-table-from-a-webpage

In [7]:
# Create empty dataframe to store the data 
rows = table_body.find_all('tr')
df_cols = ['Postcode', 'Borough', 'Neighbourhood']
pc = pd.DataFrame(columns = df_cols)
print('starting with', len(rows), 'rows \n')


# iterate through rows, appending into dataframe
for row in rows:
    cols = row.find_all('td')
    cols = [x.text.strip() for x in cols]
    try:
        # Need a test to catch where no value exists (such as case with first line )
        if cols[0]:
            pc = pc.append({'Postcode':cols[0],
                    'Borough':cols[1],
                    'Neighbourhood':cols[2]}, ignore_index=True)
    except:
        print('hit an empty row, moving on \n')

print('output is a dataframe of this shape: ', pc.shape)
print('before cleaning, dataframe looks like\n\n', pc.head(10))

starting with 289 rows 

hit an empty row, moving on 

output is a dataframe of this shape:  (288, 3)
before cleaning, dataframe looks like

   Postcode           Borough     Neighbourhood
0      M1A      Not assigned      Not assigned
1      M2A      Not assigned      Not assigned
2      M3A        North York         Parkwoods
3      M4A        North York  Victoria Village
4      M5A  Downtown Toronto      Harbourfront
5      M5A  Downtown Toronto       Regent Park
6      M6A        North York  Lawrence Heights
7      M6A        North York    Lawrence Manor
8      M7A      Queen's Park      Not assigned
9      M8A      Not assigned      Not assigned


## Step Three: 
Clean up the dataframe

In [8]:
# First remove rows where Borough is 'Not assigned'
cleanpc = pc[pc.Borough != 'Not assigned']
cleanpc.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern


Compare M7A Queen's Park in the above dataframe, and the below

In [9]:
# Next, If a cell has a borough but a Not assigned neighborhood, set neighborhood to same as the borough
cleanpc['Neighbourhood'] = np.where(cleanpc['Neighbourhood'] == 'Not assigned', cleanpc['Borough'], cleanpc['Neighbourhood'])

cleanpc.head(10)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Queen's Park
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern


In [10]:
# Now combine Neighbourhood field when PostCode is the same
grouppc = cleanpc.groupby(['Postcode','Borough'])['Neighbourhood'].apply(','.join).reset_index()

## Step Four: 
Show the final shape of the dataframe.  Also see the first few rows:

In [11]:
grouppc.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park,Ionview,Kennedy Park"
7,M1L,Scarborough,"Clairlea,Golden Mile,Oakridge"
8,M1M,Scarborough,"Cliffcrest,Cliffside,Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff,Cliffside West"


In [12]:
grouppc.shape

(103, 3)

## GeoCode: 
Find lat/lon for each of the Boroughs

In [13]:
grouppc[grouppc.Postcode == 'M5G']

Unnamed: 0,Postcode,Borough,Neighbourhood
57,M5G,Downtown Toronto,Central Bay Street


In [14]:
filepath = 'https://cocl.us/Geospatial_data'
postcodes = pd.read_csv(filepath, index_col=0)
postcodes.head()

Unnamed: 0_level_0,Latitude,Longitude
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1
M1B,43.806686,-79.194353
M1C,43.784535,-79.160497
M1E,43.763573,-79.188711
M1G,43.770992,-79.216917
M1H,43.773136,-79.239476


In [15]:
# merge lat/lon df into grouppc df
neighborhoods = pd.merge(grouppc, postcodes, left_on='Postcode', right_on='Postal Code', how='inner')
print(neighborhoods.shape)
neighborhoods.head()

(103, 5)


Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


# Clustering the Data
Explore and cluster the neighborhoods in Toronto:

## Start by mapping using folium

In [34]:
address = 'Downtown Toronto, ON'
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))
print(location.latitude)

The geograpical coordinate of New York City are 43.655115, -79.380219.
43.655115


In [35]:
import folium # map rendering library
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

In [33]:
# Limit to Downtown Toronto
toronto = neighborhoods[neighborhoods['Borough'] == 'Downtown Toronto']
toronto.reset_index(drop=True, inplace = True)
toronto

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M4W,Downtown Toronto,Rosedale,43.679563,-79.377529
1,M4X,Downtown Toronto,"Cabbagetown,St. James Town",43.667967,-79.367675
2,M4Y,Downtown Toronto,Church and Wellesley,43.66586,-79.38316
3,M5A,Downtown Toronto,"Harbourfront,Regent Park",43.65426,-79.360636
4,M5B,Downtown Toronto,"Ryerson,Garden District",43.657162,-79.378937
5,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
6,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
7,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
8,M5H,Downtown Toronto,"Adelaide,King,Richmond",43.650571,-79.384568
9,M5J,Downtown Toronto,"Harbourfront East,Toronto Islands,Union Station",43.640816,-79.381752


# Use Foursquare to get data on Toronto

## To start:  focus on one particular Neighborhood: St. James Town

In [112]:
# I will focus on one particular neighbourhood:  St. James Town in M5E
sjt = toronto[toronto['Postcode'] == 'M5E']
sjt

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
6,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306


In [113]:
lat = sjt.iloc[0,3]
lon = sjt.iloc[0,4]
print('lat and long of St. James Town is',lat,lon)

lat and long of St. James Town is 43.644770799999996 -79.3733064


In [124]:
#Explore neighborhoods using foursquare
CLIENT_ID = 'NN45UJFVE5IZ4R53ASPZXNPW1CMDQKU2SJGTOSC2IHJXXMO3' # your Foursquare ID
CLIENT_SECRET = 'IUG1IQ4OXOHZGGEGZHYRE4TTUM4MAPSTDLMNICP5MMGJNW0B' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 200
radius = 500
url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, lat, lon, VERSION, radius, LIMIT)
print(url)

https://api.foursquare.com/v2/venues/search?client_id=NN45UJFVE5IZ4R53ASPZXNPW1CMDQKU2SJGTOSC2IHJXXMO3&client_secret=IUG1IQ4OXOHZGGEGZHYRE4TTUM4MAPSTDLMNICP5MMGJNW0B&ll=43.644770799999996,-79.3733064&v=20180605&radius=500&limit=200


In [125]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5cc77c46f594df21bb75435c'},
 'response': {'venues': [{'id': '4ad94f83f964a520b91921e3',
    'name': 'Union Station',
    'location': {'address': '65 Front St W',
     'crossStreet': 'btwn Bay & York St',
     'lat': 43.645167120407564,
     'lng': -79.38064098358154,
     'labeledLatLngs': [{'label': 'display',
       'lat': 43.645167120407564,
       'lng': -79.38064098358154}],
     'distance': 592,
     'postalCode': 'M5J 1E6',
     'cc': 'CA',
     'city': 'Toronto',
     'state': 'ON',
     'country': 'Canada',
     'formattedAddress': ['65 Front St W (btwn Bay & York St)',
      'Toronto ON M5J 1E6',
      'Canada']},
    'categories': [{'id': '4bf58dd8d48988d129951735',
      'name': 'Train Station',
      'pluralName': 'Train Stations',
      'shortName': 'Train Station',
      'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/travel/trainstation_',
       'suffix': '.png'},
      'primary': True}],
    'referralId': 'v-1556577351',

In [126]:
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
all_venues = results['response']['venues']


In [127]:
nearby_venues = json_normalize(all_venues)
nearby_venues.columns

Index(['categories', 'hasPerk', 'id', 'location.address', 'location.cc',
       'location.city', 'location.country', 'location.crossStreet',
       'location.distance', 'location.formattedAddress',
       'location.labeledLatLngs', 'location.lat', 'location.lng',
       'location.neighborhood', 'location.postalCode', 'location.state',
       'name', 'referralId', 'venuePage.id'],
      dtype='object')

In [117]:
chosen_columns = ['name', 'location.lat', 'location.lng', 'location.neighborhood','location.distance']
venues = nearby_venues.loc[:,chosen_columns]
venues.iloc[3,:]
venues.head(10)

Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  return self._getitem_tuple(key)


Unnamed: 0,name,location.lat,location.lng,location.neighborhood,location.distance
0,Bell,43.64157,-79.37978,,179
1,Waterclub Gym & Pool,43.640023,-79.381777,,88
2,Harbourfront,43.639526,-79.380688,,167
3,Ten York,43.641125,-79.381022,,68
4,SP Parking,43.640091,-79.381494,,83
5,Infinity Condominiums,43.642506,-79.38297,,212
6,Artscape Daniels Launchpad,43.644298,-79.368339,,1147
7,Down On The Corner Put Im The Street,43.640226,-79.380628,,111
8,City of Toronto,43.650072,-79.383888,,1044
9,Scotiabank Arena,43.643446,-79.37904,,365


## Now generalize to all Downtown neighborhoods

### Note:
What is a little confusing is that the foursquare lookup returns NaN for most neighborhood names.  However, this is probably irrelevant because we have used the lat/lon for a particular neighborhood, and returned all venues within a certain radius of that.  In a sense then we have figured out the neighborhood on our own and don't need to rely on FourSquare naming them for us.

Next we're going to loop through all of downtown Toronto's neighborhoods and return their venues:

In [134]:
# copied from class-provided notebook
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    LIMIT = 250
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [135]:
toronto_venues = getNearbyVenues(names=toronto['Neighbourhood'],
                                   latitudes=toronto['Latitude'],
                                   longitudes=toronto['Longitude']
                                  )

toronto_venues.shape

Rosedale
Cabbagetown,St. James Town
Church and Wellesley
Harbourfront,Regent Park
Ryerson,Garden District
St. James Town
Berczy Park
Central Bay Street
Adelaide,King,Richmond
Harbourfront East,Toronto Islands,Union Station
Design Exchange,Toronto Dominion Centre
Commerce Court,Victoria Hotel
Harbord,University of Toronto
Chinatown,Grange Park,Kensington Market
CN Tower,Bathurst Quay,Island airport,Harbourfront West,King and Spadina,Railway Lands,South Niagara
Stn A PO Boxes 25 The Esplanade
First Canadian Place,Underground city
Christie


(1291, 7)

In [136]:
toronto_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Rosedale,43.679563,-79.377529,Rosedale Park,43.682328,-79.378934,Playground
1,Rosedale,43.679563,-79.377529,Whitney Park,43.682036,-79.373788,Park
2,Rosedale,43.679563,-79.377529,Alex Murray Parkette,43.6783,-79.382773,Park
3,Rosedale,43.679563,-79.377529,Milkman's Lane,43.676352,-79.373842,Trail
4,"Cabbagetown,St. James Town",43.667967,-79.367675,Cranberries,43.667843,-79.369407,Diner


We now have over 1000 venues.  Let's group by neighborhood and see how many we have per neighborhood.

In [138]:
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Adelaide,King,Richmond",100,100,100,100,100,100
Berczy Park,55,55,55,55,55,55
"CN Tower,Bathurst Quay,Island airport,Harbourfront West,King and Spadina,Railway Lands,South Niagara",14,14,14,14,14,14
"Cabbagetown,St. James Town",47,47,47,47,47,47
Central Bay Street,89,89,89,89,89,89
"Chinatown,Grange Park,Kensington Market",100,100,100,100,100,100
Christie,16,16,16,16,16,16
Church and Wellesley,86,86,86,86,86,86
"Commerce Court,Victoria Hotel",100,100,100,100,100,100
"Design Exchange,Toronto Dominion Centre",100,100,100,100,100,100


It seems like we don't get more than 100 venues per neighborhood. Even though I've raised the limit to 250 or higher.  this is probably a limit on the per-lookup results from FourSquare.

In [141]:
print("The number of unique venues are")
len(toronto_venues['Venue Category'].unique())

The number of unique venues are


204

# Conclusion

I've used this section to gain experience with Foursquare and Toronto Neighborhoods.  Given that it's only worth 3 points, I don't think the assignment is intended to completely replicate the k-means clustering example given from NYC.  I'm going to move on and use my time more on the Capstone project itself. 