# Scraping Wikipedia For Neighborhood Info #
### Albert Olszewski ###
In this document, I will be gathering information on toronto neighborhoods off of website data and performing clustering analysis using a foursquare plug in.

Import necessary packages.

In [596]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests
import csv

Create a beautifulsoup object from the xml source code from a wikipedia link given.

In [597]:
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(source, 'lxml')

The following block of code I cleaned data as follows:
1. Gather each individual row in table throwing an exception for empty rows or cells (this helped handle the heading becuase they were enclosed in different code)
2. Remove rows that do not have an assigned borough
3. Assign the borough as neighborhood name for postal codes with unassigned neighborhoods
4. Create a dataframe and compile neighborhoods together that share a postal code (groupby and join)

In [598]:
postcodes = []
boroughs = []
neighborhoods = []
districts = []

for district in soup.find('table').find_all('tr'):
    try:
        postcode = district.find_all('td')[0].text
        borough = district.find_all('td')[1].text
        neighborhood = district.find_all('td')[2].text
        neighborhood = neighborhood.replace("\n","")
    except Exception as e:
        postcode = None
        borough = None
        neighborhood = None
    # compile data into a list
    districts.append([postcode,borough,neighborhood])


# get rid of postal codes not assigned to a borough
assigned_districts = []
for i in range(1,len(districts)):
    if districts[i][1]!='Not assigned':
        assigned_districts.append(districts[i])

# assign borough as neighborhoods for unassigned neighborhoods
for j in range(0,len(assigned_districts)):
    if assigned_districts[j][2] == 'Not assigned':
        assigned_districts[j][2] = assigned_districts[j][1]
        



In [599]:
# creating dataframe
df = pd.DataFrame(data = assigned_districts, columns = ['Postal Code','Borough','Neighborhood'])
# joining neighborhoods with same postalcode
df = df.groupby(['Postal Code','Borough'])['Neighborhood'].apply(lambda x: ','.join(x.astype(str))).reset_index()

df.head()


Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [600]:
print('The final dataframe has the shape of: ', df.shape)

The final dataframe has the shape of:  (103, 3)


## Get Longitude and Latitude ##
 The geocoder package was not working consistently, so I am uploading the csv given in assignment.

In [601]:
latlong = pd.read_csv('Geospatial_Coordinates.csv')
latlong.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Join the dataframes using the column Postal Code. (inner join)

In [602]:
# join dataframes
df_torontoinfo = pd.merge(df,latlong, how = 'inner')
df_torontoinfo.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


The following code is commented out, but in the future it can be used to find latitude and longitude data.  It was not used because geocoder was not working.

In [603]:
'''# Documentation can be found https://geocoder.readthedocs.io/index.html.

import geocoder # import geocoder

# initialize your variable to None
lat_lng_coords = None

postal_code = 'M5G'

g = geocoder.google('Toronto, Ontario')
print(g.latlng)'''
'''# loop until you get the coordinates
while(lat_lng_coords is None):
  g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
  lat_lng_coords = g.latlng

latitude = lat_lng_coords[0]
longitude = lat_lng_coords[1]

print(latitude)
print(longitude)'''

"# loop until you get the coordinates\nwhile(lat_lng_coords is None):\n  g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))\n  lat_lng_coords = g.latlng\n\nlatitude = lat_lng_coords[0]\nlongitude = lat_lng_coords[1]\n\nprint(latitude)\nprint(longitude)"

## Foursquare Toronto Neighborhoods ##
Import necessary additional libraries and packages.

In [604]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library
print("Packages Installed...")

Packages Installed...


Create a map of Toronto with markers for the postal codes imposed on it.

In [605]:
# create map of Toronto using latitude and longitude values
latitude = 43.6532
longitude = -79.3832
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(df_torontoinfo['Latitude'], df_torontoinfo['Longitude'], df_torontoinfo['Borough'], df_torontoinfo['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
    
map_toronto

Define Foursquare credentials.

In [606]:
CLIENT_ID = 'JFMMIPVEDBJHMKDINLMG5MU2Y45JMQA3E35JD14HUR5VCUEJ' # your Foursquare ID
CLIENT_SECRET = 'AYFWYZTKALQF5Q3540IKE5JDMCVDU0OW3LJEW5ZQ24THHMFQ' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: JFMMIPVEDBJHMKDINLMG5MU2Y45JMQA3E35JD14HUR5VCUEJ
CLIENT_SECRET:AYFWYZTKALQF5Q3540IKE5JDMCVDU0OW3LJEW5ZQ24THHMFQ


Create a dataframe that has most common ammenities of each neighborhood.  The following cell contains a function that will loop through neighborhoods and request info from Foursquare.

In [610]:
radius = 500
LIMIT = 100


def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [611]:
toronto_venues = getNearbyVenues(names=df_torontoinfo['Neighborhood'],
                                   latitudes=df_torontoinfo['Latitude'],
                                   longitudes=df_torontoinfo['Longitude']
                                  )




Rouge,Malvern


KeyError: 'groups'

In [612]:
print("There are {} unique venues categories.".format(len(toronto_venues['Venue Category'].unique())))

There are 279 unique venues categories.


Create one hot dataframe to prep data for clustering.

In [613]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 
first_index = toronto_onehot.columns.get_loc('Neighborhood')
fixed_columns = [toronto_onehot.columns[first_index]] + list(toronto_onehot.columns[:first_index]) + list(toronto_onehot.columns[first_index+1:]) 
toronto_onehot = toronto_onehot[fixed_columns]

toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped

Unnamed: 0,Neighborhood,Accessories Store,Adult Boutique,Afghan Restaurant,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,"Adelaide,King,Richmond",0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.040000,...,0.010000,0.000000,0.000000,0.000000,0.000000,0.010000,0.0,0.000000,0.000000,0.000000
1,Agincourt,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000
2,"Agincourt North,L'Amoreaux East,Milliken,Steel...",0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000
3,"Albion Gardens,Beaumond Heights,Humbergate,Jam...",0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000
4,"Alderwood,Long Branch",0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000
5,"Bathurst Manor,Downsview North,Wilson Heights",0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.055556,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000
6,Bayview Village,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000
7,"Bedford Park,Lawrence Manor East",0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.045455,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000
8,Berczy Park,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.018182,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000
9,"Birch Cliff,Cliffside West",0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000


In [614]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [615]:
num_top_venues = 5

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,"Adelaide,King,Richmond",Coffee Shop,Café,Steakhouse,Bar,American Restaurant
1,Agincourt,Lounge,Breakfast Spot,Clothing Store,Skating Rink,Yoga Studio
2,"Agincourt North,L'Amoreaux East,Milliken,Steel...",Park,Playground,Coffee Shop,Yoga Studio,Drugstore
3,"Albion Gardens,Beaumond Heights,Humbergate,Jam...",Grocery Store,Pharmacy,Pizza Place,Fast Food Restaurant,Coffee Shop
4,"Alderwood,Long Branch",Pizza Place,Coffee Shop,Pharmacy,Sandwich Place,Pub


## Cluster the Neighborhoods ##
We will cluster the data into 4 different groups using K-Means.  The number of groups was chossen by using different values and picking the best visually.

In [616]:
# set number of clusters
kclusters = 4

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 0, 3, 0, 0, 0, 0, 0, 0, 0], dtype=int32)

In [579]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = df_torontoinfo

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')


ValueError: cannot insert Cluster Labels, already exists

In [621]:
# drop rows with NA.  These rows are likely neighborhoods that did not have any data on Foursquare
toronto_merged = toronto_merged.dropna()
toronto_merged['Cluster Labels'] = toronto_merged['Cluster Labels'].astype('int64')

In [622]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## Examine Clusters ##

In [625]:
# Cluster 0
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[2] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
1,"Highland Creek,Rouge Hill,Port Union",0,History Museum,Bar,Yoga Studio,Dumpling Restaurant,Dive Bar
2,"Guildwood,Morningside,West Hill",0,Intersection,Mexican Restaurant,Spa,Rental Car Location,Electronics Store
3,Woburn,0,Coffee Shop,Korean Restaurant,Yoga Studio,Discount Store,Dive Bar
4,Cedarbrae,0,Hakka Restaurant,Bakery,Athletics & Sports,Caribbean Restaurant,Thai Restaurant
6,"East Birchmount Park,Ionview,Kennedy Park",0,Discount Store,Hobby Shop,Department Store,Bus Station,Coffee Shop
8,"Cliffcrest,Cliffside,Scarborough Village West",0,American Restaurant,Motel,Dim Sum Restaurant,Discount Store,Dive Bar
9,"Birch Cliff,Cliffside West",0,Café,General Entertainment,Skating Rink,College Stadium,Concert Hall
10,"Dorset Park,Scarborough Town Centre,Wexford He...",0,Indian Restaurant,Pet Store,Chinese Restaurant,Light Rail Station,Vietnamese Restaurant
11,"Maryvale,Wexford",0,Auto Garage,Smoke Shop,Breakfast Spot,Middle Eastern Restaurant,Bakery
12,Agincourt,0,Lounge,Breakfast Spot,Clothing Store,Skating Rink,Yoga Studio


Cluster 0 is by far the largest cluster. It contains a lot of neighborhoods with cofee shops, pizza places and cafes in the top amenities. 

In [624]:
# Cluster 1
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[2] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
98,Weston,1,Convenience Store,Yoga Studio,Eastern European Restaurant,Dive Bar,Dog Run


This cluster contains 1 neighborhood (Weston).  It looks different from the other clusters because it has a convenience store as the most common venue, and other clusters do not have that value.  It should be noted that it is not clear what the distinction is between a convenience store and a drug store. 

In [559]:
# Cluster 2
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[2] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
32,Downsview Central,2,Korean Restaurant,Home Service,Baseball Field,Food Truck,Yoga Studio
63,Roselawn,2,Home Service,Garden,Yoga Studio,Drugstore,Diner
91,"Humber Bay,King's Mill Park,Kingsway Park Sout...",2,Baseball Field,Business Service,Construction & Landscaping,Yoga Studio,Dog Run
97,"Emery,Humberlea",2,Baseball Field,Yoga Studio,Dive Bar,Dog Run,Doner Restaurant


One trait that sticks out in this cluster is that these neighborhoods have a lot of baseball fields.  It should also be noted that gardening and construction businesses are common.  These neighborhoods are spaced evenly from eachother.  Perhaps they contain a lot of space for highschools and large businesses.

In [560]:
# Cluster 3
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[2] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,"Rouge,Malvern",3,Fast Food Restaurant,Dumpling Restaurant,Discount Store,Dive Bar,Dog Run
5,Scarborough Village,3,Spa,Playground,Drugstore,Diner,Discount Store
7,"Clairlea,Golden Mile,Oakridge",3,Bakery,Fast Food Restaurant,Intersection,Bus Line,Metro Station
14,"Agincourt North,L'Amoreaux East,Milliken,Steel...",3,Park,Playground,Coffee Shop,Yoga Studio,Drugstore
23,York Mills West,3,Park,Bank,Convenience Store,Yoga Studio,Dumpling Restaurant
25,Parkwoods,3,Food & Drink Shop,Park,Fast Food Restaurant,Falafel Restaurant,Event Space
30,"CFB Toronto,Downsview East",3,Park,Other Repair Shop,Airport,Snack Place,Yoga Studio
31,Downsview West,3,Hotel,Shopping Mall,Park,Grocery Store,Bank
40,East Toronto,3,Park,Coffee Shop,Convenience Store,Yoga Studio,Dumpling Restaurant
44,Lawrence Park,3,Park,Swim School,Bus Line,Yoga Studio,Drugstore


This cluster contains neighborhoods where outdoor activities are common such as parks, rivers, fields. They are far away from downtown Toronto.  