<h1><font=18 align=center>Toronto Neighborhood Gym Analysis!</font></h1>
<h2><font=15 align=center>Part 3 - Exploring the Neighbohood</h2>
This project will analyze Toronto neighborhood data using data from the citie's Wikipedia page and Foursquare venue data.  This notebook continues the project by creating a map and exploring the neighborhood.


In [2]:
##import necessary modules
import pandas as pd
from bs4 import BeautifulSoup
import requests
#import geocoder # import geocoder - couldn't get it to work
import json
import numpy as np
import os #import for file handling
import folium
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

In [3]:
#scrape wikipedia table
url=requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')
soup=url.text
#parse wiki data
wiki_html=BeautifulSoup(soup,'html.parser')

#instantiate data frame
columns=["Postal Code", "Borough", "Neighborhood"]
df_toronto=pd.DataFrame(columns=columns) 

#loop through table rows, slice table header
for tr in wiki_html.table.find_all('tr')[1:]:
        #set vars
        pc=tr.find_all('td')[0].get_text().rstrip()
        borough=tr.find_all('td')[1].get_text().rstrip()
        nh=tr.find_all('td')[2].get_text().rstrip()
        
        #Verify data.  Boroughs not assigned should be ignored.  Postal codes with multiple
        #neighbohoods should be combined
        if "Not assigned" not in borough:
            if pc not in df_toronto['Postal Code'].values:
                df_row={'Postal Code' : pc,'Borough' : borough,'Neighborhood' : nh}
                df_toronto=df_toronto.append(df_row, ignore_index=True)
            elif pc in df_toronto['Postal Code'].values:
                dfloc=df_toronto.loc[df_toronto['Postal Code'] == pc].index
                df_toronto.loc[dfloc,'Neighborhood']=df_toronto.loc[dfloc,'Neighborhood'] + ',' + nh
                #print('PC:', pc, 'Neigborhood:', nh)

In [4]:
#instantiate geo data frame and add lat/long columns
df_toronto_geo=df_toronto.copy(deep=True)
df_toronto_geo['Latitude']=np.nan
df_toronto_geo['Longitude']=np.nan

In [5]:
df_toronto_geo.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,,
1,M4A,North York,Victoria Village,,
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",,
3,M6A,North York,"Lawrence Manor, Lawrence Heights",,
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",,


<h2><font=10 align=center>Import geoData from geocoder module</font></h2>

In [6]:
#define getgeodata function for pulling geodata from google api

#define api key key
f=open('gk','r')
key=f.readline()
f.close

def getgeodata(postal_code,key):
    #generate url
    apiurl='https://maps.googleapis.com/maps/api/geocode/json?address={},+Toronto,+Canada&key={}'.format(postal_code,key)   
 
    #create a file to write
    fname='./geodata/{}.json'.format(postal_code)
    if os.path.exists(fname):
        f=open(fname,'r')
        jsondata=f.read()
        f.close
        return jsondata
    else:
        f=open(fname,'x')
        url=requests.get(apiurl)
        jsondata=url.text
        f.write(jsondata)
        f.close
        return jsondata

In [7]:
#loop through dataframe, get geoData from google api and update dataframe.
for pc in df_toronto_geo['Postal Code']:
    dfloc=df_toronto_geo.loc[df_toronto_geo['Postal Code'] == pc].index #set dataframe location
    json_data=json.loads(getgeodata(pc,key)) #load json_data
    latitude=json_data['results'][0]['geometry']['location']['lat']
    longitude=json_data['results'][0]['geometry']['location']['lng']
    #write latitude/longitude
    df_toronto_geo.loc[dfloc,'Latitude']=latitude
    df_toronto_geo.loc[dfloc,'Longitude']=longitude
    #df_toronto_geo.loc[df_toronto_geo['Postal Code'] == pc]

In [8]:
df_toronto_geo.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


In [9]:
df_toronto_geo.shape

(103, 5)

<h2><font size=12 align=center>Map of Toronto</font></h2>

In [10]:
# create map of toronto
#lat/long of toronto
toronto_json=json.loads(getgeodata('Toronto',key))
latitude=toronto_json['results'][0]['geometry']['location']['lat']
longitude=toronto_json['results'][0]['geometry']['location']['lng']
toronto_map = folium.Map(location=[latitude, longitude], zoom_start=12)

In [11]:
#create dataframe of neighborhoods in Toronto
toronto_nh=df_toronto_geo[df_toronto_geo['Borough'].str.contains('Toronto')]

In [12]:
# add neighboood markers to map
for lat, lng, borough, neighborhood in zip(toronto_nh['Latitude'], toronto_nh['Longitude'], toronto_nh['Borough'], toronto_nh['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(toronto_map)  

<h2><font size=12 align=center>Get Foursquare venue data!</font></h2>

In [13]:
#define foursquare credentials
with open('fk','r') as f:
    lines=f.readlines()
    CLIENT_ID=lines[0].strip()
    CLIENT_SECRET=lines[1].strip()
    VERSION = '20200517' # Foursquare API version

In [14]:
#setup a function to build and call foursquare API.
#added writing data to a .json file to reduce number of API calls to help with multiple calls needed to correctly build venue dataframe
def getvenuedata(pc,borough,lat,lng,radius,limit):
    #setup API URL
    apiurl = 'https://api.foursquare.com/v2/venues/search?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
        CLIENT_ID, 
        CLIENT_SECRET, 
        VERSION, 
        lat, 
        lng,
        radius,
        limit)        
    #file ops
    fname='./fdata/Venue-{}-{}.json'.format(pc,borough)
    if os.path.exists(fname):
        f=open(fname,'r')
        data=f.read()
        jsondata=json.loads(data)
        f.close
        #print("file found: {}".format(fname))
        return jsondata
    else:
        f=open(fname,'x')
        data=requests.get(apiurl)
        jsondata=json.loads(data.text)
        f.write(data.text)
        f.close
        #print("New File: {}".format(fname))
        return jsondata

In [15]:
#instantiate a venue dataframe
columns=["Postal Code","Borough", "Neighborhood","Venue Category","Venue Name", "Venue Address", "Venue Foursqaure ID","Latitude", "Longitude"]
toronto_venues=pd.DataFrame(columns=columns)

In [17]:
#loop through all neighborhoods in Toronto dataframe.  
radius=500
limit=100

for pc,borough,nh,lat,long in zip(df_toronto_geo['Postal Code'],df_toronto_geo['Borough'],df_toronto_geo['Neighborhood'],df_toronto_geo['Latitude'],df_toronto_geo['Longitude']):

    jsondata=getvenuedata(pc,borough,lat,long,radius,limit)  # get the venu data
    
    #grab interesting data from json and add to venue dataframe for analysis
    for venue in jsondata['response']['venues']:
        name = venue['name']
        ID = venue['id']
        #check if there is an address
        try:
            address=venue['location']['address']
        except:
            address='None'
        #check if there is a category. Categorize as the name if no category found.
        try:
            cat=venue['categories'][0]['shortName']
        except:
            cat=venue['name']
        vlat=venue['location']['lat']
        vlong=venue['location']['lng']      
        df_row={'Postal Code':pc, 'Borough':borough,'Neighborhood':nh,'Venue Category':cat,'Venue Name':name,'Venue Address':address,'Venue Foursqaure ID':ID,'Latitude':vlat,'Longitude': vlong}
        toronto_venues=toronto_venues.append(df_row, ignore_index=True)

In [19]:
print('There are {} unique categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 1545 unique categories.


<h2><font size=12 align=center>Analyse the types of gyms in each neighborhoods!</font></h2>
Of the 1545 unique categories throughout Toronto, we will find all that venues that are categorized as a Gym and see which type is the most common.

First, we extract all of the venues with a category that includes the word "Gym".  We encode the extraction using the onehot method to enable anlaysis, then group the
data by neighborhood. 

In [20]:
#dataframe of all venues with a category like Gym
toronto_gym=toronto_venues[toronto_venues['Venue Category'].str.contains('Gym')]

#one hot encode to use for analysis
toronto_gym_1hot=pd.get_dummies(toronto_gym[['Venue Category']],prefix="", prefix_sep="")

# add neighborhood column to 1hot dataframe
toronto_gym_1hot['Neighborhood'] = toronto_gym['Neighborhood']

# move neighborhood column to the first column
fixed_columns = [toronto_gym_1hot.columns[-1]] + list(toronto_gym_1hot.columns[:-1])
toronto_gym_1hot = toronto_gym_1hot[fixed_columns]

In [21]:
#group the 1hot gyms by neighborhood and reset the index to the average
toronto_gym_grouped = toronto_gym_1hot.groupby('Neighborhood').mean().reset_index()

We now create clusters of the neighborhoods based on the most common Gym venues.  We add labels to the dataframe to indicate which cluster the neighborhood belongs to.

In [22]:
#borrowed from the DP0701EN-3-3-2 lab.  function to returm most common venues
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [23]:
#borrowed method from the DP0701EN-3-3-2 lab
#rank the venues. there are only seven, so not able to do a top 10.
num_top_venues = 7

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_gym_grouped['Neighborhood']

for ind in np.arange(toronto_gym_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_gym_grouped.iloc[ind, :], num_top_venues)

In [24]:
#Cluster neighborhoods
#set number of clusters
kclusters = 5

toronto_gym_grouped_clustering = toronto_gym_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_gym_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 0, 0, 3, 3, 2, 2, 0, 0, 3], dtype=int32)

In [25]:
#add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
toronto_gym_merged = toronto_venues[toronto_venues['Venue Category'].str.contains('Gym')]

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_gym_merged = toronto_gym_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

<h2><font size=12 align=center>Plot Gym venue data on our map!</font></h2>
In this section, we plot the neighborhoods with labels based on which cluster they belong to.  The markers are color coded by cluster.  All of the neighborhoods with the same color coded marker are similar based on the ranking of Gym venues.

In [26]:
# create map
toronto_gym_map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_gym_merged['Latitude'], toronto_gym_merged['Longitude'], toronto_gym_merged['Neighborhood'], toronto_gym_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(toronto_gym_map_clusters)
       
toronto_gym_map_clusters