## Table of Contents

[Part 1 - Scrapping data from wikipedia page](#p1)<br>
[Part 2 - Integrating Location Coordinates](#p2)<br>
[Part 3 - REPLICATING Analysis](#p3)

<a id='p1'></a>
# Part 1 - Scrapping data from wikipedia page

Importing Libraries

In [1]:
import pandas as pd
import urllib.request
import folium
import json
import requests
from sklearn.cluster import KMeans


import numpy as np
import matplotlib.cm as cm
import matplotlib.colors as colors

Turning Data into pandas Dataframe

In [2]:
toronto_neighborhoods = pd.read_html("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")[0]

Adjusting Header(titles)

In [3]:
toronto_neighborhoods.columns=["PostalCode","Borough","Neighborhood"]
toronto_neighborhoods = toronto_neighborhoods.drop([0]).reset_index(drop=True)

Removing Cells with borough "Not assigned"

In [4]:
toronto_neighborhoods = toronto_neighborhoods[toronto_neighborhoods["Borough"]!= "Not assigned"].reset_index(drop=1)

Check for any entries left with Neighborhood coloumn as Not assigned. 

In [5]:
(toronto_neighborhoods["Neighborhood"]=="Not assigned").sum()

0

Number of rows

In [6]:
toronto_neighborhoods.shape[0]

103

<a id='p2'></a>
# Part 2 - Integrating Location Coordinates

I attempted to use geocoder library with google api, but I only got <[REQUEST_DENIED] Google - Geocode [empty]> as response.

In [7]:
import geocoder
pcode = toronto_neighborhoods['PostalCode'][0]
g = geocoder.google(pcode+'Toronto, Ontario, Canada')
print(g)
g.ok

<[REQUEST_DENIED] Google - Geocode [empty]>


False

So, I decided to use given dataset of coordinates

Getting coordinates dataframe

In [8]:
url = "https://cocl.us/Geospatial_data"
filename = "location_data.csv"
urllib.request.urlretrieve(url, filename)
print('Data downloaded!')

Data downloaded!


In [9]:
location_data = pd.read_csv(filename)
location_data.columns = ["PostalCode","Latitude","Longitude"]

merging two dataframes using PostalCode as refrence column

In [10]:
loc_data = pd.merge(toronto_neighborhoods,location_data,on="PostalCode")

In [11]:
loc_data.shape

(103, 5)

In [12]:
loc_data.to_csv('p2output.csv', index=False)

In [13]:
loc_data.head(12)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


<a id='p3'></a>
# Part 3 - REPLICATING Analysis

All the neighborhood centers highlighted on the map

In [14]:
#Location of toronto
Latitude, Longitude = 43.6532,-79.3832

In [15]:
#placing markers for all neighborhoods
map_toronto = folium.Map(location=[Latitude,Longitude],zoom_start=10)

for lat, lng, borough, neighborhood in zip(loc_data['Latitude'],loc_data['Longitude'],loc_data['Borough'],loc_data['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker([lat, lng],
                        popup=label,
                        radius=5,
                        color='#D2691E',
                        fill=True,
                        fill_color='#D2691E',
                        fill_opacity=0.3).add_to(map_toronto)

map_toronto

Foursquare Credentials

In [16]:
CLIENT_ID = 'nope' # your Foursquare ID
CLIENT_SECRET = 'nope' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value
radius = 500

#setting up for trial run
lat_2 = loc_data["Latitude"][0]
lng_2 = loc_data["Longitude"][0]


In [17]:
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    lat_2, 
    lng_2, 
    radius, 
    LIMIT)


In [18]:
def getNearbyVenues(neighborhoodNames, lats, lngs, rad=500):
    venues_list=[]
    
    for name, lat, lng in zip(neighborhoodNames, lats, lngs):
        #constructing url
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
                CLIENT_ID, 
                CLIENT_SECRET, 
                VERSION, 
                lat, 
                lng, 
                radius, 
                LIMIT)
        # getting GET request
        results = requests.get(url).json()['response']['groups'][0]['items']
        #making list of tuples
        venues_list.append([(name,
                          lat,
                           lng,
                           v['venue']['name'],
                           v['venue']['location']['lat'],
                           v['venue']['location']['lng'],
                           v['venue']['categories'][0]['name']) for v in results])
        #turning list into dataframe
        nearby_venues  = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
        #adjusting column names
        nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
        
    return nearby_venues
            

In [19]:
toronto_venues=getNearbyVenues(loc_data.Neighborhood, loc_data.Latitude, loc_data.Longitude)
print("done!")

done!


In [20]:
toronto_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
1,Parkwoods,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
2,Victoria Village,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
3,Victoria Village,43.725882,-79.315572,Tim Hortons,43.725517,-79.313103,Coffee Shop
4,Victoria Village,43.725882,-79.315572,Portugril,43.725819,-79.312785,Portuguese Restaurant


Number of Venues per each Neighborhood

In [21]:
toronto_venues.groupby('Neighborhood').count().shape

(96, 6)

Types of unique Values

In [22]:
len(toronto_venues['Venue Category'].unique())

277

### Analyzing each neighborhood

In [23]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues['Venue Category'])



In [24]:
#neighborhood column
Neighborhood = toronto_venues['Neighborhood']

#dropping a column named 'Neighborhood' that exists for some reason
toronto_onehot = toronto_onehot.drop(['Neighborhood'],axis=1)

toronto_onehot.insert(0,'Neighborhood', Neighborhood)

In [25]:
toronto_onehot.shape

(2167, 277)

Grouping all the venues for neighborhoods

In [26]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').sum().reset_index() # total venues
toronto_grouped_2 = toronto_onehot.groupby('Neighborhood').mean().reset_index() # mean number of venues

Make a column to check total number of venues in each neighborhood

In [27]:
toronto_grouped_noname = toronto_grouped.drop('Neighborhood', axis=1)
Total = toronto_grouped_noname.sum(axis=1)
toronto_grouped.insert(1, 'Total', Total)
toronto_grouped.head()

Unnamed: 0,Neighborhood,Total,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,...,Train Station,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,Agincourt,5,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"Alderwood, Long Branch",7,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"Bathurst Manor, Wilson Heights, Downsview North",21,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Bayview Village,4,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"Bedford Park, Lawrence Manor East",22,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### Visulizing total number of venues 

In [28]:
toronto_merged = toronto_grouped.join(loc_data.set_index('Neighborhood'), on='Neighborhood')

In [29]:
map_total = folium.Map(location=[43.6532,-79.3832], zoom_start=10)
for lat, lng, borough, neighborhood,total in zip(toronto_merged['Latitude'],toronto_merged['Longitude'],toronto_merged['Borough'],toronto_merged['Neighborhood'],toronto_merged['Total']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker([lat, lng],
                        popup=label,
                        radius=5,
                        color='#D2691E',
                        fill=True,
                        opacity=(total/100)*0.9+0.1,
                        fill_color='#D2691E',
                        fill_opacity=(total/100)*0.9+0.1).add_to(map_total)



map_total

As it is clearly visible, Neighborhoods with higher number of reported venues are located in center of the city(opaque circles). General trend is that as you move out in any direction, number of venues are decreasing(Transparent circles)

#### Clustering

In [30]:
#number of clusters
kclusters = 3
cluster_in = toronto_grouped_2.drop(['Neighborhood'],1)

kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(cluster_in)
kmeans.labels_

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1,
       0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 2, 1, 1, 0, 1, 1, 1, 0,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       0, 1, 1, 1, 1, 1, 0, 1])

In [31]:
#adding coordinates and output of KMeans algorithm to our data
cluster_out = toronto_grouped[['Neighborhood']]
cluster_out.insert(0, 'Cluster Labels', kmeans.labels_)

final_clusters = cluster_out.join(loc_data.set_index('Neighborhood'),on='Neighborhood')
#this is the final map of this excercise
final_clusters.head(5) 

Unnamed: 0,Cluster Labels,Neighborhood,PostalCode,Borough,Latitude,Longitude
0,1,Agincourt,M1S,Scarborough,43.7942,-79.262029
1,1,"Alderwood, Long Branch",M8W,Etobicoke,43.602414,-79.543484
2,1,"Bathurst Manor, Wilson Heights, Downsview North",M3H,North York,43.754328,-79.442259
3,1,Bayview Village,M2K,North York,43.786947,-79.385975
4,1,"Bedford Park, Lawrence Manor East",M5M,North York,43.733283,-79.41975


#### Mapping

In [32]:
# create map
map_clusters = folium.Map(location=[43.6532,-79.3832], zoom_start=10)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(final_clusters['Latitude'], final_clusters['Longitude'], final_clusters['Neighborhood'], final_clusters['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Conclusion at the end of this notebook

#### Getting top 10 venues

In [33]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [34]:
num_top_venues = 5

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped_2['Neighborhood']

for ind in np.arange(toronto_grouped_2.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped_2.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Agincourt,Lounge,Breakfast Spot,Latin American Restaurant,Skating Rink,Clothing Store
1,"Alderwood, Long Branch",Pizza Place,Sandwich Place,Coffee Shop,Pub,Pharmacy
2,"Bathurst Manor, Wilson Heights, Downsview North",Coffee Shop,Bank,Pharmacy,Deli / Bodega,Shopping Mall
3,Bayview Village,Japanese Restaurant,Café,Bank,Chinese Restaurant,Distribution Center
4,"Bedford Park, Lawrence Manor East",Sandwich Place,Italian Restaurant,Coffee Shop,Greek Restaurant,Thai Restaurant


In [35]:
cvc = final_clusters.join(neighborhoods_venues_sorted.set_index('Neighborhood'),on='Neighborhood')

#### Cluster 1

In [36]:
cvc.loc[cvc['Cluster Labels']==0,cvc.columns[[1] + list(range(6, cvc.shape[1]))]].reset_index(drop=True)

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Caledonia-Fairbanks,Park,Women's Store,Pool,Drugstore,Discount Store
1,"East Toronto, Broadview North (Old East York)",Intersection,Park,Convenience Store,Drugstore,Discount Store
2,"Kingsview Village, St. Phillips, Martin Grove ...",Pizza Place,Park,Sandwich Place,Bus Line,Dog Run
3,Lawrence Park,Park,Swim School,Bus Line,Yoga Studio,Donut Shop
4,"Milliken, Agincourt North, Steeles East, L'Amo...",Intersection,Playground,Park,Bakery,Donut Shop
5,"North Park, Maple Leaf Park, Upwood Park",Bakery,Park,Construction & Landscaping,Yoga Studio,Dumpling Restaurant
6,Parkwoods,Park,Food & Drink Shop,Yoga Studio,Drugstore,Discount Store
7,Rosedale,Park,Trail,Playground,Tennis Court,Donut Shop
8,Weston,Park,Yoga Studio,Drugstore,Discount Store,Distribution Center
9,York Mills West,Park,Convenience Store,Yoga Studio,Drugstore,Discount Store


#### Cluster 2

In [37]:
cvc.loc[cvc['Cluster Labels']==1,cvc.columns[[1] + list(range(6, cvc.shape[1]))]].reset_index(drop=True)

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Agincourt,Lounge,Breakfast Spot,Latin American Restaurant,Skating Rink,Clothing Store
1,"Alderwood, Long Branch",Pizza Place,Sandwich Place,Coffee Shop,Pub,Pharmacy
2,"Bathurst Manor, Wilson Heights, Downsview North",Coffee Shop,Bank,Pharmacy,Deli / Bodega,Shopping Mall
3,Bayview Village,Japanese Restaurant,Café,Bank,Chinese Restaurant,Distribution Center
4,"Bedford Park, Lawrence Manor East",Sandwich Place,Italian Restaurant,Coffee Shop,Greek Restaurant,Thai Restaurant
5,Berczy Park,Coffee Shop,Restaurant,Bakery,Cocktail Bar,Beer Bar
6,"Birch Cliff, Cliffside West",College Stadium,General Entertainment,Skating Rink,Café,Donut Shop
7,"Brockton, Parkdale Village, Exhibition Place",Café,Breakfast Spot,Nightclub,Coffee Shop,Gym
8,"Business reply mail Processing Centre, South C...",Pizza Place,Auto Workshop,Garden Center,Garden,Fast Food Restaurant
9,"CN Tower, King and Spadina, Railway Lands, Har...",Airport Lounge,Airport Service,Boutique,Historic Site,Bar


#### Cluster 3

In [38]:
cvc.loc[cvc['Cluster Labels']==2,cvc.columns[[1] + list(range(6, cvc.shape[1]))]].reset_index(drop=True)

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,"Humberlea, Emery",Baseball Field,Yoga Studio,Dumpling Restaurant,Dive Bar,Dog Run
1,"Old Mill South, King's Mill Park, Sunnylea, Hu...",Construction & Landscaping,Baseball Field,Yoga Studio,Dumpling Restaurant,Dive Bar


Conclusion: From what it looks like, Cluster 1(red) has most number of parks, cluster 2(purple) has more coffee shop and cluster 3(cyan) has more venues that are uncommon with former two.