# Part 1

## Parsing the Toronto Canada Page for Neignborhood, Borough and Postal Code details

* This uses Pandas for data analysis and parsing too
* We scraped the following link "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

In [1]:
from itertools import zip_longest
import pandas as pd
import requests
from sklearn.cluster import KMeans
import folium
from geopy.geocoders import Nominatim
import matplotlib.cm as cm
import matplotlib.colors as colors
from numpy import arange, linspace

## Find the table and convert it into the dataframe

In [2]:
toronto = pd.read_html("http://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M", attrs={'class': 'wikitable'})[0]

## Drop all the values where Borough's value is 'Not assigned'

In [3]:
toronto.drop(toronto[toronto['Borough'] == 'Not assigned'].index, inplace=True)
toronto.reset_index(drop=True, inplace=True)

## Check if any Neighborhood is 'Not assigned', replace it with Borough column value

In [4]:
toronto['Neighborhood'] = toronto.apply(lambda row: row['Borough'] if row['Neighborhood'] == 'Not assigned' else row['Neighborhood'], axis=1)

## Replace values of joined Neighborhood with ","

In [5]:
toronto['Neighborhood'] = toronto.Neighborhood.apply(lambda x: x.replace(" /", ","))

## Finally get the shape of the resulting dataset

In [6]:
toronto.shape

(103, 3)

# Part 2

### Tried Geocode, but was not able to fetch results. So, used the dataset instead, to get coordinates.

In [7]:
cordinates = pd.read_csv("Geospatial_Coordinates.csv")
toronto = toronto.merge(left_on='Postal code', right=cordinates, right_on='Postal Code')
toronto.drop(columns='Postal Code',inplace=True)
del cordinates

###  Get only those boroughs which contains Toronto in their names

In [8]:
toronto_boroughs = toronto[toronto.Borough.str.contains("Toronto", case=False, regex=False)].reset_index(drop=True)

### Since, we have Neighborhood as multivalued attribute, we will first change it to single valued attribute

In [9]:
toronto_boroughs['Neighborhood'] = toronto_boroughs.Neighborhood.str.split(", ")
toronto_boroughs = toronto_boroughs.explode("Neighborhood")
toronto_boroughs.reset_index(drop=True, inplace=True)

In [10]:
toronto_boroughs.head()

Unnamed: 0,Postal code,Borough,Neighborhood,Latitude,Longitude
0,M5A,Downtown Toronto,Regent Park,43.65426,-79.360636
1,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636
2,M7A,Downtown Toronto,Queen's Park,43.662301,-79.389494
3,M7A,Downtown Toronto,Ontario Provincial Government,43.662301,-79.389494
4,M5B,Downtown Toronto,Garden District,43.657162,-79.378937


### Credentials for FourSquare API and Global Variables to be used.

In [11]:
CLIENT_ID = ''
CLIENT_SECRET = ''
VERSION = '20200412'
LIMIT = 100
RADIUS = 500

### Get venues near to the neighborhood, from FourSquare

In [12]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list = []
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [13]:
toronto_venues = getNearbyVenues(names=toronto_boroughs['Neighborhood'],
                                latitudes=toronto_boroughs['Latitude'],
                                longitudes=toronto_boroughs['Longitude'])

Regent Park
Harbourfront
Queen's Park
Ontario Provincial Government
Garden District
Ryerson
St. James Town
The Beaches
Berczy Park
Central Bay Street
Christie
Richmond
Adelaide
King
Dufferin
Dovercourt Village
Harbourfront East
Union Station
Toronto Islands
Little Portugal
Trinity
The Danforth West
Riverdale
Toronto Dominion Centre
Design Exchange
Brockton
Parkdale Village
Exhibition Place
India Bazaar
The Beaches West
Commerce Court
Victoria Hotel
Studio District
Lawrence Park
Roselawn
Davisville North
Forest Hill North & West
High Park
The Junction South
North Toronto West
The Annex
North Midtown
Yorkville
Parkdale
Roncesvalles
Davisville
University of Toronto
Harbord
Runnymede
Swansea
Moore Park
Summerhill East
Kensington Market
Chinatown
Grange Park
Summerhill West
Rathnelly
South Hill
Forest Hill SE
Deer Park
CN Tower
King and Spadina
Railway Lands
Harbourfront West
Bathurst Quay
South Niagara
Island airport
Rosedale
Stn A PO Boxes
St. James Town
Cabbagetown
First Canadian Place
U

In [14]:
toronto_venues.shape

(3127, 7)

## Data Analysis

In [15]:
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Adelaide,94,94,94,94,94,94
Bathurst Quay,18,18,18,18,18,18
Berczy Park,56,56,56,56,56,56
Brockton,23,23,23,23,23,23
Business reply mail Processing CentrE,14,14,14,14,14,14
...,...,...,...,...,...,...
Underground city,100,100,100,100,100,100
Union Station,100,100,100,100,100,100
University of Toronto,35,35,35,35,35,35
Victoria Hotel,100,100,100,100,100,100


### Check the unique venue categories,as we will be clustering on this basis only.

In [16]:
toronto_venues['Venue Category'].unique()

array(['Bakery', 'Coffee Shop', 'Distribution Center', 'Spa',
       'Breakfast Spot', 'Restaurant', 'Park', 'Historic Site', 'Pub',
       'Farmers Market', 'Chocolate Shop', 'Dessert Shop', 'Theater',
       'Performing Arts Venue', 'Gym / Fitness Center',
       'French Restaurant', 'Café', 'Mexican Restaurant', 'Event Space',
       'Yoga Studio', 'Ice Cream Shop', 'Asian Restaurant', 'Shoe Store',
       'Art Gallery', 'Cosmetics Shop', 'Electronics Store', 'Bank',
       'Beer Store', 'Hotel', 'Health Food Store', 'Antique Shop',
       'Italian Restaurant', 'Beer Bar', 'Sushi Restaurant', 'Creperie',
       'Arts & Crafts Store', 'Burrito Place', 'Diner', 'Hobby Shop',
       'Discount Store', 'Fried Chicken Joint', 'Burger Joint',
       'Juice Bar', 'Sandwich Place', 'Gym', 'College Auditorium', 'Bar',
       'Clothing Store', 'Comic Shop', 'Plaza', 'Tea Room', 'Music Venue',
       'Ramen Restaurant', 'Thai Restaurant', 'Steakhouse',
       'Sporting Goods Shop', 'Shopping Ma

### Now, turn the dataset as one-hotencoded with column as venue categories. 

+ For this, get_dummies() method  is used.
+ Dropped neighborhood column as it matches our original neighborhood column, I can replace it with a different name, but I decided to drop it.

In [17]:
toronto_categories = pd.concat([toronto_venues['Neighborhood'], toronto_venues['Venue Category'].str.get_dummies().drop(columns='Neighborhood')], axis='columns')

In [18]:
toronto_categories = toronto_categories.groupby('Neighborhood').mean().reset_index()

In [19]:
toronto_categories.head()

Unnamed: 0,Neighborhood,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,...,Theme Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Women's Store,Yoga Studio
0,Adelaide,0.0,0.0,0.0,0.0,0.0,0.0,0.031915,0.0,0.0,...,0.0,0.0,0.0,0.0,0.010638,0.0,0.0,0.0,0.010638,0.0
1,Bathurst Quay,0.055556,0.055556,0.055556,0.111111,0.166667,0.111111,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.017857,0.0,0.0,0.0,0.0,0.0
3,Brockton,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Business reply mail Processing CentrE,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Get  top 10 venues of each Neighborhood

In [22]:
top_10_venues  = pd.DataFrame(list(toronto_categories.apply(lambda x: x.iloc[1:].sort_values(ascending=False).index[:10].values, axis=1)),
                             columns=['{}{} most common venue'.format(i+1, j) for i , j in zip_longest(range(10), ['st', 'nd', 'rd'], fillvalue='th')])
top_10_venues.insert(0, 'Neighborhood', toronto_categories['Neighborhood'])
top_10_venues.head()

Unnamed: 0,Neighborhood,1st most common venue,2nd most common venue,3rd most common venue,4th most common venue,5th most common venue,6th most common venue,7th most common venue,8th most common venue,9th most common venue,10th most common venue
0,Adelaide,Coffee Shop,Café,Restaurant,Hotel,Thai Restaurant,Deli / Bodega,Gym,American Restaurant,Bakery,Bar
1,Bathurst Quay,Airport Service,Airport Lounge,Airport Terminal,Airport,Bar,Coffee Shop,Plane,Rental Car Location,Sculpture Garden,Boat or Ferry
2,Berczy Park,Coffee Shop,Cocktail Bar,Bakery,Cheese Shop,Café,Italian Restaurant,Restaurant,Farmers Market,Beer Bar,Seafood Restaurant
3,Brockton,Café,Breakfast Spot,Coffee Shop,Grocery Store,Bakery,Stadium,Burrito Place,Restaurant,Climbing Gym,Pet Store
4,Business reply mail Processing CentrE,Light Rail Station,Garden Center,Burrito Place,Auto Workshop,Spa,Fast Food Restaurant,Farmers Market,Garden,Pizza Place,Gym / Fitness Center


# Clustering the Neighborhoods

In [23]:
kclusters = 5
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_categories.drop(columns='Neighborhood'))
top_10_venues.insert(1, 'Cluster Labels', kmeans.labels_)

In [24]:
# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_boroughs.join(top_10_venues.set_index('Neighborhood'), on='Neighborhood')
toronto_merged.head()

Unnamed: 0,Postal code,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st most common venue,2nd most common venue,3rd most common venue,4th most common venue,5th most common venue,6th most common venue,7th most common venue,8th most common venue,9th most common venue,10th most common venue
0,M5A,Downtown Toronto,Regent Park,43.65426,-79.360636,1,Coffee Shop,Park,Pub,Bakery,Breakfast Spot,Restaurant,Café,Theater,Mexican Restaurant,Shoe Store
1,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636,1,Coffee Shop,Park,Pub,Bakery,Breakfast Spot,Restaurant,Café,Theater,Mexican Restaurant,Shoe Store
2,M7A,Downtown Toronto,Queen's Park,43.662301,-79.389494,1,Coffee Shop,Sushi Restaurant,Diner,Yoga Studio,Creperie,Beer Bar,Sandwich Place,Burger Joint,Burrito Place,Café
3,M7A,Downtown Toronto,Ontario Provincial Government,43.662301,-79.389494,1,Coffee Shop,Sushi Restaurant,Diner,Yoga Studio,Creperie,Beer Bar,Sandwich Place,Burger Joint,Burrito Place,Café
4,M5B,Downtown Toronto,Garden District,43.657162,-79.378937,1,Clothing Store,Coffee Shop,Café,Bubble Tea Shop,Middle Eastern Restaurant,Cosmetics Shop,Japanese Restaurant,Fast Food Restaurant,Italian Restaurant,Restaurant


### Using geopy library to get the latitude and longitude values of Toronto.

In [25]:
address = 'Toronto,  Ontario'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


In [31]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = arange(kclusters)
ys = [i + x + (x*i)**2 for i in range(kclusters)]
colors_array = cm.rainbow(linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### Checking the clusters shape, as one type of cluster is prevalent

+ Cluster 2 is most prevalent as 63 points are assigned to the label

In [32]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0]

Unnamed: 0,Postal code,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st most common venue,2nd most common venue,3rd most common venue,4th most common venue,5th most common venue,6th most common venue,7th most common venue,8th most common venue,9th most common venue,10th most common venue
34,M5N,Central Toronto,Roselawn,43.711695,-79.416936,0,Home Service,Garden,Yoga Studio,Dessert Shop,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop


In [33]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1]

Unnamed: 0,Postal code,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st most common venue,2nd most common venue,3rd most common venue,4th most common venue,5th most common venue,6th most common venue,7th most common venue,8th most common venue,9th most common venue,10th most common venue
0,M5A,Downtown Toronto,Regent Park,43.654260,-79.360636,1,Coffee Shop,Park,Pub,Bakery,Breakfast Spot,Restaurant,Café,Theater,Mexican Restaurant,Shoe Store
1,M5A,Downtown Toronto,Harbourfront,43.654260,-79.360636,1,Coffee Shop,Park,Pub,Bakery,Breakfast Spot,Restaurant,Café,Theater,Mexican Restaurant,Shoe Store
2,M7A,Downtown Toronto,Queen's Park,43.662301,-79.389494,1,Coffee Shop,Sushi Restaurant,Diner,Yoga Studio,Creperie,Beer Bar,Sandwich Place,Burger Joint,Burrito Place,Café
3,M7A,Downtown Toronto,Ontario Provincial Government,43.662301,-79.389494,1,Coffee Shop,Sushi Restaurant,Diner,Yoga Studio,Creperie,Beer Bar,Sandwich Place,Burger Joint,Burrito Place,Café
4,M5B,Downtown Toronto,Garden District,43.657162,-79.378937,1,Clothing Store,Coffee Shop,Café,Bubble Tea Shop,Middle Eastern Restaurant,Cosmetics Shop,Japanese Restaurant,Fast Food Restaurant,Italian Restaurant,Restaurant
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
70,M4X,Downtown Toronto,Cabbagetown,43.667967,-79.367675,1,Coffee Shop,Restaurant,Italian Restaurant,Café,Pub,Bakery,Market,Pizza Place,Grocery Store,Pet Store
71,M5X,Downtown Toronto,First Canadian Place,43.648429,-79.382280,1,Coffee Shop,Café,Restaurant,Hotel,Asian Restaurant,American Restaurant,Gym,Bar,Steakhouse,Gastropub
72,M5X,Downtown Toronto,Underground city,43.648429,-79.382280,1,Coffee Shop,Café,Restaurant,Hotel,Asian Restaurant,American Restaurant,Gym,Bar,Steakhouse,Gastropub
73,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160,1,Coffee Shop,Sushi Restaurant,Japanese Restaurant,Restaurant,Pub,Men's Store,Mediterranean Restaurant,Hotel,Gay Bar,Gastropub


In [34]:
toronto_merged[toronto_merged['Cluster Labels'] == 2]

Unnamed: 0,Postal code,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st most common venue,2nd most common venue,3rd most common venue,4th most common venue,5th most common venue,6th most common venue,7th most common venue,8th most common venue,9th most common venue,10th most common venue
50,M4T,Central Toronto,Moore Park,43.689574,-79.38316,2,Playground,Restaurant,Yoga Studio,Department Store,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant
51,M4T,Central Toronto,Summerhill East,43.689574,-79.38316,2,Playground,Restaurant,Yoga Studio,Department Store,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant


In [35]:
toronto_merged[toronto_merged['Cluster Labels'] == 3]

Unnamed: 0,Postal code,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st most common venue,2nd most common venue,3rd most common venue,4th most common venue,5th most common venue,6th most common venue,7th most common venue,8th most common venue,9th most common venue,10th most common venue
60,M5V,Downtown Toronto,CN Tower,43.628947,-79.39442,3,Airport Service,Airport Lounge,Airport Terminal,Airport,Bar,Coffee Shop,Plane,Rental Car Location,Sculpture Garden,Boat or Ferry
61,M5V,Downtown Toronto,King and Spadina,43.628947,-79.39442,3,Airport Service,Airport Lounge,Airport Terminal,Airport,Bar,Coffee Shop,Plane,Rental Car Location,Sculpture Garden,Boat or Ferry
62,M5V,Downtown Toronto,Railway Lands,43.628947,-79.39442,3,Airport Service,Airport Lounge,Airport Terminal,Airport,Bar,Coffee Shop,Plane,Rental Car Location,Sculpture Garden,Boat or Ferry
63,M5V,Downtown Toronto,Harbourfront West,43.628947,-79.39442,3,Airport Service,Airport Lounge,Airport Terminal,Airport,Bar,Coffee Shop,Plane,Rental Car Location,Sculpture Garden,Boat or Ferry
64,M5V,Downtown Toronto,Bathurst Quay,43.628947,-79.39442,3,Airport Service,Airport Lounge,Airport Terminal,Airport,Bar,Coffee Shop,Plane,Rental Car Location,Sculpture Garden,Boat or Ferry
65,M5V,Downtown Toronto,South Niagara,43.628947,-79.39442,3,Airport Service,Airport Lounge,Airport Terminal,Airport,Bar,Coffee Shop,Plane,Rental Car Location,Sculpture Garden,Boat or Ferry
66,M5V,Downtown Toronto,Island airport,43.628947,-79.39442,3,Airport Service,Airport Lounge,Airport Terminal,Airport,Bar,Coffee Shop,Plane,Rental Car Location,Sculpture Garden,Boat or Ferry


In [36]:
toronto_merged[toronto_merged['Cluster Labels'] == 4]

Unnamed: 0,Postal code,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st most common venue,2nd most common venue,3rd most common venue,4th most common venue,5th most common venue,6th most common venue,7th most common venue,8th most common venue,9th most common venue,10th most common venue
33,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879,4,Park,Bus Line,Swim School,Dessert Shop,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant
67,M4W,Downtown Toronto,Rosedale,43.679563,-79.377529,4,Park,Playground,Trail,Deli / Bodega,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant,Dog Run


### Finallly cluster labels can be given on the basis of 1st most common venue and maximum count of the venue

In [37]:
labels = [toronto_merged.loc[toronto_merged['Cluster Labels'] == i, '1st most common venue'].value_counts().index[0] for i in range(kclusters)]
labels

['Home Service', 'Coffee Shop', 'Playground', 'Airport Service', 'Park']