# Battle of the Neighborhoods
### Coursera Data Science Capstone Project  
**Author**: Aaron Carter  
**Date**: 12/22/2020

This notebook extracts Postal Code data from an online resource using BeautifulSoup web scraping library.  
The resulting table is merged with coordinates for each Postal Code, to be used with folium visualizations.  
MORE

## Part 1: Data Collection and Cleaning  

Data is scraped from Wikipedia, and cleaned according to project specifications. 

In [36]:
import pandas as pd
from bs4 import BeautifulSoup as bs
import requests

Postal codes will be scraped from a provided [wikipedia](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M "Canada Postal Codes") resource  
This operation will be performed using the BeautifulSoup (bs4) library. 

In [37]:
#Link has been updated since course creation, using historical version
resp = requests.get("https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&oldid=1011037969")
resp

<Response [200]>

Wikipedia uses standard html formatting, so we are using the html parser included with bs4

In [38]:
soup = bs(resp.text, 'html.parser')

In [39]:
#See HTML response body
rows = []
respRows = soup.table.tbody.tr.find_next_siblings("tr") #Creates a list of table row elements (ignoring table header)

**This Cell uses nested "for" loops to convert the html response into a pandas dataframe**

In [40]:

c = 0


for row in respRows:
    newRow = {}
    cols = row.find_all("td")
    j = 0
    for col in cols:
        newRow[["Postal Code", "Borough", "Neighborhood"][j]] = col.text.replace('\n','') #Load columns one at a time 
        j+=1
    rows.append(newRow)
    
wikiData = pd.DataFrame(rows)
wikiData

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


In [41]:
wikiData_copy = wikiData #Just in case we need to go back to the original table

**Data is cleaned such that no Borough is "Not Assigned"**

In [42]:
wikiData = wikiData[wikiData["Borough"]!="Not assigned"]
if len(wikiData[wikiData["Neighborhood"]=="Not assigned"]) == 0:
    print("All neighborhoods assigned!")

if len(wikiData['Postal Code'].unique()) == len(wikiData):
    print("No Duplicate Postal Codes!")
wikiData

All neighborhoods assigned!
No Duplicate Postal Codes!


Unnamed: 0,Postal Code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
160,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
165,M4Y,Downtown Toronto,Church and Wellesley
168,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
169,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


## TABLE FOR QUESTION 1
This is the first required table in the Coursera assignment

In [43]:
print("Table Shape: ",wikiData.shape)
wikiData

Table Shape:  (103, 3)


Unnamed: 0,Postal Code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
160,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
165,M4Y,Downtown Toronto,Church and Wellesley
168,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
169,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


## Part 2: Merging Data with Geospatial Coordinates

Provided in the project requirements is a .csv file (GeoSpatial_Coordinates.csv) that provides Latitude and Longitude for each postal code.  
This file is consumed into a pandas dataframe and merged with the scraped Postal Code dataframe

In [44]:
geo1 = pd.read_csv("Geospatial_Coordinates.csv")
wikiData = pd.merge(left=wikiData, right=geo1, how='left', on="Postal Code")

## TABLE FOR QUESTION 2
This is the second required table in the Coursera Assignment

In [48]:
wikiData['Latitude'] = wikiData['Latitude'].astype(str)
wikiData['Longitude'] = wikiData['Longitude'].astype(str)
print(wikiData.shape)
wikiData

(103, 5)


Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.7532586,-79.3296565
1,M4A,North York,Victoria Village,43.725882299999995,-79.31557159999998
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.6542599,-79.3606359
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718517999999996,-79.46476329999999
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.6623015,-79.3894938
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653653600000005,-79.5069436
99,M4Y,Downtown Toronto,Church and Wellesley,43.6658599,-79.38315990000001
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.6627439,-79.321558
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.6362579,-79.49850909999999


In [49]:
wikiData.to_csv("CanadaNeighborhoods.csv")

## Part 3: Analysis and Visualization

In [53]:
CLIENT_ID = 'P1AWGSLTDJSXOC2EMIRVMQOWD3CO0HRTCNRNJEUWM2R5XZNN' # your Foursquare ID
CLIENT_SECRET = '5RUWXK2HOV011WDQLMTWECVKEVB45K3N31ERCSIXNIS2K4RM' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 300 # A default Foursquare API limit value

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Postal Code', 
                  'Postal Code Latitude', 
                  'Postal Code Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [54]:
Canada_data = wikiData
Canada_venues = getNearbyVenues(Canada_data['Postal Code'], Canada_data['Latitude'], Canada_data['Longitude'])
Canada_venues

Unnamed: 0,Postal Code,Postal Code Latitude,Postal Code Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,M3A,43.7532586,-79.3296565,Brookbanks Park,43.751976,-79.332140,Park
1,M3A,43.7532586,-79.3296565,Variety Store,43.751974,-79.333114,Food & Drink Shop
2,M4A,43.725882299999995,-79.31557159999998,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
3,M4A,43.725882299999995,-79.31557159999998,Portugril,43.725819,-79.312785,Portuguese Restaurant
4,M4A,43.725882299999995,-79.31557159999998,Tim Hortons,43.725517,-79.313103,Coffee Shop
...,...,...,...,...,...,...,...
2112,M8Z,43.6288408,-79.52099940000001,Koala Tan Tanning Salon & Sunless Spa,43.631370,-79.519006,Tanning Salon
2113,M8Z,43.6288408,-79.52099940000001,Once Upon A Child,43.631075,-79.518290,Kids Store
2114,M8Z,43.6288408,-79.52099940000001,Value Village,43.631269,-79.518238,Thrift / Vintage Store
2115,M8Z,43.6288408,-79.52099940000001,Kingsway Boxing Club,43.627254,-79.526684,Gym


In [55]:
Canada_venues.to_csv("Canada_venues.csv")

In [56]:
Canada_venues = pd.read_csv("Canada_venues.csv").drop(["Unnamed: 0"], axis=1)
Canada_venues

Unnamed: 0,Postal Code,Postal Code Latitude,Postal Code Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,M3A,43.753259,-79.329656,Brookbanks Park,43.751976,-79.332140,Park
1,M3A,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
2,M4A,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
3,M4A,43.725882,-79.315572,Portugril,43.725819,-79.312785,Portuguese Restaurant
4,M4A,43.725882,-79.315572,Tim Hortons,43.725517,-79.313103,Coffee Shop
...,...,...,...,...,...,...,...
2112,M8Z,43.628841,-79.520999,Koala Tan Tanning Salon & Sunless Spa,43.631370,-79.519006,Tanning Salon
2113,M8Z,43.628841,-79.520999,Once Upon A Child,43.631075,-79.518290,Kids Store
2114,M8Z,43.628841,-79.520999,Value Village,43.631269,-79.518238,Thrift / Vintage Store
2115,M8Z,43.628841,-79.520999,Kingsway Boxing Club,43.627254,-79.526684,Gym


In [57]:
Canada_venues['count'] = Canada_venues.groupby('Postal Code')['Postal Code'].transform('count')
Canada_venues.sort_values('count', ascending=False, inplace=True)
Canada_venues

Unnamed: 0,Postal Code,Postal Code Latitude,Postal Code Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,count
1058,M5K,43.647177,-79.381576,Bay Adelaide Centre,43.650879,-79.380003,Office,100
841,M5J,43.640816,-79.381752,Starbucks,43.643090,-79.383071,Coffee Shop,100
843,M5J,43.640816,-79.381752,Redpath Stage,43.638764,-79.383317,Music Venue,100
844,M5J,43.640816,-79.381752,Subway,43.639708,-79.383441,Sandwich Place,100
845,M5J,43.640816,-79.381752,Kellys Landing,43.645082,-79.383050,Restaurant,100
...,...,...,...,...,...,...,...,...
2100,M8Y,43.636258,-79.498509,Woodford Park,43.633152,-79.496266,Baseball Field,1
1236,M2M,43.789053,-79.408493,Newtonbrook Park,43.786942,-79.410022,Park,1
217,M9B,43.650943,-79.554724,Marius Bakery,43.648965,-79.549381,Bakery,1
1099,M2L,43.757490,-79.374714,Vyner Greenbelt,43.759642,-79.369590,Park,1


In [58]:
#268 categories
len(pd.unique(Canada_venues['Venue Category']))

269

In [59]:
#46 types of restaurants
len(pd.unique(Canada_venues[Canada_venues['Venue Category'].str.contains("Restaurant")]['Venue Category']))

47

#### Data Preparation
In order to feed a clustering algorithm, it is necessary to transform our data into a numeric representation.  
OneHot encoding is a method of doing so by creating a sparse matrix of (0,1) values.  
These binary values are then averaged in order to get the most popular type of venues for each postal code.

In [60]:
# one hot encoding
Canada_encoded = pd.get_dummies(Canada_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
Canada_encoded['Postal Code'] = Canada_venues['Postal Code'] 

# move neighborhood column to the first column
fixed_columns = [Canada_encoded.columns[-1]] + list(Canada_encoded.columns[:-1])
Canada_encoded = Canada_encoded[fixed_columns]

Canada_encoded.head()

Unnamed: 0,Postal Code,Accessories Store,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Train Station,Truck Stop,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
1058,M5K,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
841,M5J,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
843,M5J,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
844,M5J,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
845,M5J,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [61]:
canada_grouped = Canada_encoded.groupby('Postal Code').mean().reset_index()
canada_grouped

Unnamed: 0,Postal Code,Accessories Store,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Train Station,Truck Stop,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,M1B,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,M1C,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,M1E,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,M1G,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,M1H,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
96,M9N,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
97,M9P,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
98,M9R,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
99,M9V,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [62]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

## KMeans Clustering 
This section divides the postal codes into 6 distinct clusters, visualizing them on a map of Toronto. 

The clusters are divided using the proximity to certain categories of venues. 
Clusters are distinct in that they represent a different experience to the people living within them. 

Note: This analysis makes no attempt to find postal codes that offer a diverse experience, but this map is useful for finding such locations, if they are desired. 

In [63]:
from sklearn.cluster import KMeans

In [64]:
#!conda install -c conda-forge folium=0.5.0 --yes 

In [65]:
import numpy as np
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Postal Code']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Postal Code'] = canada_grouped['Postal Code']

for ind in np.arange(canada_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(canada_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()


# set number of clusters
kclusters = 8

grouped_clustering = canada_grouped.drop('Postal Code', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(grouped_clustering)

# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

canada_merged = Canada_data

# merge manhattan_grouped with manhattan_data to add latitude/longitude for each neighborhood
canada_merged = canada_merged.join(neighborhoods_venues_sorted.set_index('Postal Code'), on='Postal Code')
canada_merged.dropna(inplace=True)
canada_merged['Cluster Labels'] = canada_merged['Cluster Labels'].astype(int)
canada_merged[['Latitude','Longitude']]=canada_merged[['Latitude','Longitude']].astype(float)
canada_merged

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,4,Food & Drink Shop,Park,Yoga Studio,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant
1,M4A,North York,Victoria Village,43.725882,-79.315572,0,Pizza Place,Coffee Shop,Intersection,Portuguese Restaurant,Hockey Arena,Dumpling Restaurant,Eastern European Restaurant,Electronics Store,Drugstore,Escape Room
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636,0,Coffee Shop,Park,Bakery,Pub,Breakfast Spot,Café,Theater,Yoga Studio,Shoe Store,Restaurant
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,0,Clothing Store,Furniture / Home Store,Accessories Store,Vietnamese Restaurant,Coffee Shop,Boutique,Ethiopian Restaurant,Escape Room,Electronics Store,Eastern European Restaurant
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,0,Coffee Shop,Sushi Restaurant,Diner,Bar,Beer Bar,Smoothie Shop,Burrito Place,Sandwich Place,Café,Portuguese Restaurant
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944,7,Park,River,Yoga Studio,Doner Restaurant,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dog Run,Donut Shop
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160,0,Coffee Shop,Sushi Restaurant,Japanese Restaurant,Gay Bar,Restaurant,Fast Food Restaurant,Hotel,Mediterranean Restaurant,Men's Store,Pub
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.662744,-79.321558,0,Yoga Studio,Auto Workshop,Garden Center,Garden,Fast Food Restaurant,Light Rail Station,Farmers Market,Comic Shop,Park,Restaurant
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509,2,Baseball Field,Yoga Studio,Drugstore,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant,Fast Food Restaurant


In [66]:
import folium # map rendering library
import matplotlib.cm as cm
import matplotlib.colors as colors

In [67]:
# create map
map_clusters = folium.Map(location=[43.636258, -79.498509], zoom_start=10)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(canada_merged['Latitude'], canada_merged['Longitude'], canada_merged['Postal Code'], canada_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [68]:
df = canada_grouped.drop('Postal Code', 1)
df.columns = range(df.shape[1])
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,259,260,261,262,263,264,265,266,267,268
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
96,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
97,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
98,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
99,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## DBSCAN Clustering
This seciton utilizes DBSCAN Clustering, and defines 2 clusters. 
1. Downtown District
2. Anything Else

This cluster, like with the KMeans method, is identified through frequency of common venues

In [69]:
#Not happy with the results, trying alternate clustering algorithm
from sklearn.cluster import DBSCAN
ms = DBSCAN(eps=0.2).fit(canada_grouped.drop('Postal Code', 1))

#neighborhoods_venues_sorted.drop('MS Cluster', axis=1, inplace=True)
neighborhoods_venues_sorted.insert(0, 'MS Cluster', ms.labels_)

canada_merged2 = Canada_data

# merge manhattan_grouped with manhattan_data to add latitude/longitude for each neighborhood
canada_merged2 = canada_merged2.join(neighborhoods_venues_sorted.set_index('Postal Code'), on='Postal Code')
canada_merged2.dropna(inplace=True)
canada_merged2['MS Cluster'] = canada_merged2['MS Cluster'].astype(int)
canada_merged2[['Latitude','Longitude']]=canada_merged2[['Latitude','Longitude']].astype(float)
canada_merged2

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude,MS Cluster,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,-1,4.0,Food & Drink Shop,Park,Yoga Studio,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant
1,M4A,North York,Victoria Village,43.725882,-79.315572,-1,0.0,Pizza Place,Coffee Shop,Intersection,Portuguese Restaurant,Hockey Arena,Dumpling Restaurant,Eastern European Restaurant,Electronics Store,Drugstore,Escape Room
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636,0,0.0,Coffee Shop,Park,Bakery,Pub,Breakfast Spot,Café,Theater,Yoga Studio,Shoe Store,Restaurant
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,-1,0.0,Clothing Store,Furniture / Home Store,Accessories Store,Vietnamese Restaurant,Coffee Shop,Boutique,Ethiopian Restaurant,Escape Room,Electronics Store,Eastern European Restaurant
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,0,0.0,Coffee Shop,Sushi Restaurant,Diner,Bar,Beer Bar,Smoothie Shop,Burrito Place,Sandwich Place,Café,Portuguese Restaurant
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944,-1,7.0,Park,River,Yoga Studio,Doner Restaurant,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dog Run,Donut Shop
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160,0,0.0,Coffee Shop,Sushi Restaurant,Japanese Restaurant,Gay Bar,Restaurant,Fast Food Restaurant,Hotel,Mediterranean Restaurant,Men's Store,Pub
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.662744,-79.321558,-1,0.0,Yoga Studio,Auto Workshop,Garden Center,Garden,Fast Food Restaurant,Light Rail Station,Farmers Market,Comic Shop,Park,Restaurant
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509,-1,2.0,Baseball Field,Yoga Studio,Drugstore,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant,Fast Food Restaurant


In [70]:
# create map
map_clusters = folium.Map(location=[43.636258, -79.498509], zoom_start=10)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(canada_merged2['Latitude'], canada_merged2['Longitude'], canada_merged2['Postal Code'], canada_merged2['MS Cluster']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [71]:
canada_grouped_disp = canada_grouped
#canada_grouped_disp.drop('KMeans Cluster', axis=1, inplace=True)
#canada_grouped_disp.drop('DBSCAN Cluster', axis=1, inplace=True)
canada_grouped_disp.insert(0, 'KMeans Cluster', canada_merged['Cluster Labels'].astype(int))
canada_grouped_disp.insert(0, 'DBSCAN Cluster', canada_merged2['MS Cluster'].astype(int))
canada_grouped_disp=canada_grouped_disp.dropna()


In [72]:
#canada_grouped_disp = Canada_encoded.groupby('Postal Code').sum()
#canada_grouped_disp

In [73]:
dbs1=canada_grouped_disp.groupby('DBSCAN Cluster').mean().drop('KMeans Cluster', axis=1).sort_values(axis=1, by=0, ascending=False).columns[:10]
dbs1

Index(['Park', 'Coffee Shop', 'Bakery', 'Pizza Place', 'Bank',
       'Sandwich Place', 'Grocery Store', 'Café', 'Discount Store',
       'Food & Drink Shop'],
      dtype='object')

In [74]:
dbs2=canada_grouped_disp.groupby('DBSCAN Cluster').mean().drop('KMeans Cluster', axis=1).sort_values(axis=1, by=-1, ascending=False).columns[:10]
dbs2

Index(['Park', 'Coffee Shop', 'Pizza Place', 'Café', 'Fast Food Restaurant',
       'Sandwich Place', 'Bakery', 'Restaurant', 'Grocery Store',
       'Convenience Store'],
      dtype='object')

In [75]:
k1=canada_grouped_disp.groupby('KMeans Cluster').mean().drop('DBSCAN Cluster', axis=1).sort_values(axis=1, by=0, ascending=False).columns[:10]
k2=canada_grouped_disp.groupby('KMeans Cluster').mean().drop('DBSCAN Cluster', axis=1).sort_values(axis=1, by=1, ascending=False).columns[:10]
k3=canada_grouped_disp.groupby('KMeans Cluster').mean().drop('DBSCAN Cluster', axis=1).sort_values(axis=1, by=2, ascending=False).columns[:10]
k4=canada_grouped_disp.groupby('KMeans Cluster').mean().drop('DBSCAN Cluster', axis=1).sort_values(axis=1, by=3, ascending=False).columns[:10]
k5=canada_grouped_disp.groupby('KMeans Cluster').mean().drop('DBSCAN Cluster', axis=1).sort_values(axis=1, by=4, ascending=False).columns[:10]
k6=canada_grouped_disp.groupby('KMeans Cluster').mean().drop('DBSCAN Cluster', axis=1).sort_values(axis=1, by=5, ascending=False).columns[:10]


## DBSCAN Cluster Exploration

1. Based on the generated map, our 2 clusters have identified the shopping district.  
2. This district is distinct in it's closer proximity to "Store" type venues. 

In [76]:
dbscan_popularity={
    "Cluster 1": dbs1,
    "Cluster 2": dbs2
}
pd.DataFrame(dbscan_popularity)

Unnamed: 0,Cluster 1,Cluster 2
0,Park,Park
1,Coffee Shop,Coffee Shop
2,Bakery,Pizza Place
3,Pizza Place,Café
4,Bank,Fast Food Restaurant
5,Sandwich Place,Sandwich Place
6,Grocery Store,Bakery
7,Café,Restaurant
8,Discount Store,Grocery Store
9,Food & Drink Shop,Convenience Store


# KMEANS Most Common Venues
This demonstrates the diversity in the KMeans Clusters' venues.  
Someone intending to move to Toronto would be able to use these clusters, in addition to their personal preference, to find a good place to live. 

In [77]:
kmeans_popularity={
    "Cluster 1": k1,
    "Cluster 2": k2,
    "Cluster 3": k3,
    "Cluster 4": k4,
    "Cluster 5": k5,
    "Cluster 6": k6,
}
pd.DataFrame(kmeans_popularity)

Unnamed: 0,Cluster 1,Cluster 2,Cluster 3,Cluster 4,Cluster 5,Cluster 6
0,Park,Coffee Shop,Coffee Shop,Sandwich Place,Fast Food Restaurant,Coffee Shop
1,Coffee Shop,Athletics & Sports,Café,Bakery,Moving Target,Café
2,Pizza Place,Discount Store,Restaurant,Auto Garage,Medical Center,Park
3,Bakery,Grocery Store,Hotel,Middle Eastern Restaurant,Mediterranean Restaurant,Sandwich Place
4,Café,Intersection,Gym,Monument / Landmark,Men's Store,Mexican Restaurant
5,Restaurant,Liquor Store,Clothing Store,Moving Target,Metro Station,Pub
6,Grocery Store,Gym / Fitness Center,Deli / Bodega,Movie Theater,Mexican Restaurant,Clothing Store
7,Bank,Park,Thai Restaurant,Motel,Middle Eastern Restaurant,Sporting Goods Shop
8,Bar,Gym,Cosmetics Shop,Moroccan Restaurant,Miscellaneous Shop,Pet Store
9,Trail,Sandwich Place,Concert Hall,Molecular Gastronomy Restaurant,Mobile Phone Shop,Bakery
