# Battle of the Neighborhoods
### Coursera Data Science Capstone Project  
**Author**: Aaron Carter  
**Date**: 12/22/2020

This notebook extracts Postal Code data from an online resource using BeautifulSoup web scraping library.  
The resulting table is merged with coordinates for each Postal Code, to be used with folium visualizations.  
MORE

## Part 1: Data Collection and Cleaning  

Data is scraped from Wikipedia, and cleaned according to project specifications. 

In [19]:
import pandas as pd
from bs4 import BeautifulSoup as bs
import requests

Postal codes will be scraped from a provided [wikipedia](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M "Canada Postal Codes") resource  
This operation will be performed using the BeautifulSoup (bs4) library. 

In [20]:
#Link has been updated since course creation, using historical version
resp = requests.get("https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&oldid=1011037969")
resp

<Response [200]>

Wikipedia uses standard html formatting, so we are using the html parser included with bs4

In [21]:
soup = bs(resp.text, 'html.parser')

In [22]:
#See HTML response body
rows = []
respRows = soup.table.tbody.tr.find_next_siblings("tr") #Creates a list of table row elements (ignoring table header)
respRows

[<tr>
 <td>M1A
 </td>
 <td>Not assigned
 </td>
 <td>Not assigned
 </td></tr>, <tr>
 <td>M2A
 </td>
 <td>Not assigned
 </td>
 <td>Not assigned
 </td></tr>, <tr>
 <td>M3A
 </td>
 <td>North York
 </td>
 <td>Parkwoods
 </td></tr>, <tr>
 <td>M4A
 </td>
 <td>North York
 </td>
 <td>Victoria Village
 </td></tr>, <tr>
 <td>M5A
 </td>
 <td>Downtown Toronto
 </td>
 <td>Regent Park, Harbourfront
 </td></tr>, <tr>
 <td>M6A
 </td>
 <td>North York
 </td>
 <td>Lawrence Manor, Lawrence Heights
 </td></tr>, <tr>
 <td>M7A
 </td>
 <td>Downtown Toronto
 </td>
 <td>Queen's Park, Ontario Provincial Government
 </td></tr>, <tr>
 <td>M8A
 </td>
 <td>Not assigned
 </td>
 <td>Not assigned
 </td></tr>, <tr>
 <td>M9A
 </td>
 <td>Etobicoke
 </td>
 <td>Islington Avenue, Humber Valley Village
 </td></tr>, <tr>
 <td>M1B
 </td>
 <td>Scarborough
 </td>
 <td>Malvern, Rouge
 </td></tr>, <tr>
 <td>M2B
 </td>
 <td>Not assigned
 </td>
 <td>Not assigned
 </td></tr>, <tr>
 <td>M3B
 </td>
 <td>North York
 </td>
 <td>Don Mills
 

**This Cell uses nested "for" loops to convert the html response into a pandas dataframe**

In [83]:

c = 0


for row in respRows:
    newRow = {}
    cols = row.find_all("td")
    j = 0
    for col in cols:
        newRow[["Postal Code", "Borough", "Neighborhood"][j]] = col.text.replace('\n','') #Load columns one at a time 
        j+=1
    rows.append(newRow)
    
wikiData = pd.DataFrame(rows)
wikiData

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
535,M5Z,Not assigned,Not assigned
536,M6Z,Not assigned,Not assigned
537,M7Z,Not assigned,Not assigned
538,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


In [84]:
wikiData_copy = wikiData #Just in case we need to go back to the original table

**Data is cleaned such that no Borough is "Not Assigned"**

In [85]:
wikiData = wikiData[wikiData["Borough"]!="Not assigned"]
if len(wikiData[wikiData["Neighborhood"]=="Not assigned"]) == 0:
    print("All neighborhoods assigned!")

if len(wikiData['Postal Code'].unique()) == len(wikiData):
    print("No Duplicate Postal Codes!")
wikiData

All neighborhoods assigned!


Unnamed: 0,Postal Code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
520,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
525,M4Y,Downtown Toronto,Church and Wellesley
528,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
529,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [86]:
wikiData.shape

(309, 3)

## Part 2: Merging Data with Geospatial Coordinates

Provided in the project requirements is a .csv file (GeoSpatial_Coordinates.csv) that provides Latitude and Longitude for each postal code.  
This file is consumed into a pandas dataframe and merged with the scraped Postal Code dataframe

In [27]:
geo1 = pd.read_csv("Geospatial_Coordinates.csv")
wikiData = pd.merge(left=wikiData, right=geo1, how='left', on="Postal Code")

In [28]:
wikiData['Latitude'] = wikiData['Latitude'].astype(str)
wikiData['Longitude'] = wikiData['Longitude'].astype(str)
wikiData

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.7532586,-79.3296565
1,M4A,North York,Victoria Village,43.725882299999995,-79.31557159999998
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.6542599,-79.3606359
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718517999999996,-79.46476329999999
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.6623015,-79.3894938
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653653600000005,-79.5069436
99,M4Y,Downtown Toronto,Church and Wellesley,43.6658599,-79.38315990000001
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.6627439,-79.321558
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.6362579,-79.49850909999999


In [29]:
wikiData.to_csv("CanadaNeighborhoods.csv")

## Part 3: Analysis and Visualization

In [30]:
CLIENT_ID = 'P1AWGSLTDJSXOC2EMIRVMQOWD3CO0HRTCNRNJEUWM2R5XZNN' # your Foursquare ID
CLIENT_SECRET = '5RUWXK2HOV011WDQLMTWECVKEVB45K3N31ERCSIXNIS2K4RM' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 300 # A default Foursquare API limit value

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Postal Code', 
                  'Postal Code Latitude', 
                  'Postal Code Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [31]:
Canada_data = wikiData
Canada_venues = getNearbyVenues(Canada_data['Postal Code'], Canada_data['Latitude'], Canada_data['Longitude'])
Canada_venues

M3A
M4A
M5A
M6A
M7A
M9A
M1B
M3B
M4B
M5B
M6B
M9B
M1C
M3C
M4C
M5C
M6C
M9C
M1E
M4E
M5E
M6E
M1G
M4G
M5G
M6G
M1H
M2H
M3H
M4H
M5H
M6H
M1J
M2J
M3J
M4J
M5J
M6J
M1K
M2K
M3K
M4K
M5K
M6K
M1L
M2L
M3L
M4L
M5L
M6L
M9L
M1M
M2M
M3M
M4M
M5M
M6M
M9M
M1N
M2N
M3N
M4N
M5N
M6N
M9N
M1P
M2P
M4P
M5P
M6P
M9P
M1R
M2R
M4R
M5R
M6R
M7R
M9R
M1S
M4S
M5S
M6S
M1T
M4T
M5T
M1V
M4V
M5V
M8V
M9V
M1W
M4W
M5W
M8W
M9W
M1X
M4X
M5X
M8X
M4Y
M7Y
M8Y
M8Z


Unnamed: 0,Postal Code,Postal Code Latitude,Postal Code Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,M3A,43.7532586,-79.3296565,Brookbanks Park,43.751976,-79.332140,Park
1,M3A,43.7532586,-79.3296565,TTC stop #8380,43.752672,-79.326351,Bus Stop
2,M3A,43.7532586,-79.3296565,Variety Store,43.751974,-79.333114,Food & Drink Shop
3,M4A,43.725882299999995,-79.31557159999998,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
4,M4A,43.725882299999995,-79.31557159999998,Portugril,43.725819,-79.312785,Portuguese Restaurant
...,...,...,...,...,...,...,...
2098,M8Z,43.6288408,-79.52099940000001,Jim & Maria's No Frills,43.631152,-79.518617,Grocery Store
2099,M8Z,43.6288408,-79.52099940000001,Koala Tan Tanning Salon & Sunless Spa,43.631370,-79.519006,Tanning Salon
2100,M8Z,43.6288408,-79.52099940000001,Once Upon A Child,43.631075,-79.518290,Kids Store
2101,M8Z,43.6288408,-79.52099940000001,Value Village,43.631269,-79.518238,Thrift / Vintage Store


In [32]:
Canada_venues.to_csv("Canada_venues.csv")

In [59]:
Canada_venues = pd.read_csv("Canada_venues.csv").drop(["Unnamed: 0"], axis=1)
Canada_venues

Unnamed: 0,Postal Code,Postal Code Latitude,Postal Code Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,M3A,43.753259,-79.329656,Brookbanks Park,43.751976,-79.332140,Park
1,M3A,43.753259,-79.329656,TTC stop #8380,43.752672,-79.326351,Bus Stop
2,M3A,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
3,M4A,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
4,M4A,43.725882,-79.315572,Portugril,43.725819,-79.312785,Portuguese Restaurant
...,...,...,...,...,...,...,...
2098,M8Z,43.628841,-79.520999,Jim & Maria's No Frills,43.631152,-79.518617,Grocery Store
2099,M8Z,43.628841,-79.520999,Koala Tan Tanning Salon & Sunless Spa,43.631370,-79.519006,Tanning Salon
2100,M8Z,43.628841,-79.520999,Once Upon A Child,43.631075,-79.518290,Kids Store
2101,M8Z,43.628841,-79.520999,Value Village,43.631269,-79.518238,Thrift / Vintage Store


In [60]:
Canada_venues['count'] = Canada_venues.groupby('Postal Code')['Postal Code'].transform('count')
Canada_venues.sort_values('count', ascending=False, inplace=True)
Canada_venues

Unnamed: 0,Postal Code,Postal Code Latitude,Postal Code Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,count
1051,M5K,43.647177,-79.381576,Sushi-Q,43.647738,-79.379699,Sushi Restaurant,100
1041,M5K,43.647177,-79.381576,Assembly Chef's Hall,43.650579,-79.383412,Food Court,100
1039,M5K,43.647177,-79.381576,Irish Embassy,43.647918,-79.377273,Irish Pub,100
1038,M5K,43.647177,-79.381576,Fast Fresh Foods,43.647708,-79.379549,Salad Place,100
1037,M5K,43.647177,-79.381576,Boxcar Social Temperance,43.650557,-79.381956,Bar,100
...,...,...,...,...,...,...,...,...
1340,M5N,43.711695,-79.416936,Rosalind's Garden Oasis,43.712189,-79.411978,Garden,1
1294,M9M,43.724766,-79.532242,Strathburn Park,43.721765,-79.532854,Baseball Field,1
1226,M2M,43.789053,-79.408493,Wedgewood park,43.790635,-79.405494,Park,1
1090,M2L,43.757490,-79.374714,Vyner Greenbelt,43.759642,-79.369590,Park,1


In [61]:
#268 categories
len(pd.unique(Canada_venues['Venue Category']))

271

In [62]:
#46 types of restaurants
len(pd.unique(Canada_venues[Canada_venues['Venue Category'].str.contains("Restaurant")]['Venue Category']))

47

#### Data Preparation
In order to feed a clustering algorithm, it is necessary to transform our data into a numeric representation.  
OneHot encoding is a method of doing so by creating a sparse matrix of (0,1) values.  
These binary values are then averaged in order to get the most popular type of venues for each postal code.

In [63]:
# one hot encoding
Canada_encoded = pd.get_dummies(Canada_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
Canada_encoded['Postal Code'] = Canada_venues['Postal Code'] 

# move neighborhood column to the first column
fixed_columns = [Canada_encoded.columns[-1]] + list(Canada_encoded.columns[:-1])
Canada_encoded = Canada_encoded[fixed_columns]

Canada_encoded.head()

Unnamed: 0,Postal Code,Accessories Store,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,...,Train Station,Truck Stop,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
1051,M5K,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1041,M5K,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1039,M5K,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1038,M5K,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1037,M5K,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [64]:
canada_grouped = Canada_encoded.groupby('Postal Code').mean().reset_index()
canada_grouped

Unnamed: 0,Postal Code,Accessories Store,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,...,Train Station,Truck Stop,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,M1B,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,M1C,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,M1E,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,M1G,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,M1H,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
94,M9M,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
95,M9P,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
96,M9R,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
97,M9V,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [65]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

## KMeans Clustering 
This section divides the postal codes into 6 distinct clusters, visualizing them on a map of Toronto. 

The clusters are divided using the proximity to certain categories of venues. 
Clusters are distinct in that they represent a different experience to the people living within them. 

Note: This analysis makes no attempt to find postal codes that offer a diverse experience, but this map is useful for finding such locations, if they are desired. 

In [66]:
from sklearn.cluster import KMeans

In [67]:
#!conda install -c conda-forge folium=0.5.0 --yes 

In [68]:
import numpy as np
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Postal Code']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Postal Code'] = canada_grouped['Postal Code']

for ind in np.arange(canada_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(canada_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()


# set number of clusters
kclusters = 8

grouped_clustering = canada_grouped.drop('Postal Code', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(grouped_clustering)

# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

canada_merged = Canada_data

# merge manhattan_grouped with manhattan_data to add latitude/longitude for each neighborhood
canada_merged = canada_merged.join(neighborhoods_venues_sorted.set_index('Postal Code'), on='Postal Code')
canada_merged.dropna(inplace=True)
canada_merged['Cluster Labels'] = canada_merged['Cluster Labels'].astype(int)
canada_merged[['Latitude','Longitude']]=canada_merged[['Latitude','Longitude']].astype(float)
canada_merged

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,2,Park,Food & Drink Shop,Bus Stop,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dim Sum Restaurant
1,M4A,North York,Victoria Village,43.725882,-79.315572,1,Hockey Arena,Pizza Place,Portuguese Restaurant,Coffee Shop,Intersection,Eastern European Restaurant,Electronics Store,Dumpling Restaurant,Drugstore,Donut Shop
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636,3,Coffee Shop,Park,Pub,Bakery,Breakfast Spot,Theater,Café,Yoga Studio,Dessert Shop,Hotel
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,3,Accessories Store,Clothing Store,Vietnamese Restaurant,Carpet Store,Coffee Shop,Furniture / Home Store,Boutique,Electronics Store,Escape Room,Eastern European Restaurant
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,3,Coffee Shop,Diner,Gym,Sushi Restaurant,Café,Italian Restaurant,Beer Bar,Japanese Restaurant,Sandwich Place,Burger Joint
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944,3,River,Pool,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Deli / Bodega
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160,3,Coffee Shop,Japanese Restaurant,Sushi Restaurant,Gay Bar,Restaurant,Yoga Studio,Pub,Café,Mediterranean Restaurant,Men's Store
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.662744,-79.321558,3,Light Rail Station,Yoga Studio,Auto Workshop,Garden Center,Garden,Fast Food Restaurant,Farmers Market,Comic Shop,Pizza Place,Restaurant
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509,2,Park,Deli / Bodega,Baseball Field,Dim Sum Restaurant,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Drugstore


In [69]:
import folium # map rendering library
import matplotlib.cm as cm
import matplotlib.colors as colors

In [70]:
# create map
map_clusters = folium.Map(location=[43.636258, -79.498509], zoom_start=10)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(canada_merged['Latitude'], canada_merged['Longitude'], canada_merged['Postal Code'], canada_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [71]:
df = canada_grouped.drop('Postal Code', 1)
df.columns = range(df.shape[1])
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,261,262,263,264,265,266,267,268,269,270
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
94,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
95,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
96,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
97,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## DBSCAN Clustering
This seciton utilizes DBSCAN Clustering, and defines 2 clusters. 
1. Downtown District
2. Anything Else

This cluster, like with the KMeans method, is identified through frequency of common venues

In [72]:
#Not happy with the results, trying alternate clustering algorithm
from sklearn.cluster import DBSCAN
ms = DBSCAN(eps=0.2).fit(canada_grouped.drop('Postal Code', 1))

#neighborhoods_venues_sorted.drop('MS Cluster', axis=1, inplace=True)
neighborhoods_venues_sorted.insert(0, 'MS Cluster', ms.labels_)

canada_merged2 = Canada_data

# merge manhattan_grouped with manhattan_data to add latitude/longitude for each neighborhood
canada_merged2 = canada_merged2.join(neighborhoods_venues_sorted.set_index('Postal Code'), on='Postal Code')
canada_merged2.dropna(inplace=True)
canada_merged2['MS Cluster'] = canada_merged2['MS Cluster'].astype(int)
canada_merged2[['Latitude','Longitude']]=canada_merged2[['Latitude','Longitude']].astype(float)
canada_merged2

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude,MS Cluster,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,-1,2.0,Park,Food & Drink Shop,Bus Stop,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dim Sum Restaurant
1,M4A,North York,Victoria Village,43.725882,-79.315572,-1,1.0,Hockey Arena,Pizza Place,Portuguese Restaurant,Coffee Shop,Intersection,Eastern European Restaurant,Electronics Store,Dumpling Restaurant,Drugstore,Donut Shop
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636,0,3.0,Coffee Shop,Park,Pub,Bakery,Breakfast Spot,Theater,Café,Yoga Studio,Dessert Shop,Hotel
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,-1,3.0,Accessories Store,Clothing Store,Vietnamese Restaurant,Carpet Store,Coffee Shop,Furniture / Home Store,Boutique,Electronics Store,Escape Room,Eastern European Restaurant
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,0,3.0,Coffee Shop,Diner,Gym,Sushi Restaurant,Café,Italian Restaurant,Beer Bar,Japanese Restaurant,Sandwich Place,Burger Joint
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944,-1,3.0,River,Pool,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Deli / Bodega
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160,0,3.0,Coffee Shop,Japanese Restaurant,Sushi Restaurant,Gay Bar,Restaurant,Yoga Studio,Pub,Café,Mediterranean Restaurant,Men's Store
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.662744,-79.321558,-1,3.0,Light Rail Station,Yoga Studio,Auto Workshop,Garden Center,Garden,Fast Food Restaurant,Farmers Market,Comic Shop,Pizza Place,Restaurant
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509,-1,2.0,Park,Deli / Bodega,Baseball Field,Dim Sum Restaurant,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Drugstore


In [73]:
# create map
map_clusters = folium.Map(location=[43.636258, -79.498509], zoom_start=10)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(canada_merged2['Latitude'], canada_merged2['Longitude'], canada_merged2['Postal Code'], canada_merged2['MS Cluster']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [74]:
canada_grouped_disp = canada_grouped
#canada_grouped_disp.drop('KMeans Cluster', axis=1, inplace=True)
#canada_grouped_disp.drop('DBSCAN Cluster', axis=1, inplace=True)
canada_grouped_disp.insert(0, 'KMeans Cluster', canada_merged['Cluster Labels'].astype(int))
canada_grouped_disp.insert(0, 'DBSCAN Cluster', canada_merged2['MS Cluster'].astype(int))
canada_grouped_disp=canada_grouped_disp.dropna()


In [75]:
#canada_grouped_disp = Canada_encoded.groupby('Postal Code').sum()
#canada_grouped_disp

In [76]:
dbs1=canada_grouped_disp.groupby('DBSCAN Cluster').mean().drop('KMeans Cluster', axis=1).sort_values(axis=1, by=0, ascending=False).columns[:10]
dbs1

Index(['Park', 'Coffee Shop', 'Pizza Place', 'Grocery Store', 'Café',
       'Sandwich Place', 'Bank', 'Trail', 'Bus Line', 'Intersection'],
      dtype='object')

In [77]:
dbs2=canada_grouped_disp.groupby('DBSCAN Cluster').mean().drop('KMeans Cluster', axis=1).sort_values(axis=1, by=-1, ascending=False).columns[:10]
dbs2

Index(['Park', 'Coffee Shop', 'Pizza Place', 'Bakery', 'Fast Food Restaurant',
       'Café', 'Baseball Field', 'Bar', 'Sandwich Place', 'Intersection'],
      dtype='object')

In [78]:
k1=canada_grouped_disp.groupby('KMeans Cluster').mean().drop('DBSCAN Cluster', axis=1).sort_values(axis=1, by=0, ascending=False).columns[:10]
k2=canada_grouped_disp.groupby('KMeans Cluster').mean().drop('DBSCAN Cluster', axis=1).sort_values(axis=1, by=1, ascending=False).columns[:10]
k3=canada_grouped_disp.groupby('KMeans Cluster').mean().drop('DBSCAN Cluster', axis=1).sort_values(axis=1, by=2, ascending=False).columns[:10]
k4=canada_grouped_disp.groupby('KMeans Cluster').mean().drop('DBSCAN Cluster', axis=1).sort_values(axis=1, by=3, ascending=False).columns[:10]
k5=canada_grouped_disp.groupby('KMeans Cluster').mean().drop('DBSCAN Cluster', axis=1).sort_values(axis=1, by=4, ascending=False).columns[:10]
k6=canada_grouped_disp.groupby('KMeans Cluster').mean().drop('DBSCAN Cluster', axis=1).sort_values(axis=1, by=5, ascending=False).columns[:10]


## DBSCAN Cluster Exploration

1. Based on the generated map, our 2 clusters have identified the shopping district.  
2. This district is distinct in it's closer proximity to "Store" type venues. 

In [79]:
dbscan_popularity={
    "Cluster 1": dbs1,
    "Cluster 2": dbs2
}
pd.DataFrame(dbscan_popularity)

Unnamed: 0,Cluster 1,Cluster 2
0,Park,Park
1,Coffee Shop,Coffee Shop
2,Pizza Place,Pizza Place
3,Grocery Store,Bakery
4,Café,Fast Food Restaurant
5,Sandwich Place,Café
6,Bank,Baseball Field
7,Trail,Bar
8,Bus Line,Sandwich Place
9,Intersection,Intersection


# KMEANS Most Common Venues
This demonstrates the diversity in the KMeans Clusters' venues.  
Someone intending to move to Toronto would be able to use these clusters, in addition to their personal preference, to find a good place to live. 

In [80]:
kmeans_popularity={
    "Cluster 1": k1,
    "Cluster 2": k2,
    "Cluster 3": k3,
    "Cluster 4": k4,
    "Cluster 5": k5,
    "Cluster 6": k6,
}
pd.DataFrame(kmeans_popularity)

Unnamed: 0,Cluster 1,Cluster 2,Cluster 3,Cluster 4,Cluster 5,Cluster 6
0,Coffee Shop,Bar,Garden,Park,Bus Line,Coffee Shop
1,Department Store,Park,Fast Food Restaurant,Coffee Shop,Jewelry Store,Clothing Store
2,Hobby Shop,Bakery,Bakery,Pizza Place,Sushi Restaurant,Airport Service
3,Chinese Restaurant,Athletics & Sports,Brewery,Café,Trail,Café
4,Café,Indian Restaurant,Sandwich Place,Baseball Field,Molecular Gastronomy Restaurant,Airport Lounge
5,Restaurant,Japanese Restaurant,Park,Grocery Store,Mobile Phone Shop,Airport Terminal
6,Clothing Store,Pizza Place,Greek Restaurant,Intersection,Miscellaneous Shop,Ramen Restaurant
7,Deli / Bodega,Liquor Store,Coffee Shop,Sandwich Place,Middle Eastern Restaurant,Fast Food Restaurant
8,Thai Restaurant,Grocery Store,Middle Eastern Restaurant,Bank,Mexican Restaurant,Ice Cream Shop
9,Hotel,Construction & Landscaping,Smoke Shop,Bakery,Accessories Store,Restaurant
