<a href="https://colab.research.google.com/github/eolus87/Coursera_Capstone/blob/master/Web_scraping_Toronto_neighbourhoods_3rdpart.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Web scraping Neighborhood data from Wikipedia about Toronto
This code has been developed for Coursera course: "Applied Data Science Capstone", following the instructions on [Segmenting and Clustering Neighborhoods in Toronto](https://www.coursera.org/learn/applied-data-science-capstone/peer/I1bDq/segmenting-and-clustering-neighborhoods-in-toronto/submit)

References used are:
- [How To Web Scrape Wikipedia Using Python, Urllib, Beautiful Soup and Pandas](https://simpleanalytical.com/how-to-web-scrape-wikipedia-python-urllib-beautiful-soup-pandas)

Nicolas Gutierrez  
UK, 17th May 2020

## From Web to a dataframe

### Importing needed libraries
- [**urllib.request**](https://docs.python.org/3.0/library/urllib.request.html)
- [**BeautifulSoup**](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [**Pandas**](https://pandas.pydata.org/) (The ubiquitous)



In [0]:
import urllib.request
from bs4 import BeautifulSoup
import pandas as pd

print("Libraries needed imported!")

Libraries needed imported!


### Variables initialization, web request and data checking

In [0]:
# As indicated in the course project instructions, the URL is the following
url  = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
# We make the call
page =  urllib.request.urlopen(url)
# Now it is the turn of Beautiful Soup to parse the HTML code we got
soup = BeautifulSoup(page, "lxml")
#print(soup.prettify())
# I am not displaying the soup.prettify() field above because it is too long, but 
# it can be uncommented if needed

In [0]:
# Now we have the HTML code, we can have a look, and the table we are intestrested in begins with "<table class="wikitable sortable">".
all_tables   = soup.find_all("table")
#all_tables
# I am not displaying the all_tables object above because it is too long, but 
# it can be uncommented if needed

In [0]:
# Let's filter the obtained tables list and get the only table we want
table_wanted = soup.find("table", class_ = "wikitable sortable")
#table_wanted
# I am not displaying the table_watend object above because it is too long, but 
# it can be uncommented if needed

### Extracting and cleaning data
First we will find the table headers. They all have this format "\<th>Header\</th>" (This step is just for fun, because we assign the table headers manually afterwards).

In [0]:
# Initialize the list that will contain the headers
table_headers = []

# Find all headers in the table
thheaders = table_wanted.find_all('th')

# Include all headers in table_headers. With 'find' method we separate the text 
# from the HTML syntax. Then we get the names but finishing with '\n' jump line 
# character so we remove it by cropping the vector from [0:-1]
for i in range(len(thheaders)):
  table_headers.append(thheaders[i].find(text=True)[0:-1])

# Printing the headers
print(table_headers)

['Postal Code', 'Borough', 'Neighborhood']


Next, the table data. Every line in the table is enclosed in "\<tr>DataRow\</tr>" and then every field is enclosed in "\<td>DataField\</td>" and we have three fields per row.

In [0]:
# Initializing the lists that will contain the data
postalcode   = []
borough      = []
neighborhood = []

# We locate every field in the data (we need to crop the data again to remove 
# the jump line character)
for row in table_wanted.find_all('tr'):
  row_data = row.findAll('td')
  if len(row_data) == 3:
    postalcode.append(row_data[0].find(text=True)[0:-1])
    borough.append(row_data[1].find(text=True)[0:-1])
    neighborhood.append(row_data[2].find(text=True)[0:-1])

# Printing the results of the fields
print("Content of every column:")
print("Postal_code: {}".format(postalcode))
print("Borough: {}".format(borough))
print("Neighborhood: {}\n".format(neighborhood))

# Printing the length of every column
print("Size of every column:")
print(len(postalcode))
print(len(borough))
print(len(neighborhood))

Content of every column:
Postal_code: ['M1A', 'M2A', 'M3A', 'M4A', 'M5A', 'M6A', 'M7A', 'M8A', 'M9A', 'M1B', 'M2B', 'M3B', 'M4B', 'M5B', 'M6B', 'M7B', 'M8B', 'M9B', 'M1C', 'M2C', 'M3C', 'M4C', 'M5C', 'M6C', 'M7C', 'M8C', 'M9C', 'M1E', 'M2E', 'M3E', 'M4E', 'M5E', 'M6E', 'M7E', 'M8E', 'M9E', 'M1G', 'M2G', 'M3G', 'M4G', 'M5G', 'M6G', 'M7G', 'M8G', 'M9G', 'M1H', 'M2H', 'M3H', 'M4H', 'M5H', 'M6H', 'M7H', 'M8H', 'M9H', 'M1J', 'M2J', 'M3J', 'M4J', 'M5J', 'M6J', 'M7J', 'M8J', 'M9J', 'M1K', 'M2K', 'M3K', 'M4K', 'M5K', 'M6K', 'M7K', 'M8K', 'M9K', 'M1L', 'M2L', 'M3L', 'M4L', 'M5L', 'M6L', 'M7L', 'M8L', 'M9L', 'M1M', 'M2M', 'M3M', 'M4M', 'M5M', 'M6M', 'M7M', 'M8M', 'M9M', 'M1N', 'M2N', 'M3N', 'M4N', 'M5N', 'M6N', 'M7N', 'M8N', 'M9N', 'M1P', 'M2P', 'M3P', 'M4P', 'M5P', 'M6P', 'M7P', 'M8P', 'M9P', 'M1R', 'M2R', 'M3R', 'M4R', 'M5R', 'M6R', 'M7R', 'M8R', 'M9R', 'M1S', 'M2S', 'M3S', 'M4S', 'M5S', 'M6S', 'M7S', 'M8S', 'M9S', 'M1T', 'M2T', 'M3T', 'M4T', 'M5T', 'M6T', 'M7T', 'M8T', 'M9T', 'M1V', 'M2V', 'M

### DataFrame and final cleaning
Now we will create a Pandas Dataframe and do the last stage of cleaning in Pandas

In [0]:
# Initializing the data frame
table_dataframe                 = pd.DataFrame(postalcode,columns=['PostalCode'])
# Adding the rest of columns
table_dataframe['Borough']      = borough
table_dataframe['Neighborhood'] = neighborhood
table_dataframe

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,
176,M6Z,Not assigned,
177,M7Z,Not assigned,
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


Next, the steps required by the assignment:
1. Ignore cells with a borough that is "Not assigned"
2. No PostalCode will be duplicated, combine cells with the same PostalCode
3. If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

In [0]:
#1. Removing (ignoring) rows with Borough "Not assigned"
table_dataframe = table_dataframe[table_dataframe['Borough']!="Not assigned"]
print("DataFrame shape: {}\n".format(table_dataframe.shape))

# Sanity check about datatypes
print(table_dataframe.dtypes)

#2. Joining Neighborhoods as indicated in the assignment instructions
print(table_dataframe[table_dataframe['PostalCode'].isin(['M5A'])])
table_dataframe = table_dataframe.groupby(['PostalCode','Borough'], axis = 0)['Neighborhood'].apply(lambda x: ', '.join(x)).reset_index()
print(table_dataframe[table_dataframe['PostalCode'].isin(['M5A'])])
# NOTE: PostalCodes are not duplicated in wikipedia page. Actually the table is 
# grouped by PostalCodes so there is no need to go thorugh the group by process

#3. Looking and filling empty neighborhood
for i in range(len(table_dataframe)):
  if table_dataframe['Neighborhood'][i] == '':
    print(table_dataframe['Borough'][i])
    table_dataframe['Neighborhood'][i] = table_dataframe['Borough'][i]
# NOTE: There was no changes in the step 3. It seems that empty neighborhoods are
# linked to "Not Assigned" Boroughs, so this step was done with step 1.

DataFrame shape: (103, 3)

PostalCode      object
Borough         object
Neighborhood    object
dtype: object
  PostalCode           Borough               Neighborhood
4        M5A  Downtown Toronto  Regent Park, Harbourfront
   PostalCode           Borough               Neighborhood
53        M5A  Downtown Toronto  Regent Park, Harbourfront


Finally, we print the Data frame and the shape

In [0]:
#Printing the dataframe
table_dataframe

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
...,...,...,...
98,M9N,York,Weston
99,M9P,Etobicoke,Westmount
100,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ..."
101,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest..."


In [0]:
# Printing the shape of the dataframe
table_dataframe.shape

(103, 3)

## Including Latitude and Longitude data 

### Importing the needed library

In [0]:
# Google colab doesn't have geocoder by default
!pip install -q geocoder
# import geocoder
import geocoder
# import time for sleeping between calls to geocoder 
import time

print('All libraries installed and imported!')

All libraries installed and imported!


### Variables initialization and latitude retrieval
In this section I tried first retrieving the coordinates from geocoder, but it was not possible (see the code below), so I just loaded the [csv](https://cocl.us/Geospatial_data) indicated by the course organizers.

In [0]:
# I tried the code below and after one hour it doesn't even timed out, so in the 
# next cell I will load the csv file with the latitues and longitudes provided by 
# the course organizers.

# Initializing the latitude and longitude vectors
# latitude  = []
# longitude = []

# for i in range(len(table_dataframe)):
#   # initialize your variable to None
#   lat_lng_coords = None
#   # loop until you get the coordinates
#   while(lat_lng_coords is None):
#     g = geocoder.google('{}, Toronto, Ontario'.format(table_dataframe['PostalCode'][i]))
#     lat_lng_coords = g.latlng
#     time.sleep(0.1)
#   latitude.append(lat_lng_coords[0]) 
#   longitude.append(lat_lng_coords[1])

In [0]:
!wget -o geospatial.txt https://cocl.us/Geospatial_data

In [0]:
geospatial_data = pd.read_csv('Geospatial_data')
geospatial_data

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


### Adding the latitude and longitude to the dataframe

In [0]:
# Order table_dataframe by PostalCode 
table_dataframe.sort_values('PostalCode')
print(table_dataframe.shape)
table_dataframe.head()

(103, 3)


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [0]:
# Order geolocation datafrarme by Postal Code
geospatial_data.sort_values('Postal Code')
print(table_dataframe.shape)
geospatial_data.head()

(103, 3)


Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [0]:
# Loading Latitude and Longitude to the dataframe
table_dataframe['Latitude']  = geospatial_data['Latitude']
table_dataframe['Longitude'] = geospatial_data['Longitude']
table_dataframe.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


## Plotting the location of neighborhoods in a map


Importing the needed library for this part of the work

In [0]:
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import folium # map rendering library
print("Libraries imported!")

Libraries imported!


Retrieving the coordinates from Toronto

In [0]:
address = 'Toronto, ON'

geolocator = Nominatim(user_agent="toronto_couseracapstone")
location   = geolocator.geocode(address)
latitude   = location.latitude
longitude  = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


Creating a map of Toronto and plotting the coordinates of the different neighborhoods

In [0]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(table_dataframe['Latitude'], table_dataframe['Longitude'], table_dataframe['Borough'], table_dataframe['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=3,
        popup=label,
        color='red',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

Reducing the list of neighborhoods to those that have Toronto in the name

In [0]:
newdftoronto = table_dataframe[table_dataframe['Borough'].str.contains('Toronto')].reset_index(drop=True)
newdftoronto.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M4E,East Toronto,The Beaches,43.676357,-79.293031
1,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
2,M4L,East Toronto,"India Bazaar, The Beaches West",43.668999,-79.315572
3,M4M,East Toronto,Studio District,43.659526,-79.340923
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879


Creating a new map with just the boroughs selected

In [0]:
# create map of Toronto using latitude and longitude values
toronto_selected = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, borough, neighborhood in zip(newdftoronto['Latitude'], newdftoronto['Longitude'], newdftoronto['Borough'], newdftoronto['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=3,
        popup=label,
        color='red',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(toronto_selected)  
    
toronto_selected

## Retrieving venus information from foursquare

In [0]:
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
import numpy as np

Now, we will connect to foursquare to retrieve info from neighborhoods

In [47]:
CLIENT_ID     = 'XXX' # your Foursquare ID
CLIENT_SECRET = 'XXX' # your Foursquare Secret
VERSION       = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: XXX
CLIENT_SECRET:XXX


Just to check how it works, let's do it first for one neighborhood

In [0]:
print(newdftoronto.loc[0, 'Neighborhood'])

neighborhood_latitude  = newdftoronto.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = newdftoronto.loc[0, 'Longitude'] # neighborhood longitude value

neighborhood_name      = newdftoronto.loc[0, 'Neighborhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

The Beaches
Latitude and longitude values of The Beaches are 43.67635739999999, -79.2930312.


Let's retrieve the top 100 venues in The Beaches

In [0]:
LIMIT  = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius

# create URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)
print(url) # display URL

results = requests.get(url).json()
print("The results have a length of {}".format(len(results))) 

https://api.foursquare.com/v2/venues/explore?&client_id=3BJX4JOQVRT3AAFCI22NKN1SVVKT5NF5J0UNCUZPAIMLK1EH&client_secret=OPFBSCK1X5SZWIR5V4MU5E2XZP0CLVDUNP2KZ0ZL20KGP1RM&v=20180605&ll=43.67635739999999,-79.2930312&radius=500&limit=100
The results have a length of 2


Transforming the JSON into a pandas dataframe

In [0]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

  from ipykernel import kernelapp as app


Unnamed: 0,name,categories,lat,lng
0,Glen Manor Ravine,Trail,43.676821,-79.293942
1,The Big Carrot Natural Food Market,Health Food Store,43.678879,-79.297734
2,Grover Pub and Grub,Pub,43.679181,-79.297215
3,Upper Beaches,Neighborhood,43.680563,-79.292869


Let's transform this to import venus from all selected neighborhoods. The following function will help us.

In [0]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [0]:
torontoselected_venues = getNearbyVenues(names    =newdftoronto['Neighborhood'],
                                        latitudes =newdftoronto['Latitude'],
                                        longitudes=newdftoronto['Longitude']
                                        )

The Beaches
The Danforth West, Riverdale
India Bazaar, The Beaches West
Studio District
Lawrence Park
Davisville North
North Toronto West
Davisville
Moore Park, Summerhill East
Summerhill West, Rathnelly, South Hill, Forest Hill SE, Deer Park
Rosedale
St. James Town, Cabbagetown
Church and Wellesley
Regent Park, Harbourfront
Garden District, Ryerson
St. James Town
Berczy Park
Central Bay Street
Richmond, Adelaide, King
Harbourfront East, Union Station, Toronto Islands
Toronto Dominion Centre, Design Exchange
Commerce Court, Victoria Hotel
Roselawn
Forest Hill North & West
The Annex, North Midtown, Yorkville
University of Toronto, Harbord
Kensington Market, Chinatown, Grange Park
CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport
Stn A PO Boxes
First Canadian Place, Underground city
Christie
Dufferin, Dovercourt Village
Little Portugal, Trinity
Brockton, Parkdale Village, Exhibition Place
High Park, The Junction South
Parkdale, Ron

In [0]:
print(torontoselected_venues.shape)
torontoselected_venues.head()

(1602, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,The Beaches,43.676357,-79.293031,Glen Manor Ravine,43.676821,-79.293942,Trail
1,The Beaches,43.676357,-79.293031,The Big Carrot Natural Food Market,43.678879,-79.297734,Health Food Store
2,The Beaches,43.676357,-79.293031,Grover Pub and Grub,43.679181,-79.297215,Pub
3,The Beaches,43.676357,-79.293031,Upper Beaches,43.680563,-79.292869,Neighborhood
4,"The Danforth West, Riverdale",43.679557,-79.352188,MenEssentials,43.67782,-79.351265,Cosmetics Shop


In [0]:
# Filtering the most common types of venues
numberofvenues = 10

listofvenues   = list(((torontoselected_venues.groupby('Venue Category').count().sort_values('Venue', ascending = False)).index)[:numberofvenues])
print(listofvenues)

booleantypeofvenues = [False]*len(torontoselected_venues)
for i in range(len(torontoselected_venues)):
  if torontoselected_venues['Venue Category'][i] in listofvenues:
    booleantypeofvenues[i] = True

torontoselected_venues_short = torontoselected_venues[booleantypeofvenues].copy()

torontoselected_venues[booleantypeofvenues]

['Coffee Shop', 'Café', 'Restaurant', 'Italian Restaurant', 'Park', 'Japanese Restaurant', 'Hotel', 'Bakery', 'Gym', 'Bar']


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
6,"The Danforth West, Riverdale",43.679557,-79.352188,Cafe Fiorentina,43.677743,-79.350115,Italian Restaurant
14,"The Danforth West, Riverdale",43.679557,-79.352188,7 Numbers,43.677062,-79.353934,Italian Restaurant
17,"The Danforth West, Riverdale",43.679557,-79.352188,Rikkochez,43.677267,-79.353274,Restaurant
29,"The Danforth West, Riverdale",43.679557,-79.352188,Marvel Coffee Co.,43.678630,-79.347460,Coffee Shop
31,"The Danforth West, Riverdale",43.679557,-79.352188,Dough Bakeshop,43.676643,-79.356846,Bakery
...,...,...,...,...,...,...,...
1579,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,SUDS,43.659880,-79.394712,Bar
1580,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,Tim Hortons,43.658175,-79.390681,Coffee Shop
1582,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,Tim Hortons,43.658906,-79.388696,Coffee Shop
1591,Business reply mail Processing Centre,43.662744,-79.321558,The Green Wood,43.664728,-79.324117,Restaurant


In [0]:
torontoselected_venues_short.groupby('Neighborhood').count().head()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Berczy Park,13,13,13,13,13,13
"Brockton, Parkdale Village, Exhibition Place",10,10,10,10,10,10
Business reply mail Processing Centre,2,2,2,2,2,2
"CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport",1,1,1,1,1,1
Central Bay Street,25,25,25,25,25,25


In [0]:
print('There are {} uniques categories.'.format(len(torontoselected_venues_short['Venue Category'].unique())))

There are 10 uniques categories.


In [0]:
# one hot encoding
torontoselected_onehot = pd.get_dummies(torontoselected_venues_short[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
torontoselected_onehot['Neighborhood'] = torontoselected_venues_short['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [torontoselected_onehot.columns[-1]] + list(torontoselected_onehot.columns[:-1])
torontoselected_onehot = torontoselected_onehot[fixed_columns]

torontoselected_onehot.head()

Unnamed: 0,Neighborhood,Bakery,Bar,Café,Coffee Shop,Gym,Hotel,Italian Restaurant,Japanese Restaurant,Park,Restaurant
6,"The Danforth West, Riverdale",0,0,0,0,0,0,1,0,0,0
14,"The Danforth West, Riverdale",0,0,0,0,0,0,1,0,0,0
17,"The Danforth West, Riverdale",0,0,0,0,0,0,0,0,0,1
29,"The Danforth West, Riverdale",0,0,0,1,0,0,0,0,0,0
31,"The Danforth West, Riverdale",1,0,0,0,0,0,0,0,0,0


In [0]:
toronto_grouped = torontoselected_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped.head()

Unnamed: 0,Neighborhood,Bakery,Bar,Café,Coffee Shop,Gym,Hotel,Italian Restaurant,Japanese Restaurant,Park,Restaurant
0,Berczy Park,0.153846,0.0,0.153846,0.307692,0.0,0.076923,0.0,0.076923,0.076923,0.153846
1,"Brockton, Parkdale Village, Exhibition Place",0.1,0.1,0.3,0.2,0.1,0.0,0.1,0.0,0.0,0.1
2,Business reply mail Processing Centre,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.5
3,"CN Tower, King and Spadina, Railway Lands, Har...",0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Central Bay Street,0.0,0.08,0.12,0.44,0.0,0.04,0.16,0.08,0.04,0.04


Printing all neighborhoods with the 10 most common venues.

In [0]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

num_top_venues = 5

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Berczy Park,Coffee Shop,Restaurant,Café,Bakery,Park
1,"Brockton, Parkdale Village, Exhibition Place",Café,Coffee Shop,Restaurant,Italian Restaurant,Gym
2,Business reply mail Processing Centre,Restaurant,Park,Japanese Restaurant,Italian Restaurant,Hotel
3,"CN Tower, King and Spadina, Railway Lands, Har...",Coffee Shop,Restaurant,Park,Japanese Restaurant,Italian Restaurant
4,Central Bay Street,Coffee Shop,Italian Restaurant,Café,Japanese Restaurant,Bar


## Clustering Neighborhoods and plotting them
Finally some unsupervised machine learning with the data. We will run *k*-means algorithm using 5 clusters

In [0]:
# import k-means
from sklearn.cluster import KMeans
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

In [0]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_

array([4, 2, 1, 0, 4, 2, 4, 4, 2, 1, 2, 4, 3, 4, 4, 2, 1, 2, 3, 2, 1, 4,
       4, 0, 4, 4, 3, 2, 4, 4, 4, 2, 0, 2, 4, 4, 2], dtype=int32)

In [0]:
toronto_grouped_clustering.head()

Unnamed: 0,Bakery,Bar,Café,Coffee Shop,Gym,Hotel,Italian Restaurant,Japanese Restaurant,Park,Restaurant
0,0.153846,0.0,0.153846,0.307692,0.0,0.076923,0.0,0.076923,0.076923,0.153846
1,0.1,0.1,0.3,0.2,0.1,0.0,0.1,0.0,0.0,0.1
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.5
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.08,0.12,0.44,0.0,0.04,0.16,0.08,0.04,0.04


In [0]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

newdftoronto_merged = newdftoronto

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
newdftoronto_merged = newdftoronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

newdftoronto_merged.head() # check the last columns!
print(newdftoronto_merged.shape)

(39, 11)


In [0]:
# Clean the newdftoronto_merged, removing the NaN
newdftoronto_merged.dropna(axis=0,inplace=True)
newdftoronto_merged['Cluster Labels'] = newdftoronto_merged['Cluster Labels'].astype('int64')
print(newdftoronto_merged.shape)

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(newdftoronto_merged['Latitude'], newdftoronto_merged['Longitude'], newdftoronto_merged['Neighborhood'], newdftoronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

(37, 11)


## Naming the clusters according to the most commons venues on them

### Cluster 1 -> Red

In [0]:
newdftoronto_merged.loc[newdftoronto_merged['Cluster Labels'] == 0, newdftoronto_merged.columns[[1] + list(range(5, newdftoronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
9,Central Toronto,0,Coffee Shop,Restaurant,Park,Japanese Restaurant,Italian Restaurant
27,Downtown Toronto,0,Coffee Shop,Restaurant,Park,Japanese Restaurant,Italian Restaurant
37,Downtown Toronto,0,Coffee Shop,Park,Japanese Restaurant,Italian Restaurant,Gym


Cluster 1 seems to be related with the quantity of Coffee shops-restaurants-Parks

### Cluster 2 -> Purple

In [0]:
newdftoronto_merged.loc[newdftoronto_merged['Cluster Labels'] == 1, newdftoronto_merged.columns[[1] + list(range(5, newdftoronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
2,East Toronto,1,Park,Restaurant,Italian Restaurant,Gym,Coffee Shop
5,Central Toronto,1,Park,Hotel,Gym,Restaurant,Japanese Restaurant
8,Central Toronto,1,Park,Gym,Restaurant,Japanese Restaurant,Italian Restaurant
38,East Toronto,1,Restaurant,Park,Japanese Restaurant,Italian Restaurant,Hotel


Cluster 2 seems to be related with Parks-Restaurant-Gym

### Cluster 3 -> Light Blue

In [0]:
newdftoronto_merged.loc[newdftoronto_merged['Cluster Labels'] == 2, newdftoronto_merged.columns[[1] + list(range(5, newdftoronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
3,East Toronto,2,Café,Coffee Shop,Bakery,Park,Italian Restaurant
7,Central Toronto,2,Italian Restaurant,Gym,Coffee Shop,Café,Restaurant
24,Central Toronto,2,Café,Coffee Shop,Park,Restaurant,Japanese Restaurant
25,Downtown Toronto,2,Café,Restaurant,Japanese Restaurant,Italian Restaurant,Bar
26,Downtown Toronto,2,Café,Coffee Shop,Bakery,Bar,Park
30,Downtown Toronto,2,Café,Park,Restaurant,Italian Restaurant,Coffee Shop
31,West Toronto,2,Bakery,Park,Café,Bar,Restaurant
32,West Toronto,2,Bar,Restaurant,Café,Park,Japanese Restaurant
33,West Toronto,2,Café,Coffee Shop,Restaurant,Italian Restaurant,Gym
34,West Toronto,2,Café,Park,Italian Restaurant,Bar,Bakery


Cluster 3 seems to be related with Café-Bakeries-Coffee shops-Park-Restaurant

### Cluster 4 -> Green

In [0]:
newdftoronto_merged.loc[newdftoronto_merged['Cluster Labels'] == 3, newdftoronto_merged.columns[[1] + list(range(5, newdftoronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
4,Central Toronto,3,Park,Restaurant,Japanese Restaurant,Italian Restaurant,Hotel
10,Downtown Toronto,3,Park,Restaurant,Japanese Restaurant,Italian Restaurant,Hotel
23,Central Toronto,3,Park,Restaurant,Japanese Restaurant,Italian Restaurant,Hotel


Cluster 4 seems to be related with Park and Restaurant

### Cluster 5 -> Orange

In [0]:
newdftoronto_merged.loc[newdftoronto_merged['Cluster Labels'] == 4, newdftoronto_merged.columns[[1] + list(range(5, newdftoronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
1,East Toronto,4,Italian Restaurant,Coffee Shop,Restaurant,Café,Bakery
6,Central Toronto,4,Coffee Shop,Restaurant,Park,Café,Japanese Restaurant
11,Downtown Toronto,4,Coffee Shop,Bakery,Restaurant,Park,Italian Restaurant
12,Downtown Toronto,4,Japanese Restaurant,Coffee Shop,Restaurant,Hotel,Café
13,Downtown Toronto,4,Coffee Shop,Park,Bakery,Café,Restaurant
14,Downtown Toronto,4,Coffee Shop,Restaurant,Japanese Restaurant,Italian Restaurant,Café
15,Downtown Toronto,4,Coffee Shop,Café,Restaurant,Italian Restaurant,Gym
16,Downtown Toronto,4,Coffee Shop,Restaurant,Café,Bakery,Park
17,Downtown Toronto,4,Coffee Shop,Italian Restaurant,Café,Japanese Restaurant,Bar
18,Downtown Toronto,4,Coffee Shop,Café,Restaurant,Hotel,Gym


Cluster 5 seems to be realted with coffee shops, restaurant, café.

## Conclusions
During this exercise we have gone from web data to pandas, then to a map and finally we have used machine learning to cluster neighborhoods depending on the type of venus they have.  
We have used 5 clusters:  
- Clusters 1 (red),3 (light blue) and 5 (orange) have many coffee shops and Cafés. Additionally you can have some food at restaurants around there and you have some parks as well. If you want to have cake with your cofee, then best cluster is 3 (light blue). Cluster 1 has more parks and Cluster 5 more bar-restaurants.
- Cluster 2 (purple) is the healty cluster, where you can find parks and gyms.
- Cluster 4 (orange) is the most balanced cluster. It has parks, restaurants and hotels. The only thing missing there is coffee shops.