# Peer-graded Assignment: Segmenting and Clustering Neighborhoods in Toronto

In this assignment, you will be required to explore, segment, and cluster the neighborhoods in the city of Toronto. However, unlike New York, the neighborhood data is not readily available on the internet. What is interesting about the field of data science is that each project can be challenging in its unique way, so you need to learn to be agile and refine the skill to learn new libraries and tools quickly depending on the project.

For the Toronto neighborhood data, a Wikipedia page exists that has all the information we need to explore and cluster the neighborhoods in Toronto. You will be required to scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas dataframe so that it is in a structured format like the New York dataset.

Once the data is in a structured format, you can replicate the analysis that we did to the New York City dataset to explore and cluster the neighborhoods in the city of Toronto.

Your submission will be a link to your Jupyter Notebook on your Github repository.

***This notebook contains all three parts of the assignment***

<br>

## **Task 1**

For this assignment, you will be required to explore and cluster the neighborhoods in Toronto.

1. Start by creating a new Notebook for this assignment.
2. Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe like the one shown below:
<p align="center">
 <img src="https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/7JXaz3NNEeiMwApe4i-fLg_40e690ae0e927abda2d4bde7d94ed133_Screen-Shot-2018-06-18-at-7.17.57-PM.png?expiry=1606608000000&hmac=tpWz4QHI5JMGGOkiMOqi3FQJgOdYCIETnxmqPY35YTs" alt = "dataframe"/>
</p>
3. To create the above dataframe:

  - The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
  - Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
  - More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.
  
  - If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.
  - Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.
  - In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.
  
4. Submit a link to your Notebook on your Github repository.

Note: There are different website scraping libraries and packages in Python. For scraping the above table, you can simply use pandas to read the table into a pandas dataframe.

Another way, which would help to learn for more complicated cases of web scraping is using the BeautifulSoup package. Here is the package's main documentation page: http://beautiful-soup-4.readthedocs.io/en/latest/

Use pandas, or the BeautifulSoup package, or any other way you are comfortable with to transform the data in the table on the Wikipedia page into the above pandas dataframe.

In [1]:
# importing libraries
import pandas as pd
import numpy as np
import requests
#import bs4 #it was not needed after all

Assigning the url to a variable for easiness of use. Then performing a request with out url.
Response of 200 would mean everything is OK

In [2]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
res = requests.get(url)
res

<Response [200]>

It is OK. Now we need to parse the table from our url to pandas dataframe. I have tried it with the bs4 module, but, apparently, it is much easier to do with <code>pandas.read_html</code>. We need to add a .content attribute to our request. The returned value is then a list. Therefore, we would extract its first element which contains the table we need. And as such, we create a dataframe.

In [3]:
url_raw = pd.read_html(res.content)
type(url_raw)

list

In [4]:
url_raw = pd.read_html(res.content)[0]
url_raw

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


Finally, we create a dataframe we'll be working with. For this we must filter out all the entries where 'Borough' are not assigned. When it's done, we reset the index.

In [5]:
df = url_raw[url_raw.Borough != 'Not assigned']
df.reset_index(drop=True, inplace=True)
df

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


Check whether there are entries where 'Neighbourhood' values are not assigned.

In [6]:
df[df['Neighbourhood'] == 'Not assigned']

Unnamed: 0,Postal Code,Borough,Neighbourhood


It seems everything is fine. Now we can check the number of rows we have and jump to part 2.

In [7]:
print(f"The dataframe has {df.shape[0]} rows in it.")

The dataframe has 103 rows in it.


<br>
<br>

## **Task 2**

Now that you have built a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name, in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood.

In an older version of this course, we were leveraging the Google Maps Geocoding API to get the latitude and the longitude coordinates of each neighborhood. However, recently Google started charging for their API: http://geoawesomeness.com/developers-up-in-arms-over-google-maps-api-insane-price-hike/, so we will use the Geocoder Python package instead: https://geocoder.readthedocs.io/index.html.

The problem with this Package is you have to be persistent sometimes in order to get the geographical coordinates of a given postal code. So you can make a call to get the latitude and longitude coordinates of a given postal code and the result would be None, and then make the call again and you would get the coordinates. So, in order to make sure that you get the coordinates for all of our neighborhoods, you can run a while loop for each postal code. Taking postal code M5G as an example, your code would look something like this:
```python
import geocoder # import geocoder

# initialize your variable to None
lat_lng_coords = None

# loop until you get the coordinates
while(lat_lng_coords is None):
  g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
  lat_lng_coords = g.latlng

latitude = lat_lng_coords[0]
longitude = lat_lng_coords[1]
```
Given that this package can be very unreliable, in case you are not able to get the geographical coordinates of the neighborhoods using the Geocoder package, here is a link to a csv file that has the geographical coordinates of each postal code: http://cocl.us/Geospatial_data

Use the Geocoder package or the csv file to create the following dataframe:
<img src="https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/HZ3jNHNOEeiMwApe4i-fLg_f44f0f10ccfaf42fcbdba9813364e173_Screen-Shot-2018-06-18-at-7.18.16-PM.png?expiry=1606608000000&hmac=fOqqJgXrl5tNYH6TPCai-_iqS5Xn_PZms-sD0UM6BbM">

**Important Note:** There is a limit on how many times you can call geocoder.google function. It is 2500 times per day. This should be way more than enough for you to get acquainted with the package and to use it to get the geographical coordinates of the neighborhoods in the Toronto.

Once you are able to create the above dataframe, submit a link to the new Notebook on your Github repository.

<br>
Import geocoder (which was installed previously) and try the code given in the task

```python
import geocoder # import geocoder

# initialize your variable to None
lat_lng_coords = None

#give a postal code
postal_code = 'M3A'

# loop until you get the coordinates
while(lat_lng_coords is None):
  g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
  lat_lng_coords = g.latlng

latitude = lat_lng_coords[0]
longitude = lat_lng_coords[1]
```

Well, the code was running too long with no results. I had to interrupt it via "Kernel --> Interrupt".
Let's try the .csv file that is provided.

In [8]:
p_df = pd.read_csv('https://cocl.us/Geospatial_data')
p_df

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


Okay, here's a new dataframe. Apparently, now we have to join the two. For this we need to check whether they are compatible, namely check their dimensions and data types.

In [9]:
print(f"The first dataframe df has a shape of {df.shape} and the following datatypes: \n{df.dtypes}")
print(f"\nThe second dataframe has a shape of {p_df.shape} and the following datatypes: \n{p_df.dtypes}")

The first dataframe df has a shape of (103, 3) and the following datatypes: 
Postal Code      object
Borough          object
Neighbourhood    object
dtype: object

The second dataframe has a shape of (103, 3) and the following datatypes: 
Postal Code     object
Latitude       float64
Longitude      float64
dtype: object


Looks like we can join the two dataframes on the Postal Code

In [10]:
comb_df = df.join(p_df.set_index('Postal Code'), on='Postal Code', how='inner')
comb_df

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.662744,-79.321558
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509


In [11]:
print(f"The new dataframe has {comb_df.shape[0]} rows and no errors occured during its creation.")

The new dataframe has 103 rows and no errors occured during its creation.


<br>
<br>

## **Task 3**

Explore and cluster the neighborhoods in Toronto. You can decide to work with only boroughs that contain the word Toronto and then replicate the same analysis we did to the New York City data. It is up to you.

Just make sure:

1. to add enough Markdown cells to explain what you decided to do and to report any observations you make.
2. to generate maps to visualize your neighborhoods and how they cluster together. 

Once you are happy with your analysis, submit a link to the new Notebook on your Github repository.

<br>
Let's import the geocoder module, specifically, we are interested in the <code>Nominatim</code> class. Then we set the address to Toronto and get the coordinates.

In [12]:
#import geocoder
from geopy.geocoders import Nominatim 

address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print(f'The geograpical coordinates of Toronto are {latitude}, {longitude}.')

The geograpical coordinates of Toronto are 43.6534817, -79.3839347.


The next step is to build the map to visualize the clusters. For this task the <code>Folium</code> module is needed. Then, very much like in previous labs we add markers to the map using a <code>for</code> loop.

In [13]:
import folium

Toronto_map = folium.Map(location=[latitude, longitude], zoom_start=10)

for latitude, longitude, borough, neighbourhood in zip(comb_df['Latitude'], comb_df['Longitude'], comb_df['Borough'], comb_df['Neighbourhood']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [latitude, longitude],
        radius=5,
        popup=label,
        color='green',
        fill=True
        ).add_to(Toronto_map)  
    
Toronto_map

From the previous labs we now copy our Foursquare credential or creat new if needed.

In [14]:
CLIENT_ID = '3WFWS0OEZXUVJW2HOYQIEJSUQKMRQJJG3TTZDOCPY2USGGES' 
CLIENT_SECRET = 'I5GTXXEWPAHE35DJTYGFNIUQVJPRYQE3J3RDUAMPTHOL0MHY'
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: 3WFWS0OEZXUVJW2HOYQIEJSUQKMRQJJG3TTZDOCPY2USGGES
CLIENT_SECRET:I5GTXXEWPAHE35DJTYGFNIUQVJPRYQE3J3RDUAMPTHOL0MHY


Next we want to see all the venues in the area. With the help from previously done labs we can write a function which will perform just that task.

In [15]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius
            )
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'],
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue',
                  'Venue Latitude', 
                  'Venue Longitude',           
                  'Venue Category']
    
    return(nearby_venues)

Now, using the resulting dataframe from Task 2, we can create a dataframe with venues for each neighbourhood using the function above.

In [16]:
toronto_venues = getNearbyVenues(comb_df['Neighbourhood'], comb_df['Latitude'], comb_df['Longitude'])

Parkwoods
Victoria Village
Regent Park, Harbourfront
Lawrence Manor, Lawrence Heights
Queen's Park, Ontario Provincial Government
Islington Avenue, Humber Valley Village
Malvern, Rouge
Don Mills
Parkview Hill, Woodbine Gardens
Garden District, Ryerson
Glencairn
West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale
Rouge Hill, Port Union, Highland Creek
Don Mills
Woodbine Heights
St. James Town
Humewood-Cedarvale
Eringate, Bloordale Gardens, Old Burnhamthorpe, Markland Wood
Guildwood, Morningside, West Hill
The Beaches
Berczy Park
Caledonia-Fairbanks
Woburn
Leaside
Central Bay Street
Christie
Cedarbrae
Hillcrest Village
Bathurst Manor, Wilson Heights, Downsview North
Thorncliffe Park
Richmond, Adelaide, King
Dufferin, Dovercourt Village
Scarborough Village
Fairview, Henry Farm, Oriole
Northwood Park, York University
East Toronto, Broadview North (Old East York)
Harbourfront East, Union Station, Toronto Islands
Little Portugal, Trinity
Kennedy Park, Ionview, East Birchmo

In [17]:
toronto_venues.head()

Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
1,Parkwoods,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
2,Victoria Village,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
3,Victoria Village,43.725882,-79.315572,Tim Hortons,43.725517,-79.313103,Coffee Shop
4,Victoria Village,43.725882,-79.315572,Portugril,43.725819,-79.312785,Portuguese Restaurant


All right, the dataframe looks good. Let's check how many rows it has.

In [18]:
print(f"The 'toronto_venues' dataframe has {toronto_venues.shape[0]} rows in it.")

The 'toronto_venues' dataframe has 1338 rows in it.


In [19]:
toronto_venues.groupby('Neighbourhood').head()

Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.332140,Park
1,Parkwoods,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
2,Victoria Village,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
3,Victoria Village,43.725882,-79.315572,Tim Hortons,43.725517,-79.313103,Coffee Shop
4,Victoria Village,43.725882,-79.315572,Portugril,43.725819,-79.312785,Portuguese Restaurant
...,...,...,...,...,...,...,...
1322,"Mimico NW, The Queensway West, South of Bloor,...",43.628841,-79.520999,South St. Burger,43.631314,-79.518408,Burger Joint
1323,"Mimico NW, The Queensway West, South of Bloor,...",43.628841,-79.520999,Wingporium,43.630275,-79.518169,Wings Joint
1324,"Mimico NW, The Queensway West, South of Bloor,...",43.628841,-79.520999,Dollarama,43.629883,-79.518627,Discount Store
1325,"Mimico NW, The Queensway West, South of Bloor,...",43.628841,-79.520999,Healthy Planet,43.630214,-79.518495,Supplement Shop


Okay, this way we have an overview of nieghbourhoods. What about venues per neighbourhood?

In [20]:
toronto_venues.groupby('Neighbourhood').count()

Unnamed: 0_level_0,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Agincourt,5,5,5,5,5,5
"Alderwood, Long Branch",7,7,7,7,7,7
"Bathurst Manor, Wilson Heights, Downsview North",21,21,21,21,21,21
Bayview Village,4,4,4,4,4,4
"Bedford Park, Lawrence Manor East",22,22,22,22,22,22
...,...,...,...,...,...,...
"Willowdale, Willowdale West",5,5,5,5,5,5
Woburn,4,4,4,4,4,4
Woodbine Heights,9,9,9,9,9,9
York Mills West,2,2,2,2,2,2


Finally, let's see how many unique venue categories are there.

In [21]:
toronto_venues.groupby('Venue Category').max()

Unnamed: 0_level_0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude
Venue Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Accessories Store,"Wexford, Maryvale",43.750072,-79.295849,Puffin Gear,43.750947,-79.290047
Airport,Downsview,43.737473,-79.394420,Toronto Downsview Airport (YZD),43.738883,-79.396033
Airport Food Court,"CN Tower, King and Spadina, Railway Lands, Har...",43.628947,-79.394420,Billy Bishop Café,43.631132,-79.396139
Airport Gate,"CN Tower, King and Spadina, Railway Lands, Har...",43.628947,-79.394420,Gate 8,43.631536,-79.394570
Airport Lounge,"CN Tower, King and Spadina, Railway Lands, Har...",43.628947,-79.394420,Porter Lounge,43.631360,-79.395756
...,...,...,...,...,...,...
Warehouse Store,Thorncliffe Park,43.705369,-79.349372,Costco,43.707051,-79.348093
Wine Bar,"Little Portugal, Trinity",43.653206,-79.400049,Paris Paris Bar,43.653479,-79.401427
Wings Joint,"Mimico NW, The Queensway West, South of Bloor,...",43.628841,-79.520999,Wingporium,43.630275,-79.518169
Women's Store,"Lawrence Manor, Lawrence Heights",43.718518,-79.453512,Maximum Woman,43.717878,-79.456333


In [22]:
print(f'There are {toronto_venues.groupby("Venue Category").max().shape[0]} unique venue categories.')

There are 240 unique venue categories.


Now, in order to be able to work with this dataframe further we need to make sure the code can "see" which neighbourhood has which venues. Our variables here are categorical, so we would need to convert them to a vector using binary boolean principles. Namely, if the statement is <code>True</code> and the venue is present in a given neighbourhood - then this venue would have a value of **one (1)**. Similarly, if the statement is <code>False</code> and there is no such venue type in the neighbourhood - it will have a value of **zero (0)**. This is called one-hot encoding, and in <code>pandas</code> it can be applied via <code>.get_dummies</code> method.
We want to make sure to use <code>prefix=""</code> and <code>prefix_sep=""</code> to make sure that our columns have the right names.

In [36]:
toronto_venues_encoded = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")
toronto_venues_encoded

Unnamed: 0,Accessories Store,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,...,Train Station,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1333,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1334,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1335,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1336,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


As we want to know which venues are in which neighbourhood, we need to include a column "Neighbourhood" into our hot-encoded dataframe. For this we will use a <code>concat()</code> function along axis 1 -- columns.

In [37]:
toronto_venues_encoded = pd.concat([toronto_venues['Neighbourhood'],toronto_venues_encoded],axis=1)
toronto_venues_encoded.head(10)

Unnamed: 0,Neighbourhood,Accessories Store,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Train Station,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [40]:
print(f'This dataframe has {toronto_venues_encoded.shape[0]} rows and {toronto_venues_encoded.shape[1]} columns.')

This dataframe has 1338 rows and 241 columns.


Now, to see which venue categories dominate the neighbourhood and how these venue types are distributed across Toronto we would sort/group the dataframe according to the mean values of venue types per neighbourhood.

In [42]:
toronto_grouped = toronto_venues_encoded.groupby('Neighbourhood').mean().reset_index()
toronto_grouped.head()

Unnamed: 0,Neighbourhood,Accessories Store,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Train Station,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"Alderwood, Long Branch",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Bathurst Manor, Wilson Heights, Downsview North",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Bedford Park, Lawrence Manor East",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.045455,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [43]:
print(f'This dataframe has {toronto_grouped.shape[0]} rows and {toronto_grouped.shape[1]} columns.')

This dataframe has 96 rows and 241 columns.


Again, refering to the previous lab, we can borrow a function from it. The function is supposed to return most common venues.

In [44]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Additionally, we can borrow the sequence of finding top 10 venues in each neighbourhood and passing them to a new dataframe, which will list all the venues per neighbourhood according to their popularity/presence. As the sequence is using <code>numpy</code> we have to import it as well (we did it in the beginning).

In [48]:
#import numpy as np

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighbourhoods_venues_sorted = pd.DataFrame(columns=columns)
neighbourhoods_venues_sorted['Neighbourhood'] = toronto_grouped['Neighbourhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighbourhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighbourhoods_venues_sorted.head()

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Agincourt,Lounge,Breakfast Spot,Latin American Restaurant,Skating Rink,Clothing Store,Dessert Shop,Eastern European Restaurant,Dumpling Restaurant,Drugstore,Donut Shop
1,"Alderwood, Long Branch",Pizza Place,Pharmacy,Coffee Shop,Pub,Sandwich Place,Gym,Gay Bar,Donut Shop,Dive Bar,Distribution Center
2,"Bathurst Manor, Wilson Heights, Downsview North",Coffee Shop,Bank,Pizza Place,Chinese Restaurant,Bridal Shop,Shopping Mall,Sandwich Place,Diner,Deli / Bodega,Restaurant
3,Bayview Village,Café,Bank,Chinese Restaurant,Japanese Restaurant,Yoga Studio,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Drugstore,Donut Shop
4,"Bedford Park, Lawrence Manor East",Coffee Shop,Sandwich Place,Italian Restaurant,Pharmacy,Thai Restaurant,Pizza Place,Pub,Restaurant,Café,Butcher


Now, having this frequency data at hand we can start with the clustering with k=5. Naturally, we need to import <code>KMeans</code> from <code>sklearn</code>.

In [49]:
from sklearn.cluster import KMeans

k_clusters = 5

#drop the Neighbourhood column to work with numerical values only
toronto_k_clustering = toronto_grouped.drop('Neighbourhood', 1)

KM = KMeans(n_clusters=k_clusters, random_state=0)

In [50]:
KM.fit(toronto_k_clustering)
KM

KMeans(n_clusters=5, random_state=0)

In [52]:
KM.labels_[0:100]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 2, 1, 0, 4, 0, 1, 0, 0, 0, 3, 0, 0, 1, 0, 0, 0, 1,
       0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 1, 0])

The result of the code above represents the label that was assigned to a neighbourhood regarding the cluster. Now we need to add these labels to the **neighbourhoods_venues_sorted** dataframe. We then would want a dataframe which will have as much information as possible. Basically, we will merge a few dataframes together.

In [53]:
#adding the labels to the top10 df
neighbourhoods_venues_sorted.insert(0, 'Cluster Labels', KM.labels_)

#creating a copy of a combined dataframe from Task 2
toronto_final = comb_df.copy()
toronto_final

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.662744,-79.321558
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509


In [56]:
toronto_ffinal = toronto_final.join(neighbourhoods_venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')

toronto_ffinal.head(10)

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,1.0,Park,Food & Drink Shop,Yoga Studio,Department Store,Eastern European Restaurant,Dumpling Restaurant,Drugstore,Donut Shop,Dog Run,Dive Bar
1,M4A,North York,Victoria Village,43.725882,-79.315572,0.0,Intersection,Hockey Arena,Pizza Place,French Restaurant,Coffee Shop,Portuguese Restaurant,Department Store,Dessert Shop,Dim Sum Restaurant,Diner
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,0.0,Coffee Shop,Park,Bakery,Breakfast Spot,Theater,French Restaurant,Performing Arts Venue,Chocolate Shop,Pub,Restaurant
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,0.0,Clothing Store,Accessories Store,Boutique,Gift Shop,Furniture / Home Store,Event Space,Coffee Shop,Women's Store,Vietnamese Restaurant,Airport Terminal
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,0.0,Coffee Shop,Yoga Studio,Diner,Sandwich Place,Portuguese Restaurant,Park,Mexican Restaurant,Italian Restaurant,Hobby Shop,Fried Chicken Joint
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242,,,,,,,,,,,
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353,2.0,Fast Food Restaurant,Yoga Studio,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Drugstore,Donut Shop,Dog Run,Dive Bar
7,M3B,North York,Don Mills,43.745906,-79.352188,0.0,Gym,Beer Store,Japanese Restaurant,Coffee Shop,Dim Sum Restaurant,Discount Store,Supermarket,Caribbean Restaurant,Café,Chinese Restaurant
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937,0.0,Coffee Shop,Pizza Place,Bank,Intersection,Road,Athletics & Sports,Bus Line,Breakfast Spot,Gastropub,Gym / Fitness Center
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937,0.0,Coffee Shop,Café,Clothing Store,Theater,Pizza Place,Burger Joint,Japanese Restaurant,Electronics Store,New American Restaurant,Shopping Mall


In [57]:
#chech for null values
toronto_ffinal[toronto_ffinal['Cluster Labels'].isnull()]

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242,,,,,,,,,,,
52,M2M,North York,"Willowdale, Newtonbrook",43.789053,-79.408493,,,,,,,,,,,
95,M1X,Scarborough,Upper Rouge,43.836125,-79.205636,,,,,,,,,,,


Having NaN values we want to remove them from the dataframe

In [58]:
toronto_final_nonulls = toronto_ffinal.dropna(subset=['Cluster Labels'])

In [60]:
toronto_final_nonulls[toronto_final_nonulls['Cluster Labels'].isnull()]

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue


All right, we are cleared to go to plotting! We would need to import a few things from <code>matplotlib</code>. We can also refer to the lab to get help with the plotting.

In [61]:
import matplotlib.cm as cm
import matplotlib.colors as colors

map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(k_clusters)
ys = [i + x + (i*x)**2 for i in range(k_clusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_final_nonulls['Latitude'], toronto_final_nonulls['Longitude'], toronto_final_nonulls['Neighbourhood'], toronto_final_nonulls['Cluster Labels']):
    label = folium.Popup('Cluster ' + str(int(cluster) +1) + '\n' + str(poi) , parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster-1)],
        fill=True,
        fill_color=rainbow[int(cluster-1)]
        ).add_to(map_clusters)
        
map_clusters

<br>
Now we can examine each cluster to get more insight and compare it to the map

**Cluster 1** [Index 0]

In [62]:
toronto_final_nonulls.loc[toronto_final_nonulls['Cluster Labels'] == 0, toronto_final_nonulls.columns[[1] + list(range(5, toronto_final_nonulls.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,North York,0.0,Intersection,Hockey Arena,Pizza Place,French Restaurant,Coffee Shop,Portuguese Restaurant,Department Store,Dessert Shop,Dim Sum Restaurant,Diner
2,Downtown Toronto,0.0,Coffee Shop,Park,Bakery,Breakfast Spot,Theater,French Restaurant,Performing Arts Venue,Chocolate Shop,Pub,Restaurant
3,North York,0.0,Clothing Store,Accessories Store,Boutique,Gift Shop,Furniture / Home Store,Event Space,Coffee Shop,Women's Store,Vietnamese Restaurant,Airport Terminal
4,Downtown Toronto,0.0,Coffee Shop,Yoga Studio,Diner,Sandwich Place,Portuguese Restaurant,Park,Mexican Restaurant,Italian Restaurant,Hobby Shop,Fried Chicken Joint
7,North York,0.0,Gym,Beer Store,Japanese Restaurant,Coffee Shop,Dim Sum Restaurant,Discount Store,Supermarket,Caribbean Restaurant,Café,Chinese Restaurant
...,...,...,...,...,...,...,...,...,...,...,...,...
97,Downtown Toronto,0.0,Café,Coffee Shop,Restaurant,Seafood Restaurant,Gym,Tea Room,Gastropub,Gym / Fitness Center,Hotel,Concert Hall
98,Etobicoke,0.0,Pool,River,Yoga Studio,Deli / Bodega,Eastern European Restaurant,Dumpling Restaurant,Drugstore,Donut Shop,Dog Run,Dive Bar
99,Downtown Toronto,0.0,Coffee Shop,Beer Bar,Breakfast Spot,Bubble Tea Shop,Burger Joint,Salon / Barbershop,Café,Restaurant,Ramen Restaurant,Pub
100,East Toronto,0.0,Recording Studio,Auto Workshop,Skate Park,Park,Light Rail Station,Burrito Place,Farmers Market,Fast Food Restaurant,Butcher,Restaurant


**Cluster 2** [Index 1]

In [63]:
toronto_final_nonulls.loc[toronto_final_nonulls['Cluster Labels'] == 1, toronto_final_nonulls.columns[[1] + list(range(5, toronto_final_nonulls.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,North York,1.0,Park,Food & Drink Shop,Yoga Studio,Department Store,Eastern European Restaurant,Dumpling Restaurant,Drugstore,Donut Shop,Dog Run,Dive Bar
21,York,1.0,Park,Women's Store,Pool,Deli / Bodega,Eastern European Restaurant,Dumpling Restaurant,Drugstore,Donut Shop,Dog Run,Dive Bar
35,East York,1.0,Park,Convenience Store,Intersection,Yoga Studio,Dessert Shop,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Drugstore,Donut Shop
49,North York,1.0,Park,Construction & Landscaping,Bakery,Yoga Studio,Dessert Shop,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Drugstore,Donut Shop
61,Central Toronto,1.0,Park,Swim School,Bus Line,Yoga Studio,Dessert Shop,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Drugstore,Donut Shop
64,York,1.0,Park,Yoga Studio,Department Store,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Drugstore,Donut Shop,Dog Run,Dive Bar
66,North York,1.0,Park,Convenience Store,Yoga Studio,Department Store,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Drugstore,Donut Shop,Dog Run
85,Scarborough,1.0,Playground,Park,Bakery,Intersection,Department Store,Eastern European Restaurant,Dumpling Restaurant,Drugstore,Donut Shop,Dog Run
91,Downtown Toronto,1.0,Park,Tennis Court,Trail,Playground,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dog Run


**Cluster 3** [Index 2]

In [64]:
toronto_final_nonulls.loc[toronto_final_nonulls['Cluster Labels'] == 2, toronto_final_nonulls.columns[[1] + list(range(5, toronto_final_nonulls.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
6,Scarborough,2.0,Fast Food Restaurant,Yoga Studio,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Drugstore,Donut Shop,Dog Run,Dive Bar


**Cluster 4** [Index 3]

In [65]:
toronto_final_nonulls.loc[toronto_final_nonulls['Cluster Labels'] == 3, toronto_final_nonulls.columns[[1] + list(range(5, toronto_final_nonulls.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
57,North York,3.0,Baseball Field,Yoga Studio,Event Space,Escape Room,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Drugstore,Donut Shop,Dog Run
101,Etobicoke,3.0,Construction & Landscaping,Baseball Field,Yoga Studio,Dim Sum Restaurant,Escape Room,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Drugstore,Donut Shop


**Cluster 5** [Index 4]

In [66]:
toronto_final_nonulls.loc[toronto_final_nonulls['Cluster Labels'] == 4, toronto_final_nonulls.columns[[1] + list(range(5, toronto_final_nonulls.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
32,Scarborough,4.0,Playground,Smoke Shop,Jewelry Store,Department Store,Eastern European Restaurant,Dumpling Restaurant,Drugstore,Donut Shop,Dog Run,Dive Bar
83,Central Toronto,4.0,Playground,Trail,Deli / Bodega,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Drugstore,Donut Shop,Dog Run,Dive Bar


It looks like the clustering was performed successfully!