




---

<a id="0"></a>
# Applied Data Science Capstone Week 3 Assignment: <br> Segmenting and Clustering Neighborhoods in Toronto

---


## Table of Contents:

1. [Create *List of postal codes of Canada: M* Dataframe](#1.-Create-%22List-of-postal-codes-of-Canada:-M%22-Dataframe:)

2. [Get the latitude and the longitude coordinates of each neighborhood](##2.-Get-the-latitude-and-the-longitude-coordinates-of-each-neighborhood:)

3. [Explore and cluster the neighborhoods in Toronto](#3.-Explore-and-cluster-the-neighborhoods-in-Toronto)


---

## 1. Create "*List of postal codes of Canada: M*" Dataframe:

* The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
* Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
* More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.
* If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.
* Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.
* In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

Importing pandas and numpy:

In [4]:
import pandas as pd
import numpy as np

Using pandas to get from Wikipedia page "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M" data that is in the table of postal codes, transforming it into a pandas dataframe: 

In [5]:
df = pd.read_html("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")

In [6]:
df[0].head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


Preparing cells, replacing "**Not assigned**" with "**NaN**", for further cleanning:

In [7]:
df_post = df[0].replace("Not assigned",np.nan)
df_post.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,,
1,M2A,,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


Postal codes like "**M1A**" and "**M2A**" have neither Borough nor Neighborhood associated. These postal codes aren't needed.

Below, "**Not assigned**" (now as "**NaN**") Neighborhoods values are replaced with Borough names
  (Although there is no "**Not assigned**" Neighborhoods with assigned Boroughs in the Wikipedia table).
  
Any "Not assigned" ("**NaN**") values in Borough column are dropped too.

In [8]:
# Replace Not assigned (now, NaN values) Neighborhoods with Borough names (Although there is no Not assigned Neighborhoods with assigned Boroughs)
df_post['Neighborhood'] = df_post['Neighborhood'].replace(np.nan,df_post['Borough'])
# drop any Not assigned (now, NaN) values in Borough column
df_post = df_post.dropna(subset=['Borough'], axis=0).reset_index(drop=True)
df_post.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


Here, "***.shape***" method is used to print the number of dataframe rows:

In [9]:
df_post.shape

(103, 3)

---

## 2. Get the latitude and the longitude coordinates of each neighborhood:               

Now that we have built a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name, in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood.


In [10]:
#!conda install -c conda-forge geopy --yes # uncomment if there isn't Geocoder library yet
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

!pip install pgeocode # uncomment if there isn't pgeocode library yet
import pgeocode

!pip install folium # uncomment if there isn't folium library yet       # !conda install -c conda-forge folium=0.5.0 --yes 
import folium # map rendering library

import json # library to handle JSON files
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
print('Libraries imported.')

Collecting pgeocode
  Downloading https://files.pythonhosted.org/packages/86/44/519e3db3db84acdeb29e24f2e65991960f13464279b61bde5e9e96909c9d/pgeocode-0.2.1-py2.py3-none-any.whl
Installing collected packages: pgeocode
Successfully installed pgeocode-0.2.1
Collecting folium
[?25l  Downloading https://files.pythonhosted.org/packages/a4/f0/44e69d50519880287cc41e7c8a6acc58daa9a9acf5f6afc52bcc70f69a6d/folium-0.11.0-py2.py3-none-any.whl (93kB)
[K     |████████████████████████████████| 102kB 8.4MB/s ta 0:00:011
Collecting branca>=0.3.0 (from folium)
  Downloading https://files.pythonhosted.org/packages/13/fb/9eacc24ba3216510c6b59a4ea1cd53d87f25ba76237d7f4393abeaf4c94e/branca-0.4.1-py3-none-any.whl
Installing collected packages: branca, folium
Successfully installed branca-0.4.1 folium-0.11.0
Libraries imported.


Now, we'll create two functions to get coordinates: one with ***geocode*** library and other with *** pgeocode *** library:

In [11]:
def getCoordinatesGeocode(postal_code, borough, neighborhood, user_agent = 'geocode'):
    loc = None    # initialize variable to None
    # loop until you get the coordinates
    while(loc is None):
        geolocator = Nominatim(user_agent=user_agent)
        loc = geolocator.geocode('{}, {}, {}'.format(postal_code, borough, neighborhood))
    return loc.latitude, loc.longitude

def getCoordinatesPGeocode(postal_code, country_code = 'CA'):
    geolocator = pgeocode.Nominatim(country_code)
    loc = geolocator.query_postal_code(postal_code)
    return loc['latitude'], loc['longitude']

In [12]:
# Function using geocoder library constructed, however, as it isn't able to get some post cades' (e.g. M4A) inormation, it was replaced by pgeocode library 
#postal_code, borough, neighborhood = list(df_post["Postal Code"].head(2)), list(df_post["Borough"].head(2)), list(df_post["Neighborhood"].head(2))
#latitude, longitude = list(map(getCoordinatesGeocode, postal_code, borough, neighborhood))

# Getting coordinates with function created with pgeocode library
latitude, longitude = getCoordinatesPGeocode(list(df_post["Postal Code"]))
df_post['latitude'], df_post['longitude'] = latitude, longitude
df_post.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,latitude,longitude
0,M3A,North York,Parkwoods,43.7545,-79.33
1,M4A,North York,Victoria Village,43.7276,-79.3148
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.6555,-79.3626
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.7223,-79.4504
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.6641,-79.3889


### Getting the latitude and the longitude coordinates of each neighborhood from IBM's csv:             

Given that those packages are unreliable, it will be used coordinates IBM's csv file that has the geographical coordinates of each postal code (http://cocl.us/Geospatial_data) to work with reliable values

In [13]:
!wget -q -O 'Geospatial_Coordinates.csv' https://cocl.us/Geospatial_data
geoCoordinates_df = pd.read_csv('Geospatial_Coordinates.csv')
geoCoordinates_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [14]:
df_post = df_post.drop(columns=['latitude', 'longitude'])
df_post = pd.merge(df_post, geoCoordinates_df,
                        how="left", on=["Postal Code"])

In [15]:
df_post.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


Let's just check the size of data set:

In [16]:
df_post.shape

(103, 5)

---

## 3. Explore and cluster the neighborhoods in Toronto

Explore and cluster the neighborhoods in Toronto. In this task, it's said we can decide to work with only boroughs that contain the word Toronto and then replicate the same analysis we did to the New York City data.

To be done:

* to add enough Markdown cells to explain what you decided to do and to report any observations you make.
* to generate maps to visualize your neighborhoods and how they cluster together.

First, we start getting latitude and longitude from Toronto:

In [17]:
# Getting Toronto's latitude and longitude and preparing url to call foursquare
latitude, longitude = getCoordinatesGeocode('CA', 'Toronto', '') 
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    latitude, 
    longitude, 
    radius, 
    LIMIT)

Now, we explore Toronto in Foursquare in radius of 500, limiting in 100 items:

In [18]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5ee9f6617d58314072603628'},
 'response': {'suggestedFilters': {'header': 'Tap to show:',
   'filters': [{'name': 'Open now', 'key': 'openNow'}]},
  'headerLocation': 'Bay Street Corridor',
  'headerFullLocation': 'Bay Street Corridor, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 76,
  'suggestedBounds': {'ne': {'lat': 43.6579817045, 'lng': -79.37772678059432},
   'sw': {'lat': 43.6489816955, 'lng': -79.39014261940568}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '5227bb01498e17bf485e6202',
       'name': 'Downtown Toronto',
       'location': {'lat': 43.65323167517444,
        'lng': -79.38529600606677,
        'labeledLatLngs': [{'label': 'display',
          'lat': 43.65323167517444,
          'lng'

We'll prepare a function to plot a map with spots:


In [19]:
def plotMap(main_latitude, main_longitude, df_Latitude, df_Longitude, df_Borough, df_Neighborhood, fillColor, zoom_start = 10):
    # create map of Toronto using latitude and longitude values
    map1 = folium.Map(location=[main_latitude, main_longitude], zoom_start=zoom_start)
    # add markers to map
    for lat, lng, borough, neighborhood in zip(df_Latitude, df_Longitude, df_Borough, df_Neighborhood):
        if str(lat) == 'nan' or str(lng) == 'nan':
            print('Geo Coordinates not found for', lat, lng, borough, neighborhood)
            pass
        else:
            label = '{}, {}'.format(neighborhood, borough)
            label = folium.Popup(label, parse_html=True)
            folium.CircleMarker(
                [lat, lng],
                radius=5,
                popup=label,
                color='blue',
                fill=True,
                fill_color=fillColor,
                fill_opacity=0.7,
                parse_html=False).add_to(map1)  

    return map1

Let's take a look at Toronto's neighborhood:

In [20]:
plotMap(latitude, longitude, df_post['Latitude'], df_post['Longitude'], df_post['Borough'], df_post['Neighborhood'], '#3186cc',11)


---

---

---

---

---

---

---

---

---

---

Now, let's borrow a function rom the Foursquare lab that extracts the category of the venue

In [21]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

We will borrow one more function, to repeat the same process of exploration in a specific Borough:

In [22]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Now we need to clean the json and structure:

In [23]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]
print('A total of {} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

nearby_venues.head()

A total of 76 venues were returned by Foursquare.


Unnamed: 0,name,categories,lat,lng
0,Downtown Toronto,Neighborhood,43.653232,-79.385296
1,Nathan Phillips Square,Plaza,43.65227,-79.383516
2,Chatime 日出茶太,Bubble Tea Shop,43.655542,-79.384684
3,Textile Museum of Canada,Art Museum,43.654396,-79.3865
4,Indigo,Bookstore,43.653515,-79.380696


It seems East York is a good place. Not so crowded and realtivly near from Downtown. Let's explore it:

In [24]:
df_EastYork = df_post[df_post['Borough'] == 'East York'].reset_index(drop=True)
df_EastYork.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
1,M4C,East York,Woodbine Heights,43.695344,-79.318389
2,M4G,East York,Leaside,43.70906,-79.363452
3,M4H,East York,Thorncliffe Park,43.705369,-79.349372
4,M4J,East York,"East Toronto, Broadview North (Old East York)",43.685347,-79.338106


In [25]:
eastYork_venues = getNearbyVenues(names=df_EastYork['Neighborhood'],
                                   latitudes=df_EastYork['Latitude'],
                                   longitudes=df_EastYork['Longitude']
                                  )

Parkview Hill, Woodbine Gardens
Woodbine Heights
Leaside
Thorncliffe Park
East Toronto, Broadview North (Old East York)


We'll take a look at "Leaside" Neighborhood

In [26]:
df_EastYork.loc[2, 'Neighborhood']

'Leaside'

In [27]:
latitude, longitude = df_EastYork.loc[2, 'Latitude'], df_EastYork.loc[2, 'Longitude']
url2 = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    latitude, 
    longitude, 
    radius, 
    LIMIT)

In [28]:
results_EastYork = requests.get(url2).json()
results_EastYork

{'meta': {'code': 200, 'requestId': '5ee9f5fbb1a23637626b4f1a'},
 'response': {'headerLocation': 'Leaside',
  'headerFullLocation': 'Leaside, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 32,
  'suggestedBounds': {'ne': {'lat': 43.7135604045, 'lng': -79.3572380270639},
   'sw': {'lat': 43.704560395499996, 'lng': -79.3696653729361}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '5531956d498e24c6e9994f2e',
       'name': 'Local Leaside',
       'location': {'address': '180 Laird Dr',
        'lat': 43.71001166793114,
        'lng': -79.36351433524794,
        'labeledLatLngs': [{'label': 'display',
          'lat': 43.71001166793114,
          'lng': -79.36351433524794}],
        'distance': 106,
        'postalCode': 'M4G 3V7',
        'cc':

In [29]:
venues = results_EastYork['response']['groups'][0]['items']
nearby_venues = json_normalize(venues) # flatten JSON
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]
nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Local Leaside,Sports Bar,43.710012,-79.363514
1,Rack Attack,Sporting Goods Shop,43.706934,-79.362261
2,Olde Yorke Fish & Chips,Fish & Chips Shop,43.706141,-79.361829
3,CrossFit Toronto,Gym,43.7081,-79.35906
4,LCBO,Liquor Store,43.710571,-79.360287


In [30]:



# plotMap(*getCoordinatesGeocode('CA', 'Toronto', 'East York'), df_EastYork['Latitude'], df_EastYork['Longitude'], df_EastYork['Borough'], df_EastYork['Neighborhood'], '#31cc86')


In [31]:
neighborhood_latitude =  # neighborhood latitude value
neighborhood_longitude =  # neighborhood longitude value
neighborhood_name = df_EastYork.loc[2, 'Neighborhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

SyntaxError: invalid syntax (<ipython-input-31-adcdc32ce422>, line 1)

In [None]:
getNearbyVenues(neighborhood_name, neighborhood_latitude, neighborhood_longitude)



manhattan_venues = getNearbyVenues(names=df_EastYork['Neighborhood'],
                                   latitudes=df_EastYork['Latitude'],
                                   longitudes=manhattan_data['Longitude']
                                  )

In [None]:
df_post.iloc[:3]
numpp(df_post["Postal Code"].head(2))