<h1> DATA RPOCESSING: TORONTO NEIGHBOURHOODS </h1>

<h3>Part one: obtain and cleand data</h3>

In this section data should be imported from Wikipedia and cleaned to satisfy these criteria:
<ul><li>
    The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood</li>
<li>Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.</li>
    
<li>More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.</li>
    
<li>If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. So for the 9th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park.</li>
</ul>

In [150]:
#pip install geocoder

In [151]:
#pip install folium

In [152]:
import pandas as pd
import numpy as np
import geocoder

Import data and exclude rows where Borough is not assigned

In [153]:
x = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')[0]

x = x[x.Borough != "Not assigned"]
x.reset_index(inplace = True)
x.head()

Unnamed: 0,index,Postcode,Borough,Neighbourhood
0,2,M3A,North York,Parkwoods
1,3,M4A,North York,Victoria Village
2,4,M5A,Downtown Toronto,Harbourfront
3,5,M6A,North York,Lawrence Heights
4,6,M6A,North York,Lawrence Manor


For neigbourhoods without a name (not assigned) name of Bourough should be assigned:

In [154]:
#If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.
for i in range(0, x.shape[0]):
    if x.Neighbourhood[i]=="Not assigned":
        x.Neighbourhood[i]=x.Borough[i]

In order to merge all Neigbourhoods with a comma for the same postal codes, let's firstly sort data by Postal Code and then iterate by rows. If postal code of row is same with previous row value then neigbourhood should be concatenated with previous ones. <br> Also let's introduce new column Keep = 0 and if current postal code is not equal to next one, then keep = 1. Finally, keep only rows where corresponding column equal to 1 and final row. Then drop keep columns as it's not needed column.

In [155]:
#sort df by postal code
x.initial_shape = range(0, x.shape[0])
x.sort_values(by = "Postcode")

x['Keep'] = 0



x.shape
#loop trough dataframe to append Neighborhoods for the same Postal Codes
#Assign 1 in column keep for rows which hold the last row for each Postal Code, as it contains all the neighbourhoods
for i in range(1, x.shape[0]):
    if x.Postcode[i]==x.Postcode[i-1]:
        x.Neighbourhood[i] = x.Neighbourhood[i-1] + ", " + x.Neighbourhood[i] 
    if i<x.shape[0]-1 and x.Postcode[i]!=x.Postcode[i+1]:
        x.Keep[i]=1
#As loop always look at i+1 the last row was uncovered. Last row always should be kept as it always last row of a Postal COde.        
x.Keep[x.shape[0]]=1   

#keep only rows with keep == 1, so there would be rows with unique postal codes and all their neighbourhoods.
x = x[x.Keep == 1]
x.reset_index(inplace = True)
x.drop(['Keep', 'index', 'level_0'], axis = 1, inplace = True)    
    
#Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.
#In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.
x.head()


  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0,Postcode,Borough,Neighbourhood
0,M4A,North York,Victoria Village
1,M5A,Downtown Toronto,Harbourfront
2,M6A,North York,"Lawrence Heights, Lawrence Manor"
3,M7A,Queen's Park,Queen's Park
4,M9A,Downtown Toronto,Queen's Park


<h3>PART TWO: OBTAIN FINAL DATASET WITH COORDINATED</h3>

<p>I honestly tried to obtain coordinates using geocoder, but request is denied. After googling (what an irony) the problem, it could be because some additional parameters are not defiend. Nevertheless, this package can sometimes returns NA, so let's use prepared file.</p>

In [156]:
g = geocoder.google('Mountain View, CA')
g

<[REQUEST_DENIED] Google - Geocode [empty]>

So here is the coordinated file:

In [157]:
coords_file = pd.read_csv("https://cocl.us/Geospatial_data")
coords_file.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


So here each row stand for Postal Code and coordinates. We can do left join by postal code:

In [158]:
x = pd.merge(x, coords_file, left_on = 'Postcode', right_on = 'Postal Code', how = 'left')
x.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Postal Code,Latitude,Longitude
0,M4A,North York,Victoria Village,M4A,43.725882,-79.315572
1,M5A,Downtown Toronto,Harbourfront,M5A,43.65426,-79.360636
2,M6A,North York,"Lawrence Heights, Lawrence Manor",M6A,43.718518,-79.464763
3,M7A,Queen's Park,Queen's Park,M7A,43.662301,-79.389494
4,M9A,Downtown Toronto,Queen's Park,M9A,43.667856,-79.532242


<h3>PART THREE: CLUSTERIZATOIN </h3>

Let's use functions prepared in the course Lab to obtain all the venues for given neigbourhoods and obtain top-10 categories. Then investigate given results to build clusters.

In [159]:
import requests # library to handle requests

In [160]:

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [129]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

If you would like to run the code yourself, insert your Foursquare API credentials:

In [130]:
CLIENT_ID = '' # your Foursquare ID
CLIENT_SECRET = '' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100

In [131]:
# type your answer here

toronto_venues = getNearbyVenues(names=x['Neighbourhood'],
                                   latitudes=x['Latitude'],
                                   longitudes=x['Longitude']
                                  )

So we got now final dataset:

In [161]:
toronto_venues.head()

Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Victoria Village,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
1,Victoria Village,43.725882,-79.315572,Tim Hortons,43.725517,-79.313103,Coffee Shop
2,Victoria Village,43.725882,-79.315572,Portugril,43.725819,-79.312785,Restaurant
3,Victoria Village,43.725882,-79.315572,The Frig,43.727051,-79.317418,Restaurant
4,Harbourfront,43.65426,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery


In [162]:
toronto_venues.shape

(2225, 7)

We can see that there are more than 2000 venues. To what category do they belong?

In [163]:
toronto_venues['Venue Category'].value_counts()

Restaurant                 512
Coffee Shop                281
Park                        53
Pizza Place                 52
Bakery                      51
Sandwich Place              42
Hotel                       42
Bar                         42
Clothing Store              37
Gym                         32
Pub                         25
Burger Joint                25
Breakfast Spot              24
Gastropub                   23
Steakhouse                  22
Diner                       21
Grocery Store               21
Beer Bar                    20
Pharmacy                    20
Furniture / Home Store      18
Bookstore                   18
Bank                        18
Fried Chicken Joint         18
Cosmetics Shop              17
Ice Cream Shop              17
Gym / Fitness Center        17
Deli / Bodega               16
Tea Room                    16
Dessert Shop                16
Liquor Store                15
                          ... 
Airport Gate                 1
Hospital

Most common places are caffes and coffe shops, as well as another restaurants, bars, i.e. places to have meal and rest. Let's take a look of all unique places types:

In [164]:
toronto_venues['Venue Category'].unique()

array(['Hockey Arena', 'Coffee Shop', 'Restaurant', 'Bakery',
       'Gym / Fitness Center', 'Spa', 'Breakfast Spot', 'Park',
       'Historic Site', 'Pub', 'Chocolate Shop', 'Farmers Market',
       'Dessert Shop', 'Performing Arts Venue', 'Theater', 'Event Space',
       'Art Gallery', 'Ice Cream Shop', 'Shoe Store', 'Cosmetics Shop',
       'Brewery', 'Electronics Store', 'Bank', 'Beer Store', 'Hotel',
       'Health Food Store', 'Antique Shop', 'Boutique',
       'Furniture / Home Store', 'Clothing Store', 'Accessories Store',
       'Fraternity House', 'Carpet Store', 'Miscellaneous Shop', 'Gym',
       'Creperie', 'Burrito Place', 'Yoga Studio', 'Hobby Shop',
       'Arts & Crafts Store', 'Diner', 'Beer Bar', 'Wings Joint',
       'Burger Joint', 'Nightclub', 'Fried Chicken Joint',
       'Smoothie Shop', 'Sandwich Place', 'College Auditorium', 'Bar',
       'Music Venue', 'Print Shop', 'Gastropub', 'Pharmacy',
       'Pizza Place', 'Intersection', 'Bus Line', 'Athletics & Sports

We can see that there are a lot of types of places: theaters, gyms, airport infrastructure and various stores. Let's merge together Caffe and Coffe Shops in one group, and merge all restaurants to another to make clusterization more specific to types of objects (i.e. restaurant, gym, .etc) instead of slight difference among some of them (i.e. different types of restaurants)

In [165]:
toronto_venues.replace({'Venue Category': r'Caf√©*'}, {'Venue Category': 'Coffee Shop'}, regex=True, inplace = True)
toronto_venues.replace({'Venue Category': r'.+Restaurant'}, {'Venue Category': 'Restaurant'}, regex=True, inplace = True)
toronto_venues['Venue Category'].value_counts()

Restaurant                 512
Coffee Shop                281
Park                        53
Pizza Place                 52
Bakery                      51
Sandwich Place              42
Hotel                       42
Bar                         42
Clothing Store              37
Gym                         32
Pub                         25
Burger Joint                25
Breakfast Spot              24
Gastropub                   23
Steakhouse                  22
Diner                       21
Grocery Store               21
Beer Bar                    20
Pharmacy                    20
Furniture / Home Store      18
Bookstore                   18
Bank                        18
Fried Chicken Joint         18
Cosmetics Shop              17
Ice Cream Shop              17
Gym / Fitness Center        17
Deli / Bodega               16
Tea Room                    16
Dessert Shop                16
Liquor Store                15
                          ... 
Airport Gate                 1
Hospital

After some reshaping of vanues let's create dataframe with top-10 venues in columns for each area in rows and their quantity as values:

In [184]:
#toronto_venues = toronto_venues.groupby(['Neighborhood', 'Venue Category'], as_index =False).agg({'Venue':'count'})

# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighbourhood'] = toronto_venues['Neighbourhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_grouped = toronto_onehot.groupby('Neighbourhood').mean().reset_index()


num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighbourhood'] = toronto_grouped['Neighbourhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)


neighborhoods_venues_sorted.head()


Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide, King, Richmond",Restaurant,Coffee Shop,Steakhouse,Burger Joint,Salad Place,Bar,Bakery,Concert Hall,Cosmetics Shop,Hotel
1,Agincourt,Skating Rink,Restaurant,Lounge,Clothing Store,Breakfast Spot,Donut Shop,Dog Run,Discount Store,Diner,Dessert Shop
2,"Agincourt North, L'Amoreaux East, Milliken, St...",Park,Playground,Arts & Crafts Store,Yoga Studio,Creperie,Donut Shop,Dog Run,Discount Store,Diner,Dessert Shop
3,"Albion Gardens, Beaumond Heights, Humbergate, ...",Grocery Store,Video Store,Fried Chicken Joint,Sandwich Place,Pizza Place,Restaurant,Pharmacy,Beer Store,Diner,Creperie
4,"Alderwood, Long Branch",Pizza Place,Coffee Shop,Sandwich Place,Skating Rink,Pharmacy,Gym,Pub,Dance Studio,Curling Ice,Deli / Bodega


In [186]:
import folium
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors

In [187]:
kclusters = 4

toronto_grouped_clustering = toronto_grouped.drop('Neighbourhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = x.join(neighborhoods_venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')
toronto_merged['Cluster Labels'].replace(np.nan, 4, inplace = True)

toronto_merged.head() # check the last columns!

Unnamed: 0,Postcode,Borough,Neighbourhood,Postal Code,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4A,North York,Victoria Village,M4A,43.725882,-79.315572,3.0,Restaurant,Hockey Arena,Coffee Shop,College Gym,Cupcake Shop,Drugstore,Donut Shop,Dog Run,Discount Store,Diner
1,M5A,Downtown Toronto,Harbourfront,M5A,43.65426,-79.360636,3.0,Coffee Shop,Restaurant,Park,Bakery,Pub,Theater,Breakfast Spot,Hotel,Spa,Beer Store
2,M6A,North York,"Lawrence Heights, Lawrence Manor",M6A,43.718518,-79.464763,2.0,Furniture / Home Store,Clothing Store,Accessories Store,Boutique,Fraternity House,Miscellaneous Shop,Carpet Store,Restaurant,Coffee Shop,Aquarium
3,M7A,Queen's Park,Queen's Park,M7A,43.662301,-79.389494,3.0,Coffee Shop,Restaurant,Diner,Park,Gym,Yoga Studio,Hobby Shop,College Auditorium,Beer Bar,Smoothie Shop
4,M9A,Downtown Toronto,Queen's Park,M9A,43.667856,-79.532242,3.0,Coffee Shop,Restaurant,Diner,Park,Gym,Yoga Studio,Hobby Shop,College Auditorium,Beer Bar,Smoothie Shop


So let's plot out clusters on the map:

In [195]:
# create map
map_clusters = folium.Map(location=[43.725882,-79.315572], zoom_start=10)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighbourhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster-1)],
        fill=True,
        fill_color=rainbow[int(cluster-1)],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters