# A place for videogames Developers

## Business Problem:

Toronto is an international centre for business and finance. Generally considered the financial capital of Canada. The city is an important centre for the media, publishing, telecommunication, information technology and film production industries; Although much of the region's manufacturing activities take place outside the city limits, Toronto continues to be a wholesale and distribution point for the industrial sector. The city's strategic position along the Quebec City–Windsor Corridor and its road and rail connections help support the nearby production of motor vehicles, iron, steel, food, machinery, chemicals and paper. The completion of the Saint Lawrence Seaway in 1959 gave ships access to the Great Lakes from the Atlantic Ocean. 

There has recently been a substantial amount of interest in the emergence of video game development as an industry in Canada and its impact on the economy, the creative industries, the role studios play in specific city ecosystems and how video games affect physically and mentally. A recent study was done at McMaster University studying how playing video games improves the eyesight of those who suffer from vision problems.Toronto, Montreal, Quebec is a particularly popular subject of study due to the maturity of the gaming industry and its overall urban ecology.

Therefore, finding space for enough people to work with and/or start on the industry requres a selection of places where to share and build a network. As finding new places might be overwhelming, I decided to move to a office / coworking place locator in order to get to the right place.

## Methodology

We will use K-mean clustering to segment and cluster Toronto neighborhoods to understand their similarity. With that understanding, we will be able to recommend a suitable place. Such location would be near universities, media agencies, tech startups and coworkings spaces.

- List of Toronto boroughs and neighborhoods which can be found at https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M to explore, segment, and cluster.
- Toronto's sociodemographic data which can be found at https://en.wikipedia.org/wiki/Demographics_of_Toronto_neighbourhoods.
    Information on venues in Toronto extracted from Foursquare.com

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import csv
import json

import os
from dotenv import load_dotenv
load_dotenv()

CLIENT_ID = os.environ["CLIENT_ID"]
CLIENT_SECRET = os.environ["CLIENT_SECRET"]
VERSION = os.environ["VERSION"]

VERSION_2 = os.environ["VERSION_2"]

from geopy.geocoders import Nominatim 
import numpy as np
from pandas.io.json import json_normalize # Tranform JSON file into a pandas dataframe

# Visualisation
import matplotlib.cm as cm
import matplotlib.colors as colores
import folium 


#Modeling
from sklearn.cluster import KMeans

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

src = requests.get(url).text 
soup = BeautifulSoup(src, 'lxml')

def url_par(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'lxml')
    
    for table in soup.find_all('table', class_="wikitable sortable"):
    # We search for the table that stores the info we want inside the class "wikitable_..."
        n_columns = 0
        n_rows=0
        column_names = []
        
        for row in table.find_all('tr'):
            td_tags = row.find_all('td')
            if len(td_tags) > 0:
                n_rows+=1
                if n_columns == 0:
                    n_columns = len(td_tags)
                        
            th_tags = row.find_all('th') 
            if len(th_tags) > 0 and len(column_names) == 0:
                for th in th_tags:
                    column_names.append(th.get_text())
            columns = row.find_all('td')
    
        if len(column_names) > 0 and len(column_names) != n_columns:
            raise Exception("Column titles != number columns")
    
        columns = column_names if len(column_names) > 0 else range(0,n_columns)
        
        df = pd.DataFrame(columns = columns, index= range(0,n_rows))
        row_marker = 0
       
        for row in table.find_all('tr'):
            column_marker = 0
            columns = row.find_all('td')
            for column in columns:
                df.iat[row_marker,column_marker] = column.get_text()
                column_marker += 1
            if len(columns) > 0:
                row_marker += 1
                    
        for col in df:
            try:
                df[col] = df[col].astype(float)
                
            except ValueError:
                pass
            
        return df

def cleanup(df):
    df = df[df.Borough != 'Not assigned']
    df = df[df['Neighbourhood\n'] != 'Not assigned']

    df = df.replace('\n',' ', regex=True)
    return df

In [3]:
table_init = url_par(url)
df_fin = cleanup(table_init)
df_fin.head()

Unnamed: 0,Postcode,Borough,Neighbourhood\n
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor


In [4]:
df = df_fin.groupby(['Postcode','Borough'])['Neighbourhood\n'].apply(lambda x: ", ".join(x.astype(str))).reset_index()
df_final = df.sample(frac=1).reset_index(drop=True)

print("The dataframe shape is: ",df_final.shape)
display(df_final.head(10))

The dataframe shape is:  (103, 3)


Unnamed: 0,Postcode,Borough,Neighbourhood\n
0,M1B,Scarborough,"Rouge , Malvern"
1,M9B,Etobicoke,"Cloverdale , Islington , Martin Grove , Prince..."
2,M5S,Downtown Toronto,"Harbord , University of Toronto"
3,M3H,North York,"Bathurst Manor , Downsview North , Wilson Heig..."
4,M2N,North York,Willowdale South
5,M6J,West Toronto,"Little Portugal , Trinity"
6,M5T,Downtown Toronto,"Chinatown , Grange Park , Kensington Market"
7,M5C,Downtown Toronto,St. James Town
8,M4P,Central Toronto,Davisville North
9,M6R,West Toronto,"Parkdale , Roncesvalles"


In [5]:
url_geo ="http://cocl.us/Geospatial_data"

geo_data=pd.read_csv(url_geo)

df_geo = pd.merge(left=df_final, right=geo_data, left_on='Postcode', right_on='Postal Code')
df_geo.head()

Unnamed: 0,Postcode,Borough,Neighbourhood\n,Postal Code,Latitude,Longitude
0,M1B,Scarborough,"Rouge , Malvern",M1B,43.806686,-79.194353
1,M9B,Etobicoke,"Cloverdale , Islington , Martin Grove , Prince...",M9B,43.650943,-79.554724
2,M5S,Downtown Toronto,"Harbord , University of Toronto",M5S,43.662696,-79.400049
3,M3H,North York,"Bathurst Manor , Downsview North , Wilson Heig...",M3H,43.754328,-79.442259
4,M2N,North York,Willowdale South,M2N,43.77012,-79.408493


In [6]:
address = 'Toronto'

geolocator = Nominatim()
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

  This is separate from the ipykernel package so we can avoid doing imports until


In [7]:
map_geo = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(df_geo['Latitude'], df_geo['Longitude'], df_geo['Neighbourhood\n']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='green',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_geo)  
    
map_geo

In [8]:
# We check the API status
i = 2
neigh_lat = df_geo.loc[i, 'Latitude'] #Latitude
neigh_lng = df_geo.loc[i, 'Longitude']
radius = 500
LIMIT = 100 

url = f'https://api.foursquare.com/v2/venues/explore?client_id={CLIENT_ID}&client_secret={CLIENT_SECRET}&ll={neigh_lat},{neigh_lng}&v={VERSION}&radius={radius}&limit={LIMIT}'

requests.get(url).status_code

200

In [9]:
# Here we make a seleccion of the codes neccessary to find closed-mind places where to find our place in Toronto
# We have checked on Foursquare website which codes could fit in our search:


In [10]:
list_of_interest = {
    'coworking_code':'4bf58dd8d48988d174941735',
    'tech_startup':'4bf58dd8d48988d125941735',
    'corporate_coffee_shop':'5665c7b9498e7d8a4f2c0f06',
    'recruiting_agency':'52f2ab2ebcbc57f1066b8b57',
    'college_technology':'4bf58dd8d48988d19f941735',
    'design_studio':'4bf58dd8d48988d1f4941735'
}

In [11]:
i = 2


def get_place_type(value):
    if value in list(list_of_interest.values()):
        for k, v in list_of_interest.items():
            if v == value:
                return k
            else:
                None
    else:
        return None

df = pd.DataFrame(columns = ['Id','Name','Category', 'Latitude','Longitude', 'type'])

cat_id_list = list(list_of_interest.values())
print(cat_id_list)

['4bf58dd8d48988d174941735', '4bf58dd8d48988d125941735', '5665c7b9498e7d8a4f2c0f06', '52f2ab2ebcbc57f1066b8b57', '4bf58dd8d48988d19f941735', '4bf58dd8d48988d1f4941735']


In [12]:
get_place_type('5432ce85498e08f48dbefd81')

In [23]:
explore_df_list = []
CLIENT_ID = 'EGJ4JLYHQDQDV05Y3AJB1CDSUBGWT1XURNJAWKVGFZYIZZRR' # your Foursquare ID
CLIENT_SECRET = '2BKSLK4VVRZGHP2HQ5TXQICQP13YK0E5ES3A3SZ3UGYPBAWY' # your Foursquare Secret
VERSION = '20181022'

for ide in cat_id_list:
    try:
        for i, neigh_name in enumerate(df_geo['Neighbourhood\n']):  
            neigh_name = df_geo.loc[i, 'Neighbourhood\n'] #neigh_name
            neigh_lat = df_geo.loc[i, 'Latitude'] #Latitude
            neigh_lng = df_geo.loc[i, 'Longitude']
            radius = 500
            LIMIT = 100 
            url = f'https://api.foursquare.com/v2/venues/explore?client_id={CLIENT_ID}&client_secret={CLIENT_SECRET}&ll={neigh_lat},{neigh_lng}&v={VERSION}&radius={radius}&limit={LIMIT}&categoryId={ide}'

            results = json.loads(requests.get(url).text)

            results = results['response']['groups'][0]['items']



            near = pd.json_normalize(results) 
            filtered_columns = ["venue.id","venue.name","venue.categories","venue.location.lat","venue.location.lng"]
            near = near.filter(items=filtered_columns)


                    # Renaming the columns
            near = near.rename(columns = {'venue.id':'Id','venue.name':'Name', 'venue.categories':'Category', 'venue.location.lat':'Latitude','venue.location.lng':'Longitude'})

            
            for n,e in enumerate(near.index.values):
                explore_df_list.append([neigh_name, neigh_lat, neigh_lng] + near.loc[n, :].values.tolist())

    except Exception as e:
        print("An exception ocurred: ", e)


An exception ocurred:  'groups'
An exception ocurred:  'groups'
An exception ocurred:  'groups'


In [37]:
def get_category(row):        
    if len(row) == 0:
        return None
    else:
        return row[0]['name']
    

In [28]:
tor_df = pd.DataFrame([item for item in explore_df_list])

tor_df.columns = ['Neighbourhood', 'Neighbourhood Latitude', 'Neighbourhood Longitude','venue_id', 'Venue Name', 'Venue Category', 'Venue Latitude', 'Venue Longitude']
tor_df['Venue Category'] = tor_df['Venue Category'].apply(get_category)
print(tor_df.shape)
display(tor_df.head())

(353, 8)


Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,venue_id,Venue Name,Venue Category,Venue Latitude,Venue Longitude
0,"Rouge , Malvern",43.806686,-79.194353,50c5f88fe4b0eaecd9bec902,Imminent Concepts,"[{'id': '4bf58dd8d48988d174941735', 'name': 'C...",43.804597,-79.199744
1,"Harbord , University of Toronto",43.662696,-79.400049,5086c6ede4b0c33d74e5c691,Carolyn's Office,"[{'id': '4bf58dd8d48988d174941735', 'name': 'C...",43.662975,-79.399147
2,"Harbord , University of Toronto",43.662696,-79.400049,560a2a14498e624fdeecac0a,Grape Capital Office,"[{'id': '4bf58dd8d48988d174941735', 'name': 'C...",43.662628,-79.403021
3,"Harbord , University of Toronto",43.662696,-79.400049,4adf49b8f964a5201a7921e3,Health Strategy Innovation Cell,"[{'id': '4bf58dd8d48988d174941735', 'name': 'C...",43.664691,-79.397242
4,Willowdale South,43.77012,-79.408493,51b0eeb7011c0a4b4d080de3,somolopro.com,"[{'id': '4bf58dd8d48988d174941735', 'name': 'C...",43.769718,-79.411798


## Methodology

In this project we will direct our efforts on detecting areas of Toronto that have high coworking spaces density. We will limit our analysis to area ~10km around city center.

In first step we have collected the required **data: location and type (category) of every space within 10km from Toronto center**. We have **identified thse type of spaces** (according to Foursquare categorization).

Second step in our analysis will be calculation and exploration of '**offices density**' across different areas of Toronto - we will use **heatmaps** to identify a few promising areas close to center with low number of spaces in general and focus our attention on those areas.

In third and final step we will focus on most promising areas and within those create **clusters of locations that meet some basic requirements** established in discussion with stakeholders. We will present map of all such locations but also create clusters (using **k-means clustering**) of those locations to identify general zones / neighborhoods / addresses which should be a starting point for final 'street level' exploration and search for optimal venue location by stakeholders.

## Analysis

Let's perform some basic explanatory data analysis and derive some additional info from our raw data. This can be achived by clustering the neighborhoods of the basis of the offices data we have acquired. Clustering is a predominant algorithm of unsupervised Machine Learning. It is used to segregate data entries in cluster depending of the similarity of their attributes, calculated by using the simple formula of euclidian distance.

We can then analyze these clusters separately and use those clusters that show high trends in our dataset

### Normalization of the data for clustering

In [39]:
toronto_onehot = pd.get_dummies(tor_df[['Venue Category']], prefix="", prefix_sep="")

toronto_onehot['Neighbourhood'] = tor_df['Neighbourhood'] 

fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Neighbourhood,Bank,Coworking Space,Design Studio,Office,Recruiting Agency,Tech Startup
0,"Rouge , Malvern",0,1,0,0,0,0
1,"Harbord , University of Toronto",0,1,0,0,0,0
2,"Harbord , University of Toronto",0,1,0,0,0,0
3,"Harbord , University of Toronto",0,1,0,0,0,0
4,Willowdale South,0,1,0,0,0,0


In [40]:
toronto_grouped = toronto_onehot.groupby('Neighbourhood').mean().reset_index()
toronto_grouped.head()

Unnamed: 0,Neighbourhood,Bank,Coworking Space,Design Studio,Office,Recruiting Agency,Tech Startup
0,"Adelaide , King , Richmond",0.0,0.5,0.0,0.0,0.0,0.5
1,Agincourt,0.0,0.0,0.0,0.0,0.0,1.0
2,"Agincourt North , L'Amoreaux East , Milliken ,...",0.0,0.5,0.0,0.0,0.0,0.5
3,"Alderwood , Long Branch",0.0,0.666667,0.0,0.0,0.0,0.333333
4,"Bedford Park , Lawrence Manor East",0.0,0.0,0.0,0.0,0.0,1.0


In [41]:
# With this function, we get the most common venues in our df. This way, we can create columns according 
# to number of top venues
def common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [43]:
num_class_venues = 6
indicators = ['st', 'nd', 'rd']

# Columns as number of class venues
columns = ['Neighbourhood']
for ind in np.arange(num_class_venues):
    columns.append(f'{ind+1} Most-common Type Venue')

# Create a new dataframe
venues_sorted = pd.DataFrame(columns=columns)
venues_sorted['Neighbourhood'] = toronto_grouped['Neighbourhood']

for ind in np.arange(toronto_grouped.shape[0]):
    venues_sorted.iloc[ind, 1:] = common_venues(toronto_grouped.iloc[ind, :], num_class_venues)

venues_sorted.head()

Unnamed: 0,Neighbourhood,1 Most-common Type Venue,2 Most-common Type Venue,3 Most-common Type Venue,4 Most-common Type Venue,5 Most-common Type Venue,6 Most-common Type Venue
0,"Adelaide , King , Richmond",Tech Startup,Coworking Space,Recruiting Agency,Office,Design Studio,Bank
1,Agincourt,Tech Startup,Recruiting Agency,Office,Design Studio,Coworking Space,Bank
2,"Agincourt North , L'Amoreaux East , Milliken ,...",Tech Startup,Coworking Space,Recruiting Agency,Office,Design Studio,Bank
3,"Alderwood , Long Branch",Coworking Space,Tech Startup,Recruiting Agency,Office,Design Studio,Bank
4,"Bedford Park , Lawrence Manor East",Tech Startup,Recruiting Agency,Office,Design Studio,Coworking Space,Bank


### K-Means

In [44]:
k = 5
tor_clusters = toronto_grouped.drop('Neighbourhood', 1)

# Run k-means clustering
kmeans = KMeans(n_clusters = k, random_state = 0).fit(tor_clusters)

# Check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

# Add clustering labels
venues_sorted.insert(0, 'K-Labels', kmeans.labels_)

In [45]:
df_final = df_final.rename(columns = {'Neighbourhood\n':'Neighbourhood'})
df_final = pd.merge(left=df_final, right=geo_data, left_on='Postcode', right_on='Postal Code')
tor_merged = df_final

tor_merged = tor_merged.join(venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')
tor_merged.dropna(inplace = True)
tor_merged['K-Labels'] = tor_merged['K-Labels'].astype(int)
tor_merged.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Postal Code,Latitude,Longitude,K-Labels,1 Most-common Type Venue,2 Most-common Type Venue,3 Most-common Type Venue,4 Most-common Type Venue,5 Most-common Type Venue,6 Most-common Type Venue
0,M1B,Scarborough,"Rouge , Malvern",M1B,43.806686,-79.194353,3,Coworking Space,Tech Startup,Recruiting Agency,Office,Design Studio,Bank
2,M5S,Downtown Toronto,"Harbord , University of Toronto",M5S,43.662696,-79.400049,0,Coworking Space,Tech Startup,Recruiting Agency,Office,Design Studio,Bank
4,M2N,North York,Willowdale South,M2N,43.77012,-79.408493,1,Coworking Space,Recruiting Agency,Tech Startup,Office,Design Studio,Bank
5,M6J,West Toronto,"Little Portugal , Trinity",M6J,43.647927,-79.41975,1,Coworking Space,Tech Startup,Recruiting Agency,Office,Design Studio,Bank
6,M5T,Downtown Toronto,"Chinatown , Grange Park , Kensington Market",M5T,43.653206,-79.400049,0,Tech Startup,Coworking Space,Design Studio,Recruiting Agency,Office,Bank


In [46]:
map_Kmeans = folium.Map(location=[latitude, longitude], zoom_start=11)


# Color for the clusters
x = np.arange(k)
y = [i + x + (i*x)**2 for i in range(k)]

colors_list = cm.rainbow(np.linspace(0, 1, len(y)))
rainbow = [colores.rgb2hex(i) for i in colors_list]

# Markers to the map
markers_colors = []
for lat, lon, i, cluster in zip(tor_merged['Latitude'], tor_merged['Longitude'], tor_merged['Neighbourhood'], tor_merged['K-Labels']):
    label = folium.Popup(str(i) + ' (Cluster ' + str(cluster) + ')', parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_Kmeans)
       
map_Kmeans