# Battle of the neighbourhoods II

In this project, we will step into the role of a consultant, to advise and help young entrepreneurs open a business in the city of Toronto by leveraging the capabilities of Foursquare API, complementing it with economic, demographic and safety datasets provided by the city of Toronto via its web portal Open Data (https://www.toronto.ca/city-government/data-research-maps/open-data/). With this, we will be able to provide a comprehensive anaylsis on the economic oportunities this city offers, optimizing the chances of success for this entrepreneurs.

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1. <a href="#item1">Problem statement and approach</a>

2. <a href="#item2">Datasets and role in the project</a>

3. <a href="#item2">Datasets cleansing and data preparation</a>

4. <a href="#item2">Methodology and Data analysis</a>

5. <a href="#item2">Results</a>

5. <a href="#item2">Discussions</a>

5. <a href="#item2">Conclusions</a>
 
</font>
</div>

# 1. Problem statement and approach

## 1.1 Background

Toronto is Canada’s business and financial capital, a growing financial hub in North America, and a top ten global financial centre. 

The Toronto region’s GDP accounts for 18% of Canada’s GDP. It is home to Canada’s five major banks, the vast majority of foreign banks operating in Canada, and the Toronto Stock Exchange (TSX) – the world’s principal exchange for mining, oil and gas and a leader in cleantech listings.

Toronto is competitive in almost every other major business sector from technology and life sciences to green energy; from fashion and design to food and beverage; from film and television production to music and digital media. Toronto’s rich industrial diversity drives growth, innovation and cross-sectoral synergies and knowledge spillovers have spawned new leading-edge hybrid sectors including med-tech, green-tech and food-tech. (Source: https://www.toronto.ca/business-economy/invest-in-toronto/strong-economy/)

## 1.2 Problem statement

As an entrepreneur, you often have to make key decisions with little or no information early in the game, which can end up having a tremendous effect in your chances of success in the future. One of these key decisions, and the one of which we are going to focus as the scope of this project is where to locate your business, correctly assessing competition and business opportunities from public or private investment.

## 1.3 Approach

To navigate through this project we will create a decision framework organized into three lines of action, upon which we will design a business case example to apply this tools to a specific client. The three lines of action are the following:
1. **Economic environment**: analysis of economic data per neighbourhood, with leading industry sectors and economic activity at Small and Medium Enterprise (SMEs) level (leveraging Foursquare API).
2. **Potential customer base**: evaluation of demographic factors considering population, age segmentation and median income per neighbourhood among others.
3. **Safety and public services**: exploration of different crimes and felonies recorded in each neighbourhood.

## 1.4 Business case introduction
We will advise Standard Digital Services Ltd., a promising startup which aims to connect financial and small businesses data to provide better loan conditions to individual stakeholders while connecting them with banks with premium conditions. The CEO, Elisabeth Mapple, has an intuition on what they want, although as a startup their strategy is not yet clearly detailed. The location of their dreams would have to fullfill the following:
* **Branding:** they want to be seen with the big players, for which location and prestige are a mandatory feature.
* **Customers:** they target customers in a situation to require a loan but savvy enough in digital technologies, their ideal customer would range between 25 and 54 years old. Appart from this, any individual with an income of 100,000 dollars or higher would be also a potential customer.
* **Providers:** as an IT company, they need to be close to banks as well as have close storage facilitites for the infrastructure. We will leverage Foursquare for this analysis.
* **IP protection:** because of the previous point, break and enters are to be avoided at all costs.

# 2. Datasets to be used and role in the project

We have explained in section 1.3 the approach to be followed in this project, and how we plan to structure it to define which neighbourhood is more suitable to placing a start up. Also, we have introduced in our Problem Statement the lack of information about your economic environment as a potential pitfall for your business, therefore we will leverage the following datasets in our project:

## 2.1 Toronto Administrative Organization

**Source:** https://www.toronto.ca/city-government/data-research-maps/neighbourhoods-communities/neighbourhood-profiles/

Before clustering and preparing our data, and once defined the scope of the problem to be solved it is necessary to define how the data will broke down to become actionable. As the core of the project is to determine the optimal conditions for an startup to thrive, we will use the different administrative neighborhoods in which Toronto is divided to draw our conclussions.

The fields relevant fo this dataset are:
* Neighbourhood name
* Latitude
* Longitude
* Shape

## 2.2 Toronto Demographics

Source: https://www.toronto.ca/city-government/data-research-maps/open-data/open-data-catalogue/locations-and-mapping/#8c732154-5012-9afe-d0cd-ba3ffc813d5a

This dataset, provided via Toronto's Open Data portal contains the demographic data per each of the neighbourhoods in which Toronto is structured, from population density to income data and main ethnicities and languages spoken. This will be used either to define the customer profile in a given neighbourhood, or to fit a target customer for a variety of neighbourhoods. Due to the economic profile of this project, we have selected a subset of this data, containing the following fields:

* Neighbourhood
* Population density per sqm
* Land area (sqm)
* Children (0-14 years)
* Youth (15-24 years)
* Working Age (25-54 years)
* Pre-retirement (55-64 years)
* Seniors (65+ years)
* Older Seniors (85+ years)
* Civil status - Married
* Civil status - Never married
* Civil status - Separated
* Civil status - Divorced
* Median income - Under 10,000 (including loss)
* Median income - 10,000 to 19,999
* Median income - 20,000 to 29,999
* Median income - 30,000 to 39,999
* Median income - 40,000 to 49,999
* Median income - 50,000 to 59,999
* Median income - 60,000 to 69,999
* Median income - 70,000 to 79,999
* Median income - 80,000 to 89,999
* Median income - 90,000 to 99,999
* Median income - 100,000 and over
* Median income - 100,000 to 149,999
* Median income - 150,000 and over


## 2.3 Toronto Economics

**Source:** https://www.toronto.ca/city-government/data-research-maps/open-data/open-data-catalogue/business/#e3a085d5-8e94-e279-4c17-33c209141464

A second lever after customer segmentation is understanding the economic context of the area where the business would be based. By defining the state of general economics (debt level, real state, employment) as well as venues and small businessess it would be possible to infere the economic fitness of an area either to invest, or to secure your customers while understanding your competition. The fields contained in the dataset are the following:

* Neighbourhoods
* Businesses
* Child care spaces
* Debt risk score
* Home prices
* Local employment
* Social Assistance

## 2.4 Foursquare API

We will use Foursquare to obtain data from the different venues in each of the neighbourhoods, being able to check on the profile os small businessess from the different neighbourhoods

## 2.5 Toronto Safety

**Source:** https://www.toronto.ca/city-government/data-research-maps/open-data/open-data-catalogue/public-safety/#6ff36980-d2f4-f438-d940-3e6a5c315588

Finally, once our customer profile is clear and taoilored to the potential neighbourhoods, the economic situation described in them is viable, we will confirm via its public records the state of the different major crimes and felonies recorded and their nature, so the final assessment can take place. Some of the fields to be considered will be:
* Breaks and enters
* Fire and fire alarms 
* Robberies
* Total Major Crimes incidents

# 3. Datasets cleansing and data preparation

As the data prepared and uploaded into the Open Data portal is already cleaned and packed by categories as shown above, this will greatly simplify our effort into data preparation. We load now the datasets to be used, and assign them to a given dataframe:

In [35]:
import types
from botocore.client import Config
import ibm_boto3

def __iter__(self): return 0

## 3.1 Toronto administrative organization

In this case and as stated above we only require the name of the neighbourhood, its code and its coordinates (longitude and latitude)

In [2]:
import pandas as pd
body = client_d3d4563280a44b8eb7ad63b8c7fd2fac.get_object(Bucket='locationbasedanalysis-donotdelete-pr-volx3spnyzafuc',Key='Toronto_neighbourhoods_coordinates.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

df_neighbourhood = pd.read_csv(body)
df_neighbourhood.head()

Unnamed: 0,_id,AREA_ID,AREA_ATTR_ID,PARENT_AREA_ID,AREA_SHORT_CODE,AREA_LONG_CODE,AREA_NAME,AREA_DESC,LONGITUDE,LATITUDE,OBJECTID
0,1,25886861,25926662,49885,94,94,Wychwood (94),Wychwood (94),-79.425515,43.676919,16491505
1,2,25886820,25926663,49885,100,100,Yonge-Eglinton (100),Yonge-Eglinton (100),-79.40359,43.704689,16491521
2,3,25886834,25926664,49885,97,97,Yonge-St.Clair (97),Yonge-St.Clair (97),-79.397871,43.687859,16491537
3,4,25886593,25926665,49885,27,27,York University Heights (27),York University Heights (27),-79.488883,43.765736,16491553
4,5,25886688,25926666,49885,31,31,Yorkdale-Glen Park (31),Yorkdale-Glen Park (31),-79.457108,43.714672,16491569


We clean the data and shape it to the requirements of the project, which are linking area short code with coordinates (latitude and longitude):

In [3]:
df_neighbourhood.drop(['_id','AREA_ATTR_ID','PARENT_AREA_ID','AREA_LONG_CODE','AREA_DESC','OBJECTID'], axis=1, inplace=True)

And index it to the Area code, which will serve to join other dataframes

In [4]:
df_neighbourhood.sort_values(by=['AREA_SHORT_CODE'], inplace=True)
df_neighbourhood.set_index('AREA_SHORT_CODE', inplace=True)
df_neighbourhood.head()

Unnamed: 0_level_0,AREA_ID,AREA_NAME,LONGITUDE,LATITUDE
AREA_SHORT_CODE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,25886718,West Humber-Clairville (1),-79.596356,43.71618
2,25886715,Mount Olive-Silverstone-Jamestown (2),-79.587259,43.746868
3,25886723,Thistletown-Beaumond Heights (3),-79.563491,43.737988
4,25886730,Rexdale-Kipling (4),-79.566228,43.723725
5,25886733,Elms-Old Rexdale (5),-79.548983,43.721519


## 3.2 Toronto Demographics

We load the whole dataset as described in the Datasets explanation section, which we will split into individual subsets per demographic topic (income, age, etc) once defined our customer target in upcoming sections:

In [5]:
body = client_d3d4563280a44b8eb7ad63b8c7fd2fac.get_object(Bucket='locationbasedanalysis-donotdelete-pr-volx3spnyzafuc',Key='WB-Demographics.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

df_demographics = pd.read_csv(body)
df_demographics.rename(columns={"Neigbourhood Id": "Neighbourhood Id"},inplace=True)
df_demographics.head()

Unnamed: 0,Neighbourhood,Neighbourhood Id,Population density sqm,Land area sqm,Children (0-14 years),Youth (15-24 years),Working Age (25-54 years),Pre-retirement (55-64 years),Seniors (65+ years),Older Seniors (85+ years),...,Median income - $30k to $39k,Median income -$40k to $49k,Median income - $50k to $59k,Median income - $60k to $69k,Median income - $70k to $79k,Median income - $80k to $89k,Median income -$90k to $99k,Median income - $100k and over,Median income - $100k to $149k,Median income - $150k and over
0,Agincourt North,129,3929,7.41,3840,3705,11305,4230,6045,925,...,2465,1895,1265,865,655,435,365,665,530,135
1,Agincourt South-Malvern West,128,3034,7.83,3075,3360,9965,3265,4105,555,...,2020,1560,1125,825,570,435,315,685,525,165
2,Alderwood,20,2435,4.95,1760,1235,5220,1825,2015,320,...,1095,950,825,690,530,395,370,845,620,225
3,Annex,95,10863,2.81,2360,3750,15040,3480,5910,1040,...,2150,1935,1655,1460,1290,1000,830,5255,2190,3055
4,Banbury-Don Mills,42,2775,9.98,3605,2730,10810,3555,6975,1640,...,1980,1915,1665,1425,1220,960,820,3670,2035,1635


Now it is necessary to link a set of coordinates to this data, for which we will take advantage of the function join, and df_neighbourhoods dataset:

In [6]:
df_demographics = pd.merge(df_neighbourhood, df_demographics, how='inner', left_on = 'AREA_SHORT_CODE', right_on = 'Neighbourhood Id')
df_demographics = df_demographics.drop('Neighbourhood', axis=1)
df_demographics.head()

Unnamed: 0,AREA_ID,AREA_NAME,LONGITUDE,LATITUDE,Neighbourhood Id,Population density sqm,Land area sqm,Children (0-14 years),Youth (15-24 years),Working Age (25-54 years),...,Median income - $30k to $39k,Median income -$40k to $49k,Median income - $50k to $59k,Median income - $60k to $69k,Median income - $70k to $79k,Median income - $80k to $89k,Median income -$90k to $99k,Median income - $100k and over,Median income - $100k to $149k,Median income - $150k and over
0,25886718,West Humber-Clairville (1),-79.596356,43.71618,1,1117,29.81,5060,5445,13845,...,3305,2700,1900,1170,720,535,370,600,505,105
1,25886715,Mount Olive-Silverstone-Jamestown (2),-79.587259,43.746868,2,7291,4.52,7090,5240,13615,...,3135,2125,1250,750,460,275,180,230,200,30
2,25886723,Thistletown-Beaumond Heights (3),-79.563491,43.737988,3,3130,3.31,1730,1410,4160,...,1025,740,490,340,255,175,135,250,195,55
3,25886730,Rexdale-Kipling (4),-79.566228,43.723725,4,4229,2.49,1640,1355,4300,...,935,835,625,430,310,220,185,240,205,45
4,25886733,Elms-Old Rexdale (5),-79.548983,43.721519,5,3306,2.86,1805,1440,3700,...,890,675,495,320,215,155,100,185,170,30


And finally indexing it to the Area code, which will serve to join other dataframes:

In [7]:
df_demographics.sort_values(by=['Neighbourhood Id'], inplace=True)
df_demographics.set_index('Neighbourhood Id', inplace=True)
df_demographics.head()

Unnamed: 0_level_0,AREA_ID,AREA_NAME,LONGITUDE,LATITUDE,Population density sqm,Land area sqm,Children (0-14 years),Youth (15-24 years),Working Age (25-54 years),Pre-retirement (55-64 years),...,Median income - $30k to $39k,Median income -$40k to $49k,Median income - $50k to $59k,Median income - $60k to $69k,Median income - $70k to $79k,Median income - $80k to $89k,Median income -$90k to $99k,Median income - $100k and over,Median income - $100k to $149k,Median income - $150k and over
Neighbourhood Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,25886718,West Humber-Clairville (1),-79.596356,43.71618,1117,29.81,5060,5445,13845,3990,...,3305,2700,1900,1170,720,535,370,600,505,105
2,25886715,Mount Olive-Silverstone-Jamestown (2),-79.587259,43.746868,7291,4.52,7090,5240,13615,3475,...,3135,2125,1250,750,460,275,180,230,200,30
3,25886723,Thistletown-Beaumond Heights (3),-79.563491,43.737988,3130,3.31,1730,1410,4160,1195,...,1025,740,490,340,255,175,135,250,195,55
4,25886730,Rexdale-Kipling (4),-79.566228,43.723725,4229,2.49,1640,1355,4300,1520,...,935,835,625,430,310,220,185,240,205,45
5,25886733,Elms-Old Rexdale (5),-79.548983,43.721519,3306,2.86,1805,1440,3700,1255,...,890,675,495,320,215,155,100,185,170,30


## 3.2 Toronto Economics

We apply the same techniques to the Economics of Toronto than in the previous section:

In [8]:
body = client_d3d4563280a44b8eb7ad63b8c7fd2fac.get_object(Bucket='locationbasedanalysis-donotdelete-pr-volx3spnyzafuc',Key='WB-Economics.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

df_economics = pd.read_csv(body)

Appending the latitude and longitude coordinates from df_neighbourhood dataset:

In [9]:
df_economics = pd.merge(df_neighbourhood, df_economics, how='inner', left_on = 'AREA_SHORT_CODE', right_on = 'Neighbourhood Id')
df_economics = df_economics.drop('Neighbourhood', axis=1)
df_economics.head()

Unnamed: 0,AREA_ID,AREA_NAME,LONGITUDE,LATITUDE,Neighbourhood Id,Businesses,Child Care Spaces,Debt Risk Score,Home Prices,Local Employment,Social Assistance Recipients
0,25886718,West Humber-Clairville (1),-79.596356,43.71618,1,2463,195,719,317508,58271,2912
1,25886715,Mount Olive-Silverstone-Jamestown (2),-79.587259,43.746868,2,271,60,687,251119,3244,6561
2,25886723,Thistletown-Beaumond Heights (3),-79.563491,43.737988,3,217,25,718,414216,1311,1276
3,25886730,Rexdale-Kipling (4),-79.566228,43.723725,4,144,75,721,392271,1178,1323
4,25886733,Elms-Old Rexdale (5),-79.548983,43.721519,5,67,60,692,233832,903,1683


And finally indexing it to the Area code, which will serve to join other dataframes:

In [10]:
df_economics.sort_values(by=['Neighbourhood Id'], inplace=True)
df_economics.set_index('Neighbourhood Id', inplace=True)
df_economics.head()

Unnamed: 0_level_0,AREA_ID,AREA_NAME,LONGITUDE,LATITUDE,Businesses,Child Care Spaces,Debt Risk Score,Home Prices,Local Employment,Social Assistance Recipients
Neighbourhood Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,25886718,West Humber-Clairville (1),-79.596356,43.71618,2463,195,719,317508,58271,2912
2,25886715,Mount Olive-Silverstone-Jamestown (2),-79.587259,43.746868,271,60,687,251119,3244,6561
3,25886723,Thistletown-Beaumond Heights (3),-79.563491,43.737988,217,25,718,414216,1311,1276
4,25886730,Rexdale-Kipling (4),-79.566228,43.723725,144,75,721,392271,1178,1323
5,25886733,Elms-Old Rexdale (5),-79.548983,43.721519,67,60,692,233832,903,1683


## 3.3 Foursquare API

Following the same approach applied to the clustering project in this same Capstone project, we will:
* Define Foursquare Credentials and Version to explore each neighbourhood
* Explore the different neighbourhoods

### 3.3.1 Credentials and version

In [11]:
import requests

CLIENT_ID = '5V0RKZ32BHYQYGKJ3DME1TCGET1MHFJC1U0XAVE2DHIEDEK3'
CLIENT_SECRET = '3GKEKHWAVBGUZ21SNLTWWGYV45KMMVBHCI4W24ZFQ1GKTED2'
VERSION = '20180605' # Foursquare API version
LIMIT=100

print('Your credentials:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentials:
CLIENT_ID: 5V0RKZ32BHYQYGKJ3DME1TCGET1MHFJC1U0XAVE2DHIEDEK3
CLIENT_SECRET:3GKEKHWAVBGUZ21SNLTWWGYV45KMMVBHCI4W24ZFQ1GKTED2


### 3.3.2 Neighbourhood venues data collection
We prepare the function that will allow us to fetch all the data required in a single call:

In [12]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

And applied to Toronto: 

In [14]:
toronto_venues = getNearbyVenues(names=df_neighbourhood['AREA_NAME'],
                                   latitudes=df_neighbourhood['LATITUDE'],
                                   longitudes=df_neighbourhood['LONGITUDE']
                                  )

West Humber-Clairville (1)
Mount Olive-Silverstone-Jamestown (2)
Thistletown-Beaumond Heights (3)
Rexdale-Kipling (4)
Elms-Old Rexdale (5)
Kingsview Village-The Westway (6)
Willowridge-Martingrove-Richview (7)
Humber Heights-Westmount (8)
Edenbridge-Humber Valley (9)
Princess-Rosethorn (10)
Eringate-Centennial-West Deane (11)
Markland Wood (12)
Etobicoke West Mall (13)
Islington-City Centre West (14)
Kingsway South (15)
Stonegate-Queensway (16)
Mimico (includes Humber Bay Shores) (17)
New Toronto (18)
Long Branch (19)
Alderwood (20)
Humber Summit (21)
Humbermede (22)
Pelmo Park-Humberlea (23)
Black Creek (24)
Glenfield-Jane Heights (25)
Downsview-Roding-CFB (26)
York University Heights (27)
Rustic (28)
Maple Leaf (29)
Brookhaven-Amesbury (30)
Yorkdale-Glen Park (31)
Englemount-Lawrence (32)
Clanton Park (33)
Bathurst Manor (34)
Westminster-Branson (35)
Newtonbrook West (36)
Willowdale West (37)
Lansing-Westgate (38)
Bedford Park-Nortown (39)
St.Andrew-Windfields (40)
Bridle Path-Sunnyb

We check if Foursquare has fetched the venues correctly, and leave them as they are, as we will manipulate them in the final section of this project:

In [15]:
print(toronto_venues.shape)
toronto_venues.head(15)

(2073, 7)


Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,West Humber-Clairville (1),43.71618,-79.596356,Mandarin Buffet,43.719798,-79.59582,Chinese Restaurant
1,West Humber-Clairville (1),43.71618,-79.596356,Tim Hortons,43.714657,-79.593716,Coffee Shop
2,West Humber-Clairville (1),43.71618,-79.596356,Xawaash,43.715786,-79.593053,Mediterranean Restaurant
3,West Humber-Clairville (1),43.71618,-79.596356,Winners,43.719819,-79.594923,Clothing Store
4,West Humber-Clairville (1),43.71618,-79.596356,Staples Rexdale,43.718539,-79.59457,Paper / Office Supplies Store
5,West Humber-Clairville (1),43.71618,-79.596356,Subway,43.719075,-79.595933,Sandwich Place
6,West Humber-Clairville (1),43.71618,-79.596356,TD Canada Trust,43.71963,-79.599896,Bank
7,West Humber-Clairville (1),43.71618,-79.596356,Swiss Pick,43.71615,-79.593843,Swiss Restaurant
8,West Humber-Clairville (1),43.71618,-79.596356,Comfort Hotel Airport North,43.716187,-79.594093,Hotel
9,West Humber-Clairville (1),43.71618,-79.596356,Planet Fitness,43.719063,-79.595205,Gym / Fitness Center


## 3.4 Public Safety
Finally, our last dataset will include data from major crimes and minor felonies recorded in Toronto, from which we will only select those which could affect to a business location:

In [16]:
body = client_d3d4563280a44b8eb7ad63b8c7fd2fac.get_object(Bucket='locationbasedanalysis-donotdelete-pr-volx3spnyzafuc',Key='WB-Safety.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

df_safety = pd.read_csv(body)

We clean it to display only the relevant features mentioned in section 2.4:

In [17]:
df_safety.drop(['Drug Arrests','Fire Medical Calls' ,'Fire Vehicle Incidents','Hazardous Incidents','Thefts'], axis=1, inplace=True)

Appending the latitude and longitude coordinates from df_neighbourhood dataset:

In [18]:
df_safety = pd.merge(df_neighbourhood, df_safety, how='inner', left_on = 'AREA_SHORT_CODE', right_on = 'Neighbourhood Id')
df_safety= df_safety.drop('Neighbourhood', axis=1)
df_safety.head()

Unnamed: 0,AREA_ID,AREA_NAME,LONGITUDE,LATITUDE,Neighbourhood Id,Arsons,Assaults,Break & Enters,Fires & Fire Alarms,Murders,Robberies,Sexual Assaults,Total Major Crime Incidents,Vehicle Thefts
0,25886718,West Humber-Clairville (1),-79.596356,43.71618,1,4,390,175,705,0,82,68,1119,288
1,25886715,Mount Olive-Silverstone-Jamestown (2),-79.587259,43.746868,2,3,316,61,361,1,78,75,690,62
2,25886723,Thistletown-Beaumond Heights (3),-79.563491,43.737988,3,0,85,36,90,0,17,24,192,12
3,25886730,Rexdale-Kipling (4),-79.566228,43.723725,4,0,59,32,94,1,16,20,164,18
4,25886733,Elms-Old Rexdale (5),-79.548983,43.721519,5,1,77,25,107,0,23,5,185,22


And finally indexing it to the Area code, which will serve to join other dataframes:

In [19]:
df_safety.sort_values(by=['Neighbourhood Id'], inplace=True)
df_safety.set_index('Neighbourhood Id', inplace=True)
df_safety.head()

Unnamed: 0_level_0,AREA_ID,AREA_NAME,LONGITUDE,LATITUDE,Arsons,Assaults,Break & Enters,Fires & Fire Alarms,Murders,Robberies,Sexual Assaults,Total Major Crime Incidents,Vehicle Thefts
Neighbourhood Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,25886718,West Humber-Clairville (1),-79.596356,43.71618,4,390,175,705,0,82,68,1119,288
2,25886715,Mount Olive-Silverstone-Jamestown (2),-79.587259,43.746868,3,316,61,361,1,78,75,690,62
3,25886723,Thistletown-Beaumond Heights (3),-79.563491,43.737988,0,85,36,90,0,17,24,192,12
4,25886730,Rexdale-Kipling (4),-79.566228,43.723725,0,59,32,94,1,16,20,164,18
5,25886733,Elms-Old Rexdale (5),-79.548983,43.721519,1,77,25,107,0,23,5,185,22


# 4. Methodology and Data analysis

In this section, we will apply different techniques to gather a more visual understanding on the datasets that we are handling using Folium and data analysis tools which will allow us to display some key features from the different datasets. Further into the project this will help us in the decission making process of section 4.2 - Business case.

In [20]:
#!conda install -c conda-forge folium=0.5.0 --yes
import folium

print('Folium installed and imported!')

Folium installed and imported!


Now that Folium is imported we create a map of Toronto:

In [21]:
import numpy as np
import json

latitude= df_neighbourhood['LATITUDE'].mean()+0.27
longitude= df_neighbourhood['LONGITUDE'].mean()

map_toronto = folium.Map(location=[latitude,longitude], zoom_start=11)

## 4.1 Environment exploration: Economics and City safety

To start exploring our data we focus on the economic environment and safety conditions present in the different areas of Toronto, we will select the top 10 values per each of the key categories, which will be superimposed in a Toronto map, as an initial pattern search:

### 4.1.1 Economic framework

As a general rule and regardless of the business at hand we will try to place our business in the most developed areas. First we select the top 10 neeighbourhoods by opened business number:

In [22]:
top10_business=df_economics.nlargest(10, ['Businesses']) 
top10_business.head(10)

Unnamed: 0_level_0,AREA_ID,AREA_NAME,LONGITUDE,LATITUDE,Businesses,Child Care Spaces,Debt Risk Score,Home Prices,Local Employment,Social Assistance Recipients
Neighbourhood Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
76,25886914,Bay Street Corridor (76),-79.385721,43.657511,4324,30,755,457787,185891,678
77,25886962,Waterfront Communities-The Island (77),-79.377202,43.63388,2899,130,748,416759,88058,2864
27,25886593,York University Heights (27),-79.488883,43.765736,2643,156,698,359372,42885,4720
1,25886718,West Humber-Clairville (1),-79.596356,43.71618,2463,195,719,317508,58271,2912
75,25886906,Church-Yonge Corridor (75),-79.379017,43.659649,2443,45,736,410703,54044,3090
95,25886874,Annex (95),-79.404001,43.671585,2328,201,755,993491,25719,1150
14,25886767,Islington-City Centre West (14),-79.543317,43.633463,2263,204,755,491678,45794,2534
130,25886411,Milliken (130),-79.275009,43.820691,2090,220,770,387879,16901,1921
78,25886955,Kensington-Chinatown (78),-79.39724,43.653554,1883,247,728,477989,37205,2523
31,25886688,Yorkdale-Glen Park (31),-79.457108,43.714672,1468,82,729,421045,24685,1502


In [23]:
import matplotlib.cm as cm
import matplotlib.colors as colors

NUMBER_HOODS= 10

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(top10_business['LATITUDE'], top10_business['LONGITUDE'], top10_business['AREA_NAME'], top10_business['Businesses']):
    label = folium.Popup(str(poi) + ' Businesses: ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=30,
        popup=label,
        color='green',
        fill=True,
        fill_color='green',
        fill_opacity=0.7).add_to(map_toronto)

To be followed by local employment as the second most relevant factor:

In [24]:
top10=df_economics.nlargest(10, ['Local Employment']) 
top10.head(10)

Unnamed: 0_level_0,AREA_ID,AREA_NAME,LONGITUDE,LATITUDE,Businesses,Child Care Spaces,Debt Risk Score,Home Prices,Local Employment,Social Assistance Recipients
Neighbourhood Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
76,25886914,Bay Street Corridor (76),-79.385721,43.657511,4324,30,755,457787,185891,678
77,25886962,Waterfront Communities-The Island (77),-79.377202,43.63388,2899,130,748,416759,88058,2864
1,25886718,West Humber-Clairville (1),-79.596356,43.71618,2463,195,719,317508,58271,2912
75,25886906,Church-Yonge Corridor (75),-79.379017,43.659649,2443,45,736,410703,54044,3090
14,25886767,Islington-City Centre West (14),-79.543317,43.633463,2263,204,755,491678,45794,2534
27,25886593,York University Heights (27),-79.488883,43.765736,2643,156,698,359372,42885,4720
78,25886955,Kensington-Chinatown (78),-79.39724,43.653554,1883,247,728,477989,37205,2523
95,25886874,Annex (95),-79.404001,43.671585,2328,201,755,993491,25719,1150
42,25886643,Banbury-Don Mills (42),-79.349718,43.737657,834,255,776,613647,25614,1030
31,25886688,Yorkdale-Glen Park (31),-79.457108,43.714672,1468,82,729,421045,24685,1502


In [25]:
import matplotlib.cm as cm
import matplotlib.colors as colors

NUMBER_HOODS= 10

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(top10['LATITUDE'], top10['LONGITUDE'], top10['AREA_NAME'], top10['Local Employment']):
    label = folium.Popup(str(poi) + ' Employment: ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=20,
        popup=label,
        color='lightgreen',
        fill=True,
        fill_color='lightgreen',
        fill_opacity=0.9).add_to(map_toronto)

And concluding by Average home price as the final factor relevant to understand the development of a certain area:

In [26]:
top10=df_economics.nlargest(10, ['Home Prices']) 
top10.head(10)

Unnamed: 0_level_0,AREA_ID,AREA_NAME,LONGITUDE,LATITUDE,Businesses,Child Care Spaces,Debt Risk Score,Home Prices,Local Employment,Social Assistance Recipients
Neighbourhood Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
41,25886663,Bridle Path-Sunnybrook-York Mills (41),-79.378904,43.731013,58,0,788,1849084,11756,37
101,25886389,Forest Hill South (101),-79.414318,43.694526,175,30,775,1585984,1945,134
40,25886620,St.Andrew-Windfields (40),-79.379037,43.756246,591,124,774,1363202,13023,606
98,25886836,Rosedale-Moore Park (98),-79.379669,43.68282,683,162,777,1265389,19160,499
103,25886812,Lawrence Park South (103),-79.406039,43.717212,167,140,784,1215390,1634,240
39,25886655,Bedford Park-Nortown (39),-79.420227,43.731486,676,60,776,1191040,8187,484
96,25886842,Casa Loma (96),-79.408007,43.681852,258,142,771,1083381,3283,189
56,25886328,Leaside-Bennington (56),-79.366072,43.703797,313,274,791,1071823,2901,239
97,25886834,Yonge-St.Clair (97),-79.397871,43.687859,468,20,771,995616,7858,283
95,25886874,Annex (95),-79.404001,43.671585,2328,201,755,993491,25719,1150


In [27]:
import matplotlib.cm as cm
import matplotlib.colors as colors

NUMBER_HOODS= 10

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(top10['LATITUDE'], top10['LONGITUDE'], top10['AREA_NAME'], top10['Home Prices']):
    label = folium.Popup(str(poi) + ' House prices: ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=20,
        popup=label,
        color='lightblue',
        fill=True,
        fill_color='lightblue',
        fill_opacity=0.7).add_to(map_toronto)

### 4.1.2 Public Safety

In order to place our business, and as a general rule we will also strive to locate our headquarters in th safest possible area in the city. For this we will take advantage of safety dataset, which will provide a measure of the safety of each area for the following incidents:
* Break & Enters
* Fires & Fire Alarms
* Robberies

In [28]:
top10=df_safety.nlargest(10, ['Break & Enters']) 
top10.head(10)

import matplotlib.cm as cm
import matplotlib.colors as colors

NUMBER_HOODS= 10

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(top10['LATITUDE'], top10['LONGITUDE'], top10['AREA_NAME'], top10['Break & Enters']):
    label = folium.Popup(str(poi) + ' Break & Enters: ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=15,
        popup=label,
        color='red',
        fill=True,
        fill_color='red',
        fill_opacity=0.7).add_to(map_toronto)

top10=df_safety.nlargest(10, ['Fires & Fire Alarms']) 
top10.head(10)

NUMBER_HOODS= 10

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(top10['LATITUDE'], top10['LONGITUDE'], top10['AREA_NAME'], top10['Fires & Fire Alarms']):
    label = folium.Popup(str(poi) + ' Fires & Fire Alarms: ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=10,
        popup=label,
        color='orange',
        fill=True,
        fill_color='orange',
        fill_opacity=0.7).add_to(map_toronto)

top10=df_safety.nlargest(10, ['Robberies']) 
top10.head(10)

NUMBER_HOODS= 10

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(top10['LATITUDE'], top10['LONGITUDE'], top10['AREA_NAME'], top10['Robberies']):
    label = folium.Popup(str(poi) + ' Robberies: ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=10,
        popup=label,
        color='yellow',
        fill=True,
        fill_color='yellow',
        fill_opacity=0.7).add_to(map_toronto)
map_toronto

### 4.1.3 Partial conclusions

We have been able so far to tackle the initial part of our project, which consist on defining an economic framework among which would be the top locations to base a startup based on a variety of industry-wide factors. According to this perspective, our leading options would be the following:
* Yorkdale-Glen Park
* Waterfront Communities-The Island
* Rosedale-Moore Park

One interesting insight that we can draw from this map is that indeed the higher the economic prosperity of the area, the higher the tendency to focus an elevated number of minor crimes as robbery and theft. However it is possible to find compromises on this figures, with a relatively safe record with a high development in the ranking shown above.

## 4.2 Business case

As anticipated during the project, we will advise Standard Digital Services Ltd., a promising startup which aims to connect financial and small businesses data to provide better loan conditions to individual stakeholders while connecting them with banks at premium conditions. The CEO, Elisabeth Mapple, has an intuition on what they want, although as a startup their strategy is not yet clearly detailed. The location of their dreams would have to fullfill the following:
* **Branding:** they want to be seen with the big players, for which location and prestige are a mandatory feature.
* **Customers:** they target customers in a situation to require a loan but savvy enough in digital technologies, their ideal customer would range between 25 and 54 years old. Appart from this, any individual with an income of 100,000 dollars or higher would be also a potential customer.
* **Providers:** as an IT company, they need to be close to banks as well as have close storage facilitites for the infrastructure. We will leverage Foursquare for this analysis.
* **IP protection:** because of the previous point, break and enters focal points are to be avoided at all costs.

Branding (location) and IP Protection (Public safety) have been abundantly covered, that is why we focus now on Customer segmentation (demographics) and Providers (via Foursquare)

### 4.2.1 Customer analysis

With the customer description provided above we are able to refine the dataset provided on Toronto demographics, to tailor it to our needs:

In [29]:
top10_demographics=df_demographics[['AREA_NAME','LONGITUDE','LATITUDE','Working Age (25-54 years)','Median income - $100k and over']]

In [30]:
top10_income=top10_demographics.nlargest(10, ['Median income - $100k and over']) 
top10_income.head(10)

Unnamed: 0_level_0,AREA_NAME,LONGITUDE,LATITUDE,Working Age (25-54 years),Median income - $100k and over
Neighbourhood Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
77,Waterfront Communities-The Island (77),-79.377202,43.63388,45105,11300
98,Rosedale-Moore Park (98),-79.379669,43.68282,7925,5580
95,Annex (95),-79.404001,43.671585,15040,5255
39,Bedford Park-Nortown (39),-79.420227,43.731486,8410,5095
82,Niagara (82),-79.41242,43.636681,23320,4775
51,Willowdale East (51),-79.401484,43.770602,25850,4300
63,The Beaches (63),-79.299601,43.67105,9590,4280
103,Lawrence Park South (103),-79.406039,43.717212,5870,4080
56,Leaside-Bennington (56),-79.366072,43.703797,6455,3885
14,Islington-City Centre West (14),-79.543317,43.633463,20640,3830


In [31]:
top10_customer=top10_demographics.nlargest(10, ['Working Age (25-54 years)'])
top10_customer.head(10)

Unnamed: 0_level_0,AREA_NAME,LONGITUDE,LATITUDE,Working Age (25-54 years),Median income - $100k and over
Neighbourhood Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
77,Waterfront Communities-The Island (77),-79.377202,43.63388,45105,11300
51,Willowdale East (51),-79.401484,43.770602,25850,4300
82,Niagara (82),-79.41242,43.636681,23320,4775
137,Woburn (137),-79.228586,43.76674,21945,1045
14,Islington-City Centre West (14),-79.543317,43.633463,20640,3830
93,Dovercourt-Wallace Emerson-Junction (93),-79.438541,43.665677,19790,1860
75,Church-Yonge Corridor (75),-79.379017,43.659649,18780,3410
131,Rouge (131),-79.186343,43.821201,18510,2110
132,Malvern (132),-79.222517,43.803658,17865,500
17,Mimico (includes Humber Bay Shores) (17),-79.500137,43.615924,17695,3545


Displayed on the map:

In [32]:
markers_colors = []
for lat, lon, poi, cluster in zip(top10_income['LATITUDE'], top10_income['LONGITUDE'], top10_income['AREA_NAME'], top10_income['Median income - $100k and over']):
    label = folium.Popup(str(poi) + ' Median income - $100k and over: ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=10,
        popup=label,
        color='lightblue',
        fill=True,
        fill_color='lightblue',
        fill_opacity=0.7).add_to(map_toronto)

markers_colors = []
for lat, lon, poi, cluster in zip(top10_customer['LATITUDE'], top10_customer['LONGITUDE'], top10_customer['AREA_NAME'], top10_customer['Working Age (25-54 years)']):
    label = folium.Popup(str(poi) + ' Working Age (25-54 years): ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=10,
        popup=label,
        color='blue',
        fill=True,
        fill_color='blue',
        fill_opacity=0.7).add_to(map_toronto)
map_toronto

### 4.2.2 Providers display

The final step of this analysis would be to display the potential providers close to the locations we are targeting, for which we will clear the dataset with Foursquare venues to focus only on those labeled as banks or storage suppliers:

In [33]:
df_providers = toronto_venues[(toronto_venues['Venue Category'] == 'Bank') | (toronto_venues['Venue Category'] == 'Storage Facility')]
df_providers.head()

Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
6,West Humber-Clairville (1),43.71618,-79.596356,TD Canada Trust,43.71963,-79.599896,Bank
10,West Humber-Clairville (1),43.71618,-79.596356,Access Storage - Etobicoke,43.718771,-79.593344,Storage Facility
22,Thistletown-Beaumond Heights (3),43.737988,-79.563491,TD Canada Trust,43.736602,-79.562379,Bank
57,Markland Wood (12),43.633542,-79.573431,TD Canada Trust,43.631251,-79.575869,Bank
136,York University Heights (27),43.765736,-79.488883,TD Canada Trust,43.762826,-79.490243,Bank


In [34]:
markers_colors = []
for lat, lon, poi, cluster in zip(df_providers['Venue Latitude'], df_providers['Venue Longitude'], df_providers['Neighbourhood'], df_providers['Venue Category']):
    label = folium.Popup(str(poi) + ' Venue Category: ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color='purple',
        fill=True,
        fill_color='purple',
        fill_opacity=0.7).add_to(map_toronto)
map_toronto

# 5. Results

With the map provided above, which is the main output of our analysis, we are able to select the three top locations that would be feasible as headquarters for our client:

| Candidates    | # Businesses  | Working population  | Cutomer base | Close providers | Top 10 Break and enter | Top 10 Robberies | Status |
| ------------- |:-------------:| :------------------:|:------------:|:---------------:|:----------------------:|:----------------:|:------:|
| Waterfront Communities-The Island      | 2899 | 88058 | 45015 | Yes | No| No| Mandatory and optional criteria approved|
| Church-Yonge Corridor      | 2443 | 54044 | 18780 | Yes | No| Yes| Mandatory criteria approved|
| Yorkdale-Glen Park      | 1468 | 24685 | No top 10 | Yes | No| Yes| Mandatory criteria approved|

Between them, **Waterfront Communities-The Island** clearly shows a much better shape considering the metrics requested by the customer, while keeping the IP secure as it is not susceptible of robberies.

# 6. Discussions 
One curious insight that one can see here is that despite the optimum customer base,lack of breack and enter incients is the main concern of the client, this is a proof on how personal appreciations (as the perception of safety) usually play a part on the decission making process over data.

Another curious insight is how the price of real state promotions ( in light blue) does not correlate with the number of business or local zone employment, which would be fitness factors that one would expect to impact on the housing prices.

# 7. Conclusions

In this project, we have stepped into the role of a consultant, advising and helping a young digital company to open a business in the city of Toronto by leveraging the capabilities of Foursquare API, complementing it with economic, demographic and safety datasets provided by the city of Toronto via its web portal Open Data.

We have tested the power of visualizations to clearly draw valuable conclusions with limited data, and how in some cases decision making processes can tend to a bias depending on what the client, who is the final stakeholder of this analysis, demands.

Also, we have been able to derive additional relationships, which appeared to be hidden between the different datasets.

To wrap up, we confirm the importance of a structured data visualization approach and data understanding and the need of reinforce decision making processes based on data over intuition.
