# A new gym for Toronto

## Background

The following notebook is based around the question, in which neighbourhood in Toronto one would want to open a new gym. The subject targets individual business people, who would like to own their own company and decided to establish a sports venue and also gym chains, which want to open a new venue in a neighbourhood without a gym. 

## Data

For this project, I will use, inter alia, data from a Wikipedia page, which presents a list of all postal codes with their boroughs and neighbourhoods. Please find a link here below: 
'https://web.archive.org/web/20200303121502/https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M' 

Additionally, for the purpose of locating the different postcode areas, I will use the gespatial data of Toronto provided on the Coursera website, which was used in the previous segmentation and clustering assignment. Finally, to find out which neighbourhood best fits our purpose, I will use Foursquare data for the different neighbourhoods to see which venues correlate with the existence of a gym in an area.

## Methodological Section

#### Import libraries

In [2]:
import pandas as pd
import numpy as np

# library to open URLs
import urllib.request

#library for parsing HTML and XML documents
from bs4 import BeautifulSoup

!conda install -c conda-forge folium=0.5.0 --yes # install folium package to visualize map
import folium # map rendering library

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # to create requests from Foursquare database

from sklearn.metrics import r2_score # for regression model evaluation
from sklearn.preprocessing import PolynomialFeatures
from sklearn import linear_model

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    python_abi-3.6             |          1_cp36m           4 KB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    certifi-2019.11.28         |   py36h9f0ad1d_1         149 KB  conda-forge
    branca-0.4.0               |             py_0          26 KB  conda-forge
    altair-4.0.1               |             py_0         575 KB  conda-forge
    ca-certificates-2019.11.28 |       hecc5488_0         145 KB  conda-forge
    openssl-1.1.1e             |       h516909a_0         2.1 MB  conda-forge
    ------------------------------------------------------------
                       

#### Import data from wikipedia page

In [3]:
# define URL of the page we want to scrap
url = 'https://web.archive.org/web/20200303121502/https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
        
# Open URL using urllib.request and put the HTML into the page variable
wiki_page = urllib.request.urlopen(url)

# Parse the HTML from the URL into the BeautifulSoup parse tree format
soup = BeautifulSoup(wiki_page,'lxml' )

# find the right table in the HTML code
table = soup.find('table', class_ ="wikitable sortable" )

#### Create a dataframe with the information from the wikipadia page

In [4]:
# define empty lists for the 3 columns 
A=[]
B=[]
C=[]

# iterate through the rows to extract the information for the different postcodes
# find row start by searching for "tr"
for row in table.findAll('tr'): 
    # find column start and end by searching for "td"
    cells = row.findAll('td')
    # fill in the column lists, when information for 3 columns received
    if len(cells)==3:
        A.append(cells[0].find(text=True))
        B.append(cells[1].find(text=True))
        C.append(cells[2].find(text=True))
        
# put all 3 column lists to one dataframe
df = pd.DataFrame(A, columns=['Postcode'])
df['Borough'] = B
df['Neighbourhood'] = C
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


#### Data wrangling

In [5]:
# only display Boroughs, which are assigned
df = df[df['Borough'] != 'Not assigned']

#reset index for future use
df.reset_index(drop=True, inplace=True)

# Group the neighbourhoods with the same postcode and write them in one cell
# define n as the number of the last index to be able to iterate from the back
n = df['Postcode'].count() -1

# interate from the back with a for loop
for postcode in range(n):
    
    # check whether the postcode is the same as the one above
    if df['Postcode'][n] == df['Postcode'][n-1]:
        
        # if the postcodes are the same, define the Neighbourhood, which is above, as both Neighbourhoods seperated by a comma
        df['Neighbourhood'][n-1] = df['Neighbourhood'][n] + ", " + df['Neighbourhood'][n-1] 
        
        # drop the line below to prevent duplicates 
        df.drop(n, inplace=True)
    n = n-1
    

# add geospatial data to the dataframe    
# read the csv file with the geospatial data
geo = pd.read_csv('http://cocl.us/Geospatial_data')

# rename the column for the Postcode for merging purposes
geo.rename(columns={'Postal Code': 'Postcode'}, inplace=True) 
geo.head()

# merge the two dataframe on the Postcode
df_geo = pd.merge(df,geo,how='inner', on='Postcode')
df_geo.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,Queen's Park,43.662301,-79.389494


#### Import Foursquare data

In [6]:
# The code was removed by Watson Studio for sharing.

In [7]:
# create function to iterate through the dataframe and create requests of the venues in the neighbourhoods
LIMIT = 100

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])
        
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Postcode Latitude', 
                  'Postcode Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [8]:
# run function for each postcode

toronto_venues = getNearbyVenues(names=df_geo['Neighbourhood'],
                                   latitudes=df_geo['Latitude'],
                                   longitudes=df_geo['Longitude']
                                  )
toronto_venues.shape

(2197, 7)

In [9]:
# find out which different names are associated with the venue 'gym'

df_gym = toronto_venues[toronto_venues['Venue Category'].str.contains('Gym')]
df_gym['Venue Category'].unique()

array(['Gym', 'Gym / Fitness Center', 'Climbing Gym', 'College Gym'],
      dtype=object)

#### Prepare the dataframe for our use

In [10]:
# create dummies for the existence of a venue for each neighbourhood

toronto_dummies = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_dummies['Neighbourhood'] = toronto_venues['Neighbourhood']

# move neighbourhood column to the first column
fixed_columns = [toronto_dummies.columns[-1]] + list(toronto_dummies.columns[:-1])
toronto_dummies = toronto_dummies[fixed_columns]

# group rows by neighbourhood and calculate the mean of the frequency of occurrence of each category
toronto_grouped = toronto_dummies.groupby('Neighbourhood').sum().reset_index()
toronto_grouped.head()

Unnamed: 0,Neighbourhood,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Agincourt,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Bayview Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Berczy Park,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
3,Business Reply Mail Processing Centre 969 East...,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Caledonia-Fairbanks,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


In [11]:
# aggregate the 4 different types of gyms to one column

toronto_grouped['Gym_all'] = toronto_grouped[['Gym', 'Gym / Fitness Center', 'Climbing Gym', 'College Gym']].sum(axis=1)
toronto_grouped.head()

Unnamed: 0,Neighbourhood,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Gym_all
0,Agincourt,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Bayview Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Berczy Park,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Business Reply Mail Processing Centre 969 East...,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,Caledonia-Fairbanks,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


In [12]:
# drop the individual gym columns
toronto_grouped.drop(['Gym', 'Gym / Fitness Center', 'Climbing Gym', 'College Gym'], axis=1, inplace=True)
toronto_grouped.rename(columns={'Gym_all':'Gym'}, inplace=True)
toronto_grouped.head()

Unnamed: 0,Neighbourhood,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Gym
0,Agincourt,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Bayview Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Berczy Park,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Business Reply Mail Processing Centre 969 East...,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,Caledonia-Fairbanks,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


##### Separating the data into 2 dataframes, one with all neighbourhoods, where gyms exist (for training the regression model) and one with all neighbourhoods without a gym (to find out in which neighbourhood, we should place our gym

In [13]:
toronto_gym = toronto_grouped[toronto_grouped['Gym'] > 0]
toronto_gym.shape

(29, 268)

In [14]:
toronto_nogym = toronto_grouped[toronto_grouped['Gym'] == 0]
toronto_nogym.shape

(69, 268)

#### Data Analysis

In [31]:
# train / test split
msk = np.random.rand(len(toronto_gym)) < 0.8
train = toronto_gym[msk]
test = toronto_gym[~msk]

X_train = train[train.columns[1:267]]
y_train = train['Gym']
X_test = test[test.columns[1:267]]
y_test = test['Gym']

In [32]:
# train the model

poly = PolynomialFeatures(degree=2)
X_train_poly = poly.fit_transform(X_train)
clf = linear_model.LinearRegression()
y_train = clf.fit(X_train_poly, y_train)

In [33]:
from sklearn.metrics import r2_score

In [34]:
# Evaluation
X_test_poly = poly.fit_transform(X_test)
y_test_ = clf.predict(X_test_poly)
print("R2-score: %.2f" % r2_score(y_test_ , y_test) )

R2-score: 0.95


##### Application of the model to the dataframe without gyms

In [35]:
Nogym_poly = poly.fit_transform(toronto_nogym[toronto_nogym.columns[1:267]])
toronto_nogym['Gym prediction'] = clf.predict(Nogym_poly)
toronto_gymprediction = toronto_nogym[['Neighbourhood', 'Gym prediction']]
toronto_gymprediction.sort_values('Gym prediction', ascending=False, inplace=True)
toronto_top5 = toronto_gymprediction[0:5]
toronto_top5

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0,Neighbourhood,Gym prediction
31,"Kensington Market, Grange Park, Chinatown",2.009892
63,"St. James Town, Cabbagetown",1.974261
45,"Oriole\n, Henry Farm, Fairview\n",1.858331
78,"Trinity, Little Portugal",1.753322
24,Harbourfront,1.586246


##### Visualization of the 5 neighbourhoods 

In [36]:
# Get the latitude and longitude from the df_geo dataframe
# merge the two dataframe on the Neighbourhood name
result = pd.merge(toronto_top5,df_geo,how='inner', on='Neighbourhood')
result

Unnamed: 0,Neighbourhood,Gym prediction,Postcode,Borough,Latitude,Longitude
0,"Kensington Market, Grange Park, Chinatown",2.009892,M5T,Downtown Toronto,43.653206,-79.400049
1,"St. James Town, Cabbagetown",1.974261,M4X,Downtown Toronto,43.667967,-79.367675
2,"Oriole\n, Henry Farm, Fairview\n",1.858331,M2J,North York,43.778517,-79.346556
3,"Trinity, Little Portugal",1.753322,M6J,West Toronto,43.647927,-79.41975
4,Harbourfront,1.586246,M5A,Downtown Toronto,43.65426,-79.360636


In [37]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[43.653963, -79.387207], zoom_start=11)

# add markers to map
for lati, long, label in zip(result['Latitude'], result['Longitude'], result['Neighbourhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lati, long],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto