# Report of the Final Project

In this notebook we meet the next requirements for the final project:
+ **Introduction** where you discuss the business problem and who would be interested in this project.
+ Data where you **describe the data** that will be used to solve the problem and the source of the data.
+ **Methodology** section which represents the main component of the report where you discuss and describe any exploratory data analysis that you did, any inferential statistical testing that you performed, if any, and what machine learnings were used and why.
+ **Results** section where you discuss the results.
+ **Discussion** section where you discuss any observations you noted and any recommendations you can make based on the results.
+ **Conclusion** section where you conclude the report.

Also, there should be an extra presentation (for the reviewers) which can be seen in the provided links or in github; we discuss the methodology on the go, that is, between lines of code.


## 1. Introduction
To start, we state the problem and give a description of the data we will be using.

**Definition of the prblem:**

Let us be an indiviudal immigrating to toronto and we want to know the best neighborhood for us to live in, then the question is:

**Which neighborhood in Toronto should be picked for purchasing a new house or condo?**

This is a relevant research because the transaction costs involved in buying a home are substantial, so much that we would prefer making the decision on real data. 

So it can be the best decision for the well-being of the individual.

## 2. Data Description

We use the skills learned in the course to scrape wikipedia, geographical coordinates,data retrieved from FourSquare and a dataset of our own expectation/rating for the maximzation of our utility/well-being.

 Data from Wikipedia contains a list of postal codes of Canada. It will be retreived by scraping a table from the website. This data is used to create a geographical segmentation of Toronto based on postal code, and it will be linked with the actual coordinates later. Which is very similar to what we dind in the previous projects.

In [4]:
!pip install bs4
!pip install lxml

Collecting bs4
  Downloading https://files.pythonhosted.org/packages/10/ed/7e8b97591f6f456174139ec089c769f89a94a1a4025fe967691de971f314/bs4-0.0.1.tar.gz
Building wheels for collected packages: bs4
  Building wheel for bs4 (setup.py) ... [?25ldone
[?25h  Stored in directory: /home/dsxuser/.cache/pip/wheels/a0/b0/b2/4f80b9456b87abedbc0bf2d52235414c3467d8889be38dd472
Successfully built bs4
Installing collected packages: bs4
Successfully installed bs4-0.0.1


In [5]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
from urllib.request import urlopen
import urllib.request
!conda install -c conda-forge folium=0.5.0 --yes
import folium # plotting library

print('Libraries imported')

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    python_abi-3.6             |          1_cp36m           4 KB  conda-forge
    certifi-2020.4.5.1         |   py36h9f0ad1d_0         151 KB  conda-forge
    altair-4.1.0               |             py_1         614 KB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    branca-0.4.0               |             py_0          26 KB  conda-forge
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    openssl-1.1.1f             |       h516909a_0         2.1 MB  conda-forge
    ca-certificates-2020.4.5.1 |       hecc5488_0         146 KB  conda-forge
    ------------------------------------------------------------
                       

In [6]:
#Obtain the Wikipedia article as a local copy.
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
request = urllib.request.urlopen(url)
wiki_article = request.read().decode()

with open('List_of_postal_codes_of_Canada:_M.html', 'w') as fo:
    fo.write(wiki_article)
    

# Load article, use beautiful soup to get the tables.
wiki_article = open('List_of_postal_codes_of_Canada:_M.html').read()
soup = BeautifulSoup(wiki_article, 'html.parser')
tables = soup.find_all('table', class_='sortable')

# Search through all the tables, identify the table with the header we want.
for table in tables:
    all_tables = table.find_all('th')
    header = [th.text.strip() for th in all_tables]
    if header[:5] == ['Postcode', 'Borough', 'Neighborhood']:
        break

# Extract the columns we want and write to a semicolon-delimited text file.
with open('List_of_postal_codes_of_Canada:_M.txt', 'w') as fo:
    for tr in table.find_all('tr'):
        tds = tr.find_all('td')
        if not tds:
            continue
        Postcode, Borough, Neighborhood = [td.text.strip() for td in tds[:4]]
        
        print('; '.join([Postcode, Borough, Neighborhood]), file=fo)
        
# Load to pandas dataframe        
df = pd.read_table('List_of_postal_codes_of_Canada:_M.txt', delimiter = ';', header = None)
df.columns = ['PostCode', 'Borough', 'Neighborhood']

# Ignore not assigned borough
df1 = df[df['Borough'] != ' Not assigned']
df1.reset_index(inplace = True, drop = True)

# Assign borough value to neighborhood if the neighborhood is not assigned.
position = 0
neigh_list = []
for i,j in zip(df1['Borough'], df1['Neighborhood']):
    if j == ' Not assigned':
        neigh_list.append(i)
    else:
        neigh_list.append(j)

post_list = df1['PostCode'].tolist()
br_list = df1['Borough'].tolist()

df1 = pd.DataFrame([post_list, br_list, neigh_list]).T
df1.columns = ['PostCode', 'Borough', 'Neighborhood']
df1.head()

# Group dataframe by PostCode and combine neighborhood values.
borough_list =[] 
Neighborhood_list = []

for item in df1['Borough']:
    item_new = str(item)[1:] + ':'
    borough_list.append(item_new)
    
for item in df1['Neighborhood']:
    item_new = str(item)[1:] + ':'
    Neighborhood_list.append(item_new)
    
PostCode_list = df1['PostCode'].tolist()

df2 = pd.DataFrame([PostCode_list, borough_list, Neighborhood_list]).T
df2.columns = ['PostCode', 'Borough', 'Neighborhood']

new_df = df2.groupby('PostCode').sum()

borough_list=[]
Neighborhood_list = []
PostCode_list = new_df.index.tolist()

for item in new_df['Borough']:
    item_new = np.unique(np.array(str(item).split(':')))[1]
    borough_list.append(item_new)

for item in new_df['Neighborhood']:
    item_new = str(np.array(str(item).split(':'))[:-1].tolist())[1:][:-1].replace("'","")
    Neighborhood_list.append(item_new)

df3 = pd.DataFrame([PostCode_list, borough_list, Neighborhood_list]).T
df3.columns = ['PostCode', 'Borough', 'Neighborhood']
df3



Unnamed: 0,PostCode,Borough,Neighborhood
0,M1B,Scarborough,Malvern / Rouge
1,M1C,Scarborough,Rouge Hill / Port Union / Highland Creek
2,M1E,Scarborough,Guildwood / Morningside / West Hill
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,Kennedy Park / Ionview / East Birchmount Park
7,M1L,Scarborough,Golden Mile / Clairlea / Oakridge
8,M1M,Scarborough,Cliffside / Cliffcrest / Scarborough Village West
9,M1N,Scarborough,Birch Cliff / Cliffside West


**We now get the Geographical coordinates.** This dataset was provided by the previous project. It contains the coordinates for various postal code. The aim of this dataset is to link postal code with coordinates. 

In [7]:
# Load the csv file from source.
df_co = pd.read_csv('https://cocl.us/Geospatial_data')

lat_list = []
long_list = []

for PostCode_target in df3['PostCode']:
    for PostCode, Latitude, Longitude in zip (df_co['Postal Code'],
                                                 df_co['Latitude'],
                                               df_co['Longitude']):
        if PostCode_target == PostCode:
            lat_list.append(Latitude)
            long_list.append(Longitude)

# Add coordinates information to the pandas dataframe.
Final_df = pd.DataFrame([df3['PostCode'].tolist(), 
                         df3['Borough'].tolist(), 
                         df3['Neighborhood'].tolist(),
                         lat_list,
                         long_list]).T
Final_df.columns = ['PostCode','Borough','Neighborhood', 'Latitude', 'Longitude']
Final_df

Unnamed: 0,PostCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,Malvern / Rouge,43.8067,-79.1944
1,M1C,Scarborough,Rouge Hill / Port Union / Highland Creek,43.7845,-79.1605
2,M1E,Scarborough,Guildwood / Morningside / West Hill,43.7636,-79.1887
3,M1G,Scarborough,Woburn,43.771,-79.2169
4,M1H,Scarborough,Cedarbrae,43.7731,-79.2395
5,M1J,Scarborough,Scarborough Village,43.7447,-79.2395
6,M1K,Scarborough,Kennedy Park / Ionview / East Birchmount Park,43.7279,-79.262
7,M1L,Scarborough,Golden Mile / Clairlea / Oakridge,43.7111,-79.2846
8,M1M,Scarborough,Cliffside / Cliffcrest / Scarborough Village West,43.7163,-79.2395
9,M1N,Scarborough,Birch Cliff / Cliffside West,43.6927,-79.2648


Data retrieved from FourSquare will be used to **evaluate each zone of interest.** For each postal code area, obtain the frequency of occurrence of interesting venues by further data wrangling and preparation. 

In [8]:
import requests # library to handle requests

import matplotlib.cm as cm
import matplotlib.colors as colors

import folium # map rendering library

print('Libraries imported.')

Libraries imported.


In [9]:
# Extract latitude and longitude of downtown Toronto
df_to = Final_df
lat = Final_df[Final_df['Borough']=='Downtown Toronto'].iloc[0].Latitude
long = Final_df[Final_df['Borough']=='Downtown Toronto'].iloc[0].Longitude

print('The geograpical coordinate of downtown Toronto City are {}, {}.'.format(lat, long))

# create map of Toronto using latitude and longitude values.
map_to = folium.Map(location=[lat, long], zoom_start=10)

# add markers to map
for lat, lng, borough, postcode in zip(df_to['Latitude'], df_to['Longitude'], df_to['Borough'], 
                                           df_to['PostCode']):
    label = '{}, {}'.format(postcode, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='red',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_to)  
    
map_to

The geograpical coordinate of downtown Toronto City are 43.6795626, -79.37752940000001.


In [10]:
import requests # library to handle requests

In [11]:
from pandas.io.json import json_normalize

# Enter the following information. The actual info was removed before sharing since it is sensitive. please see output.
CLIENT_ID = '4VWGB0NJSVQBJ5N0I44KZ5MJZLU1B0Z0IJ3AL3S1MJTC1AMP' # your Foursquare ID
CLIENT_SECRET = 'CENQHLGTNTHCPXP33YMCQGERRNQGPRT41WI3JKDFZ2VD2MTD'
VERSION = '20180605' # Foursquare API version

LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 300 # define radius

neighborhood_latitude = df_to[df_to['PostCode']=='M5C'].Latitude.iloc[0]
neighborhood_longitude = df_to[df_to['PostCode']=='M5C'].Longitude.iloc[0]

url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)

# Retrieve information from FourSquare
# results = requests.get(url).json()

# This function is from applied data science course materials, which will be utilized by this assignment.
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
#         print(name)       
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']

        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])
    

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    
    nearby_venues.columns = ['PostCode','Neighborhood Latitude','Neighborhood Longitude','Venue','Venue Latitude','Venue Longitude','Venue Category']
    
    return(nearby_venues)

to_venues = getNearbyVenues(names=df_to['PostCode'], latitudes=df_to['Latitude'], longitudes=df_to['Longitude'])

In [12]:
# one hot encoding
to_onehot = pd.get_dummies(to_venues[['Venue Category']], prefix="", prefix_sep="")

# add postcal code column back to dataframe
to_onehot['PostCode'] = to_venues['PostCode'] 

# move postal code column to the first column
fixed_columns = [to_onehot.columns[-1]] + list(to_onehot.columns[:-1])
to_onehot = to_onehot[fixed_columns]

to_onehot.head()

Unnamed: 0,PostCode,Accessories Store,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,American Restaurant,Antique Shop,Aquarium,...,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,M1B,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,M1C,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,M1C,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,M1E,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,M1E,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [13]:
#Group rows by PostCode and take the mean of the frequency of occurrence of each category
to_grouped = to_onehot.groupby('PostCode').mean().reset_index()
to_grouped

Unnamed: 0,PostCode,Accessories Store,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,American Restaurant,Antique Shop,Aquarium,...,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,M1B,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,...,0.000,0.000000,0.000000,0.000000,0.000000,0.0,0.00000,0.0000,0.000000,0.000000
1,M1C,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,...,0.000,0.000000,0.000000,0.000000,0.000000,0.0,0.00000,0.0000,0.000000,0.000000
2,M1E,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,...,0.000,0.000000,0.000000,0.000000,0.000000,0.0,0.00000,0.0000,0.000000,0.000000
3,M1G,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,...,0.000,0.000000,0.000000,0.000000,0.000000,0.0,0.00000,0.0000,0.000000,0.000000
4,M1H,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,...,0.000,0.000000,0.000000,0.000000,0.000000,0.0,0.00000,0.0000,0.000000,0.000000
5,M1J,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,...,0.000,0.000000,0.000000,0.000000,0.000000,0.0,0.00000,0.0000,0.333333,0.000000
6,M1K,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,...,0.125,0.000000,0.000000,0.000000,0.000000,0.0,0.00000,0.0000,0.000000,0.000000
7,M1L,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,...,0.000,0.000000,0.000000,0.000000,0.000000,0.0,0.00000,0.0000,0.000000,0.000000
8,M1M,0.000000,0.000000,0.0,0.0,0.0,0.0,0.500000,0.000000,0.0,...,0.000,0.000000,0.000000,0.000000,0.000000,0.0,0.00000,0.0000,0.000000,0.000000
9,M1N,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,...,0.000,0.000000,0.000000,0.000000,0.000000,0.0,0.00000,0.0000,0.000000,0.000000


Now we incorporate **our own prefences**. Detailed data processing will be provided in the next project.  Note that the metrics that are important to us might have broad and vague definition, so it will be carefully mapped to corresponding categories retrieved from FourSquare in the next project.

In [14]:
columns = ['Customer_Name', 'Cafe', 'Health Care', 'School', 'Gym', 'Full Score']
Rating = ['Luis', 8, 10, 6, 7, 10]
df_rating = pd.DataFrame([Rating])
df_rating.columns = columns
df_rating.set_index('Customer_Name')
df_rating

Unnamed: 0,Customer_Name,Cafe,Health Care,School,Gym,Full Score
0,Luis,8,10,6,7,10


Thus, we interpret the customer’s expectation with venues more precisely. For example, café could be café and coffee shop, so both of them should be taken into consideration. Or, school could be college, high school, university, etc. These alternative key words should be identified and considered.

In [15]:
# Explore other possible key words that also mean the same metric that the customer cares about.
key_words = to_grouped.columns.tolist()
cafe_alt = []
health_alt=[]
school_alt = []
gym_alt = []

for i in key_words:
    if 'Coffee' in str(i):
        cafe_alt.append(i)
    elif 'Cafe' in str(i):
        cafe_alt.append(i)
    elif 'Café' in str(i):
        cafe_alt.append(i)


for i in key_words:
    if 'Hospital' in str(i):
        health_alt.append(i)
    elif 'Clinic' in str(i):
        health_alt.append(i)
    elif 'Pharmacy' in str(i):
        health_alt.append(i)
for i in key_words:
    if 'University' in str(i):
        school_alt.append(i)
    elif 'College' in str(i):
        school_alt.append(i)
    elif 'school' in str(i):
        school_alt.append(i)
    elif 'college' in str(i):
        school_alt.append(i)

for i in key_words:
    if 'Gym' in str(i):
        gym_alt.append(i)
    if 'Fitness' in str(i):
        gym_alt.append(i)


health_r = [10]* len(health_alt)
cafe_r = [7]* len(cafe_alt)
school_r = [9]*len(school_alt)
gym_r = [6] * len(gym_alt)

new_metric = (cafe_alt + health_alt + school_alt + gym_alt)

new_r = (cafe_r + health_r + school_r + gym_r)

extra_rating = pd.DataFrame([new_metric, new_r])
extra_rating

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,Cafeteria,Café,Coffee Shop,Gaming Cafe,Hospital,Pharmacy,College Arts Building,College Auditorium,College Gym,College Rec Center,College Stadium,Climbing Gym,College Gym,Gym,Gym / Fitness Center,Gym / Fitness Center
1,7,7,7,7,10,10,9,9,9,9,9,6,6,6,6,6


In [17]:
extra_rating.columns = extra_rating.iloc[0].tolist()
extra_rating.drop(extra_rating.index[0], inplace = True)
new_rating = extra_rating.drop(extra_rating.columns[[8, 15]], axis = 1) 
new_rating['Gym / Fitness Center']=6
new_rating

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0,Cafeteria,Café,Coffee Shop,Gaming Cafe,Hospital,Pharmacy,College Arts Building,College Auditorium,College Rec Center,College Stadium,Climbing Gym,Gym,Gym / Fitness Center
1,7,7,7,7,10,10,9,9,9,9,6,6,6


After adjusting some errors, we convert the score to a relative rating score, which will be used as a scaling factor to weight the venues later


In [18]:
# As there are four main categories for the customers, we divide each score by 40 to get a relative score.
# This is because if we divide by the sum of all scores, we could potentially over dilute
# the importance of a metric by the number of subcategories present in that category.
rating_r = new_rating/40
rating_r

Unnamed: 0,Cafeteria,Café,Coffee Shop,Gaming Cafe,Hospital,Pharmacy,College Arts Building,College Auditorium,College Rec Center,College Stadium,Climbing Gym,Gym,Gym / Fitness Center
1,0.175,0.175,0.175,0.175,0.25,0.25,0.225,0.225,0.225,0.225,0.15,0.15,0.15


In [19]:
filtered_columns = rating_r.columns.tolist()
filtered_columns.append('PostCode')
to_filtered = to_grouped [filtered_columns]
to_filtered

Unnamed: 0,Cafeteria,Café,Coffee Shop,Gaming Cafe,Hospital,Pharmacy,College Arts Building,College Auditorium,College Rec Center,College Stadium,Climbing Gym,Gym,Gym / Fitness Center,PostCode
0,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,0.00,0.000000,0.000000,0.0000,M1B
1,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,0.00,0.000000,0.000000,0.0000,M1C
2,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,0.00,0.000000,0.000000,0.0000,M1E
3,0.0,0.000000,0.666667,0.0,0.0,0.000000,0.0,0.000000,0.0,0.00,0.000000,0.000000,0.0000,M1G
4,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,0.00,0.000000,0.000000,0.0000,M1H
5,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,0.00,0.000000,0.000000,0.0000,M1J
6,0.0,0.000000,0.125000,0.0,0.0,0.000000,0.0,0.000000,0.0,0.00,0.000000,0.000000,0.0000,M1K
7,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,0.00,0.000000,0.000000,0.0000,M1L
8,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,0.00,0.000000,0.000000,0.0000,M1M
9,0.0,0.250000,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,0.25,0.000000,0.000000,0.0000,M1N


Weighting the venues based on the actual frequency of occurrence of each venue as well as our personal preference (the rating dataset). This will be performed by doing linear algebra operations of the venue matrix

In [20]:
weighted_to = (to_filtered.iloc[:,0:13].values) * (rating_r.values)
weighted_df = pd.DataFrame(weighted_to)
weighted_df.columns = to_filtered.columns.tolist()[:-1]
weighted_df['PostCode'] = to_filtered['PostCode']
weighted_df

Unnamed: 0,Cafeteria,Café,Coffee Shop,Gaming Cafe,Hospital,Pharmacy,College Arts Building,College Auditorium,College Rec Center,College Stadium,Climbing Gym,Gym,Gym / Fitness Center,PostCode
0,0,0,0,0,0,0,0,0,0,0,0,0,0,M1B
1,0,0,0,0,0,0,0,0,0,0,0,0,0,M1C
2,0,0,0,0,0,0,0,0,0,0,0,0,0,M1E
3,0,0,0.116667,0,0,0,0,0,0,0,0,0,0,M1G
4,0,0,0,0,0,0,0,0,0,0,0,0,0,M1H
5,0,0,0,0,0,0,0,0,0,0,0,0,0,M1J
6,0,0,0.021875,0,0,0,0,0,0,0,0,0,0,M1K
7,0,0,0,0,0,0,0,0,0,0,0,0,0,M1L
8,0,0,0,0,0,0,0,0,0,0,0,0,0,M1M
9,0,0.04375,0,0,0,0,0,0,0,0.05625,0,0,0,M1N


Now, we generate the recommended candidate venues given the preferences.

In [21]:
weighted_df.set_index('PostCode', inplace = True)
weighted_df

Unnamed: 0_level_0,Cafeteria,Café,Coffee Shop,Gaming Cafe,Hospital,Pharmacy,College Arts Building,College Auditorium,College Rec Center,College Stadium,Climbing Gym,Gym,Gym / Fitness Center
PostCode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
M1B,0,0,0,0,0,0,0,0,0,0,0,0,0
M1C,0,0,0,0,0,0,0,0,0,0,0,0,0
M1E,0,0,0,0,0,0,0,0,0,0,0,0,0
M1G,0,0,0.116667,0,0,0,0,0,0,0,0,0,0
M1H,0,0,0,0,0,0,0,0,0,0,0,0,0
M1J,0,0,0,0,0,0,0,0,0,0,0,0,0
M1K,0,0,0.021875,0,0,0,0,0,0,0,0,0,0
M1L,0,0,0,0,0,0,0,0,0,0,0,0,0
M1M,0,0,0,0,0,0,0,0,0,0,0,0,0
M1N,0,0.04375,0,0,0,0,0,0,0,0.05625,0,0,0


In [22]:
# Calculate the overall score for each postal code area using sum.
weighted_df['Score'] = weighted_df.sum(axis=1)
weighted_df

Unnamed: 0_level_0,Cafeteria,Café,Coffee Shop,Gaming Cafe,Hospital,Pharmacy,College Arts Building,College Auditorium,College Rec Center,College Stadium,Climbing Gym,Gym,Gym / Fitness Center,Score
PostCode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
M1B,0,0,0,0,0,0,0,0,0,0,0,0,0,0.000000
M1C,0,0,0,0,0,0,0,0,0,0,0,0,0,0.000000
M1E,0,0,0,0,0,0,0,0,0,0,0,0,0,0.000000
M1G,0,0,0.116667,0,0,0,0,0,0,0,0,0,0,0.116667
M1H,0,0,0,0,0,0,0,0,0,0,0,0,0,0.000000
M1J,0,0,0,0,0,0,0,0,0,0,0,0,0,0.000000
M1K,0,0,0.021875,0,0,0,0,0,0,0,0,0,0,0.021875
M1L,0,0,0,0,0,0,0,0,0,0,0,0,0,0.000000
M1M,0,0,0,0,0,0,0,0,0,0,0,0,0,0.000000
M1N,0,0.04375,0,0,0,0,0,0,0,0.05625,0,0,0,0.100000


In [28]:
# Sort the postcal code areas by score.
weighted_df_sorted = weighted_df.sort_values(by=['Score'], ascending = False)
weighted_df_sorted.head(10)

Unnamed: 0_level_0,Cafeteria,Café,Coffee Shop,Gaming Cafe,Hospital,Pharmacy,College Arts Building,College Auditorium,College Rec Center,College Stadium,Climbing Gym,Gym,Gym / Fitness Center,Score
PostCode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
M2L,0.175,0.0,0.0,0,0,0.0,0,0.0,0,0.0,0,0.0,0.0,0.175
M1G,0.0,0.0,0.116667,0,0,0.0,0,0.0,0,0.0,0,0.0,0.0,0.116667
M1N,0.0,0.04375,0.0,0,0,0.0,0,0.0,0,0.05625,0,0.0,0.0,0.1
M2R,0.0,0.0,0.035,0,0,0.05,0,0.0,0,0.0,0,0.0,0.0,0.085
M3B,0.0,0.035,0.0,0,0,0.0,0,0.0,0,0.0,0,0.0,0.03,0.065
M8W,0.0,0.0,0.0194444,0,0,0.0277778,0,0.0,0,0.0,0,0.0166667,0.0,0.063889
M7A,0.0,0.00486111,0.04375,0,0,0.0,0,0.00625,0,0.0,0,0.00416667,0.0,0.059028
M4J,0.0,0.0,0.0583333,0,0,0.0,0,0.0,0,0.0,0,0.0,0.0,0.058333
M1V,0.0,0.0,0.0583333,0,0,0.0,0,0.0,0,0.0,0,0.0,0.0,0.058333
M6H,0.0,0.0109375,0.0,0,0,0.03125,0,0.0,0,0.0,0,0.0,0.009375,0.051563


To visualize the top 10 areas on the map we use a marker size which represent score. Higher score has bigger size on the map to guide the eyes, as they are of more importance given our ratings.

In [29]:
# Extract latitude and longitude of downtown Toronto

lat_list =Final_df.loc[Final_df['PostCode'].isin(weighted_df_sorted.index.tolist()[0:10])].Latitude
long_list = lat =Final_df.loc[Final_df['PostCode'].isin(weighted_df_sorted.index.tolist()[0:10])].Longitude

lat = lat_list.iloc[0]
long = long_list.iloc[0]

In [30]:
# Get corresponding info for visualization
selected_df = Final_df.loc[Final_df['PostCode'].isin(weighted_df_sorted.index.tolist()[0:10])]
selected_df
Score = []
for i in selected_df['PostCode']:
    Score.append(weighted_df_sorted.loc[i]['Score'])

selected_df['Score'] = Score
selected_df['Score']
radius_list = (selected_df['Score'].values)*80

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [32]:
from folium.features import DivIcon

# create map of Toronto using latitude and longitude values.
map_to = folium.Map(location=[lat, long], zoom_start=10)
selected_df = Final_df.loc[Final_df['PostCode'].isin(weighted_df_sorted.index.tolist()[0:10])]

# add markers to map
for lat, lng, borough, postcode,radius in zip(lat_list,
                                              long_list, 
                                              selected_df['Borough'], 
                                              selected_df['PostCode'],
                                              radius_list):
    label = '{}, {}'.format(postcode, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=radius,
        popup=label,
        color='red',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False,
    ).add_to(map_to)  
    
map_to

And so, for the moment we are done with the code, since we have found the best recommendations.