# Final Project of Applied Data Science Capstone
# - Find a new living area in Toronto

## 1. Introduction
As a real estate agent, we have a new client, David, who just got a new job in Toronto and will have to move there within 2 months. Therefore, he contracted us with this task to find an ideal place and an apartment for him.

David is a software engineer. He works mostly from home and only needs to go to the office for meetings sometimes. Therefore, the distance between his home and office is not a problem. What really matters is the quality of life he will have in the new neighborhood. 

David doesn't like to cook because he hates the cooking smell inside the house, so the most important thing for him is to have access to any restaurants right down the street within 200 meters. In addition, he likes drinking coffee and working out in a gym. These two are equally important. However, David is allergic to many things so it would be better to avoid living in an area close to parks, flower stores, pet stores, or something similar. 

## 2. Data for this project
To help David find the ideal place to live, we will need the facilities data from Toronto, which we will get via Foursquare. After retrieving the data, we will perform a clustering process to filter the top 3 areas for David to choose and that will be the end results of this project.

## 3. Methods
In this secti, we will implement the following steps:

1. Retrieve necessary location data from Foursquare
2. Clean up data
3. Perform analysis, clustering, and visualization


### 3.1 Retrieve necessary location data from Foursquare

#### 3.1.1 Import libraries

In [1]:
#!pip install bs4
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup 
import requests
from sklearn.cluster import KMeans
import json # library to handle JSON files

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# !conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
# !conda install -c conda-forge folium=0.5.0 --yes
import folium # map rendering library

#### 3.1.2 Set up credentials for retrieving data from Foursquare

In [2]:
CLIENT_ID = 'ENE1VKZNXDN4D4RAJNRDJNUCLRSGEH5U2KAMX2KZEPHCF0XU' # your Foursquare ID
CLIENT_SECRET = 'IKPZTTB1N0CBNVD0SEP2YZPIYULEQBJHNZEKJCEZKNVVZMSA' # your Foursquare Secret
ACCESS_TOKEN = 'KWPBEYC2EUHJDCFFZ4JNCN3AX5AEUWYNTPAEDJPX1TYB3NQU' # your FourSquare Access Token
VERSION = '20210501' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

#### 3.1.3 Get the borough data of Toronto from Wiki

In [3]:
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
html_data=requests.get(url).text
soup=BeautifulSoup(html_data,'html5lib')
table_contents=[]
table=soup.find('table')
for row in table.findAll('td'):
    cell = {}
    if row.span.text=='Not assigned':
        pass
    else:
        cell['PostalCode'] = row.p.text[:3]
        cell['Borough'] = (row.span.text).split('(')[0]
        cell['Neighborhood'] = (((((row.span.text).split('(')[1]).strip(')')).replace(' /',',')).replace(')',' ')).strip(' ')
        table_contents.append(cell)

### 3.2 Clean up data

#### 3.2.1 Create a dataframe with the data from wiki

In [4]:
toronto_data=pd.DataFrame(table_contents)
toronto_data['Borough']=toronto_data['Borough'].replace({'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A',
                                             'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business',
                                             'EtobicokeNorthwest':'Etobicoke Northwest','East YorkEast Toronto':'East York/East Toronto',
                                             'MississaugaCanada Post Gateway Processing Centre':'Mississauga'})
toronto_data.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government


#### 3.2.2 Get the latitudes and longitudes to append to the toronto_data

In [5]:
import csv
ll=pd.read_csv('Geospatial_Coordinates.csv')

toronto_data=toronto_data.set_index('PostalCode').join(ll.set_index('Postal Code'))
toronto_data.reset_index(inplace=True)

#### 3.2.3 Make a function to loop through the neighborhoods

In [6]:
def getNearbyVenues(names, latitudes, longitudes, radius=200):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        # print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

#### 3.2.4 Run the function and get the venues in Toronto

In [7]:
toronto_venues=getNearbyVenues(toronto_data['Neighborhood'],toronto_data['Latitude'],toronto_data['Longitude'])

#### 3.2.5 Get the venues and append to the original dataframe

In [8]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")
# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped.head()

#### 3.2.6 Merge columns into categories to match the requirement of the customer

In [9]:
# Merge all the food related places into one category
str_restaurants= [''] * len(toronto_grouped.columns)
n=0
for index, name in enumerate(toronto_grouped.columns):
    if 'Restaurant' in name:
        str_restaurants[n]=name
        n=n+1
while('' in str_restaurants):
    str_restaurants.remove('')
str_restaurants.extend(['Bakery','Breakfast Spot','Burger Joint','Diner','Fish & Chips Shop','Food Court','Fried Chicken Joint','Gourmet Shop','Pizza Place','Sandwich Place','Soup Place','Steakhouse','Taco Place','Noodle House','Salad Place'])
restaurants_sum=toronto_grouped[str_restaurants].sum(axis=1)
toronto_grouped.insert(loc=1,column='All restaurants',value=restaurants_sum)
toronto_grouped.drop(columns=str_restaurants,inplace=True)

# Merge all bar related places
str_bars= [''] * len(toronto_grouped.columns)
n=0
for index, name in enumerate(toronto_grouped.columns):
    if 'Bar' in name or 'Brewery' in name or 'club' in name or 'Pub' in name or 'Beer' in name:
        str_bars[n]=name
        n=n+1
while('' in str_bars):
    str_bars.remove('')
str_bars.extend(['Liquor Store'])
str_bars_sum=toronto_grouped[str_bars].sum(axis=1)
toronto_grouped.insert(loc=2,column='All bars',value=str_bars_sum)
toronto_grouped.drop(columns=str_bars,inplace=True)

# Merge all sport related places
str_sports= [''] * len(toronto_grouped.columns)
n=0
for index, name in enumerate(toronto_grouped.columns):
    if 'Gym' in name or 'Sport' in name:
        str_sports[n]=name
        n=n+1
while('' in str_sports):
    str_sports.remove('')
str_sports.extend(['Yoga Studio'])
sports_sum=toronto_grouped[str_sports].sum(axis=1)
toronto_grouped.insert(loc=2,column='All sports',value=sports_sum)
toronto_grouped.drop(columns=str_sports,inplace=True)

# Merge all coffee related places
str_coffee= [''] * len(toronto_grouped.columns)
n=0
for index, name in enumerate(toronto_grouped.columns):
    if 'Coffee' in name or 'Café' in name:
        str_coffee[n]=name
        n=n+1
while('' in str_coffee):
    str_coffee.remove('')
# str_nature.extend([''])
print(str_coffee)
coffee_sum=toronto_grouped[str_coffee].sum(axis=1)
toronto_grouped.insert(loc=3,column='All coffee',value=coffee_sum)
toronto_grouped.drop(columns=str_coffee,inplace=True)

# Merge all allergy related places
str_allergy= [''] * len(toronto_grouped.columns)
n=0
for index, name in enumerate(toronto_grouped.columns):
    if 'Outdoor' in name or 'Park' in name or 'Lake' in name or 'Garden' in name or 'Pet' in name or 'Flower' in name or 'Field' in name:
        str_allergy[n]=name
        n=n+1
while('' in str_allergy):
    str_allergy.remove('')
str_allergy.extend(['Fountain'])
allergy_sum=toronto_grouped[str_allergy].sum(axis=1)
toronto_grouped.insert(loc=4,column='All allergies',value=allergy_sum)
toronto_grouped.drop(columns=str_allergy,inplace=True)

toronto_grouped.head()

['Café', 'Coffee Shop']


Unnamed: 0,Neighborhood,All restaurants,All sports,All coffee,All allergies,All bars,Accessories Store,Adult Boutique,Art Gallery,Arts & Crafts Store,Auto Workshop,BBQ Joint,Bank,Bike Shop,Bookstore,Boutique,Bubble Tea Shop,Burrito Place,Cheese Shop,Chocolate Shop,Clothing Store,College Rec Center,Comic Shop,Concert Hall,Convenience Store,Costume Shop,Creperie,Cupcake Shop,Deli / Bodega,Department Store,Dessert Shop,Discount Store,Electronics Store,Escape Room,Farmers Market,Frozen Yogurt Shop,Furniture / Home Store,Gas Station,Gastropub,General Entertainment,General Travel,Gift Shop,Golf Course,Grocery Store,Health & Beauty Service,History Museum,Home Service,Hot Dog Joint,Hotel,Housing Development,Ice Cream Shop,Intersection,Jewelry Store,Kids Store,Light Rail Station,Martial Arts School,Miscellaneous Shop,Movie Theater,Music Venue,Opera House,Organic Grocery,Performing Arts Venue,Pharmacy,Playground,Plaza,Pool,Record Shop,Road,Shoe Store,Shopping Mall,Smoothie Shop,Spa,Supermarket,Supplement Shop,Tea Room,Theater,Thrift / Vintage Store,Toy / Game Store,Trail,Wings Joint,Women's Store
0,"Alderwood, Long Branch",0.333333,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"Bathurst Manor, Wilson Heights, Downsview North",0.714286,0.0,0.142857,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.142857,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Bedford Park, Lawrence Manor East",0.666667,0.0,0.111111,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"Birch Cliff, Cliffside West",0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Brockton, Parkdale Village, Exhibition Place",0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### 3.2.7 A function to sort the venues in descending order

In [10]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

#### 3.2.8 Run the function to sort the dataframe with the top 5 venues and append it back

In [11]:
num_top_venues = 5

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

### 3.3 Perform analysis and clustering

#### 3.3.1 Use K-means clustering

In [12]:
# set number of clusters
kclusters = 5
toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=6).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:50] 

array([0, 2, 0, 4, 4, 1, 0, 2, 0, 0, 0, 0, 0, 3, 0, 1, 1, 1, 4, 1, 0, 1,
       1, 0, 1, 0, 2, 0, 0, 2, 4, 0, 0, 1, 1, 0, 0, 3, 2, 2, 0, 3, 0, 0,
       0, 3, 1, 2, 4, 0])

#### 3.3.2 Create a new dataframe that includes the cluster as well as the top 5 venues for each neighborhood

In [20]:
# add clustering labels
if 'Cluster Labels' in neighborhoods_venues_sorted.columns:
    neighborhoods_venues_sorted.drop(['Cluster Labels'],axis=1,inplace=True)

neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = toronto_data

# merge toronto_grouped with df to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')
toronto_merged.dropna(inplace=True)
toronto_merged['Cluster Labels']=toronto_merged['Cluster Labels'].astype(int)
neighborhoods_venues_sorted.head()

Unnamed: 0,Cluster Labels,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,0,"Alderwood, Long Branch",All restaurants,All coffee,Pharmacy,Wings Joint,Organic Grocery
1,2,"Bathurst Manor, Wilson Heights, Downsview North",All restaurants,All coffee,Deli / Bodega,Jewelry Store,Opera House
2,0,"Bedford Park, Lawrence Manor East",All restaurants,All coffee,All bars,Cupcake Shop,Kids Store
3,4,"Birch Cliff, Cliffside West",All coffee,All restaurants,Jewelry Store,Opera House,Music Venue
4,4,"Brockton, Parkdale Village, Exhibition Place",All coffee,Playground,All restaurants,Jewelry Store,Opera House


#### 3.3.3 Visualize the clusters

In [14]:
address = 'Toronto'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
# print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

#### 3.3.4 Plot the clusters on the map

In [15]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [29]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]].tail(10)

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
70,Etobicoke,0,All restaurants,All coffee,Jewelry Store,Opera House,Music Venue
79,Central Toronto,0,All restaurants,All coffee,Dessert Shop,Toy / Game Store,Bookstore
80,Downtown Toronto,0,All sports,All coffee,BBQ Joint,All restaurants,Trail
84,Downtown Toronto,0,All restaurants,All coffee,All bars,Arts & Crafts Store,Dessert Shop
90,Scarborough,0,All restaurants,Bank,Pharmacy,All sports,All coffee
92,Downtown Toronto Stn A,0,All restaurants,All bars,All coffee,All allergies,Hotel
93,Etobicoke,0,All restaurants,All coffee,Pharmacy,Wings Joint,Organic Grocery
96,Downtown Toronto,0,All restaurants,All coffee,All allergies,All bars,General Entertainment
97,Downtown Toronto,0,All restaurants,All coffee,All sports,Deli / Bodega,All bars
99,Downtown Toronto,0,All restaurants,Burrito Place,Martial Arts School,All allergies,All bars


## 4 Results

### 4.1 Find the top 3 areas that match the customer's needs
Based on David's requirement, the category "All restaurants" must be the first one, and "All coffee" and "All sports" must be either 2nd and 3rd places (the order doesn't matter).

Most important thing is, the category "All allergies" must not appear in the top 5 list for his own good.

Therefore, cluster 1 is the only choice for David.

Now, let's try to find the best place for him.

In [27]:
# Remove the areas that don't have the category "All restaurants" at the 1st most common venue
i= toronto_merged[toronto_merged['1st Most Common Venue']!='All restaurants'].index
top_area=toronto_merged.drop(i,axis=0)

# Remove the areas that have the category "All allergies"
mask= top_area[top_area=='All allergies']
mask=mask.replace(np.nan,True).replace('All allergies',np.nan)
top_area=top_area[mask].dropna(axis=0)

# Remove the areas that don't have the category "All coffee" and "All sports" at either 2nd or 3rd most common venue
top_area=top_area[(top_area['2nd Most Common Venue']=='All coffee') | (top_area['3rd Most Common Venue']=='All coffee')]
top_area=top_area[(top_area['2nd Most Common Venue']=='All sports') | (top_area['3rd Most Common Venue']=='All sports')]

top_area

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
97,M5X,Downtown Toronto,"First Canadian Place, Underground city",43.648429,-79.38228,0,All restaurants,All coffee,All sports,Deli / Bodega,All bars


### 4.2 Find the most suitable areas for David

In [17]:
print(top_area['Neighborhood'])

Series([], Name: Neighborhood, dtype: object)


In [30]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=15)

# add markers to the map
for lat, lon, poi, cluster in zip(top_area['Latitude'], top_area['Longitude'], top_area['Neighborhood'], top_area['Cluster Labels']):
    label = folium.Popup('Here is our recommended area for David', parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=200,
        popup=label,
        color='red',
        fill=True,
        fill_color='green',
        ).add_to(map_clusters)
       
map_clusters

## 5. Discussion and Conclusion

After the data analysis, we found out that the most suitable areas for David to live in Toronto would be the neighborhood of First Canadian Place or Underground city, which are both inside Downtown Toronto. The reason is that the restaurants here are the most common venues and are followed by coffee shops or cafe and sports related facilities. Most important of all, there are no venues that could cause allergy to David. Therefore, these two neighborhoods are the end results of this project.

For the future steps, a more thorough investigation of the price of apartments located in these two neighborhoods should be conducted, so that David can find the apartment that suits him the most. However, this will require more information from him because there are also lots of criteria that can influence the price of an apartment, such as the size of the apartment, the room numbers, the interior design, and so on.