# Capstone Project - The Battle of the Neighborhoods
### Applied Data Science Capstone by IBM/Coursera

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Building the Recommender System](#recommender)
* [Results and Discussion](#discussion)
* [Conclusion and Future Direction](#conclusion)

## Introduction: Business Problem <a name="introduction"></a>

In this project we will attempt to build a **recommender system** that will generate clusters containing multiple venue categories and the respective neighborhoods in which those venues exist in the city of Toronto. This should ultimately help those who are seeking to open a shop/restaurant/etc. and are in search for a rough estimate as to which neighborhood will be the best fit.

The recommender system will present multiple venue categories in each cluster and will leave the choice to the person in interest to select the cluster that best suit their purpose. 

The project will be utilizing two main data science approaches (k-means clustering and recommender systems) in order to end up with the desired results. The following sections will include an explanation of the data that will be utilized, data sources, the methodology followed and the analysis that was carried out and finally the conclusion and discussion section.

## Data <a name="data"></a>

Based on the definition of the problem, the factor/s that will influence our decission are:
* The most common venues/venue categories in each neighborhood or borough
* The cut-off point for the most common venue to take into consideration

Following data sources will be needed to extract/generate the required information:
* The first data source that will be used is a **wikipedia page** which will help in obtaining the neighborhoods' names, postal codes, boroughs' name for the city of Toronto
* The venue categories, their locations in every neighborhood and their count will be obtained using **Foursquare API**
* The coordinates (lats/longs) of each neighborhood will be obtained from **CSV** file that was earlier shared by the course instructors
* The code in **Part 1**, **Part 2** & **Part 3** will be used to obtain the required data in order to conduct the analysis

## Part Number 1: Web Scarbing and Data Wrangling

### We start by importing the libraries that will help us in putting the dataframe into place

In [1]:
# Importing the necessary libraries to complete the assignment - Data Wrangling and Web Scrabbing
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import numpy as np
import requests
from bs4 import BeautifulSoup

In [2]:
# Importing the Libraries that have to do with the geospatial data/plotting/Clustering
# !conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
import seaborn as sns

# import k-means from clustering stage
from sklearn.cluster import KMeans
from sklearn.neighbors import KNeighborsClassifier

### We set up the variables and import the table from the wiki page and prepare the columns and rows for the dataframe

In [4]:
# Setting up the variables
URL = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
response = requests.get(URL)
soup = BeautifulSoup(response.text, 'html.parser')
table = soup.find('table',{'class':'wikitable sortable'}).tbody

# Preparing the Rows and Columns for the DataFrame
rows = table.find_all('tr')
columns = [v.text.replace('\n','') for v in rows[0].find_all('th')]
df = pd.DataFrame(columns=columns)

### Populating the dataframe with the data from the wiki table

In [5]:
# Looping to populate the dataframe
for i in range(1,len(rows)):
    tds = rows[i].find_all('td')
    values = [td.text.replace('\n','') for td in tds]
    df = df.append(pd.Series(values,index=columns), ignore_index=True)
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


### Now, We will be removing the **Not assigned** Cells from the DataFrame

In [6]:
# data wrangling to remove the "Not assigned" cells
df_new = df.set_index('Borough')
df_new.drop('Not assigned',inplace=True)
df_new = df_new.reset_index()
df_new = df_new[['Postal Code','Borough','Neighborhood']]
df_new.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


## Part Number 2: Getting the Lats/Longs for the Neighborhoods <a name="data"></a>

### First, we download the geospatial data

In [7]:
# Downloading the geospatial data
!wget -q -O 'Geospatial_data.csv' http://cocl.us/Geospatial_data
geospatial_df = pd.read_csv('Geospatial_data.csv')
geospatial_df = geospatial_df.rename(columns={'Postal Code':'PC'})
geospatial_df.head()

Unnamed: 0,PC,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


### We merge the two dataframes in order to have it in a format ready for analysis

In [8]:
# using df.merge in order to join the two dataframes
Toronto_df = df_new.merge(geospatial_df,left_on='Postal Code',right_on='PC')
Toronto_df = Toronto_df.drop(['PC'],axis=1)
Toronto_df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


## Part Number 3: Using Foresquare API to get the Venues Data

### We Start by First getting the lats and longs for Toronto

In [9]:
# We get the lats/longs by using the geolocator 
address = 'Toronto'
geolocator = Nominatim(user_agent="Toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


### Defining Foresquare Credentials
#### First we start with Client_ID and Client Secret

In [10]:
CLIENT_ID = '' # your Foursquare ID
CLIENT_SECRET = '' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: IJEZGLNFCM1M3ZJXK2QLKJT3KMN5FPGD4HHSZJQN440YXBSD
CLIENT_SECRET:Q3CH4FMTJLQO4B0WLKMVHLFBVSLRHGUA5KGQNRPW1H4PFE1B


### We create a function that will explore all the neighborhoods in Toronto

In [11]:
def getNearbyVenues(names, latitudes, longitudes, radius=500, LIMIT=100):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

# Applying the function to get the venuse in Toronto
Toronto_venues = getNearbyVenues(names=Toronto_df['Neighborhood'],
                                   latitudes=Toronto_df['Latitude'],
                                   longitudes=Toronto_df['Longitude'])

Parkwoods
Victoria Village
Regent Park, Harbourfront
Lawrence Manor, Lawrence Heights
Queen's Park, Ontario Provincial Government
Islington Avenue, Humber Valley Village
Malvern, Rouge
Don Mills
Parkview Hill, Woodbine Gardens
Garden District, Ryerson
Glencairn
West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale
Rouge Hill, Port Union, Highland Creek
Don Mills
Woodbine Heights
St. James Town
Humewood-Cedarvale
Eringate, Bloordale Gardens, Old Burnhamthorpe, Markland Wood
Guildwood, Morningside, West Hill
The Beaches
Berczy Park
Caledonia-Fairbanks
Woburn
Leaside
Central Bay Street
Christie
Cedarbrae
Hillcrest Village
Bathurst Manor, Wilson Heights, Downsview North
Thorncliffe Park
Richmond, Adelaide, King
Dufferin, Dovercourt Village
Scarborough Village
Fairview, Henry Farm, Oriole
Northwood Park, York University
East Toronto, Broadview North (Old East York)
Harbourfront East, Union Station, Toronto Islands
Little Portugal, Trinity
Kennedy Park, Ionview, East Birchmo

In [12]:
# having a look on the final output (dataframe)
Toronto_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
1,Parkwoods,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
2,Parkwoods,43.753259,-79.329656,Corrosion Service Company Limited,43.752432,-79.334661,Construction & Landscaping
3,Victoria Village,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
4,Victoria Village,43.725882,-79.315572,Tim Hortons,43.725517,-79.313103,Coffee Shop


## Building the Recommender System <a name="recommender"></a>

We start by first using one hot encoding to arrange the dataframe in a format that can be analyzed and proper for the recommender system that we are planning to build

In [45]:
# one hot encoding
Toronto_onehot = pd.get_dummies(Toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
Toronto_onehot['Neighborhood'] = Toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
print(Toronto_onehot.shape)
Toronto_grouped = Toronto_onehot.groupby('Neighborhood').mean().reset_index()

(2132, 278)


In [46]:
# defining a function to return the most common venues
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    return row_categories_sorted.index.values[0:num_top_venues]

We'll be taking the 1st and 2nd most common venues, taking into consideration the Pareto Distribution. To be more specific, in a distribution of 10 most common venues the 1st and 2nd will be having the 20/80 effect. Giving some sort of a profile for the neighborhood or the borough.

In [47]:
num_top_venues = 2
indicators = ['st', 'nd']

# create columns according to number of top venues - 2 Columns
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = Toronto_grouped['Neighborhood']

for ind in np.arange(Toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(Toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue
0,Agincourt,Lounge,Breakfast Spot
1,"Alderwood, Long Branch",Pizza Place,Pharmacy
2,"Bathurst Manor, Wilson Heights, Downsview North",Coffee Shop,Bank
3,Bayview Village,Café,Bank
4,"Bedford Park, Lawrence Manor East",Sandwich Place,Italian Restaurant
5,Berczy Park,Coffee Shop,Cocktail Bar
6,"Birch Cliff, Cliffside West",General Entertainment,College Stadium
7,"Brockton, Parkdale Village, Exhibition Place",Café,Bakery
8,"Business reply mail Processing Centre, South C...",Light Rail Station,Park
9,"CN Tower, King and Spadina, Railway Lands, Har...",Airport Lounge,Airport Service


Here, we use k-means clustering to reduce the data into buckets of similar nature, which will make it easier for us to recommend based on, moreover it will make it easier for us to apply a recommender system algorithm on the data when it is in this format.

In [48]:
# set number of clusters
kclusters = 6

Toronto_grouped_clustering = Toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=3).fit(Toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([5, 0, 5, 0, 5, 5, 0, 5, 5, 5], dtype=int32)

We merge the resultant dataframe from the clustering exercise with the original one, in order to assign the cluters' lables to the nighborhoods names and their respective boroughs

In [49]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_+1)

Toronto_merged = Toronto_df

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
Toronto_merged = Toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')
Toronto_merged = Toronto_merged.dropna()
Toronto_merged.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,2.0,Park,Construction & Landscaping
1,M4A,North York,Victoria Village,43.725882,-79.315572,6.0,Intersection,Pizza Place
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,6.0,Coffee Shop,Park
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,6.0,Clothing Store,Accessories Store
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,6.0,Coffee Shop,Sushi Restaurant


We then get the output of the recommender system, showing each cluster with the different venues in each one

In [52]:
# loop each cluster and get most common venue in each one based on their rank
for i in list(range(kclusters)):
    Cluster = Toronto_merged.loc[Toronto_merged['Cluster Labels'] == i+1, Toronto_merged.columns[[1] + list(range(5, Toronto_merged.shape[1]))]]
    frame = [Cluster['1st Most Common Venue'].value_counts(),Cluster['2nd Most Common Venue'].value_counts()]
    result = pd.concat(frame)
    result = result.to_frame()
    result = result.reset_index()
    result = result.rename(columns={'index':'Venue Category',0:'Rank'})
    result = result.groupby('Venue Category').sum().reset_index().sort_values(by='Rank',ascending=False).head()
    print('Cluster #',i+1,'has the following Venue Categories',result['Venue Category'].values.tolist())

Cluster # 1 has the following Venue Categories ['Pizza Place', 'Sandwich Place', 'Pharmacy', 'Café', 'Gym / Fitness Center']
Cluster # 2 has the following Venue Categories ['Park', 'Convenience Store', 'Playground', 'Bus Line', 'Construction & Landscaping']
Cluster # 3 has the following Venue Categories ['Baseball Field', 'Yoga Studio']
Cluster # 4 has the following Venue Categories ['Bakery', 'Trail', 'American Restaurant', 'Bus Line', 'Garden']
Cluster # 5 has the following Venue Categories ['Cafeteria', 'Dog Run']
Cluster # 6 has the following Venue Categories ['Coffee Shop', 'Café', 'Park', 'Bakery', 'Grocery Store']


In order to get each cluster with its respective **Neighborhoods** and **Boroughs** just change the i (based on the cluster you would like to see) in the following code see the data relevant to each cluster

In [58]:
i = 1
C = Toronto_merged.loc[Toronto_merged['Cluster Labels'] == i, Toronto_merged.columns[[1,2] + list(range(5, Toronto_merged.shape[1]))]]

## Results and Discussion <a name="discussion"></a>

In reference to the presented output of the recommender system, we can see a mild variation between **Cluster # 1**, **Cluster # 4**, **Cluster # 6** and **Cluster 5** as they are, by and large, similar in nature (i.e. the most common venues in each is either a Cafe', a Restaurant or simply put a venue that has to do with leisure or entertainment). 

As to **Cluster 2** & **Cluster 2**, however, it appears to be very different from other clusters. More interestingly, even at the level of the items within the cluster it appears that there is some sort of variation that will probably need further investigation, e.g. looking more closely at the nature of the neighborhoods and trying to understand what they are like in reality, one assumption to answer this and yet to be validated, is that the neihborhoods in this cluster are new and are not really well defined yet. Moreover, one can look at the variation when adding more most common venues (i.e. 3rd, 4th, 5th, etc...), which will ultimately help to see whether the variation continues, or we actually start to see more harmony as we add more venue categories.

Ultimately, the decision as to where a person can open a certain shop or a restaurant is left to the reader, given that the logical backing of where to choose is dependent on the incumbent of the decision. Put differently, there is the opinion of opening a restaurant for example in a neighborhood known to have restaurants as its most common venue category, with the logical backing that more people will be visiting that venue for its restaurants and that would most probably increase the chance of the newly opened restaurant gaining attraction. The flip side to that coin is, an area with less restaurants will be more likely profitable for the business since little to no one will be competing. 

And while both arguments are correct in certain different conditions, the specifics behind making such decisions are studied differently rather than assumed, e.g. studying the markets for each cluster from a supply/demand matching perspective. This is however, out of the scope of what this recommender system tries to achieve.

## Conclusion and Future Direction  <a name="conclusion"></a>

The purpose of this report is to present a recommender system through which it clusters neighborhoods/boroughs in the city of Toronto, and outputting the most common venues that characterize the cluster. By firstly, looking up the nieghborhoods/boroughs in the city of Toronto. Secondly, getting the lats/longs for these nieghborhoods/clusters which helped in merging the resultant dataframe with the data pulled from foresquare API in the third step. After which the data was put into a k-means clustering algorithm to reduce it into buckets of similar nature. Finally, the recommender system was built based on the most common venue categories in each cluster using the concept of the Pareto distribution.

Future developlment can be made to automate the process even further for those who want the machine to pick the neighborhood for them when opening a certain shop or a business, by further studying the economics and/or the urban aspects of those clusters

# Thank You :)