# Coursera Capstone
## Introduction
#### Let´s pretend I received a quite good job offer in Toronto and I'm considering moving abroad. Since I live in Brazil, and Toronto is on the other side of the world, it would be easier to addapt to a neighbourhood that is similar to the one I live (Praia da Costa).
#### How can I solve this problem?
#### 1) Obtaining the name and coordinates of the neighbourhoods in Toronto (using Wikipedia) and the ones of my neighbourhood
#### 2) Use the Foursquare data to obtain the 40 most common venues in each neighbourhood
#### 3) Use the K-nearest mean method to cluster the neighbourhoods based on the simmilarity of the venues frequency
#### 4) Identify which cluster Praia da Costa is in and list the similiar neighbourhoods

## 1) Obtaining the name and coordinates of the neighbourhoods in Toronto (using Wikipedia) and the ones of my neighbourhood

In [24]:
#importing libraries
import requests
import pandas as pd
!pip install beautifulsoup4
from bs4 import BeautifulSoup
!pip install lxml
import folium
!pip install geopy
from geopy.geocoders import Nominatim



In [40]:
#Obtaining the name of the neighbourhoods in Toronto
web_page = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')
soup = BeautifulSoup(web_page.text, 'html.parser')
table = soup.find_all('table')[0]


#data wrangling
df = pd.read_html(str(table))[0]
postcodes = df["Postcode"]
borough = df["Borough"]
neighbourhood = df["Neighbourhood"]
df_TO= pd.DataFrame({'Postcode': postcodes, 'Borough': borough, 'Neighbourhood': neighbourhood})


#drop lines with not assigned borough
import numpy as np
df_TO.replace("Not assigned", np.nan, inplace = True)
df_TO.dropna(subset=["Borough"], axis=0, inplace=True)


#group postal areas
df_TO_group=df_TO.groupby(['Postcode','Borough'])['Neighbourhood'].agg(lambda x:', '.join(map(str, x))).reset_index()


#geolocations of each postal code obtained from a csv
coord= pd.read_csv('Geospatial_Coordinates.csv')
coord.head()

df2 = pd.concat([df_TO_group, coord], axis=1)
df2.drop('Postal Code', axis = 1, inplace=True)

#selecting Toronto neighbourhoods only
df_clean=df2.loc[df2['Borough'].str.contains('Toronto')]
df_clean.drop(['Postcode','Borough'],axis=1, inplace=True)
df_clean.head()


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


Unnamed: 0,Neighbourhood,Latitude,Longitude
37,The Beaches,43.676357,-79.293031
41,"The Danforth West, Riverdale",43.679557,-79.352188
42,"The Beaches West, India Bazaar",43.668999,-79.315572
43,Studio District,43.659526,-79.340923
44,Lawrence Park,43.72802,-79.38879


#### Appending Praia da Costa coordinates to the dataframe

In [3]:
#my neighbourhood coordinates
brazil={'Neighbourhood': 'Praia da Costa', 'Latitude':[-20.329265], 'Longitude':[-40.281969]}
br=pd.DataFrame(data=brazil)

In [4]:
#appending
df_clean=df_clean.append(br)
df_clean

Unnamed: 0,Neighbourhood,Latitude,Longitude
37,The Beaches,43.676357,-79.293031
41,"The Danforth West, Riverdale",43.679557,-79.352188
42,"The Beaches West, India Bazaar",43.668999,-79.315572
43,Studio District,43.659526,-79.340923
44,Lawrence Park,43.72802,-79.38879
45,Davisville North,43.712751,-79.390197
46,North Toronto West,43.715383,-79.405678
47,Davisville,43.704324,-79.38879
48,"Moore Park, Summerhill East",43.689574,-79.38316
49,"Deer Park, Forest Hill SE, Rathnelly, South Hi...",43.686412,-79.400049


## 2) Use the Foursquare data to obtain the 40 most common venues in each neighbourhood

In [28]:
#Foursquare credentials

CLIENT_ID = '4ZMGFQVMXGOIN2OPVAUPD31JG0PB05BHXS1C3YGPKB5K0ADW' 
CLIENT_SECRET = 'AT2VU4R3GWWRF5XH5CMLR2ZXT12ZSLWIEIZ33Q4PY3IGBQEJ'
VERSION = '20180605'
LIMIT = 100
radius = 500

In [26]:
def NearbyVenues(names, latitudes, longitudes, radius=500):
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)     
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
        results = requests.get(url).json()["response"]['groups'][0]['items']
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    return(nearby_venues)

In [27]:
toronto_venues = NearbyVenues(names=df_clean['Neighbourhood'], latitudes=df_clean['Latitude'], longitudes=df_clean['Longitude'])

The Beaches
The Danforth West, Riverdale
The Beaches West, India Bazaar
Studio District
Lawrence Park
Davisville North
North Toronto West
Davisville
Moore Park, Summerhill East
Deer Park, Forest Hill SE, Rathnelly, South Hill, Summerhill West
Rosedale
Cabbagetown, St. James Town
Church and Wellesley
Harbourfront, Regent Park
Ryerson, Garden District
St. James Town
Berczy Park
Central Bay Street
Adelaide, King, Richmond
Harbourfront East, Toronto Islands, Union Station
Design Exchange, Toronto Dominion Centre
Commerce Court, Victoria Hotel
Roselawn
Forest Hill North, Forest Hill West
The Annex, North Midtown, Yorkville
Harbord, University of Toronto
Chinatown, Grange Park, Kensington Market
CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara
Stn A PO Boxes 25 The Esplanade
First Canadian Place, Underground city
Christie
Dovercourt Village, Dufferin
Little Portugal, Trinity
Brockton, Exhibition Place, Parkdale Village
High Park, The 

In [29]:
print(toronto_venues.shape)
toronto_venues.head()

(1717, 7)


Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,The Beaches,43.676357,-79.293031,Glen Manor Ravine,43.676821,-79.293942,Trail
1,The Beaches,43.676357,-79.293031,The Big Carrot Natural Food Market,43.678879,-79.297734,Health Food Store
2,The Beaches,43.676357,-79.293031,Grover Pub and Grub,43.679181,-79.297215,Pub
3,The Beaches,43.676357,-79.293031,Upper Beaches,43.680563,-79.292869,Neighborhood
4,"The Danforth West, Riverdale",43.679557,-79.352188,Pantheon,43.677621,-79.351434,Greek Restaurant


In [30]:
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

toronto_onehot['Neighbourhood'] = toronto_venues['Neighbourhood'] 

fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Neighbourhood,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Wine Bar,Wings Joint,Yoga Studio
0,The Beaches,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
1,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"The Danforth West, Riverdale",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [39]:
toronto_grouped = toronto_onehot.groupby('Neighbourhood').mean().reset_index()
toronto_grouped.head()

Unnamed: 0,Neighbourhood,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Wine Bar,Wings Joint,Yoga Studio
0,"Adelaide, King, Richmond",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03,0.0,...,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.0
1,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.017857,0.0,0.0,0.0,0.0,0.0,0.0
2,"Brockton, Exhibition Place, Parkdale Village",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Business Reply Mail Processing Centre 969 Eastern,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"CN Tower, Bathurst Quay, Island airport, Harbo...",0.0,0.058824,0.058824,0.058824,0.117647,0.176471,0.117647,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [32]:
num_top_venues = 40

In [33]:
def most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    return row_categories_sorted.index.values[0:num_top_venues]

In [34]:
num_top_venues = 40
indicators = ['st', 'nd', 'rd']

columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))
        
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighbourhood'] = toronto_grouped['Neighbourhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,...,31th Most Common Venue,32th Most Common Venue,33th Most Common Venue,34th Most Common Venue,35th Most Common Venue,36th Most Common Venue,37th Most Common Venue,38th Most Common Venue,39th Most Common Venue,40th Most Common Venue
0,"Adelaide, King, Richmond",Coffee Shop,Café,Steakhouse,Thai Restaurant,Bar,Breakfast Spot,Hotel,Gym,American Restaurant,...,Indian Restaurant,Office,Poke Place,Plaza,Opera House,Food Court,Deli / Bodega,Italian Restaurant,Japanese Restaurant,Neighborhood
1,Berczy Park,Coffee Shop,Cocktail Bar,Farmers Market,Beer Bar,Steakhouse,Bakery,Seafood Restaurant,Cheese Shop,Café,...,Fountain,Hotel,Thai Restaurant,Tea Room,Art Gallery,Grocery Store,Vegetarian / Vegan Restaurant,Tailor Shop,French Restaurant,Greek Restaurant
2,"Brockton, Exhibition Place, Parkdale Village",Coffee Shop,Café,Breakfast Spot,Grocery Store,Intersection,Performing Arts Venue,Pet Store,Gym,Climbing Gym,...,German Restaurant,Donut Shop,Cupcake Shop,Cuban Restaurant,Gift Shop,Creperie,Gluten-free Restaurant,Coworking Space,Gourmet Shop,Cosmetics Shop
3,Business Reply Mail Processing Centre 969 Eastern,Light Rail Station,Smoke Shop,Fast Food Restaurant,Restaurant,Recording Studio,Brewery,Auto Workshop,Burrito Place,Garden,...,Creperie,Coworking Space,Cosmetics Shop,Convenience Store,Concert Hall,Comfort Food Restaurant,Department Store,Falafel Restaurant,Electronics Store,Fried Chicken Joint
4,"CN Tower, Bathurst Quay, Island airport, Harbo...",Airport Service,Airport Lounge,Airport Terminal,Plane,Boutique,Sculpture Garden,Coffee Shop,Boat or Ferry,Harbor / Marina,...,Deli / Bodega,Dance Studio,Cupcake Shop,Cuban Restaurant,Creperie,Coworking Space,Filipino Restaurant,Food,Fish Market,Garden Center


### 3) Use the K-nearest mean method to cluster the neighbourhoods based on the simmilarity of the venues frequency

In [35]:
from sklearn.cluster import KMeans

kclusters = 7
toronto_grouped_clustering = toronto_grouped.drop('Neighbourhood', 1)
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

In [36]:
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
toronto_merged = df_clean
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')
toronto_merged.head()

Unnamed: 0,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,...,31th Most Common Venue,32th Most Common Venue,33th Most Common Venue,34th Most Common Venue,35th Most Common Venue,36th Most Common Venue,37th Most Common Venue,38th Most Common Venue,39th Most Common Venue,40th Most Common Venue
37,The Beaches,43.676357,-79.293031,4,Trail,Health Food Store,Pub,Neighborhood,Ethiopian Restaurant,Electronics Store,...,Falafel Restaurant,Filipino Restaurant,Fast Food Restaurant,Gaming Cafe,Grocery Store,Greek Restaurant,Gourmet Shop,Gluten-free Restaurant,Gift Shop,German Restaurant
41,"The Danforth West, Riverdale",43.679557,-79.352188,0,Greek Restaurant,Coffee Shop,Italian Restaurant,Ice Cream Shop,Furniture / Home Store,Bubble Tea Shop,...,Donut Shop,Doner Restaurant,Dog Run,Dive Bar,Discount Store,German Restaurant,Electronics Store,Department Store,Gift Shop,Deli / Bodega
42,"The Beaches West, India Bazaar",43.668999,-79.315572,0,Park,Sandwich Place,Ice Cream Shop,Pizza Place,Pub,Movie Theater,...,Cupcake Shop,Cuban Restaurant,Creperie,Coworking Space,Cosmetics Shop,Gift Shop,Convenience Store,Concert Hall,Comic Shop,Comfort Food Restaurant
43,Studio District,43.659526,-79.340923,0,Café,Coffee Shop,Italian Restaurant,American Restaurant,Gastropub,Bakery,...,Comfort Food Restaurant,Dessert Shop,Dumpling Restaurant,Donut Shop,Cuban Restaurant,Doner Restaurant,Dive Bar,Dog Run,Cupcake Shop,Dance Studio
44,Lawrence Park,43.72802,-79.38879,3,Park,Swim School,Bus Line,Yoga Studio,Discount Store,Falafel Restaurant,...,Filipino Restaurant,Colombian Restaurant,Garden,Gym,Grocery Store,Greek Restaurant,Gourmet Shop,Gluten-free Restaurant,Gift Shop,German Restaurant


In [37]:
import folium
import matplotlib.cm as cm
import matplotlib.colors as colors

address = 'Toronto'
geolocator = Nominatim(user_agent="toronto")
location = geolocator.geocode(address)
Latitude = location.latitude
Longitude = location.longitude
map_final = folium.Map(location=[Latitude, Longitude], zoom_start=11)
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighbourhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_final)    
map_final

## 4) Identify which cluster Praia da Costa is in and list the similiar neighbourhoods

#### Praia da Costa is in cluster 0

In [38]:
neighbourhoods=toronto_merged.loc[(toronto_merged['Cluster Labels']==0)].reset_index()
print("The similar neighbourhoods are:\n", neighbourhoods['Neighbourhood'])


The similar neighbourhoods are:
 0                          The Danforth West, Riverdale
1                        The Beaches West, India Bazaar
2                                       Studio District
3                                      Davisville North
4                                    North Toronto West
5                                            Davisville
6     Deer Park, Forest Hill SE, Rathnelly, South Hi...
7                           Cabbagetown, St. James Town
8                                  Church and Wellesley
9                             Harbourfront, Regent Park
10                             Ryerson, Garden District
11                                       St. James Town
12                                          Berczy Park
13                                   Central Bay Street
14                             Adelaide, King, Richmond
15    Harbourfront East, Toronto Islands, Union Station
16             Design Exchange, Toronto Dominion Centre
17             

## This concludes the project: there were 31 similar neighbourhoods in Toronto