# Capstone Coursera Project

This is the final project for the Coursera Data Science Professional Certificate. In this project I will be segmenting and clustering data of different venue types in Toronto, Cananda. For the project, I will download and clean the data, and then use K-Means clustering to analyse it. I will visualise my data analysis with tables and maps.

## Installing Libraries

Here I am installing and importing the relevant libraries for this project.

In [1]:
import geocoder
import folium
import numpy as np
import pandas as pd
import requests
import matplotlib.cm as cm
import matplotlib.colors as colors

print('Libraries have been successfully imported!')

Libraries have been successfully imported!


## Importing and Cleaning the Data

Here I am scrapping the data on Toronto postcodes with their associated Boroughs and Neighbourhoods for analysis.

In [5]:
toronto_df = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M',header = 0)[0] # scrape data from Wikipedia
toronto_df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


Now, I am cleaning the data, removing the "Not assigned" values to make the data ready to analysis.

In [6]:
# Removing "Not assigned" value
toronto_df.drop(toronto_df[toronto_df['Borough'] == 'Not assigned'].index, inplace = True) 
toronto_df.drop(toronto_df[toronto_df['Neighborhood'] == 'Not assigned'].index, inplace = True) 
toronto_df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


Here I am adding the latitude and longitude and inserting it into the dataframe so I can use the data to create a map.

In [7]:
from opencage.geocoder import OpenCageGeocode # importing the OpenCage library

key = '1065458a9dd24cd39091974764f06ec2' # API Key

geocoder = OpenCageGeocode(key)

# Create empty lists
list_lat = []   
list_long = []


for index, row in toronto_df.iterrows(): # Iterate over rows in dataframe

    PostCode = row['Postal Code']      
    query = PostCode

    results = geocoder.geocode(query)   
    lat = results[0]['geometry']['lat']
    long = results[0]['geometry']['lng']

    list_lat.append(lat)
    list_long.append(long)

# Create new columns from lists 
toronto_df['Latitude'] = list_lat
toronto_df['Longitude'] = list_long
toronto_df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
2,M3A,North York,Parkwoods,49.484606,8.466197
3,M4A,North York,Victoria Village,49.48429,8.467
4,M5A,Downtown Toronto,"Regent Park, Harbourfront",45.440588,28.018025
5,M6A,North York,"Lawrence Manor, Lawrence Heights",53.794164,-1.752006
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",44.428198,26.165951


## Map of Neighbourhoods

This is a map which shows the different neighbourhoods in Toronto. Each neighbourhood is highlighed by a blue circle.

In [8]:
# creating new map
m = folium.Map(
    location=[43.6532, -79.3832],
    zoom_start=11,
    tiles='Stamen Terrain'
)

# adding the blue circle markers
toronto_df.apply(lambda row:folium.CircleMarker(location=[row["Latitude"], row["Longitude"]], 
                                        radius=15,
                                        popup=row['Neighborhood'],
                                        fill = True,
                                        fill_opacity = 0.7,
                                        ).add_to(m),
                                        axis=1)

m

In [9]:
CLIENT_ID = 'OAFCQS0SRE3F2IPWXJC5QF4BHFASNWNJ2PSBZY2ZOBB50OST' # Foursquare ID
CLIENT_SECRET = 'DVSK1BBJ2WQHV4JSBGRHVDFNBQBBC50VGIX0JXM0H4MS4K4C' # Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)# installing Packages

Your credentails:
CLIENT_ID: OAFCQS0SRE3F2IPWXJC5QF4BHFASNWNJ2PSBZY2ZOBB50OST
CLIENT_SECRET:DVSK1BBJ2WQHV4JSBGRHVDFNBQBBC50VGIX0JXM0H4MS4K4C


In [10]:
# creating new function to retreive the venues near to each neighbourhood
def getNearbyVenues(names, latitudes, longitudes):
    radius=500
    LIMIT=100
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()['response']['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [11]:
# list of venues
toronto_venues = getNearbyVenues(names=toronto_df['Neighborhood'],
                                   latitudes=toronto_df['Latitude'],
                                   longitudes=toronto_df['Longitude']
                                  )
toronto_venues.head()

Parkwoods
Victoria Village
Regent Park, Harbourfront
Lawrence Manor, Lawrence Heights
Queen's Park, Ontario Provincial Government
Islington Avenue, Humber Valley Village
Malvern, Rouge
Don Mills
Parkview Hill, Woodbine Gardens
Garden District, Ryerson
Glencairn
West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale
Rouge Hill, Port Union, Highland Creek
Don Mills
Woodbine Heights
St. James Town
Humewood-Cedarvale
Eringate, Bloordale Gardens, Old Burnhamthorpe, Markland Wood
Guildwood, Morningside, West Hill
The Beaches
Berczy Park
Caledonia-Fairbanks
Woburn
Leaside
Central Bay Street
Christie
Cedarbrae
Hillcrest Village
Bathurst Manor, Wilson Heights, Downsview North
Thorncliffe Park
Richmond, Adelaide, King
Dufferin, Dovercourt Village
Scarborough Village
Fairview, Henry Farm, Oriole
Northwood Park, York University
East Toronto, Broadview North (Old East York)
Harbourfront East, Union Station, Toronto Islands
Little Portugal, Trinity
Kennedy Park, Ionview, East Birchmo

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,49.484606,8.466197,Café Sammo,49.485903,8.464466,Café
1,Parkwoods,49.484606,8.466197,SENJU,49.485447,8.466555,Japanese Restaurant
2,Parkwoods,49.484606,8.466197,Novus,49.484556,8.466838,Cocktail Bar
3,Parkwoods,49.484606,8.466197,Helder & Leuween,49.485595,8.467464,Café
4,Parkwoods,49.484606,8.466197,Mémoires d'Indochine,49.487279,8.464476,Vietnamese Restaurant


## Data Preparation for K-Means

Here I am preparing the data to make it suitable for using K-Means clustering. Here I use one-hot encoding to convert the values.

In [12]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.shape

(2318, 275)

In [13]:
toronto_onehot

Unnamed: 0,Yoga Studio,Accessories Store,Airport,Airport Terminal,American Restaurant,Antique Shop,Aquarium,Arcade,Argentinian Restaurant,Art Gallery,...,Udon Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Xinjiang Restaurant
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [16]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped.head()

Unnamed: 0,Neighborhood,Yoga Studio,Accessories Store,Airport,Airport Terminal,American Restaurant,Antique Shop,Aquarium,Arcade,Argentinian Restaurant,...,Udon Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Xinjiang Restaurant
0,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"Alderwood, Long Branch",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Bathurst Manor, Wilson Heights, Downsview North",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"Bedford Park, Lawrence Manor East",0.0,0.0,0.0,0.0,0.04,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.013699,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [14]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

This is a generated list of the 5 Most Common Venues near each Neighbourhood in Toronto.

In [17]:
num_top_venues = 5

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Agincourt,Badminton Court,Breakfast Spot,Latin American Restaurant,Skating Rink,Falafel Restaurant
1,"Alderwood, Long Branch",Pharmacy,Convenience Store,Dance Studio,Pub,Gym
2,"Bathurst Manor, Wilson Heights, Downsview North",Pizza Place,Mediterranean Restaurant,Middle Eastern Restaurant,Coffee Shop,Fried Chicken Joint
3,"Bedford Park, Lawrence Manor East",Coffee Shop,Restaurant,Sandwich Place,Italian Restaurant,Thai Restaurant
4,Berczy Park,Coffee Shop,Boat or Ferry,Restaurant,Hotel,Deli / Bodega


## K-Means Clustering

Here, I use the K-Means Clustering to analyse the data. I split the data into 5 clusters and visualise them on the map below.

In [19]:
from sklearn.cluster import KMeans

# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters = kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([2, 2, 2, 2, 2, 2, 2, 2, 2, 1], dtype=int32)

In [20]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Clusters Label', kmeans.labels_)

toronto_merged = toronto_df

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

toronto_merged.head() # check the last columns!

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude,Clusters Label,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
2,M3A,North York,Parkwoods,49.484606,8.466197,2.0,Café,Clothing Store,Thai Restaurant,Plaza,Sushi Restaurant
3,M4A,North York,Victoria Village,49.48429,8.467,2.0,Café,Clothing Store,Lounge,Thai Restaurant,Coffee Shop
4,M5A,Downtown Toronto,"Regent Park, Harbourfront",45.440588,28.018025,2.0,Memorial Site,Bowling Alley,Bar,Plaza,Auto Garage
5,M6A,North York,"Lawrence Manor, Lawrence Heights",53.794164,-1.752006,2.0,Italian Restaurant,Coffee Shop,Bar,Hotel,Café
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",44.428198,26.165951,2.0,Supermarket,Park,Restaurant,Italian Restaurant,Eastern European Restaurant


In [23]:
#dropping NaN from the dataframe
toronto_merged = toronto_merged.dropna(axis=0, how='any', inplace=False)

In [26]:
toronto_merged.columns

Index(['Postal Code', 'Borough', 'Neighborhood', 'Latitude', 'Longitude',
       'Clusters Label', '1st Most Common Venue', '2nd Most Common Venue',
       '3rd Most Common Venue', '4th Most Common Venue',
       '5th Most Common Venue'],
      dtype='object')

# Map of Cluster

In [27]:
# create map
map_clusters = folium.Map(location=[43.6532, -79.3832], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Clusters Label']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=12,
        popup=label,
        color=rainbow[int(cluster)-1],
        fill=True,
        fill_color=rainbow[int(cluster)-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters