<h1 align = 'center'> Whether to accept the job or not? </h1>

In today's world, good jobs are not common. Specifically, they are might be at a different location than where we live. It might be inconvenient to leave comfort or quality of life. But, life is a trade of. Can data science help us in meauring the inconvenience and help us decide on relocations?
<p></p>
This analysis demonstrates how data science can indeed help taking that decision.
<p></p>
This part of the analysis addresses the following:
 
-  **A description of the problem and a discussion of the background**
-  **A description of the data and how it will be used to solve the problem**
    
<h2> 1. Problem statement</h2>

The problem statement is to solve:

-  <font color='blue'>*How might someone decide if a new place is worthy to relocate to?*</font>

<h2> Background</h2>

Suppose you live in west side of Toronto, Canada. Recently, you have got a better job offer on the other side of the city, that is, the east. You have a good life in the west within a lively community and all the amenities a metro city can offer.

<p></p>

Before you are ready to accept the job offer, you want to evaluate if you will have same level of quality of life in the east. If not, does it worth compromise your quality of life for the increased paycheck?

Things to consider:

-  Identify a few neighbourhoods which may be as good as the present location, that is, wast of Toronto
-  Check the venues around those neighbourhhods and decide if those are as good as the west

<h2> 2. Data sources and their usage in this analysis </h2>

To be able to do this analysis, the following data will be used:

- Current location and future job location
- Neighbourhoods of Toronto
- Geocoded data
- Venues around neighbourhoods from Four Square API

## 3. Methodology

### 3.1 Data Cleansing

Neighbourhoods of Toronto their details, like, postal code, borough name and neighbourhood names were obtained from this <a href = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"> Wikipedia page</a>. A sample of the data is given below.

In [11]:
import pandas as pd
raw_data = pd.read_html("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")[0]
raw_data.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


It can be noted that the data is not entirely clean, that is, some of the Boroughs and Neighbourhoods are missing from the data. Further analysis revealed the profile of the data.

In [12]:
raw_data.describe(include = 'all')

Unnamed: 0,Postal Code,Borough,Neighbourhood
count,180,180,180
unique,180,11,100
top,M5W,Not assigned,Not assigned
freq,1,77,77


It can further be noted that there are 180 postal codes, 10 boroughs and 100 neighbourhoods in Toronto.

### 3.2 Identifying the East areas

There are multiple sources of geocoding data, that is, latitude and longitude of various places. In this analysis, the Geegle Maps geocoding API will be used.

In [13]:
df = raw_data.loc[raw_data['Borough'] != "Not assigned"].reset_index(drop=True)
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [14]:
Boroughs=pd.DataFrame({'Borough':df['Borough'].unique()})
Boroughs

Unnamed: 0,Borough
0,North York
1,Downtown Toronto
2,Etobicoke
3,Scarborough
4,East York
5,York
6,East Toronto
7,West Toronto
8,Central Toronto
9,Mississauga


In [15]:
def get_lat_lng(apiKey, address):
    
    import requests
    url = ('https://maps.googleapis.com/maps/api/geocode/json?address={}&key={}'
           .format(address.replace(' ','+'), apiKey))
    try:
        response = requests.get(url)
        resp_json_payload = response.json()
        lat = resp_json_payload['results'][0]['geometry']['location']['lat']
        lng = resp_json_payload['results'][0]['geometry']['location']['lng']
    except:
        print('ERROR: {}'.format(address))
        lat = 0
        lng = 0
    return lat, lng


if __name__ == '__main__':
    # get key
    fname = 'GoogleMapsAPIKey.txt'
    file  = open(fname, 'r')
    apiKey = file.read()

In [16]:
for i in range(len(Boroughs)):
    Boroughs.loc[i, 'Address'] = Boroughs.loc[i, 'Borough'] + ", Toronto, Canada"
    Coordinates = get_lat_lng(apiKey, Boroughs.loc[i, 'Address'])
    Boroughs.loc[i, 'Latitude'] = Coordinates[0]
    Boroughs.loc[i, 'Longitude'] = Coordinates[1]

Plotting boroughs in the East

In [17]:
import folium
latitude = 43.653226
longitude = -79.383184

venues_map = folium.Map(location=[latitude, longitude], zoom_start=10) # generate map centred around Toronto


# add Toronto as a red circle mark
folium.CircleMarker(
    [latitude, longitude],
    radius=150,
    popup='Toronto',
    fill=True,
    color='red',
    fill_color='red',
    fill_opacity=0.5
    ).add_to(venues_map)


# add boroughs to the map as blue circle markers
for lat, lng, label in zip(Boroughs.Latitude, Boroughs.Longitude, Boroughs.Borough):
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        fill=True,
        color='blue',
        fill_color='blue',
        fill_opacity=0.6
        ).add_to(venues_map)

# display map
venues_map

Visualised the boroughs on map and shortlisted the target boroughs:​

- North York
- East York
- Scarborough

### 3.3 Identify and analyse neighbourhoods in the East

In [18]:
East = df[df.Borough.isin(['North York', 'East York', 'Scarborough'])]

Split / unstack the neighbourhoods with single rows

In [23]:
East = (East.set_index(['Postal Code', 'Borough'])
   .stack()
   .str.split(',', expand=True)
   .stack()
   .unstack(-2)
   .reset_index(-1, drop=True)
   .reset_index()
)

Obtain latitude and longitude of all such neighbourhoods

In [36]:
for i in range(len(East)):
    East.loc[i, 'Address'] = East.loc[i, 'Neighbourhood'] + ", " + East.loc[i, 'Borough'] + ", " + East.loc[i, 'Postal Code'] + ", Toronto, Canada"
    Coordinates = get_lat_lng(apiKey, East.loc[i, 'Address'])
    East.loc[i, 'Latitude'] = Coordinates[0]
    East.loc[i, 'Longitude'] = Coordinates[1]

In [37]:
East

Unnamed: 0,Postal Code,Borough,Neighbourhood,Address,Latitude,Longitude
0,M1B,Scarborough,Malvern,"Malvern, Scarborough, M1B, Toronto, Canada",43.806686,-79.194353
1,M1B,Scarborough,Rouge,"Rouge, Scarborough, M1B, Toronto, Canada",43.806686,-79.194353
2,M1C,Scarborough,Rouge Hill,"Rouge Hill, Scarborough, M1C, Toronto, Canada",43.794719,-79.134478
3,M1C,Scarborough,Port Union,"Port Union, Scarborough, M1C, Toronto, Canada",43.784535,-79.160497
4,M1C,Scarborough,Highland Creek,"Highland Creek, Scarborough, M1C, Toronto, Ca...",43.790121,-79.173392
...,...,...,...,...,...,...
78,M6L,North York,Maple Leaf Park,"Maple Leaf Park, North York, M6L, Toronto, Ca...",43.715895,-79.493079
79,M6L,North York,Upwood Park,"Upwood Park, North York, M6L, Toronto, Canada",43.708633,-79.502524
80,M9L,North York,Humber Summit,"Humber Summit, North York, M9L, Toronto, Canada",43.756303,-79.565963
81,M9M,North York,Humberlea,"Humberlea, North York, M9M, Toronto, Canada",43.721319,-79.533217


Visualise the neighbourhoods in a map

In [38]:
import folium
latitude = 43.653226
longitude = -79.383184

venues_map = folium.Map(location=[latitude, longitude], zoom_start=10) # generate map centred around Toronto


# add Toronto as a red circle mark
folium.CircleMarker(
    [latitude, longitude],
    radius=150,
    popup='Toronto',
    fill=True,
    color='red',
    fill_color='red',
    fill_opacity=0.5
    ).add_to(venues_map)


# add boroughs to the map as blue circle markers
for lat, lng, label in zip(East.Latitude, East.Longitude, East.Address):
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        fill=True,
        color='blue',
        fill_color='blue',
        fill_opacity=0.6
        ).add_to(venues_map)

# display map
venues_map

### 3.4 Venues in the East

In [39]:
df = East

Set up Foursquare API's requirements and test on one neighbourhood

In [65]:
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

LIMIT = 50 # limit of number of venues returned by Foursquare API
radius = 500 # define radius
neighborhood_latitude = df.loc[0,'Latitude']
neighborhood_longitude =  df.loc[0,'Longitude']
    
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)


if __name__ == '__main__':
    # get key
    fname = 'FoursquareID.txt'
    file  = open(fname, 'r')
    CLIENT_ID = file.read()
    fname = 'FoursquareSecret.txt'
    file  = open(fname, 'r')
    CLIENT_SECRET = file.read()
    
   
url # display URL

Your credentails:
CLIENT_ID: SHY15TRAIA2FEGZXQAERW4FRGZM4YRFFSWOXZ5NL5BYXSIXI

CLIENT_SECRET:5EJLBZ1UFZQ5CKZ4YEQUC5ZKVECJBXZL3K0QAYTDFF2QZE13


'https://api.foursquare.com/v2/venues/explore?&client_id=SHY15TRAIA2FEGZXQAERW4FRGZM4YRFFSWOXZ5NL5BYXSIXI\n&client_secret=5EJLBZ1UFZQ5CKZ4YEQUC5ZKVECJBXZL3K0QAYTDFF2QZE13&v=20180605&ll=43.8066863,-79.1943534&radius=500&limit=50'

In [66]:
import requests
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5fb6a31ef2dca11a895bdb19'},
  'headerLocation': 'Malvern',
  'headerFullLocation': 'Malvern, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 1,
  'suggestedBounds': {'ne': {'lat': 43.811186304500005,
    'lng': -79.1881295807304},
   'sw': {'lat': 43.8021862955, 'lng': -79.20057721926959}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4bb6b9446edc76b0d771311c',
       'name': 'Wendy’s',
       'location': {'crossStreet': 'Morningside & Sheppard',
        'lat': 43.80744841934756,
        'lng': -79.19905558052072,
        'labeledLatLngs': [{'label': 'display',
          'lat': 43.80744841934756,
          'lng': -79.19905558052072}],
        'distance': 387,
        'cc': 'CA',
        'city': 'Toronto',

In [67]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [68]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = pd.json_normalize(venues) # flatten JSON


# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Wendy’s,Fast Food Restaurant,43.807448,-79.199056


Obtain venues around all the selected neighbourhoods

In [69]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [70]:
# type your answer here
toronto_venues = getNearbyVenues(names=df['Neighbourhood'],
                                   latitudes=df['Latitude'],
                                   longitudes=df['Longitude']
                                  )

Malvern
 Rouge
Rouge Hill
 Port Union
 Highland Creek
Guildwood
 Morningside
 West Hill
Woburn
Cedarbrae
Scarborough Village
Kennedy Park
 Ionview
 East Birchmount Park
Golden Mile
 Clairlea
 Oakridge
Cliffside
 Cliffcrest
 Scarborough Village West
Birch Cliff
 Cliffside West
Dorset Park
 Wexford Heights
 Scarborough Town Centre
Wexford
 Maryvale
Agincourt
Clarks Corners
 Tam O'Shanter
 Sullivan
Milliken
 Agincourt North
 Steeles East
 L'Amoreaux East
Steeles West
 L'Amoreaux West
Upper Rouge
Hillcrest Village
Fairview
 Henry Farm
 Oriole
Bayview Village
York Mills
 Silver Hills
Willowdale
 Newtonbrook
Willowdale
 Willowdale East
York Mills West
Willowdale
 Willowdale West
Parkwoods
Don Mills
Don Mills
Bathurst Manor
 Wilson Heights
 Downsview North
Northwood Park
 York University
Downsview
Downsview
Downsview
Downsview
Victoria Village
Parkview Hill
 Woodbine Gardens
Woodbine Heights
Leaside
Thorncliffe Park
East Toronto
 Broadview North (Old East York)
Bedford Park
 Lawrence Manor Ea

Converting the venues in a dataframe which also captures the velue category

In [71]:
print(toronto_venues.shape)
toronto_venues.head()

(786, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Malvern,43.806686,-79.194353,Wendy’s,43.807448,-79.199056,Fast Food Restaurant
1,Rouge,43.806686,-79.194353,Wendy’s,43.807448,-79.199056,Fast Food Restaurant
2,Rouge Hill,43.794719,-79.134478,Rouge Bowl,43.797993,-79.137013,Bowling Alley
3,Rouge Hill,43.794719,-79.134478,Bikram Yoga East,43.798124,-79.137291,Yoga Studio
4,Port Union,43.784535,-79.160497,Royal Canadian Legion,43.782533,-79.163085,Bar


Summarising the data by neighbourhood

In [72]:
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Agincourt North,3,3,3,3,3,3
Broadview North (Old East York),10,10,10,10,10,10
Clairlea,5,5,5,5,5,5
Cliffcrest,2,2,2,2,2,2
Cliffside West,9,9,9,9,9,9
...,...,...,...,...,...,...
Willowdale,38,38,38,38,38,38
Woburn,4,4,4,4,4,4
Woodbine Heights,7,7,7,7,7,7
York Mills,1,1,1,1,1,1


### 3.5 Cluster Analysis

In [73]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Yoga Studio,Accessories Store,Airport,Airport Service,American Restaurant,Art Gallery,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Auto Garage,...,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wings Joint,Women's Store
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [74]:
toronto_onehot.shape

(786, 160)

In [75]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped

Unnamed: 0,Neighborhood,Yoga Studio,Accessories Store,Airport,Airport Service,American Restaurant,Art Gallery,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,...,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wings Joint,Women's Store
0,Agincourt North,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0
1,Broadview North (Old East York),0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0
2,Clairlea,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.2,0.000000,...,0.0,0.0,0.0,0.2,0.0,0.0,0.000000,0.0,0.0,0.0
3,Cliffcrest,0.0,0.0,0.0,0.0,0.500000,0.0,0.000000,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0
4,Cliffside West,0.0,0.0,0.0,0.0,0.111111,0.0,0.000000,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
71,Willowdale,0.0,0.0,0.0,0.0,0.000000,0.0,0.026316,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.026316,0.0,0.0,0.0
72,Woburn,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0
73,Woodbine Heights,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.142857,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0
74,York Mills,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0


In [153]:
toronto_grouped['Neighborhood'].describe(include='all')

count           76
unique          76
top       Fairview
freq             1
Name: Neighborhood, dtype: object

In [154]:
num_top_venues = 5

for hood in toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

---- Agincourt North----
               venue  freq
0         Playground  0.33
1               Park  0.33
2             Bakery  0.33
3  Mobile Phone Shop  0.00
4     Medical Center  0.00


---- Broadview North (Old East York)----
                       venue  freq
0                Pizza Place   0.1
1           Greek Restaurant   0.1
2  Middle Eastern Restaurant   0.1
3                Bus Station   0.1
4                   Bus Stop   0.1


---- Clairlea----
                           venue  freq
0               Asian Restaurant   0.2
1  Vegetarian / Vegan Restaurant   0.2
2          General Entertainment   0.2
3                  Movie Theater   0.2
4                 Discount Store   0.2


---- Cliffcrest----
                      venue  freq
0                     Motel   0.5
1       American Restaurant   0.5
2               Yoga Studio   0.0
3  Mediterranean Restaurant   0.0
4               Men's Store   0.0


---- Cliffside West----
              venue  freq
0       Coffee Shop  0.11
1 

#### Let's put that into a _pandas_ dataframe


First, let's write a function to sort the venues in descending order.


In [155]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.


In [156]:
import numpy as np

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Agincourt North,Playground,Park,Bakery,Women's Store,Deli / Bodega,Dog Run,Discount Store,Diner,Dim Sum Restaurant,Dessert Shop
1,Broadview North (Old East York),Pizza Place,Bakery,Middle Eastern Restaurant,Bus Stop,Bus Station,Sandwich Place,Discount Store,Bank,Pharmacy,Greek Restaurant
2,Clairlea,General Entertainment,Discount Store,Vegetarian / Vegan Restaurant,Movie Theater,Asian Restaurant,Dance Studio,Dog Run,Diner,Dim Sum Restaurant,Dessert Shop
3,Cliffcrest,American Restaurant,Motel,Women's Store,Deli / Bodega,Electronics Store,Dog Run,Discount Store,Diner,Dim Sum Restaurant,Dessert Shop
4,Cliffside West,Pharmacy,Sushi Restaurant,Park,Pizza Place,American Restaurant,Hotel,Coffee Shop,Pet Store,Fish & Chips Shop,Dim Sum Restaurant


In [157]:
neighborhoods_venues_sorted.shape

(76, 11)

Run _k_-means to cluster the neighborhood into 5 clusters.


In [140]:
# import k-means from clustering stage
from sklearn.cluster import KMeans

# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([4, 1, 1, 1, 1, 1, 1, 1, 0, 1], dtype=int32)

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.


In [165]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
toronto_merged = df

# merge manhattan_grouped with manhattan_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighbourhood')



In [167]:
toronto_merged = toronto_merged.loc[toronto_merged['Cluster Labels'].notna()].reset_index(drop=True)

Finally, let's visualize the resulting clusters


In [172]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighbourhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster)-1],
        fill=True,
        fill_color=rainbow[int(cluster)-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

<a id='item5'></a>


Examine Clusters


Now, you can examine each cluster and determine the discriminating venue categories that distinguish each cluster. Based on the defining categories, you can then assign a name to each cluster. I will leave this exercise to you.


#### Cluster 1


In [173]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
16,Scarborough,-79.274886,0.0,Bus Stop,Park,Convenience Store,Ice Cream Shop,Dessert Shop,Restaurant,Deli / Bodega,Dog Run,Discount Store,Diner
40,North York,-79.346524,0.0,Park,Intersection,Middle Eastern Restaurant,Restaurant,Women's Store,Dance Studio,Discount Store,Diner,Dim Sum Restaurant,Dessert Shop
44,North York,-79.370491,0.0,Coffee Shop,Park,Middle Eastern Restaurant,Deli / Bodega,Dog Run,Discount Store,Diner,Dim Sum Restaurant,Dessert Shop,Department Store
48,North York,-79.400049,0.0,Park,Convenience Store,Women's Store,Deli / Bodega,Electronics Store,Dog Run,Discount Store,Diner,Dim Sum Restaurant,Dessert Shop
51,North York,-79.329656,0.0,Food & Drink Shop,Park,Women's Store,Deli / Bodega,Dog Run,Discount Store,Diner,Dim Sum Restaurant,Dessert Shop,Department Store
55,North York,-79.446661,0.0,Park,Women's Store,IT Services,Dance Studio,Dog Run,Discount Store,Diner,Dim Sum Restaurant,Dessert Shop,Department Store
64,East York,-79.325415,0.0,Pharmacy,Park,Women's Store,Dance Studio,Dog Run,Discount Store,Diner,Dim Sum Restaurant,Dessert Shop,Department Store


#### Cluster 2


In [174]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,Scarborough,-79.134478,1.0,Yoga Studio,Bowling Alley,Cocktail Bar,Electronics Store,Dog Run,Discount Store,Diner,Dim Sum Restaurant,Dessert Shop,Department Store
3,Scarborough,-79.160497,1.0,Construction & Landscaping,Bar,Women's Store,Department Store,Electronics Store,Dog Run,Discount Store,Diner,Dim Sum Restaurant,Dessert Shop
4,Scarborough,-79.173392,1.0,Construction & Landscaping,Airport Service,IT Services,Women's Store,Department Store,Falafel Restaurant,Electronics Store,Dog Run,Discount Store,Diner
5,Scarborough,-79.192777,1.0,Other Repair Shop,Baseball Field,Train Station,Women's Store,Deli / Bodega,Dog Run,Discount Store,Diner,Dim Sum Restaurant,Dessert Shop
6,Scarborough,-79.204997,1.0,Coffee Shop,Park,Pharmacy,Fast Food Restaurant,Café,Beer Store,Mobile Phone Shop,Supermarket,Discount Store,Sandwich Place
...,...,...,...,...,...,...,...,...,...,...,...,...,...
76,North York,-79.475099,1.0,Fast Food Restaurant,Liquor Store,Supermarket,Pizza Place,Sandwich Place,Shopping Mall,Breakfast Spot,Big Box Store,Pet Store,Gas Station
78,North York,-79.502524,1.0,Pizza Place,Gas Station,Vietnamese Restaurant,Convenience Store,Mexican Restaurant,Discount Store,Diner,Dim Sum Restaurant,Dessert Shop,Department Store
79,North York,-79.565963,1.0,Pizza Place,Furniture / Home Store,Falafel Restaurant,Dog Run,Discount Store,Diner,Dim Sum Restaurant,Dessert Shop,Department Store,Deli / Bodega
80,North York,-79.533217,1.0,Convenience Store,Baseball Field,Italian Restaurant,Gas Station,Women's Store,Dessert Shop,Electronics Store,Dog Run,Discount Store,Diner


#### Cluster 3


In [175]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
43,North York,-79.374714,2.0,Martial Arts School,Hockey Arena,Falafel Restaurant,Electronics Store,Dog Run,Discount Store,Diner,Dim Sum Restaurant,Dessert Shop,Department Store


#### Cluster 4


In [176]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Scarborough,-79.194353,3.0,Fast Food Restaurant,Women's Store,Deli / Bodega,Electronics Store,Dog Run,Discount Store,Diner,Dim Sum Restaurant,Dessert Shop,Department Store
1,Scarborough,-79.194353,3.0,Fast Food Restaurant,Women's Store,Deli / Bodega,Electronics Store,Dog Run,Discount Store,Diner,Dim Sum Restaurant,Dessert Shop,Department Store


#### Cluster 5


In [177]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
31,Scarborough,-79.284577,4.0,Playground,Park,Bakery,Women's Store,Deli / Bodega,Dog Run,Discount Store,Diner,Dim Sum Restaurant,Dessert Shop
32,Scarborough,-79.284577,4.0,Playground,Park,Bakery,Women's Store,Deli / Bodega,Dog Run,Discount Store,Diner,Dim Sum Restaurant,Dessert Shop
34,Scarborough,-79.284577,4.0,Playground,Park,Bakery,Women's Store,Deli / Bodega,Dog Run,Discount Store,Diner,Dim Sum Restaurant,Dessert Shop
77,North York,-79.493079,4.0,Park,Construction & Landscaping,Business Service,Bakery,Women's Store,Department Store,Electronics Store,Dog Run,Discount Store,Diner


Thank you!

# If you cannot see the presentation or report, please click the following:

- https://github.com/anix-anirban/Coursera_Capstone/blob/main/Capstone_Report.pdf
- https://1drv.ms/p/s!Aqm_Z3_acITwgoJdfCRpyFoMQFzR7w?e=mzpZfA