# Task 1

#### Get data from wikipedia

In [1]:
# Get Data from wikipedia
import pandas as pd
import requests

url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

r = requests.get(url)
df_list = pd.read_html(r.text) # this parses all the tables in webpages to a list
df = df_list[0]
df.columns = ['PostalCode', 'Borough',  'Neighborhood']
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood. More than one neighborhood can exist in one postal code area. These two rows are combined into one row with the neighborhoods separated with a comma like shown above.

#### We only process the cells that have an assigned borough and ignore cells with a borough that is Not assigned. We drop the rows that contain 'Not assigned' value

In [2]:
df = df.drop(df[df['Borough'] == 'Not assigned'].index).reset_index(drop=True)
df

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


#### Check if a cell has a borough but a Not assigned neighborhood. If yes, then the neighborhood will be the same as the borough.

In [3]:
df[df['Neighborhood'] == 'Not assigned']

Unnamed: 0,PostalCode,Borough,Neighborhood


From above, we can see that there is no Neighborhood column that has 'Not Assigned' value. Thus, we won't change the value

#### Use the .shape method to print the number of rows of dataframe

In [12]:
df.shape

(103, 5)

# Task 2

In this task, I use csv data instead of geocoding API.

In [14]:
# Read CSV file
coordinates = pd.read_csv('D://Downloads/Geospatial_Coordinates.csv')
coordinates.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [15]:
# Looping every row in postal code column to get latitude and longitude
lat_list, long_list = [], []
for row in df['PostalCode']:
    lat_list.append(coordinates.loc[coordinates['Postal Code'] == row, 'Latitude'].values[0])
    long_list.append(coordinates.loc[coordinates['Postal Code'] == row, 'Longitude'].values[0])

In [16]:
#Merge latitude and longitude to dataframe
df['Latitude'] = lat_list
df['Longitude'] = long_list
df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


# Task 3

In this task, I work with entire toronto to make segmentation. And use latitude and longitude to clustering neighborhoods in toronto.

In [21]:
import folium

#Longitude a d latitude of Toronto
latitude = 43.6532
longitude = -79.3832

In [22]:
#Create map of toronto
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

for lat, lng, borough, neighborhood, pst_code in zip(df['Latitude'], df['Longitude'], df['Borough'], df['Neighborhood'], df['PostalCode']):
    label = '{}, {}, {}'.format(neighborhood, borough, pst_code)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

In [23]:
# Insert credential
CLIENT_ID = 'VOWLFKJCLNN5HY3KIKLHXCVRUU5SNKCW3LE13UZSAMM1WOQJ' # your Foursquare ID
CLIENT_SECRET = 'AIALEK502TNFTN4TFYDIFGY3X4AQSJ0GQ0QRQNV5C5EBARDA' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)
LIMIT = 100
radius = 500 # d

Your credentails:
CLIENT_ID: VOWLFKJCLNN5HY3KIKLHXCVRUU5SNKCW3LE13UZSAMM1WOQJ
CLIENT_SECRET:AIALEK502TNFTN4TFYDIFGY3X4AQSJ0GQ0QRQNV5C5EBARDA


#### Define function to get venue near the given latitude and longitude of neighborhood

In [24]:
def getnearbyvenues(pst_code, name, brgh, lat, lng,  radius=500):    
    print(pst_code)

    # create the API request URL
    url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
        CLIENT_ID, 
        CLIENT_SECRET, 
        VERSION, 
        lat, 
        lng, 
        radius, 
        LIMIT)

    results = requests.get(url).json()["response"]['groups'][0]['items']
    return [(pst_code, name,
            brgh,
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results]


#### Column name of Dataframe

In [27]:
col = ['PostalCode','Neighborhood', 'Borough',
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']

#### Looping over dataframe to get venue near given latitude and longitude of neighborhood

In [25]:
res = []
for pstl_code, name, brgh, lat, long in zip(df['PostalCode'], df['Neighborhood'], df['Borough'], df['Latitude'], df['Longitude']):
    try:
        res.append(getnearbyvenues(pstl_code, name, brgh, lat, long))
    except:
        print('Error in Postal Code: ', pstl_code)


M3A
M4A
M5A
M6A
M7A
M9A
M1B
M3B
M4B
M5B
M6B
M9B
M1C
M3C
M4C
M5C
M6C
M9C
M1E
M4E
M5E
M6E
M1G
M4G
M5G
M6G
M1H
M2H
M3H
M4H
M5H
M6H
M1J
M2J
M3J
M4J
M5J
M6J
M1K
M2K
M3K
M4K
M5K
M6K
M1L
M2L
M3L
M4L
M5L
M6L
M9L
M1M
M2M
M3M
M4M
M5M
M6M
M9M
M1N
M2N
M3N
M4N
M5N
M6N
M9N
M1P
M2P
M4P
M5P
M6P
M9P
M1R
M2R
M4R
M5R
M6R
M7R
M9R
M1S
M4S
M5S
M6S
M1T
M4T
M5T
M1V
M4V
M5V
M8V
M9V
M1W
M4W
M5W
M8W
M9W
M1X
M4X
M5X
M8X
M4Y
M7Y
M8Y
M8Z


#### Convert the reault to dataframe

In [37]:
toronto_venues = pd.DataFrame([item for sublist in res for item in sublist], columns = col)
toronto_venues.head()

Unnamed: 0,PostalCode,Neighborhood,Borough,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,M3A,Parkwoods,North York,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
1,M3A,Parkwoods,North York,43.753259,-79.329656,Brookbanks Pool,43.751389,-79.332184,Pool
2,M3A,Parkwoods,North York,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
3,M4A,Victoria Village,North York,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
4,M4A,Victoria Village,North York,43.725882,-79.315572,Tim Hortons,43.725517,-79.313103,Coffee Shop


#### Count each postal code

In [38]:
toronto_venues.groupby('PostalCode').count()

Unnamed: 0_level_0,Neighborhood,Borough,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
PostalCode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
M1B,2,2,2,2,2,2,2,2
M1C,2,2,2,2,2,2,2,2
M1E,7,7,7,7,7,7,7,7
M1G,5,5,5,5,5,5,5,5
M1H,9,9,9,9,9,9,9,9
...,...,...,...,...,...,...,...,...
M9N,1,1,1,1,1,1,1,1
M9P,8,8,8,8,8,8,8,8
M9R,3,3,3,3,3,3,3,3
M9V,10,10,10,10,10,10,10,10


In [41]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 272 uniques categories.


#### Convert categorical variable into numerical variable using one hot encoder

In [42]:
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['PostalCode'] = toronto_venues['PostalCode'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,PostalCode,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Train Station,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,M3A,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,M3A,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,M3A,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,M4A,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,M4A,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### Group by mean each postal code

In [43]:
toronto_grouped = toronto_onehot.groupby('PostalCode').mean().reset_index()
toronto_grouped

Unnamed: 0,PostalCode,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Train Station,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,M1B,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,M1C,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,M1E,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,M1G,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,M1H,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
94,M9N,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
95,M9P,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
96,M9R,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
97,M9V,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### We want to see top 5 venues every postal code


In [44]:
num_top_venues = 5

for hood in toronto_grouped['PostalCode']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['PostalCode'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----M1B----
                        venue  freq
0        Fast Food Restaurant   0.5
1                  Print Shop   0.5
2  Modern European Restaurant   0.0
3           Mobile Phone Shop   0.0
4          Miscellaneous Shop   0.0


----M1C----
                             venue  freq
0                              Bar   0.5
1       Construction & Landscaping   0.5
2                Accessories Store   0.0
3               Mexican Restaurant   0.0
4  Molecular Gastronomy Restaurant   0.0


----M1E----
                venue  freq
0  Mexican Restaurant  0.14
1      Breakfast Spot  0.14
2                Bank  0.14
3      Medical Center  0.14
4   Electronics Store  0.14


----M1G----
                             venue  freq
0                      Coffee Shop   0.4
1                         Pharmacy   0.2
2                Korean Restaurant   0.2
3                Convenience Store   0.2
4  Molecular Gastronomy Restaurant   0.0


----M1H----
                  venue  freq
0                Bakery  0

            venue  freq
0     Coffee Shop  0.10
1  Clothing Store  0.10
2     Yoga Studio  0.05
3      Bagel Shop  0.05
4           Diner  0.05


----M4S----
                venue  freq
0      Sandwich Place  0.09
1        Dessert Shop  0.09
2         Pizza Place  0.06
3                Café  0.06
4  Italian Restaurant  0.06


----M4T----
           venue  freq
0           Park  0.25
1   Tennis Court  0.25
2     Playground  0.25
3     Restaurant  0.25
4  Metro Station  0.00


----M4V----
                venue  freq
0                 Pub  0.12
1         Coffee Shop  0.12
2                Bank  0.06
3  Light Rail Station  0.06
4         Supermarket  0.06


----M4W----
                        venue  freq
0                        Park  0.50
1                       Trail  0.25
2                  Playground  0.25
3           Accessories Store  0.00
4  Modern European Restaurant  0.00


----M4X----
         venue  freq
0  Coffee Shop  0.06
1         Café  0.04
2  Pizza Place  0.04
3       Bake

4      Burrito Place  0.07


----M9C----
          venue  freq
0      Pharmacy  0.12
1   Coffee Shop  0.12
2  Liquor Store  0.12
3    Beer Store  0.12
4          Café  0.12


----M9L----
                             venue  freq
0                              Gym   0.5
1                     Home Service   0.5
2                Accessories Store   0.0
3                    Metro Station   0.0
4  Molecular Gastronomy Restaurant   0.0


----M9M----
                             venue  freq
0       Construction & Landscaping   0.5
1                   Baseball Field   0.5
2                Accessories Store   0.0
3               Mexican Restaurant   0.0
4  Molecular Gastronomy Restaurant   0.0


----M9N----
                        venue  freq
0                        Park   1.0
1           Accessories Store   0.0
2                 Men's Store   0.0
3  Modern European Restaurant   0.0
4           Mobile Phone Shop   0.0


----M9P----
                venue  freq
0         Pizza Place  0.25
1      

#### define function to return most common venues


In [45]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [77]:
import numpy as np
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['PostalCode']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
postalcode_venues_sorted = pd.DataFrame(columns=columns)
postalcode_venues_sorted['PostalCode'] = toronto_grouped['PostalCode']

for ind in np.arange(toronto_grouped.shape[0]):
    postalcode_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

postalcode_venues_sorted.head()

Unnamed: 0,PostalCode,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M1B,Fast Food Restaurant,Print Shop,Dog Run,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Doner Restaurant,Deli / Bodega
1,M1C,Bar,Construction & Landscaping,Yoga Studio,Donut Shop,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Drugstore
2,M1E,Medical Center,Rental Car Location,Breakfast Spot,Electronics Store,Mexican Restaurant,Bank,Intersection,Dog Run,Discount Store,Distribution Center
3,M1G,Coffee Shop,Convenience Store,Pharmacy,Korean Restaurant,Donut Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dog Run
4,M1H,Hakka Restaurant,Thai Restaurant,Fried Chicken Joint,Bank,Bakery,Caribbean Restaurant,Athletics & Sports,Gas Station,Lounge,Coworking Space


#### Clustering

In [78]:
from sklearn.cluster import KMeans
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('PostalCode', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

#### Merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood

We pass the parameter how = 'inner' to remove NaN or infinity number from previous grouping by mean

In [79]:
postalcode_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)


In [80]:
df_merged = df.copy()
df_merged = df_merged.join(postalcode_venues_sorted.set_index('PostalCode'), on='PostalCode', how='inner').reset_index(drop=True)

df_merged.tail() # check the last columns!

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
94,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944,1,Park,River,College Rec Center,College Stadium,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Drugstore,Donut Shop,Doner Restaurant
95,M4Y,Downtown Toronto,Church and Wellesley,43.66586,-79.38316,0,Coffee Shop,Sushi Restaurant,Japanese Restaurant,Gay Bar,Restaurant,Hotel,Mediterranean Restaurant,Men's Store,Café,Pub
96,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.662744,-79.321558,0,Yoga Studio,Auto Workshop,Garden Center,Garden,Fast Food Restaurant,Farmers Market,Comic Shop,Park,Pizza Place,Restaurant
97,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509,3,Baseball Field,Yoga Studio,Doner Restaurant,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dog Run,Donut Shop,Fast Food Restaurant
98,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,...",43.628841,-79.520999,0,Grocery Store,Tanning Salon,Gym,Convenience Store,Discount Store,Burrito Place,Burger Joint,Sandwich Place,Flower Shop,Bakery


#### Check the shape of data

In [81]:
df_merged.shape

(99, 16)

#### Plot the segmentation and cluster

In [82]:
import matplotlib.cm as cm
import matplotlib.colors as colors
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=10)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(df_merged['Latitude'], df_merged['Longitude'], df_merged['PostalCode'], df_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### Get insight from the data

#### 1. Check the first cluster

We sort the 1st most popular venues from cluster 1

In [83]:
df_merged.loc[df_merged['Cluster Labels'] == 0, '1st Most Common Venue'].value_counts()[:5]

Coffee Shop       16
Café               9
Park               6
Sandwich Place     5
Grocery Store      5
Name: 1st Most Common Venue, dtype: int64

We see that most of popular venue of cluster 1 belong to café and coffee shop. Let's see to the 2nd most popular venues:

In [84]:
df_merged.loc[df_merged['Cluster Labels'] == 0, '2nd Most Common Venue'].value_counts()[:5]

Coffee Shop         10
Café                 7
Restaurant           4
Sushi Restaurant     4
Bakery               4
Name: 2nd Most Common Venue, dtype: int64

We see again that most of popular venue of cluster 1 belong to café and coffee shop, and also food place like restaurant and bakery. Hence, we conclude that cluster 1 belongs to Coffee Shop and Food Place

#### 2. Check the second cluster

In [85]:
df_merged.loc[df_merged['Cluster Labels'] == 1, df_merged.columns[[1] + list(range(5, df_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,North York,1,Park,Food & Drink Shop,Pool,Department Store,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dog Run
19,York,1,Park,Pool,Women's Store,Grocery Store,Curling Ice,Electronics Store,Eastern European Restaurant,Drugstore,Donut Shop,Doner Restaurant
33,East York,1,Convenience Store,Park,Doner Restaurant,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dog Run,Yoga Studio
38,North York,1,Park,Airport,Snack Place,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Drugstore,Department Store
58,Central Toronto,1,Park,Swim School,Bus Line,Dog Run,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Doner Restaurant,Department Store
61,York,1,Park,Yoga Studio,Doner Restaurant,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dog Run,Donut Shop
74,Etobicoke,1,Park,Mobile Phone Shop,Sandwich Place,Discount Store,Department Store,Dessert Shop,Dim Sum Restaurant,Diner,Distribution Center,Event Space
80,Central Toronto,1,Park,Tennis Court,Playground,Restaurant,Diner,Dance Studio,Deli / Bodega,Department Store,Dessert Shop,Dim Sum Restaurant
82,Scarborough,1,Park,Playground,Dog Run,Department Store,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Doner Restaurant
88,Downtown Toronto,1,Park,Trail,Playground,Department Store,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dog Run


From the above table we can see that the majority 1st most common venue for cluster 2 is Park. Hence, we conclude that cluster 2 belongs to Park.

#### 3. Check the third cluster

In [86]:
df_merged.loc[df_merged['Cluster Labels'] == 2, df_merged.columns[[1] + list(range(5, df_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
47,North York,2,Home Service,Gym,Doner Restaurant,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dog Run,Yoga Studio
49,North York,2,Home Service,Yoga Studio,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Deli / Bodega


From the above table we can see that the all the 1st most common venue for cluster 3 is Home Service. Hence, we conclude that cluster 3 belongs to Home Service.

#### 4. Check the fourth cluster

In [90]:
df_merged.loc[df_merged['Cluster Labels'] == 3, df_merged.columns[[1] + list(range(5, df_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
50,North York,3,Food Truck,Baseball Field,Yoga Studio,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Dessert Shop
54,North York,3,Construction & Landscaping,Baseball Field,Yoga Studio,Donut Shop,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Drugstore
97,Etobicoke,3,Baseball Field,Yoga Studio,Doner Restaurant,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dog Run,Donut Shop,Fast Food Restaurant


From the above table we can see that the majority of 1st most common venue and 2nd most common venue for cluster 4 is Baseball Field. Hence, we conclude that cluster 4 belongs to Baseball Field.

#### 5. Check the fifth cluster

In [91]:
df_merged.loc[df_merged['Cluster Labels'] == 4, df_merged.columns[[1] + list(range(5, df_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
57,North York,4,Gym / Fitness Center,Liquor Store,Grocery Store,Athletics & Sports,Discount Store,Doner Restaurant,Dim Sum Restaurant,Diner,Distribution Center,Dog Run


From the above table we can see that the only of 1st most common venue for cluster 5 is Gym/Fitness Center. Hence, we conclude that cluster 5 belongs to Gym/Fitness Center.

### Conclusion

In this dataset, we make segmenting and clustering neighborhood based on most popular venue in Toronto. We use 5 clusters and we get the result map that shows each neighborhood and its coresponding cluster. From the result, we get:<br>
1. Cluster 1 belongs to Coffee Shop and Food Place
2. Cluster 2 belongs to Park
3. Cluster 3 belongs to Home Service
4. Cluster 4 belongs to Baseball Field
5. Cluster 5 belongs to Gym/Fitness Center

If we use less cluster, in example 4 cluster, then the cluster that will be removed is cluster 5, since it has the smallest cluster member (only 1) among others.