# Capstone Project

 This notebook will be mainly used for the capstone project

In [2]:
import pandas as pd
import numpy as np

In [2]:
print('Hello Capsone Project Course!')

Hello Capsone Project Course!


## Segmenting and Clustering Neighborhoods in Toronto

### Preprocessing data

Let's read the data from url

In [6]:
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')[0]
df.head()

Unnamed: 0,Postal Code,District,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


Renaming columns

In [7]:
df.rename(columns = {'District':'Borough', 'Neighbourhood':'Neighborhood'}, inplace = True)
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


Dealing with 'Not assigned' value in Borough column

In [16]:
df = df[df['Borough'] != 'Not assigned']
df.reset_index(drop = True, inplace = True)
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


Combining rows with the same 'Postal Code'

In [25]:
df = df.groupby(by = ['Postal Code']).sum()
df.reset_index(inplace = True)
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


Checking if there are 'Not assigned' value in Neighborhood column

In [34]:
count = 0
for value in df['Neighborhood'].values:
    if 'Not assigned' in value:
        count += 1
print(count)

0


Number of rows in the dataframe

In [37]:
df.shape

(103, 3)

### Getting latitude and longitude

I tried to use geocoder but it did not work, so i will use dataframe

In [45]:
ll = pd.read_csv('http://cocl.us/Geospatial_data')
ll.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [54]:
ll_df = df.join(ll.set_index('Postal Code'), on = 'Postal Code')
ll_df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


### Explore and cluster the neighborhoods in Toronto 

Selecting only boroughs which contain "Toronto"

In [59]:
new_df = ll_df[ll_df['Borough'].str.contains('Toronto')]
new_df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
37,M4E,East Toronto,The Beaches,43.676357,-79.293031
41,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
42,M4L,East Toronto,"India Bazaar, The Beaches West",43.668999,-79.315572
43,M4M,East Toronto,Studio District,43.659526,-79.340923
44,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879


How much examplses do we have

In [61]:
new_df.reset_index(drop = True, inplace = True)
new_df.shape

(39, 5)

Importing geopy and folium

In [66]:
from geopy.geocoders import Nominatim

In [69]:
#!conda install folium -c conda-forge
import folium

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - folium


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    branca-0.4.1               |             py_0          26 KB  conda-forge
    folium-0.11.0              |             py_0          61 KB  conda-forge
    ------------------------------------------------------------
                                           Total:          87 KB

The following NEW packages will be INSTALLED:

    branca: 0.4.1-py_0  conda-forge
    folium: 0.11.0-py_0 conda-forge


Downloading and Extracting Packages
branca-0.4.1         | 26 KB     | ##################################### | 100% 
folium-0.11.0        | 61 KB     | ##################################### | 100% 
Preparing transaction: done
Verifying transaction: done
Executing transaction: done


Getting latitude and longitude of Toronto city

In [68]:
address = 'Toronto, Canada'
geolocator = Nominatim(user_agent = 'toronto_agent')
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('Latitude:{}, Longitude:{}'.format(latitude, longitude))

Latitude:43.6534817, Longitude:-79.3839347


Creating a map

In [72]:
toronto_map = folium.Map(location = [latitude, longitude], zoom_start = 11)
for neighborhood, borough, lat, long in zip(new_df['Neighborhood'], new_df['Borough'], new_df['Latitude'], new_df['Longitude']):
    label = '{}, {}'.format(borough, neighborhood)
    folium.CircleMarker([lat, long], 
                        radius = 5, 
                        popup = label, 
                        color = 'blue', 
                        fill = True, 
                        fill_color = 'blue', 
                        fill_opacity = 0.7).add_to(toronto_map)

toronto_map

In [73]:
# @hidden_cell
CLIENT_ID = '1ZGP14LNSS5RCNKFUDWRQCTD4CI1UDOMRMRR55ELGDMKKTCZ' # your Foursquare ID
CLIENT_SECRET = 'A5C1ACDPBHDJFNCNUZ3VBNKQOK0RULSXARHL4EZ1LVCUGSDD' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

Importing requests

In [81]:
#!conda install -c conda-forge request
import requests

This function gets a request from foursquare of exploring every neighborhood from array.

After loop it is creating dataframe which contain data about neighborhood and venue

In [86]:
def get_nearby_venues(neighborhoods, latitudes, longitudes, radius = 500, LIMIT = 100):
    venues_list = []
    for name, lat, lon in zip(neighborhoods, latitudes, longitudes):
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lon, 
            radius, 
            LIMIT)
        results = requests.get(url).json()["response"]['groups'][0]['items']
        venues_list.append([(
            name, 
            lat, 
            lon, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    return nearby_venues

Apply function to our dataset

In [87]:
venues = get_nearby_venues(neighborhoods = new_df['Neighborhood'], latitudes = new_df['Latitude'], longitudes = new_df['Longitude'])

In [88]:
venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,The Beaches,43.676357,-79.293031,Glen Manor Ravine,43.676821,-79.293942,Trail
1,The Beaches,43.676357,-79.293031,The Big Carrot Natural Food Market,43.678879,-79.297734,Health Food Store
2,The Beaches,43.676357,-79.293031,Grover Pub and Grub,43.679181,-79.297215,Pub
3,The Beaches,43.676357,-79.293031,Upper Beaches,43.680563,-79.292869,Neighborhood
4,The Beaches,43.676357,-79.293031,Seaspray Restaurant,43.678888,-79.298167,Asian Restaurant


Let's count how many venues in different neighborhoods

In [90]:
venues.groupby(by=['Neighborhood']).count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Berczy Park,58,58,58,58,58,58
"Brockton, Parkdale Village, Exhibition Place",24,24,24,24,24,24
"Business reply mail Processing Centre, South Central Letter Processing Plant Toronto",18,18,18,18,18,18
"CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport",17,17,17,17,17,17
Central Bay Street,62,62,62,62,62,62
Christie,17,17,17,17,17,17
Church and Wellesley,76,76,76,76,76,76
"Commerce Court, Victoria Hotel",100,100,100,100,100,100
Davisville,34,34,34,34,34,34
Davisville North,8,8,8,8,8,8


Let's create one hot data

In [118]:
toronto_onehot = pd.get_dummies(venues[['Venue Category']], prefix = "", prefix_sep = "")
toronto_onehot['Neighborhood'] = venues['Neighborhood']
toronto_onehot = toronto_onehot[['Neighborhood'] + list(toronto_onehot.columns[:-1])]
toronto_onehot.head()

Unnamed: 0,Neighborhood,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,...,Theme Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Women's Store
0,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
1,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [93]:
grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
grouped.head()

Unnamed: 0,Neighborhood,Yoga Studio,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Theme Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Women's Store
0,Berczy Park,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"Brockton, Parkdale Village, Exhibition Place",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"Business reply mail Processing Centre, South C...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"CN Tower, King and Spadina, Railway Lands, Har...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Central Bay Street,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Let's write a function to sort the venues in descending order

In [97]:
def return_sorted_row(row, num):
    row_categories = row.iloc[1:]
    row_sorted = row_categories.sort_values(ascending = False)
    return row_sorted.index.values[0:num]

Now let's create the new dataframe and display the top 10 venues for each neighborhood

In [101]:
N = 10
index = ['st', 'nd', 'rd']
columns = ['Neighborhood']
for ind in range(N):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, index[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))
    
most_common = pd.DataFrame(columns = columns)
most_common['Neighborhood'] = grouped['Neighborhood']

for i in np.arange(grouped.shape[0]):
    most_common.iloc[i, 1:] = return_sorted_row(grouped.iloc[i, :], N)
    
most_common.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Berczy Park,Dessert Shop,Women's Store,Department Store,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant,Dog Run
1,"Brockton, Parkdale Village, Exhibition Place",Coffee Shop,Women's Store,Department Store,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant,Dog Run
2,"Business reply mail Processing Centre, South C...",Coffee Shop,Women's Store,Department Store,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant,Dog Run
3,"CN Tower, King and Spadina, Railway Lands, Har...",Greek Restaurant,Department Store,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant,Dog Run,Distribution Center
4,Central Bay Street,Greek Restaurant,Department Store,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant,Dog Run,Distribution Center


### Clustering

In [102]:
from sklearn.cluster import KMeans

Apply k-means algorithm with 5 clusters

In [106]:
num_clusters = 5
clustering_data = grouped.drop('Neighborhood', 1)
k_means = KMeans(init = 'k-means++', n_clusters = num_clusters).fit(clustering_data)
k_means.labels_[0:10]

array([0, 2, 2, 1, 1, 0, 0, 0, 1, 0], dtype=int32)

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood

In [107]:
most_common.insert(0, 'Cluster', k_means.labels_)
final_data = new_df.join(most_common.set_index('Neighborhood'), on = 'Neighborhood')
final_data.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude,Cluster,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4E,East Toronto,The Beaches,43.676357,-79.293031,3,Trail,Women's Store,Department Store,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant,Dog Run
1,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188,0,Health Food Store,Women's Store,Department Store,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant,Dog Run
2,M4L,East Toronto,"India Bazaar, The Beaches West",43.668999,-79.315572,0,Pub,Women's Store,Deli / Bodega,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant,Dog Run,Distribution Center
3,M4M,East Toronto,Studio District,43.659526,-79.340923,0,Women's Store,Department Store,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant,Dog Run,Distribution Center
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879,0,Asian Restaurant,Department Store,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant,Dog Run,Distribution Center


In [112]:
import matplotlib.pyplot as plt
import matplotlib.colors as colors

And finally visualize it

In [113]:
cluster_map = folium.Map(location = [latitude, longitude], zoom_start = 11)
colors_arr = plt.cm.rainbow(np.linspace(0,1,num_clusters))
rainbow = [colors.rgb2hex(i) for i in colors_arr]

for name, lat, lon, cluster in zip(final_data['Neighborhood'], final_data['Latitude'], final_data['Longitude'], final_data['Cluster']):
    folium.CircleMarker([lat, lon], 
                        radius = 5, 
                        popup = name,
                        color = rainbow[cluster], 
                        fill = True, 
                        fill_color = rainbow[cluster], 
                        fill_opacity = 0.7).add_to(cluster_map)
    
cluster_map