## Assignment - Segmenting and Clustering Neighborhoods in Toronto
This notebook has been created to segment and cluster neighborhoods in Toronto as warranted by the graded assignment for week 2 on IBM's 'Applied Data Science Capstone' course offered through coursera as part of the IBM Data Science Professional Certificate

Import all libraries required:

In [1]:
import pandas as pd
import numpy as np
import folium
import requests
import json

Scrape Data from Wikipedia and read into pandas dataframe and view it:

In [2]:
tables = pd.read_html("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
df = tables[0]
df.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront


Format and clean DataFrame:

In [3]:
df = df.set_index("Borough")
df = df.drop("Not assigned", axis=0)
df.rename(columns={"Postal code": "PostalCode"}, inplace=True)
df.reset_index(inplace=True)
df=df[['PostalCode','Borough','Neighborhood']]
df

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Regent Park / Harbourfront
3,M6A,North York,Lawrence Manor / Lawrence Heights
4,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government
...,...,...,...
98,M8X,Etobicoke,The Kingsway / Montgomery Road / Old Mill North
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,Business reply mail Processing CentrE
101,M8Y,Etobicoke,Old Mill South / King's Mill Park / Sunnylea /...


Combine rows with same postal code into a single row with values seperated by a comma:

In [4]:
i=0
for val in df['Neighborhood'].values:
    seperator = ', '
    df['Neighborhood'][i] = seperator.join(val.split(' / ')) 
    i=i+1
df

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,Business reply mail Processing CentrE
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


Check if any neighborhood has values as 'Not Assigned':

In [5]:
for x in df['Neighborhood'].unique():
    if x == 'Not Assigned':
        print('Not Assigned value is present')
    else:
        print('Not Present')

Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not Present
Not 

#### Question 1:
Show shape of dataframe as required by Question 1:

In [6]:
df.shape

(103, 3)

Read the geographical coordinates of each postal code from csv file uploaded online:

In [7]:
geo_df = pd.read_csv('http://cocl.us/Geospatial_data')
geo_df.shape

(103, 3)

Rename column 'Postal Code' to 'PostalCode' in order to match columns of both dataframes and sort dataframes in order of postal code so that we can easily merge them using a common column:

In [10]:
geo_df.rename(columns={"Postal Code": "PostalCode"}, inplace=True)

In [11]:
df = df.sort_values(['PostalCode'])
df.reset_index(inplace=True)
df.drop(['index'],axis = 1,inplace = True)

#### Question 2
Merge both dataframes and display new dataframe as required in Question 2:

In [12]:
df = df.merge(geo_df,how='right')
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


Create new dataframe 'neigh' with only those rows pertaining to Borough values containing 'Toronto' in them:

In [13]:
x=[]
for i in range(len(df['Borough'].values)):
    x.append((df['Borough'][i].find('Toronto')>0))
len(x)

103

In [14]:
neigh = df[x]
neigh.reset_index(inplace=True)

In [15]:
neigh = neigh.drop(['index'],axis = 1)

In [16]:
neigh.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M4E,East Toronto,The Beaches,43.676357,-79.293031
1,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
2,M4L,East Toronto,"India Bazaar, The Beaches West",43.668999,-79.315572
3,M4M,East Toronto,Studio District,43.659526,-79.340923
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879


## Map Plotting 
Plot all given Postal codes pertaining to toronto on a map using circle markers on the latitude and longitude values present in the 'neigh' dataframe:

In [17]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[43.679023, -79.40815], zoom_start=11)

# add markers to map
for lat,lng,label in zip(neigh['Latitude'],neigh['Longitude'],neigh['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  

map_toronto

Initialise Foursquare API credentials:

In [18]:
CLIENT_ID = '3SSGSPUECSZI3XKCU0PTTNPQHAXLGAIHD3WLWHAWKAJ1SAC2' # your Foursquare ID
CLIENT_SECRET = 'K5P5E15RWELKDOYI4L443JDJA20I3WNWTCK5SWUWFHNAGHVN' # your Foursquare Secret
VERSION = '20200504' # Foursquare API version

print('Credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Credentails:
CLIENT_ID: 3SSGSPUECSZI3XKCU0PTTNPQHAXLGAIHD3WLWHAWKAJ1SAC2
CLIENT_SECRET:K5P5E15RWELKDOYI4L443JDJA20I3WNWTCK5SWUWFHNAGHVN


Define a function to request for locations using the explore method and store results in a legible format:

In [21]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            100)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Get nearby venues (maximum 100) from every given postal code corresponding to neighborhoods in Toronto and store them in a dataframe named 'toronto_venues' : 

In [20]:
toronto_venues = getNearbyVenues(names=neigh['Neighborhood'],
                                   latitudes=neigh['Latitude'],
                                   longitudes=neigh['Longitude']
                                  )

The Beaches
The Danforth West, Riverdale
India Bazaar, The Beaches West
Studio District
Lawrence Park
Davisville North
North Toronto West
Davisville
Moore Park, Summerhill East
Summerhill West, Rathnelly, South Hill, Forest Hill SE, Deer Park
Rosedale
St. James Town, Cabbagetown
Church and Wellesley
Regent Park, Harbourfront
Garden District, Ryerson
St. James Town
Berczy Park
Central Bay Street
Richmond, Adelaide, King
Harbourfront East, Union Station, Toronto Islands
Toronto Dominion Centre, Design Exchange
Commerce Court, Victoria Hotel
Roselawn
Forest Hill North & West
The Annex, North Midtown, Yorkville
University of Toronto, Harbord
Kensington Market, Chinatown, Grange Park
CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst  Quay, South Niagara, Island airport
Stn A PO Boxes
First Canadian Place, Underground city
Christie
Dufferin, Dovercourt Village
Little Portugal, Trinity
Brockton, Parkdale Village, Exhibition Place
High Park, The Junction South
Parkdale, Ro

In [22]:
toronto_venues['Venue Category'].unique()

array(['Trail', 'Health Food Store', 'Pub', 'Neighborhood', 'Coffee Shop',
       'Greek Restaurant', 'Cosmetics Shop', 'Italian Restaurant',
       'Ice Cream Shop', 'Yoga Studio', 'Brewery',
       'Fruit & Vegetable Store', 'Pizza Place', 'Bookstore',
       'Restaurant', 'Dessert Shop', 'Juice Bar', 'Bubble Tea Shop',
       'Spa', 'Diner', 'Grocery Store', 'Furniture / Home Store', 'Café',
       'Bakery', 'Caribbean Restaurant', 'Indian Restaurant',
       'Frozen Yogurt Shop', 'Lounge', 'Liquor Store', 'Gym',
       'Fish & Chips Shop', 'Fast Food Restaurant', 'Sushi Restaurant',
       'Park', 'Pet Store', 'Steakhouse', 'Burrito Place',
       'Movie Theater', 'Sandwich Place', 'Intersection',
       'Food & Drink Shop', 'Fish Market', 'Gay Bar', 'Cheese Shop',
       'Middle Eastern Restaurant', 'Comfort Food Restaurant',
       'Thai Restaurant', 'Seafood Restaurant', 'American Restaurant',
       'Stationery Store', 'Coworking Space', 'Wine Bar', 'Bar',
       'Gym / Fitness

### NOTE:
In the above cell we can see that one of the values of venue category is 'Neighborhood' and thus if we perform one-hot encoding we will get a column with the name as 'Neighborhood' thus to avoid confusion we shall add a new column to the new one-hot enoded dataframe as 'Neighborhood_Name' which consists of the names of the neighborhods under a given postal code

In [23]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood_Name'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

In [24]:
toronto_onehot.head()

Unnamed: 0,Neighborhood_Name,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Thrift / Vintage Store,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Women's Store,Yoga Studio
0,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
1,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [25]:
toronto_onehot.shape

(1681, 236)

In [26]:
toronto_grouped = toronto_onehot.groupby('Neighborhood_Name').mean().reset_index()
toronto_grouped = toronto_grouped.sort_values(['Neighborhood_Name'])
toronto_grouped.head()

Unnamed: 0,Neighborhood_Name,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Thrift / Vintage Store,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Women's Store,Yoga Studio
0,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.018182,0.0,0.0,0.0,0.0,0.0
1,"Brockton, Parkdale Village, Exhibition Place",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Business reply mail Processing CentrE,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0625
3,"CN Tower, King and Spadina, Railway Lands, Har...",0.0,0.0625,0.0625,0.0625,0.125,0.125,0.125,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Central Bay Street,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.012987,0.0,...,0.0,0.0,0.0,0.0,0.012987,0.0,0.0,0.012987,0.0,0.012987


## Clustering
Let's create a K-means cluster to form clusters of the different Neighborhoods in toronto on the basis of their nearest venues.

In [27]:
from sklearn.cluster import KMeans
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood_Name', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [28]:
neigh = neigh.sort_values(['Neighborhood'])
neigh = neigh.reset_index(drop=True)
neigh['Cluster_Number'] = kmeans.labels_
neigh.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster_Number
0,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306,0
1,M6K,West Toronto,"Brockton, Parkdale Village, Exhibition Place",43.636847,-79.428191,0
2,M7Y,East Toronto,Business reply mail Processing CentrE,43.662744,-79.321558,0
3,M5V,Downtown Toronto,"CN Tower, King and Spadina, Railway Lands, Har...",43.628947,-79.39442,0
4,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383,0


## Plotting Final Clusters on Map
Let's plot the clusters created from the k-means clustering algorithm on the map.

In [29]:
import matplotlib.cm as cm
import matplotlib.colors as colors
# create map
map_clusters = folium.Map(location=[43.679023, -79.40815], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(neigh['Latitude'], neigh['Longitude'], neigh['Neighborhood'], neigh['Cluster_Number']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters