# Peer-graded Assignment: Segmenting and Clustering Neighborhoods in Toronto
Week 3 of the Coursera: Applied Data Science Capstone https://www.coursera.org/learn/applied-data-science-capstone/home/welcome   

## Solutions

1. <a href="item1">Create first dataframe</a>

2. <a href="item2">Add Latitude and Longitude</a>

3. <a href="item3">Cluster/Analysis Neighborhoods</a>


In [1]:
# importing important libraries
import numpy as np
import pandas as pd
import requests

# plot, rendering libraries
import folium
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.cm as cm
import matplotlib.colors as colors

# import kmean for later clustering
from sklearn.cluster import KMeans

## Obtaining neighborhoods data

In [2]:
# wikipedia scraping
url= r'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
# using pd built-in read_html to find the table and create the df, setting Not assigned as NaN
toronto_neighborhoods= pd.read_html(requests.get(url).content, na_values='Not assigned')[0]
toronto_neighborhoods[:3]

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,,
1,M2A,,
2,M3A,North York,Parkwoods


### toronto_neighborhoods df cleaning

In [3]:
# drop cells without a borough
toronto_neighborhoods.dropna(subset=['Borough'], inplace= True)
# fill NaN neighborhood with borough
toronto_neighborhoods['Neighbourhood'].fillna(toronto_neighborhoods['Borough'], inplace=True)
toronto_neighborhoods[:3]

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [4]:
#checking if NaN remains in neighborhood
toronto_neighborhoods['Neighbourhood'].isnull().sum()

0

## First Dataframe (point 3 of the assignment)
<a id="item1">

In [5]:
# joining Neighobourhood with same Postcode
df_group_hoods= pd.DataFrame(toronto_neighborhoods.groupby(['Postcode', 'Borough']).apply(lambda x: ', '.join(x['Neighbourhood'])))
df_group_hoods.reset_index(inplace= True) # resetting index after groupby
df_group_hoods.columns= toronto_neighborhoods.columns # renaming columns as previous df
df_group_hoods[:3]

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"


In [6]:
df_group_hoods.shape

(103, 3)

## Importing Geospatial Data

I have tried to use geocoder library, google was not responding and other services (osm and here) where not able to identify all postcode, i an going to download the cvs file

In [7]:
!wget -q -O geospatial.csv https://cocl.us/Geospatial_data
print('geospatial.csv downloaded')

geospatial.csv downloaded


In [8]:
# read the csv file
df_geospatial= pd.read_csv('geospatial.csv')
df_geospatial.columns= ['Postcode', 'Latitude', 'Longitude']
df_geospatial[:3]

Unnamed: 0,Postcode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711


## Dataframe including latitude and longitude
<a href="item2">

In [9]:
df_toronto= pd.merge(df_group_hoods, df_geospatial, how= 'left',  on='Postcode')
df_toronto[:3]

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711


## Exploring Toronto's neighborhoods

In [10]:
CLIENT_ID = 'IJEGQZHUQM3MZYQNYBFYTBSP5JFIMV3XLAWHBZ1RLPUIIKW5' # your Foursquare ID
CLIENT_SECRET = 'C1SAKNBHCXE2MIG0AHO421M2EHLRQFQFXBM14DOPNLANU3RU' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

In [11]:
def getNearbyVenues(names, latitudes, longitudes, radius=500, LIMIT= 100):
    '''Given latitude and longitude of a given place(name), use Foursquare to find the sourrounding venus. Limited at radius and total venus returned'''
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [12]:
toronto_venues = getNearbyVenues(names=df_toronto['Neighbourhood'],
                                 latitudes=df_toronto['Latitude'],
                                 longitudes=df_toronto['Longitude']
                                  )

Rouge, Malvern
Highland Creek, Rouge Hill, Port Union
Guildwood, Morningside, West Hill
Woburn
Cedarbrae
Scarborough Village
East Birchmount Park, Ionview, Kennedy Park
Clairlea, Golden Mile, Oakridge
Cliffcrest, Cliffside, Scarborough Village West
Birch Cliff, Cliffside West
Dorset Park, Scarborough Town Centre, Wexford Heights
Maryvale, Wexford
Agincourt
Clarks Corners, Sullivan, Tam O'Shanter
Agincourt North, L'Amoreaux East, Milliken, Steeles East
L'Amoreaux West, Steeles West
Upper Rouge
Hillcrest Village
Fairview, Henry Farm, Oriole
Bayview Village
Silver Hills, York Mills
Newtonbrook, Willowdale
Willowdale South
York Mills West
Willowdale West


KeyError: 'groups'

### checking the df obtained from Foursquare

let's check if all the neighborhood are presents

In [None]:
len(toronto_venues['Neighborhood'].unique())

it seems that 3 neighborhoods are missing, let's check which ones

In [None]:
no_venues_neighborhood= np.setdiff1d(df_group_hoods['Neighbourhood'],toronto_venues['Neighborhood'])
no_venues_neighborhood

checking on foursquare there are not many venues in these locations, i decided to not increase the radius and leave these 3 neighborhood with None

In [None]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

In [None]:
print(toronto_venues.shape)
toronto_venues[:3]

We have 103 different postalcodes, having 2233 results means that many neigborhood groups are far from the 100 venues limit setted before

In [None]:
print('In average, there are {:.0f} venues per neighbohood group'.format(toronto_venues.shape[0]/df_toronto.shape[0]))

Checking how the venues are distributed

In [None]:
venues_neighborhood= toronto_venues.groupby('Neighborhood').count().sort_values(by='Venue', ascending= True)

In [None]:
venues_neighborhood[:5]

In [None]:
venues_neighborhood[-5:]

There are neighborhood with only one venue, we will check how they are distributed

In [None]:
plt.figure(figsize= (10,5))
sns.distplot(venues_neighborhood['Venue'], kde= False)
plt.xlim(0,100)
plt.xlabel('number of venues')
plt.title('Toronto Neighborhood Venues Distribution')

From the histogram is visible that the majority of the neighborhoods have less than 20 venues

## Neighborhood Anlysis and Clustering
<a href="item3">

In [None]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot[:5]

now let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [None]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
print(toronto_grouped.shape) # checking the new shape
toronto_grouped[:5]

I will now add back the Neighborhoods with no venues, setting their mean occurency value to 0

In [None]:
a= np.zeros(shape=(no_venues_neighborhood.shape[0], toronto_grouped.shape[1]))
no_venues_df= pd.DataFrame(data=a, columns= toronto_grouped.columns)
for i in range(len(no_venues_df.index)):
    no_venues_df['Neighborhood'][i] = no_venues_neighborhood[i]

I will now create a df with the most common venues per neighborhood

In [None]:
toronto_grouped= toronto_grouped.append(no_venues_df)

In [None]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [None]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

In [None]:
# set number of clusters
kclusters = 10

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

In [None]:
# converintg UK eng in American eng to be able to merge df
df_toronto.rename(columns={'Neighbourhood': 'Neighborhood'},inplace= True )

add the labels to the df

In [None]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = df_toronto

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')
toronto_merged.head() # check the last columns!

In [None]:
# create map
map_clusters = folium.Map(location=[43.6532, -79.3832], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Now checking the most occurrent clusters

In [None]:
toronto_merged['Cluster Labels'].value_counts().plot(kind= 'bar')
plt.title('n of Neighborhoods per Cluster')
plt.ylabel('neighborhood count')
plt.xlabel('Cluster')

Checking the 2 most frequent clusters

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]][:10]

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]][:10]