# The IBM Applied DS Capstone Project

This notebook will be used for walking through [IBM Data Science Professional Certificate Specialization: Applied Data Science Capstone](https://www.coursera.org/learn/applied-data-science-capstone)

In [1]:
import pandas as pd
import numpy as np

print('Hello Capstone Project Course!')

Hello Capstone Project Course!


## Scrape "List of postal codes of Canada: M" wikipedia page for Toronto postal codes

Required imports

In [2]:
from bs4 import BeautifulSoup
import requests

Fetch wiki page as a BeautifulSoup object

In [3]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

soup = BeautifulSoup(requests.get(url).text, 'lxml')

Find the html table that contains the codes

In [4]:
soup_codes_table = soup.find('table', class_='wikitable sortable')

Convert to pandas dataframe

In [5]:
pd.set_option('display.max_colwidth', -1) # for wide columns later on

postal_codes_df = pd.read_html(str(soup_codes_table), header = 0)[0]
postal_codes_df.columns = ['PostalCode', 'Borough', 'Neighborhood']

postal_codes_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


Drop entries with "Not assigned" Borough

In [6]:
postal_codes_df = postal_codes_df[postal_codes_df.Borough != 'Not assigned']

postal_codes_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


Aggregate function to be used with groupby()

In [7]:
from functools import reduce

# this functions forms a comma+space separated string of unique sorted elements of list `series`
def aggreg(series):
    return reduce(lambda x, y: x + ', ' + y, sorted(list(set(series))))

Aggregate neighborhoods with the same PostalCode to comma+space separated strings

In [8]:
postal_codes_df = postal_codes_df.groupby('PostalCode').agg(aggreg).reset_index()

postal_codes_df.head(12)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Highland Creek, Port Union, Rouge Hill"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


Do we have 'Not assigned' Neighborhoods?

In [9]:
not_assigned_neighborhoods = postal_codes_df.Neighborhood == 'Not assigned'
#postal_codes_df[postal_codes_df.Neighborhood == 'Not assigned'].head()
postal_codes_df[not_assigned_neighborhoods]

Unnamed: 0,PostalCode,Borough,Neighborhood
85,M7A,Queen's Park,Not assigned


Let's rename 'Not assigned' Neighborhood to their respective Boroughs

In [10]:
postal_codes_df.Neighborhood = list(map(lambda x: x[1] if x[1] != 'Not assigned' else x[0], zip(postal_codes_df.Borough, postal_codes_df.Neighborhood)))

# test
postal_codes_df.loc[not_assigned_neighborhoods]

Unnamed: 0,PostalCode,Borough,Neighborhood
85,M7A,Queen's Park,Queen's Park


`postal_codes_df` size

In [11]:
postal_codes_df.shape

(103, 3)

## Add Latitude and Longtitude

First load postal codes dataframe

In [12]:
coordinates_df = pd.read_csv ('https://cocl.us/Geospatial_data')
coordinates_df.columns =['PostalCode', 'Latitude', 'Longitude']

coordinates_df.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Now create a merged postal codes dataframe

In [13]:
postal_codes_with_coords_df = pd.merge (postal_codes_df, coordinates_df, on = 'PostalCode')

postal_codes_with_coords_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Port Union, Rouge Hill",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


Check the shape to verify nothing got lost due to inner join (i.e., `coordinates_df` has all the needed postal codes entries)

In [14]:
postal_codes_with_coords_df.shape

(103, 5)

So we're all good, since `postal_codes_with_coords_df` has as many rows as `postal_codes_df` has

## Explore and cluster the neighborhoods

Let's first see which boroughs do we have

In [15]:
postal_codes_with_coords_df.Borough.unique()

array(['Scarborough', 'North York', 'East York', 'East Toronto',
       'Central Toronto', 'Downtown Toronto', 'York', 'West Toronto',
       "Queen's Park", 'Mississauga', 'Etobicoke'], dtype=object)

Entire 103 neighborhoods is a bit too much for our learning purposes. What if wee only stick to those boroughs that have `Toronto` substring?

In [16]:
boroughs = ['East Toronto', 'Central Toronto', 'Downtown Toronto', 'West Toronto']

postal_codes_with_coords_df[postal_codes_with_coords_df.Borough.isin(boroughs)].shape

(38, 5)

38 rows looks good. Let's create a new `toronto_df` dataframe with the chosen boroughs

In [17]:
toronto_df = postal_codes_with_coords_df[postal_codes_with_coords_df.Borough.isin(boroughs)].sort_values(['Borough', 'Neighborhood'])
toronto_df.reset_index(inplace = True, drop = True)

toronto_df

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M4S,Central Toronto,Davisville,43.704324,-79.38879
1,M4P,Central Toronto,Davisville North,43.712751,-79.390197
2,M4V,Central Toronto,"Deer Park, Forest Hill SE, Rathnelly, South Hill, Summerhill West",43.686412,-79.400049
3,M5P,Central Toronto,"Forest Hill North, Forest Hill West",43.696948,-79.411307
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879
5,M4T,Central Toronto,"Moore Park, Summerhill East",43.689574,-79.38316
6,M5R,Central Toronto,"North Midtown, The Annex, Yorkville",43.67271,-79.405678
7,M4R,Central Toronto,North Toronto West,43.715383,-79.405678
8,M5N,Central Toronto,Roselawn,43.711695,-79.416936
9,M5H,Downtown Toronto,"Adelaide, King, Richmond",43.650571,-79.384568


### Let's visualize neighborhoods on Toronto map

Required imports and `geopy` options first

In [None]:
!conda install -c conda-forge folium=0.5.0 --yes
!conda install -c conda-forge geopy --yes

Fetching package metadata .............
Solving package specifications: .

# All requested packages already installed.
# packages in environment at /opt/conda/envs/DSX-Python35:
#
folium                    0.5.0                      py_0    conda-forge
Fetching package metadata .............
Solving package specifications: 

In [None]:
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import folium # map rendering library
import geopy
geopy.geocoders.options.default_user_agent = "my-application"

Get Toronto coordinates

In [None]:
address = 'Toronto, ON'

geolocator = Nominatim()
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

Create a map of Toronto with chosen neighborhoods superimposed on top.

In [None]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, borough, neighborhood in zip(toronto_df['Latitude'], toronto_df['Longitude'], toronto_df['Borough'], toronto_df['Neighborhood']):
    label = '{}: {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

## Utilize the Foursquare API to explore the neighborhoods and segment them

Secret credentials cell

In [None]:
# The code was removed by Watson Studio for sharing.

API version and LIMIT

In [None]:
VERSION = '20180605' # Foursquare API version
LIMIT = 100

Let's borrow `getNearbyVenues` function from 'Neighborhoods New York' lab

In [None]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

And use it to get Torronto venues

In [None]:
torronto_venues = getNearbyVenues(names=toronto_df['Neighborhood'],
                                  latitudes=toronto_df['Latitude'],
                                  longitudes=toronto_df['Longitude']
                                 )

Size of the resulting dataframe

In [None]:
print(torronto_venues.shape)
torronto_venues.head()

How many unique categories do we have?

In [None]:
print('There are {} uniques categories.'.format(len(torronto_venues['Venue Category'].unique())))

### Now let's prepare `torronto_grouped` dataframe for clustering and generate a dataframe with 10 most common venue categories for each neighborhood 
Do it the same way it's done in New York Neighborhoods lab

In [None]:
# one hot encoding
torronto_onehot = pd.get_dummies(torronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
torronto_onehot['Neighborhood'] = torronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [torronto_onehot.columns[-1]] + list(torronto_onehot.columns[:-1])
torronto_onehot = torronto_onehot[fixed_columns]

# group rows by neighborhood and by taking the mean of the frequency of occurrence of each category
torronto_grouped = torronto_onehot.groupby('Neighborhood').mean().reset_index()

torronto_grouped

In [None]:
# a function to sort the venues in descending order
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = torronto_grouped['Neighborhood']

for ind in np.arange(torronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(torronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted

### Cluster the neighborhoods

In [None]:
from sklearn.cluster import KMeans

# set number of clusters
kclusters = 5

torronto_grouped_clustering = torronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(torronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [None]:
toronto_merged = toronto_df

# add clustering labels
toronto_merged['Cluster Labels'] = kmeans.labels_

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = pd.merge(toronto_merged, neighborhoods_venues_sorted, on='Neighborhood')

toronto_merged

Let's visualize the resulting clusters

In [None]:
import matplotlib.cm as cm
import matplotlib.colors as colors

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### Examine clusters

#### Cluster 1

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1, 2] + list(range(5, toronto_merged.shape[1]))]]

#### Cluster 2

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1, 2] + list(range(5, toronto_merged.shape[1]))]]

#### Cluster 3

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1, 2] + list(range(5, toronto_merged.shape[1]))]]

#### Cluster 4

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1, 2] + list(range(5, toronto_merged.shape[1]))]]

#### Cluster 5

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1, 2] + list(range(5, toronto_merged.shape[1]))]]