Week 3 of the coursera capstone project

## Part 1

I first import all the libraries that will be used.

In [3]:
import urllib.request
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
import geocoder
import requests
from sklearn.cluster import KMeans
import folium
import matplotlib.cm as cm
import matplotlib.colors as colors

The first step to create the requested dataset, is to scrape the wikipedia page.
A combination of urllib and BeautifulSoup will be used to do the scraping.

In [52]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page,'lxml')
#soup.prettify() #Printing the output of the prettify makes the output format in github very unreadable,
#it is left as a comment

With this information i can find the location of the table where the information i required is stored, and extract it.
A loop is put in place to have the information filled into colums and then into a pandas dataframe.

In [6]:
right_table=soup.find('table', class_='wikitable sortable')

Col_1 = []
Col_2 = []
Col_3 = []

for row in right_table.findAll('tr'):
    cells = row.findAll('td')
    if len(cells)==3:
        Col_1.append(cells[0].find(text=True))
        Col_2.append(cells[1].find(text=True))
        Col_3.append(cells[2].find(text=True))

data = pd.DataFrame({'Postal code': Col_1,"Borough":Col_2,"Neighborhood":Col_3})
data.head(20)

Unnamed: 0,Postal code,Borough,Neighborhood
0,M1A\n,Not assigned\n,\n
1,M2A\n,Not assigned\n,\n
2,M3A\n,North York\n,Parkwoods\n
3,M4A\n,North York\n,Victoria Village\n
4,M5A\n,Downtown Toronto\n,Regent Park / Harbourfront\n
5,M6A\n,North York\n,Lawrence Manor / Lawrence Heights\n
6,M7A\n,Downtown Toronto\n,Queen's Park / Ontario Provincial Government\n
7,M8A\n,Not assigned\n,\n
8,M9A\n,Etobicoke\n,Islington Avenue\n
9,M1B\n,Scarborough\n,Malvern / Rouge\n


While the scraping seems to have worked succesfully, there is a /n on many of the entries. I proceed to clean them.

In [7]:
data["Postal code"] = data["Postal code"].str.replace(r'\n','')
data["Borough"] = data["Borough"].str.replace(r'\n','')
data["Neighborhood"] = data["Neighborhood"].str.replace(r'\n','')
data.head(20)

Unnamed: 0,Postal code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront
5,M6A,North York,Lawrence Manor / Lawrence Heights
6,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government
7,M8A,Not assigned,
8,M9A,Etobicoke,Islington Avenue
9,M1B,Scarborough,Malvern / Rouge


The data is now cleaner.
Proceeding with the instructions given to achieve the desired dataset, I eliminate all Boroughs that have "Not assigned" as its value.

In [8]:
data2 = data[data['Borough']!='Not assigned'].reset_index()
data2.groupby("Postal code").count()

Unnamed: 0_level_0,index,Borough,Neighborhood
Postal code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
M1B,1,1,1
M1C,1,1,1
M1E,1,1,1
M1G,1,1,1
M1H,1,1,1
...,...,...,...
M9N,1,1,1
M9P,1,1,1
M9R,1,1,1
M9V,1,1,1


The scraping process already grouped the neighbourhoods by postal code. In order for the data to be as exact  as possible i have to change the '/' to ',' to divide the neighborhoods, that hav ethe same postal code.

In [9]:
data2['Neighborhood']=data2['Neighborhood'].str.replace(r' /',',')
data2=data2.drop('index',axis=1)
data2.head(20)

Unnamed: 0,Postal code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


The final request is that there should be Neighborhodd with 'Not assigned' as its value. And if it were to exist, to have the same name as the borough. I chekc if there are any value in the columm.

In [10]:
data2[data2['Neighborhood']=='Not assigned']

Unnamed: 0,Postal code,Borough,Neighborhood


There is no Neighborhood with Not assigned.
That was the last requierement.
I print the final version of the dataframe, alongiside size.

In [11]:
data2.head(30)

Unnamed: 0,Postal code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


In [12]:
data2.shape

(103, 3)

## Part 2

This next part is about adding latitude and longitude values to the Postal codes of the previou dataframe. I will be using the geocoder library. Regrettably while the instructions use Google's API, i had to use arcgis to get all the values. Since the source of the values is different, they may differe a bit from the example in the instructions.

In [13]:
latitud = []
longitud = []
for postal in data2['Postal code']:
    gosm = geocoder.arcgis('{}, Toronto, Ontario'.format(postal))
    latitud.append(gosm.lat)
    longitud.append(gosm.lng)
data2['latitud']=latitud
data2['longitud']=longitud
data2.head(20)

Unnamed: 0,Postal code,Borough,Neighborhood,latitud,longitud
0,M3A,North York,Parkwoods,43.752935,-79.335641
1,M4A,North York,Victoria Village,43.728102,-79.31189
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.650964,-79.353041
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.723265,-79.451211
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.66179,-79.38939
5,M9A,Etobicoke,Islington Avenue,43.667481,-79.528953
6,M1B,Scarborough,"Malvern, Rouge",43.808626,-79.189913
7,M3B,North York,Don Mills,43.7489,-79.35722
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.707193,-79.311529
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657491,-79.377529


## Part 3

In this part i will recreate the analysis done on the New York data in the "Lab - Segmenting and Clustering Neighborhoods in New York City. 
There is no change done to the analysis. The only thing that was changed were a few issues with missing data (only 3 rows, they were dropped) and a larger number of K means clustering. With a lower K of K-means, most of the data was part of a single cluster.

API Credentials

In [14]:
CLIENT_ID = 'QU0NQZJH5AKMGF5WU041FWUWBWGECFWBZSYYYYKH5QLMUJ0N' # your Foursquare ID
CLIENT_SECRET = 'FCA3CZJJMG40EW2BX5R5GPCY5H2MPJ3BF5TBN1FLMRJ0T01K' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: QU0NQZJH5AKMGF5WU041FWUWBWGECFWBZSYYYYKH5QLMUJ0N
CLIENT_SECRET:FCA3CZJJMG40EW2BX5R5GPCY5H2MPJ3BF5TBN1FLMRJ0T01K


Defining the Fuctions

In [15]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [16]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Data adquisition and Wrangling

In [43]:
LIMIT = 20
radius = 300
Toronto_venues = getNearbyVenues(names=data2['Neighborhood'],
                                   latitudes=data2['latitud'],
                                   longitudes=data2['longitud']
                                  )
Toronto_venues.head()

Parkwoods
Victoria Village
Regent Park, Harbourfront
Lawrence Manor, Lawrence Heights
Queen's Park, Ontario Provincial Government
Islington Avenue
Malvern, Rouge
Don Mills
Parkview Hill, Woodbine Gardens
Garden District, Ryerson
Glencairn
West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale
Rouge Hill, Port Union, Highland Creek
Don Mills
Woodbine Heights
St. James Town
Humewood-Cedarvale
Eringate, Bloordale Gardens, Old Burnhamthorpe, Markland Wood
Guildwood, Morningside, West Hill
The Beaches
Berczy Park
Caledonia-Fairbanks
Woburn
Leaside
Central Bay Street
Christie
Cedarbrae
Hillcrest Village
Bathurst Manor, Wilson Heights, Downsview North
Thorncliffe Park
Richmond, Adelaide, King
Dufferin, Dovercourt Village
Scarborough Village
Fairview, Henry Farm, Oriole
Northwood Park, York University
East Toronto
Harbourfront East, Union Station, Toronto Islands
Little Portugal, Trinity
Kennedy Park, Ionview, East Birchmount Park
Bayview Village
Downsview
The Danforth West, Ri

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.752935,-79.335641,Brookbanks Park,43.751976,-79.33214,Park
1,Parkwoods,43.752935,-79.335641,Corrosion Service Company Limited,43.752432,-79.334661,Construction & Landscaping
2,Parkwoods,43.752935,-79.335641,Variety Store,43.751974,-79.333114,Food & Drink Shop
3,Victoria Village,43.728102,-79.31189,Tim Hortons,43.725517,-79.313103,Coffee Shop
4,Victoria Village,43.728102,-79.31189,Portugril,43.725819,-79.312785,Portuguese Restaurant


In [44]:
Toronto_onehot = pd.get_dummies(Toronto_venues[['Venue Category']], prefix="", prefix_sep="")
Toronto_onehot = Toronto_onehot.drop('Neighborhood',axis=1)
neigh = Toronto_venues['Neighborhood']
Toronto_onehot.insert(0, 'Neighborhood', neigh)


Toronto_onehot.head()

Unnamed: 0,Neighborhood,Airport,American Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Auto Dealership,Auto Garage,...,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Women's Store,Yoga Studio
0,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [45]:
Toronto_grouped = Toronto_onehot.groupby('Neighborhood').mean().reset_index()
Toronto_grouped

Unnamed: 0,Neighborhood,Airport,American Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Auto Dealership,Auto Garage,...,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Women's Store,Yoga Studio
0,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.0,0.0,0.0
1,"Alderwood, Long Branch",0.0,0.0,0.0,0.0,0.0,0.0,0.1,0.0,0.0,...,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.0,0.0,0.0
2,"Bathurst Manor, Wilson Heights, Downsview North",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0
3,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.5,0.0,0.0,0.00,0.0,0.0,0.0,0.0,0.0
4,"Bedford Park, Lawrence Manor East",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
91,"Willowdale, Newtonbrook",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.0,0.0,0.0
92,Woburn,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.0,0.0,0.0
93,Woodbine Heights,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.0,0.0,0.0
94,York Mills West,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.0,0.0,0.0


In [46]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = Toronto_grouped['Neighborhood']

for ind in np.arange(Toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(Toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Agincourt,Breakfast Spot,Skating Rink,Supermarket,Sushi Restaurant,Badminton Court,Discount Store,Farmers Market,Farm,Falafel Restaurant,Electronics Store
1,"Alderwood, Long Branch",Gym,Gas Station,Pizza Place,Coffee Shop,Convenience Store,Pub,Dance Studio,Sandwich Place,Pharmacy,Athletics & Sports
2,"Bathurst Manor, Wilson Heights, Downsview North",Coffee Shop,Bank,Pharmacy,Frozen Yogurt Shop,Bridal Shop,Sandwich Place,Restaurant,Diner,Supermarket,Sushi Restaurant
3,Bayview Village,Construction & Landscaping,Trail,Yoga Studio,Dessert Shop,Farm,Falafel Restaurant,Electronics Store,Eastern European Restaurant,Donut Shop,Dog Run
4,"Bedford Park, Lawrence Manor East",Italian Restaurant,Coffee Shop,Sandwich Place,Pharmacy,Sushi Restaurant,Indian Restaurant,Juice Bar,Liquor Store,Comfort Food Restaurant,Pizza Place


K-Means Clustering

In [47]:
kclusters = 8
Toronto_grouped_clustering = Toronto_grouped.drop('Neighborhood', 1)

kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(Toronto_grouped_clustering)

kmeans.labels_[0:10] 

array([3, 1, 1, 0, 3, 3, 3, 3, 3, 3])

In [48]:
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

Toronto_merged = data2
Toronto_merged = Toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')
Toronto_merged=Toronto_merged.dropna()
Toronto_merged['Cluster Labels']=Toronto_merged['Cluster Labels'].astype(int)

Toronto_merged.head()

Unnamed: 0,Postal code,Borough,Neighborhood,latitud,longitud,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M3A,North York,Parkwoods,43.752935,-79.335641,2,Construction & Landscaping,Food & Drink Shop,Park,Yoga Studio,Diner,Farm,Falafel Restaurant,Electronics Store,Eastern European Restaurant,Donut Shop
1,M4A,North York,Victoria Village,43.728102,-79.31189,1,Pizza Place,Intersection,Portuguese Restaurant,Park,Coffee Shop,French Restaurant,Electronics Store,Eastern European Restaurant,Donut Shop,Dog Run
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.650964,-79.353041,3,Pub,Café,Athletics & Sports,Performing Arts Venue,Theater,Distribution Center,Mediterranean Restaurant,Food Truck,Mexican Restaurant,French Restaurant
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.723265,-79.451211,3,Clothing Store,Cosmetics Shop,Mediterranean Restaurant,American Restaurant,Leather Goods Store,Tea Room,Shopping Mall,Kitchen Supply Store,Men's Store,Toy / Game Store
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.66179,-79.38939,3,Coffee Shop,Yoga Studio,Spa,Burrito Place,Sushi Restaurant,Distribution Center,Italian Restaurant,Discount Store,Fried Chicken Joint,Middle Eastern Restaurant


Cluster Plotting

In [49]:
map_clusters = folium.Map(location=[43.651070, -79.347015], zoom_start=11)

x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

markers_colors = []
for lat, lon, poi, cluster in zip(Toronto_merged['latitud'], Toronto_merged['longitud'], Toronto_merged['Neighborhood'], Toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters.save(outfile='test.html')

The final output of the code is the same cluster map that was designed for Ney York, but for Toroto.
Regrettably since Github does not render the HTML objet in this notebook correctly, i have to use a non interactive screenshot of the folium map and paste it as follows:

<img src="cluster_map.PNG">