# CAPSTONE PROJECT FOR DATA SCIENCE CERTIFICATE 
### This notebooks's author is Eduardo Mendoza and it will be used mainly for the capstone project of this certificate 
#### Running Python 3.6

# PART 1: DOWNLOADING AND CLEANING DATAFRAME

In [66]:
import numpy as np
import pandas as pd

import requests
from pandas.io.json import json_normalize
import json

import requests

from bs4 import BeautifulSoup

from geopy.geocoders import Nominatim

import matplotlib.cm as cm
import matplotlib.colors as colors

!pip install folium==0.5.0

import folium as folium

from sklearn.cluster import KMeans

print('Libraries imported!')

print("Hellow Capstone Project Course")

Libraries imported!
Hellow Capstone Project Course


In [67]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
df = pd.read_html(url)

type(df)

list

In [68]:
len(df)

3

In [69]:
df = df[0]
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


Checking the info of the data frame in order to figure out if the type of the columns are good to go, if not we must change them first before working with them.

In [70]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 180 entries, 0 to 179
Data columns (total 3 columns):
Postal Code     180 non-null object
Borough         180 non-null object
Neighborhood    180 non-null object
dtypes: object(3)
memory usage: 4.3+ KB


Ignoring rows with the borough not assigned

In [71]:
ignoringNA = df[df.Borough != 'Not assigned'].reset_index(drop=True)

Merging neighborhoods with the same Borough

In [72]:
TOmerged = ignoringNA.groupby(['Postal Code','Borough'], as_index=False).agg(lambda x: ','.join(x))

Assigning borough to empty neighborhoods

In [73]:
mask = TOmerged['Neighborhood'] == "Not assigned"
TOmerged.loc[mask, 'Neighborhood'] = TOmerged.loc[mask, 'Borough']

In [74]:
TOmerged.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


Using shape method on the final dataframe to figure out the characteristics of the data frame

In [75]:
print(TOmerged.shape)
print("The data frame has 103 rows and 3 columns")

(103, 3)
The data frame has 103 rows and 3 columns


# PART 2: DOWNLOADING AND MERGING COORDINATES TO DATAFRAME

Downloading the coordinates of the postal codes from CSV file, due to the unresponsiveness of the API

In [76]:
!wget -q -O "toronto_coordinates.csv" http://cocl.us/Geospatial_data
print('Coordinates downloaded!')
coors = pd.read_csv('toronto_coordinates.csv')

Coordinates downloaded!


We will need to merge to two dataframes in order to have in a single dataframe the postal codes, borough, neighborhood and coordinates. For this we must first set a common index value and then apply an inner join 

In [77]:
df1 = TOmerged.set_index('Postal Code')
coors_temp = coors.set_index('Postal Code')
toronto_df_coors = pd.concat([df1, coors_temp], axis=1, join='inner')

toronto_df_coors.index.name = 'PostalCode'
toronto_df_coors.reset_index(inplace=True)

print(toronto_df_coors.shape)
toronto_df_coors.head()

(103, 5)


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


# PART 3: EXPLORING THE NEIGHBORHOODS OF TORONTO 

In [78]:
address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="tl-toronto-neigh")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinates of Toronto are {}, {}.'.format(latitude, longitude))


The geograpical coordinates of Toronto are 43.6534817, -79.3839347.


In [79]:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

for lat, long, post, borough, neigh in zip(toronto_df_coors['Latitude'], toronto_df_coors['Longitude'], toronto_df_coors['PostalCode'], toronto_df_coors['Borough'], toronto_df_coors['Neighborhood']):
    label = "{} ({}): {}".format(borough, post, neigh)
    popup = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, long],
        radius=5,
        popup=popup,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)
    
map_toronto

As recommended in the assignment we are going to reduce the number of boroughs and we are just going to analyze the boroughs with "Toronto" in their names. 

In [80]:
toronto_boroughs = ['East Toronto', 'Central Toronto', 'Downtown Toronto', 'West Toronto']
toronto_central_df = toronto_df_coors[toronto_df_coors['Borough'].isin(toronto_boroughs)].reset_index(drop=True)
print(toronto_central_df.shape)
toronto_central_df.head()

(39, 5)


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M4E,East Toronto,The Beaches,43.676357,-79.293031
1,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
2,M4L,East Toronto,"India Bazaar, The Beaches West",43.668999,-79.315572
3,M4M,East Toronto,Studio District,43.659526,-79.340923
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879


In [62]:

map_toronto = folium.Map(location=[latitude, longitude], zoom_start=12)

for lat, long, post, borough, neigh in zip(toronto_central_df['Latitude'], toronto_central_df['Longitude'], toronto_central_df['PostalCode'], toronto_central_df['Borough'], toronto_central_df['Neighborhood']):
    label = "{} ({}): {}".format(borough, post, neigh)
    popup = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, long],
        radius=5,
        popup=popup,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)
    
map_toronto

### Exploring Toronto using API FOURSQUARE

hidden cell with credentials

In [93]:
# The code was removed by Watson Studio for sharing.

In [94]:
LIMIT = 30
search_query = 'Italian'

Searching for italian venues in the city of Toronto

In [89]:
radius = 500
LIMIT = 100

venues = []

for lat, long, post, borough, neighborhood in zip(toronto_central_df['Latitude'], toronto_central_df['Longitude'], toronto_central_df['PostalCode'], toronto_central_df['Borough'], toronto_central_df['Neighborhood']):
    url = "https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}&query={}".format(
        CLIENT_ID,
        CLIENT_SECRET,
        VERSION,
        lat,
        long,
        radius, 
        LIMIT,
        search_query)
    
    results = requests.get(url).json()["response"]['groups'][0]['items']
    
    for venue in results:
        venues.append((
            post, 
            borough,
            neighborhood,
            lat, 
            long, 
            venue['venue']['name'], 
            venue['venue']['location']['lat'], 
            venue['venue']['location']['lng'],  
            venue['venue']['categories'][0]['name']))

Once we extract the data from the API we can take a look at the data frame. 

In [90]:
venues_df = pd.DataFrame(venues)
venues_df.columns = ['PostalCode', 'Borough', 'Neighborhood', 'BoroughLatitude', 'BoroughLongitude', 'VenueName', 'VenueLatitude', 'VenueLongitude', 'VenueCategory']
print(venues_df.shape)
venues_df.head()

(202, 9)


Unnamed: 0,PostalCode,Borough,Neighborhood,BoroughLatitude,BoroughLongitude,VenueName,VenueLatitude,VenueLongitude,VenueCategory
0,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188,Cafe Fiorentina,43.677743,-79.350115,Italian Restaurant
1,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188,7 Numbers,43.677062,-79.353934,Italian Restaurant
2,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188,Pizzeria Libretto,43.678489,-79.347576,Italian Restaurant
3,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188,IL FORNELLO on Danforth,43.678604,-79.346904,Italian Restaurant
4,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188,Trapezzi,43.678153,-79.34905,Italian Restaurant


We group by the amount of venues per postal code, this is interesting since it can give us an idea of the demographics of the area and it creates the following question? why there are so many italian restaurants in certain postal codes? could their be a large amount of italian inmigrants nearby? 

In [92]:
venues_df.groupby(['PostalCode', 'Borough', 'Neighborhood'])['VenueName'].count()

PostalCode  Borough           Neighborhood                                     
M4K         East Toronto      The Danforth West, Riverdale                          5
M4L         East Toronto      India Bazaar, The Beaches West                        1
M4M         East Toronto      Studio District                                       2
M4R         Central Toronto   North Toronto West, Lawrence Park                     1
M4S         Central Toronto   Davisville                                            3
M4X         Downtown Toronto  St. James Town, Cabbagetown                           2
M4Y         Downtown Toronto  Church and Wellesley                                  5
M5A         Downtown Toronto  Regent Park, Harbourfront                             3
M5B         Downtown Toronto  Garden District, Ryerson                             11
M5C         Downtown Toronto  St. James Town                                       20
M5E         Downtown Toronto  Berczy Park                   

In [91]:
len(venues_df['VenueCategory'].unique())

12