# Toronto Neighborhood Segmentation 
_________________________

### This is a notebook for retrieving data about neighborhoods in Toronto from Wikipedia and using this data to cluster the city based on the venues in each part of the city. 


In [None]:
# first install the needed libraries
!pip install bs4
!pip install requests

In [3]:
# next import the needed libraries for the web scraping 
from bs4 import BeautifulSoup # this helps us to make objects from the HTML document (tree like manner)
import requests  # this is for requests making using HTTP requests
import pandas as pd # this is for the dataframe structure

Use a Wikipedia site that contains the post codes for the boroughs and neighborhoods in Toronto.


In [4]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

In [5]:
# use the request library to retrieve the HTML from the page, in a text format
data  = requests.get(url).text

Now we need to format the text in a tree-like structure using the BeautifulSoup object  

In [6]:
soup = BeautifulSoup(data,"html5lib")

With the use of the __find()__ method of the soup object we can find the first table in the HTML document. As long as we have one table in the given page that is enough for our exploration. 


In [7]:
table = soup.find('table')

In [8]:
# let's make an empty list for holding the dictionaries that we will retrieve from our Wikipedia table 
table_list = []

In [10]:
# iterate through all the rows in the table 
for row in table.findAll('td'):
    temp_dic = {}    #create an empty dictionary for holding the values of a row 
    if row.span.text=='Not assigned':
        pass
    else:
        temp_dic['PostalCode'] = row.p.text[:3]
        temp_dic['Borough'] = (row.span.text).split('(')[0]
        temp_dic['Neighborhood'] = (((((row.span.text).split('(')[1]).strip(')')).replace(' /',',')).replace(')',' ')).strip(' ')
        table_list.append(temp_dic)
        
# print(table_contents)
df=pd.DataFrame(table_list)
df['Borough']=df['Borough'].replace({'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A',
                                             'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business',
                                             'EtobicokeNorthwest':'Etobicoke Northwest','East YorkEast Toronto':'East York/East Toronto',
                                             'MississaugaCanada Post Gateway Processing Centre':'Mississauga'})

In [15]:
#print the dataframe's first 5 rows 
df.head(11)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills North
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


In the next cell we will check that no __'Not assigned'__ cells have remained in our dataframe. 

In [35]:
x = df[df['Borough'] == 'Not assigned'].shape
y = df[df['Neighborhood'] == 'Not assigned'].shape
if(x[0] == 0 & y[0] == 0 ):
    print("The dataset not includes any Not assigned cells.")

The dataset not includes any Not assigned cells.


The next cell will print out the __dimensions__ of our dataset 


In [37]:
dim = df.shape
print('The dataframe has {} rows and {} columns.'.format(dim[0], dim[1]))

The dataframe has 103 rows and 3 columns.


___________
# END OF PART 1
_______________

## In this session get the longitude an latitude values for the postcodes


Install the needed libraries for transforming our address into longitude and latitude values.

In [45]:
!pip install geopy

  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes


In [76]:
!pip install geocoder

  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
Collecting geocoder
  Downloading geocoder-1.38.1-py2.py3-none-any.whl (98 kB)
[K     |████████████████████████████████| 98 kB 7.6 MB/s  eta 0:00:01
[?25hCollecting ratelim
  Downloading ratelim-0.1.6-py2.py3-none-any.whl (4.0 kB)
Installing collected packages: ratelim, geocoder
Successfully installed geocoder-1.38.1 ratelim-0.1.6


In [46]:
from geopy.geocoders import Nominatim

Create 2 empty lists and an object for the calls (geolocator) and print some details to get insights about the workings of the function.

In [None]:
list_lat = []
list_long = []
geolocator = Nominatim(user_agent="foursquare_agent")

for postcode in ['M4H']:
    location = None
    while(location == None):
        address = '{}, Toronto, Ontario'.format(str(postcode))
        print(address)
        location = geolocator.geocode(address)
        print(location)
    
    latitude = location.latitude
    longitude = location.longitude
    print(latitude)

    list_lat.append(latitude)
    list_long.append(longitude)

Sadly the geolocator object is not able to give back the address in all cases so we end up in an infinite loop. Let's try another method.

In [None]:
import geocoder # import geocoder

# initialize your variable to None
lat_lng_coords = None
postal_code = 'M3A'
# loop until you get the coordinates
while(lat_lng_coords is None):
    g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
    lat_lng_coords = g.latlng
    print(lat_lng_coords)

latitude = lat_lng_coords[0]
longitude = lat_lng_coords[1]

This ends up in an infinite loop in some cases so let's use the __dataframe__ that is given. 

Now load the .csv file into the workbook.


In [81]:
import urllib.request
url = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs_v1/Geospatial_Coordinates.csv'
filename = 'geo.csv'
urllib.request.urlretrieve(url, filename)

('geo.csv', <http.client.HTTPMessage at 0x7fbdd06fe050>)

In [82]:
geo_data = pd.read_csv('geo.csv')

In [83]:
geo_data.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Load the Latitude and Longitude values to a dictionary, where the postcodes will be the key values. 

In [102]:
lat_d = {}
long_d = {}
for post, lat, long in zip(geo_data['Postal Code'], geo_data['Latitude'], geo_data['Longitude']):
    lat_d[post] = lat
    long_d[post] = long


In [100]:
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government


Match  the key values to the corresponding values of the dataframe and write the values into new columns: 

In [104]:
lat_new = pd.Series([])
long_new = pd.Series([])

for i in range(len(df)):
    lat_new[i] = lat_d[df["PostalCode"][i]]
    long_new[i] = long_d[df["PostalCode"][i]]

df.insert(3, "Latitude", lat_new)
df.insert(4, "Longitude", long_new)
    

  if __name__ == '__main__':
  from ipykernel import kernelapp as app


##### Print out the dataset  with the longitude and latitude values assigned to the dataframe 
_______________________

In [107]:
df.head(12)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Queen's Park,Ontario Provincial Government,43.662301,-79.389494
5,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills North,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


___________
# END OF PART 2
_______________