# Segmenting and Clustering Neighborhoods in Toronto

# Part 1

## Web Scraping the Data
This section of the notebook uses the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M in order to obtain the data that is in the table of postal codes and to transform the dat a into a pandas dataframe. 

In [5]:
! pip install bs4

Collecting bs4
  Downloading bs4-0.0.1.tar.gz (1.1 kB)
Building wheels for collected packages: bs4
  Building wheel for bs4 (setup.py) ... [?25ldone
[?25h  Created wheel for bs4: filename=bs4-0.0.1-py3-none-any.whl size=1272 sha256=6ad900a0c7bef27fc37d578638b222b8d8589e55f8eaa1f06a1cc37c7854ebee
  Stored in directory: /tmp/wsuser/.cache/pip/wheels/0a/9e/ba/20e5bbc1afef3a491f0b3bb74d508f99403aabe76eda2167ca
Successfully built bs4
Installing collected packages: bs4
Successfully installed bs4-0.0.1


In [8]:
# importing libraries
import requests
import pandas as pd
from bs4 import BeautifulSoup
import numpy as np

print("required libraries imported")

required libraries imported


In [211]:
# send the GET request
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
result = requests.get(url).text

# pass the data into beautifulsoup
soup = BeautifulSoup(result, 'lxml')

In [212]:
# create column names for the data frame
column_names = ['Postal Code', 'Borough', 'Neighborhood']

df = pd.DataFrame(columns=column_names)

In [213]:
# append the data from soup object to the data frame
for row in soup.find('table').find_all('tr'):
    cells = row.find_all('td')
    if len(cells)>0:
        df = df.append({'Postal Code': cells[0].text.rstrip('\n'), 'Borough': cells[1].text.rstrip('\n'), 'Neighborhood': cells[2].text.rstrip('\n')}, ignore_index=True)
                       
df.head(5)

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [214]:
# Ignore cells with a borough that is Not assigned
df = df[df.Borough != 'Not assigned'].reset_index(drop=True)
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [215]:
df['Neighborhood'] = np.where(df['Neighborhood'] == 'Not assigned', df['Borough'], df['Neighborhood'])

In [216]:
# pd.set_option('display.max_rows', df.shape[0]+1)
df1 = df.groupby(['Postal Code','Borough'])['Neighborhood'].apply(', '.join).reset_index()
df1.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [217]:
df.describe()

Unnamed: 0,Postal Code,Borough,Neighborhood
count,103,103,103
unique,103,10,99
top,M1G,North York,Downsview
freq,1,24,4


### Seps for data cleaning
* For the above data cleaning step python bs4 package was used to scrape the data from website. 
* Then scraped data was converted in to a pandas data frame. The row of the data removed from the data frame if the Borugh was not assigned.
* some postal codes have more than one neighbourhoods, hence they were combine into a single row with multiple neighbourhoods 

In [218]:
df1.shape

(103, 3)

# Part 2
## Adding Longitude and Latitudes to the dataframe

In [200]:
#Geocode python package can used to get the longitude coordinates of each neighborhood. 
!pip install geocoder

Collecting geocoder
  Downloading geocoder-1.38.1-py2.py3-none-any.whl (98 kB)
[K     |████████████████████████████████| 98 kB 9.3 MB/s  eta 0:00:01
Collecting ratelim
  Downloading ratelim-0.1.6-py2.py3-none-any.whl (4.0 kB)
Installing collected packages: ratelim, geocoder
Successfully installed geocoder-1.38.1 ratelim-0.1.6


In [219]:
import pandas as pd
import numpy as np
import geocoder
print("Imported!")

Imported!


In [220]:
def get_lat_long(postal_code):
    # initialize your variable to None
    lat_lng_coords = None

    # loop until you get the coordinates
    while(lat_lng_coords is None):
      g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
      lat_lng_coords = g.latlng

    latitude = lat_lng_coords[0]
    longitude = lat_lng_coords[1]
    return latitude, longitude

The problem with this Package is you have to be persistent sometimes in order to get the geographical coordinates of a given postal code. 
So you can make a call to get the latitude and longitude coordinates of a given postal code and the result would be None, and then make the call again and you would get the coordinates. 

In [205]:
!wget http://cocl.us/Geospatial_data

--2021-01-25 00:34:48--  http://cocl.us/Geospatial_data
Resolving cocl.us (cocl.us)... 169.63.96.176, 169.63.96.194
Connecting to cocl.us (cocl.us)|169.63.96.176|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://cocl.us/Geospatial_data [following]
--2021-01-25 00:34:48--  https://cocl.us/Geospatial_data
Connecting to cocl.us (cocl.us)|169.63.96.176|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://ibm.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following]
--2021-01-25 00:34:49--  https://ibm.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv
Resolving ibm.box.com (ibm.box.com)... 107.152.26.197
Connecting to ibm.box.com (ibm.box.com)|107.152.26.197|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /public/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following]
--2021-01-25 00:34:49--  https://ibm.box.com/public/static/9afzr83p

In [223]:
# Using Geospatial csv data
df2 = pd.read_csv('Geospatial_data')
df2.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


merge data frame from part 1 and part 2

In [226]:
df3 = pd.merge(df1, df2, on='Postal Code')
df3.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
