# Applied Data Science Capstone: Week 3: Segmenting and Clustering Neighborhoods in Toronto

In [1]:
import pandas as pd
import numpy as np
import requests

## Submission 1 of 3 - Task: "Use pandas, or the BeautifulSoup package, or any other way you are comfortable with to transform the data in the table on the Wikipedia page into the above pandas dataframe."
This notebook contains Python code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M.

We store the .html table of postal codes  in a dataframe. 
The dataframe will consist of three columns: Postal Code, Borough, and Neighborhood.
If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.
In the last cell of this notebook, we print the .shape atribute of the DataFrame in order to display the number of rows and columns for the DataFrame.

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
response = requests.get(url)

In [3]:
dfs = pd.read_html(url, header=0)

The table we would like to scrape for data is the first table of the webpage.

In [4]:
table_df = dfs[0]

We use pandas to transform the .html table into a DataFrame with three (3) columns.

In [5]:
table_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 180 entries, 0 to 179
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Postal Code    180 non-null    object
 1   Borough        180 non-null    object
 2   Neighbourhood  180 non-null    object
dtypes: object(3)
memory usage: 4.3+ KB


In [6]:
table_df

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


We drop any row of the DataFrame if that row has a value 'Not assigned' in 'Borough' column.

In [7]:
table_df_drop_not_assigned = pd.DataFrame(table_df) # Create a new instance of a DataFrame backed by another DataFrame.
table_df_drop_not_assigned

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


In [8]:
table_df_drop_not_assigned[['Borough']] = table_df[['Borough']].replace(to_replace='Not assigned', value=np.nan, inplace=False)
table_df_drop_not_assigned

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,,Not assigned
1,M2A,,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,,Not assigned
176,M6Z,,Not assigned
177,M7Z,,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


In [9]:
table_df_drop_not_assigned = table_df_drop_not_assigned.dropna(axis='index', how='any', inplace=False)
table_df_drop_not_assigned

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
160,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
165,M4Y,Downtown Toronto,Church and Wellesley
168,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
169,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


We can print a list of the unique values in the column Neighbourhood.

In [10]:
 pd.unique(table_df_drop_not_assigned['Neighbourhood'])

array(['Parkwoods', 'Victoria Village', 'Regent Park, Harbourfront',
       'Lawrence Manor, Lawrence Heights',
       "Queen's Park, Ontario Provincial Government",
       'Islington Avenue, Humber Valley Village', 'Malvern, Rouge',
       'Don Mills', 'Parkview Hill, Woodbine Gardens',
       'Garden District, Ryerson', 'Glencairn',
       'West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale',
       'Rouge Hill, Port Union, Highland Creek', 'Woodbine Heights',
       'St. James Town', 'Humewood-Cedarvale',
       'Eringate, Bloordale Gardens, Old Burnhamthorpe, Markland Wood',
       'Guildwood, Morningside, West Hill', 'The Beaches', 'Berczy Park',
       'Caledonia-Fairbanks', 'Woburn', 'Leaside', 'Central Bay Street',
       'Christie', 'Cedarbrae', 'Hillcrest Village',
       'Bathurst Manor, Wilson Heights, Downsview North',
       'Thorncliffe Park', 'Richmond, Adelaide, King',
       'Dufferin, Dovercourt Village', 'Scarborough Village',
       'Fairview, H

We notice that 'Not assigned' does not occur as an entry in the Neighbourhood column.

In [11]:
'Not assigned' in pd.unique(table_df_drop_not_assigned['Neighbourhood'])

False

We would like to determine if more than one row of the DataFrame has the same postal code.

In [12]:
# This line returns the number of unique postal codes in the DataFrame.
print('Number of unique postal codes in table: {}'.format(len(pd.unique(table_df_drop_not_assigned['Postal Code']))))

Number of unique postal codes in table: 103


Since the number of unique postal codes is equal to the number of rows of the Data Frame, we conclude that no two rows have the same postal code.

In [13]:
postal_codes_df = table_df_drop_not_assigned.sort_values(by='Postal Code', axis='index', inplace=False).reset_index(inplace=False).drop(labels='index', axis='columns', inplace=False)
postal_codes_df

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
...,...,...,...
98,M9N,York,Weston
99,M9P,Etobicoke,Westmount
100,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ..."
101,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest..."


In [14]:
postal_codes_df.shape

(103, 3)

## Submission 2 of 3 - Task: "Use the Geocoder package or the csv file to create the following dataframe:..."

In [15]:
import geocoder # import geocoder

# initialize your variable to None
lat_lng_coords = None

# loop until you get the coordinates
while(lat_lng_coords is None):
  g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
  lat_lng_coords = g.latlng

latitude = lat_lng_coords[0]
longitude = lat_lng_coords[1]

ModuleNotFoundError: No module named 'geocoder'