# Toronto Neighborhood Segmentation

## Part 1: Toronto's Neighboorhoods
In this part we retrieve the basic information about Toronto's neighborhoods that have *M* in their postcode and summarize it in a pandas dataframe.
The data are scrapped from the Wikipedia page: [https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M).

Steps:
- Downloading the Webpage Using Requests Library
- Parsing Webpage HTML Using BeautifulSoup
- Extracting Data and Building DataFrame

In [1]:
#import necessary packages
import pandas as pd 
import requests
from bs4 import BeautifulSoup

In [2]:
#downlaod wikipedia page using requests
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
html_data = requests.get(url).text

In [3]:
#parse the html data using beatiful_soup
soup = BeautifulSoup(html_data,"html5lib")

In [4]:
#get the page title
soup.title

<title>List of postal codes of Canada: M - Wikipedia</title>

Using Beatiful soup extract the table with the neighborhood data and store them in a dataframe .

In [44]:
# create the dataframe
toronto_neighborhoods = pd.DataFrame(columns=[
    "PostalCode", 
    "Borough", 
    "Neighborhood"])

#extract the table, and extract data row by row, column by column
for row in soup.find("tbody").find_all("tr"):
    col = row.find_all("td")
    if col:
        postcode = col[0].text
        postcode = col[0].text
        borough = col[1].text
        neighborhood = col[2].text

        toronto_neighborhoods = toronto_neighborhoods.append({
            "PostalCode":postcode, 
            "Borough":borough, 
            "Neighborhood":neighborhood}, 
            ignore_index = True)

toronto_neighborhoods.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A\n,Not assigned\n,Not assigned\n
1,M2A\n,Not assigned\n,Not assigned\n
2,M3A\n,North York\n,Parkwoods\n
3,M4A\n,North York\n,Victoria Village\n
4,M5A\n,Downtown Toronto\n,"Regent Park, Harbourfront\n"


Clean the dataframe:
- remove "\n"
- remove not assinged postal codes (rows with Borough="Not assigned")
- group Neighborhoods with same postalcode in the same row
- replace Neihgborhood cells having "Not Assigned" with the name of the corresponding Borough

In [45]:
#remove "\n"
toronto_neighborhoods=toronto_neighborhoods.replace(to_replace=r'\n', value='', regex=True)

In [46]:
#filter out not assinged postal codes
mask = toronto_neighborhoods['Borough']=="Not assigned"
toronto_neighborhoods = toronto_neighborhoods[~mask]
toronto_neighborhoods.reset_index(drop=True,inplace=True)
toronto_neighborhoods.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [47]:
# Check whether there are rows with duplicated postal code
duplicateDFRow = toronto_neighborhoods[toronto_neighborhoods.duplicated(['PostalCode'])]
print(duplicateDFRow)

Empty DataFrame
Columns: [PostalCode, Borough, Neighborhood]
Index: []


There are no rows with duplicated postal codes.

In [48]:
#Check whether there are cells with not assigned Neighboorhood field
mask_nb = toronto_neighborhoods['Neighborhood']=="Not assigned"
df_nb = toronto_neighborhoods[mask_nb]
df_nb.head()

Unnamed: 0,PostalCode,Borough,Neighborhood


There are no rows with "Not assigned" Neighborhood field.

In [49]:
#print out the shape of the dataframe
print(f'Number of rows (unique postal codes with assigned borough) in the toronto_neighborhoods dataframe: {toronto_neighborhoods.shape[0]}.')

Number of rows (unique postal codes with assigned borough) in the toronto_neighborhoods dataframe: 103.


### Assumption:
For the remainder of the project we assume that we need to downselect the postal code shown in the picture below, i.e, only 12 rows.:

![](https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/7JXaz3NNEeiMwApe4i-fLg_40e690ae0e927abda2d4bde7d94ed133_Screen-Shot-2018-06-18-at-7.17.57-PM.png?expiry=1615593600000&hmac=unqSqgkLjy999x2SlSPGTtwyQY3V-RE76_R0fAdH2IY)

Source: [https://www.coursera.org/learn/applied-data-science-capstone/peer/I1bDq/segmenting-and-clustering-neighborhoods-in-toronto/submit](https://www.coursera.org/learn/applied-data-science-capstone/peer/I1bDq/segmenting-and-clustering-neighborhoods-in-toronto/submit)

In [60]:
#filter out the exact same DF rows as shown in the assignment and put them in a new dataframe
toronto_neighborhoods_xs = pd.DataFrame(columns=[
    "PostalCode", 
    "Borough", 
    "Neighborhood"])
postalcodes = ['M5G','M2H','M4B','M1J','M4G','M4M','M1R','M9V','M9L','M5V','M1B','M5A']
for postalcode in postalcodes:
    row = toronto_neighborhoods[toronto_neighborhoods['PostalCode']==postalcode]
    toronto_neighborhoods_xs = toronto_neighborhoods_xs.append(row,ignore_index=True)
toronto_neighborhoods_xs.head(12)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M5G,Downtown Toronto,Central Bay Street
1,M2H,North York,Hillcrest Village
2,M4B,East York,"Parkview Hill, Woodbine Gardens"
3,M1J,Scarborough,Scarborough Village
4,M4G,East York,Leaside
5,M4M,East Toronto,Studio District
6,M1R,Scarborough,"Wexford, Maryvale"
7,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest..."
8,M9L,North York,Humber Summit
9,M5V,Downtown Toronto,"CN Tower, King and Spadina, Railway Lands, Har..."


## Part 2: Add latitude and longitude to the Toronto Neighborhoods DataFrame

We will retrieve latitude and longitude for each postal code from the following csv file: [http://cocl.us/Geospatial_data](http://cocl.us/Geospatial_data).

In [29]:
# Download the csv with the geospatial data
#use -L to follow redirects (https://www.unix.com/shell-programming-and-scripting/263133-how-get-content-webpage-curl-vs-wget.html)
!curl -o Geospatial_data.csv -L http://cocl.us/Geospatial_data/Geospatial_Coordinates.csv 

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   171  100   171    0     0    924      0 --:--:-- --:--:-- --:--:--   924
100   524    0   524    0     0    420      0 --:--:--  0:00:01 --:--:--  102k
  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0
100     4    0     4    0     0      1      0 --:--:--  0:00:02 --:--:--     1
100  2891  100  2891    0     0   1040      0  0:00:02  0:00:02 --:--:-- 34011


In [30]:
# Read the data into a dataframe
geo_data = pd.read_csv("Geospatial_data.csv",delimiter=",")
geo_data.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [None]:
#we will perform an inner join, which requires that the columns onto which the join is performed have the same name.
geo_data.rename(columns={'Postal Code':'PostalCode'},errors="raise",inplace=True) #remember to specify inplace=True to change the DF

In [61]:
#perform the merge
toronto_neighborhoods_xs=toronto_neighborhoods_xs.merge(geo_data,how="inner",on="PostalCode")
toronto_neighborhoods_xs.head(12)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
1,M2H,North York,Hillcrest Village,43.803762,-79.363452
2,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
3,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
4,M4G,East York,Leaside,43.70906,-79.363452
5,M4M,East Toronto,Studio District,43.659526,-79.340923
6,M1R,Scarborough,"Wexford, Maryvale",43.750072,-79.295849
7,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest...",43.739416,-79.588437
8,M9L,North York,Humber Summit,43.756303,-79.565963
9,M5V,Downtown Toronto,"CN Tower, King and Spadina, Railway Lands, Har...",43.628947,-79.39442


The above dataframe corresponds to the one shown in the assignment. See
[https://www.coursera.org/learn/applied-data-science-capstone/peer/I1bDq/segmenting-and-clustering-neighborhoods-in-toronto/submit](https://www.coursera.org/learn/applied-data-science-capstone/peer/I1bDq/segmenting-and-clustering-neighborhoods-in-toronto/submit).

In [None]:
## Part 3: This is a test