# Segmenting and Clustering Neighborhoods

## Part 1

### Toronto, Canada

This notebook will be used for the Coursera Module 9, Week 3 graded assignment

#### Scrape data from Wiki

In [3]:
#install requests library from Python
! pip3 install requests
print('Requests installed')

Requests installed


In [3]:
#import requests and set up URL
import requests

url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
page = requests.get(url).text

#### Parse HTML code with Beautiful Soup

In [4]:
#install beautiful soup
! pip3 install beautifulsoup4



In [7]:
#import libraries BeautifulSoup, Pandas, and Numpy
from bs4 import BeautifulSoup #for parsing data
import pandas as pd #for creating dataframe
import numpy as np 

#create BeautifulSoup object to take HTML content scraped earlier as input (set up HTML parser)
soup = BeautifulSoup(page, 'html.parser')

From inspecting the wiki page, I identified that the table is stored in table class="wikitable sortable jquery-tablesorter" and the rows are stored in tr.  

In [9]:
#use Beautiful Soup to find the table
results = soup.find('table')

Before collecting the table rows, I will set up the columns for the Pandas dataframe and call the dataframe to display the columns

In [10]:
#create a table column names
column_names = ['PostalCode', 'Borough', 'Neighborhood']

#convert to dataframe
df = pd.DataFrame(columns=column_names)
df

Unnamed: 0,PostalCode,Borough,Neighborhood


From the results above, it looks like the row data is stored under </tr> and I need to extract them using .findAll()

In [12]:
table_rows = results.findAll('tr')

In [13]:
#loop through the rows and add to a table
for tr in table_rows:
    rows = []
    for td in tr.find_all('td'):
        rows.append(td.text.strip())
    if len(rows) == 3:
        df.loc[len(df)] = rows

In [14]:
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


#### Clean the data

The next step is to clean up the data.  This involves:
-removing cells that do not have an assigned borough, 
-removing duplicates for postal codes that have more than one neighborhood, 
-and assigning borough names as neighborhood names for cells that have borough names but a Not assigned neighborhood

In [15]:
# remove cells with no assigned borough
df_v1 = df[df['Borough'] != 'Not assigned']

#view first 5 rows to validate that 'Not assigned' Boroughs were dropped
df_v1.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [16]:
# assign borough names as neighborhood names where cells have borough names but a Not assigned neighborhood

df_v1['Neighborhood'] = np.where(df_v1['Neighborhood'] == 'Not assignd', df_v1['Borough'], df_v1['Neighborhood'])
df_v1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  app.launch_new_instance()


Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
160,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
165,M4Y,Downtown Toronto,Church and Wellesley
168,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
169,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [17]:
# merge duplicate rows for postal codes
df_final = df_v1.groupby(['PostalCode', 'Borough'], sort=False).agg(','.join)
df_final.reset_index(inplace=True) #adjust index to start with 0

#read first five rows of the dataframe
df_final.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [18]:
# use shape() to view the rows and columns in the final table

df_final.shape

(103, 3)

## Part 2

In [21]:
# set url path for downloading csv file with latitude and longitude data
geo_url = "https://cocl.us/Geospatial_data"

# import csv file into pandas dataframe and view first 5 rows
geo_df = pd.read_csv(geo_url)
geo_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [22]:
# rename Postal Code in geo_df to PostalCode
geo_df.columns = ['PostalCode', 'Latitude', 'Longitude']
geo_df.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [25]:
# merge df_final and geo_df using PostalCode and Postal Code as the matching columns
merged_df = pd.merge(df_final, geo_df, on='PostalCode', how='outer')
merged_df.head(12)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
