# Book 1: Scraping Wiki for Neighborhoods in Toronto, Canada

Peer-graded assignment for IBM Data Science Professional certification capstone course: Week 3 - Brian Vineyard

**Tools used**:
- [BeautifulSoup](https://https://beautiful-soup-4.readthedocs.io/en/latest/)
- [Foursquare API tool](https://foursquare.com)
- [http://cocl.us/Geospatial_data](http://cocl.us/Geospatial_data)

Toronto neighborhoods information is loaded using postal codes pulled from Wikipedia using BeautifulSoup, then analyzed using the Foursquare API tool.

This project uses the following Wiki page: [List of postal codes of Canada: M](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M) 

### Steps to analyze Toronto Neighborhood Data

1. Load the wiki containing postal codes for Toronto beginning with the letter 'M' into a Pandas dataframe. BeautifulSoup is used to scrape the wiki for the postal data. 
2. The data is cleaned up for processing.
3. Load the Geospatial data from a .CSV file built from [http://cocl.us/Geospatial_data](http://cocl.us/Geospatial_data). This provides latitude and longitude parameters to use in Foursquare.
4. Build out the dataframe with the Toronto neighborhoods including the map coordinates.
5. Analyze and model the data to gain insights into neighborhood characteristics.

**This notebook is for phase 1, scraping the Wikipedia page using BeautifulSoup to build a Pandas dataframe of Toronto neighborhoods based on their postal code**.


In [1]:
# Import Pandas and Beautiful Soup libraries
import pandas as pd
from bs4 import BeautifulSoup
import requests

### Link to source data and parsing of Wiki website using BeautifulSoup

In [2]:
# Use Beautiful Soup to get the postal code data from the wiki page
sourcelink = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
source = requests.get(sourcelink).text
Postal_Info = BeautifulSoup(source, 'lxml')

### Set up Pandas dataframe to hold postal data

In [3]:
# Build dataframe with postal data
Column_Names = ['PostalCode','Borough','Neighborhood']
Postal_Data = pd.DataFrame(columns=Column_Names)

### Initialize variables for postal code, borough, and neighborhood, then loop through table rows to pull data

In [6]:
content = Postal_Info.find('div', class_='mw-parser-output')
postal_table = content.table.tbody
postal_code = 0
borough = 0
neighborhood = 0

for tr in postal_table.find_all('tr'):
    i = 0
    for td in tr.find_all('td'):
        if i == 0:
            postal_code = td.text
            i = i + 1
        elif i == 1:
            borough = td.text
            i = i + 1
        elif i == 2:
            neighborhood = td.text.strip('\n').replace(']','')
            Postal_Data = Postal_Data.append({'Postalcode': postal_code,'Borough': borough,'Neighborhood': neighborhood},ignore_index=True)

In [10]:
Postal_Data.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood,Postalcode
0,,Not assigned,Not assigned,M1A
1,,Not assigned,Not assigned,M2A
2,,North York,Parkwoods,M3A
3,,North York,Victoria Village,M4A
4,,Downtown Toronto,Harbourfront,M5A
5,,North York,Lawrence Heights,M6A
6,,North York,Lawrence Manor,M6A
7,,Downtown Toronto,Queen's Park,M7A
8,,Not assigned,Not assigned,M8A
9,,Queen's Park,Not assigned,M9A


## Data cleaning - remove boroughs with 'Not assigned',  0 values

In [11]:
Postal_Data = Postal_Data[Postal_Data.Borough!='Not assigned']
Postal_Data = Postal_Data[Postal_Data.Borough!= 0]
Postal_Data.reset_index(drop = True, inplace = True)
i = 0
for i in range(0,Postal_Data.shape[0]):
    if Postal_Data.iloc[i][2] == 'Not assigned':
        Postal_Data.iloc[i][2] = Postal_Data.iloc[i][1]
        i = i+1
df = Postal_Data.groupby(['Postalcode','Borough'])['Neighborhood'].apply(', '.join).reset_index()

In [12]:
df = df.dropna()
empty = 'Not assigned'
df = df[(df.Postalcode != empty ) & (df.Borough != empty) & (df.Neighborhood != empty)]   

### Get first 5 and last 5 rows of data from the postal dataframe

In [13]:
df.head()

Unnamed: 0,Postalcode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern, Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union, Highla..."
2,M1E,Scarborough,"Guildwood, Morningside, West Hill, Guildwood, ..."
3,M1G,Scarborough,"Woburn, Woburn"
4,M1H,Scarborough,"Cedarbrae, Cedarbrae"


In [14]:
df.tail()

Unnamed: 0,Postalcode,Borough,Neighborhood
98,M9N,York,"Weston, Weston"
99,M9P,Etobicoke,"Westmount, Westmount"
100,M9R,Etobicoke,"Kingsview Village, Martin Grove Gardens, Richv..."
101,M9V,Etobicoke,"Albion Gardens, Beaumond Heights, Humbergate, ..."
102,M9W,Etobicoke,"Northwest, Northwest"


### Group data by Postal Code and Borough

In [74]:
def neighborhood_list(grouped):
    return ', '.join(sorted(grouped['Neighborhood'].tolist()))
grp = df.groupby(['Postalcode', 'Borough'])
df2 = grp.apply(neighborhood_list).reset_index(name='Neighborhood')

In [75]:
print(df2.shape)
df2.head()

(103, 3)


Unnamed: 0,Postalcode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
