# Scraping Wikipedia For Neighborhood Info #
### Albert Olszewski ###
In this document, I will be gathering information on toronto neighborhoods off of website data and performing clustering analysis using a foursquare plug in.

Import necessary packages.

In [596]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests
import csv

Create a beautifulsoup object from the xml source code from a wikipedia link given.

In [597]:
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(source, 'lxml')

The following block of code I cleaned data as follows:
1. Gather each individual row in table throwing an exception for empty rows or cells (this helped handle the heading becuase they were enclosed in different code)
2. Remove rows that do not have an assigned borough
3. Assign the borough as neighborhood name for postal codes with unassigned neighborhoods
4. Create a dataframe and compile neighborhoods together that share a postal code (groupby and join)

In [598]:
postcodes = []
boroughs = []
neighborhoods = []
districts = []

for district in soup.find('table').find_all('tr'):
    try:
        postcode = district.find_all('td')[0].text
        borough = district.find_all('td')[1].text
        neighborhood = district.find_all('td')[2].text
        neighborhood = neighborhood.replace("\n","")
    except Exception as e:
        postcode = None
        borough = None
        neighborhood = None
    # compile data into a list
    districts.append([postcode,borough,neighborhood])


# get rid of postal codes not assigned to a borough
assigned_districts = []
for i in range(1,len(districts)):
    if districts[i][1]!='Not assigned':
        assigned_districts.append(districts[i])

# assign borough as neighborhoods for unassigned neighborhoods
for j in range(0,len(assigned_districts)):
    if assigned_districts[j][2] == 'Not assigned':
        assigned_districts[j][2] = assigned_districts[j][1]
        



In [599]:
# creating dataframe
df = pd.DataFrame(data = assigned_districts, columns = ['Postal Code','Borough','Neighborhood'])
# joining neighborhoods with same postalcode
df = df.groupby(['Postal Code','Borough'])['Neighborhood'].apply(lambda x: ','.join(x.astype(str))).reset_index()

df.head()


Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [600]:
print('The final dataframe has the shape of: ', df.shape)

The final dataframe has the shape of:  (103, 3)
