<a href="https://colab.research.google.com/github/eolus87/Coursera_Capstone/blob/master/Web_scraping_Toronto_neighbourhoods.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Web scraping Neighborhood data from Wikipedia about Toronto
This code has been developed for Coursera course: "Applied Data Science Capstone", following the instructions on [Segmenting and Clustering Neighborhoods in Toronto](https://www.coursera.org/learn/applied-data-science-capstone/peer/I1bDq/segmenting-and-clustering-neighborhoods-in-toronto/submit)

References used are:
- [How To Web Scrape Wikipedia Using Python, Urllib, Beautiful Soup and Pandas](https://simpleanalytical.com/how-to-web-scrape-wikipedia-python-urllib-beautiful-soup-pandas)

Nicolas Gutierrez  
UK, 17th May 2020

### Importing needed libraries
- [**urllib.request**](https://docs.python.org/3.0/library/urllib.request.html)
- [**BeautifulSoup**](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [**Pandas** (The ubiquitous)](https://pandas.pydata.org/)



In [1]:
import urllib.request
from bs4 import BeautifulSoup
import pandas as pd

print("Libraries needed imported!")

Libraries needed imported!


### Variables initialization, web request and data checking

In [0]:
# As indicated in the course project instructions, the URL is the following
url  = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
# We make the call
page =  urllib.request.urlopen(url)
# Now it is the turn of Beautiful Soup to parse the HTML code we got
soup = BeautifulSoup(page, "lxml")
#print(soup.prettify())
# I am not displaying the soup.prettify() field above because it is too long, but 
# it can be uncommented if needed

In [0]:
# Now we have the HTML code, we can have a look, and the table we are intestrested in begins with "<table class="wikitable sortable">".
all_tables   = soup.find_all("table")
#all_tables
# I am not displaying the all_tables field above because it is too long, but 
# it can be uncommented if needed

In [4]:
# Let's filter the obtained tables list and get the only table we want
table_wanted = soup.find("table", class_ = "wikitable sortable")
table_wanted

<table class="wikitable sortable">
<tbody><tr>
<th>Postal Code
</th>
<th>Borough
</th>
<th>Neighborhood
</th></tr>
<tr>
<td>M1A
</td>
<td>Not assigned
</td>
<td>
</td></tr>
<tr>
<td>M2A
</td>
<td>Not assigned
</td>
<td>
</td></tr>
<tr>
<td>M3A
</td>
<td>North York
</td>
<td>Parkwoods
</td></tr>
<tr>
<td>M4A
</td>
<td>North York
</td>
<td>Victoria Village
</td></tr>
<tr>
<td>M5A
</td>
<td>Downtown Toronto
</td>
<td>Regent Park, Harbourfront
</td></tr>
<tr>
<td>M6A
</td>
<td>North York
</td>
<td>Lawrence Manor, Lawrence Heights
</td></tr>
<tr>
<td>M7A
</td>
<td>Downtown Toronto
</td>
<td>Queen's Park, Ontario Provincial Government
</td></tr>
<tr>
<td>M8A
</td>
<td>Not assigned
</td>
<td>
</td></tr>
<tr>
<td>M9A
</td>
<td>Etobicoke
</td>
<td>Islington Avenue
</td></tr>
<tr>
<td>M1B
</td>
<td>Scarborough
</td>
<td>Malvern, Rouge
</td></tr>
<tr>
<td>M2B
</td>
<td>Not assigned
</td>
<td>
</td></tr>
<tr>
<td>M3B
</td>
<td>North York
</td>
<td>Don Mills
</td></tr>
<tr>
<td>M4B
</td>
<td>East Y

### Extracting and cleaning data
First we will find the table headers. They all have this format "\<th>Header\</th>" (This step is just for fun, because we assign the table headers manually afterwards).

In [5]:
# Initialize the list that will contain the headers
table_headers = []

# Find all headers in the table
thheaders = table_wanted.find_all('th')

# Include all headers in table_headers. With 'find' method we separate the text 
# from the HTML syntax. Then we get the names but finishing with '\n' jump line 
# character so we remove it by cropping the vector from [0:-1]
for i in range(len(thheaders)):
  table_headers.append(thheaders[i].find(text=True)[0:-1])

# Printing the headers
print(table_headers)

['Postal Code', 'Borough', 'Neighborhood']


Next, the table data. Every line in the table is enclosed in "\<tr>DataRow\</tr>" and then every field is enclosed in "\<td>DataField\</td>" and we have three fields per row.

In [6]:
# Initializing the lists that will contain the data
postalcode   = []
borough      = []
neighborhood = []

# We locate every field in the data (we need to crop the data again to remove 
# the jump line character)
for row in table_wanted.find_all('tr'):
  row_data = row.findAll('td')
  if len(row_data) == 3:
    postalcode.append(row_data[0].find(text=True)[0:-1])
    borough.append(row_data[1].find(text=True)[0:-1])
    neighborhood.append(row_data[2].find(text=True)[0:-1])

# Printing the results of the fields
print("Content of every column:")
print("Postal_code: {}".format(postalcode))
print("Borough: {}".format(borough))
print("Neighborhood: {}\n".format(neighborhood))

# Printing the length of every column
print("Size of every column:")
print(len(postalcode))
print(len(borough))
print(len(neighborhood))

Content of every column:
Postal_code: ['M1A', 'M2A', 'M3A', 'M4A', 'M5A', 'M6A', 'M7A', 'M8A', 'M9A', 'M1B', 'M2B', 'M3B', 'M4B', 'M5B', 'M6B', 'M7B', 'M8B', 'M9B', 'M1C', 'M2C', 'M3C', 'M4C', 'M5C', 'M6C', 'M7C', 'M8C', 'M9C', 'M1E', 'M2E', 'M3E', 'M4E', 'M5E', 'M6E', 'M7E', 'M8E', 'M9E', 'M1G', 'M2G', 'M3G', 'M4G', 'M5G', 'M6G', 'M7G', 'M8G', 'M9G', 'M1H', 'M2H', 'M3H', 'M4H', 'M5H', 'M6H', 'M7H', 'M8H', 'M9H', 'M1J', 'M2J', 'M3J', 'M4J', 'M5J', 'M6J', 'M7J', 'M8J', 'M9J', 'M1K', 'M2K', 'M3K', 'M4K', 'M5K', 'M6K', 'M7K', 'M8K', 'M9K', 'M1L', 'M2L', 'M3L', 'M4L', 'M5L', 'M6L', 'M7L', 'M8L', 'M9L', 'M1M', 'M2M', 'M3M', 'M4M', 'M5M', 'M6M', 'M7M', 'M8M', 'M9M', 'M1N', 'M2N', 'M3N', 'M4N', 'M5N', 'M6N', 'M7N', 'M8N', 'M9N', 'M1P', 'M2P', 'M3P', 'M4P', 'M5P', 'M6P', 'M7P', 'M8P', 'M9P', 'M1R', 'M2R', 'M3R', 'M4R', 'M5R', 'M6R', 'M7R', 'M8R', 'M9R', 'M1S', 'M2S', 'M3S', 'M4S', 'M5S', 'M6S', 'M7S', 'M8S', 'M9S', 'M1T', 'M2T', 'M3T', 'M4T', 'M5T', 'M6T', 'M7T', 'M8T', 'M9T', 'M1V', 'M2V', 'M

### DataFrame and final cleaning
Now we will create a Pandas Dataframe and do the last stage of cleaning in Pandas

In [7]:
# Initializing the data frame
table_dataframe                 = pd.DataFrame(postalcode,columns=['PostalCode'])
# Adding the rest of columns
table_dataframe['Borough']      = borough
table_dataframe['Neighborhood'] = neighborhood
table_dataframe

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,
176,M6Z,Not assigned,
177,M7Z,Not assigned,
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


Next, the steps required by the assignment:
1. Ignore cells with a borough that is "Not assigned"
2. No PostalCode will be duplicated, combine cells with the same PostalCode
3. If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

In [8]:
#1. Removing (ignoring) rows with Borough "Not assigned"
table_dataframe = table_dataframe[table_dataframe['Borough']!="Not assigned"]
print("DataFrame shape: {}\n".format(table_dataframe.shape))

# Sanity check about datatypes
print(table_dataframe.dtypes)

#2. Joining Neighborhoods as indicated in the assignment instructions
print(table_dataframe[table_dataframe['PostalCode'].isin(['M5A'])])
table_dataframe = table_dataframe.groupby(['PostalCode','Borough'], axis = 0)['Neighborhood'].apply(lambda x: ', '.join(x)).reset_index()
print(table_dataframe[table_dataframe['PostalCode'].isin(['M5A'])])
# NOTE: PostalCodes are not duplicated in wikipedia page. Actually the table is 
# grouped by PostalCodes so there is no need to go thorugh the group by process

#3. Looking and filling empty neighborhood
for i in range(len(table_dataframe)):
  if table_dataframe['Neighborhood'][i] == '':
    print(table_dataframe['Borough'][i])
    table_dataframe['Neighborhood'][i] = table_dataframe['Borough'][i]
# NOTE: There was no changes in the step 3. It seems that empty neighborhoods are
# linked to "Not Assigned" Boroughs, so this step was done with step 1.

DataFrame shape: (103, 3)

PostalCode      object
Borough         object
Neighborhood    object
dtype: object
  PostalCode           Borough               Neighborhood
4        M5A  Downtown Toronto  Regent Park, Harbourfront
   PostalCode           Borough               Neighborhood
53        M5A  Downtown Toronto  Regent Park, Harbourfront


Finally, we print the Data frame and the shape

In [9]:
#Printing the dataframe
table_dataframe

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
...,...,...,...
98,M9N,York,Weston
99,M9P,Etobicoke,Westmount
100,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ..."
101,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest..."


In [10]:
# Printing the shape of the dataframe
table_dataframe.shape

(103, 3)