# Web Scraping Wikipedia Tables using BeautifulSoup and Python

## IBM Data Science Specializationy
##### Project Capstone. 
### Foursquare. Toronto Neighborhoods Segmentation and Classsification
**Article in Medium: Introduction to Web Scraping with BeautifulSoup**
https://towardsdatascience.com/introduction-to-web-scraping-with-beautifulsoup-e87a06c2b857

In [1]:
import pandas as pd
import numpy as np

In [2]:
#Library `request` allows you to send organic, grass-fed HTTP/1.1 requests, without the need for manual labor.
#!conda install -c anaconda requests --yes

In [3]:
# importing libraries
from bs4 import BeautifulSoup
import urllib.request
import re

In [4]:
url="https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

In [5]:
# Connect to the website using urllib .
page = urllib.request.urlopen(url) # conntect to website

In [6]:
try:
    req = urllib.request.urlopen(url)
except:
    print("An error occured.")

In [7]:
article = req.read().decode()
#The ISO 3166-1 alpha-2 contains this information in an HTML table 
# which can be scraped quite easily as follows.

with open('ISO_3166-1_alpha-2.html', 'w') as fo:
    fo.write(article)

### Load and parse the Wikipedia page
Then load it and parse it with Beautiful Soup. Extract all the `<table>` tags and search for the one with the headings corresponding to the data we want.

In [8]:
from bs4 import BeautifulSoup

# Load article, turn into soup and get the <table>s.
article = open('ISO_3166-1_alpha-2.html').read()
soup = BeautifulSoup(article, 'html.parser')
table_str = soup.find_all('table', class_='sortable')
len(table_str)

1

The variable **`table_str`** looks like:

`
[<table class="wikitable sortable">
 <tbody><tr>
 <th>Postcode</th>
 <th>Borough</th>
 <th>Neighbourhood
 </th></tr>
 <tr>
 <td>M1A</td>
 <td>Not assigned</td>
 <td>Not assigned
 </td></tr>
 ...
 <tr>
 <td>M3B</td>
 <td><a href="/wiki/North_York" title="North York">North York</a></td>
 <td>Don Mills North
 </td></tr>
 ...
 <tr>
 <td>M5A</td>
 <td><a href="/wiki/Downtown_Toronto" title="Downtown Toronto">Downtown Toronto</a></td>
 <td><a href="/wiki/Harbourfront_(Toronto)" title="Harbourfront (Toronto)">Harbourfront</a>
 </td></tr>
 ...
 ...
 ...
 <tr>
 <td>M9Z</td>
 <td>Not assigned</td>
 <td>Not assigned
 </td></tr>
 </tbody></table>]`

### Find specific elements in the page, i.e. the  contain of the cell's table.
The created BeautifulSoup object can now be used to find elements in the HTML. When we inspected the website we saw that every list item in the content section has a class that starts with `th` (`</th>`) and we can us BeautifulSoup’s `find_all` method to find all list items with that class.

In [9]:
for table in table_str:
    ths = table.find_all('th')
    heading = [th.text.strip() for th in ths]

print('Data Frame columns: ', heading)

Data Frame columns:  ['Postcode', 'Borough', 'Neighbourhood']


When we inspected the website we saw that every new entry or row of the table starts with `tr` and after in the row the elements **Postcode**, **Borough** and **Neighbourhood** starts with `td` (`</td>`) and we can us BeautifulSoup’s `find_all` method to find all list items with that class.

In [30]:
for table in table_str:
    tds = table.find_all('td')
    body = [th.text.strip() for th in tds]

#Display the first 6 elements of body
body[0:7]

['M1A',
 'Not assigned',
 'Not assigned',
 'M2A',
 'Not assigned',
 'Not assigned',
 'M3A']

### Convert variable `body` into a Pandas DataFrame: `my_table`

Assign column names and Define DataFrame contain

In [12]:
my_table=pd.DataFrame([], columns=heading)
my_table['Postcode']=body[0::3]
my_table['Borough']=body[1::3]
my_table['Neighbourhood']=body[2::3]

Display 6 first rows. Sort table ordered by `Borough`

In [13]:
my_table.sort_values(by='Borough').head(6)

Unnamed: 0,Postcode,Borough,Neighbourhood
211,M4V,Central Toronto,Deer Park
158,M5P,Central Toronto,Forest Hill West
197,M4T,Central Toronto,Summerhill East
168,M4R,Central Toronto,North Toronto West
157,M5P,Central Toronto,Forest Hill North
156,M4P,Central Toronto,Davisville North


### Nr. of unique values and values of `Borough`
We can see that `Borough` contains the value `'Not assigned'`

In [14]:
print('*** Unique values (Borohoods):\n',  my_table.Borough.unique())
print('\n*** Nr of Unique values(Borohoods): ', my_table.Borough.nunique())

*** Unique values (Borohoods):
 ['Not assigned' 'North York' 'Downtown Toronto' "Queen's Park" 'Etobicoke'
 'Scarborough' 'East York' 'York' 'East Toronto' 'West Toronto'
 'Central Toronto' 'Mississauga']

*** Nr of Unique values(Borohoods):  12


### Reject rows  which Borougs =  `Not assigned`

In [15]:
# Slice my_table and discard all columns which 'Borough' is equal to 'Not assigned'
my_table=my_table[my_table.Borough != 'Not assigned']
bor_names=my_table.Borough.unique()
print('*** Nr of Unique values(Borohoods): ', my_table.Borough.nunique())
print('\n*** Unique values (Borohoods): \n','\n', my_table.Borough.unique())

*** Nr of Unique values(Borohoods):  11

*** Unique values (Borohoods): 
 
 ['North York' 'Downtown Toronto' "Queen's Park" 'Etobicoke' 'Scarborough'
 'East York' 'York' 'East Toronto' 'West Toronto' 'Central Toronto'
 'Mississauga']


Show number of Neighbourhoods and 5 first values.

### Nr. of unique values and values of `Neighbourhoods`

In [16]:

unique_neig=my_table.Neighbourhood.unique()
print('\n*** Nr of Unique values(Neighbourhoods): ', my_table.Neighbourhood.nunique())
print('*** 5 first Unique values (Neighbourhoods):\n', my_table.Neighbourhood.unique()[0:5])


*** Nr of Unique values(Neighbourhoods):  209
*** 5 first Unique values (Neighbourhoods):
 ['Parkwoods' 'Victoria Village' 'Harbourfront' 'Regent Park'
 'Lawrence Heights']


### Replace  ` Neighbourhood = 'Not assigned'` with its `Borough` value

In [17]:
# Replace columns which 'Neighbourhood' is equal to 'Not assigned'
vect=list(my_table[my_table.Neighbourhood == 'Not assigned'].Borough)
my_table=my_table.replace('Not assigned',vect[0])

In [18]:
my_table.head(8)

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Queen's Park
10,M9A,Etobicoke,Islington Avenue


Postcode `M7A` (row_index=8) => The Neighbourhood is `Queen's Park` now. Same as its Borough

### Create dictionary `my_Dict` with all `'Postcode', 'Borough','Neighborhood'`

In [19]:
my_Dict={}

for pc in my_table.Postcode.unique():
    nr=my_table[my_table.Postcode == pc].Neighbourhood.nunique()
    bor=my_table[my_table.Postcode == pc].Borough.unique()
    
    if nr == 1:
        ne=my_table[my_table.Postcode == pc].Neighbourhood
        my_Dict[pc]={'Postcode': pc, 'Borough': bor[0],'Neighborhood':list(ne)}
    if nr>1:
        w=[]
        for z in my_table[my_table.Postcode == pc].Neighbourhood:
            w.append(z)
        my_Dict[pc]={'Postcode': pc,'Borough': bor[0],'Neighborhood':list(w)}    

#### Calling elements of  `my_Dict`
Examples:

In [20]:
print(my_Dict['M6A'])
print(my_Dict['M6A']['Neighborhood'])

{'Postcode': 'M6A', 'Borough': 'North York', 'Neighborhood': ['Lawrence Heights', 'Lawrence Manor']}
['Lawrence Heights', 'Lawrence Manor']


## Create pandas DataFrame table `df` from dictionaty `my_Dict`

Get a string representation of a python list, (do not show brackets) with `.join()`

In [23]:
df=pd.DataFrame(my_Dict).T
df = df[['Postcode', 'Borough', 'Neighborhood']]
df=df.reset_index(drop=True)
my_table=df
df['Neighborhood'] = df.Neighborhood.apply(', '.join)
my_table.head()

Unnamed: 0,Postcode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront, Regent Park"
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Queen's Park


# Solution `df`:

Show first 15 rows

In [26]:
df.head(15)

Unnamed: 0,Postcode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront, Regent Park"
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Queen's Park
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Rouge, Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens, Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson, Garden District"
