# Web Scraping

## Web Scraping with BeautifulSoup4

Importing the necessary libraries.

In [1]:
from bs4 import BeautifulSoup
import requests

Sending an HTTP GET request to the specified URL and parsing the page's HTML content.

In [2]:
url = "https://en.wikipedia.org/wiki/List_of_the_largest_population_centres_in_Canada"

page = requests.get(url)

soup = BeautifulSoup(page.text, "html")

In [3]:
# print(soup)

Finding the first table with the class 'wikitable sortable' in the parsed HTML.

In [4]:
soup.find('table', class_ = 'wikitable sortable')

<table class="wikitable sortable">
<tbody><tr>
<th>Rank
</th>
<th>Population centre<sup class="reference" id="cite_ref-2021census_5-0"><a href="#cite_note-2021census-5"><span class="cite-bracket">[</span>5<span class="cite-bracket">]</span></a></sup>
</th>
<th>Province<sup class="reference" id="cite_ref-2021census_5-1"><a href="#cite_note-2021census-5"><span class="cite-bracket">[</span>5<span class="cite-bracket">]</span></a></sup>
</th>
<th>Size group<sup class="reference" id="cite_ref-2021census_5-2"><a href="#cite_note-2021census-5"><span class="cite-bracket">[</span>5<span class="cite-bracket">]</span></a></sup>
</th>
<th>Population (2021)<sup class="reference" id="cite_ref-2021census_5-3"><a href="#cite_note-2021census-5"><span class="cite-bracket">[</span>5<span class="cite-bracket">]</span></a></sup>
</th>
<th>Population (2016)<sup class="reference" id="cite_ref-2021census_5-4"><a href="#cite_note-2021census-5"><span class="cite-bracket">[</span>5<span class="cite-bracket">]</s

Selecting the first table.

In [5]:
table = soup.find_all("table")[0]

In [6]:
# print(table)

Extracting all header cells from the selected table.

In [7]:
titles = table.find_all('th')

In [8]:
titles

[<th>Rank
 </th>,
 <th>Population centre<sup class="reference" id="cite_ref-2021census_5-0"><a href="#cite_note-2021census-5"><span class="cite-bracket">[</span>5<span class="cite-bracket">]</span></a></sup>
 </th>,
 <th>Province<sup class="reference" id="cite_ref-2021census_5-1"><a href="#cite_note-2021census-5"><span class="cite-bracket">[</span>5<span class="cite-bracket">]</span></a></sup>
 </th>,
 <th>Size group<sup class="reference" id="cite_ref-2021census_5-2"><a href="#cite_note-2021census-5"><span class="cite-bracket">[</span>5<span class="cite-bracket">]</span></a></sup>
 </th>,
 <th>Population (2021)<sup class="reference" id="cite_ref-2021census_5-3"><a href="#cite_note-2021census-5"><span class="cite-bracket">[</span>5<span class="cite-bracket">]</span></a></sup>
 </th>,
 <th>Population (2016)<sup class="reference" id="cite_ref-2021census_5-4"><a href="#cite_note-2021census-5"><span class="cite-bracket">[</span>5<span class="cite-bracket">]</span></a></sup>
 </th>,
 <th>Cha

Creating a list of table header titles.

In [9]:
table_titles = [title.text.strip() for title in titles]

print(table_titles)

['Rank', 'Population centre[5]', 'Province[5]', 'Size group[5]', 'Population (2021)[5]', 'Population (2016)[5]', 'Change[5]', 'Land area (km2)[5]', 'Population density (/km2)[5]']


Cleaning up the titles.

In [10]:
import re

table_titles = [re.sub(r'\[.*?\]', '', title).strip() for title in table_titles]

print(table_titles)

['Rank', 'Population centre', 'Province', 'Size group', 'Population (2021)', 'Population (2016)', 'Change', 'Land area (km2)', 'Population density (/km2)']


Creating an empty DataFrame with columns named after the cleaned table titles.

In [11]:
import pandas as pd

df = pd.DataFrame(columns=table_titles)

df

Unnamed: 0,Rank,Population centre,Province,Size group,Population (2021),Population (2016),Change,Land area (km2),Population density (/km2)


Extracting all table rows from the selected table.

In [12]:
column_data = table.find_all('tr')

Finally, iterating over each table row to extract, then appending it as a new row to the DataFrame.

In [12]:
for row in column_data:
    row_data = row.find_all('td')
    
    if row_data:
        row_datas = [data.text.strip() for data in row_data]
        
        length = len(df)
        df.loc[length] = row_datas

In [13]:
df

Unnamed: 0,Rank,Population centre,Province,Size group,Population (2021),Population (2016),Change,Land area (km2),Population density (/km2)
0,1,Toronto,Ontario,Large urban,5647656,5433590,+3.9%,1829.05,3087.8
1,2,Montreal,Quebec,Large urban,3675219,3528651,+4.2%,1382.47,2658.4
2,3,Vancouver,British Columbia,Large urban,2426160,2268864,+6.9%,911.64,2661.3
3,4,Calgary,Alberta,Large urban,1305550,1240413,+5.3%,621.72,2099.9
4,5,Edmonton,Alberta,Large urban,1151635,1070998,+7.5%,627.2,1836.2
...,...,...,...,...,...,...,...,...,...
95,96,Parksville,British Columbia,Small,27330,25364,+7.8%,27.45,995.6
96,97,Keswick – Elmhurst Beach,Ontario,Small,27145,26999,+0.5%,16.56,1639.2
97,98,Fort Saskatchewan,Alberta,Small,26831,23944,+12.1%,21.85,1228.0
98,99,Bolton,Ontario,Small,26795,26378,+1.6%,20.71,1293.8
