![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

<a href="https://hub.callysto.ca/jupyter/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fcallysto%2Flesson-plans&branch=master&subPath=notebooks/webscraping/webscraping.ipynb&depth=1"><img src="https://raw.githubusercontent.com/callysto/curriculum-notebooks/master/open-in-callysto-button.svg?sanitize=true" width="123" height="24" alt="Open in Callysto"/></a>

## Introduction to webscrapping

In this short lesson, we discover how to collect data from tables in a webpage and put them into a Pandas dataframe for data analysis. 

Outline:

- introduction to the idea of webscraping
- explain the libraries "requests" and "beautifulsoup"
- a simple example. Tables in wikipedia. Say, Canadian cities
- a more complex example. Canadian parliament, baseball stats, Stats Canada, Kaggle?
- mention APIs as a possibility for data collection
- going further

Some useful resources online:
- https://realpython.com/beautiful-soup-web-scraper-python/
- https://github.com/psf/requests-html
- https://medium.com/geekculture/web-scraping-tables-in-python-using-beautiful-soup-8bbc31c5803e



## Installing libraries

We begin by installing and importing some importand code libraries. First, the **requests** library, which handles the process of requesting raw data from a webpage. Next, the **beautifulsoup** library, which takes the raw data from the webpage and reformats it, or interprets it into a form that is easily saved as a data frame.

Normally, we just use the **import** command to make the libraries accessible in our code. However, if the libraries are not pre-installed in our system, we need to run the following two **pip install** commands. To run them in the next cell, remove the # symbol which made those lines of code inactive. 

In [None]:
#pip install requests
#pip install beautifulsoup4

In [77]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

## First scrape: Getting a list of Canadian cities

We start with a URL pointing to Wikipedia, for its article on the list of Canadian cities. A GET request will grab the data from the webpage for us. We then use the BeautifulSoup library to convert the html code from the webpage into something we can use. 


In [None]:
# URL for the Wikipedia webpage
url = 'https://en.wikipedia.org/wiki/List_of_cities_in_Canada'

# Send a GET request to the URL
response = requests.get(url)

# Pase the webpage from the response, and save as the data item "soup"
soup = BeautifulSoup(response.content, 'html.parser')

## Looking at the response

We can now look at the informuation, using the "prettify" instruction on "soup." Since the result is quite long, let's just print out the first 1000 characters.

In [31]:
print(soup.prettify()[:1000]) 

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-0 vector-feature-client-preferences-disabled vector-feature-client-prefs-pinned-disabled vector-feature-night-mode-disabled skin-night-mode-clientpref-0 vector-toc-available" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of cities in Canada - Wikipedia
  </title>
  <script>
   (function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-mai

## Grabbing a table

The webpage contains a number of tables. We use BeautifulSoup to identify each one, and save them all in a list. 

In [32]:
## Find the tables and count how many there are
tables = soup.find_all('table')
len(tables)

23

## Viewing a table

Let's take a look at the first table. The command "tables[0]" will print out the information in the first table, which we see is a list of capital cities. The table is made up of two columns, like this:

<div align="center">
<img src="images/capitals.jpg" alt="A table showing the capital cities" width="400"/><br>
The first table on the Wiki page, showing two columns: regions and capital cities.
</div>

Here is the command to display the data.

In [151]:
tables[0]

<table class="wikitable sortable">
<tbody><tr>
<th>Geographic area
</th>
<th>Capital
</th></tr>
<tr>
<td><b>Canada</b></td>
<td><b><a href="/wiki/Ottawa" title="Ottawa">Ottawa</a></b>
</td></tr>
<tr>
<td><a href="/wiki/Alberta" title="Alberta">Alberta</a></td>
<td><a href="/wiki/Edmonton" title="Edmonton">Edmonton</a>
</td></tr>
<tr>
<td><a href="/wiki/British_Columbia" title="British Columbia">British Columbia</a></td>
<td><a href="/wiki/Victoria,_British_Columbia" title="Victoria, British Columbia">Victoria</a>
</td></tr>
<tr>
<td><a href="/wiki/Manitoba" title="Manitoba">Manitoba</a></td>
<td><a href="/wiki/Winnipeg" title="Winnipeg">Winnipeg</a>
</td></tr>
<tr>
<td><a href="/wiki/New_Brunswick" title="New Brunswick">New Brunswick</a></td>
<td><a href="/wiki/Fredericton" title="Fredericton">Fredericton</a>
</td></tr>
<tr>
<td><a href="/wiki/Newfoundland_and_Labrador" title="Newfoundland and Labrador">Newfoundland and Labrador</a></td>
<td><a href="/wiki/St._John%27s,_Newfoundland_an

## Understanding the table data

We see the table consists of a number of items with tags like \<tr> or \<td> and some text data as well.

The \<tr> indicates a row in the table, while the \<td>  tags a piece of data in in that row.

We can write a simple loop to put this information into an array we call "contents" and then convert it to a dataframe.



In [152]:
contents = []

for row in tables[0].tbody.find_all('tr'):    
    columns = row.find_all('td')
    
    if(columns != []):
        cell = {}
        cell['Region'] = columns[0].text.strip()
        cell['City']  = columns[1].text.strip()
        contents.append(cell)

contents

[{'Region': 'Canada', 'City': 'Ottawa'},
 {'Region': 'Alberta', 'City': 'Edmonton'},
 {'Region': 'British Columbia', 'City': 'Victoria'},
 {'Region': 'Manitoba', 'City': 'Winnipeg'},
 {'Region': 'New Brunswick', 'City': 'Fredericton'},
 {'Region': 'Newfoundland and Labrador', 'City': "St. John's"},
 {'Region': 'Nova Scotia', 'City': 'Halifax'},
 {'Region': 'Ontario', 'City': 'Toronto'},
 {'Region': 'Prince Edward Island', 'City': 'Charlottetown'},
 {'Region': 'Quebec', 'City': 'Quebec City'},
 {'Region': 'Saskatchewan', 'City': 'Regina'},
 {'Region': 'Northwest Territories', 'City': 'Yellowknife'},
 {'Region': 'Nunavut', 'City': 'Iqaluit'},
 {'Region': 'Yukon', 'City': 'Whitehorse'}]

## Viewing the data frame

Here we create the data frame and view it.

In [153]:
df1 = pd.DataFrame(contents)
df1

Unnamed: 0,Region,City
0,Canada,Ottawa
1,Alberta,Edmonton
2,British Columbia,Victoria
3,Manitoba,Winnipeg
4,New Brunswick,Fredericton
5,Newfoundland and Labrador,St. John's
6,Nova Scotia,Halifax
7,Ontario,Toronto
8,Prince Edward Island,Charlottetown
9,Quebec,Quebec City


## Viewing another table about Canadian cities

Let's take a look at the next table from the same Wikipedia webpage. The command "tables[1]" will print out the information in the first table, which is information about the cities in Alberta. It looks like this:

<div align="center">
<img src="images/alberta.jpg" alt="A table showing the capital cities" width="400"/><br>
The second table on the Wiki page, showing cities in Alberta.
</div>

Run the following cell to see the raw data. It is a bit long, though, so we just print the first 2000 characters to see the basic format of the data. 

In [159]:
print(str(tables[1])[:2000])

<table class="wikitable sortable">
<tbody><tr>
<th rowspan="2" scope="col">Name
</th>
<th rowspan="2" scope="col"><a class="mw-redirect" href="/wiki/List_of_regions_of_Alberta" title="List of regions of Alberta">Region</a>
</th>
<th rowspan="2" scope="col">Incorporation<br/>date (city)<sup class="reference" id="cite_ref-ABcityprofiles_3-0"><a href="#cite_note-ABcityprofiles-3">[3]</a></sup>
</th>
<th rowspan="2" scope="“col”">Council<br/>size<sup class="reference" id="cite_ref-ABcityprofiles_3-1"><a href="#cite_note-ABcityprofiles-3">[3]</a></sup>
</th>
<th colspan="5" scope="“col”"><a href="/wiki/2021_Canadian_census" title="2021 Canadian census">2021 Census of Population</a><sup class="reference" id="cite_ref-2021census_4-0"><a href="#cite_note-2021census-4">[4]</a></sup>
</th></tr>
<tr>
<th scope="“col”">Population<br/>(2021)
</th>
<th scope="“col”">Population<br/>(2016)
</th>
<th scope="“col”">Change<br/>(%)
</th>
<th scope="“col”">Land<br/>area<br/>(km<sup>2</sup>)
</th>
<th data-

## Creating another data frame

While we could use all the data from the table, we will just take the name, region, and population information. From the table shown above, we see this corresponds to columns 0, 1 and 4. 

The code to create the data frame is as follows:

In [168]:
contents = []

for row in tables[1].tbody.find_all('tr'):    
    columns = row.find_all('td')
    
    if(columns != []):
        cell = {}
        cell['Name'] = columns[0].text.strip()
        cell['Region']  = columns[1].text.strip()
        cell['Population (2021)']  = columns[4].text.strip()
        contents.append(cell)

df1 = pd.DataFrame(contents)
df1

Unnamed: 0,Name,Region,Population (2021)
0,Airdrie,Calgary Metro,74100
1,Beaumont[AB 1],Edmonton Metro,20888
2,Brooks[AB 2],Southern,14924
3,Calgary[AB 3],Calgary Metro,1306784
4,Camrose,Central,18772
5,Chestermere[AB 4],Calgary Metro,22163
6,Cold Lake,North,15661
7,Edmonton[AB 5],Edmonton Metro,1010899
8,Fort Saskatchewan,Edmonton Metro,27088
9,Grande Prairie,Northern,64141


## Describing the data

We can get a quick summary of the data by using the "describe" function on the data frame, as follows:

In [169]:
df1.describe()

Unnamed: 0,Name,Region,Population (2021)
count,20,20,20
unique,20,7,20
top,Airdrie,Edmonton Metro,74100
freq,1,6,1


## From strings to numbers

You may have noticed that the population data is being treated as string, not actual numbers in the data frame. 

We can use the following lines of code to convert these strings to numbers. First, remove the commas from the text, then convert the text to integers

In [170]:
df1['Population (2021)'] = df1['Population (2021)'].apply(lambda x: x.replace(',',''))
df1['Population (2021)'] = df1['Population (2021)'].apply(int)
df1

Unnamed: 0,Name,Region,Population (2021)
0,Airdrie,Calgary Metro,74100
1,Beaumont[AB 1],Edmonton Metro,20888
2,Brooks[AB 2],Southern,14924
3,Calgary[AB 3],Calgary Metro,1306784
4,Camrose,Central,18772
5,Chestermere[AB 4],Calgary Metro,22163
6,Cold Lake,North,15661
7,Edmonton[AB 5],Edmonton Metro,1010899
8,Fort Saskatchewan,Edmonton Metro,27088
9,Grande Prairie,Northern,64141


## Describing the numerical data

Now that thhe popolulation column is representing actual numbers, the "describe" function will give us some basic statistics about these numbers. Such as the minimum, maximum, mean, standard deviation, etc. 

In [166]:
df1.describe()

Unnamed: 0,Population (2021)
count,20.0
mean,302364.1
std,728640.8
min,12594.0
25%,19497.25
50%,35869.5
75%,80176.5
max,3023641.0


## A second scraping: Canadian Parliament webpage

For our second demonstration, let's go to a webpage that shows a list of Members of Parliaments who are speaking one day in the Canadian House of Common. There is a table on this webpage, https://openparliament.ca/debates/, and we will examine one particular date. Say, Feburary 29, 2024. The webpage looks like this:

<div align="center">
<img src="images/parliament.jpg" alt="The speakers in parliament webpage" width="400"/><br>
The first table on the Wiki page, showing two columns: regions and capital cities.
</div>

Here is the code to grab the data, using the requests library, and beautifulsoup to parse the data. 



In [196]:
dateOfDebate = ('2024/02/29/')

page = requests.get('https://openparliament.ca/debates/' + dateOfDebate + '?singlepage=1').text  #?singlepage=1' gets all of the speakers
data = BeautifulSoup(page, 'html.parser')

While we could print out this data and examine its structure, using the print(data) command, instead let's focus on a section that is informative such as this oneL

```
<div class="row statement_browser statement" data-floor="" data-hocid="12614875" data-url="/debates/2024/2/29/anita-anand-1/" id="sanita-anand-1">
<div class="l-ctx-col">
<noscript><p><a href="/debates/2024/2/29/anita-anand-1/only/">Permalink</a></p></noscript>
<p><strong class="statement_topic">Main Estimates, 2024-25</strong><span class="br"></span>Routine Proceedings</p>
<p>10:10 a.m.
				
				

				<p>Oakville
				<span class="br"></span>Ontario</p><p class="partytag"><span class="tag partytag_liberal">Liberal</span>
</p></p></div>
<div class="text-col">
<a href="/politicians/anita-anand/">
<img class="headshot_thumb" src="/media/CACHE/images/polpics/anita-anand/76708c03c398389aa08f038af639a182.jpg"/>
</a>
<p class="speaking">
<a href="/politicians/anita-anand/">
<span class="pol_name">Anita Anand</span>
</a> <span class="partytag tag partytag_liberal">Liberal</span><span class="pol_affil">President of the Treasury Board</span>
</p>
```

Here we see the identifier **class="row statement_browser statement"** which starts a new section with a new person speaking. 

The tag **class="pol_name"** shows the politician's name. The tag **class="partytag tag partytag_liberal"** identifies the party and the tag **class="pol_affil"** shows their affilication. 

We use these tags to build up a dictionary of unique names of people in the debate, track their party and affliation, and also count how many times they speak.


In [186]:
debateDict = {'Name': [],
              'Party' : [],
              'Affiliation' : [],
              'Count' : [],
             }
for item in data.findAll("div", class_="row statement_browser statement"):
    try:  # getting the name of the speaker
        name = item.find('span', class_='pol_name').text
        name = str(name)
    except AttributeError:
        continue
    try:  # if they have spoken already, we do not find their party or affiliation
        index = debateDict['Name'].index(name)
        indexFound = True
    except ValueError:
        indexFound = False
        try:  # finding the affiliation
            affiliation = item.find('span', class_="pol_affil").text
            affiliation = str(affiliation).replace("\n","")
            affiliation = affiliation.replace("						", "")
        except AttributeError:
            affiliation = 'N/A'
        try:  # For speakers without party tags
            party = item.find('p', class_='partytag').text
            party = str(party).replace("\n","")
        except AttributeError:
            party = 'N/A'
    if indexFound:
        debateDict["Count"][index] = debateDict["Count"][index] + 1
    else:
        debateDict['Name'].append(name)
        debateDict['Party'].append(party)
        debateDict['Affiliation'].append(affiliation)
        debateDict['Count'].append(1)
 

From this dictionary, we create a dataframe that shows all the collected information. 

In [195]:
df2 = pd.DataFrame.from_dict(debateDict)
display(df2)

Unnamed: 0,Name,Party,Affiliation,Count
0,The Speaker,Liberal,Greg Fergus,16
1,Anita Anand,Liberal,President of the Treasury Board,3
2,Kevin Lamoureux,Liberal,Parliamentary Secretary to the Leader of the G...,17
3,Mark Holland,Liberal,Minister of Health,18
4,Karen Vecchio,Conservative,"Elgin—Middlesex—London, ON",1
...,...,...,...,...
90,Taylor Bachrach,NDP,"Skeena—Bulkley Valley, BC",1
91,Gabriel Ste-Marie,Bloc,"Joliette, QC",1
92,Annie Koutrakis,Liberal,Parliamentary Secretary to the Minister of Tou...,1
93,Ryan Williams,Conservative,"Bay of Quinte, ON",2


If you would like to see more done with this example, please check out this other Callysto notebook, 
[open-parliament.ipynb](https://hub.callysto.ca/jupyter/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fcallysto%2Fcurriculum-notebooks&branch=master&urlpath=notebooks/curriculum-notebooks/SocialStudies/OpenParliament/open-parliament.ipynb&depth=1)


## More examples

-- work to follow

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)