## Some notes on webscrapping

https://realpython.com/beautiful-soup-web-scraper-python/

https://github.com/psf/requests-html

Outline:

- introduction to the idea of webscraping
- explain the libraries "requests" and "beautifulsoup"
- a simple example. Tables in wikipedia. Say, Canadian cities
- a more complex example. Baseball stats, Stats Canada?
- mention APIs as a possibility for data collection
- going further

In [None]:
#pip install requests
#pip install beautifulsoup4

In [77]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

## Getting a list of Canadian cities

We start with a URL pointing to Wikipedia, for its article on the list of Canadian cities. A GET request will grab the data from the webpage for us. We then use the BeautifulSoup library to convert the html code from the webpage into something we can use. 


In [None]:
# URL for the Wikipedia webpage
url = 'https://en.wikipedia.org/wiki/List_of_cities_in_Canada'

# Send a GET request to the URL
response = requests.get(url)

# Pase the webpage from the response, and save as the data item "soup"
soup = BeautifulSoup(response.content, 'html.parser')

## Looking at the response

We can now look at the informuation, using the "prettify" instruction on "soup." Since the result is quite long, let's just print out the first 1000 characters.

In [31]:
print(soup.prettify()[:1000]) 

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-0 vector-feature-client-preferences-disabled vector-feature-client-prefs-pinned-disabled vector-feature-night-mode-disabled skin-night-mode-clientpref-0 vector-toc-available" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of cities in Canada - Wikipedia
  </title>
  <script>
   (function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-mai

## Grabbing a table

The webpage contains a number of tables. We use BeautifulSoup to identify each one, and save them all in a list. 

In [32]:
## Find the tables and count how many there are
tables = soup.find_all('table')
len(tables)

23

## Viewing a table

Let's take a look at the first table. The command "tables[0]" will print out the information in the first table, which we see is a list of capital cities. The table is made up of two columns, like this:

<div align="center">
<img src="images/capitals.jpg" alt="A table showing the capital cities" width="400"/><br>
The first table on the Wiki page, showing two columns: regions and capital cities.
</div>

Here is the command to display the data.

In [151]:
tables[0]

<table class="wikitable sortable">
<tbody><tr>
<th>Geographic area
</th>
<th>Capital
</th></tr>
<tr>
<td><b>Canada</b></td>
<td><b><a href="/wiki/Ottawa" title="Ottawa">Ottawa</a></b>
</td></tr>
<tr>
<td><a href="/wiki/Alberta" title="Alberta">Alberta</a></td>
<td><a href="/wiki/Edmonton" title="Edmonton">Edmonton</a>
</td></tr>
<tr>
<td><a href="/wiki/British_Columbia" title="British Columbia">British Columbia</a></td>
<td><a href="/wiki/Victoria,_British_Columbia" title="Victoria, British Columbia">Victoria</a>
</td></tr>
<tr>
<td><a href="/wiki/Manitoba" title="Manitoba">Manitoba</a></td>
<td><a href="/wiki/Winnipeg" title="Winnipeg">Winnipeg</a>
</td></tr>
<tr>
<td><a href="/wiki/New_Brunswick" title="New Brunswick">New Brunswick</a></td>
<td><a href="/wiki/Fredericton" title="Fredericton">Fredericton</a>
</td></tr>
<tr>
<td><a href="/wiki/Newfoundland_and_Labrador" title="Newfoundland and Labrador">Newfoundland and Labrador</a></td>
<td><a href="/wiki/St._John%27s,_Newfoundland_an

## Understanding the table data

We see the table consists of a number of items with tags like \<tr> or \<td> and some text data as well.

The \<tr> indicates a row in the table, while the \<td>  tags a piece of data in in that row.

We can write a simple loop to put this information into an array we call "contents" and then convert it to a dataframe.



In [152]:
contents = []

for row in tables[0].tbody.find_all('tr'):    
    columns = row.find_all('td')
    
    if(columns != []):
        cell = {}
        cell['Region'] = columns[0].text.strip()
        cell['City']  = columns[1].text.strip()
        contents.append(cell)

contents

[{'Region': 'Canada', 'City': 'Ottawa'},
 {'Region': 'Alberta', 'City': 'Edmonton'},
 {'Region': 'British Columbia', 'City': 'Victoria'},
 {'Region': 'Manitoba', 'City': 'Winnipeg'},
 {'Region': 'New Brunswick', 'City': 'Fredericton'},
 {'Region': 'Newfoundland and Labrador', 'City': "St. John's"},
 {'Region': 'Nova Scotia', 'City': 'Halifax'},
 {'Region': 'Ontario', 'City': 'Toronto'},
 {'Region': 'Prince Edward Island', 'City': 'Charlottetown'},
 {'Region': 'Quebec', 'City': 'Quebec City'},
 {'Region': 'Saskatchewan', 'City': 'Regina'},
 {'Region': 'Northwest Territories', 'City': 'Yellowknife'},
 {'Region': 'Nunavut', 'City': 'Iqaluit'},
 {'Region': 'Yukon', 'City': 'Whitehorse'}]

## Viewing the data frame

Here we create the data frame and view it.

In [153]:
df1 = pd.DataFrame(contents)
df1

Unnamed: 0,Region,City
0,Canada,Ottawa
1,Alberta,Edmonton
2,British Columbia,Victoria
3,Manitoba,Winnipeg
4,New Brunswick,Fredericton
5,Newfoundland and Labrador,St. John's
6,Nova Scotia,Halifax
7,Ontario,Toronto
8,Prince Edward Island,Charlottetown
9,Quebec,Quebec City


## Viewing another table

Let's take a look at the next table. The command "tables[1]" will print out the information in the first table, which is information about the cities in Alberta. It looks like this:

<div align="center">
<img src="images/alberta.jpg" alt="A table showing the capital cities" width="400"/><br>
The second table on the Wiki page, showing cities in Alberta.
</div>

Run the following cell to see the raw data. It is a bit long, though, so we just print the first 2000 characters to see the basic format of the data. 

In [159]:
print(str(tables[1])[:2000])

<table class="wikitable sortable">
<tbody><tr>
<th rowspan="2" scope="col">Name
</th>
<th rowspan="2" scope="col"><a class="mw-redirect" href="/wiki/List_of_regions_of_Alberta" title="List of regions of Alberta">Region</a>
</th>
<th rowspan="2" scope="col">Incorporation<br/>date (city)<sup class="reference" id="cite_ref-ABcityprofiles_3-0"><a href="#cite_note-ABcityprofiles-3">[3]</a></sup>
</th>
<th rowspan="2" scope="“col”">Council<br/>size<sup class="reference" id="cite_ref-ABcityprofiles_3-1"><a href="#cite_note-ABcityprofiles-3">[3]</a></sup>
</th>
<th colspan="5" scope="“col”"><a href="/wiki/2021_Canadian_census" title="2021 Canadian census">2021 Census of Population</a><sup class="reference" id="cite_ref-2021census_4-0"><a href="#cite_note-2021census-4">[4]</a></sup>
</th></tr>
<tr>
<th scope="“col”">Population<br/>(2021)
</th>
<th scope="“col”">Population<br/>(2016)
</th>
<th scope="“col”">Change<br/>(%)
</th>
<th scope="“col”">Land<br/>area<br/>(km<sup>2</sup>)
</th>
<th data-

## Creating another data framw

While we could use all the data from the table, we will just take the name, region, and population information. From the table shown above, we see this corresponds to columns 0, 1 and 4. 

The code to create the data frame is as follows:

In [168]:
contents = []

for row in tables[1].tbody.find_all('tr'):    
    columns = row.find_all('td')
    
    if(columns != []):
        cell = {}
        cell['Name'] = columns[0].text.strip()
        cell['Region']  = columns[1].text.strip()
        cell['Population (2021)']  = columns[4].text.strip()
        contents.append(cell)

df1 = pd.DataFrame(contents)
df1

Unnamed: 0,Name,Region,Population (2021)
0,Airdrie,Calgary Metro,74100
1,Beaumont[AB 1],Edmonton Metro,20888
2,Brooks[AB 2],Southern,14924
3,Calgary[AB 3],Calgary Metro,1306784
4,Camrose,Central,18772
5,Chestermere[AB 4],Calgary Metro,22163
6,Cold Lake,North,15661
7,Edmonton[AB 5],Edmonton Metro,1010899
8,Fort Saskatchewan,Edmonton Metro,27088
9,Grande Prairie,Northern,64141


## Describing the data

We can get a quick summary of the data by using the "describe" function on the data frame, as follows:

In [169]:
df1.describe()

Unnamed: 0,Name,Region,Population (2021)
count,20,20,20
unique,20,7,20
top,Airdrie,Edmonton Metro,74100
freq,1,6,1


## From strings to numbers

You may have noticed that the population data is being treated as string, not actual numbers in the data frame. 

We can use the following lines of code to convert these strings to numbers. First, remove the commas from the text, then convert the text to integers

In [170]:
df1['Population (2021)'] = df1['Population (2021)'].apply(lambda x: x.replace(',',''))
df1['Population (2021)'] = df1['Population (2021)'].apply(int)
df1

Unnamed: 0,Name,Region,Population (2021)
0,Airdrie,Calgary Metro,74100
1,Beaumont[AB 1],Edmonton Metro,20888
2,Brooks[AB 2],Southern,14924
3,Calgary[AB 3],Calgary Metro,1306784
4,Camrose,Central,18772
5,Chestermere[AB 4],Calgary Metro,22163
6,Cold Lake,North,15661
7,Edmonton[AB 5],Edmonton Metro,1010899
8,Fort Saskatchewan,Edmonton Metro,27088
9,Grande Prairie,Northern,64141


## Describing the numerical data

Now that thhe popolulation column is representing actual numbers, the "describe" function will give us some basic statistics about these numbers. Such as the minimum, maximum, mean, standard deviation, etc. 

In [166]:
df1.describe()

Unnamed: 0,Population (2021)
count,20.0
mean,302364.1
std,728640.8
min,12594.0
25%,19497.25
50%,35869.5
75%,80176.5
max,3023641.0


## Canadian Parliament webpage


In [None]:
dateOfDebate = ('2024/02/29/')

page = requests.get('https://openparliament.ca/debates/' + dateOfDebate + '?singlepage=1').text  #?singlepage=1' gets all of the speakers
data = BeautifulSoup(page, 'html.parser')

While we could print out this data and examine its structure, using the print(data) command, instead let's focus on a section that is informative such as this oneL

```
<div class="row statement_browser statement" data-floor="" data-hocid="12614875" data-url="/debates/2024/2/29/anita-anand-1/" id="sanita-anand-1">
<div class="l-ctx-col">
<noscript><p><a href="/debates/2024/2/29/anita-anand-1/only/">Permalink</a></p></noscript>
<p><strong class="statement_topic">Main Estimates, 2024-25</strong><span class="br"></span>Routine Proceedings</p>
<p>10:10 a.m.
				
				

				<p>Oakville
				<span class="br"></span>Ontario</p><p class="partytag"><span class="tag partytag_liberal">Liberal</span>
</p></p></div>
<div class="text-col">
<a href="/politicians/anita-anand/">
<img class="headshot_thumb" src="/media/CACHE/images/polpics/anita-anand/76708c03c398389aa08f038af639a182.jpg"/>
</a>
<p class="speaking">
<a href="/politicians/anita-anand/">
<span class="pol_name">Anita Anand</span>
</a> <span class="partytag tag partytag_liberal">Liberal</span><span class="pol_affil">President of the Treasury Board</span>
</p>
```

Here we see the identifier **class="row statement_browser statement"** which starts a new section with a new person speaking. 

The tag **class="pol_name"** shows the politician's name. The tag **class="partytag tag partytag_liberal"** identifies the party and the tag **class="pol_affil"** shows their affilication. 

We use these tags to build up a dictionary of unique names of people in the debate, track their party and affliation, and also count how many times they speak.


In [186]:
debateDict = {'Name': [],
              'Party' : [],
              'Affiliation' : [],
              'Count' : [],
             }
for item in data.findAll("div", class_="row statement_browser statement"):
    try:  # getting the name of the speaker
        name = item.find('span', class_='pol_name').text
        name = str(name)
    except AttributeError:
        continue
    try:  # if they have spoken already, we do not find their party or affiliation
        index = debateDict['Name'].index(name)
        indexFound = True
    except ValueError:
        indexFound = False
        try:  # finding the affiliation
            affiliation = item.find('span', class_="pol_affil").text
            affiliation = str(affiliation).replace("\n","")
            affiliation = affiliation.replace("						", "")
        except AttributeError:
            affiliation = 'N/A'
        try:  # For speakers without party tags
            party = item.find('p', class_='partytag').text
            party = str(party).replace("\n","")
        except AttributeError:
            party = 'N/A'
    if indexFound:
        debateDict["Count"][index] = debateDict["Count"][index] + 1
    else:
        debateDict['Name'].append(name)
        debateDict['Party'].append(party)
        debateDict['Affiliation'].append(affiliation)
        debateDict['Count'].append(1)
 

From this dictionary, we create a dataframe that shows all the collected information. 

In [195]:
df2 = pd.DataFrame.from_dict(debateDict)
display(df2)

Unnamed: 0,Name,Party,Affiliation,Count
0,The Speaker,Liberal,Greg Fergus,16
1,Anita Anand,Liberal,President of the Treasury Board,3
2,Kevin Lamoureux,Liberal,Parliamentary Secretary to the Leader of the G...,17
3,Mark Holland,Liberal,Minister of Health,18
4,Karen Vecchio,Conservative,"Elgin—Middlesex—London, ON",1
...,...,...,...,...
90,Taylor Bachrach,NDP,"Skeena—Bulkley Valley, BC",1
91,Gabriel Ste-Marie,Bloc,"Joliette, QC",1
92,Annie Koutrakis,Liberal,Parliamentary Secretary to the Minister of Tou...,1
93,Ryan Williams,Conservative,"Bay of Quinte, ON",2


If you would like to see more done with this example, please check out this other Callysto notebook, 
[open-parliament.ipynb](https://hub.callysto.ca/jupyter/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fcallysto%2Fcurriculum-notebooks&branch=master&urlpath=notebooks/curriculum-notebooks/SocialStudies/OpenParliament/open-parliament.ipynb&depth=1)


In [9]:
# URL of the MLB website with scores
url = 'https://www.mlb.com/scores'
url = 'https://www.mlb.com/stats/'
url = 'https://en.wikipedia.org/wiki/List_of_cities_in_Canada'

# Send a GET request to the URL
response = requests.get(url)

In [11]:
soup = BeautifulSoup(response.content, 'html.parser')
#print(soup.prettify()) 

print('Classes of each table:')
for table in soup.find_all('table'):
    print(table.get('class'))


Classes of each table:
['wikitable', 'sortable']
['wikitable', 'sortable']
['wikitable', 'sortable', 'sticky-header']
['wikitable', 'sortable', 'sticky-header']
['wikitable', 'sortable', 'sticky-header']
['wikitable', 'sortable']
['wikitable']
['wikitable']
['wikitable', 'sortable', 'sticky-header']
['wikitable', 'sortable']
['wikitable', 'sortable']
['wikitable', 'sortable', 'sticky-header']
['wikitable']
['nowraplinks', 'mw-collapsible', 'mw-collapsed', 'navbox-inner']
['nowraplinks', 'hlist', 'mw-collapsible', 'expanded', 'navbox-inner']
['nowraplinks', 'navbox-subgroup']
['nowraplinks', 'hlist', 'navbox-subgroup']
['nowraplinks', 'hlist', 'navbox-subgroup']
['nowraplinks', 'hlist', 'navbox-subgroup']
['nowraplinks', 'hlist', 'navbox-subgroup']
['nowraplinks', 'hlist', 'navbox-subgroup']
['nowraplinks', 'hlist', 'mw-collapsible', 'autocollapse', 'navbox-inner']
['nowraplinks', 'mw-collapsible', 'mw-collapsed', 'navbox-inner']


In [17]:
tables = soup.find_all('table')
len(tables)

23

In [58]:
tables[1]

<table class="wikitable sortable">
<tbody><tr>
<th rowspan="2" scope="col">Name
</th>
<th rowspan="2" scope="col"><a class="mw-redirect" href="/wiki/List_of_regions_of_Alberta" title="List of regions of Alberta">Region</a>
</th>
<th rowspan="2" scope="col">Incorporation<br/>date (city)<sup class="reference" id="cite_ref-ABcityprofiles_3-0"><a href="#cite_note-ABcityprofiles-3">[3]</a></sup>
</th>
<th rowspan="2" scope="“col”">Council<br/>size<sup class="reference" id="cite_ref-ABcityprofiles_3-1"><a href="#cite_note-ABcityprofiles-3">[3]</a></sup>
</th>
<th colspan="5" scope="“col”"><a href="/wiki/2021_Canadian_census" title="2021 Canadian census">2021 Census of Population</a><sup class="reference" id="cite_ref-2021census_4-0"><a href="#cite_note-2021census-4">[4]</a></sup>
</th></tr>
<tr>
<th scope="“col”">Population<br/>(2021)
</th>
<th scope="“col”">Population<br/>(2016)
</th>
<th scope="“col”">Change<br/>(%)
</th>
<th scope="“col”">Land<br/>area<br/>(km<sup>2</sup>)
</th>
<th data-

In [45]:
for row in tables[1].tbody.find_all('tr'):    
    columns = row.find_all('td')
    for col in columns: 
            print(col.text.strip(),"--",len(col))


Airdrie -- 1
Calgary Metro -- 1
Jan 1, 1985 -- 1
7 -- 1
74,100 -- 1
61,581 -- 1
+20.3% -- 2
84.39 -- 1
878.1 -- 2
Beaumont[AB 1] -- 2
Edmonton Metro -- 1
Jan 1, 2019 -- 1
7 -- 1
20,888 -- 1
17,457 -- 1
+19.7% -- 2
24.70 -- 1
845.7 -- 2
Brooks[AB 2] -- 2
Southern -- 1
Sep 1, 2005 -- 1
7 -- 1
14,924 -- 1
14,451 -- 1
+3.3% -- 2
18.21 -- 1
819.5 -- 2
Calgary[AB 3] -- 2
Calgary Metro -- 1
Jan 1, 1894 -- 1
15 -- 1
1,306,784 -- 1
1,239,220 -- 1
+5.5% -- 2
820.62 -- 1
1,592.4 -- 2
Camrose -- 1
Central -- 1
Jan 1, 1955 -- 1
9 -- 1
18,772 -- 1
18,742 -- 1
+0.2% -- 2
41.67 -- 1
450.5 -- 2
Chestermere[AB 4] -- 2
Calgary Metro -- 1
Jan 1, 2015 -- 1
7 -- 1
22,163 -- 1
19,887 -- 1
+11.4% -- 2
32.83 -- 1
675.1 -- 2
Cold Lake -- 1
North -- 1
Oct 1, 2000 -- 1
7 -- 1
15,661 -- 1
14,976 -- 1
+4.6% -- 2
66.61 -- 1
235.1 -- 2
Edmonton[AB 5] -- 2
Edmonton Metro -- 1
Oct 8, 1904 -- 1
13 -- 1
1,010,899 -- 1
933,088 -- 1
+8.3% -- 2
765.61 -- 1
1,320.4 -- 2
Fort Saskatchewan -- 1
Edmonton Metro -- 1
Jul 1, 1985 

In [37]:
columns = row.find_all('td')

In [42]:
for col in columns:
    print(col.text.strip())

Airdrie
Calgary Metro
Jan 1, 1985
7
74,100
61,581
+20.3%
84.39
878.1


In [28]:
columns[4].text.strip()

'3,023,641'

In [None]:
# Parse the HTML content
soup = BeautifulSoup(response.content, 'html.parser')

# Find all elements with class 'game-row__container' which contains the game details
game_containers = soup.find_all(class_='game-row__container')

# Initialize a list to store the extracted data
game_scores = []

# Loop through each game container
for container in game_containers:
    # Extract relevant information
    teams = container.find_all(class_='team-name')
    team_1 = teams[0].text.strip()
    team_2 = teams[1].text.strip()
    score = container.find(class_='total-score').text.strip()

    # Store the data as a dictionary
    game_info = {
        'Team 1': team_1,
        'Team 2': team_2,
        'Score': score
    }

    # Append the dictionary to the list
    game_scores.append(game_info)

# Print the extracted game scores
for game in game_scores:
    print(game)

In [6]:
response.content




In [None]:
response


In [None]:
!pip install requests-html

In [4]:
from requests_html import HTMLSession, AsyncHTMLSession

# Create an HTML session
asession = AsyncHTMLSession()

# URL of the MLB website with scores
url = 'https://www.mlb.com/scores'

async def get_mlb():
    r = await asession.get(url)
    return r

# Send a GET request to the URL and render JavaScript
results = asession.run(get_mlb)

RuntimeError: This event loop is already running

In [None]:
# Render the JavaScript
response.html.render()

# Parse the HTML content
soup = response.html

# Find all elements with class 'game-row__container' which contains the game details
game_containers = soup.find('.game-row__container')

# Initialize a list to store the extracted data
game_scores = []

# Loop through each game container
for container in game_containers:
    # Extract relevant information
    teams = container.find('.team-name')
    team_1 = teams[0].text.strip()
    team_2 = teams[1].text.strip()
    score = container.find('.total-score', first=True).text.strip()

    # Store the data as a dictionary
    game_info = {
        'Team 1': team_1,
        'Team 2': team_2,
        'Score': score
    }

    # Append the dictionary to the list
    game_scores.append(game_info)

# Print the extracted game scores
for game in game_scores:
    print(game)

In [None]:
response.

In [None]:
import requests
from bs4 import BeautifulSoup

# URL of the Statistics Canada website with greenhouse gas emissions data
url = 'https://www150.statcan.gc.ca/n1/pub/11-621-m/11-621-m2019013-eng.htm'

# Send a GET request to the URL
response = requests.get(url)

# Parse the HTML content
soup = BeautifulSoup(response.content, 'html.parser')

# Find the table containing greenhouse gas emissions data
table = soup.find('table')

# Initialize lists to store data
years = []
emissions = []

# Loop through each row in the table
for row in table.find_all('tr')[1:]:  # Skip the first row (header row)
    # Extract year and emissions from the row
    columns = row.find_all('td')
    year = columns[0].text.strip()
    emission = columns[1].text.strip()

    # Append the data to the lists
    years.append(year)
    emissions.append(emission)

# Print the extracted data
print("Year\tGreenhouse Gas Emissions (kt)")
for year, emission in zip(years, emissions):
    print(f"{year}\t{emission}")

In [None]:
soup
