![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

<a href="https://hub.callysto.ca/jupyter/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fcallysto%2Flesson-plans&branch=master&subPath=notebooks/webscraping/webscraping.ipynb&depth=1"><img src="https://raw.githubusercontent.com/callysto/curriculum-notebooks/master/open-in-callysto-button.svg?sanitize=true" width="123" height="24" alt="Open in Callysto"/></a>

## Overview: Webscraping notebook

In this short lesson, we discover how to collect data from tables in a webpage and put them into a Pandas dataframe for data analysis. This is called webscraping. 

Outline:

1. Introduction to webscraping
2. The code libraries "requests" and "beautifulsoup"
3. A simple example: Tables of Canadian Cities, from Wikipedia.
4. A more complex example: Canadian parliament and creating a data frame
5. Using APIs: Canada Open Data examples
6. Using CSV files: baseball data example
7. Going further: dynamic webpages, Kaggle data sournce

Some useful online resources:
- https://realpython.com/beautiful-soup-web-scraper-python/
- https://github.com/psf/requests-html
- https://medium.com/geekculture/web-scraping-tables-in-python-using-beautiful-soup-8bbc31c5803e
- https://careerfoundry.com/en/blog/data-analytics/open-data-sources/



## 1. Introduction to  webscraping

**Webscraping** is the process of collecting (or scraping) information from a webpage, in order to use it for further analysis. 

This is often useful for gathering a large amount of data in a short amount of time, whether it be a large collection of numbers and text from a big table in a single webpage, or to automatically gather data from many different webpages hosting useful information. 

While much of this data could be collected by hand, say by copying with a pen and paper (or more likely, copying and pasting with a keyboard command), it is much more efficient to gather the data using a computer program.

This notebook shows how to use some simple tools in Python to webscrape data from interesting webpages. 


## 2. Installing libraries

We begin by installing and importing some important code libraries. First, the **requests** library, which handles the process of requesting raw data from a webpage. Next, the **beautifulsoup** library, which takes the raw data from the webpage and reformats it, or interprets it into a form that is easily saved as a data frame.

Normally, we just use the **import** command to make the libraries accessible in our code. However, if the libraries are not pre-installed in our system, we need to run the following two **pip install** commands. To run them in the next cell, remove the # symbol which made those lines of code inactive. 

In [None]:
#pip install requests
#pip install beautifulsoup4

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

## 3. A simple example: Scaping a list of Canadian cities

We start with a URL pointing to Wikipedia, for its article on Canadian cities. A GET request will grab the data from the webpage for us. We then use the `beautifulsoup` library to convert the HTML code from the webpage into something we can use. 


In [None]:
# URL for the Wikipedia webpage
url = 'https://en.wikipedia.org/wiki/List_of_cities_in_Canada'

# Send a GET request to the URL
response = requests.get(url)

# Pase the webpage from the response, and save as the data item "soup"
soup = BeautifulSoup(response.content, 'html.parser')

### Looking at the response

We can now look at the information, using the "prettify" instruction on "soup". Since the result is quite long, let's just print out the first 1000 characters.

In [None]:
print(soup.prettify()[:1000]) 

### Grabbing a table

The webpage contains a number of tables. We use `beautifulsoup` to identify each one, and save them all in a list. 

In [None]:
## Find the tables and count how many there are
tables = soup.find_all('table')
len(tables)

### Viewing a table

Let's take a look at the first table. The command "tables[0]" will print out the information in the first table, which we see is a list of capital cities. The table is made up of two columns, like this:

<div align="center">
<img src="images/capitals.jpg" alt="A table showing the capital cities" width="400"/><br>
The first table on the Wiki page, showing two columns: regions and capital cities.
</div>

Here is the command to display the data.

In [None]:
tables[0]

### Understanding the table data

We see the table consists of a number of items with tags like \<tr> or \<td> and some text data as well.

The \<tr> indicates a row in the table, while the \<td>  tags a piece of data in that row.

We can write a simple loop to put this information into an array we call "contents" and then convert it to a dataframe.

In [None]:
contents = []

for row in tables[0].tbody.find_all('tr'):    
    columns = row.find_all('td')
    
    if(columns != []):
        cell = {}
        cell['Region'] = columns[0].text.strip()
        cell['City']  = columns[1].text.strip()
        contents.append(cell)

contents

### Viewing the dataframe

Here we create the dataframe and view it.

In [None]:
df1 = pd.DataFrame(contents)
df1

### Viewing another table about Canadian cities

Let's take a look at the next table from the same Wikipedia webpage. The command "tables[1]" will print out the information in the first table, which is information about the cities in Alberta. It looks like this:

<div align="center">
<img src="images/alberta.jpg" alt="A table showing the capital cities" width="800"/><br>
The second table on the Wiki page, showing cities in Alberta.
</div>

Run the following cell to see the raw data. It is a bit long, though, so we just print the first 2000 characters to see the basic format of the data. 

In [None]:
print(str(tables[1])[:2000])

### Creating another dataframe

While we could use all the data from the table, we will just take the name, region, and population information. From the table shown above, we see this corresponds to columns 0, 1 and 4. 

The code to create the dataframe is as follows:

In [None]:
contents = []

for row in tables[1].tbody.find_all('tr'):    
    columns = row.find_all('td')
    
    if(columns != []):
        cell = {}
        cell['Name'] = columns[0].text.strip()
        cell['Region']  = columns[1].text.strip()
        cell['Population (2021)']  = columns[4].text.strip()
        contents.append(cell)

df1 = pd.DataFrame(contents)
df1

### Describing the data

We can get a quick summary of the data by using the "describe" function on the data frame, as follows:

In [None]:
df1.describe()

### From strings to numbers

You may have noticed that the population data is being treated as string, not actual numbers in the dataframe. 

We can use the following lines of code to convert these strings to numbers. First, remove the commas from the text, then convert the text to integers

In [None]:
df1['Population (2021)'] = df1['Population (2021)'].apply(lambda x: x.replace(',',''))
df1['Population (2021)'] = df1['Population (2021)'].apply(int)
df1

### Describing the numerical data

Now that the population column is representing actual numbers, the "describe" function will give us some basic statistics about these numbers. Such as the minimum, maximum, mean, standard deviation, etc. 

In [None]:
df1.describe()

## 4. A more complex example: Canadian parliament webpage

For our second demonstration, let's go to a webpage that shows a list of Members of Parliament who are speaking one day in the Canadian House of Commons. There is a table on this webpage, https://openparliament.ca/debates/, and we will examine one particular date. Say, **February 29, 2024**. 

The webpage looks like this:

<div align="center">
<img src="images/parliament.jpg" alt="The speakers in parliament webpage" width="400"/><br>
The first table on the Wiki page, showing two columns: regions and capital cities.
</div>

Here is the code to grab the data, using the `requests` library, and `beautifulsoup` to parse the data.`

In [None]:
dateOfDebate = ('2024/02/29/')

page = requests.get('https://openparliament.ca/debates/' + dateOfDebate + '?singlepage=1').text  #?singlepage=1' gets all of the speakers
data = BeautifulSoup(page, 'html.parser')

While we could print out this data and examine its structure, using the print(data) command, instead let's focus on a section that is informative such as this one:

```
<div class="row statement_browser statement" data-floor="" data-hocid="12614875" data-url="/debates/2024/2/29/anita-anand-1/" id="sanita-anand-1">
<div class="l-ctx-col">
<noscript><p><a href="/debates/2024/2/29/anita-anand-1/only/">Permalink</a></p></noscript>
<p><strong class="statement_topic">Main Estimates, 2024-25</strong><span class="br"></span>Routine Proceedings</p>
<p>10:10 a.m.
				
				

				<p>Oakville
				<span class="br"></span>Ontario</p><p class="partytag"><span class="tag partytag_liberal">Liberal</span>
</p></p></div>
<div class="text-col">
<a href="/politicians/anita-anand/">
<img class="headshot_thumb" src="/media/CACHE/images/polpics/anita-anand/76708c03c398389aa08f038af639a182.jpg"/>
</a>
<p class="speaking">
<a href="/politicians/anita-anand/">
<span class="pol_name">Anita Anand</span>
</a> <span class="partytag tag partytag_liberal">Liberal</span><span class="pol_affil">President of the Treasury Board</span>
</p>
```

Here we see the identifier **class="row statement_browser statement"** which starts a new section with a new person speaking. 

The tag **class="pol_name"** shows the politician's name. The tag **class="partytag tag partytag_liberal"** identifies the party and the tag **class="pol_affil"** shows their affiliation. 

We use these tags to build up a dictionary of unique names of people in the debate, track their party and affiliation, and also count how many times they speak.

In [None]:
debateDict = {'Name': [],
              'Party' : [],
              'Affiliation' : [],
              'Count' : [],
             }
for item in data.findAll("div", class_="row statement_browser statement"):
    try:  # getting the name of the speaker
        name = item.find('span', class_='pol_name').text
        name = str(name)
    except AttributeError:
        continue
    try:  # if they have spoken already, we do not find their party or affiliation
        index = debateDict['Name'].index(name)
        indexFound = True
    except ValueError:
        indexFound = False
        try:  # finding the affiliation
            affiliation = item.find('span', class_="pol_affil").text
            affiliation = str(affiliation).replace("\n","")
            affiliation = affiliation.replace("						", "")
        except AttributeError:
            affiliation = 'N/A'
        try:  # For speakers without party tags
            party = item.find('p', class_='partytag').text
            party = str(party).replace("\n","")
        except AttributeError:
            party = 'N/A'
    if indexFound:
        debateDict["Count"][index] = debateDict["Count"][index] + 1
    else:
        debateDict['Name'].append(name)
        debateDict['Party'].append(party)
        debateDict['Affiliation'].append(affiliation)
        debateDict['Count'].append(1)
 

### Viewing the data

From this dictionary, we create a dataframe that shows all the collected information. 

In [None]:
df2 = pd.DataFrame.from_dict(debateDict)
display(df2)

We are now ready to analyze this data in the dataframe. But let's save that for another day. 

## 5. Using  APIs: Canada Open Data

An **API**, which stands for **Application Programming Interface**, is a bridge allowing different software applications to communicate and interact with each other.

Imagine you're at a restaurant. The menu acts as an API because it provides a simplified way for you to interact with the kitchen. Instead of going into the kitchen directly and asking the chef how to cook your dish, you simply order off the menu. The kitchen staff then uses the instructions provided on the menu to prepare and serve your menu.

This process simplifies the process of creating a data frame from information appearing on a webpage.

For instance, we can use the same *openparliament* source from above, to grab information and directly convert it into a dataframe. 

Let's start with a request to a specific web address, requesting information about votes on bills in parliament.

In [None]:
r = requests.get('http://api.openparliament.ca/votes/?format=json&limit=100')
df3 = pd.DataFrame(r.json()['objects'])
df3

### Wasn't that easy?

If you would like to see more done with this example, please check out this other Callysto notebook: [Open Parliament](https://hub.callysto.ca/jupyter/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fcallysto%2Fcurriculum-notebooks&branch=master&urlpath=notebooks/curriculum-notebooks/SocialStudies/OpenParliament/open-parliament.ipynb&depth=1)

## 5a: More examples: Canada open data

The Canadian government publishes reams of data that is freely available for analysis. We can browse through the available data sets at this website: https://open.canada.ca/en

For instance, here is a link to information about government contracts: https://search.open.canada.ca/contracts/

Let's use this link and create a dataframe. We are using the API call that the website supplies, along with a resource ID that points to this particular data set. (We can find all this information on the website.) We limit the search to the first 10 items.

Our code below prints out the status code, to indicate success (200) or not (403, 404, etc).

In [None]:
url='https://open.canada.ca/data/en/api/action/datastore_search?limit=10&resource_id=fac950c0-00d5-4ec1-a4d3-9cbebf98a305'
response = requests.get(url)
print(response.status_code, "  Note: 200 indicates a successful request.")


### Reading the JSON object

The response contains a JSON object, which is a dictionary with many entries. Let's look at the result of the query, which holds the records of various contracts. 

We limit it to the first 2 items, to keep the screen clear. 

In [None]:
response.json()['result']['records'][:2]

### Next step - data frame
We turn this JSON information into the data frame. 

In [None]:
df4 = pd.DataFrame(response.json()['result']['records'])
df4

### Narrowing the results

Lets look at just the name of the vendor and the value of the contract, using the following Pandas command:

In [None]:
df4[['vendor_name','contract_value']]

## 6. Using CSV files: Baseball data example

Many webpages gives users access to their data by providing downloads of files containing all their information. A common format is the CSV file, or Comma Separated Values file, which is like a generic spreadsheet table. By downloading these files directly, we can load them into a Pandas dataframe immediately, skipping all the webscraping and parsing tricks discussed above. 

The research group Five Thirty Eight provides all kinds of open data in the form of CSV files, which are stored in their GitHub repo at https://github.com/fivethirtyeight. For instance, we see some interesting baseball data stored at this directory: 
https://github.com/fivethirtyeight/data/tree/master/foul-balls

We have downloaded this baseball file from the repo to accompany this notebook so the following "read_csv" command will load in the file directly to a dataframe.

In [None]:
df5 = pd.read_csv("foul-balls.csv")
print(df5)

### Describe

The describe command gives a quick summary about this data.

In [None]:
df5.describe()

We can now analyze this data, asking such questions as: which team hit the most foul balls; ro, in which zone did most foul balls end up?

We'll leave this up to a later notebook. 

## 7. Going further

We have demonstrated the basics of scraping data from static web pages using the `requests` library and beautiful soup. We have also seen how to gather data through APIs and by downloading CSV files.

### Dynamic webpages 

Some webpages are not static, but instead are built "on the fly" with code that runs when the user views the web pages. These **dynamic** webpages load information onto the webpage using JavaScript code, as a live rendering of the information. To access these types of webpages, we need more powerful tools such as *Selenium* or *Pyppeteer*. 

These tools launch a "headless" browser in the Python environment where the JavaScript can render the webpage content locally for us, and then we can use `requests` and `beautlifulsoup` to look at the data. You can read more about the Selenium process here:

https://www.selenium.dev/

and also a tutorial on the setup here:

https://www.zenrows.com/blog/dynamic-web-pages-scraping-python

Note that Selenium no longer requires the installation of a webdriver, which had been an obstacle to using the tool within Jupyter. Currently, a simple "pip install selenium" will give you access to the tools you need to run a headless browser and webscrape from dynamic webpages.

### Kaggle and other data sources

Some popular data sources such as *Kaggle* (https://www.kaggle.com/datasets) provide access to a wide variety of information available for downloads. They often sponsor hackathons where people are encouraged to analyze a suite of data in a competition.

To access this data, however, you need to create a "token" to identify yourself and gain permission to download this data. Typically, this process is free, but it does require you to provide some personal information to get access. Be sure to read the privacy rules before providing any such personal information all line.


### Good luck, and enjoy!

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)