# Introduction to Web Scraping 
<br><br>
**ONS / NISR**<br>
2021

## What is web scraping 

Web scraping (Screen Scraping, Web Data Extraction, Web Harvesting etc.) is a technique used to automatically extract large amounts of data from websites and save it to a file or database. 

<center><img src="./imgs/web_scraping.png"></center>

The Internet is a data store of world's information - be it text, media or data in any other format. Every web page display data in one form or the other. Access to this data is crucial for the success of most businesses in the modern world. Unfortunately, most of this data is not open. Most websites do not provide the option to save the data which they display to your local storage, or to your own website.

## Why do web scraping 

Web Scraping is used for getting data. Data collection and analysis is important even for government, non-profit and educational institutions.

The following are few of the many common applications of Web Scraping:

1. In eCommerce, Web Scraping is used for competition price monitoring.

2. In Marketing, Web Scraping is used for lead generation, to build phone and email lists for cold outreach.

3. In Real Estate, Web Scraping is used to get property and agent/owner details.

4. Web Scraping is used to collect training and testing data for Machine Learning projects.


## Is Web Scraping Legal?

One of the most frequent questions which comes to your mind once you have decided to scrape data is whether the process of web scraping is legal or not. Scraping data which is already available in public domain is **legal** as long as you use the data **ethically**.

### Additional considerations 

Whilst the process of web scraping is legal, consideration should be given to the data that you're attempting to collect. Whilst it may be in the public domain, you may not have a legal standing to collect **personal** or **copyrighted** data. 

**Personal Data** - As a rule of thumb, it is recommended to have a lawful reason to obtain, store and use personal data without the user’s consent.

**Copyrighted Data** - It is not illegal to scrape copyrighted data as long as you don’t plan to reuse or publish it.

## What are web pages?

### HTML (Hypertext Markup Language)

The backbone of any web page is HTML. This is a relatively simple markup language that uses `<tags>`, denoted by angle brackets, to markup different elements.

Open https://www.statistics.gov.rw in any web browser and right click on the page and select `View Source`. 

<center><img src="./imgs/webscrape_view_source.png" width=400></center>

<center><img src="./imgs/nisr_page_source.png"></center>


### Creating a Basic HTML page

As `HTML` is just a series of `<tags>` written in plain text, we can create a web page that can be rendered in any browser just using a text editor. 

Create a new file called `my_webpage.html` and add the following text.
```html
<html> <!-- Open the HTML tag to declare that everything inside is HTML -->
    <body> <!-- Open the body tag, this is where we can write visible elements -->
        <h1>Page title</h1> <!-- h1 stands for Heading, see the use of </> to close the tag -->
        <p>This is my webpage.</p> <!-- p stands for paragraph -->
    </body> <!-- Close the body tag -->
</html> <!-- Close the HTML tag-->
```

<center><img src="./imgs/my_webpage.png" width=400 border=1></center>

There are plenty of other `<tags>` we can use in `HTML`, a full list can be found [here](https://www.tutorialrepublic.com/html-reference/html5-tags.php)

Some common ones you'll see are listed below 

| Tag | Usage |
| --- | --- | 
| `<div>` | Used to group elements together, or to provide structure to the web page |
| `<span>` | Used to group elements and to provide structure behaves slightly differently to `<div>`| 
| `<img>` | Adds an image to the web page |
| `<table>`, `<th>`, `<tr>`, `<td>` | Defines a table in HTML with the sub-elements defining the table header, table row and table cell respectively. |
| `<a>` | Create a hyperlink around a specific element |
| `<b>`, `<i>` | Create bold and italic elements respectively | 
| `<ol>`, `<ul>`, `<li>` | Create ordered and unordered lists where `<li>` tags list items. |


Lets create a second web page called `my_complex_webpage.html` that incorporates some of these other HTML elements. 

```html
<html>
    <body>
        <h1>My Complex Webpage</h1>
        <p>This is my more complex webpage with additional elements</p>
        <a href="https://www.statistics.gov.rw">This is a link to https://www.statistics.gov.rw</a>
        <p>Below here is the NISR logo</p>
        <img src="https://www.statistics.gov.rw/sites/default/files/images/logo.png">
        <h2>This is an unordered list of fruits</h2>
        <ul>
            <li>Apple</li>
            <li>Banana</li>
        </ul>
        <h2>This is a HTML table</h2>
        <table>
            <tr><th>Column 1</th><th>Column2</th><th>Column3</th></tr>
            <tr><td>1</td><td>2</td><td>3</td></tr>
            <tr><td>4</td><td>5</td><td>6</td></tr>
            <tr><td>7</td><td>8</td><td>9</td></tr>
        </table>
    </body>
</html>
```     

<center><img src="./imgs/my_complex_webpage.png" width=400 border=1></center>

### Cascading Style Sheets (CSS)

`<HTML>` is good for structure but it isn't very useful for styling elements on a web page. That's where Cascading Style Sheets (CSS) comes in. `CSS` is a separate language that allows us to apply "styles" to elements on our `HTML` web page. 

For example if we wanted to set the background of our web page to black and the font colour to white we could use the following CSS code. 

```css
/* The body tells the browser to only apply the contained styles onto the <body> element */

body {  
    background: black; /* Set the page background to black */
    color: white; /* Set the page font colour to white */
}
```

Save the above code as `style.css`

There are two ways to add `CSS` to our web page. We can add it directly into the `HTML` document using the `<style>` tags. More commonly you'll see `CSS` stored in a separate `.css` file which is linked in the `.html` file using the `<head>` and `<link>` tags. 

The `<head>` tag is like the body tag, but used for store additional meta information that isn't directly displayed on the page. 

```html
<html>
    <head>
        <link rel='stylesheet' href='style.css'>
    </head>
    <body>
        ...
    </body>
</html>
```

Create a copy of `my_complex_webpage.html` and add the `<head>` and `<link>` tags as described above. 

<center><img src="./imgs/css.png" width=400></center>

`CSS` is able to define styles not just for **types** of elements (i.e. `<body>`, `<li>`, `<p>`) but it can also define **classes** we can be applied to numerous elements. 

```css 
/* The "." at the start of the definition tells HTML to apply this style to any elements */
/* that have the specified class name. */

.red_text {
    color: red;
}
```

Add this style to your `style.css` file. 

We can now use the the `class` attribute on any HTML element to assign this specific style to specific elements. 

```html
<html>
    <head>
        <link rel='stylesheet' href='style.css'>
    </head>
    <body>
        <h1>My Complex Webpage</h1>
        <p class="red_text">This is my more complex webpage with additional elements</p>
        ...
        <h2 class="red_text">This is an unordered list of fruits</h2>
        <ul>
            <li class="red_text">Apple</li>
            <li>Banana</li>
        </ul>
        ...
    <head>
<html>
```

Edit your copy of `my_complex_webpage.html` to include the `class` attribute on some tags.

<center><img src="./imgs/css_classes.png" width=400></center>

### Congrats, you're officially a web designer!

## Scraping web pages with Pandas

Pandas has a built in function called `read_html` that allows us to read HTML tables directly from a web page. We can try this with the web page we just finished creating by using the following code. 

```python 
import pandas as pd 
df = pd.read_html('./my_complex_webpage.html')
df
```

In [1]:
import pandas as pd 
df = pd.read_html('./assets/my_complex_webpage.html')

print(df[0])

   Column 1  Column2  Column3
0         1        2        3
1         4        5        6
2         7        8        9


Pandas correctly found our table parsing out all our other `HTML`, but by default `read_html` will return a `list` of all tables that pandas can find on the web page, even if its only one. 

Pandas is also able to filter out any of the CSS that's been applied to our tables as well, returning on the data. 

```python
import pandas as pd

# Select the first / only dataframe in the list
df_no_css = pd.read_html('./my_complex_webpage.html')[0]
df_css = pd.read_html('./my_complex_webpage_with_css.html')[0]

# This will error if the dataframes aren't identical.
pd.testing.assert_frame_equal(df_no_css, df_css)
```

In [2]:
import pandas as pd

# Select the first / only dataframe in the list
df_no_css = pd.read_html('./assets/my_complex_webpage.html')[0]
df_css = pd.read_html('./assets/my_complex_webpage_with_css.html')[0]

# This will error if the dataframes aren't identical.
pd.testing.assert_frame_equal(df_no_css, df_css)

### Real world application 

Obviously real world websites are much messier than our example page, so we will also need to employ some basic data cleaning techniques to deal with these real world examples. 

Lets look at the wikipedia page for the [Rwandan Men's National Basketball Team](https://en.wikipedia.org/wiki/Rwanda_men%27s_national_basketball_team). There are lots of different tables, in different styles, some with images, some with complex headers. We can throw the URL directly into to `read_html` and see what comes out.  

```python 
import pandas as pd 

basketball_tables = pd.read_html('https://en.wikipedia.org/wiki/Rwanda_men%27s_national_basketball_team')

print(f'Tables found: {len(basketball_tables)}')
```

In [3]:
basketball_tables = pd.read_html('https://en.wikipedia.org/wiki/Rwanda_men%27s_national_basketball_team')
print(f'Tables found: {len(basketball_tables)}')

Tables found: 13


Often web developers will use `<table>` tags as a structural element, rather than to explicitly display some data. Note that the `0th` index in `basketball_tables` doesn't refer to the first visible table, but instead the information card in the top left corner of the page. 

<center><img src="./imgs/wiki_info_card.png" width=200></center>

Looking through the 13 parsed tables, we can find the current roster table at position `4`, but as Wikipedia can change we want to be able to write some code that **always** selects the roster table. We can do that using the keyword `match`. The `match` keyword will return any table containing the string passed. 

Once we've done that we can add our usual `skiprows` and `header` arguments to make sure the correct row is being used as the header of the table. 

```python 
url = 'https://en.wikipedia.org/wiki/Rwanda_men%27s_national_basketball_team'

roster = pd.read_html(url, 
                      match="Rwanda men's national basketball team roster", 
                      skiprows=1, 
                      header=2)[0]
roster.head()
```

In [4]:
url = 'https://en.wikipedia.org/wiki/Rwanda_men%27s_national_basketball_team'

roster = pd.read_html(url, 
                      match="Rwanda men's national basketball team roster", 
                      skiprows=1, 
                      header=2)[0]
roster.head()

Unnamed: 0,Pos.,No.,Name,Age – Date of birth,Height,Club,Ctr.
0,PG,4,Jean Nshobozwabyose,23 –,1.83 m (6 ft 0 in),Patriots,
1,G,5,Ntore Habimana,24 –,1.96 m (6 ft 5 in),Wilfrid Laurier Golden Hawks,
2,SG,6,Steven Hagumintwari,27 –,1.93 m (6 ft 4 in),Patriots,
3,SG,7,Armel Sangwe,24 –,1.90 m (6 ft 3 in),Espoir,
4,SG,8,Emile Kazeneza,20 –,2.01 m (6 ft 7 in),William Carey University,


We've now code that can scrape that table whenever we want. However, something looks a little wrong with the `Age - Date of Birth` column. Not all the data has been scraped, notably the actual dates of birth. 

<center><img src="./imgs/display_none.png" width=300></center>

This is because there is **hidden** data within these cells. Pandas will stop scraping a cell if it hits hidden data unless we explicitly tell it not to using `displayed_only=False`. 

```python
url = 'https://en.wikipedia.org/wiki/Rwanda_men%27s_national_basketball_team'

roster = pd.read_html(url, 
                      match="Rwanda men's national basketball team roster", 
                      skiprows=1, 
                      header=2,
                      displayed_only=False)[0]
roster.head()
```


In [5]:
url = 'https://en.wikipedia.org/wiki/Rwanda_men%27s_national_basketball_team'

roster = pd.read_html(url, 
                      match="Rwanda men's national basketball team roster", 
                      skiprows=1, 
                      header=2,
                      displayed_only=False)[0]
roster.head()

Unnamed: 0,Pos.,No.,Name,Age – Date of birth,Height,Club,Ctr.
0,PG,4,Jean Nshobozwabyose,23 – (1998-06-26)26 June 1998,1.83 m (6 ft 0 in),Patriots,
1,G,5,Ntore Habimana,24 – (1997-08-15)15 August 1997,1.96 m (6 ft 5 in),Wilfrid Laurier Golden Hawks,
2,SG,6,Steven Hagumintwari,27 – (1993-10-01)1 October 1993,1.93 m (6 ft 4 in),Patriots,
3,SG,7,Armel Sangwe,24 – (1997-04-15)15 April 1997,1.90 m (6 ft 3 in),Espoir,
4,SG,8,Emile Kazeneza,20 – (2000-08-30)30 August 2000,2.01 m (6 ft 7 in),William Carey University,


There we go, now we've got all the data we want from the table. Unfortunately, as Wikipedia has used images rather than text to represent the countries of the players, we're unable to scrape them using pandas. 

We'll look at other methods to get this data later.

### Limitations of Pandas for web scraping

Obviously we've seem some of the limitations already, notably pandas not being to parse images and it collecting tables that aren't relevant to our intended goal. However in most web pages, the data we want to scrape wont be formatted into a nice table for us. If it isn't in `<table>` tags then we wont be able to scrape it using pandas. 

- Good for websites with predefined tables 
- Wont collection information that isn't text

However there are lots of other methods for accessing that data, but first we need to understand a little of how websites function. 

## How do web sites work?

Now that we understand the structure of a web page, we can see how it might be extremely tedious to create every individual web page, especially if we want to include regularly changing data. 

That's why most web pages are created **dynamically**. This means that the web page is put together **on-the-fly** whenever someone requests to see it. 

### Client-side versus Server-side scripting

Web pages are usually generated in one of two ways, via **client-side** scripting or via **server-side** script. This defines where the data gets turned into HTML elements. If it is on the **client-side**, then the raw data is sent directly to our browser and our computer creates the web page, if it is **server-side** then we don't ever see the raw data, only the computed HTML elements.

<table>
    <tr><th>Client-side Scripting</th><th>Server-side scripting</th>
    <tr><td><img src="./imgs/client_side.png" width=400></td><td><img src="./imgs/server_side.png" width=400></td></tr>
    <tr><td>Data usually processed with Javascript</td><td>Data can be processed with php, Javascript, python etc.</td></tr>
    <tr><td>Is possible to see the underlying data</td><td>Is not possible to see the underlying data</td></tr>
</table>


### Inspecting a web page's creation

We've already looked at a web page's source by using `View page source`. There is a more advanced tool for working with web pages built into most browsers, usually called `Inspect`. `Right-Click > Inspect`. Lets inspect the wikipedia page for the [Rwandan Men's National Basketball Team](https://en.wikipedia.org/wiki/Rwanda_men%27s_national_basketball_team).

We'll come back to the `Elements` page later, for now we want to look at the `Network Tab` on the toolbar.
<center><img src="./imgs/inspect.png" width=400></center>

The network tab records all the requests that go between our browser and the server (as well as other servers) in the production of the web page. When you first open the page it will be blank. Refreshing the browser page will cause the network tab to record all the different requests that occur. 

<table>
    <tr><td><img src="./imgs/network.png" width=400></td><td><img src="./imgs/network_recorded.png" width=400></td></tr>
</table>

Clicking on any one of the files that has been requested, you can see the full HTTP request (more on this later) as well as a preview and actual full response from the server for that request. Looking at the response for the first request (the page) we can see that the data was included directly in the page as HTML. This implies that this particular page was processed **server-side**. Another clue was the reference to `php` which is an exclusively server-side language. 

<table><tr>
    <td><img src="./imgs/network_preview.png" width=400></td>
    <td><img src="./imgs/serverside_table.png" width=500></td>
    </tr>
</table>

#### Client Side Example

Lets look at the [NBA website](https://www.nba.com/stats/leaders/?SeasonType=Regular%20Season) instead. This page shows us statistics for the regular season for players in the NBA ordered by the number of points.

<center><img src="./imgs/nba_scoreboard.png" width=600></center>

We could try and scrape this data using `pandas` but lets see if we can find the source of this data first. Opening up the `Inspect` tool we can look at the `Network` tab to try and find where this data is loaded from. 

<center><img src="./imgs/inspect_fetch.png" width=500></center>

There are **a lot** of files that are loaded as part of this web page. We can reduce the number we need to search through by using the built in filters on the `Network` tab. Lets look at `Fetch/XHR` which filters the list to requests usually associated with data. 

Looking through this shorter list of files, one stands out as potentially containing the data that we want to extract from the web page. 

We can click on the file and the the `Response` tab to see what information is sent to our browser from this file. Looking at the response we can see that the data that goes into our table is not encoded into `HTML` so we can be relatively sure that web page is generated at least partly on the client side.  

<table>
    <tr><td><img src="./imgs/inspect_datarequest.png" width=400></td>
        <td><img src="./imgs/inspect_data.png" width=600></td>
    </tr>
</table>


### Key takeways 

* If a website is processed **client** side then it may be possible to get at the data that creates the web page **without** having to parse `HTML`
* However some web scraping programs wont be able to execute the client side code, meaning we have to use a web browser. 
* If a website is processed **server** side then it **is not** possible to get the data without having to parse the `HTML` served. 

## Requests Library

The `requests` library is the de-facto standard for making `HTTP requests`, it abstracts away all of the complexity we just saw using the `Inspect` tool. `requests` is a built in library so there is no need to install. 

The `requests` library is very powerful, but importantly we can use it to do in python what our web browser was doing when it loaded in our data. 

Returning to our [NBA example](https://www.nba.com/stats/leaders/?SeasonType=Regular%20Season). Looking at the `Network` tab shows us all the of the `HTTP` requests that have been made in the process of creating the web page that we see.

If we look in the `Headers` tab, we can see the form that this HTTP request took.

The `URL` had the request information encoded into it, we can also see that the request type is `GET`. 

<center><img src="./imgs/inspect_header.png" width=400></center>

Lets see what happens if we recreate that request in Python using the `requests` library. First we need to get the request URL from the header tab. We also need to note that the method is `GET`. 

```python
import requests

url = 'https://stats.nba.com/stats/leagueLeaders?LeagueID=00&PerMode=PerGame&Scope=S&Season=2021-22&SeasonType=Regular+Season&StatCategory=PTS'

# We are using the .get method to match the GET HTTP request 
# we also include the .json() method to return to us the response 
# from the request as a python dictionary.
response = requests.get(url).json()

print(response)
```

In [2]:
import requests

url = 'https://stats.nba.com/stats/leagueLeaders?LeagueID=00&PerMode=PerGame&Scope=S&Season=2021-22&SeasonType=Regular+Season&StatCategory=PTS'

# We are using the .get method to match the GET HTTP request 
# we also include the .json() method to return to us the response 
# from the request as a python dictionary.
response = requests.get(url).json()

We can see that the result of that command is the same as the data we saw using the `Inspect` tool. We can look through this nested dictionary object to try and understand the structure of the response. **It is important to note that not every response will look the same**. You'll need to dig into each response to extract the data as and when.  

We can look through the object and see if there is a way to convert information into a table that we can use. 

```python 
print(response.keys())
print(response['resultSet'].keys())
```

In [7]:
print(response.keys())
print(response['resultSet'].keys())

dict_keys(['resource', 'parameters', 'resultSet'])
dict_keys(['name', 'headers', 'rowSet'])


Looking at the keys in the data, we can see that the response contains three objects called `resource`, `parameters` and `resultSet`. `resource` and `parameters` are metadata about the table that we've just requested. `resultSet` contains another dictionary with the keys `name`, `headers` and `rowSet`. `rowSet` is a list of list, each representing a row of data and the `headers` contains a list of column headers. 

We can put these together using `pandas` into a dataframe very easily. 

```python 
import requests
import pandas as pd 

url = 'https://stats.nba.com/stats/leagueLeaders?LeagueID=00&PerMode=PerGame&Scope=S&Season=2021-22&SeasonType=Regular+Season&StatCategory=PTS'

response = requests.get(url).json()
table_headers = response['resultSet']['headers']
table_data = response['resultSet']['rowSet']

df = pd.DataFrame(table_data, columns=table_headers)
df
```

In [8]:
import requests
import pandas as pd 

url = 'https://stats.nba.com/stats/leagueLeaders?LeagueID=00&PerMode=PerGame&Scope=S&Season=2021-22&SeasonType=Regular+Season&StatCategory=PTS'

response = requests.get(url).json()
table_headers = response['resultSet']['headers']
table_data = response['resultSet']['rowSet']

df = pd.DataFrame(table_data, columns=table_headers)
df

Unnamed: 0,PLAYER_ID,RANK,PLAYER,TEAM,GP,MIN,FGM,FGA,FG_PCT,FG3M,...,FT_PCT,OREB,DREB,REB,AST,STL,BLK,TOV,PTS,EFF
0,201142,1,Kevin Durant,BKN,12,34.4,11.2,19.1,0.585,1.9,...,0.829,0.5,8.0,8.5,5.0,0.6,0.7,3.5,29.5,31.8
1,201939,2,Stephen Curry,GSW,11,33.6,8.6,19.9,0.434,5.0,...,0.949,0.8,5.6,6.5,6.5,1.6,0.6,3.1,27.4,28.0
2,202331,3,Paul George,LAC,11,35.3,10.0,21.9,0.456,3.2,...,0.867,0.5,7.3,7.8,5.4,2.5,0.5,4.5,26.7,26.0
3,203507,4,Giannis Antetokounmpo,MIL,12,32.9,9.5,19.2,0.496,1.3,...,0.688,1.9,9.9,11.8,6.0,1.1,1.8,3.0,26.6,31.8
4,1629630,5,Ja Morant,MEM,11,35.3,10.0,20.6,0.485,1.7,...,0.779,1.3,4.5,5.7,7.3,1.7,0.3,4.0,26.5,25.5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
274,1629216,275,Gabe Vincent,MIA,10,8.9,0.8,2.0,0.400,0.1,...,1.000,0.3,0.5,0.8,1.7,0.2,0.0,0.6,1.9,2.8
275,203085,276,Austin Rivers,DEN,9,12.4,0.8,3.0,0.259,0.2,...,0.500,0.4,0.7,1.1,0.8,0.3,0.1,0.7,1.9,1.2
276,1630541,276,Moses Moody,GSW,9,6.8,0.8,2.0,0.389,0.2,...,0.500,0.0,0.9,0.9,0.3,0.0,0.1,0.1,1.9,1.8
277,1626161,278,Willie Cauley-Stein,DAL,11,10.0,0.7,1.7,0.421,0.0,...,0.000,0.7,1.6,2.4,0.5,0.3,0.0,0.2,1.5,3.4


If we look closer at the URL, we can see it encodes a lot of arguments, these arguments look very similar to the filters that are available on the web page. 

```
https://stats.nba.com/stats/leagueLeaders?
    LeagueID=00&
    PerMode=PerGame&
    Scope=S&
    Season=2021-22&
    SeasonType=Regular+Season&
    StatCategory=PTS
```

<img src='./imgs/nba_filters.png'>

If we change "PerGame" to "Totals" and re-run our code then we should get data that would inform website table had we selected that option. What we've done here is discover the `API` that sits behind the NBA website and we can exploit this to extract data. 

In [9]:
import requests
import pandas as pd 

per_mode = 'Totals'

url = f'https://stats.nba.com/stats/leagueLeaders?LeagueID=00&PerMode={per_mode}&Scope=S&Season=2021-22&SeasonType=Regular+Season&StatCategory=PTS'

response = requests.get(url).json()
table_headers = response['resultSet']['headers']
table_data = response['resultSet']['rowSet']

df = pd.DataFrame(table_data, columns=table_headers)
df

Unnamed: 0,PLAYER_ID,RANK,PLAYER,TEAM,GP,MIN,FGM,FGA,FG_PCT,FG3M,...,REB,AST,STL,BLK,TOV,PF,PTS,EFF,AST_TOV,STL_TOV
0,201142,1,Kevin Durant,BKN,12,413,134,229,0.585,23,...,102,60,7,8,42,16,354,381,1.43,0.17
1,203507,2,Giannis Antetokounmpo,MIL,12,395,114,230,0.496,16,...,142,72,13,21,36,36,319,381,2.00,0.36
2,201939,3,Stephen Curry,GSW,11,370,95,219,0.434,55,...,71,72,18,7,34,17,301,308,2.12,0.53
3,202331,4,Paul George,LAC,11,388,110,241,0.456,35,...,86,59,28,5,49,31,294,286,1.20,0.57
4,1629630,5,Ja Morant,MEM,11,388,110,227,0.485,19,...,63,80,19,3,44,16,292,281,1.82,0.43
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
443,1630536,418,Sharife Cooper,ATL,2,7,0,3,0.000,0,...,0,2,0,0,1,0,0,-2,2.00,0.00
444,1629605,418,Tacko Fall,CLE,3,3,0,1,0.000,0,...,2,0,0,0,1,0,0,0,0.00,0.00
445,1628962,418,Udoka Azubuike,UTA,2,2,0,0,0.000,0,...,0,0,0,0,0,0,0,0,0.00,0.00
446,1630176,418,Vernon Carey Jr.,CHA,1,1,0,1,0.000,0,...,1,0,0,0,0,0,0,0,0.00,0.00


## Beautiful Soup Library

Sometimes, in-fact most of the time, the information that we want to scrape wont be found neatly formatted into a table. What we need to be able to do is extract the relevant information programatically from non-table objects. Enter `beautifulsoup`. `beautifulsoup` is a `HTML` parsing library for python, it allows us to pull out all the relevant information from a web page using a nice and easy to use syntax.

`beautifulsoup` does not come as part of the standard python installation so we need to `pip` install it. We can do this inside our jupyternotebook using 

```
!pip install beautifulsoup4
```

Or just on the command line by running the same command, without the `!` at the begining of the line. 

In [10]:
!pip install beautifulsoup4



Once we've installed beautiful soup we can start to use it to parse our HTML data. Lets start again by parsing the web page that we made earlier. 

```python
from bs4 import BeautifulSoup 

with open('./my_complex_webpage.html', 'r') as f:
    soup = BeautifulSoup(f, 'html.parser')

print(soup)
```

In [11]:
from bs4 import BeautifulSoup 

with open('./assets/my_complex_webpage_with_css_classes.html', 'r') as f:
    soup = BeautifulSoup(f, 'html.parser')

print(str(soup)[:300] + '\n...')

<html>
<head>
<link href="style.css" rel="stylesheet"/>
</head>
<body>
<h1>My Complex Webpage</h1>
<p class="red_text">This is my more complex webpage with additional elements</p>
<a href="https://www.statistics.gov.rw">This is a link to https://www.statistics.gov.rw</a>
<p>Below here is the NISR lo
...


`beautifulsoup` has lots of functions that make it very easy to extract information from a HTML page. The most useful of which is the `find_all()` method. Full documentation for the `find_all` method can be found [here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all).

Before we were able to use pandas to extract the HTML table very easily, but what if we were more interested in the "Unordered list of fruits". We can use the `find_all` function to retrieve all of the list item `<li>` tags. 

```python
soup.find_all('li')
```

In [12]:
soup.find_all('li')

[<li class="red_text">Apple</li>, <li>Banana</li>]

We've successfully extracted all of the `<li>` tags, but our data still isn't very clean. We're not interested in the `HTML` tag, just the data contained within. We can deal with this by using `beautifulsoup` to strip out our HTML tags. 

```python 
# We can do this with a loop
for tag in soup.find_all('li'):
    print(tag.get_text())

# Or by using a list comprehension
[tag.get_text() for tag in soup.find_all('li')]
```

In [13]:
# We can do this with a loop
for tag in soup.find_all('li'):
    print(tag.get_text())

# Or by using a list comprehension
[tag.get_text() for tag in soup.find_all('li')]

Apple
Banana


['Apple', 'Banana']

Success, however it will be common that not all the information we want to extract has the same `<tag>` or there will be lots of irrelevant information that has the same `<tag>`. Fortunately, when people are designing web pages they tend to give similar information the same visual appearance. We know that visual appearance is controlled by `css` and using `beautifulsoup` we can extract data by `css` class!

```python
for red_text in soup.find_all(class_="red_text"):
    print(red_text)
    
[red_text.get_text() for red_text in soup.find_all(class_='red_text')]
```

In [14]:
for red_text in soup.find_all(class_="red_text"):
    print(red_text)
    
[red_text.get_text() for red_text in soup.find_all(class_='red_text')]

<p class="red_text">This is my more complex webpage with additional elements</p>
<h2 class="red_text">This is an unordered list of fruits</h2>
<li class="red_text">Apple</li>


['This is my more complex webpage with additional elements',
 'This is an unordered list of fruits',
 'Apple']

### Real world example

Lets go back to our wikipedia example, if we remember we were able to use pandas to scrape the table, but weren't able to get all of the country information because this wasn't stored as plain text. We can use `beautifulsoup` to parse out that information with much finer control. 

First we need to get the `HTML` that generates that wikipedia page. We can do this using our trusty `requests` library. 

```python 
import requests

URL = 'https://en.wikipedia.org/wiki/Rwanda_men%27s_national_basketball_team'
wiki_page = requests.get(URL)
print(wiki_page)
```

In [15]:
import requests

URL = 'https://en.wikipedia.org/wiki/Rwanda_men%27s_national_basketball_team'
wiki_page = requests.get(URL)
print(wiki_page)

<Response [200]>


Oh, this is just a response code, not the `HTML` that we were expecting. Fortunately `Response [200]` means that the request **successfully** executed. In order to get the `HTML` we need to use the `.text` attribute.

```python 
import requests

URL = 'https://en.wikipedia.org/wiki/Rwanda_men%27s_national_basketball_team'
wiki_page = requests.get(URL).text
print(wiki_page)
```

In [16]:
import requests

URL = 'https://en.wikipedia.org/wiki/Rwanda_men%27s_national_basketball_team'
wiki_page = requests.get(URL).text
print(wiki_page[:200]+'\n...')

<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>Rwanda men's national basketball team - Wikipedia</title>
<script>document.documentElement.classNam
...


Now we can parse this `HTML` code with beautiful soup, as we're only interested in the roster table, we can tell `beautifulsoup` to filter out all the `HTML` that isn't related to the roster table. 

```python
tables = soup.find_all('table')
print(f'Found {len(tables)} tables.\n')

# Filter list of tables to just those that contain a country 
# column called Ctr.
country_tables = [tbl for tbl in tables if 'Ctr.' in str(tbl)]

# This is more complex html than usual, there is a table in a table, so we need to 
# select the second country_table which represents the inner table.
roster_html = country_tables[1]
```

In [17]:
import requests
from bs4 import BeautifulSoup

URL = 'https://en.wikipedia.org/wiki/Rwanda_men%27s_national_basketball_team'
wiki_page = requests.get(URL).text
soup = BeautifulSoup(wiki_page)

tables = soup.find_all('table')
print(f'Found {len(tables)} tables.\n')

# Filter list of tables to just those that contain a country 
# column called Ctr.
country_tables = [tbl for tbl in tables if 'Ctr.' in str(tbl)]

# This is more complex html than usual, there is a table in a table, so we need to 
# select the second country_table which represents the inner table.

roster_html = country_tables[1]
print(str(roster_html)[:300] + '\n...')

Found 13 tables.

<table class="sortable" style="background:transparent; margin:0px; width:100%;">
<tbody><tr>
<th><abbr title="Position(s)">Pos.</abbr></th>
<th><abbr title="Number">No.</abbr></th>
<th>Name</th>
<th>Age – <small>Date of birth</small></th>
<th>Height</th>
<th>Club</th>
<th><abbr title="Country">Ctr.<
...


Try parsing the `roster_html` Beautiful Soup into a `pandas` dataframe.

```python
# Use a list comprehension to look for all the <th> tags, for each 
# one, get the text and strip the result. These are the column headers
# for the table.
header = [col.get_text().strip() for col in roster_html.find_all('th')]

# Create an empty list to store our processed rows.
rows = []
# Loop over all of the <tr> tags, each set corresponds to a row
# row in our table. 
for tr in roster_html.find_all('tr')[1:]:
    # Create an empty row variable where we can store all of our processed
    # data
    row = []
    # Loop over all of the <td> tags inside the current <tr> tag. These are 
    # going to be our data items.
    for data in tr.find_all('td'):
        
        # If the data item isn't blank (or just a a new line character)
        # then add it to our row, stripping out the excess whitespace
        if data.get_text() != '\n':
            row.append(data.get_text().strip())
            
        # If there is an <img> tag in the <td> tag then we're on our 
        # flag column. We want to extract the country information. 
        # We could extract this from the image, but all the images are 
        # wrapped in a <a> hyperlink tag to that country, which will be
        # easier to clean. 
        if data.find('img') is not None:
            # Get the <a> hyperlink tag
            img = data.find('a')
            # Add the href attribute (this is the link address) to our row
            row.append(img['href'])
    
    # Finally add the row into our list of rows.
    rows.append(row)
    
# Construct a dataframe from our list of rows and our header data
df = pd.DataFrame(rows, columns=header)
df
```

In [18]:
# Use a list comprehension to look for all the <th> tags, for each 
# one, get the text and strip the result. These are the column headers
# for the table.
header = [col.get_text().strip() for col in roster_html.find_all('th')]

# Create an empty list to store our processed rows.
rows = []
# Loop over all of the <tr> tags, each set corresponds to a row
# row in our table. Skip the first row as it contains the <th> tags.
for tr in roster_html.find_all('tr')[1:]:
    # Create an empty row variable where we can store all of our processed
    # data
    row = []
    # Loop over all of the <td> tags inside the current <tr> tag. These are 
    # going to be our data items.
    for data in tr.find_all('td'):
        
        # If the data item isn't blank (or just a a new line character)
        # then add it to our row, stripping out the excess whitespace
        if data.get_text() != '\n':
            row.append(data.get_text().strip())
            
        # If there is an <img> tag in the <td> tag then we're on our 
        # flag column. We want to extract the country information. 
        # We could extract this from the image, but all the images are 
        # wrapped in a <a> hyperlink tag to that country, which will be
        # easier to clean. 
        if data.find('img') is not None:
            # Get the <a> hyperlink tag
            img = data.find('a')
            # Add the href attribute (this is the link address) to our row
            row.append(img['href'])
    
    # Finally add the row into our list of rows.
    rows.append(row)
    
# Construct a dataframe from our list of rows and our header data
df = pd.DataFrame(rows, columns=header)
df

Unnamed: 0,Pos.,No.,Name,Age – Date of birth,Height,Club,Ctr.
0,PG,4,Jean Nshobozwabyose,23 – (1998-06-26)26 June 1998,1.83 m (6 ft 0 in),Patriots,/wiki/Rwanda
1,G,5,Ntore Habimana,24 – (1997-08-15)15 August 1997,1.96 m (6 ft 5 in),Wilfrid Laurier Golden Hawks,/wiki/Canada
2,SG,6,Steven Hagumintwari,27 – (1993-10-01)1 October 1993,1.93 m (6 ft 4 in),Patriots,/wiki/Rwanda
3,SG,7,Armel Sangwe,24 – (1997-04-15)15 April 1997,1.90 m (6 ft 3 in),Espoir,/wiki/Rwanda
4,SG,8,Emile Kazeneza,20 – (2000-08-30)30 August 2000,2.01 m (6 ft 7 in),William Carey University,/wiki/United_States
5,SG,9,Dieudonné Ndizeye,24 – (1996-10-14)14 October 1996,1.98 m (6 ft 6 in),Patriots,/wiki/Rwanda
6,PF,10,Olivier Shyaka,26 – (1995-08-14)14 August 1995,2.00 m (6 ft 7 in),REG,/wiki/Rwanda
7,F,11,Alex Mpoyo,24 – (1997-01-05)5 January 1997,2.01 m (6 ft 7 in),Trepça,/wiki/Kosovo
8,SG,12,Kenny Gasana,36 – (1984-11-09)9 November 1984,1.90 m (6 ft 3 in),Patriots,/wiki/Rwanda
9,C,13,Elie Kaje,26 – (1995-03-17)17 March 1995,1.90 m (6 ft 3 in),Patriots,/wiki/Rwanda


### Non-tabular data 

Lets look at less tabular data. This is the [fiba.basketball news page](https://www.fiba.basketball/afrobasket/2021/qualifiers/news). We can see there is list news articles, with headlines, dates and a small blurb. To start lets inspect one of the items and see if we can see anything common that link on. 

<table>
    <tr>
        <td><img src="./imgs/fibanews.png" width=400></td>
        <td><img src="./imgs/fibanews_class.png" width=400></td>
    </tr>
</table>

All the additional news articles are in `<div>` tags with the class `related_row`. 

In [19]:
import requests
from bs4 import BeautifulSoup

URL = 'https://www.fiba.basketball/afrobasket/2021/qualifiers/news'

fiba_news = requests.get(URL).text
soup = BeautifulSoup(fiba_news)

soup.find_all(class_='related_row')[0]

<div class="related_row">
<a href="http://www.fiba.basketball/afrobasket/2021/qualifiers/news/enabu-rwahwire-reflect-on-uganda-tenacity-ahead-of-morocco-cracker">
<div class="related_top right">
<div class="date_highlighted">07/07/2021</div>
<div class="category" style="background-color: #000000;">News</div>
<h6>Enabu, Rwahwire reflect on Uganda tenacity ahead of Morocco cracker</h6>
</div>
<div class="related_image left adaptive_image" data-adaptive-image-breakpoints="{ default: '/images.fiba.com/Graphic/2/F/7/8/8iqS79S7ukebtkZoyPWr0Q.jpg?v=20210113123220303', 480: '/images.fiba.com/Graphic/F/3/0/5/b8Zt49T0X0CHF0MVS2z46Q.jpg?v=20210113123217152' }" data-adaptive-image-extra-attrs="{ alt: '5 Jimmy Enabu (UGA)' }">
</div>
<div class="related_bottom right">
<p>SALE (Morocco) - Qualifying for the FIBA AfroBasket is a lifetime dream for many and for Uganda captain Jimmy Enabu, it is a befitting reward to a diligent servant of the game back home. </p>
</div>
</a>
</div>

We can see that all the information we want is stored inside this `<div>` tag with the class `related_row`. There is a `<div>` inside with the class `date_highlighted` that contains the date, one with the class `category` that contains the article category information. We can see the title of the article is wrapped in header `<h6>` tags and the blurb of the article is the only `<p>` tag within the `<div>`. 

Using all this we can write a very simple loop to go through all of the `related_row` objects and pull out the pertinent information using the exact same methods we've already used.

Try parsing the FIBA news page into a pandas dataframe

In [20]:
import requests
import pandas as pd 
from bs4 import BeautifulSoup

# Set the URL
URL = 'https://www.fiba.basketball/afrobasket/2021/qualifiers/news'

# Download the webpage
fiba_news = requests.get(URL).text
# Convert the web page into a beautifulsoup object
soup = BeautifulSoup(fiba_news)

# Create an empty list to store our extracted information
news_items = []

# Loop over all the related_row classed divs
for news_html in soup.find_all(class_='related_row'):
    
    # For each related_row class:
    # Extract the date using the date_highlighted
    date = news_html.find(class_='date_highlighted').get_text()
    # Extract the category using the category class 
    category = news_html.find(class_='category').get_text()
    # Extract the headline using the <h6> tag
    headline = news_html.find('h6').get_text()
    # Extract the blurb using the <p> tags
    blurb = news_html.find('p').get_text()
    
    # Create a dictionary of all the data extracted
    data_extracted = dict(date=date, headline=headline, category=category, blurb=blurb)
    
    # Add the dictionary to the list to store extracted infomration for each article
    news_items.append(data_extracted)

# Conver the list of dictionarys into a pandas dataframe
df = pd.DataFrame(news_items)
df

Unnamed: 0,date,headline,category,blurb
0,07/07/2021,"Enabu, Rwahwire reflect on Uganda tenacity ahe...",News,SALE (Morocco) - Qualifying for the FIBA AfroB...
1,11/03/2021,Madagascar's Botou wants to keep the AfroBaske...,News,ANTANANARIVO (Madagascar) – Despite having mis...
2,10/03/2021,10 standout performers from the last window of...,News,ABIDJAN - As we look back at the Second Round ...
3,08/03/2021,Guinea celebrate second straight AfroBasket ap...,News,Guinea left Cameroon at the end of the FIBA Af...
4,02/03/2021,Decisions concerning the February window of th...,News,FIBA has taken decisions regarding Equatorial ...
5,26/02/2021,Impressive operational efforts in FIBA Contine...,News,MIES (Switzerland) - Another successful window...
6,24/02/2021,"""Senegal have some room for improvement,"" says...",News,DAKAR (Senegal) - Senegal finished top of Grou...
7,23/02/2021,History Makers Kenya's confidence is sky-high,News,By beating eleven-time Africa champions Angola...
8,21/02/2021,Four teams undefeated at the end of AfroBasket...,Review,MONASTIR/YAOUNDE (Tunisia/Cameroon) - The 20-t...
9,21/02/2021,Top performers at Day 3 in Yaounde,News,YAOUNDE (Cameroon) - There was great frenzy at...


## Selenium Library

Lets look at one final example piece of `HTML`. In this one we're going to use some basic `javascript` to add some elements to the page.

```html
<html>
    <head>
    <script type='text/javascript'>
        window.onload = function(){
            for(i=0; i<5; i++){
                var paragraph = document.createElement('p');
                paragraph.innerHTML = 'This is paragraph '+i;
                document.body.appendChild(paragraph);
            }
        }
    </script>
    </head>
    <body>
    </body>
</html>
```

If we open this `.html` file up in a web browser we get what we'd expect which is 5 paragraph objects labeled 0 to 4. 

<center><img src="./imgs/dynamic_javascript.png"></center>

But when we try our usual approach of opening this file up into `beautifulsoup` we get the following result. 
```python
from bs4 import BeautifulSoup

with open('./dynamic_javascript.html', 'r') as f:
    html = f.read()
    
soup = BeautifulSoup(html)
soup.find_all('p')
```

In [21]:
from bs4 import BeautifulSoup

with open('./assets/dynamic_javascript.html', 'r') as f:
    html = f.read()
    
soup = BeautifulSoup(html)
soup.find_all('p')

[]

This is because in order for the `<p>` tags to appear on the page there needs to be some process to execute the javascript that generates them. `BeautifulSoup` is just a `HTML` parser, it isn't able to execute javascript that is stored in the `.html` file.

This is a difficult problem to deal with, it would be useful if we could access the results of the `.html` file as it is rendered inside our browser, enter Selenium.

Selenium is a browser automation tool that is primarily used for testing websites, but can be put to a whole host of different tasks. What the Selenium library allows us to do is control a web browser using python and interact with the results.  

### Installing Selenium 

Installing Selenium is a little trickier than most python packages, as in additional to a python library, we require a custom version of our web browser that can communicate with the Selenium library. 

First we can pip install the selenium library 
```python
!pip install selenium
```

Then we need to go to https://sites.google.com/chromium.org/driver/ and download the latest **stable** release of our chrome driver. Now we need to make sure our selenium library can talk to the web driver, to do this we need to add it to our `system path`; a list of directories our computer looks for programs. 

1. Create the directory `C:\bin`
2. Extract the chromedriver.exe into `C:\bin`
3. Run the following command 

```sh
setx PATH "%PATH%;C:\bin"
```


4. Check this worked by restarting the command prompt and running the following

```sh
chromedriver -v 
```

### Using Selenium 

Once Selenium is installed we can import and use it just like any other package, the syntax of selenium is very similar to the packages we've looked at so far. In order to start a Selenium browser session we have to specify the type of browser that we're planning to use. As we installed the Chrome driver we can do this with the following code. 

```python
from selenium import webdriver

driver = webdriver.Chrome()
```

You'll notice that running this code opens up a new browser window with the message `Chrome is being controlled by automated test software`. This is the browser that python is going to control. 

Be careful running this cell multiple times, everytime `webdriver.Chrome()` is called, it will start a new browser but wont close the old one. 

In [22]:
from selenium import webdriver

driver = webdriver.Chrome()

Now we have a our web driver running, we can tell it to nagivate around to various pages using the `driver.get()`. For example if we wanted to open the Rwanda Men's basketball wikipedia page we would use:

```python
driver.get('https://en.wikipedia.org/wiki/Rwanda_men%27s_national_basketball_team')
```

Similarly if we wanted to open our dynamic javascript page, we just need to tell the driver to navigate there. Opening a local file is a little different, first we need to use `file://` rather than `http://` and we also need to enter the full filepath, which we can get from `os.getcwd()`. 

```python
import os 

local_file = 'file://'+os.getcwd()+'dynamic_javascript.html'
driver.get(local_file)
```

In [23]:
import os 

local_file = 'file://' + os.getcwd() + '/assets/dynamic_javascript.html'
driver.get(local_file)

Once we've navigated our browser to the right page there are several methods we can use to extract data processed HTML. Nearly all methods have a `find_element` and `find_elements` version, returning the first and a list of all the matching elements respectively. 

| Driver Method | Usage |
| --- | --- |
| find_element_by_id | Select element by the `id` attribute
| find_elements_by_name | Select elements by the `name` attribute |
| find_elements_by_xpath | Select elements by an XML path |
| find_elements_by_link_text | Select elements with specific hyperlink text |
| find_elements_by_partial_link_text | Select elements matching part of hyperlink text |
| find_elements_by_tag_name | Select elements by tag name / tag type |
| find_elements_by_class_name | Select elements with the same class |
| find_elements_by_css_selector | Select elements by CSS selectors | 

We can use the `find_elements_by_tag_name` method to collect our dynamically generated `<p>` tags. What we get is a list of `WebElement` objects. We can use the `.text` property to retrieve the text inside the tag, or use `.get_attribute('outerHTML')` to extract the full tag as a string

```python
import os
from selenium import webdriver 

driver = webdriver.Chrome()
local_file = 'file://' + os.getcwd() + '/dynamic_javascript.html'

driver.get(local_file)
p_tags = driver.find_elements_by_tag_name('p')

for tag in p_tags:
    print(type(tag))
    print(tag.get_attribute('outerHTML'))
    print(tag.text)
```

In [24]:
import os
from selenium import webdriver 

# driver = webdriver.Chrome()
local_file = 'file://' + os.getcwd() + '/assets/dynamic_javascript.html'

driver.get(local_file)
p_tags = driver.find_elements_by_tag_name('p')

for tag in p_tags:
    print(type(tag))
    print(tag.get_attribute('outerHTML'))
    print(tag.text)

<class 'selenium.webdriver.remote.webelement.WebElement'>
<p>This is paragraph 0</p>
This is paragraph 0
<class 'selenium.webdriver.remote.webelement.WebElement'>
<p>This is paragraph 1</p>
This is paragraph 1
<class 'selenium.webdriver.remote.webelement.WebElement'>
<p>This is paragraph 2</p>
This is paragraph 2
<class 'selenium.webdriver.remote.webelement.WebElement'>
<p>This is paragraph 3</p>
This is paragraph 3
<class 'selenium.webdriver.remote.webelement.WebElement'>
<p>This is paragraph 4</p>
This is paragraph 4


In some cases it may be more useful to use Selenium to generate the page, but then parse the resulting HTML using BeautifulSoup. Fortunately Selenium allows us to access the full HTML of the page including all of the generated elements. 

```python 
import os 
from selenium import webdriver 

driver = webdriver.Chrome()
local_file = 'file://' + os.getcwd() + 'dynamic_javascript.html'

driver.get(local_file)
driver.get(local_file)
html = driver.page_source
soup = BeautifulSoup(html)

soup.find_all('p')
```

In [25]:
import os 
from bs4 import BeautifulSoup
from selenium import webdriver 

# driver = webdriver.Chrome()
local_file = 'file://' + os.getcwd() + '/assets/dynamic_javascript.html'

driver.get(local_file)
html = driver.page_source
soup = BeautifulSoup(html)

soup.find_all('p')

[<p>This is paragraph 0</p>,
 <p>This is paragraph 1</p>,
 <p>This is paragraph 2</p>,
 <p>This is paragraph 3</p>,
 <p>This is paragraph 4</p>]

### Practical Example 

Lets look at the [fiba.basketball news page](https://www.fiba.basketball/afrobasket/2021/qualifiers/news). If we go to the bottom of the page we can see there is a button that says `Show More News`. This button dynamically loads more news onto the page we're currently viewing. 

We wouldn't be able to get this using `requests` alone, but maybe selenium can help. First we need to tell selenium that we want to click that button, but before we can click it we need to find it. 

<center><img src="./imgs/fiba_showmore.png" width=500></center>


Using the `Inspect` tool we can see that the button has the class `show_more_button` so we can use that and a selenium class selector to isolate the element.

Once we've done that we can use the built in `click` method for `WebElements` to simulate us clicking the the `Show More News` button. 

```python 
from selenium import webdriver 

driver = webdriver.Chrome()

URL = 'https://www.fiba.basketball/afrobasket/2021/qualifiers/news'
driver.get(URL)

button = driver.find_element_by_class_name('show_more_button')
button.click()
```

In [26]:
from selenium import webdriver 

driver = webdriver.Chrome()

URL = 'https://www.fiba.basketball/afrobasket/2021/qualifiers/news'
driver.get(URL)

button = driver.find_element_by_class_name('show_more_button')
button.click()

Now if we use the same code we used previously to scrape this information, swapping out the request's call for the Selenium one, we can 
parse even more news than we did previously. 

Note, clicking on the button doesn't generate the new news items instantly, it takes a moment for the browser to collect them. We need to add a wait function using `time.sleep` to wait for the page to load before we can scrape the data.

In [27]:
import time
import pandas as pd 
from bs4 import BeautifulSoup
from selenium import webdriver


driver = webdriver.Chrome()

# Set the URL
URL = 'https://www.fiba.basketball/afrobasket/2021/qualifiers/news'

# Navigate our page to the URL 
driver.get(URL)

# Find the show_more_button and click it 
button = driver.find_element_by_class_name('show_more_button')
button.click()

# We need to add a wait function here, otherwise the page wont have 
# loaded the new elements before it attempts to parse them. 
time.sleep(10)

# Download the webpage
fiba_news = driver.page_source
# Convert the web page into a beautifulsoup object
soup = BeautifulSoup(fiba_news)

# Create an empty list to store our extracted information
news_items = []

# Loop over all the related_row classed divs
for news_html in soup.find_all(class_='related_row'):
    
    # For each related_row class:
    # Extract the date using the date_highlighted
    date = news_html.find(class_='date_highlighted').get_text()
    # Extract the category using the category class 
    category = news_html.find(class_='category').get_text()
    # Extract the headline using the <h6> tag
    headline = news_html.find('h6').get_text()
    # Extract the blurb using the <p> tags
    blurb = news_html.find('p').get_text()
    
    # Create a dictionary of all the data extracted
    data_extracted = dict(date=date, headline=headline, category=category, blurb=blurb)
    
    # Add the dictionary to the list to store extracted infomration for each article
    news_items.append(data_extracted)

# Conver the list of dictionarys into a pandas dataframe
df = pd.DataFrame(news_items)
df

Unnamed: 0,date,headline,category,blurb
0,07/07/2021,"Enabu, Rwahwire reflect on Uganda tenacity ahe...",News,SALE (Morocco) - Qualifying for the FIBA AfroB...
1,11/03/2021,Madagascar's Botou wants to keep the AfroBaske...,News,ANTANANARIVO (Madagascar) – Despite having mis...
2,10/03/2021,10 standout performers from the last window of...,News,ABIDJAN - As we look back at the Second Round ...
3,08/03/2021,Guinea celebrate second straight AfroBasket ap...,News,Guinea left Cameroon at the end of the FIBA Af...
4,02/03/2021,Decisions concerning the February window of th...,News,FIBA has taken decisions regarding Equatorial ...
5,26/02/2021,Impressive operational efforts in FIBA Contine...,News,MIES (Switzerland) - Another successful window...
6,24/02/2021,"""Senegal have some room for improvement,"" says...",News,DAKAR (Senegal) - Senegal finished top of Grou...
7,23/02/2021,History Makers Kenya's confidence is sky-high,News,By beating eleven-time Africa champions Angola...
8,21/02/2021,Four teams undefeated at the end of AfroBasket...,Review,MONASTIR/YAOUNDE (Tunisia/Cameroon) - The 20-t...
9,21/02/2021,Top performers at Day 3 in Yaounde,News,YAOUNDE (Cameroon) - There was great frenzy at...
