# <center>ECON 725: Computer Programming and Data Management in Economics <a class="tocSkip"></center>    
# <center>Python - Web scraping <a class="tocSkip"></center>

## Learning Objectives
<hr>

- Understand the basics of web scraping
- Learn how to use the `MechanicalSoup` library to scrape web pages.
- Learn how to scrape data from a website using forms, and pagination.



# Web scraping Basics

This lecture is about on how to get "cheap" data, off the web and onto our computer. This is a very useful skill to have, as it allows us to collect data that is not available in a downloadable format.

Web scraping is the process of extracting information from websites. It simulates the human browsing behavior to extract information from websites. The information is then converted into a structured format that can be further analyzed and stored in a database. There are two ways that web content gets rendered in a browser:

1. Server-side rendering: The server sends the HTML content to the browser, which then renders the content.
2. Client-side rendering: The server sends a minimal HTML content to the browser, which then executes JavaScript code to render the content.

You can read [here](https://www.codeconquest.com/website/client-side-vs-server-side/) for more details (incluiding example scripts), but for the purposed of this lecture, we will focus on the following:

### Server-side rendering

* The scripts that “build” the website are not run on our computer, but rather on a host server that sends down all of the HTML code.
E.g. Wikipedia tables are already populated with all of the information — numbers, dates, etc. — that we see in our browser.
* In other words, the information that we see in our browser has already been processed by the host server.
* You can think of this information being embeded directly in the webpage’s HTML.
* **Webscraping challenges**: Finding the correct CSS (or Xpath) “selectors”. Iterating through dynamic webpages (e.g. “Next page” and “Show More” tabs).
* Key concepts: CSS, Xpath, HTML

### Client-side rendering
* The website contains an empty template of HTML and CSS.
E.g. It might contain a “skeleton” table without any values.
* However, when we actually visit the page URL, our browser sends a request to the host server.
* If everything is okay (e.g. our request is valid), then the server sends a response script, which our browser executes and uses to populate the HTML template with the specific information that we want.
* **Webscraping challenges**: Finding the “API endpoints” can be tricky, since these are sometimes hidden from view.
* Key concepts: APIs, API endpoints


Over this lecture, we'll go over the server-side rendering, which is the most common type of rendering. However, I want to forewarn you that webscraping typically involves a fair bit of detective work. 

You will often have to adjust your steps according to the type of data you want, and the steps that worked on one website may not work on another. (Or even work on the same website a few months later). 

**All this is to say that webscraping involves as much art as it does science.**

The good news is that both server-side and client-side websites allow for webscraping.1 If you can see it in your browser, you can scrape it!


### Caveat: Ethical and legal considerations

TLDR; Just because you can scrape it, doesn’t mean you should.

It is currently legal (read more [here](https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn)) to scrape public websites, but there are some important caveats to keep in mind: it’s still important to realise that the tools we’ll be using over these next two lectures are very powerful. A computer can process commands much, much faster than we can ever type them up manually. It’s pretty easy to write up a function or program that can overwhelm a host server or application through the sheer weight of requests. Or, just as likely, the host server has built-in safeguards that will block you in case of a suspected malicious attack. So, be careful and nice when scraping websites 🙂.



<hr>

# MechanicalSoup 

### What is MechanicalSoup?

[MechanicalSoup](https://mechanicalsoup.readthedocs.io/en/stable/) is a Python browser automation library built on top of `Requests` (for making HTTP requests) and `BeautifulSoup` (for parsing HTML). It acts as a headless browser, mimicking a browser's behavior without the need for a graphical user interface. This makes it lightweight and efficient compared to full-fledged browser automation tools.

Here a simple example of using `MechanicalSoup` to open a webpage, print its title, and close the browser session:
```python
import mechanicalsoup

browser = mechanicalsoup.StatefulBrowser()
browser.open("https://example.com")
page = browser.get_current_page()
print(page.title.text)
browser.close()
```

### Why use MechanicalSoup for web scraping?

`MechanicalSoup` provides some features that make it a valuable tool for web scraping:

- Navigation
- Form handling
- Session management

#### Navigation

```python
# Follow a link by its text
browser.follow_link("Login")

# Follow a link by its URL
browser.visit("https://example.com/next_page")

```

#### Form handling

```python
from mechanicalsoup import Browser

# Replace with the login URL you want
browser = Browser()
browser.open("https://old.reddit.com")  # Example login form

# Find the login form
login_form = browser.select_form('form')  # Replace with specific form identifier if needed

# Fill the form fields
login_form["username"] = "your_username"
login_form["password"] = "your_password"

# Submit the form
browser.submit_selected()

# Access the response data (assuming successful login)
response = browser.soup
content = response.find("pre").text
print(content)  # This should print the form data submitted


```

#### Session management

`MechanicalSoup` automatically manages cookies and other session information:

```python
# Access data from a protected page after login (session is maintained)
content = browser.soup.find("div", class_="protected_content").text
print(content)

```

`MechanicalSoup` is a powerful tool for web scraping, but it is not the only one. Other popular libraries include `Scrapy`, `Selenium`, and `BeautifulSoup`. Each has its own strengths and weaknesses, so it's important to choose the right tool for the job.

#### MechanicalSoup vs. BeautifulSoup
While both libraries are used for web scraping, `MechanicalSoup` and `BeautifulSoup` have distinct functionalities:

- `BeautifulSoup`: This library specializes in parsing HTML content. It excels at identifying and extracting data from downloaded HTML code. However, it doesn't handle tasks like form submission, navigation, or session management.
- `MechanicalSoup`: Built on top of `BeautifulSoup`, `MechanicalSoup` adds form handling, navigation, and session management functionalities. It allows you to interact with websites more dynamically by mimicking user behavior.
If you simply need to extract data from downloaded HTML content, `BeautifulSoup` is sufficient.

If your scraping task involves interacting with forms, navigating through pages, or maintaining sessions, `MechanicalSoup` is the better choice.


#### MechanicalSoup vs. Selenium

Another popular library for web scraping is `Selenium`, as it provides full-fledged browser automation. Here's a breakdown of the key differences between `Selenium` and `MechanicalSoup`:

* Functionality

    - `Selenium` can handle complex JavaScript, render pages with dynamic content, and automate browser interactions beyond scraping.
    - `MechanicalSoup` focuses on scraping tasks and doesn't mimic advanced browser behavior.

* Complexity
    - `Selenium` has a steeper learning curve due to its comprehensive functionality.
    - `MechanicalSoup` offers a simpler API, which makes it easier to learn and use for basic scraping tasks.

* Performance
    - `Selenium` is slower, and so may require more processing power due to its browser emulation capabilities.
    - `MechanicalSoup` is generally faster for simpler tasks as it doesn't involve full browser rendering.


If you need to handle advanced JavaScript or mimic complex browser interactions, `Selenium` is the way to go.

For most basic scraping tasks, especially for static websites and those with straightforward forms and navigation, `MechanicalSoup` is a more efficient and lightweight option.

<hr>

### Getting Started with MechanicalSoup

<br>

<center>
<div>
<img src="https://raw.githack.com/DataSciEcon-Emory/fall2025/main/images/mechanical_soup.png" width="400"/>
</div>
</center>

This topic material is based on the [scrapingbee.com](https://www.scrapingbee.com/blog/getting-started-with-mechanicalsoup/) blog and adapted for our purposes in the course.


Let's start with a simple example on how to extract the title and the reference links of a webpage using `MechanicalSoup` using this link from [Wikipedia](https://en.wikipedia.org/wiki/Web_scraping). Of course, there's a lot of information that you can scrape from a Wikipedia page, but extracting just the title and the reference links should help you understand how to find HTML tags and how to extract data from deep within the HTML hierarchy of a page.

In [1]:
import mechanicalsoup

In [2]:
# Connect to the website
browser = mechanicalsoup.StatefulBrowser(
    user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
)  
browser.open("https://en.wikipedia.org/wiki/Web_scraping")

<Response [200]>

##### Extracting the title

To extract the title, you first need to understand the structure of the Wikipedia page. Open the page in Google Chrome and press `F12` to open the developer tools. In the `Elements` tab, look for the HTML tag for the title of the page. Here's what it will look like:

<div>
<img src="https://www.scrapingbee.com/blog/getting-started-with-mechanicalsoup/page-title-html_hu17009774656312328514.png" width="600"/>
</div>

To get the title from the page, look for a `span` tag that has the class `mw-page-title-main`

In [5]:
title = browser.page.select_one('span.mw-page-title-main')
print("title: " + title.string)

title: Web scraping


#### Extracting the reference links

Next, you'll find and extract all the links from the references section of the web page. First, take a quick peek at the HTML structure of the References section by inspecting it with your browser's developer tools:

<div>
<img src="https://www.scrapingbee.com/blog/getting-started-with-mechanicalsoup/references-structure_hu4756474786081050296.png" width="1000"/>
</div>


First, select the ordered list (`ol`) tag that has the class references. With that tag selected, you can consider using the method `find_all` method to extract all links inside this `ol` tag. However, on a closer look, you'll find that this section also contains backlinks to places in the same Wikipedia page where each reference was cited:

<div>
<img src="https://www.scrapingbee.com/blog/getting-started-with-mechanicalsoup/backlinks_hu8625713131764143085.png" width="1000"/>
</div>

If you extract all links from this tag using the `find_all` function, the results will contain same-page backlinks too, which you don't need. To avoid that, you need to select all `reference-text` span tags and extract the links from them. 

In [8]:
references_list

[<span class="reference-text"><style data-mw-deduplicate="TemplateStyles:r1238218222">.mw-parser-output cite.citation{font-style:inherit;word-wrap:break-word}.mw-parser-output .citation q{quotes:"\"""\"""'""'"}.mw-parser-output .citation:target{background-color:rgba(0,127,255,0.133)}.mw-parser-output .id-lock-free.id-lock-free a{background:url("//upload.wikimedia.org/wikipedia/commons/6/65/Lock-green.svg")right 0.1em center/9px no-repeat}.mw-parser-output .id-lock-limited.id-lock-limited a,.mw-parser-output .id-lock-registration.id-lock-registration a{background:url("//upload.wikimedia.org/wikipedia/commons/d/d6/Lock-gray-alt-2.svg")right 0.1em center/9px no-repeat}.mw-parser-output .id-lock-subscription.id-lock-subscription a{background:url("//upload.wikimedia.org/wikipedia/commons/a/aa/Lock-red-alt-2.svg")right 0.1em center/9px no-repeat}.mw-parser-output .cs1-ws-icon a{background:url("//upload.wikimedia.org/wikipedia/commons/4/4c/Wikisource-logo.svg")right 0.1em center/12px no-repea

In [6]:
# Select the **References section**
references = browser.page.select_one('ol.references')

# Select all span tags with the class `reference-text` to exclude backlinks
references_list = references.select('span.reference-text')

# For each reference span tag, find all anchor elements and print their href attribute values
for reference in references_list:
    link_tags = reference.find_all("a")
    
    for link_tag in link_tags:
        link = link_tag.get("href")
        if link:
            print(link)

https://doi.org/10.5334%2Fdsj-2021-024
/wiki/Doi_(identifier)
https://doi.org/10.5334%2Fdsj-2021-024
/wiki/ISSN_(identifier)
https://search.worldcat.org/issn/1683-1470
/wiki/S2CID_(identifier)
https://api.semanticscholar.org/CorpusID:237719804
http://www.searchenginehistory.com/
https://thehistoryoftheweb.com/ebay-apis-connected-web/
https://web.archive.org/web/20161011080619/https://pdfs.semanticscholar.org/4fb4/3c5a212df751e84c3b2f8d29fabfe56c3616.pdf
/wiki/Doi_(identifier)
https://doi.org/10.1145%2F1281192.1281287
/wiki/ISBN_(identifier)
/wiki/Special:BookSources/9781595936097
/wiki/S2CID_(identifier)
https://api.semanticscholar.org/CorpusID:833565
https://pdfs.semanticscholar.org/4fb4/3c5a212df751e84c3b2f8d29fabfe56c3616.pdf
http://www.gooseeker.com/en/node/knowledgebase/freeformat
http://www.xconomy.com/san-francisco/2012/07/25/diffbot-is-using-computer-vision-to-reinvent-the-semantic-web/
https://web.archive.org/web/20020308222536/http://www.chillingeffects.org/linking/faq.cgi#QI

### Advanced Web Scraping with MechanicalSoup

Now that you know how to scrape simple websites with `MechanicalSoup`, you can learn how to use it to scrape dynamic multipage websites. In this section, you'll scrape a sandbox called [Scrape This Site](https://www.scrapethissite.com/) . First, navigate to its home page:

To navigate through web pages, you can use MechanicalSoup's `follow_link` function. You'll need to provide it with the target link that you want to navigate to, so locate and select the appropriate anchor tags on each page and extract their `href` attribute values.

In [9]:
browser.open("https://www.scrapethissite.com/")

print(browser.url)

https://www.scrapethissite.com/


Next, find the link to the sandbox page from the navigation bar. If you inspect the page, you'll find that the link is inside an `<li>` tag that has the `id` as `nav-sandbox`. You can use the following code to extract and follow the link from this tag:

In [10]:
sandbox_nav_link = browser.page.select_one('li#nav-sandbox')

sandbox_link = sandbox_nav_link.select_one('a')

browser.follow_link(sandbox_link)

print(browser.url)

https://www.scrapethissite.com/pages/


Next, locate and follow the **Hockey Teams: Forms, Searching and Pagination** link. If you inspect the HTML structure around this element, you'll notice that each link on this page is inside `<div class="page">` tags.

<div>
<img src="https://www.scrapingbee.com/blog/getting-started-with-mechanicalsoup/inspect-sandbox-page_hu8403476182413272040.png" width="1000"/>
</div>

You can extract all these `divs` first, select the second `div`, and then select and extract the link from the `a` tag from within it. Here's the code you can use to do that:

In [11]:
sandbox_list_items = browser.page.select('div.page')

hockey_list_item = sandbox_list_items[1]

forms_sandbox_link = hockey_list_item.select_one('a')

browser.follow_link(forms_sandbox_link)

print(browser.url)

https://www.scrapethissite.com/pages/forms/


This will bring your browser to the hockey teams information page. Putting all together, we can create a function to do the navigation for us:

In [12]:
def navigate():
    browser.open("https://www.scrapethissite.com/")

    print(browser.url)

    sandbox_nav_link = browser.page.select_one('li#nav-sandbox')

    sandbox_link = sandbox_nav_link.select_one('a')

    browser.follow_link(sandbox_link)

    print(browser.url)

    sandbox_list_items = browser.page.select('div.page')

    hockey_list_item = sandbox_list_items[1]

    forms_sandbox_link = hockey_list_item.select_one('a')

    browser.follow_link(forms_sandbox_link)

    print(browser.url)

In [13]:
navigate()

https://www.scrapethissite.com/
https://www.scrapethissite.com/pages/
https://www.scrapethissite.com/pages/forms/


#### Handling Forms

Suppose that we want to extract information of teams that contain the word "new" in their name. To do this, we need to fill the search form with the word "new" and submit it. `MechanicalSoup` offers a `form` class to facilitate simpler form handling. You can select a form using its CSS selector and then easily pass input to it according to the names of its input fields.

MechanicalSoup also provides a simple function called `submit_selected()` , which allows you to submit a form without having to locate and click its submit button.

If you inspect the website's structure, you'll notice that the form element has the classes `form` and `form-inline`. You can use one of these to identify and select the form. Notice that the input box has been named `q`. You can use these two details to write the code for handling this page's search form in the `handle_form()` function. But first, let's write a function to extract the data from a Table.

To extract the table data from the page, you'll use the `select_one` and `select` functions from Beautiful Soup, just as you did earlier in this tutorial. Save the following code in your `extract_table_data` function:

In [15]:
import csv
def extract_table_data(file_name):

    # Select the table element
    results = browser.page.select_one('table')

    # Open a file and prepare a CSV Writer object
    file = open(file_name, 'w')
    csvwriter = csv.writer(file)

    # Select the headers from the table first
    headers = results.select('th')

    # Create a temporary array to store the extract table header cells
    temp_header_row = []

    # For each header from the headers row, add it to the temporary array
    for header in headers:
        temp_header_row.append(header.string.strip().replace(",", ""))

    # Write the temporary array to the CSV file
    csvwriter.writerow(temp_header_row)
    
    # Next, select all rows of the table
    rows = results.select('tr.team')

    # For each row, prepare a temporary array containing all extracted cells and append it to the CSV file
    for row in rows:
        cells = row.select('td')
        temp_row = []

        for cell in cells:
            temp_row.append(cell.string.strip().replace(",", ""))

        csvwriter.writerow(temp_row)

    # Close the CSV file at the end
    file.close()

Now goin back to the `handle_form` function, you can use the `extract_table_data` function to extract the data from the table after submitting the form. Here's the complete code for the `handle_form` function:

In [16]:
def handle_form():
    browser.select_form('form.form-inline')

    browser.form.input({"q": "new"})

    browser.submit_selected()

    extract_table_data("first-page-results.csv")

In [17]:
handle_form()

### Handling Pagination

MechanicalSoup doesn't offer any special methods for handling paginations. You have to select the appropriate link from the pagination element and follow it via your browser object to navigate to the desired page.

Here's what the HTML structure of the pagination element looks like:

<div>
<img src="https://www.scrapingbee.com/blog/getting-started-with-mechanicalsoup/pagination-links-html-structure_hu9569013241349928453.png" width="1000"/>
</div>

First, select the `<ul>` tag that has the pagination class added to it. Then, find the right `<li>` tag in it (based on the page number you want to navigate to), and finally, extract the `href` value from the `<a>` tag inside the `<li>` tag you identified. Here's the code to do it:

In [18]:
def handle_pagination():
    pagination_links = browser.page.select_one('ul.pagination')

    page_links = pagination_links.select('li')

    # Choose the link at index 1 to navigate to page 2
    next_page = page_links[1]

    next_page_link = next_page.select_one('a')

    browser.follow_link(next_page_link)

    print(browser.url)

    # Extract the table data from this page too
    extract_table_data('second-page-results.csv')

In [19]:
handle_pagination()

https://www.scrapethissite.com/pages/forms/?page_num=2&q=new


Notice that two new CSV files will be created in the project directory: `first-page-results.csv` and `second-page-results.csv`. These will contain the results data extracted from the tables. This completes the tutorial for setting up web scraping using MechanicalSoup in Python!

# <center>Thank you!<a class="tocSkip"></center>
***