# <font color='grey'>Web Scraping with Python and Beautiful Soup Pt I:<br>Scraping HTML</font>

## <font color='orange'>Workshop Description</font>

This workshop will introduce students to techniques for scraping information from the web using Python’s Beautiful Soup (bs4) toolkit. We will begin with a basic overview of the “anatomy” or structure of a webpage. Students will then learn how to write a script for extracting textual data from websites like Reddit and organizing it into spreadsheets. The second half of the workshop will explore how to use Python Pandas library to clean and analyze your data. In addition to technical skills, participants are encouraged to engage with critical questions like: What can we, as researchers, learn from publicly available data? As well as, what are the potential ethical and legal complexities around data harvesting, and how do we do it responsibly?

By the end of the workshop, students will know:

- Scrape HTML content from a webpage
- Clean and analyze data

### Requirements:

*Tip Use the Anaconda package manager to install Jupyter, Python 3, and Beautiful Soup: [Installation Guide](https://docs.anaconda.com/free/anaconda/install/)*

- Install Jupyter Notebook: [Installation Guide](https://jupyter.org/install)
    - [JupyterLite](https://jupyterlite.readthedocs.io/) is a browser-based version that comes with pre-installed packages, no installation required. 
- Install Python 3 
- Install Beautiful Soup 4: See [Installation Guide](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-beautiful-soup)

### Resources:
- [BeautifulSoup4 Quickstart](https://beautiful-soup-4.readthedocs.io/en/latest/#quick-start)
- [Markdown Cheat Sheet](https://www.markdownguide.org/cheat-sheet/)
- [Regex 101](https://regex101.com/)

## <font color='orange'>What is Jupyter Notebook?</font> 

- Jupyter Notebook is a Graphic User Interface (GUI) that sits on top of the code, making it easier to interact with.
- Jupyter Notebook is for:
    - a) writing and executing "live" "live" computer code
    - b) creating written and visual notes or commentary (like for tutorials!)

### Using Jupyter:
- **Cells** are boxes for entering code or text.
- Switch between **Cell Types** using the dropdown menu:
    - a) **Code**: e.g. Python, Java, R
    - b) **Markdown**: create text and visual content, easy-to-read and write
    
### Working with Cells
- **Double-click** a Cell to edit it.
- You can **Insert** Cells above or below, **Copy and Paste** Cells contents, **Move** Cells up/down, and **Delete** Cells.
- Click **Shift + Return** to execute Cell contents (run code).
- Click the **STOP** icon to stop (interrupt) code.

### <font color='grey'> *Try it!*</font>
- In the upper-right corner of *this* cell, click the **Rectangle Box with a Plus Sign Underneath** to add a cell below.
- Click on the cell.
- At the top, change the dropdown menu to select "Code."
- Type the code below

```
print('Oh, what a beautiful morning!')
```

- Click ** **Shift + Return**
- If successful, you'll see the lyrics to the opening song from Oklahoma.

In [140]:
print('Oh, what a beautiful morning!')

Oh, what a beautiful morning!


## <font color='orange'>What is Web Scraping?</font>
- Extracting content (text and/or metadata) from websites
- Can iterate over multiple websites/pages

### Static vs Dynamic Websites
- **Static:** flat, pure HTML
    - what you see is what you get!
- **Dynamic:** database-driven, often relies on Javascipt
    - generates content on the fly, often personalised (e.g. clicking, scrolling)
 
*Tip: Not sure if a website is dynamic? Try disabling Javascript in Chrome's Dev Tools. On a webpage, right-click "Inspect" > Open Command Menu (click three dots) > Select "Run Command" > begin typing Javascript > Select "Disable Javascript".*

### Techniques: HTML vs APIs
- ### HTML Scraping (with Beautiful Soup):
    - DIY approach
    - Works for simple, HTML
- ### Application Programming Interfaces (APIs):
  - Mediates between two systems (like a mobile app and your phone)
  - Allows to exhcange data
  - Major social media platforms offer APIs to developers, as a way to encourage the creation of third-party apps and services.
  - ... but APIs also restrict usage in different ways (more on that in Part Two).

## <font color='orange'>Web Scraping Etiquette</font>

### Best Practices
- Take only what you need
- Anonymise or (better yet!) avoid scraping identifying information
- Try not to overload with requests; build in pauses (e.g. sleep)
- Identify yourself
- Credit source

### Robots.txt
- Guidelines for webscrapers about which parts of a site to scrape
- Helps prevent overloading
- Plain text file, found in root directory
- For example, here is a directive from the [McMaster University website](https://www.mcmaster.ca/robots.txt), which bans all webcrawlers from scraping content from pages in the "busstrike" filepath: 

```
    User-agent: *
    Disallow:   /busstrike/
```

- Robots.txt files *can* negatively impact SEO (findability) if used incorrectly

## <font color='orange'> Check for Permission to Scrape </font>
- You can use Python tools like **RobotFileParser** to check whether you're allowed to fetch the contents of the particular website.
- Today we'll be scraping rebuttal speeches from USCB's [American Presidency Project](https://www.presidency.ucsb.edu/documents/presidential-documents-archive-guidebook/annual-messages-congress-the-state-the-union/list) website.
- The code below queries the robot.txt protocols for specific sites.

### <font color='grey'>Let's try it!</font>
- Click *Shift + Return* to execute the code.
- To try checking a different website's robot.txt protocols, just replace the URL.

*See [RobotFileParser Documentation](https://docs.python.org/3/library/urllib.robotparser.html) to learn more about how this tool works.*

In [142]:
import urllib.robotparser

# Create instance of RobotFileParser
rp = urllib.robotparser.RobotFileParser()

# Specify the URL of the site's robots.txt file
rp.set_url("https://www.presidency.ucsb.edu/robots.txt")
rp.read()

# Specify the user agent and URL you're interested in
user_agent = 'MyWebScraper'
url_to_scrape = "https://www.presidency.ucsb.edu/documents/republican-party-response-president-obamas-address-before-joint-session-the-congress-the-3"

# Check if fetching the URL is allowed for your user agent
can_fetch = rp.can_fetch(user_agent, url_to_scrape)

# Check's request rate (number of requests and seconds between)
rrate = rp.request_rate("*")

if can_fetch:
    print("Fetching is allowed.")
else:
    print("Fetching is disallowed.")

Fetching is allowed.


## <font color='orange'>About Beautiful Soup</font>

### What is Beautitful Soup?
- A Python library for parsing (i.e. extracting *specific* content) from HTML and XML documents

### Installation Instructions
- To begin, first you need to install Beautiful Soup 4 and its dependencies, mainly Python3.
- Installation instructions can be found under **'Requirements'** section of the **Workshop Description.**

## <font color='orange'>Making the Soup</font>
1. Import bs4 and urrllib (another python library for opening URLs).
2. Fetch the website content using a urrllib lib "request" and "open"
3. Use Beautiful Soup to extract the HTML.
4. Print a 'prettified' or reader friendly version of the content.

In [144]:
from bs4 import BeautifulSoup
import urllib

r = urllib.request.urlopen('https://www.presidency.ucsb.edu/documents/republican-party-response-president-bidens-address-before-joint-session-the-congress-the-1').read()

soup = BeautifulSoup(r, 'html.parser')

#print(soup.prettify())

## <font color='orange'>Anatomy of a Website</font>
- HTML websites are made up of different parts (e.g. head, body, paragraphs, tables)
- You can demarcate different sections (elements) using *div tags*
- Divs can be further classified into *classes* and *IDs*

## <font color='orange'>Examine the Website</font>
1. On the webpage you want to examine, *right-click* and select *Inspect*
2. Under 'Elements,' hover over different parts of the website to see how different elements are tagged and fit within the overall structure of the website.

## <font color='orange'>Find our Ingredients</font>
- Once you've made the soup, you can hone in on specific elements.

  ### <font color='grey'>Try it!</font>
  - Return the *title* of the webpage:
      ```
      soup.title
      ```
  
  - Return the *first paragraph*:
      ```
      soup.p
      ```
        
  - Return the *first link*:
    ```
    soup.a
    ```
        
  - Return *all links*:
        
    ```
    soup.find_all('a')
    ```
    
3. We want to grab only the speech text, which is in the div class "field-docs-content."
- To return the contents of this div, use the code below.
- *Note: We will create a new object from this text by assigning it the name "speech" using the equals sign.

```
speech = soup.find("div", {"class": "field-docs-content"})
```

In [146]:
speech = soup.find("div", {"class": "field-docs-content"})
#print(speech.get_text())

### Continue to Refine
- But wait...!
- Before the start of the transcript, there is an extra chunk of contextual information.
- Let's exclude this paragraph...


In [159]:
speech_wintro = soup.find("div", {"class": "field-docs-content"})
just_speech = speech_wintro.find_all('p')[1:]
#print(just_speech)

- Let's also extract *just the text* and get rid of the "p" tags

In [162]:
speech = ' '.join([tag.get_text() for tag in just_speech])
#print(speech)

In [155]:
r = urllib.request.urlopen('https://www.presidency.ucsb.edu/documents/presidential-documents-archive-guidebook/annual-messages-congress-the-state-the-union/list').read()
soup = BeautifulSoup(r, 'html.parser')
#print(main_page_soup)

## <font color='orange'>Build the Corpus</font>
- Grabbing a single chunk of text is easy (why not just copy/paste?) 
- Web scraping is most useful for building *datasets.*
- Let's compile our corpus...  

### Iterate List Items
- The website has an [index](https://www.presidency.ucsb.edu/documents/presidential-documents-archive-guidebook/annual-messages-congress-the-state-the-union/list) of all the rebuttal speeches, which includes this information and also links to the individual speech pages from our initial test.

1. ### Make New Soup
- To extract information from the index page (as opposed to the individual page), we'll need to make a new soup.

```
r = urllib.request.urlopen('https://www.presidency.ucsb.edu/documents/presidential-documents-archive-guidebook/annual-messages-congress-the-state-the-union/list').read()
#soup = BeautifulSoup(r)
soup = BeautifulSoup(r, 'html.parser')
```

2. ### Inspect Page
- After inspecting the page, we can see that the information we needed is stored as rows in a table.
- The dates are hyperlinks that link to the speeches.

3. ### Narrow Scope
- First let's define the section of text (the table rows) that we want to scrape data from.

```
rows = soup.tbody.find_all('tr')

```

4. ### Iterate List of Speeches
- Then we'll need to cycle through the list of rebuttal speeches, executing the same code each time.

```
for row in rows:
```

5. ### Check for link
- Before executing the code, we need to check to see if row contains a link.
- If so, we'll grab the hyperlink...
- ... then we're off to the races!

```
a = row.find('a')
        
if a:
    URL = a['href']

```



## <font color='orange'>Gather Metadata</font>
- But wait!
- What other information do we need?

1. ### Get Date
- The hyperlink is embedded in the speech date.
- All we need to do is extract the text from the link.
  
```
date = a.get_text()
```

2. ### Format Date
- However, the date is just plain text "e.g January 31, 1990"
- We can translate it into a special datetime_object (using the datetime python library), so that our code recognizes this as special "datetime" object.
- Then we can reformat our dates in any style we like (e.g. 1990-01-31)
- Doing so gives us the option to order our list by time, and examien trends (in speech text) overtime.

```
datetime_object = datetime.strptime(date, '%B %d, %Y')
print(datetime_object.date())
```

3. ### Get Speaker Name
- In HTML, table rows are divided into cells (tagged as 'td')
- Examining the website shows that the speaker name is in the 3rd cell.
- In Python, you can access elements in a list or series using square brackets.
- As before, we use the function 'get_text()' to extract just the text.
- Finally, we strip extra whitespace from the start and end of the string with 'strip.()'

```
tds = row.find_all('td')
name = tds[2].get_text().strip()
print(name)
```

4. ### Get President's Name
- Finally, we grab the name of the sitting President at the time.
- The President's name is part of the title, which is enclosed in 'h1' tags.

```
title = soup.h1.get_text()

```

## <font color='orange'>Regular Expressions</font>

- What if you need to grab a section of text, which *isn't* enclosed in special tags?
- Here's where it gets a bit tricky (but doable!)
- **Regular Expressions** are tools for matching characters (like find and replace!). It works by using a a combination of brackets and special characters to capture only what you need.
- Regular Expressions can be hard to master, but there are online tools (like [RegexR](https://regexr.com/) for testing your code.

### Grab *just* the President's Name
- Once we've got our regular expression (like a net), we can search for a match using the format below.

```

president_match = re.search(r'^.*President ([a-zA-Z]*)\'s', title)
president = president_match.group(1) if president_match else 'Unknown'
print(president)
```

## <font color='orange'>Ethical Practices</font>

### Identify yourself as 'friendly'
- At the start of your code, set a new variable 'user_agent.'
- Add a name for your bot.
- Add a link to your website (or institution).

```
# Identify your bot using a User-Agent string
user_agent = 'MyWebScraper/1.0 (+https://github.com/cmiya)'
```
- In the initial request, insert a new line with a 'header' that identifies yourself to the server.

```
req = urllib.request.Request(
    URL,
    headers={
        'User-Agent': user_agent
    }
)
```

### Insert Pauses
- We don't want to overload the server!
- It's good practice to insert "rests."
- Here, we've added a 10-15 second pause between each request. 

```
#Give server time to breathe!
time.sleep(random.uniform(10, 15))
```

## <font color='orange'>Save Dataset</font>
- Finally, let's save our dataset.

1. ### Create a Folder and CSV File.
- At the *start* of your code, establish where you want to save your data.

``` 
folder_path = 'Desktop/rebuttals'
if not os.path.exists(folder_path):
    os.makedirs(folder_path)

csv_path = 'Desktop/USRebuttals.csv'
with open(csv_path, 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerow(['year', 'president', 'rebuttal_speaker'])
```

2. ### Set Filename
- As you cycle through the list, for each speech you need to create a new file name and path.
- The date is unique to each speech, so we'll use that to title the files.

```

file_date = datetime_object.date().strftime('%Y-%m-%d')
file_name = f'rebut-{file_date}.txt'
full_path = os.path.join(folder_path, file_name)
```

3. ### Save File to Folder
- Then we'll save the speech as a txt file to our folder.

with open(full_path, 'w') as f:
    f.write(speech)

3. ### Save Metadata as CSV
- Finally, we'll write the date, president name, and speaker name to a new row in our csv spreadshreet.

```
writer.writerow([datetime_object.date(), president, name])
```

## <font color='orange'>Finished Code</font>

- Now that we've built the components of our 'bot,' let's put the pieces together!

In [163]:
from bs4 import BeautifulSoup
import urllib
import time
import random
import re
from datetime import datetime
import os
import csv

r = urllib.request.urlopen('https://www.presidency.ucsb.edu/documents/presidential-documents-archive-guidebook/annual-messages-congress-the-state-the-union/list').read()
#soup = BeautifulSoup(r)

soup = BeautifulSoup(r, 'html.parser')

#print(soup.prettify())

# Identify your bot using a User-Agent string
user_agent = 'MyWebScraper/1.0 (+https://github.com/cmiya)'

rows = soup.tbody.find_all('tr')

folder_path = 'Desktop/rebuttals'
if not os.path.exists(folder_path):
    os.makedirs(folder_path)

csv_path = 'Desktop/USRebuttals.csv'
with open(csv_path, 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerow(['year', 'president', 'rebuttal_speaker'])
    
    for row in rows:
        a = row.find('a')
        
        if a:
            URL = a['href']
            print(URL)

            date = a.get_text()
            try:
                datetime_object = datetime.strptime(date, '%B %d, %Y')
                print(datetime_object.date())
            except ValueError as e:
                print(f"Error parsing date: {e}")
                continue

            tds = row.find_all('td')
            name = tds[2].get_text().strip()
            print(name)

            # Create a request with the User-Agent header
            req = urllib.request.Request(
                URL,
                headers={
                    'User-Agent': user_agent
                }
            )
            
            try:
                # Use the request object to open the URL
                r = urllib.request.urlopen(req).read()
            except Exception as e:
                print(f"Error opening URL {URL}: {e}")
                continue

            soup = BeautifulSoup(r, 'html.parser')

            title = soup.h1.get_text()
            president_match = re.search(r'^.*President ([a-zA-Z]*)\'s', title)
            president = president_match.group(1) if president_match else 'Unknown'
            print(president)

            speech_wintro = soup.find("div", {"class": "field-docs-content"})
            just_speech = speech_wintro.find_all('p')[1:]
            speech = ' '.join([tag.get_text() for tag in just_speech])

            file_date = datetime_object.date().strftime('%Y-%m-%d')
            file_name = f'rebut-{file_date}.txt'
            full_path = os.path.join(folder_path, file_name)

            with open(full_path, 'w') as f:
                f.write(speech)

            writer.writerow([datetime_object.date(), president, name])

            #Give server time to breathe!
            time.sleep(random.uniform(1, 3))

https://www.presidency.ucsb.edu/ws/index.php?pid=109261
1991-01-29
Senator George Mitchell (ME)
Bush
https://www.presidency.ucsb.edu/ws/index.php?pid=109260
1992-01-28
House Speaker Tom Foley (WA)
Bush
https://www.presidency.ucsb.edu/ws/index.php?pid=109235
1993-02-17
Rep. Bob Michel (IL)
Clinton
https://www.presidency.ucsb.edu/ws/index.php?pid=109236
1994-01-25
Senator Robert Dole (KS)
Clinton
https://www.presidency.ucsb.edu/ws/index.php?pid=109259
1995-01-24
Governor Christine Todd Whitman (NJ)
Clinton
https://www.presidency.ucsb.edu/ws/index.php?pid=109237
1996-01-23
Senator Robert Dole (KS)
Clinton
https://www.presidency.ucsb.edu/ws/index.php?pid=109238
1997-02-04
Rep. J.C. Watts (OK)
Clinton
https://www.presidency.ucsb.edu/ws/index.php?pid=109239
1998-01-27
Senator Trent Lott (MS)
Clinton
https://www.presidency.ucsb.edu/ws/index.php?pid=109240
1999-01-19
Rep. Jennifer Dunn (WA) and Rep. Steven Largent (OK)
Clinton
https://www.presidency.ucsb.edu/ws/index.php?pid=109241
2000-01-27
