---

### 🎓 **Professor**: Apostolos Filippas

### 📘 **Class**: Web Analytics

### 📋 **Topic**: Scraping data using Beautiful Soup

🚫 **Note**: You are not allowed to share the contents of this notebook with anyone outside this class without written permission by the professor.

---

# 🌐 0. Introduction

This notebook details how we can scrape whitehouse news.

As always, we will begin by importing our libraries.

Let's also create 
- `data`: an empty list, where we will be appending our content
- `headers`: a dictionary that tells our requests as what kind of browser we are requesting the webpage

In [None]:
# 1. importing useful libraries

import requests # For making HTTP requests to the web.
import time     # To introduce pauses in our code, ensuring we don't overwhelm the server.
from bs4 import BeautifulSoup # A popular library to parse and navigate HTML content

# create an empty list to store our scraped data
data  = [] 
# Define headers to simulate a browser request. 
# This can help bypass restrictions that prevent scripts or bots from accessing web content.
my_headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}

---
# 📕 1. Fetching the Webpage

First, we will try to get the page five times:
- ✅ **Success**: if we are successful, we will immediately exit the loop
- 🚫 **Failure**: if we are unsuccessful, the `requests.get` method will trigger an error. As a response, the code will print a message indicating this

At then end, we will tell the user whether we got the page or not

In [None]:
# Define the target URL
page = 'https://www.whitehouse.gov/news/page/1/' 

# Initialize the source content to None 
# (False would work as well, but using None is more Python-idiomatic)
src = None

# Now get the page

# Try to scrape the page up to 5 times
for i in range(5): 
    try:
        # Fetch the content of the URL with the specified headers
        response = requests.get(page, headers = my_headers)
        # If the request was successful, store the page content in src & exit the loop
        src = response.content
        break 
    # Catch specific exception for request.get() errors
    except:
        print (f'Failed attempt # {i+1}')
        # wait 2 secs before trying again
        time.sleep(2)

# Let the user know if the page content was fetched successfully
if src:
   print(f'Successfully got page: {page}')
else:
   print(f'Could not get page: {page}')


Try to get a page that does not exist, e.g. https://www.whitsadsadasdasdehouse.gov/news/page/1/' and see what happens!

---
# 🔢 2. Encoding

When we fetch data from the web using libraries like requests, the data we get can be in various formats such as HTML, JSON, images, etc. This raw data is sent over the internet in the form of bytes. Therefore, when we receive the data, it's in its raw byte form. 

Bytes are represented in Python as a `bytes` object.
- The bytes object in Python is an immutable sequence of bytes used to represent raw, binary data.
- "Bytes" are the smallest addressable unit in a computer and can represent a wide variety of data, including text, images, audio, and more.

`bytes` are different from `strings`
-  Strings in Python represent text, and each character in a string corresponds to a specific symbol. When we want to store or transmit this text, it needs to be encoded into bytes, which is where encodings like ASCII, utf-8, and others come into play. 
-  Each encoding has its own way of mapping characters to sequences of bytes.
-  Strings are sequences of characters, while bytes are sequences of bytes.
-  Conversion between strings and bytes is done via encoding (to bytes) and decoding (to string):

To decode bytes to strings, we need to know how the bytes were encoded.
- The server tells us the encoding of the data it sends us in the response headers.
- requests also comes with a built-in encoding detector, which guesses the encoding of the content it receives

In [None]:
print(src)
print("Encoding as per headers:", response.encoding)
print("Apparent encoding:", response.apparent_encoding)

--- 

# 🍜 3. Using Beautiful Soup to parse the information

Beautiful Soup is a Python library designed for web scraping purposes to pull the data out of HTML and XML files. It creates parse trees from these files that are helpful to extract the data easily.

## 3.1 Parsing the HTML as a Beautiful Soup object
Before we parse the content, we need to ensure that it is in a suitable encoding. While the internet contains data in a plethora of encodings, ASCII is one of the most basic. By decoding to ASCII, we're simplifying the content, but it's worth noting that we may lose special characters or symbols that are not represented in ASCII.

In [None]:
# create a BeautifulSoup object with the BeautifulSoup constructor
soup = BeautifulSoup(src.decode('utf-8', 'ignore'), 'html.parser')
# confirm that you have successfully created a BeautifulSoup object
type(soup)

Why `decode('ascii', 'ignore')`?
- **decode('ascii')**: This converts bytes (which is the type of our fetched content) into a string using the 'utf-8' encoding.
- **'ignore' parameter**: If characters in the bytes aren't valid in 'utf-8', instead of throwing an error, the 'ignore' option tells Python to just skip over them. This is especially useful when scraping web content as we can sometimes encounter non-standard characters.

The `BeautifulSoup()` constructor needs two arguments:
- **The HTML (or XML) content** that we are going to parse.
- **The parser library name.** In this case, we're using the `html.parser`, which is a nice built-in parser. There are other parsers, such as `html5lib`, and `lxml`. Often, the `lxml` parser is preferred for its speed.

## 3.2 Parse Tree

A parse tree, often referred to as a *parsing tree* or *syntax tree*, represents syntactic constructs in source code or any structured text. It's utilized to portray the code or content's structure in a hierarchical format.

For HTML and XML documents, the parse tree denotes the structure of the document. Each tag, attribute, and piece of text stands as a node in this tree.

### Key Concepts:

- **Node**: Each unique part of the tree, such as an HTML tag, is termed a node. For instance, in an HTML document like `<title>My Web Page</title>`, both the `<title>` tag and the text "My Web Page" are nodes.

- **Root Node**: This is the highest node in the tree. In the context of an HTML document, the root node is typically represented by the `<html>` tag.

- **Child Nodes**: These are nodes that stem from another node. For HTML, tags and content nested within another tag are its child nodes. As an example, in `<body><p>Text</p></body>`, `<p>` is a child node of `<body>`.

- **Parent Node**: The opposite of child nodes. If node A is a descendant of node B, node B is the progenitor of node A.

- **Sibling Nodes**: These nodes share the same parent. If two tags or textual pieces are housed within the same parent tag, they are acknowledged as siblings.

### Why is the Parse Tree Important?

- **Navigation**: With the HTML or XML content transformed into a parse tree, navigation becomes possible. This empowers users to discover specific document segments, traverse the structure in various directions, and pull out the desired data.

- **Search**: Specific tags, classes, IDs, or other attributes can be located. This capability is a cornerstone in web scraping.

- **Modification**: Content can be altered through the parse tree. This includes the addition, removal, or alteration of tags and attributes.

- **Extraction**: Once the desired data is pinpointed, structured extraction becomes feasible.

### Example
Here's some HTML
```
<html>
    <head>
        <title>My Web Page</title>
    </head>
    <body>
        <p class="intro">Welcome to my web page!</p>
        <p>Here's some more text.</p>
    </body>
</html>
```

Parse Tree Structure:
- `<html>` (Root Node)
  - `<head>` (Child of `<html>`)
    - `<title>` (Child of `<head>`)
      - "My Web Page" (Text node, child of `<title>`)
  - `<body>` (Child of `<html>`, sibling of `<head>`)
    - `<p class="intro">` (Child of <body>)
      - "Welcome to my web page!" (Text node, child of first `<p>`)
    - `<p>` (Second `<p>` tag, child of `<body>`, sibling of the first `<p>`)
      - "Here's some more text." (Text node, child of second `<p>`)

The representation above is a basic idea of how the parse tree would look for the given HTML.

### Beautiful Soup and the Parse Tree:

When Beautiful Soup is employed to dissect an HTML or XML document, it creates an in-memory parse tree from the page source code. This grants the ability to engage with the page's architecture, navigate, seek out, and alter its components. Functions such as 
- `find()`, 
- `find_all()`, 
- `parent`, 
- `children`, 
- `next_sibling`, 
- `previous_sibling`, 

and more, facilitate movement and interaction within this tree.


## 3.3 Using find_all with Beautiful Soup
The `find_all` method is one of the most commonly used methods in Beautiful Soup to search the parse tree. It returns all the matching tags found in the document, in the form of a list.

In [None]:
links = soup.find_all("a")
# observe the similarity with  re.findall('<a .+</a>', response.text ) 
print(len(links))
links[16]

In the above line, `soup.find_all("a")` looks for all `<a>` tags, commonly used for hyperlinks, in the parsed HTML document. It's equivalent to using a regular expression like `re.findall('<a .+</a>', response.text)` which attempts to find all patterns in the raw HTML that resemble anchor tags.

In [None]:
element = links[16]


In [None]:
element.parent

In [None]:
for k in element.children:
    print(k)

Let's do another example:

In [None]:
element = links[-3]
print(f"Here is the element we grabbed: {element}")

In [None]:
print("Here is the text of the element:")
print(element.text)

In [None]:
print("Here are the attributes of the element:")
print(element.attrs)

In [None]:
print("Here is the parent of the element:")
print(element.parent)

In [None]:
print("Here are the children of the element:")
for k in element.parent.children:
    print(f"- {k}")

---
## 🔍 4. Finding Data Location using the "Inspect" tool


To accurately extract information, it's crucial to first identify where it resides within the webpage's structure. For each statement, we aim to retrieve details on: statement type, link text and link address

<div align="center">
    <img src="https://github.com/apostolosfilippas/wa/blob/main/assets/whitehouse-1.png?raw=true">
    <br><br>
</div>

Using your browser's "inspect" feature expedites the scraping process:

1. **Initiate Inspect Mode**: Right-click on any webpage element and select "Inspect".
2. **Element Selector**: Click the "Select an element" tool.
3. **Element Navigation**: Hover over the webpage elements. The corresponding HTML structure will be highlighted. Click when you've located the desired element to pin its HTML code in the inspector.

<div align="center">
    <img src="https://github.com/apostolosfilippas/wa/blob/main/assets/whitehouse-2.png?raw=true">
    <br><br>
</div>

Using the inspect tool allows you to find the HTML segments containing the data you want to access.

### 🌍 **Google Chrome**:
1. **Right-click** on an element.
2. Select **"Inspect"** from the context menu.
3. Alternatively, use the shortcut **Ctrl + Shift + I** (Windows/Linux) or **Cmd + Option + I** (Mac).

### 🦊 **Mozilla Firefox**:
1. **Right-click** on an element.
2. Choose **"Inspect Element"** from the dropdown.
3. Shortcut: **Ctrl + Shift + C** (Windows/Linux) or **Cmd + Option + C** (Mac).

### 🌐 **Microsoft Edge**:
1. **Right-click** on the desired item.
2. Opt for **"Inspect Element"**.
3. You can also use **F12** or **Ctrl + Shift + I** to directly open the developer tools.

### 🍏 **Safari**:
(Note: You might need to enable the "Develop" menu first from Preferences > Advanced)
1. **Right-click** on the element (if you have a Mac with a single button mouse, Ctrl + click).
2. Choose **"Inspect Element"**.
3. Shortcut: **Cmd + Option + C**.


---
# 🎯 5. Getting what we want

## 5.1 Finding a tag

Next, we need to find what is the tag that "contains" the information we want. 

Inspecting the HTML we find that the tag `<li>` with an attribute of type `data-wp-key` which has value ""post-template-item-*****"" contains tags with the info that we want to parse...: 

1. the `<a>` tag with with an attribute of type `class` which has value "news-item__title" 
   
2. ... and other information

By saying "contain" we mean that this information is found after the article tag opens, and before it closes.
- alternatively, we are looking at the "children" of the `<article>` tag

This is easy to see in the "inspect view" because everything that an element contains, has more indent than that element!!

See for example below:

<div align="center">
    <img src="https://github.com/apostolosfilippas/wa/blob/main/assets/whitehouse-3.png?raw=true">
    <br><br>
</div>

To find all article tags, we simply ask Beautiful Soup to look for this tag in the "soup" variable---that contains the HTML

In [None]:
# find all li tags with a data-wp-key attribute
entries = soup.find_all("li", attrs={"data-wp-key": True})

# filter them to only keep the <li> tags whose "data-wp-key" attribute starts with "post-template-item-"
filtered_entries = [li for li in entries if li["data-wp-key"].startswith("post-template-item-")]

# how many entries did we find?
print(f"We found {len(filtered_entries)} entries")

Let's take a look at the first entry:

In [None]:
print(filtered_entries[0].prettify())

In [None]:
print(filtered_entries[9].prettify())

You can take a look at the attributes of that html tag, and veryify that they are the tags that we want

### Note
Let's say you have the following HTML:
```html
<div>
    <a href="#">Link 1</a>
    <div>
        <a href="#">Nested Link 1</a>
        <a href="#">Nested Link 2</a>
    </div>
</div>
```
If you use `soup.find_all("a")`, it will return all three `<a>` tags: 
- ```text<a href="#">Link 1</a>```
- ```text<a href="#">Nested Link 1</a>```
- ```text<a href="#">Nested Link 2</a>```

Even though Nested Link 1, and Nested Link 2 are inside a nested `<div>`, BS returns them as separate matches.


Alright, now that we have all briefing statements etc, let's grab the correct information for each article.
We need to grab two kinds of information:
    
- the text of the statement
- the url of the statement

## 5.3 Getting the text
Let's first do this with the first article, and then do it for every article using a for loop

In [None]:
entry = filtered_entries[2]
print(entry.prettify())
print(type(entry))

In [None]:
# first grab the <h2> tag and everythign it contains from the first article
# to do so, simply use beautiful soups "find" functionality 
# (because there is only one, it will return the first occurence)
entry_text = entry.find('h2').get_text()
print(entry_text)

In [None]:
# We could have also grabed the text of the first <a> tag
entry_text = entry.find('a').get_text()
print(entry_text)

In [None]:
# strip the text from unecessary whitespace using the native BS2 function
entry_text = entry.find('a').get_text().strip()
print(entry_text)

## 5.4 Getting the URL

In [None]:
# find the <a> tag with the attribute target="_self"
entry_url = entry.find('a', attrs={'target': "_self"})
# keep only the value of the href attribute
entry_url = entry_url.attrs.get('href')
print(entry_url)

In [None]:
print(f'url   = {entry_url}')
print(f'text  = {entry_text}')


## 5.5 Getting information for all articles
Now to scrape all articles from the webpage, simply use a for loop!

In [None]:
data = []

entries = soup.find_all('li', {'data-wp-key': True})

for entry in entries:
    
    # find a, grab text, strip() it
    entry_text = entry.find('h2').get_text()
    
    # and get the url too
    entry_url = entry.find('a', attrs={'target': "_self"}).attrs.get('href')
    
    # add all the info to the data link
    data.append([entry_url, entry_text])

for i, entry in enumerate(data):
    print(f"Entry {i+1}:")
    print(f"- URL:  {entry[0]}")
    print(f"- Text: {entry[1]}")


## 5.6 Scraping all articles in webpage with failsafes
Let's now see how to deal with cases where we search for elements that do not exist


In [None]:
data = []

entries = soup.find_all('li', {'data-wp-key': True})

for entry in entries:
    
    entry_text = None
    entry_url = None

    # find h2, grab text, strip() it
    found_element = entry.find('h2')
    if found_element:
        entry_text = found_element.get_text().strip()
    
    # find a, get href
    found_element = entry.find('a', attrs={'target': "_self"})
    if found_element:
        entry_url = found_element.attrs.get('href')
    
    # add all the info to the data link
    data.append([entry_url, entry_text])

for i, entry in enumerate(data):
    print(f"Entry {i+1}:")
    print(f"- URL:  {entry[0]}")
    print(f"- Text: {entry[1]}")

# 📃 6. Scraping many pages!

To scrape many pages, we simply add a for loop to everything...




In [None]:
data = []
numPages = 5

for k in range(1,numPages+1):
    
    # Give the url of the page
    page = f'https://www.whitehouse.gov/news/page/{k}/' 
    # Initialize src to be False
    src  = None

    # Now get the page

    # try to scrape 5 times
    for i in range(1,6): 
        try:
            # get url content
            response = requests.get(page, headers = my_headers)
            # get the html content
            src = response.content
            # if we successuflly got the file, break the loop
            break 
        # if requests.get() threw an exception, i.e., the attempt to get the response failed
        except:
            print (f'failed attempt # {i}')
            # wait 2 secs before trying again
            time.sleep(2)

    # if successful, let the user now
    if src:
       print(f'Successfully got page: {page}')
    # if unsuccessful, notify the user and move to the next page
    else:
       print('Could not get page: {page}')
       continue 
    
    soup     = BeautifulSoup(src.decode('utf-8', 'ignore'), 'html.parser')
    entries = soup.find_all('li', {'data-wp-key': True})

    for entry in entries:
        
        entry_text = None
        entry_url = None

        # find h2, grab text, strip() it
        found_element = entry.find('h2')
        if found_element:
            entry_text = found_element.get_text().strip()
        
        # find a, get href
        found_element = entry.find('a', attrs={'target': "_self"})
        if found_element:
            entry_url = found_element.attrs.get('href')
        
        # add all the info to the data link
        data.append([entry_url, entry_text])
    
    # always a good idea to take a nap
    time.sleep(2)

In [None]:
for entry in data:
        print(entry)

# <center><font color='red'>CHALLENGE</font></center>

<div align="center">
    <img src="https://github.com/apostolosfilippas/wa/blob/main/assets/deal.png?raw=true" width="500">
    <br><br>
</div>

In the remainder of the class I ask that you work on the following challenge (+1% bonus)

1. Extend today's code to also scrape the "type" and "date" of each announcement

2. Save all information that you scraped about each announcement in a .txt file. Each announcement on the website should be a line in the file, and each article's attributes (title, url, type, and date) should be tab-separated , like `Article_title\tArticle_url\tArticle_date\tArticle_type\n`
3. If an article has several types, then separate the types with dash "-", like `Article_title\tArticle_url\tArticle_date\tArticle_type1-Article_type2-Article_type3`
4. After you've saved the data, make sure you can read it using Python.