In [None]:
import pandas as pd
import numpy as np
import os
import requests
import json
import bs4

from IPython.display import display, Image

# Lecture 16 – More Parsing Examples

## DSC 80, Spring 2023

### Agenda

- Parsing HTML using Beautiful Soup.
    - Example: Scraping the HDSI Faculty page.
    - Example: Scraping quotes.
- Nested vs. flat data structures.

## Parsing HTML using Beautiful Soup

### `BeautifulSoup` objects

- `bs4.BeautifulSoup` takes in a string or file-like object representing HTML (markup) and returns a **parsed** document.
- Remember, HTML documents are represented as **trees**, under the "Document Object Model."

In [None]:
html_string = '''
<html>
    <body>
      <div id="content">
        <h1>Heading here</h1>
        <p>My First paragraph</p>
        <p>My <em>second</em> paragraph</p>
        <hr>
      </div>
      <div id="nav">
        <ul>
          <li>item 1</li>
          <li>item 2</li>
          <li>item 3</li>
        </ul>
      </div>
    </body>
</html>
'''.strip()

In [None]:
soup = bs4.BeautifulSoup(html_string)

In [None]:
type(soup)

### Finding elements in a tree

The most common methods you'll use to find _tags_ in a `soup` object are:
- `soup.find(tag)`, which finds the **first** instance of a tag (the first one on the page, i.e. the first one that DFS sees).
    - More general: `soup.find(name=None, attrs={}, recursive=True, text=None, **kwargs)`.
- `soup.find_all(tag)`, which finds **all** instances of a tag.


### Using `find_all`

`find_all` returns a list of all matches.

In [None]:
soup.find_all('div')

In [None]:
soup.find_all('li')

In [None]:
[x.text for x in soup.find_all('li')]

### Node attributes
* The `text` attribute of a tag element gets the text between the opening and closing tags.
* The `attrs` attribute lists all attributes of a tag.
* The `get(key)` method gets the value of a tag attribute.

In [None]:
soup.find('p')

In [None]:
soup.find('p').text

In [None]:
soup.find('div')

In [None]:
soup.find('div').attrs

In [None]:
soup.find('div').get('id')

The `get` method must be called directly on the node that contains the attribute you're looking for.

In [None]:
soup

In [None]:
# While there are multiple 'id' attributes, none of them are in the <html> tag at the top.
soup.get('id')

In [None]:
soup.find('div').get('id')

## Example: Scraping the HDSI Faculty page

### Example

Let's try and extract a list of HDSI Faculty from https://datascience.ucsd.edu/about/faculty/faculty/.

A good first step is to use the "inspect element" tool in our web browser.

In [None]:
fac_response = requests.get('https://datascience.ucsd.edu/faculty/')
fac_response

In [None]:
soup = bs4.BeautifulSoup(fac_response.text)

It seems like the relevant `<div>`s for faculty are the ones where the `data-entry-type` attribute is equal to `'individual'`. Let's find all of those.

In [None]:
divs = soup.find_all('div', attrs={'class': 'vc_custom_heading vc_gitem-post-data vc_gitem-post-data-source-post_title'})

In [None]:
divs[0]

Within here, we need to extract each faculty member's name. It seems like names are stored as text within the `<a>` tag.

In [None]:
divs[0].find('a').text

We can also extract job titles:

In [None]:
divs1 = soup.find_all('div', attrs={'class': 'vc_gitem-align-left fields'})

divs1[0]

In [None]:
divs1 = soup.find_all('div', attrs={'class': 'field pendari_people_title'})

divs1[0].text

Let's create a DataFrame consisting of names and job titles for each faculty member.

In [None]:
names = [div.find('a').text for div in divs]
names

In [None]:
# titles = []
# for div in divs:
#     h4 = div.find('h4')
#     if h4:
#         titles.append(h4.text)
#     else:
#         titles.append('')

titles = [div.text for div in divs1]
titles

In [None]:
faculty = pd.DataFrame().assign(name=names, title=titles)
faculty.head()

Now we have a DataFrame!

In [None]:
faculty[faculty['title'] == 'Professor']

What if we want to get faculty members' pictures? It seems like we should look at the attributes of an `<img>` tag.

In [None]:
def show_picture(name):
    idx = names.index(name)
    imgs = soup.find_all('img', attrs={'class': 'vc_gitem-zone-img'})
    url = imgs[idx].get('src')
    display(Image(url))

In [None]:
show_picture('Tauhidur Rahman')

## Example: Scraping quotes

### Example: Scraping quotes

Let's scrape quotes from https://quotes.toscrape.com/.

<center><img src="imgs/quotes2scrape.png" width=60%></center>

Specifically, let's try to make a DataFrame that looks like the one below:

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>quote</th>
      <th>author</th>
      <th>author_url</th>
      <th>tags</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</td>
      <td>Albert Einstein</td>
      <td>https://quotes.toscrape.com/author/Albert-Einstein</td>
      <td>change,deep-thoughts,thinking,world</td>
    </tr>
    <tr>
      <th>1</th>
      <td>“It is our choices, Harry, that show what we truly are, far more than our abilities.”</td>
      <td>J.K. Rowling</td>
      <td>https://quotes.toscrape.com/author/J-K-Rowling</td>
      <td>abilities,choices</td>
    </tr>
    <tr>
      <th>2</th>
      <td>“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”</td>
      <td>Albert Einstein</td>
      <td>https://quotes.toscrape.com/author/Albert-Einstein</td>
      <td>inspirational,life,live,miracle,miracles</td>
    </tr>
  </tbody>
</table>

### The plan

Eventually, we will create a single function – `quote_df` – which takes in an integer `n` and returns a **DataFrame** with the quotes on the **first `n` pages** of https://quotes.toscrape.com/.

To do this, we will define several helper functions:

- `download_page(i)`, which downloads a **single page** (page `i`) and returns a `BeautifulSoup` object of the response.

- `process_quote(div)`, which takes in a `<div>` tree corresponding to a **single quote** and returns a Series containing all of the relevant information for that quote.

- `process_page(divs)`, which takes in a list of `<div>` trees corresponding to a **single page** and returns a DataFrame containing all of the relevant information for all quotes on that page.

Key principle: some of our helper functions will make **requests**, and others will **parse**, but none will do both! 
- Easier to debug and catch errors.
- Avoids unnecessary requests.

### Downloading a single page

In [None]:
def download_page(i):
    url = f'https://quotes.toscrape.com/page/{i}'
    request = requests.get(url)
    return bs4.BeautifulSoup(request.text)

In `quote_df`, we will call `download_page` repeatedly – once for `i=1`, once for `i=2`, ..., `i = n`. For now, we will work with just page 5 (chosen arbitrarily).

In [None]:
soup = download_page(5)

### Parsing a single page

Let's look at the page's source code (via "inspect element") to find where the quotes in the page are located.

In [None]:
divs = soup.find_all('div', attrs={'class': 'quote'})

In [None]:
divs[0]

From this `<div>`, we can extract the quote, author name, author's URL, and tags.

In [None]:
divs[0].find('span', attrs={'class': 'text'}).text

In [None]:
divs[0].find('small', attrs={'class': 'author'}).text

In [None]:
divs[0].find('a').get('href')

In [None]:
divs[0].find('meta', attrs={'class': 'keywords'}).get('content')

Let's implement our next function, `process_quote`, which takes in a `<div>` corresponding to a single quote and returns a **Series** containing the quote's information.

Note that this approach is different than the approach taken in the HDSI Faculty page example – there, we created each column of our final DataFrame separately, while here we are creating one **row** of our final DataFrame at a time.

In [None]:
def process_quote(div):
    quote = div.find('span', attrs={'class': 'text'}).text
    author = div.find('small', attrs={'class': 'author'}).text
    author_url = 'https://quotes.toscrape.com' + div.find('a').get('href')
    tags = div.find('meta', attrs={'class': 'keywords'}).get('content')
    
    return pd.Series({'quote': quote, 'author': author, 'author_url': author_url, 'tags': tags})

In [None]:
process_quote(divs[3])

Our last helper function will take in a **list** of `<div>`s, call `process_quote` on each `<div>` in the list, and return a **DataFrame**.

In [None]:
def process_page(divs):
    return pd.DataFrame([process_quote(div) for div in divs])

In [None]:
process_page(divs)

### Putting it all together

In [None]:
def quote_df(n):
    '''Returns a DataFrame containing the quotes on the first n pages of https://quotes.toscrape.com/.'''
    dfs = []
    for i in range(1, n + 1):
        # Download page n and create a BeautifulSoup object.
        soup = download_page(i)
        
        # Create DataFrame using the information in that page.
        divs = soup.find_all('div', attrs={'class': 'quote'})
        df = process_page(divs)
        
        # Append DataFrame to dfs.
        dfs.append(df)
        
    # Stitch all DataFrames together.
    return pd.concat(dfs).reset_index(drop=True)

In [None]:
first_three_pages = quote_df(3)
first_three_pages.head()

The elements in the `'tags'` column are all strings, but they look like lists. This is not ideal, as we will see shortly.

### Key takeaways

* Make as few requests as possible.
* Create a request and parsing plan **beforehand**.
* Create your output schema **beforehand**.
* Make requests and parse in **separate functions**!
* See Lab 6, Question 2 for a related example.

## Nested vs. flat data formats

### Nested vs. flat data formats

- **Nested** data formats, like HTML, JSON, and XML, allow us to represent hierarchical relationships between variables.

* **Flat** (i.e. tabular) data formats, like CSV, do not.

<center><img src="imgs/hierarchy.png" width=40%></center>

### Aside: JSON Crack

The site [jsoncrack.com](https://jsoncrack.com/editor) allows you to upload a JSON file and visualizes it. Let's try it with `data/family.json`!

### Example: Scraping quotes, again

- Suppose we obtained the quotes data via an API and saved it to the file `data/quotes2scrape.json`.
- `quotes2scrape.json` is a **JSON records** file; each line is a valid JSON object, **but the entire document is not**.

In [None]:
f = open(os.path.join('data', 'quotes2scrape.json'))

In [None]:
json.loads(f.readline())

Note that for a single quote, we have keys for `'auth_url'`, `'quote_auth'`, `'quote_text'`, `'bio'`, `'dob'`, and `'tags'`.

Since each line is a separate JSON object, let's read in each line one at a time.

In [None]:
L = [json.loads(x) for x in open(os.path.join('data', 'quotes2scrape.json'))]

Let's convert the result to a DataFrame.

In [None]:
df = pd.DataFrame(L)
df.head()

What data type is the `'tags'` column?

In [None]:
df['tags'].iloc[0]

Let's save `df` to a CSV and read it back in.

In [None]:
df.to_csv('out.csv')

In [None]:
df_again = pd.read_csv('out.csv')
df_again.head()

What data type is the `'tags'` column now?

In [None]:
df_again['tags'].iloc[0]

### One-hot encoding

- So that we don't have to deal with lists within Series, we can **flatten** lists of tags so that there is **one column per unique tag**.
    - For example, consider the tag `'inspirational'`.
    - If a quote has a 1 in the `'inspirational'` column, it **was** tagged `'inspirational'`.
    - If a quote has a 0 in the `'inspirational'` column, it **was not** tagged `'inspirational'`.

- This process – of converting categorical variables into columns of 1s and 0s – is called **one-hot encoding**. We will revisit it in a few weeks.

Let's write a function that takes in the list of tags (`taglist`) for a given quote and returns the one-hot-encoded sequence of 1s and 0s for that quote.

In [None]:
taglist

In [None]:
def flatten_tags(taglist):
    return pd.Series({k:1 for k in taglist}, dtype=float)

tags = df['tags'].apply(flatten_tags).fillna(0).astype(int)
tags.head()

Let's combine this one-hot-encoded DataFrame with `df`.

In [None]:
df_full = pd.concat([df, tags], axis=1).drop(columns='tags')
df_full.head()

If we want all quotes tagged `'inspiration'`, we can simply query:

In [None]:
df_full[df_full['inspirational'] == 1].head()

Note that this DataFrame representation of the response JSON takes up much more space than the original JSON. Why is that?

## Summary, next time

### Summary

- Beautiful Soup is an HTML parser that allows us to (somewhat) easily extract information from HTML documents.
    - `soup.find` and `soup.find_all` are the functions you will use most often.
- When writing scraping code:
    - Use "inspect element" to identify the names of tags and attributes that are relevant to the information you want to extract.
    - Separate your logic for making requests and for parsing.

### Next time

All about regular expressions!