In [None]:
import pandas as pd
import numpy as np

import requests
from IPython.display import HTML, Image

# Lecture 15 – Requests and Parsing HTML

## DSC 80, Spring 2022

### Announcements

- Project 3 is released, and is due on **Thursday, May 12th at 11:59PM**.
    - See [dsc80.com/project3](https://dsc80.com/project3/) for all the details.
- Midterm Exam grades are released! See [#935](https://campuswire.com/c/G325FA25B/feed/935) for details.
- Lab 5 is due **today at 11:59PM**.

### Agenda

- APIs and web scraping.
- The anatomy of HTML documents.
- Parsing HTML via Beautiful Soup.
- Example: Scraping the HDSI Faculty page.

## APIs and web scraping

### Programmatic requests

* We learned how to use the Python `requests` package to exchange data via HTTP.
    - `GET` requests are used to request data **from** a server.
    - `POST` requests are used to **send** data to a server. 
* There are two ways of collecting data via requests:
    * By using a published API (application programming interface).
    * By scraping a webpage to collect its HTML source code.

### APIs

* An API is a service that makes data directly available to the user in a convenient fashion.

* Advantages:
    - The data are usually clean, up-to-date, and ready to use.
    - The presence of a API signals that the data provider is okay with you using their data.
    - The data provider can plan and regulate data usage.
        - Some APIs require you to create an API "key", which is like an account for using the API.
        - APIs can also give you access to data that isn't publicly available on a webpage.

* Disadvantages:
    - APIs don't always exist for the data you want!

### API terminology

- A URL, or uniform resource locator, describes the location of a website or resource.

- An **API endpoint** is a URL of the data source that the user wants to make requests to.

- For example, on the [Reddit API](https://www.reddit.com/dev/api/):
    * the `/comments` endpoint retrieves information about comments.
    * the `/hot` endpoint retrieves data about posts labeled "hot" right now. 
    - To access these endpoints, you add the endpoint name to the base URL of the API.

### API requests

- API requests are just `GET`/`POST` requests to a specially maintained URL.
- Let's test out the [Pokémon API](https://pokeapi.co).

First, let's make a `GET` request for `'squirtle'`.

In [None]:
r = requests.get('https://pokeapi.co/api/v2/pokemon/squirtle')
r

Remember, the 200 status code is good! Let's take a look at the **content**:

In [None]:
r.content[:1000]

Looks like JSON. We can extract the JSON from this request with the `json` method (or by passing `r.text` to `json.loads`).

In [None]:
r.json()

Let's try a `GET` request for `'billy'`.

In [None]:
r = requests.get('https://pokeapi.co/api/v2/pokemon/billy')
r

Uh oh...

### Scraping

* Scraping is the act of programmatically "browsing" the web, downloading the source code (HTML) of pages that you're interested in extracting data from.

* Advantages:
    * You can always do it!
        - e.g. Google scrapes webpages in order to make them searchable.

* Disadvantages:
    - It is often difficult to parse and clean scraped data.
        - Source code often includes a lot of content unrelated to the data you're trying to find (e.g. formatting, advertisements, other text).
    - Websites can change often, so scraping code can get outdated quickly.
    - Websites may not want you to scrape their data!

- In general, we prefer APIs.

### Accessing HTML

Let's make a `GET` request to the HDSI Faculty page and see what the resulting HTML looks like. 

In [None]:
url = 'https://datascience.ucsd.edu/about/faculty/faculty/'
r = requests.get(url)
r

In [None]:
urlText = r.text
len(urlText)

In [None]:
print(urlText[:1000])

Wow, that is gross looking! 😰 

- It is **raw** HTML, which web browsers use to display websites.
- The information we are looking for – faculty information – is in there somewhere, but we have to search for it and extract it, which we wouldn't have to do if we had an API.

### Best practices for scraping

1. **Send requests slowly** and be upfront about what you are doing!
2. Respect the policy published in the page's `robots.txt` file.
    - Many sites have a `robots.txt` file in their root directory, which contains a policy that allows or disallows automatic access to their site. 
    - See [here](https://moz.com/learn/seo/robotstxt) or Lab 5, Question 5 for more details.
3. Don't spoof your User-agent (i.e. don't try to trick the server into thinking you are a person).
4. Read the Terms of Service for the site and follow it.

### Consequences of irresponsible scraping

If you make too many requests:
* The server may block your IP Address.
    - Everyone in your dorm might lose access to Google! (Seriously!)
* You may take down the website.
    - A journalist scraped and accidentally took down the Cook County Inmate Locater.
    - As a result, inmate's families weren't able to contact them while the site was down.

## The anatomy of HTML documents

### What is HTML?

* HTML (HyperText Markup Language) is **the** basic building block of the internet. 
* It defines the content and layout of a webpage, and as such, it is what you get back when you scrape a webpage.
* See [this tutorial](http://fab.academany.org/2018/labs/fablaboshanghai/students/bob-wu/Fabclass/week2_project_management/HTML.html) for more details.

In [None]:
!cat data/lec15_ex1.html

### The anatomy of HTML documents

* **HTML document**: The totality of markup that makes up a webpage.

* **Document Object Model (DOM)**: The internal representation of a HTML document as a hierarchical **tree** structure.

* **HTML element**: An object in the DOM, such as a paragraph, header, or title.
* **HTML tags**: Markers that denote the **start** and **end** of an element, such as `<p>` and `</p>`.

<center><img src='imgs/dom.jpg'></center>

<center><a href='https://simplesnippets.tech/what-is-document-object-modeldom-how-js-interacts-with-dom/'>(source)</a></center>

### Useful tags to know


|Element|Description|
|:---|:---|
|`<html>`|the document|
|`<head>`|the header|
|`<body>`|the body|
|`<div>` |a logical division of the document|
|`<span>`|an *in-line* logical division|
|`<p>`|a paragraph|
| `<a>`| an anchor (hyper-link)|
|`<h1>, <h2>, ...`| header(s) |
|`<img>`| an image |

There are many, many more. See [this article](https://en.wikipedia.org/wiki/HTML_element) for examples.

### Example: images and hyperlinks

Tags can have **attributes**, which further specify how to display information on a webpage.

For instance, `<img>` tags have `src` and `alt` attributes (among others):

```html
<img src="billy-selfie.png" alt="A photograph of Billy." width=500>
```

Hyperlinks have `href` attributes: 

```html
Click <a href="https://dsc80.com/project3">this link</a> to access Project 3.
```

What do you think this webpage looks like?

In [None]:
!cat data/lec15_ex2.html

### The `<div>` tag

```html
<div style="background-color:lightblue">
  <h3>This is a heading</h3>
  <p>This is a paragraph.</p>
</div>
```

* The `<div>` tag defines a division or a "section" of an HTML document.
    * Think of a `<div>` as a "cell" in a Jupyter Notebook.

* The `<div>` element is often used as a container for other HTML elements to style them with CSS or to perform operations involving them using JavaScript.

* `<div>` elements often have attributes, which are important when scraping!

### Document trees

In [None]:
!cat data/lec15_ex1.html

Under the document object model (DOM), HTML documents are trees. In DOM trees, child nodes are **ordered**.

<center>

<img src="imgs/webpage_anatomy.png" width="50%">

</center>    

What does the DOM tree look like for this document?

<center><img src="imgs/dom_tree.png" width="50%"></center>

### Example: Quote scraping

Consider the following webpage.

<center><img src="imgs/quotes2scrape.png" width=60%></center>

- What do you think the DOM tree looks like?
- If you had to store the data on this page in a DataFrame, what would the rows and columns represent?

<center><img src="imgs/quote_dom.png" width="50%"></center>

## Parsing HTML via Beautiful Soup

### Beautiful Soup 🍜

* [Beautiful Soup 4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) is a Python HTML parser.
    - To "parse" means to "extract meaning from a sequence of symbols".
* **Warning:** Beautiful Soup 4 and Beautiful Soup 3 work differently, so make sure you are using and looking at documentation for Beautiful Soup 4.

### Example HTML document

To start, let's instantiate a `BeautifulSoup` object, using the source code for an HTML page with the DOM tree shown below:

<center><img src="imgs/dom_tree_1.png" width="60%"></center>

The string `html_string` contains an HTML "document".

In [None]:
html_string = '''
<html>
    <body>
      <div id="content">
        <h1>Heading here</h1>
        <p>My First paragraph</p>
        <p>My <em>second</em> paragraph</p>
        <hr>
      </div>
      <div id="nav">
        <ul>
          <li>item 1</li>
          <li>item 2</li>
          <li>item 3</li>
        </ul>
      </div>
    </body>
</html>
'''.strip()

Using the `HTML` function in the `IPython.display` module, we can render an HTML document from within our Jupyter Notebook:

In [None]:
HTML(html_string)

### `BeautifulSoup` objects

`bs4.BeautifulSoup` takes in a string or file-like object representing HTML (`markup`) and returns a **parsed** document.

In [None]:
import bs4

In [None]:
bs4.BeautifulSoup?

Normally, we pass the result of a `GET` request to `bs4.BeautifulSoup`, but here we will pass our hand-crafted `html_string`.

In [None]:
soup = bs4.BeautifulSoup(html_string)
soup

In [None]:
type(soup)

`BeautifulSoup` objects have several useful attributes, e.g. `text`:

In [None]:
print(soup.text)

### Child nodes

- Recall, HTML documents are represented as trees.
    - Each page element becomes a node in this tree.
- A `BeautifulSoup` object represents a **node** in the tree.
    - Each `BeautifulSoup` object has 0 or more child nodes.
    - To access the children of a node, use the `children` attribute.

In [None]:
soup

In [None]:
soup.children

### Aside: iterators

On the previous slide, we saw that that `soup.children` isn't another `BeautifulSoup` object, but rather something of the form `<list_iterator at 0x7f7b0ab8c370>`.

What are [iterators](https://www.w3schools.com/python/python_iterators.asp), again?

In [None]:
nums = [1, 2, 3, 4]
double = map(lambda x: x ** 2, nums)
double

In [None]:
next(double)

In [None]:
list(double)

### Child nodes

The `children` attribute returns an iterator so that it doesn't have to load the entire DOM tree in memory.

In [None]:
soup

In [None]:
soup.children

In [None]:
len(list(soup.children))

In [None]:
root = next(soup.children)
root

In [None]:
list(root.children)

In [None]:
list(list(root.children)[1].children)

In [None]:
list(list(list(root.children)[1].children)[3].children)

### Depth-first traversal through `descendants`

- While we could use the `children` attribute to navigate to any node in a `BeautifulSoup` tree, there are easier ways of navigating the tree.

- The `descendants` attribute traverses a `BeautifulSoup` tree using **depth-first traversal**.
    - Why depth-first? Elements closer to one another on a page are more likely to be related than elements further away.
    - Question: What type of depth-first traversal does this use – preorder, inorder, or postorder traversal?

<center><img src="imgs/dom_tree_1.png" width="60%"></center>

In [None]:
for child in soup.descendants:
    # print(child) # What would happen if we ran this instead?
    if isinstance(child, str):
        continue
    print(child.name)

### Finding elements in a tree

Practically speaking, you will not use the `children` or `descendants` attributes directly very often. Instead, you will use the following methods:

- `soup.find(tag)`, which finds the **first** instance of a tag (the first one on the page, i.e. the first one that DFS sees).
    - More general: `soup.find(name=None, attrs={}, recursive=True, text=None, **kwargs)`.
- `soup.find_all(tag)` will find **all** instances of a tag.


### Using `find`

Let's try and extract the first `<div>` subtree.

<center><img src="imgs/dom_tree_1.png" width="60%"></center>  

In [None]:
soup

In [None]:
div = soup.find('div')
div

<center><img src="imgs/dom_subtree_1.png" width="30%"></center>  

Let's try and find the `<div>` element that has an `id` attribute equal to `'nav'`.

In [None]:
soup.find('div', attrs={'id': 'nav'})

`find` will return the first occurrence of a tag, regardless of what depth it is in the tree.

In [None]:
soup.find('ul')

In [None]:
soup.find('li')

### Using `find_all`

`find_all` returns a list of all matches.

In [None]:
soup.find_all('div')

In [None]:
soup.find_all('li')

In [None]:
[x.text for x in soup.find_all('li')]

`text` is a node attribute.

### Node attributes
* The `text` attribute of a tag element gets the text between the opening and closing tags.
* The `attrs` attribute lists all attributes of a tag.
* The `get(key)` method gets the value of a tag attribute.

In [None]:
soup.find('p')

In [None]:
soup.find('p').text

In [None]:
soup.find('div')

In [None]:
soup.find('div').attrs

In [None]:
soup.find('div').get('id')

You can access tags using attribute notation, too.

In [None]:
soup

In [None]:
soup.html.div.h1

In [None]:
soup.html.div.h1.text

In [None]:
soup.html.div.next_sibling.next_sibling.attrs

## Example: Scraping the HDSI Faculty page

### Example

Let's try and extract a list of HDSI Faculty from https://datascience.ucsd.edu/about/faculty/faculty/.

A good first step is to use the "inspect element" tool in our web browser.

In [None]:
fac_response = requests.get('https://datascience.ucsd.edu/about/faculty/faculty/')
fac_response

In [None]:
soup = bs4.BeautifulSoup(fac_response.text)

It seems like the relevant `<div>`s for faculty are the ones where the `data-entry-type` attribute is equal to `'individual'`. Let's find all of those.

In [None]:
divs = soup.find_all('div', attrs={'data-entry-type': 'individual'})

In [None]:
divs[0]

Within here, we need to extract each faculty member's name. It seems like names are stored in the `title` attribute within an `<a>` tag.

In [None]:
divs[0].find('a').get('title')

We can also extract job titles:

In [None]:
divs[0].find('h4').text

And bios:

In [None]:
divs[0].find('div', attrs={'class': 'cn-bio'}).text.strip()

Let's create a DataFrame consisting of names and bios for each faculty member.

In [None]:
names = [div.find('a').get('title') for div in divs]
names[:5]

In [None]:
titles = [div.find('h4').text if div.find('h4') else '' for div in divs]

In [None]:
bios = [div.find('div', attrs={'class': 'cn-bio'}).text.strip() for div in divs]

In [None]:
faculty = pd.DataFrame().assign(name=names, title=titles, bio=bios)
faculty.head()

Now we have a DataFrame!

In [None]:
faculty[faculty['title'] == 'Lecturer']

What if we want to get faculty members' pictures? It seems like we should look at the attributes of an `<img>` tag.

In [None]:
divs[0].find('img')

In [None]:
def show_picture(name):
    idx = names.index(name)
    url = divs[idx].find('img').get('srcset')
    url = 'https://' + url.strip('/').strip(' 1x')
    display(Image(url))

In [None]:
show_picture('Suraj Rampure')

## Summary, next time

### Summary

- APIs allow us to request information from web servers in a convenient fashion.
- When APIs don't exist, we instead scrape webpages to access their source HTML and then parse the HTML to extract the information we care about.
- Under the document object model (DOM), HTML documents are trees.
    - Elements are defined by tags.
- Beautiful Soup is an HTML parser that allows us to (somewhat) easily extract information from HTML documents.
    - `soup.find` and `soup.find_all` are the functions you will use most often.
- **Next time:** Another scraping example. Regular expressions.