In [None]:
import pandas as pd
import numpy as np
import os
import requests

import util

# Lecture 15 – Requests and Parsing HTML

## DSC 80, Spring 2023

### Agenda

- Data formats.
- APIs and scraping.
- The anatomy of HTML documents.
- Parsing HTML using Beautiful Soup.

## Data formats

### The data formats of the internet

Responses typically come in one of two formats: HTML or JSON.

- The response body of a `GET` request is usually either JSON (when using an API) or HTML (when accessing a webpage).

- The response body of a `POST` request is usually JSON.

- XML is also a common format, but not as popular as it once was.

### JSON

- JSON stands for **JavaScript Object Notation**. It is a lightweight format for storing and transferring data.

- It is:
    - very easy for computers to read and write.
    - moderately easy for programmers to read and write by hand.
    - meant to be generated and parsed.

- Most modern languages have an interface for working with JSON objects.
    - JSON objects _resemble_ Python dictionaries (but are not the same!).

### JSON data types

| Type | Description |
| --- | --- |
| String | Anything inside double quotes. |
| Number | Any number (no difference between ints and floats). |
| Boolean | `true` and `false`. |
| Null | JSON's empty value, denoted by `null`. |
| Array | Like Python lists. |
| Object | A collection of key-value pairs, like dictionaries. Keys must be strings, values can be anything (even other objects). |

See [json-schema.org](https://json-schema.org/understanding-json-schema/reference/type.html) for more details.

### Example JSON object

See `data/family.json`.

<center><img src='imgs/hierarchy.png' width=50%></center>

In [None]:
import json

f = open(os.path.join('data', 'family.json'), 'r')
family_tree = json.load(f)

In [None]:
family_tree

In [None]:
family_tree['children'][0]['children'][0]['age']

### Aside: `eval`

- `eval`, which stands for "evaluate", is a function built into Python.

- It takes in a **string containing a Python expression** and evaluates it in the current context.

In [None]:
x = 4
eval('x + 5')

- It seems like `eval` can do the same thing that `json.load` does...

In [None]:
f = open(os.path.join('data', 'family.json'), 'r')
eval(f.read())

- But you should **never use `eval`**. The next slide demonstrates why.

### `eval` gone wrong

Observe what happens when we use `eval` on a string representation of a JSON object:

In [None]:
f_other = open(os.path.join('data', 'evil_family.json'))
eval(f_other.read())

- Oh no! Since `evil_family.json`, which could have been downloaded from the internet, contained malicious code, we now lost all of our files.


- This happened because `eval` **evaluates** all parts of the input string as if it were Python code.

- You never need to do this – instead, use the `.json()` method of a response object, or use the `json` library.

### Using the `json` module

Let's process the same file using the `json` module. Recall:
- `json.load(f)` loads a JSON file from a file object.
- `json.loads(f)` loads a JSON file from a **s**tring.

In [None]:
f_other = open(os.path.join('data', 'evil_family.json'))
s = f_other.read()
s

In [None]:
json.loads(s)

- Since `util.err()` is not a string in JSON (there are no quotes around it), `json.loads` is not able to parse it as a JSON object.

- This "safety check" is intentional.

### Handling _unfamiliar_ data

- Never trust data from an unfamiliar site.

- **Never** use `eval` on "raw" data that you didn't create!

- The JSON data format needs to be **parsed**, not evaluated as a dictionary.
    - It was designed with safety in mind!

## APIs and scraping

### Programmatic requests

* We learned how to use the Python `requests` package to exchange data via HTTP.
    - `GET` requests are used to request data **from** a server.
    - `POST` requests are used to **send** data to a server.

* There are two ways of collecting data through a request:
    * By using a published API (application programming interface).
    * By scraping a webpage to collect its HTML source code.

### APIs

An API is a service that makes data directly available to the user in a convenient fashion.

Advantages:

- The data are usually clean, up-to-date, and ready to use.

- The presence of a API signals that the data provider is okay with you using their data.

- The data provider can plan and regulate data usage.
    - Some APIs require you to create an API "key", which is like an account for using the API.
    - APIs can also give you access to data that isn't publicly available on a webpage.

Disadvantages:
- APIs don't always exist for the data you want!

### API terminology

- A URL, or uniform resource locator, describes the location of a website or resource.

- An **API endpoint** is a URL of the data source that the user wants to make requests to.

- For example, on the [Reddit API](https://www.reddit.com/dev/api/):
    * the `/comments` endpoint retrieves information about comments.
    * the `/hot` endpoint retrieves data about posts labeled "hot" right now. 
    - To access these endpoints, you add the endpoint name to the base URL of the API.

### API requests

- API requests are just `GET`/`POST` requests to a specially maintained URL.
- Let's test out the [Pokémon API](https://pokeapi.co).

First, let's make a `GET` request for `'squirtle'`.

In [None]:
r = requests.get('https://pokeapi.co/api/v2/pokemon/squirtle')
r

Remember, the 200 status code is good! Let's take a look at the **content**:

In [None]:
r.content[:1000]

Looks like JSON. We can extract the JSON from this request with the `json` method (or by passing `r.text` to `json.loads`).

In [None]:
rr=r.json()
rr

In [None]:
rr.keys()

In [None]:
rr['weight']

In [None]:
rr['abilities'][1]['ability']['name']

Let's try a `GET` request for `'billy'`.

In [None]:
r = requests.get('https://pokeapi.co/api/v2/pokemon/billy')
r

Uh oh...

### Scraping

Scraping is the act of programmatically "browsing" the web, downloading the source code (HTML) of pages that you're interested in extracting data from.

Advantages:

* You can always do it!
    - e.g. Google scrapes webpages in order to make them searchable.

Disadvantages:

- It is often difficult to parse and clean scraped data.
    - Source code often includes a lot of content unrelated to the data you're trying to find (e.g. formatting, advertisements, other text).

- Websites can change often, so scraping code can get outdated quickly.

- Websites may not want you to scrape their data!

- **In general, we prefer APIs.**

### Accessing HTML

**Goal**: Access information about HDSI faculty members from the HDSI Faculty page.

Let's start by making a `GET` request to the HDSI Faculty page and see what the resulting HTML looks like. 

In [None]:
r = requests.get('https://datascience.ucsd.edu/faculty/')
r

In [None]:
faculty_text = r.text
len(faculty_text)

In [None]:
print(faculty_text[:1000])

In [None]:
'Tauhidur Rahman' in faculty_text

Wow, that is gross looking! 😰 

- It is **raw** HTML, which web browsers use to display websites.
- The information we are looking for – faculty information – is in there somewhere, but we have to search for it and extract it, which we wouldn't have to do if we had an API.
- We'll now look at how HTML documents are structured and how to extract information from them.

### Best practices for scraping

1. **Send requests slowly** and be upfront about what you are doing!
2. Respect the policy published in the page's `robots.txt` file.
    - Many sites have a `robots.txt` file in their root directory, which contains a policy that allows or disallows automatic access to their site. 
3. Don't spoof your User-agent (i.e. don't try to trick the server into thinking you are a person).
4. Read the Terms of Service for the site and follow it.

### Consequences of irresponsible scraping

If you make too many requests:
* The server may block your IP Address.
    - Everyone in your dorm might lose access to Google! (Seriously!)
* You may take down the website.
    - A journalist scraped and accidentally took down the Cook County Inmate Locater.
    - As a result, inmate's families weren't able to contact them while the site was down.

## The anatomy of HTML documents

### What is HTML?

* HTML (HyperText Markup Language) is **the** basic building block of the internet. 


* It defines the content and layout of a webpage, and as such, it is what you get back when you scrape a webpage.

* See [this tutorial](http://fab.academany.org/2018/labs/fablaboshanghai/students/bob-wu/Fabclass/week2_project_management/HTML.html) for more details.

For instance, here's the content of a very basic webpage.

In [None]:
!cat data/lec15_ex1.html

Using `IPython.display.HTML`, we can render it directly in our notebook.

In [None]:
from IPython.display import HTML
HTML(os.path.join('data', 'lec15_ex1.html'))

### The anatomy of HTML documents

* **HTML document**: The totality of markup that makes up a webpage.

* **Document Object Model (DOM)**: The internal representation of a HTML document as a hierarchical **tree** structure.

* **HTML element**: An object in the DOM, such as a paragraph, header, or title.
* **HTML tags**: Markers that denote the **start** and **end** of an element, such as `<p>` and `</p>`.

<center><img src='imgs/dom.jpg'></center>

<center><a href='https://simplesnippets.tech/what-is-document-object-modeldom-how-js-interacts-with-dom/'>(source)</a></center>

### Useful tags to know


|Element|Description|
|:---|:---|
|`<html>`|the document|
|`<head>`|the header|
|`<body>`|the body|
|`<div>` |a logical division of the document|
|`<span>`|an *inline* logical division|
|`<p>`|a paragraph|
| `<a>`| an anchor (hyperlink)|
|`<h1>, <h2>, ...`| header(s) |
|`<img>`| an image |

There are many, many more. See [this article](https://en.wikipedia.org/wiki/HTML_element) for examples.

### Example: images and hyperlinks

Tags can have **attributes**, which further specify how to display information on a webpage.

For instance, `<img>` tags have `src` and `alt` attributes (among others):

```html
<img src="king-selfie.png" alt="A photograph of King Triton." width=500>
```

Hyperlinks have `href` attributes: 

```html
Click <a href="https://dsc80.com/project3">this link</a> to access Project 3.
```

What do you think this webpage looks like?

In [None]:
!cat data/lec15_ex2.html

### The `<div>` tag

```html
<div style="background-color:lightblue">
  <h3>This is a heading</h3>
  <p>This is a paragraph.</p>
</div>
```

* The `<div>` tag defines a division or a "section" of an HTML document.
    * Think of a `<div>` as a "cell" in a Jupyter Notebook.

* The `<div>` element is often used as a container for other HTML elements to style them with CSS or to perform operations involving them using JavaScript.

* `<div>` elements often have attributes, **which are important when scraping**!

### Document trees

Under the document object model (DOM), HTML documents are trees. In DOM trees, child nodes are **ordered**.

<center>

<img src="imgs/webpage_anatomy.png" width="50%">

</center>    

What does the DOM tree look like for this document?

<center><img src="imgs/dom_tree.png" width="50%"></center>

### Example: Quote scraping

Consider the following webpage.

<center><img src="imgs/quotes2scrape.png" width=60%></center>

- What do you think the DOM tree looks like?
- If you had to store the data on this page in a DataFrame, what would the rows and columns represent?

<center><img src="imgs/quote_dom.png" width="50%"></center>

## Parsing HTML using Beautiful Soup

### Beautiful Soup 🍜

* [Beautiful Soup 4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) is a Python HTML parser.
    - To "parse" means to "extract meaning from a sequence of symbols".
* **Warning:** Beautiful Soup 4 and Beautiful Soup 3 work differently, so make sure you are using and looking at documentation for Beautiful Soup 4.

### Example HTML document

To start, we'll work with the source code for an HTML page with the DOM tree shown below:

<center><img src="imgs/dom_tree_1.png" width="50%"></center>

The string `html_string` contains an HTML "document".

In [None]:
html_string = '''
<html>
    <body>
      <div id="content">
        <h1>Heading here</h1>
        <p>My First paragraph</p>
        <p>My <em>second</em> paragraph</p>
        <hr>
      </div>
      <div id="nav">
        <ul>
          <li>item 1</li>
          <li>item 2</li>
          <li>item 3</li>
        </ul>
      </div>
    </body>
</html>
'''.strip()

In [None]:
HTML(html_string)

### `BeautifulSoup` objects

`bs4.BeautifulSoup` takes in a string or file-like object representing HTML (`markup`) and returns a **parsed** document.

In [None]:
import bs4

In [None]:
bs4.BeautifulSoup?

Normally, we pass the result of a `GET` request to `bs4.BeautifulSoup`, but here we will pass our hand-crafted `html_string`.

In [None]:
soup = bs4.BeautifulSoup(html_string)
soup

In [None]:
type(soup)

`BeautifulSoup` objects have several useful attributes, e.g. `text`:

In [None]:
print(soup.text)

### Traversing through `descendants`

The `descendants` attribute traverses a `BeautifulSoup` tree using **depth-first traversal**.

Why depth-first? Elements closer to one another on a page are more likely to be related than elements further away.

<center><img src="imgs/dom_tree_1.png" width="60%"></center>

In [None]:
soup.descendants

In [None]:
for child in soup.descendants:
#     print(child) # What would happen if we ran this instead?
    if isinstance(child, str):
        continue
    print(child.name)

### Finding elements in a tree

Practically speaking, you will not use the `descendants` attribute (or the related `children` attribute) directly very often. Instead, you will use the following methods:

- `soup.find(tag)`, which finds the **first** instance of a tag (the first one on the page, i.e. the first one that DFS sees).
    - More general: `soup.find(name=None, attrs={}, recursive=True, text=None, **kwargs)`.
- `soup.find_all(tag)` will find **all** instances of a tag.


### Using `find`

Let's try and extract the first `<div>` subtree.

<center><img src="imgs/dom_tree_1.png" width="60%"></center>  

In [None]:
soup

In [None]:
div = soup.find('div')
div

<center><img src="imgs/dom_subtree_1.png" width="30%"></center>  

Let's try and find the `<div>` element that has an `id` attribute equal to `'nav'`.

In [None]:
soup.find('div', attrs={'id': 'nav'})

`find` will return the first occurrence of a tag, regardless of its depth in the tree.

In [None]:
soup.find('ul')

In [None]:
soup.find('li')

### Using `find_all`

`find_all` returns a list of all matches.

In [None]:
soup.find_all('div')

In [None]:
soup.find_all('li')

In [None]:
[x.text for x in soup.find_all('li')]

### Node attributes
* The `text` attribute of a tag element gets the text between the opening and closing tags.
* The `attrs` attribute lists all attributes of a tag.
* The `get(key)` method gets the value of a tag attribute.

In [None]:
soup.find('p')

In [None]:
soup.find('p').text

In [None]:
soup.find('div')

In [None]:
soup.find('div').attrs

In [None]:
soup.find('div').get('id')

The `get` method must be called directly on the node that contains the attribute you're looking for.

In [None]:
soup

In [None]:
# While there are multiple 'id' attributes, none of them are in the <html> tag at the top.
soup.get('id')

In [None]:
soup.find('div').get('id')

## Summary, next time

### Summary

- The two main file formats used for storing information on the internet are HTML and JSON.
    - JSON objects resemble Python dictionaries, but they are not quite the same. 
    - Use the `.json()` method of a response object or the `json` package to parse them, **not** `eval`.
- APIs allow us to request information from web servers in a convenient fashion.
- When APIs don't exist, we instead scrape webpages to access their source HTML and then parse the HTML to extract the information we care about.
- Under the document object model (DOM), HTML documents are trees.
    - Elements are defined by tags.
- Beautiful Soup is an HTML parser that allows us to (somewhat) easily extract information from HTML documents.
    - `soup.find` and `soup.find_all` are the functions you will use most often.

### Next time

- Friday: More practical scraping examples. Towards string processing and regular expressions.