In [None]:
%matplotlib inline
import requests
import bs4
import numpy as np
import pandas as pd
from IPython.display import HTML

# Announcements

* Midterm Tuesday
* Online via Gradescope, start any time in 24 hour window; time limit of 90 minutes
* Practice questions will be posted today

### Lecture 11 - Part 1

# Parsing HTML

## Last time

* HTTP Requests:
    * GET, POST, HEAD
    * Status codes
* Data of the Internet:
    * HTML, JSON
    * json.loads
* Scraping vs. APIs

## This time

- What is HTML?
- Once we've downloaded the HTML, how do we extract what we want?

## What is HTML?

* HTML (HyperText Markup Language) is the most basic building block of the Web. 
* It defines the content and layout of a webpage.
* It is what you get back when you scrape a webpage.
* HTML markup includes special "elements" (tags) such as 
    * `<head>, <title>, <body>, <p>, <div>, <img>`,.....
    

See [this tutorial](http://fab.academany.org/2018/labs/fablaboshanghai/students/bob-wu/Fabclass/week2_project_management/HTML.html) for more reference.

In [None]:
!cat data/lec10.html

In [None]:
!cat data/lec10_pic_ref.html

In [None]:
# Display an HTML page, inline
from IPython.display import HTML
HTML(open('data/lec10_pic_ref.html').read())

### The Anatomy of HTML

- Html is **hierarchical**.

* **HTML Document**: the totality of markup that makes up a web-page
* **Document Object Model**: the internal representation of a HTML document as a *tree* structure.

* **HTML Element**: An object in the DOM, such as a paragraph, header, title.
* **HTML Tags**: Markers that denote the *start* and *end* of an element. E.g. `<p>` and `</p>`.



![image.png](attachment:image.png)

In [None]:
!cat data/lec10.html

### HTML Tags

HTML tags define both the document structure and elements within:

<img src="imgs/webpage_anatomy.png" width="50%">

### Useful tags to know:

|Structure Elements|Description|Head/Body Elements|Description|
|---|---|---|---|
|`<html>`|the document|`<p>`|the paragraph|
|`<head>`|the header|`<h1>, <h2>, ...`|header(s)|
|`<body>`|the body|`<img>`|images|
|`<div>` |a logical division of the document|`<a>`| anchor (hyper-link)|
|`<span>`|an *in-line* logical division|[MANY MORE](https://en.wikipedia.org/wiki/HTML_element)||


### Example: Images and Hyperlinks

- Tags can have attributes, text.

* Tag for a picture (can use a link to the image):
```
<img src="HumDum.png" alt="Humbpty Dumpty">
```

* Tag for a hyperlink: 

```
<a href="https://ucsd.edu/">Visit our page on the World Wide Web!</a>
```


## div tag

```
<div style="background-color:lightblue">
  <h3>This is a heading</h3>
  <p>This is a paragraph.</p>
</div>
```

* The `<div>` tag defines a division or a section in an HTML document.

* The `<div>` element is often used as a container for other HTML elements to style them with CSS or to perform certain tasks with JavaScript.

* Often has attributes (important for scraping!)


## The HTML Document Tree

* DOM represents a document as a logical tree.

<div class="image-txt-container">

<img src="imgs/webpage_anatomy.png" width="50%">

</div>    

Draw a DOM (tree) yourself!

## The HTML Document Tree

* DOM represents a document as a logical tree.
* Unlike in DSC 40B, child nodes are ordered.

<div class="image-txt-container">

<img src="imgs/webpage_anatomy.png" width="50%">

<img src="imgs/dom_tree.png" width="50%">

</div> 

### Question: "quotes collection" website

* What do you think the DOM tree look like? (roughly)
* What would your table schema (i.e. rows/columns) look like?

<img src="imgs/quotes2scrape.png">

### Example (rough) document tree

* How you would you parse it, if you wanted to collect data:
    - Quote-by-quote (all attributes)?
    - Attribute-by-attribute?
    
<div class="image-txt-container">

<img src="imgs/quotes2scrape.png" width="50%">
    
<img src="imgs/quote_dom.png" width="50%">

</div> 

### Part 2

# BeautifulSoup

## BeautifulSoup: parsing the document tree

* [Beautiful Soup 4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) is a python HTML parser.
* **Warning:** BeautifulSoup has changed between versions, so make sure you are looking at documentation for the version you are using (4 here).

* Parse a small HTML "page", with corresponding tree below:

<img src="imgs/dom_tree_1.png" width="75%">

In [None]:
s = '''
<html>
    <body>
      <div id="content">
        <h1>Heading here</h1>
        <p>My First paragraph</p>
        <p>My <em>second</em> paragraph</p>
        <hr>
      </div>
      <div id="nav">
        <ul>
          <li>item 1</li>
          <li>item 2</li>
          <li>item 3</li>
        </ul>
      </div>
    </body>
</html>
'''.strip()

In [None]:
# we can render it

HTML(s)

### BeautifulSoup Parsing

* `bs4.BeautifulSoup` parses a string or file-like object representing HTML
* Returns a *parsed document*
* Use the `children` property to access child nodes.

In [None]:
import bs4

In [None]:
# builder: what we use to parse HTML
#help(bs4.BeautifulSoup)

In [None]:
print(s)

In [None]:
# usually s is a result of a GET request
soup = bs4.BeautifulSoup(s)
soup

In [None]:
# DOM object, tree structure
type(soup)

In [None]:
# many useful attributes 
# print just the text
print(soup.text)

### Child Nodes

- The document is a tree.
- The BeautifulSoup object is a *node* in the tree.
- It has 0 or more child nodes.
- Each page element becomes a node in this tree.

In [None]:
soup.children

### Detour: iterators

In [None]:
lst = [1,2,3,4]
double = map(lambda x: x*2, lst)
double
# what is the output?

In [None]:
next(double)

In [None]:
list(double)

### Child nodes

In [None]:
soup.

In [None]:
len(list(soup.children))

In [None]:
child = next(soup.children)
child

In [None]:
soup.html?

In [None]:
## Chi

In [None]:
type(child)

### Child nodes

In [None]:
root = list(soup.children)[0]
root

In [None]:
for grandchild in child:
    print(type(grandchild))
    print(grandchild)
    print('-' * 80)

### Child nodes

In [None]:
body = list(child.children)[1]

In [None]:
body

In [None]:
list(child)[1]

In [None]:
print(body.prettify())

In [None]:
for greatgrandchild in body:
    print(greatgrandchild)
    print('-' * 80)

## Traversing the tree

In [None]:
soup.html

### Document tree traversal: depth-first

* Using `.descendents` attribute, traverse `soup` depth-first.
* Take care to only print node elements!

<img src="imgs/dom_tree_1.png" width="75%">

In [None]:
# DFS using `descendants` property
# a DFS produces elements in order that they appear in the HTML
for x in soup.descendants:
    if isinstance(x, str):
        continue       
    print(str(x.name))

In [None]:
soup.html

## Finding in the tree

- `.find()` will find the *first* instance of a tag (first in DFS and on page)
- `.find_all()` wil find *all* instances of the tag
* `soup.find(name=None, attrs={}, recursive=True, text=None, **kwargs)`

## Selecting subtrees of the document tree

* Using `soup.find('div')`:


<div class="image-txt-container">
    
<img src="imgs/dom_tree_1.png" width="50%">
  
<img src="imgs/dom_subtree_1.png" width="40%">
    
</div>

In [None]:
div = soup.find('div')
div

In [None]:
type(div)

In [None]:
# more specifi
soup.find('div', attrs={'id': 'nav'})

In [None]:
# will recurse into tree until it finds something
soup.find('ul')

In [None]:
# all divs
soup.find_all('div')

In [None]:
lsts = soup.find_all('li')
lsts

In [None]:
[x.text for x in soup.find_all('li')]

### Document tree traversal: depth-first

* Using `.children` attribute, traverse `soup` depth-first.
* Take care to only print node elements!

<img src="imgs/dom_tree_1.png" width="75%">

In [None]:
soup

## Selecting attributes of nodes
* The `.text` property of a tag element gets the text elements between the tags.
* The `.attrs` property lists all attributes of a tag.
* The `.get(key)` method, gets the value of a tag attribute.

In [None]:
# iterator, follows DFS order
hdr = soup.html.body.h1

In [None]:
hdr

In [None]:
hdr.text

In [None]:
hdr.attrs

In [None]:
soup

In [None]:
# key: id

div = list(soup.descendants)[21]
div

In [None]:
div.attrs

In [None]:
div.attrs['id']

In [None]:
div.get('id')

In [None]:
div.get('nonexistant')

In [None]:
div.text

### Part 3

# Example: Scraping HDSI faculty

### Example

- Get a list of HDSI faculty from https://datascience.ucsd.edu/about/faculty/faculty/
- hint: use browser's inspect tool

In [None]:
import requests
response = requests.get('https://datascience.ucsd.edu/about/faculty/faculty/')

In [None]:
# live demo

### Part 4

# Example: Scraping Quotes

### Scraping Quotes

go to http://quotes.toscrape.com/

### Example: scraping quotes
* Collect famous quotes and related data
* Requires scraping many pages to get data
* Parse the pages to extract information

<img src="imgs/quotes2scrape.png" width="50%">

### The plan

- `download_page(n)`
    Downloads and parses page `n` from the website.
- `parse_page(n)`
    Parses HTML into a dictionary with keys: `quote`, `author`, `tags`.
    
**Tip 1:** Have functions that request and functions that parse, but not both! 
    - Easier to debug and catch errors!
    - Avoids unnecessary requests!
    
**Tip 2:** Make lots of helper functions!

In [None]:
# live demo

### Scraping quotes: conclusion

* Make as few requests as possible
* Create a request/parsing plan *beforehand*
* Create your output schema *beforehand*
* Separate parsing and requests into different functions!

### Part 5

# Nested vs. Flat Data Structures

## Nested vs flat data structures

* Nested: HTML, JSON, XML
* Flat: CSV

Suppose we obtained the quotes data via an API and saved it to the file `quotes2scrape.json`
- `quotes2scrape.json` is a 'json records file'; each line is a valid json object

In [None]:
import json
json.loads(open('data/quotes2scrape.json').readline())

In [None]:
# read in all the lines: each element is a dictionary
L = [json.loads(x) for x in open('data/quotes2scrape.json')]

In [None]:
# What happended to the tags column?
df = pd.DataFrame(L)
df.head()

In [None]:
df.to_csv('out.csv')

In [None]:
type(pd.read_csv('out.csv')['tags'].iloc[0])

## One-hot encoding

- "Flatten" the lists by creating a column for each tag
- A 1 in the column means that this quote has that tag
- A 0 means the quote does not have that tag

In [None]:
distinct_tags = np.unique(df.tags.sum())

In [None]:
def list2series(taglist):
    return pd.Series({k:1 for k in taglist})

tags = df.tags.apply(list2series)
tags.head()

In [None]:
# combine them
pd_full = pd.concat([df, tags], axis=1)
pd_full

In [None]:
pd_full[pd_full['books'] == 1]

### Converting JSON to CSV

* Flattening the nested list requires a lot of space. Why?
* We can't always read in *all* the JSON; might need to extract just what we need line-by-line.
* A JSON records file is **not** valid JSON. Why? Why can't we just use JSON?