## Lecture 21: Parsing HTML

### STA 141B

- HTML is short for hypertext markup language
- descriptive, markup language
- used for modifying text with hidden tags
- browser reads the html file, with the tags and everything and turns it into text with different fonts, colors, structure, etc
- html has head (metadata, style, scripts) and body (main content to be displayed)
- just html will look outdated: http://anson.ucdavis.edu/~affarris/

### Tags
&lt;TAG NAME&gt;STUFF BETWEEN TAGS&lt;\TAG NAME&gt;

- paragraph: &lt;p&gt;&lt;\p&gt;
- bold: &lt;strong&gt; text, create hyperlinks &lt;a href="site.html"&gt;
- create headers: &lt;h1&gt;&lt;h2&gt;
- create unordered lists: &lt;ul&gt;
- tables: &lt;table&gt;
- newline: &lt;br&gt;
- display image: &lt;img&gt;

### Style

- HTML only gives semantic structure to text, and does not specify or customize how these tags are visually represented
- CSS defines the visual meaning of the HTML, which is given in separate stylesheet files

```
.page-header {
  padding: 2rem 6rem; }
```

- tag.class selectors : select every tag with class
- tag#id selectors : the tag with id

### HTML structure

Tags such as `div, a, body, span` wrap more html with opening and closing tags 

```
<div class='example'>text in between</div>;
```

- hierarchical document structure
- If a tag is nested inside of another tag then it is the child (`<a>` is the child of `<p>`)
```
<p>Have you tried the <a href="salmon.html">salmon</a></p>
```
- tree structure is called the Document Object Model (DOM)
- LXML and BeautifulSoup parses the html string into this tree structure

## Breaking down data.gov climate data

Data.gov maintains a list of climate related datasets, but they are maintained by different agencies.  I'd like to get a sense of the number and diversity of these datasets, and which agencies they are coming from.

In [100]:
import requests
import requests_ftp
import requests_cache
import lxml
from bs4 import BeautifulSoup
from collections import Counter
from matplotlib import pyplot as plt
import pandas as pd
plt.style.use('ggplot')
requests_cache.install_cache('coll_cache')
%matplotlib inline

I went to data.gov and selected the climate data tab.  Here is the url: https://catalog.data.gov/dataset?groups=climate5434&page=1

Let's make a request for this page.

In [102]:
urlbase = "https://catalog.data.gov/dataset"
dataparams = {"groups":"climate5434","page":1}
climreq = requests.get(urlbase,params = dataparams)

In [103]:
climreq.url

'https://catalog.data.gov/dataset?groups=climate5434&page=1'

In [104]:
climhtml = climreq.text
clim = BeautifulSoup(climhtml,'lxml')

Here we imported the climate data catalogue (page 1), and turned it into a Beautiful Soup object.  We used the lxml parser, which is why we had to import lxml.  Let's look at the raw html.

In [105]:
print(climhtml[:500])

<!DOCTYPE html>
<!--[if IE 7]> <html lang="en" class="ie ie7"> <![endif]-->
<!--[if IE 8]> <html lang="en" class="ie ie8"> <![endif]-->
<!--[if IE 9]> <html lang="en" class="ie9"> <![endif]-->
<!--[if gt IE 8]><!--> <html lang="en"> <!--<![endif]-->
  <head>
    <!--[if lte ie 8]><script type="text/javascript" src="/fanstatic/vendor/:version:2020-02-07T22:53:07.54/html5.min.js"></script><![endif]-->
<link rel="stylesheet" type="text/css" href="/fanstatic/vendor/:version:2020-02-07T22:53:07.54/se


The BS object, and tags have other tags within then that you can access either as a class object (like below) or using the find method

In [106]:
clim.body

<body data-locale-root="https://catalog.data.gov/" data-site-root="https://catalog.data.gov/">
<div class="hide"><a href="#content">Skip to content</a></div>
<a class="hide" href="#content">Skip to content</a>
<header class="navbar navbar-static-top masthead">
<div class="container">
<div class="searchbox-row skip-navigation">
<div class="skip-link">
<a href="#">Jump to Content</a>
</div>
<div>
<form action="/dataset" class="search-form form-inline navbar-right navbar-nav col-sm-6 col-md-6 col-lg-6" method="get" role="search">
<div class="input-group">
<label class="hide" for="search-header">Search for:</label>
<input class="search-field form-control" id="search-header" name="q" onblur="if(value=='') value = 'Search Data.Gov'" onfocus="if(value=='Search Data.Gov') value = ''" placeholder="Search Data.Gov" type="search" value="Search Data.Gov"/> <span class="input-group-btn">
<button class="search-submit btn_new btn-default" type="submit">
<i class="fa fa-search"></i>
<span class="sr-on

In [34]:
collset_content = clim.find_all(name='div',attrs={'class':'dataset-content'})
collset1 = collset_content[0]
print(collset1.prettify())

<div class="dataset-content">
 <div class="organization-type-wrap">
  <span class="organization-type" data-organization-type="federal" title="Federal Government">
   <span>
    Federal
   </span>
  </span>
 </div>
 <h3 class="dataset-heading">
  <a href="/dataset/u-s-hourly-precipitation-data">
   U.S. Hourly Precipitation Data
  </a>
  <!-- Snippet snippets/popular.html start -->
  <span class="recent-views" title="855 recent views" xmlns="http://www.w3.org/1999/xhtml">
   <i class="fa fa-line-chart">
   </i>
   855 recent views
  </span>
  <!-- Snippet snippets/popular.html end -->
 </h3>
 <div class="notes">
  <p class="dataset-organization">
   National Oceanic and Atmospheric Administration, Department of Commerce —
  </p>
  <div>
   Hourly Precipitation Data (HPD) is digital data set DSI-3240, archived at the National Climatic Data Center (NCDC). The primary source of data for this file is...
  </div>
 </div>
 <ul class="dataset-resources unstyled">
  <li>
   <a class="label" dat

In [35]:
collset1.name, collset1.attrs, collset1['class']

('div', {'class': ['dataset-content']}, ['dataset-content'])

In [36]:
aset1 = collset1.find_all('a')
print(aset1[0])
print(aset1[1])

<a href="/dataset/u-s-hourly-precipitation-data">U.S. Hourly Precipitation Data</a>
<a class="label" data-format="html" data-organization="National Oceanic and Atmospheric Administration, Department of Commerce" href="https://data.nodc.noaa.gov/cgi-bin/iso?id=gov.noaa.ncdc:C00313" target="_blank">HTML</a>


Here is the title of the dataset collection, let's make a script to extract the title and href's from the 'a' tags.

In [37]:
aset1[0].text

'U.S. Hourly Precipitation Data'

In [38]:
aset1[0].attrs

{'href': '/dataset/u-s-hourly-precipitation-data'}

In [41]:
adict = {'label':[],'coll':[],'more':[]}
for a in aset1:
    try:
        adict[a['class'][0]].append(a['href'])
    except KeyError:
        adict['coll'].append(a['href'])
        collname = a.text.strip()

In [42]:
collname

'U.S. Hourly Precipitation Data'

In [43]:
adict

{'label': ['https://data.nodc.noaa.gov/cgi-bin/iso?id=gov.noaa.ncdc:C00313',
  'https://www.ncdc.noaa.gov/cdo-web/search?datasetid=PRECIP_HLY#',
  'https://gis.ncdc.noaa.gov/maps/ncei/cdo/hourly?layers=001',
  'ftp://ftp.ncdc.noaa.gov/pub/data/hourly_precip-3240/',
  'https://gis.ncdc.noaa.gov/arcgis/rest/services/cdo/precip_hly/MapServer',
  '/dataset/u-s-hourly-precipitation-data/resource/42d54fed-eb7a-4ec9-b7e2-7cd6db22b1bc'],
 'coll': ['/dataset/u-s-hourly-precipitation-data'],
 'more': ['/dataset/u-s-hourly-precipitation-data']}

We are also able to get the description component of the collection.

In [44]:
collnotes = collset1.find('div',attrs={'class':'notes'})

In [45]:
print(collnotes.prettify())

<div class="notes">
 <p class="dataset-organization">
  National Oceanic and Atmospheric Administration, Department of Commerce —
 </p>
 <div>
  Hourly Precipitation Data (HPD) is digital data set DSI-3240, archived at the National Climatic Data Center (NCDC). The primary source of data for this file is...
 </div>
</div>



In [46]:
collnotes.p.text.strip()

'National Oceanic and Atmospheric Administration, Department of Commerce —'

In [47]:
collnotes.div.text.strip()

'Hourly Precipitation Data (HPD) is digital data set DSI-3240, archived at the National Climatic Data Center (NCDC). The primary source of data for this file is...'

In [48]:
collorg = collnotes.p.text.strip()
colldesc = collnotes.div.text.strip()

Now that we have some code to process the collection html, let's create a def to modularize this section.

In [52]:
def process_collset(collset1):
    """
    Processes the data.gov html within the tags <div class = dataset-content>...
    Input: html string
    Output: tuple of title (string), organization (string), 
      description (string), hrefs (dictionary)
    """
    aset1 = collset1.find_all('a')
    adict = {'label':[],'coll':[],'more':[]}
    for a in aset1:
        try:
            adict[a['class'][0]].append(a['href'])
        except KeyError:
            adict['coll'].append(a['href'])
            collname = a.text.strip()
    collnotes = collset1.find('div',attrs={'class':'notes'})
    collorg = collnotes.p.text.strip()
    colldescdiv = collnotes.div
    if colldescdiv:
        colldesc = colldescdiv.text.strip()
    else:
        colldesc = ""
    return collname, collorg, colldesc, adict

In [53]:
help(process_collset)

Help on function process_collset in module __main__:

process_collset(collset1)
    Processes the data.gov html within the tags <div class = dataset-content>...
    Input: html string
    Output: tuple of title (string), organization (string), 
      description (string), hrefs (dictionary)



In [54]:
collname, collorg, colldesc, adict = process_collset(collset1)

In [55]:
collname

'U.S. Hourly Precipitation Data'

In [56]:
collorg

'National Oceanic and Atmospheric Administration, Department of Commerce —'

In [57]:
colldesc

'Hourly Precipitation Data (HPD) is digital data set DSI-3240, archived at the National Climatic Data Center (NCDC). The primary source of data for this file is...'

In [58]:
adict

{'label': ['https://data.nodc.noaa.gov/cgi-bin/iso?id=gov.noaa.ncdc:C00313',
  'https://www.ncdc.noaa.gov/cdo-web/search?datasetid=PRECIP_HLY#',
  'https://gis.ncdc.noaa.gov/maps/ncei/cdo/hourly?layers=001',
  'ftp://ftp.ncdc.noaa.gov/pub/data/hourly_precip-3240/',
  'https://gis.ncdc.noaa.gov/arcgis/rest/services/cdo/precip_hly/MapServer',
  '/dataset/u-s-hourly-precipitation-data/resource/42d54fed-eb7a-4ec9-b7e2-7cd6db22b1bc'],
 'coll': ['/dataset/u-s-hourly-precipitation-data'],
 'more': ['/dataset/u-s-hourly-precipitation-data']}

In [59]:
for collset in collset_content:
    collname = process_collset(collset)[0]
    print(collname)

U.S. Hourly Precipitation Data
Fruit and Vegetable Prices
NCDC Storm Events Database
National Hydrography Dataset (NHD) - USGS National Map Downloadable Data Collection
Soil Survey Geographic Database (SSURGO)
Global Surface Summary of the Day - GSOD
Feed Grains Database
Military Installations, Ranges, and Training Areas
NCEP North American Regional Reanalysis (NARR), for 1979 to Present
Census Data
MyPyramid Food Raw Data
USGS National Structures Dataset - USGS National Map Downloadable Data Collection
Toxics Release Inventory (TRI)
Global Forecast System (GFS) [0.5 Deg.]
1/9th Arc-second Digital Elevation Models (DEMs) - USGS National Map 3DEP Downloadable Data Collection
NOAA NEXt-Generation RADar (NEXRAD) Products
Internet Weather Source
Fertilizer Use and Price
Freight Analysis Framework
Integrated Surface Dataset (Global)
