In [1]:
import numpy as np
from bs4 import BeautifulSoup

# Parsing an html document

The library you want is called Beautiful Soup. Docs available here. https://beautiful-soup-4.readthedocs.io/en/latest/

In the cells below, I open a saved html file, take a peak at it, and extract some data from it.

In [4]:
with open("hydrogen.html") as fp:
    soup = BeautifulSoup(fp)

In [5]:
soup

<html>
<head><title>H</title></head>
<link href="../periodic.css" rel="STYLESHEET" type="TEXT/CSS"/>
<body>
<h2>H</h2>
<table align="CENTER" cellpadding="4" cellspacing="2">
<tr align="CENTER">
<th class="head" width="10%">Charge</th>
<th class="head" width="10%">C.N.</th>
<th class="head" width="10%">Spin</th>
<th class="head" width="10%">I.R./Å</th>
<tr align="CENTER">
<td class="NM">+1</td>
<td class="NM">1</td>
<td class="NM"> </td>
<td class="NM">-0.38</td>
<tr align="CENTER">
<td class="NM"></td>
<td class="NM">2</td>
<td class="NM"> </td>
<td class="NM">-0.18</td>
</tr></tr></tr></table>
<h2>D</h2>
<table align="CENTER" cellpadding="4" cellspacing="2">
<tr align="CENTER">
<th class="head" width="10%">Charge</th>
<th class="head" width="10%">C.N.</th>
<th class="head" width="10%">Spin</th>
<th class="head" width="10%">I.R./Å</th>
<tr align="CENTER">
<td class="NM">+1</td>
<td class="NM">2</td>
<td class="NM"> </td>
<td class="NM">-0.10</td>
</tr></tr></table>
</body>
</html>

In [15]:
soup.h2
#shows the first time that has it's tag as "h2", the header stating "H" to label the table

<h2>H</h2>

In [14]:
soup.tr
#shows the first item that has it's tag as "tr", the table of data for hydrogen

<tr align="CENTER">
<th class="head" width="10%">Charge</th>
<th class="head" width="10%">C.N.</th>
<th class="head" width="10%">Spin</th>
<th class="head" width="10%">I.R./Å</th>
<tr align="CENTER">
<td class="NM">+1</td>
<td class="NM">1</td>
<td class="NM"> </td>
<td class="NM">-0.38</td>
<tr align="CENTER">
<td class="NM"></td>
<td class="NM">2</td>
<td class="NM"> </td>
<td class="NM">-0.18</td>
</tr></tr></tr>

In [16]:
soup.tr.th
#prints the first "th" entry within the first "tr" entry

<th class="head" width="10%">Charge</th>

In [17]:
soup.tr.td
#prints the first "td" entry within the first "tr" entry

<td class="NM">+1</td>

In [29]:
soup.find_all('td')
#finds every "td" tag

[<td class="NM">+1</td>,
 <td class="NM">1</td>,
 <td class="NM"> </td>,
 <td class="NM">-0.38</td>,
 <td class="NM"></td>,
 <td class="NM">2</td>,
 <td class="NM"> </td>,
 <td class="NM">-0.18</td>,
 <td class="NM">+1</td>,
 <td class="NM">2</td>,
 <td class="NM"> </td>,
 <td class="NM">-0.10</td>]

In [25]:
#what if I want to grab out some contents, not the whole line? Here are a couple of examples
tag = soup.tr.td
tag.contents[0]

'+1'

In [31]:
# Here I'm finding all the items within the td tag, then displaying the contents of the first one.
# You could imagine using a loop to step through the list and grab each value
data = soup.find_all('td')
data[0].contents[0]

'+1'

In [45]:
#this one grabs the first tr tag and prints out it's contents in plain text
for child in soup.tr.children:
    print(child.string)



Charge


C.N.


Spin


I.R./Å


None


There are a lot of ways to navigate through these documents. You'll want to consider how to search through the html on that page. I would image you need to plan out the structure, then write loops that will grab each piece of data. I'd probably work out how to store either rows or columns as a list, then reconstruct the table into a dataframe. You'd need to decide if you're going to fill in the implied values or leave them blank.

Once you've figured out how to parse the hydrogen data into a dataframe, save it as a csv. You don't want to immediately start working with the data - you still need to learn how to get it all from the web. You'll end up in a new notebook by the time you try to ML this data. Just save it in the folder and move onto the next skill to learn.

# Downloading the internet

Beautiful Soup lets you parse an html file. But it doesn't directly talk to the internet. For that, you need another library. I've got two options for you. Firstly, I found urllib.request in the Python standard library. The code below shows you how to import it, open an URL, and has a crawler that finds all the sub-urls from a website.

https://docs.python.org/3/library/urllib.request.html

In [13]:
from urllib.request import urlopen

In [39]:
with urlopen('http://mrlweb.mrl.ucsb.edu/~seshadri/Periodic/Elements/Eu.html') as response:
    print("this opened")
    soup = BeautifulSoup(response)

this opened


In [40]:
soup

<html>
<head><title>Eu</title></head>
<link href="../periodic.css" rel="STYLESHEET" type="TEXT/CSS"/>
<body>
<h2>Eu</h2>
<table align="CENTER" cellpadding="4" cellspacing="2">
<tr align="CENTER">
<th class="head" width="10%">Charge</th>
<th class="head" width="10%">C.N.</th>
<th class="head" width="10%">Spin</th>
<th class="head" width="10%">I.R./Å</th>
<tr align="CENTER">
<td class="LA">+2</td>
<td class="LA">6</td>
<td class="LA"></td>
<td class="LA">1.17</td>
<tr align="CENTER">
<td class="LA"></td>
<td class="LA">7</td>
<td class="LA"></td>
<td class="LA">1.20</td>
<tr align="CENTER">
<td class="LA"></td>
<td class="LA">8</td>
<td class="LA"></td>
<td class="LA">1.25</td>
<tr align="CENTER">
<td class="LA"></td>
<td class="LA">9</td>
<td class="LA"></td>
<td class="LA">1.30</td>
<tr align="CENTER">
<td class="LA"></td>
<td class="LA">10</td>
<td class="LA"></td>
<td class="LA">1.35</td>
<tr align="CENTER">
<td class="LA"></td>
<td class="LA">12</td>
<td class="LA"></td>
<td class="

In [35]:
def crawl(page, depth = None):
    indexed_url = []
    for i in range(depth):
        if page not in indexed_url:
            indexed_url.append(page)
            try:
                c = urlopen(page)
                print("opened")
            except:
                print("could not open %s" % page)
                continue
            soup = BeautifulSoup(c.read())
            links = soup('a')
            for link in links:
                if 'href' in dict(link.attrs):
                    url = urljoin(page, link['href'])
                    if url.find("'") != -1:
                        continue
                    url = url.split('#')[0]
                    if url[0:4] == 'http':
                        indexed_url.append(url)
    return indexed_url

In [37]:
crawl("http://mrlweb.mrl.ucsb.edu/~seshadri/Periodic/periodic.html", depth = 2)

opened


['http://mrlweb.mrl.ucsb.edu/~seshadri/Periodic/periodic.html',
 'http://mrlweb.mrl.ucsb.edu/~seshadri/Periodic/Elements/H.html',
 'http://mrlweb.mrl.ucsb.edu/~seshadri/Periodic/Elements/He.html',
 'http://mrlweb.mrl.ucsb.edu/~seshadri/Periodic/Elements/Li.html',
 'http://mrlweb.mrl.ucsb.edu/~seshadri/Periodic/Elements/Be.html',
 'http://mrlweb.mrl.ucsb.edu/~seshadri/Periodic/Elements/B.html',
 'http://mrlweb.mrl.ucsb.edu/~seshadri/Periodic/Elements/C.html',
 'http://mrlweb.mrl.ucsb.edu/~seshadri/Periodic/Elements/N.html',
 'http://mrlweb.mrl.ucsb.edu/~seshadri/Periodic/Elements/O.html',
 'http://mrlweb.mrl.ucsb.edu/~seshadri/Periodic/Elements/F.html',
 'http://mrlweb.mrl.ucsb.edu/~seshadri/Periodic/Elements/Ne.html',
 'http://mrlweb.mrl.ucsb.edu/~seshadri/Periodic/Elements/Na.html',
 'http://mrlweb.mrl.ucsb.edu/~seshadri/Periodic/Elements/Mg.html',
 'http://mrlweb.mrl.ucsb.edu/~seshadri/Periodic/Elements/Al.html',
 'http://mrlweb.mrl.ucsb.edu/~seshadri/Periodic/Elements/Si.html',
 'ht

# Learning the Requests library

https://docs.python-requests.org/en/latest/user/advanced/

My nerd spouse says that `requests` is the more common way to talk to the internet. Here's the usage of that, though without a handy crawler because I need to move onto other things today.

In [47]:
import requests


In [50]:
r = requests.get('http://mrlweb.mrl.ucsb.edu/~seshadri/Periodic/periodic.html')

In [53]:
soup = BeautifulSoup(r.text)

In [54]:
soup

<html>
<head>
<title>Ram Seshadri Group at UCSB: Periodic table of the elements</title>
<link href="mailto:seshadri@sscu.iisc.ernet.in" rev="made"/>
<meta content="Ram Seshadri Group Periodic Table Effective Ionic Radii Shannon Prewitt Radii" name="keywords"/>
<meta content="Shannon Prewitt Effective Ionic Radii for Ions and Elements" name="description"/>
<meta content="General" name="rating"/>
<meta content="15 days" name="revisit-after"/>
<meta content="Homepage" name="VW96.objecttype"/>
</head>
<link href="periodic.css" rel="STYLESHEET" type="TEXT/CSS"/>
<body>
<h2>Periodic table of the elements</h2>
<h3>Click on the element for tables of the Effective Ionic Radii</h3>
<table align="CENTER" cellpadding="2" cellspacing="2" width="90%">
<tr align="CENTER">
<th class="head" width="5%">1</th> <th class="head" width="5%">2</th>
<th class="head" width="5%">3</th> <th class="head" width="5%">4</th>
<th class="head" width="5%">5</th> <th class="head" width="5%">6</th>
<th class="head" width