# BMI565: Bioinformatics Programming & Scripting

#### (C) Michael Mooney (mooneymi@ohsu.edu)

## Week 4: XML, HTML and Web Scraping

*** Thanks to Ryan Swan for the materials on HTML and web scraping.**

1. XML Overview
    - XML Format
2. The Python ElementTree Class
    - Reading XML
    - Writing XML
3. XML and Bioinformatics
1. HTML
    * Organization of HTML files
2. LXML Package
    * HTML as a tree structure
    * XPath queries
    * Element objects
    * HTML tag attributes
3. Beautiful Soup
    * Soup objects and methods
    * Using tag attributes with BeautifulSoup
4. The Web Developers Console
5. A note about APIs and `robots.txt`



#### Requirements

- Python 2.7 or 3.5
- `xml.etree.ElementTree` module
- `lxml` module
- `requests` module
- `BeautifulSoup (beautifulsoup4)` module
- `io` module
- Data Files
    - `./data/book.xml`
    - `./data/SHH.xml`
- Miscellaneous Files
    - `./images/book_tree.jpg`

In [1]:
from __future__ import print_function, division

## XML Overview

<b>XML</b> stands for E<u>x</u>tensible <u>M</u>arkup <u>L</u>anguage, and is a set of rules for encoding documents in a machine-readable format. In bioinformatics, XML is a commonly used format for sharing heterogenous data (as opposed to delimited files, where every record (row) contains the same data elements).

The World Wide Web Consortium (W3C) oversaw XML development in 1996.

#### XML Design Goals:
1. XML shall be straightforwardly usable over the Internet
2. XML shall support a wide variety of applications
3. XML shall be compatible with Standard Generalized Markup Language (SGML)
4. It shall be easy to write programs that process XML documents
5. The number of optional features in XML is to be kept to the absolute minimum
6. XML documents should be human-legible and reasonably clear
7. The XML design should be prepared quickly
8. The design of XML shall be formal and concise
9. XML documents shall be easy to create
10. Terseness in XML markup is of minimal importance

#### Why can't we use CSV formats?
1. We usually can, but...
1. CSV files are not always human readable (other documentation is often necessary to identify data elements)
2. Inconsistencies are more likely 
3. CSV files don't easily support multiple levels of data
4. CSV files don't easily support addition details such as formatting or meta data (experimental protocols, etc.)


#### UniProt Example: Sonic Hedgehog Protein

[http://www.uniprot.org/uniprot/Q15465.xml](http://www.uniprot.org/uniprot/Q15465.xml)

I've provided this file in the course materials, saved as `SHH.xml`.

### XML Format

The first couple lines of an XML document contain information about the XML version used, the document structure and comments:

#### Version
    <?xml version='1.0' encoding='UTF-8'?>
    
#### Document Type Declaration
    <uniprot xmlns="http://uniprot.org/uniprot" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://uniprot.org/uniprot http://www.uniprot.org/support/docs/uniprot.xsd">

#### XML Document Body

The body of an XML document contains labeled data elements. Data elements can be nested to show relationships. Data labels are called "tags", which can also contain attributes (values are always strings) that provide additional information about the data.
    
    <parent_tag>
        <child_tag attribute1="value1" attrubute2="value2">data</child_tag>
    </parent_tag>

It is subjective whether to provide additional information as attributes or additional data elements:

    <contact birthdate="1-1-1980">
        <name>John Smith</name>
    </contact>
    
    <contact>
        <name>John Smith</name>
        <birthdate>1-1-1980</birthdate>
    </contact>

#### DTD and XML Schema

- Document Type Definitions (DTD) and XML Schemas are two ways of describing the structure and content of an XML document
- XML Schemas (a.k.a. XML Schema Definitions or XSDs) were designed to improve upon the shortcomings of DTDs
    - data type support
    - namespace aware
- Example: the UniProt XSD - [http://www.uniprot.org/support/docs/uniprot.xsd](http://www.uniprot.org/support/docs/uniprot.xsd)

## ElementTree
### Reading XML

There are two strategies for reading an XML document:

1. Document Object Model
    - Read the entire file, analyze relationships between elements, and build a tree structure which can be navigated/searched
    - Uses the innate organization of the data
    - Examples: `minidom`, `elementtree`, `lxml` Python modules
2. Event Driven Parsers (SAX or Simple API for XML)
    - Read the XML file and report events, such as the start and end of an element
    - Uses less memory, no tree construction
    - Examples: `sax` and `elementtree` Python modules
    
We will be covering both the `elementtree` and `lxml` modules in this lecture. 

#### A Simple Example

    <book>
        <title>Nineteen Eighty‐Four</title>
        <author>George Orwell</author>
        <character>Winston Smith</character>
        <character>Julia</character>
    </book>

    import xml.etree.ElementTree as et
    tree = et.parse("1984.xml")

In the example above, `tree` is an ElementTree object containing a tree of the entire XML file. ElementTree objects are iterable objects. We can iterate through these object to access individual elements. Start by accessing the root of the tree. Each element object contains three main attributes: the tag name `tag`, the text inside the tag `text`, and the tag attributes `attrib`.

    root_element = tree.getroot()
    for element in root_element:
        print element.tag
        print element.text
        print element.attrib

<img src="./images/book_tree.jpg" align="left" width="700" />

#### Another Example: `book.xml`

    <book>
	<title>Ender's Game</title>
	<author>Orson Scott Card</author>
	<chapter>Third</chapter>
	<chapter>Peter</chapter>
	<chapter>Graff</chapter>
    <publication_info>
		<publisher location="New York">Tor Books</publisher>
		<publication_date>1985</publication_date>
	</publication_info>
    </book>

In [2]:
import xml.etree.ElementTree as et
tree = et.parse('./data/book.xml')
root_element = tree.getroot()
root_element

<Element 'book' at 0x104278bf0>

In [3]:
list(root_element)

[<Element 'title' at 0x104281dd0>,
 <Element 'author' at 0x104291890>,
 <Element 'chapter' at 0x1042918f0>,
 <Element 'chapter' at 0x104291950>,
 <Element 'chapter' at 0x1042919b0>,
 <Element 'publication_info' at 0x104291a10>]

In [4]:
len(root_element)

6

In [5]:
for element in root_element:
    print(element.tag + ":", element.text.strip())

title: Ender's Game
author: Orson Scott Card
chapter: Third
chapter: Peter
chapter: Graff
publication_info: 


In [6]:
root_element[5]

<Element 'publication_info' at 0x104291a10>

In [7]:
len(root_element[5])

2

In [8]:
list(root_element[5])

[<Element 'publisher' at 0x104291ad0>,
 <Element 'publication_date' at 0x104291b30>]

In [9]:
## Each element is iterable, which allows access
## to child elements. Here we check the length of
## each element to get the number of children
for element in root_element:
    if len(element) > 0:
        print(element.tag + ":", element.text.strip(), ", ", element.attrib)
        for child in element:
            print("\t" + child.tag + ":", child.text.strip(), ", ", child.attrib)
    else:
        print(element.tag + ":", element.text.strip(), ", ", element.attrib)

title: Ender's Game ,  {}
author: Orson Scott Card ,  {}
chapter: Third ,  {}
chapter: Peter ,  {}
chapter: Graff ,  {}
publication_info:  ,  {}
	publisher: Tor Books ,  {'location': 'New York'}
	publication_date: 1985 ,  {}


#### ElementTree Element Methods

<table align="left">
<tr><td style="text-align:center"><b>Method</b></td><td><b>Description</b></td></tr>
<tr><td style="text-align:center">`Element.iter(tag=None)`</td><td>Creates a tree iterator with the current element as root.<br />If `tag` is specified, only those elements with a tag equal to `tag` are returned by the iterator.</td></tr>
<tr><td style="text-align:center">`Element.find(tag)`</td><td>Returns the first subelement with a tag equal to `tag` or `None` if no match.</td></tr>
<tr><td style="text-align:center">`Element.findall(tag)`</td><td>Returns a list of all matching subelements.</td></tr>
</table>

In [10]:
author = root_element.find("author")
author.text

'Orson Scott Card'

In [11]:
chapters = root_element.findall("chapter")
[c.text for c in chapters]

['Third', 'Peter', 'Graff']

If the XML file is very large, you may want to use an iterator, rather than creating a tree of the entire file all at once. The `iterparse()` method implements an event-driven parser. It will return an iterator of (event, element) tuples, where event indicates the part of an element encountered (e.g. the start tag or end tag). By default, only end events are returned. Since, `iterparse()` still creates a tree in memory, you can use the `Element.clear()` method to save memory. 

In [12]:
iter_et = et.iterparse('./data/book.xml')
for event, element in iter_et:
    print(event)
    print(element.tag + ":", element.text.strip())

end
title: Ender's Game
end
author: Orson Scott Card
end
chapter: Third
end
chapter: Peter
end
chapter: Graff
end
publisher: Tor Books
end
publication_date: 1985
end
publication_info: 
end
book: 


In [13]:
## Use clear() to clear each element after processing
## including the root element
iter_et = et.iterparse('./data/book.xml', events=['start', 'end'])
event, root = next(iter_et)
for event, element in iter_et:
    if event == "end" and element.tag != root.tag:
        print(element.tag + ":", element.text.strip())
        element.clear()

root.clear()

title: Ender's Game
author: Orson Scott Card
chapter: Third
chapter: Peter
chapter: Graff
publisher: Tor Books
publication_date: 1985
publication_info: 


#### XML Namespaces

XML namespaces are used to create uniquely named elements and attributes in an XML document. Since a single document may contain element names from multiple vocabularies, ambiguity can arise from the same element name used for different entity definitions. The namespace is appended to the front of tag names to create unique names. In the UniProt example shown above, the attribute `xmlns="http://uniprot.org/uniprot"` specifies the UniProt namespace (in the document type declaration.

A document's namespace can be extracted from the root element:

In [14]:
## Get the XML document's namespace
import re
shh_tree = et.parse('./data/SHH.xml')
shh_root = shh_tree.getroot()
namespace = re.match(r"{.*}", shh_root.tag).group()
namespace

'{http://uniprot.org/uniprot}'

In [15]:
## Append the namespace to any element name
## you want to find
entry = shh_root.find(namespace+'entry')
entry.find(namespace+'name').text

'SHH_HUMAN'

In [16]:
ns = {'uniprot':'http://uniprot.org/uniprot'}
entry = shh_root.find('uniprot:entry', ns)
entry.find("uniprot:name", ns).text

'SHH_HUMAN'

In [17]:
entry = shh_root.find('{http://uniprot.org/uniprot}entry')
entry

<Element '{http://uniprot.org/uniprot}entry' at 0x10431a770>

In [18]:
entry2 = shh_root.find('entry')
entry2

### Writing XML

#### Methods for Writing XML
<table align="left">
<tr><td style="text-align:center"><b>Method</b></td><td><b>Description</b></td></tr>
<tr><td style="text-align:center">`et.Element(tag)`</td><td>Creates an element with the specified tag. Returns an element object.</td></tr>
<tr><td style="text-align:center">`et.SubElement(element, tag)`</td><td>Creates a child element under the specified element.</td></tr>
<tr><td style="text-align:center">`Element.set(key, value)`</td><td>Sets the attributes of an element.</td></tr>
<tr><td style="text-align:center">`et.ElementTree(root)`</td><td>Returns an ElementTree object.</td></tr>
<tr><td style="text-align:center">`ElementTree.write(file)`</td><td>Writes an ElementTree object to a file.</td></tr>
</table>

In [19]:
## Create a simple XML file
root = et.Element("book")
title = et.SubElement(root, "title")
title.text = "Nineteen Eighty-Four"
author = et.SubElement(root, "author")
author.text = "George Orwell"


tree.write("1984.xml")

In [20]:
with open('1984.xml') as fh:
    data = fh.read()
data

'<book><title>Nineteen Eighty-Four</title><author>George Orwell</author><publication_info><publisher location="London">Secker and Warburg</publisher></publication_info></book>'

#### Drawbacks to XML?

- More difficult to parse than CSV
- Verbose syntax means larger files

## XML and Bioinformatics
#### SBML (Systems Biology Markup Language)
- Used to communicate models of biological processes (cell-signaling pathways, regulatory networks). Models can represent:
    - Chemical Equations
    - Cellular Components: nucleus, cytoplasm, etc.
    - Species: genomes, proteomes, etc.
- Supported by many applications: [http://sbml.org/SBML_Software_Guide](http://sbml.org/SBML_Software_Guide)
- [http://www.ebi.ac.uk/biomodels-main/](http://www.ebi.ac.uk/biomodels-main/)

#### KGML (KEGG Markup Language)
- A format for KEGG pathway maps
    - [http://www.kegg.jp/kegg/xml/](http://www.kegg.jp/kegg/xml/)
    
#### PDBML (Protein Databank Markup Language)
- Describes 3D protein structure
    - relative atomic coordinates
    - secondary structure assignment
    - atomic connectivity
- [http://www.rcsb.org/pdb/home/home.do](http://www.rcsb.org/pdb/home/home.do)
- [http://pdbml.pdb.org/](http://pdbml.pdb.org/)

## HTML

Hypertext Markup Language (HTML) is the basis for most pages that are served on the internet. HTML is actually very similar to XML (Extensible Markup Language), with the caveat that it also contains presentation semantics, which are attributes that specify how information is meant to be displayed or arranged on a screen. But overall, the nested format is almost exactly like an XML document, and because of that, we can extract information from a standard HTML page exactly the same way we would from an XML document. Below is a simple example of an HTML document:

    <html>
    <head>
        <title>Hey look, a webpage!</title>
    </head>
    <body>
        <p>webpage goes here</p>
    </body>
    </html>


## LXML package

The LXML package for Python contains methods to read HTML pages like a tree structure. It uses a querying syntax called XML Path Language (XPath) to parse the tree structure and return relevent information from the document.

Before we get started, it helps to have an idea of some of the ways that HTML arranges documents. Most scrapable HTML data is contained in tables like the one at http://www.bioinformatics.org/sms/iupac.html. HTML tables are arranged in the following format:

    <table>
        <tr>
            <td></td>
            <td></td>
            <td></td>
            ...
        </tr>
        <tr>
            ...
        </tr>
    </table>

This general format specifies table rows and table dividers, where each divider is a different column. The data in the table is contained inside each of the nested <td></td> tag pairs. 

XPath querying allows us to find specific kinds of elements and their contents. Let's use the tables on the following webpage as an example: [http://www.bioinformatics.org/sms/iupac.html](http://www.bioinformatics.org/sms/iupac.html)

In [21]:
from lxml import etree
import requests
from io import StringIO # This will help us deal with string inputs

## Get the code from the url
html = requests.get("http://www.bioinformatics.org/sms/iupac.html").text

## Next we have to create a parser that will read the info from the HTML 
## file and tell it what kind of data it will be receiving
parser = etree.HTMLParser()
tree = etree.parse(StringIO(html),parser)

We now have the webpage represented as a tree of data. This tree is an iterable object, just like we saw above when working with XML documents. We can do all sorts of things now.

For example with can iterate through the tree with a for loop:

In [22]:
## Note: here we are only showing two levels of the tree
root = tree.getroot()

for e in root:
    print(e)
    for i in e:
        print('\t' + str(i))

<Element head at 0x10511c690>
	<Element meta at 0x10511c7d0>
	<Element meta at 0x10511c820>
	<Element meta at 0x10511c870>
	<Element title at 0x10511c7d0>
<Element body at 0x10511c6e0>
	<Element table at 0x10511c9b0>
	<Element br at 0x10511ca00>
	<Element table at 0x10511c7d0>


In [23]:
## The following function will print the entire tree structure
## This function looks in each element node, and if it has 
## contents it performs the same action on the descendent node
## Note that this is an example of recursion - a function 
## that calls itself.

def parseTree(e,t='\t'):
    for i in e:
        print(str(t) + str(i))
        parseTree(i,t=t + '\t')

parseTree(tree.getroot())

	<Element head at 0x10511df00>
		<Element meta at 0x10511e050>
		<Element meta at 0x10511e0a0>
		<Element meta at 0x10511e0f0>
		<Element title at 0x10511e050>
	<Element body at 0x10511c6e0>
		<Element table at 0x10511e050>
			<Element tr at 0x10511e230>
				<Element td at 0x10511e370>
					<Element font at 0x10511e4b0>
				<Element td at 0x10511e3c0>
					<Element font at 0x10511e410>
			<Element tr at 0x10511e280>
				<Element td at 0x10511e410>
				<Element td at 0x10511e370>
			<Element tr at 0x10511e3c0>
				<Element td at 0x10511e410>
				<Element td at 0x10511e5a0>
			<Element tr at 0x10511e370>
				<Element td at 0x10511e410>
				<Element td at 0x10511e690>
			<Element tr at 0x10511e5a0>
				<Element td at 0x10511e410>
				<Element td at 0x10511e780>
			<Element tr at 0x10511e690>
				<Element td at 0x10511e410>
				<Element td at 0x10511e870>
			<Element tr at 0x10511e780>
				<Element td at 0x10511e410>
				<Element td at 0x10511e960>
			<Element tr at 0x10511e870>
				<Eleme

The `etree` object has a method called `xpath()`, which allows us to perform queries on the tree structure to identify specific elements within the HTML document. For example, if we want to find all tables within the body of the document we would do the following:

In [24]:
## This will return a list of table elements
tables = tree.xpath('body/table')
tables

[<Element table at 0x104315410>, <Element table at 0x10511c7d0>]

We can also use tag attributes to perform more specific queries. For instance, we know that the table containing amino acid codes has three columns. To extract this table we could do something like:

In [25]:
## This will find all tables with three columns
## Note: the // means it will look anywhere under the current element (root in this case) 
## (i.e. the table could be nested within another element)
amino = tree.xpath("//table[@cols='3']")
amino

[<Element table at 0x10511c7d0>]

In [26]:
## We can iterate through this table to get the data
for row in amino[0]:
    for cell in row:
        print(cell.text)

None
None
None
A
Ala
Alanine
C
Cys
Cysteine
D
Asp
Aspartic Acid
E
Glu
Glutamic Acid
F
Phe
Phenylalanine
G
Gly
Glycine
H
His
Histidine
I
Ile
Isoleucine
K
Lys
Lysine
L
Leu
Leucine
M
Met
Methionine
N
Asn
Asparagine
P
Pro
Proline
Q
Gln
Glutamine
R
Arg
Arginine
S
Ser
Serine
T
Thr
Threonine
V
Val
Valine
W
Trp
Tryptophan
Y
Tyr
Tyrosine


Note that the column headers are missing above. This is because that text is not directly within the table cells, it is actually nested within a `<font>` tag, which allows additional formatting of the text. The code below will solve this problem. The Xpath `text()` function will extract text, and using the `//` means that it will find text anywere under the `<td>` tag.

In [27]:
for i in tree.xpath("//table[@cols='3']/tr/td//text()"):
    print(i)

IUPAC amino acid code
Three letter code
Amino acid
A
Ala
Alanine
C
Cys
Cysteine
D
Asp
Aspartic Acid
E
Glu
Glutamic Acid
F
Phe
Phenylalanine
G
Gly
Glycine
H
His
Histidine
I
Ile
Isoleucine
K
Lys
Lysine
L
Leu
Leucine
M
Met
Methionine
N
Asn
Asparagine
P
Pro
Proline
Q
Gln
Glutamine
R
Arg
Arginine
S
Ser
Serine
T
Thr
Threonine
V
Val
Valine
W
Trp
Tryptophan
Y
Tyr
Tyrosine


We can now start using for loops to write more interesting queries, and convert the entire table to a data structure  we can more easily use.

One thing to keep in mind is that once you have focused on a particular part of the tree, your position is defined relative to that element. However, the object still contains the full information about the whole HTML document's tree. You are able to start a query with the absolute path of the full tree with `/` or you are able to use `.` in order to define a query relative to your current position. Here we use the `.` operator to define a path relative to the current element (e.g. the table element stored in `amino[0]`).

In [28]:
## Remember here we are only interested in the amino acid table
## Use the . to ensure you are searching for rows within that table only
table_list = []
for tr in amino[0].xpath('./tr'):
    table_list.append(tr.xpath('./td//text()'))
table_list

[['IUPAC amino acid code', 'Three letter code', 'Amino acid'],
 ['A', 'Ala', 'Alanine'],
 ['C', 'Cys', 'Cysteine'],
 ['D', 'Asp', 'Aspartic Acid'],
 ['E', 'Glu', 'Glutamic Acid'],
 ['F', 'Phe', 'Phenylalanine'],
 ['G', 'Gly', 'Glycine'],
 ['H', 'His', 'Histidine'],
 ['I', 'Ile', 'Isoleucine'],
 ['K', 'Lys', 'Lysine'],
 ['L', 'Leu', 'Leucine'],
 ['M', 'Met', 'Methionine'],
 ['N', 'Asn', 'Asparagine'],
 ['P', 'Pro', 'Proline'],
 ['Q', 'Gln', 'Glutamine'],
 ['R', 'Arg', 'Arginine'],
 ['S', 'Ser', 'Serine'],
 ['T', 'Thr', 'Threonine'],
 ['V', 'Val', 'Valine'],
 ['W', 'Trp', 'Tryptophan'],
 ['Y', 'Tyr', 'Tyrosine']]

## Beautiful Soup 

While that was certainly a fun demonstration of how HTML is organized and can be digested for further analysis, manual XPath evaluations can be a tedious process. Beautiful Soup is a package meant to make the process of getting information from web documents much simpler.

In Beautiful Soup, we first import the package in order to create a "soup" object. Here we use the html object that we acquired earlier.

In [29]:
from bs4 import BeautifulSoup as bs
soup = bs(html, "lxml")

From here we can perform all sorts of different manipulations on the data, and Beautiful Soup takes care of the many of the details behind the scenes. Let's just take a look a couple quick examples:

In [30]:
## Find all tables in the document
tables = soup.find_all("table")
tables

[<table border="" cellpadding="2" cellspacing="0" cols="2" width="350">
 <tr>
 <td bgcolor="#B0C4DE"><font color="#000000">IUPAC nucleotide code</font></td>
 <td bgcolor="#B0C4DE"><font color="#000000">Base</font></td>
 </tr>
 <tr>
 <td>A</td>
 <td>Adenine</td>
 </tr>
 <tr>
 <td>C</td>
 <td>Cytosine</td>
 </tr>
 <tr>
 <td>G</td>
 <td>Guanine</td>
 </tr>
 <tr>
 <td>T (or U)</td>
 <td>Thymine (or Uracil)</td>
 </tr>
 <tr>
 <td>R</td>
 <td>A or G</td>
 </tr>
 <tr>
 <td>Y</td>
 <td>C or T</td>
 </tr>
 <tr>
 <td>S</td>
 <td>G or C</td>
 </tr>
 <tr>
 <td>W</td>
 <td>A or T</td>
 </tr>
 <tr>
 <td>K</td>
 <td>G or T</td>
 </tr>
 <tr>
 <td>M</td>
 <td>A or C</td>
 </tr>
 <tr>
 <td>B</td>
 <td>C or G or T</td>
 </tr>
 <tr>
 <td>D</td>
 <td>A or G or T</td>
 </tr>
 <tr>
 <td>H</td>
 <td>A or C or T</td>
 </tr>
 <tr>
 <td>V</td>
 <td>A or C or G</td>
 </tr>
 <tr>
 <td>N</td>
 <td>any base</td>
 </tr>
 <tr>
 <td>. or -</td>
 <td>gap</td>
 </tr>
 </table>,
 <table border="" cellpadding="2" cellspaci

In [31]:
## Find the first table that matches some criteria
table = soup.find("table",{"width":"350","cols":"3"})
table

<table border="" cellpadding="2" cellspacing="0" cols="3" width="350">
<tr>
<td bgcolor="#B0C4DE"><font color="#000000">IUPAC amino acid code</font></td>
<td bgcolor="#B0C4DE"><font color="#000000">Three letter code</font></td>
<td bgcolor="#B0C4DE"><font color="#000000">Amino acid</font></td>
</tr>
<tr>
<td>A</td>
<td>Ala</td>
<td>Alanine</td>
</tr>
<tr>
<td>C</td>
<td>Cys</td>
<td>Cysteine</td>
</tr>
<tr>
<td>D</td>
<td>Asp</td>
<td>Aspartic Acid</td>
</tr>
<tr>
<td>E</td>
<td>Glu</td>
<td>Glutamic Acid</td>
</tr>
<tr>
<td>F</td>
<td>Phe</td>
<td>Phenylalanine</td>
</tr>
<tr>
<td>G</td>
<td>Gly</td>
<td>Glycine</td>
</tr>
<tr>
<td>H</td>
<td>His</td>
<td>Histidine</td>
</tr>
<tr>
<td>I</td>
<td>Ile</td>
<td>Isoleucine</td>
</tr>
<tr>
<td>K</td>
<td>Lys</td>
<td>Lysine</td>
</tr>
<tr>
<td>L</td>
<td>Leu</td>
<td>Leucine</td>
</tr>
<tr>
<td>M</td>
<td>Met</td>
<td>Methionine</td>
</tr>
<tr>
<td>N</td>
<td>Asn</td>
<td>Asparagine</td>
</tr>
<tr>
<td>P</td>
<td>Pro</td>
<td>Proline</td>


In [32]:
## Iterate through the table and create a list of lists
table_list2 = []
for row in table.find_all("tr"):
    cells = row.find_all("td")
    newCells = []
    for c in cells:
        newCells.append(c.get_text())
    table_list2.append(newCells)
table_list2

[['IUPAC amino acid code', 'Three letter code', 'Amino acid'],
 ['A', 'Ala', 'Alanine'],
 ['C', 'Cys', 'Cysteine'],
 ['D', 'Asp', 'Aspartic Acid'],
 ['E', 'Glu', 'Glutamic Acid'],
 ['F', 'Phe', 'Phenylalanine'],
 ['G', 'Gly', 'Glycine'],
 ['H', 'His', 'Histidine'],
 ['I', 'Ile', 'Isoleucine'],
 ['K', 'Lys', 'Lysine'],
 ['L', 'Leu', 'Leucine'],
 ['M', 'Met', 'Methionine'],
 ['N', 'Asn', 'Asparagine'],
 ['P', 'Pro', 'Proline'],
 ['Q', 'Gln', 'Glutamine'],
 ['R', 'Arg', 'Arginine'],
 ['S', 'Ser', 'Serine'],
 ['T', 'Thr', 'Threonine'],
 ['V', 'Val', 'Valine'],
 ['W', 'Trp', 'Tryptophan'],
 ['Y', 'Tyr', 'Tyrosine']]

## The Developer's Console

Both Chrome and Firefox are equipped with a developer's console, meant for debugging code while writing websites. This console can also be used to see what elements your computer is interfacing with while you surf the web. 

To open the developer's console in firefox, press Ctrl+Shift+K in Windows or Cmd+Opt+K in OSX. The network tab will allow you to see what information is being sent when, while the Inspector tab allows you to hover over code and see what element of the page it represents. 

Chrome's developer console can be accessed with Ctrl+Shift+J on Windows or Cmd+Opt+J on OSX. While the tabs are named slightly differently, the functions are essentially the same. Notably, Chrome provides native support for web scraping, though the data it gives are usually oriented more toward the organization of entire sites and less toward acquiring data from an individual page.

If you plan on getting data from the web, this is an invaluable tool that will save you a lot of time finding out where data is stored.

## A Word On APIs And robots.txt

Before scraping a site, it is worth taking a couple of things into account in order to make sure that you are a good citizen of the web.  The robots.txt file located in the root directory of most websites will usually give you an idea of which directories are and are not allowed for web scraping. It is good practice if you are scraping a large amount of data to make sure that you adhere to the areas that are described by robots.txt with the "Allow:" tag. 

Many sites also provide an Application Programming Interface (API) that allows you to acquire information directly without scraping web data from the HTML interface, saving both you and the site manager time and money. If an API is available, it is almost always advisable to make use of it.

## In-Class Exercises

In [None]:
## Exercise 1.
## Extract the title and author list for the 
## first reference in SHH.xml
## Get the XML document's namespace
import xml.etree.ElementTree as et
import re



In [None]:
## Exercise 2.
## Using either lxml or BeautifulSoup, scrape the values from the first 
## table at the URL below, which contains nucleotides and their corresponding name
## Create a dictionary from these values where the nucleotide code is the key.
## "http://www.bioinformatics.org/sms/iupac.html"



## References

- <u>Python Essential Reference</u>, David Beazley, 4th Edition, Addison‐Wesley (2008)
- <u>Python for Bioinformatics</u>, Sebastian Bassi, CRC Press (2010)
- [http://en.wikipedia.org/wiki/XML](http://en.wikipedia.org/wiki/XML)
- [http://docs.python.org/](http://docs.python.org/)
- [https://docs.python.org/2/library/xml.etree.elementtree.html](https://docs.python.org/2/library/xml.etree.elementtree.html)
- [LXML HTML Xpath Tutorial](http://lxml.de/parsing.html)
- [BeautifulSoup Documentation](http://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [XPath Syntax Guide](https://www.w3schools.com/xml/xpath_syntax.asp)

#### Last Updated: 22-Sep-2021