# BMI565: Bioinformatics Programming & Scripting

#### (C) Michael Mooney (mooneymi@ohsu.edu)

## Week 4: XML, HTML and Web Scraping

**Thanks to Ryan Swan for the materials on HTML and web scraping.**


- keep the structure the same, 
- update the Elementtree code to LXML code (and add whatever new material you want, including examples of using XPath queries). 
- shorten the LXML examples for parsing HTML (depending on how much is overlap with the updated first lecture) 
- keep the BeautifulSoup examples mostly as-is. Let me know if you have other ideas. 


# XML Overview

><b>XML</b> stands for E<u>x</u>tensible <u>M</u>arkup <u>L</u>anguage, and is a set of rules for encoding documents in a machine-readable format. In bioinformatics, XML is a commonly used format for sharing heterogenous data (as opposed to delimited files, where every record (row) contains the same data elements).

The World Wide Web Consortium (W3C) oversaw XML development in 1996.

### XML Design Goals:
1. XML shall be straightforwardly usable over the Internet
2. XML shall support a wide variety of applications
3. XML shall be compatible with Standard Generalized Markup Language (SGML)
4. It shall be easy to write programs that process XML documents
5. The number of optional features in XML is to be kept to the absolute minimum
6. XML documents should be human-legible and reasonably clear
7. The XML design should be prepared quickly
8. The design of XML shall be formal and concise
9. XML documents shall be easy to create
10. Terseness in XML markup is of minimal importance

### Why can't we use CSV formats?

We usually can, but...

1. CSV files are not always human readable (other documentation is often necessary to identify data elements)
2. Inconsistencies are more likely 
3. CSV files don't easily support multiple levels of data
4. CSV files don't easily support addition details such as formatting or meta data (experimental protocols, etc.)


### XML Format

The first couple lines of an XML document contain information about the XML version used, the document structure and comments:

#### Version

```xml
<?xml version='1.0' encoding='UTF-8'?>
```
    
#### Document Type Declaration
```xml
<uniprot xmlns="http://uniprot.org/uniprot" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://uniprot.org/uniprot http://www.uniprot.org/support/docs/uniprot.xsd">
```

#### XML Document Body

The body of an XML document contains labeled data elements. Data elements can be nested to show relationships. Data labels are called "tags", which can also contain attributes (values are always strings) that provide additional information about the data.
    
```xml
    <parent_tag>
        <child_tag attribute1="value1" attrubute2="value2">data</child_tag>
    </parent_tag>
```

It is subjective whether to provide additional information as attributes or additional data elements:

```xml
    <contact birthdate="1-1-1980">
        <name>John Smith</name>
    </contact>
    
    <contact>
        <name>John Smith</name>
        <birthdate>1-1-1980</birthdate>
    </contact>
```

<center><img src="xml_graph.png"></center>

#### DTD and XML Schema

- Document Type Definitions (DTD) and XML Schemas are two ways of describing the structure and content of an XML document
- XML Schemas (a.k.a. XML Schema Definitions or XSDs) were designed to improve upon the shortcomings of DTDs
    - data type support
    - namespace aware
- Example: the UniProt XSD - [http://www.uniprot.org/support/docs/uniprot.xsd](http://www.uniprot.org/support/docs/uniprot.xsd)

In [3]:

company_xmlpath = 'company.xml'

company_xmlstring = ('''<?xml version='1.0' ?>
                        <company>
                            <department>
                                <employee>
                                    <name>John Doe</name>
                                    <job>Software Analyst</job>
                                    <salary>2000</salary>
                                </employee>
                                <employee>
                                    <name>Jane Fletcher</name>
                                    <job>Designer</job>
                                    <salary>2500</salary>
                                </employee>        
                                <employee>
                                    <name>Mike Mooney</name>
                                    <job>Professor</job>
                                    <salary>250000</salary>
                                </employee>
                                <employee>
                                    <name>Gareth Harman</name>
                                    <job>Student</job>
                                    <salary>10</salary>
                                </employee>
                            </department>
                        </company>''')


# LXML

`parse()`
> Read an xml file
> Returns a `ElementTree` object

`fromstring()`
> Create an `Element` object from a string-like XML

It's important to note that loading each method of reading above create different objects which we can see below

In [19]:

import lxml.etree as et
import re

# Lets load our object from our string object above
parse_company = et.fromstring(company_xmlstring)
tree_company = parse_company.getroottree()

print(f'parse_company: {type(parse_company)}')
print(f'tree_company: {type(tree_company)}')

# And load from an xml file directly
tree_company = et.parse(company_xmlpath)

print(f'tree_company: {type(tree_company)}')


parse_company: <class 'lxml.etree._Element'>
tree_company: <class 'lxml.etree._ElementTree'>
tree_company: <class 'lxml.etree._ElementTree'>


Lets look at the root node and some of the information about our xml

In [26]:

# Obtain the root of our node
root_company = tree_company.getroot()
print(f'root_company: {type(root_company)}')

print(f'Company: {root_company.tag} Len: {len(root_company)}')
print(f'Company: {root_company[0].tag} Len: {len(root_company[0])}')
print(f'Company: {root_company[0][0].tag} Len: {len(root_company[0][0])}')

print(f'{root_company[0][0][0].tag}')
print(f'{root_company[0][0][0].text}')
print(f'{root_company[0][0][0].attrib}')


root_company: <class 'lxml.etree._Element'>
Company: company Len: 1
Company: department Len: 4
Company: employee Len: 3
name
John Doe
{}


### Walking the tree

`iterwalk()`
> Iteratively *walk* the elements of an `ElementTree` or `Element` object

`iterparse()`
> Iteratively *parse* and walk the elements of an .xml file

`element.clear()`
> This statement is important!

> Oftentimes we are traversing a very large .xml file (sometimes >4GB), sometimes we might only have 4GB of ram total, so if we want to traverse the xmltree we need to clear objects from memory as we go

In [14]:

def walk(iter_obj):
    
    ''' Walking an ElementTree Object '''

    event, root = next(iter_obj)            # Create our generator
   
    for event, element in iter_obj:         # Walk through the elements
        if (event == 'end' and              # Check it is the end of the object
            element.tag != root.tag and     # Check it isnt our root object
            element.text is not None):      # Check it isnt None
            if element.text.strip() != '':  # Check that the attribute has text
                print(f'{element.tag}:{element.text.strip()}')
                element.clear()             # Clear this element from memory
    
    root.clear()                            # Clear the root from memory
    



In [27]:
# Create our generator for our parser
iter_et_fromfile = et.iterparse('company.xml', events=['start', 'end'])

# These will do the same thing in different ways
walk(iter_et_fromfile)


name:John Doe
job:Software Analyst
salary:2000
name:Jane Fletcher
job:Designer
salary:2500
name:Mike Mooney
job:Professor
salary:250000
name:Gareth Harman
job:Student
salary:10


In [33]:
iter_et_fromtree = et.iterwalk(root_company, events=['start', 'end'])
walk(iter_et_fromtree)


name:John Doe
job:Software Analyst
salary:2000
name:Jane Fletcher
job:Designer
salary:2500
name:Mike Mooney
job:Professor
salary:250000
name:Gareth Harman
job:Student
salary:10


In [35]:
tree_company = et.parse(company_xmlpath)
root_company = tree_company.getroot()


In [36]:

#### FIND ####: Search for name (only on the first level)
print(root_company.find("name"))


#### FIND ####: Search for name anywhere in the tree (.//)
print(root_company.find(".//name").text)


#### FINDALL #: Find all instance of name anywhere in the tree
print(root_company.findall(".//name"))


#### ITERFIND : Generator to find instances of name anywhere in the tree
for jj in root_company.iterfind('.//name'):
    print(jj.text)

    
# Same thing but put it in a list
[jj.text for jj in root_company.iterfind('.//name')]


None
John Doe
[<Element name at 0x7fec53ecb840>, <Element name at 0x7fec53ecb680>, <Element name at 0x7fec53ecb540>, <Element name at 0x7fec53ecbb80>]
John Doe
Jane Fletcher
Mike Mooney
Gareth Harman


['John Doe', 'Jane Fletcher', 'Mike Mooney', 'Gareth Harman']

# Building/Writing an XML 

#### Methods for Writing XML
<table align="left">
<tr><td style="text-align:center"><b>Method</b></td><td><b>Description</b></td></tr>
<tr><td style="text-align:center">`et.Element(tag)`</td><td>Creates an element with the specified tag. Returns an element object.</td></tr>
<tr><td style="text-align:center">`et.SubElement(element, tag)`</td><td>Creates a child element under the specified element.</td></tr>
<tr><td style="text-align:center">`Element.set(key, value)`</td><td>Sets the attributes of an element.</td></tr>
<tr><td style="text-align:center">`et.ElementTree(root)`</td><td>Returns an ElementTree object.</td></tr>
<tr><td style="text-align:center">`ElementTree.write(file)`</td><td>Writes an ElementTree object to a file.</td></tr>
</table>

In [55]:
## Create a simple XML file
#root = et.Element("book")
#title = et.SubElement(root, "title")
#title.text = "Nineteen Eighty-Four"
#author = et.SubElement(root, "author")
#author.text = "George Orwell"





# Create the root elements
page = et.Element('library')
doc = et.ElementTree(page)

root = et.Element("book")
title = et.SubElement(page, "title")
title.text = "Nineteen Eighty-Four"
author = et.SubElement(page, "author")
author.text = "George Orwell"
pub_info = et.SubElement(page, "publication_info")
pub = et.SubElement(pub_info, "publisher",
                    name="Secker and Warburg",
                    location="London")

#tree = et.ElementTree(root)

# For multiple multiple attributes, use as shown above

# Save to XML file
outFile = open('output.xml', 'wb')
doc.write(outFile, xml_declaration=True, encoding='utf-16') 






iter_et = et.iterwalk(root)
for event, element in iter_et:
    if element.text is not None:
        print(element.tag + ":", element.text.strip())
    
#pub_info = et.SubElement(root, "publication_info")
#pub = et.SubElement(pub_info, "publisher")
#pub.text = "Secker and Warburg"
#pub.attrib = {"location": "London"}
#tree = et.ElementTree(root)
#tree.write("1984.xml")



# XPath Queries


- Stands for XML Path Language
- Uses "path like" syntax to identify and navigate nodes in an XML document
- Contains over 200 built-in functions
- A major element in the XSLT standard
- A W3C recommendation

Reference:
- [W3 XPath](https://www.w3schools.com/xml/xpath_intro.asp)

In [270]:

### XPATH
employees = tree_company.xpath('.//name/text()')
employees


['John Doe', 'Jane Fletcher', 'Mike Mooney', 'Gareth Harman']

In [38]:
namespace = re.match(r"{.*}", root_company.tag)
print(namespace)

None


In [17]:
# Create the root element
page = etree.Element('results')

# Make a new document tree
doc = etree.ElementTree(page)

# Add the subelements
pageElement = etree.SubElement(page, 'Country', 
                                      name='Germany',
                                      Code='DE',
                                      Storage='Basic')
# For multiple multiple attributes, use as shown above

In [35]:
for i in doc.iter():
    print(f'{i} {i.text}')
    if hasattr(i, 'items'):
        for k, j in i.items():
            print(f'\t{k} {j}')

<Element results at 0x7fe95dfc2380> None
<Element Country at 0x7fe95dfc2400> None
	name Germany
	Code DE
	Storage Basic


In [55]:
for e in root:
    print(f'{e.tag} : {e.text.strip()}')
    
list(root)

{http://uniprot.org/uniprot}entry : 
{http://uniprot.org/uniprot}copyright : Copyrighted by the UniProt Consortium, see https://www.uniprot.org/terms Distributed under the Creative Commons Attribution (CC BY 4.0) License


[<Element {http://uniprot.org/uniprot}entry at 0x7fe95e33b040>,
 <Element {http://uniprot.org/uniprot}copyright at 0x7fe95de70280>]

In [66]:
book_xml_str = ('''
    <book>
        <title>Ender's Game</title>
        <author>Orson Scott Card</author>
        <chapter>Third</chapter>
        <chapter>Peter</chapter>
        <chapter>Graff</chapter>
        <publication_info>
            <publisher location="New York">Tor Books</publisher>
            <publication_date>1985</publication_date>
        </publication_info>
    </book>
    '''
               )

book_tree = etree.fromstring(book_xml_str)

#root_element = book_tree.getroot()
#root_element
dir(book_tree.getroottree())

['__class__',
 '__copy__',
 '__deepcopy__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__ne__',
 '__new__',
 '__pyx_vtable__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '_setroot',
 'docinfo',
 'find',
 'findall',
 'findtext',
 'getelementpath',
 'getiterator',
 'getpath',
 'getroot',
 'iter',
 'iterfind',
 'parse',
 'parser',
 'relaxng',
 'write',
 'write_c14n',
 'xinclude',
 'xmlschema',
 'xpath',
 'xslt']

In [38]:
import xmltodict

In [228]:


with open('company.xml') as fd:
    doc = xmltodict.parse(fd.read())

In [229]:
doc

OrderedDict([('company',
              OrderedDict([('department',
                            OrderedDict([('employee',
                                          [OrderedDict([('name', 'John Doe'),
                                                        ('job',
                                                         'Software Analyst'),
                                                        ('salary', '2000')]),
                                           OrderedDict([('name',
                                                         'Jane Fletcher'),
                                                        ('job', 'Designer'),
                                                        ('salary', '2500')]),
                                           OrderedDict([('name',
                                                         'Mike Mooney'),
                                                        ('job', 'Professor'),
                                                        ('salary', '25

OrderedDict([('uniprot',
              OrderedDict([('@xmlns', 'http://uniprot.org/uniprot'),
                           ('@xmlns:xsi',
                            'http://www.w3.org/2001/XMLSchema-instance'),
                           ('@xsi:schemaLocation',
                            'http://uniprot.org/uniprot http://www.uniprot.org/docs/uniprot.xsd'),
                           ('entry',
                            OrderedDict([('@dataset', 'Swiss-Prot'),
                                         ('@created', '1999-07-15'),
                                         ('@modified', '2022-08-03'),
                                         ('@version', '236'),
                                         ('@xmlns',
                                          'http://uniprot.org/uniprot'),
                                         ('accession',
                                          ['Q15465', 'A4D247', 'Q75MC9']),
                                         ('name', 'SHH_HUMAN'),
              