# Parsing XML/HTML using BeautifulSoup

This notebook illustrates how to work with XML/HTML data using a Python package called BeautifulSoup. Both XML and HTMl are markup languages. Some datasets are provided in XML (Extensible Markup Language), and when you use a web scraper to gather data from websites, chances are your data is in HTML (HyperText Markup Language), since that is the standard markup language for web browser documents (i.e. websites). This notebook is a primer on how to get information from XML/HTML into a format that is processable in Python.

## The basic structure of XML/HTML

An XML/HTML document consists of elements demarcated by tags, which are organized in a tree structure. Elements can be nested into each other, and multiple elements can be the 'children' of one parent element.

### Tags
A tag begins with < and ends with >. There are three kinds of tags:
- start-tag, e.g. \<section>
- end-tag, e.g. \</section>
- empty-element tag, e.g. \<line-break />

Start-tags and end-tags always appear in pairs (see 'Elements').

### Content
Any text that is not inside a tag. When processing scraped data for text mining, this is usually where the data you are looking for is.

### Elements
An element is component of a document that either begins with a start-tag and ends with a matching end-tag, or consists only of an empty-element tag. The characters between the start-tag and end-tag, if any, are the element's content, and may contain markup, including other elements, which are called child elements. An example is <greeting>Hello, world!</greeting>. Another is <line-break />.

### Attributes
Attributes appear within a start-tag or empty-element tag, and consist of a name–value pair. An example is <img src="madonna.jpg" alt="Madonna" />, where the names of the attributes are "src" and "alt", and their values are "madonna.jpg" and "Madonna", respectively. Attributes usually do not contain running text, but they may still be useful in the context of text mining as they may provide useful meta-information or even labels.

### Entities
Entities are variables used to store text. They are prefaced with & and end with ;.

### Comments
Comments begin with \<!-- and end with -->. They are not part of the content. You can think of them as similar to placing a comment after # in Python.


## Let's get started!

This is a small XML snippet we will be working with*.  I've indented it here so that you can see the structure of the document more clearly. It comes from a dataset of book blurbs which is available through Kaggle.

At the 'lowest level' in this document, we have two book elements. The book start tag has two attributes; date and xml:lang. Each of the boook elements has 4 children: title, body, copyright and metadata. metadata has 6 children istelf, and one of them, topics, has a varying number of children again.

\* technically, this is not a valid XML document, as an XML document needs a single root element to be considered well-formed. You will see we are using 'html.parser'; to process well-formed XML you could also use 'xml', but that parser doesn't understand how to deal with our data, as we have multiple root elements. We can use the html.parser because the two formats are so similar.

## BeautifulSoup

To process XML using Python, we'll be using a package called Beautiful Soup which provides an xml&html parser. There are other packages available, too. Check if you can `from bs4 import BeautifulSoup` and otherwise install `beautifulsoup4` (do NOT install BeautifulSoup; that is an earlier, now outdated version of the same package). 

BeautifulSoup documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

In [None]:
#try to import the package
from bs4 import BeautifulSoup

In [41]:
# open the file
data = open('blurb_snippet.txt', 'r', encoding="utf8")

# parse the data
soup = BeautifulSoup(data, 'html.parser')

# cose the file
data.close()

Now, we find all 'book'elements in the data, and put them in a list.

In [42]:
booklist = soup.find_all('book')

Let's say we are interested in the book title, the list of topics and the blurb, as well as the date of data collection. Note that this is not the date the book was published - the latter is an element in the metadata, whereas the former is an attribute of the 'book' element. We'll give every book in our dataset an integer as an ID, and put the whole thing in a Python dictionary.

We can accesss an element by concatenating the element names in the tree structure intil we arrive at the element we are interested in. 

If we only want the content of an element, we can use `contents` or `string`. `contents` contains a list - if you know it only ever contains one field, you can unwrap it by indexing the 0th element.

If we want all children of an element, regardless of their type, we can use `findChildren()`.

If we want to find an attribute, we use `get()` with the name of the attribute as an argument.

In [53]:
book_data = {} # this will be the big dictionary that contains all our data
book_id = 1

for book in booklist:
    # Title
    title = book.title.contents[0] # contents returns a list; we know there is always only 1 title, so we are interested in the element at position 0
    
    # Topics
    topics = []
    topictags = book.metadata.topics.findChildren() # we want to find all topics, regardless of the tag
    for tag in topictags: 
        topics.extend(tag.contents) # the tags may contain multiple topics. we add all of them to our list.
    
    # Blurb
    blurb = book.body.string
    
    # Date of data collection. This is the date that is an attribute of the book opening tag (not the date under 'published')
    date_collected = book.get('date')

    #put everything in a dictionary, and add it to our big data dictionary with the current book ID as key
    book_data[book_id] = {'title' : title,
                         'topics' : topics,
                         'blurb' : blurb,
                         'date_of_data_collection' : date_collected}
    
    
    book_id += 1

## Write dictionary to JSON

Now let's save our data as JSON for easier future use.

In [60]:
import json

with open("book_data.json", "w") as f:
    json.dump(book_data, f)