# Parsing XML/HTML using BeautifulSoup

This notebook illustrates how to work with XML/HTML data using a package called BeautifulSoup. Both XML and HTMl are markup languages. Some datasets are provided in XML (Extensible Markup Language), and when you use a web scraper to gather data from websites, chances are your data is in HTML (HyperText Markup Language), since that is the standard markup language for web browser documents (i.e. websites). This notebook is a primer on how to get information from XML/HTML into a format that is processable in Python.

## The basic structure of XML/HTML

An XML/HTML document consists of elements demarcated by tags, which are organized in a tree structure. Elements can be nested into each other, and multiple elements can be the 'children' of one parent element.

### Tags
A tag begins with < and ends with >. There are three kinds of tags:
- start-tag, e.g. \<section>
- end-tag, e.g. \</section>
- empty-element tag, e.g. \<line-break />

Start-tags and end-tags always appear in pairs (see 'Elements').

### Content
Any text that is not inside a tag. When processing scraped data for text mining, this is usually where the data you are looking for is.

### Elements
An element is component of a document that either begins with a start-tag and ends with a matching end-tag, or consists only of an empty-element tag. The characters between the start-tag and end-tag, if any, are the element's content, and may contain markup, including other elements, which are called child elements. An example is <greeting>Hello, world!</greeting>. Another is <line-break />.

### Attributes
Attributes appear within a start-tag or empty-element tag, and consist of a name–value pair. An example is <img src="madonna.jpg" alt="Madonna" />, where the names of the attributes are "src" and "alt", and their values are "madonna.jpg" and "Madonna", respectively. Attributes usually do not contain running text, but they may still be useful in the context of text mining as they may provide useful meta-information or even labels.

### Entities
Entities are variables used to store text. They are prefaced with & and end with ;.

### Comments
Comments begin with \<!-- and end with -->. They are not part of the content. You can think of them as similar to placing a comment after # in Python.


## Let's get started!

This is a small XML snippet we will be working with*.  I've indented it here so that you can see the structure of the document more clearly. It comes from a dataset of book blurbs which is available through Kaggle.

At the 'lowest level' in this document, we have two book elements. The book start tag has two attributes; date and xml:lang. Each of the boook elements has 4 children: title, body, copyright and metadata. metadata has 6 children istelf, and one of them, topics, has a varying number of children again.

\* technically, this is not a valid XML document, as an XML document needs a single root element to be considered well-formed. You will see we are using 'html.parser'; to process well-formed XML you could also use 'xml', but that parser doesn't understand how to deal with our data, as we have multiple root elements. We can use the html.parser because the two formats are so similar.

## BeautifulSoup

To process XML using Python, we'll be using a package called Beautiful Soup which provides an xml&html parser. There are other packages available, too. Check if you can `from bs4 import BeautifulSoup` and otherwise install `beautifulsoup4` (do NOT install BeautifulSoup; that is an earlier, now outdated version of the same package). 

BeautifulSoup documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

In [35]:
#try to import the package
from bs4 import BeautifulSoup

In [2]:
# open the file
data = open('BlurbGenreCollection_EN_train.txt', 'r', encoding="utf8")

# parse the data
soup = BeautifulSoup(data, 'html.parser')

# cose the file
data.close()

In [3]:
# open the file
data2 = open('BlurbGenreCollection_EN_dev.txt', 'r', encoding="utf8")

# parse the data
soup2 = BeautifulSoup(data2, 'html.parser')

# cose the file
data2.close()

Now, we find all 'book'elements in the data, and put them in a list.

In [4]:
booklist = soup.find_all('book')

In [5]:
booklist2 = soup2.find_all('book')

Let's say we are interested in the book title, the list of topics and the blurb, as well as the date of data collection. Note that this is not the date the book was published - the latter is an element in the metadata, whereas the former is an attribute of the 'book' element. We'll give every book in our dataset an integer as an ID, and put the whole thing in a Python dictionary.

We can accesss an element by concatenating the element names in the tree structure intil we arrive at the element we are interested in. 

If we only want the content of an element, we can use `contents` or `string`. `contents` contains a list - if you know it only ever contains one field, you can unwrap it by indexing the 0th element.

If we want all children of an element, regardless of their type, we can use `findChildren()`.

If we want to find an attribute, we use `get()` with the name of the attribute as an argument.

In [6]:
book_data = {} # this will be the big dictionary that contains all our data
book_id = 1

for book in booklist:
    # Title
    title = book.title.contents[0] # contents returns a list; we know there is always only 1 title, so we are interested in the element at position 0
    
    # Topics
    topics = []
    topictags = book.metadata.topics.d0.contents[0] # we want to find all main topics (d0)
   
    # Blurb
    blurb = book.body.string
    
    # Date of data collection. This is the date that is an attribute of the book opening tag (not the date under 'published')
    date_collected = book.get('date')

    #put everything in a dictionary, and add it to our big data dictionary with the current book ID as key
    book_data[book_id] = {'title' : title,
                         'topics' : topictags,
                         'blurb' : blurb,
                         'date_of_data_collection' : date_collected}
    
    
    book_id += 1

In [7]:
book_data2 = {} # this will be the big dictionary that contains all our data
book_id = 1

for book in booklist2:
    # Title
    title = book.title.contents[0] # contents returns a list; we know there is always only 1 title, so we are interested in the element at position 0
    
    # Topics
    topics = []
    topictags = book.metadata.topics.d0.contents[0] # we want to find all main topics (d0)
    
    # Blurb
    blurb = book.body.string
    
    # Date of data collection. This is the date that is an attribute of the book opening tag (not the date under 'published')
    date_collected = book.get('date')

    #put everything in a dictionary, and add it to our big data dictionary with the current book ID as key
    book_data2[book_id] = {'title' : title,
                         'topics' : topictags,
                         'blurb' : blurb,
                         'date_of_data_collection' : date_collected}
    
    
    book_id += 1

## Write dictionary to JSON

Now let's save our data as JSON for easier future use.

In [1]:
import json

In [9]:
with open("book_data2.json", "w") as f:
    j2 = json.dump(book_data2, f)
   

In [10]:

with open("book_data.json", "w") as f:
    j = json.dump(book_data, f)
   

----------------------------------

In [2]:
import pathlib
import numpy as np
import nltk
from nltk.corpus import stopwords
from collections import Counter
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn import svm
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report



In [3]:
# import the Google News vectors word-embedding model with 300 nodes.
import gensim
word_embedding_model = gensim.models.KeyedVectors.load_word2vec_format('C:\\Users\\benth\\Documents\\AI\\jaar_3\\text_mining\\ba-text-mining-master\\lab_sessions\\lab4\\GoogleNews-vectors-negative300.bin.gz', binary=True, limit=500000)

In [4]:
training_books = json.load(open('book_data.json'))

In [5]:
valid_books = json.load(open('book_data2.json'))

In [6]:
# Get all blurbs from the training set in a list.
blurbs_train = []
for id_, book_info in training_books.items():
    blurbs_train.append(book_info['blurb'])


In [7]:
# Get all blurbs from the validation set in a list.
blurbs_valid = []
for id_, book_info in valid_books.items():
    blurbs_valid.append(book_info['blurb'])


In [8]:
#preprocess the blurbs by splitting them in tokens and removing stopwords.
def preprocess(list_of_blurbs):
    result = []
    
    for blurb in list_of_blurbs:
        seperated_tokens_blurb = nltk.word_tokenize(blurb) #split blurb in tokens. 
        blurb_without_stopwords = []
        for token in seperated_tokens_blurb:
            stopword_set = set(stopwords.words('english')) #set to search faster
            if token not in stopword_set:
                blurb_without_stopwords.append(token) #delete stopwords from the blurb.
        result.append(blurb_without_stopwords)

            
        
    return(result)


In [9]:
processed_train_blurbs = preprocess(blurbs_train)

In [10]:
processed_valid_blurbs = preprocess(blurbs_valid)

In [13]:
# make embeddings of the blurbs
def make_embedding(blurb):
    word_vectors = []
    for token in blurb: #first make embeddings for all seperate tokens
        if token in word_embedding_model:
            vector = word_embedding_model[token] 
        else:
            vector = [0]*300 #if the word is not in the model, we use a zero's-vector.
        word_vectors.append(vector)
    result = np.average(word_vectors,axis=0) #the vector of the document is the average of all word vectors.
    return(result)

In [26]:
# make every blurb in the training set into an embedding
embedded_train_blurbs = list(map(make_embedding, processed_train_blurbs))

In [28]:
# make every blurb in the validation set into an embedding
embedded_valid_blurbs = list(map(make_embedding, processed_valid_blurbs))

In [17]:
# gather the genres in a gold list.
genres_train = []
for id_, book_info in training_books.items():
    genres_train.append(book_info['topics'])
        
genres_valid = []
for id_, book_info in valid_books.items():
    genres_valid.append(book_info['topics'])
    

In [18]:
lin_clf = svm.LinearSVC(multi_class="crammer_singer") #Linear Support Vector classifier to predict the genres.

In [29]:
lin_clf.fit(embedded_train_blurbs, genres_train) #First train on the training set.
genre_pred_lin = lin_clf.predict(embedded_valid_blurbs) #predict on validation set.



In [30]:
print(classification_report(genres_valid,genre_pred_lin,digits=3)) 

                    precision    recall  f1-score   support

  Children’s Books      0.815     0.825     0.820      3125
          Classics      0.752     0.521     0.615       628
           Fiction      0.794     0.913     0.850      4889
             Humor      0.718     0.226     0.343       226
        Nonfiction      0.870     0.902     0.886      5218
            Poetry      0.553     0.265     0.359        98
Teen & Young Adult      0.781     0.042     0.079       601

          accuracy                          0.824     14785
         macro avg      0.755     0.528     0.565     14785
      weighted avg      0.820     0.824     0.804     14785



In [31]:
from sklearn import linear_model
log_reg = linear_model.LogisticRegression() #Linear regression to predict the genres.

In [32]:
log_reg.fit(embedded_train_blurbs,genres_train) #first train on the training set.
genre_pred_log = log_reg.predict(embedded_valid_blurbs) #predict on validation set.

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [33]:
print(classification_report(genres_valid,genre_pred_log,digits=3))

                    precision    recall  f1-score   support

  Children’s Books      0.822     0.813     0.817      3125
          Classics      0.703     0.538     0.610       628
           Fiction      0.792     0.909     0.846      4889
             Humor      0.723     0.208     0.323       226
        Nonfiction      0.865     0.899     0.882      5218
            Poetry      0.533     0.082     0.142        98
Teen & Young Adult      0.617     0.097     0.167       601

          accuracy                          0.820     14785
         macro avg      0.722     0.507     0.541     14785
      weighted avg      0.810     0.820     0.802     14785



In [110]:
i= 1264

print("Example Blurb: ", blurbs_valid[i])
print("Genre: ", genres_valid[i])
print("Predicted by Linear SVC as: ", genre_pred_lin[i])
print("Predicted by Logistic Regression as: ", genre_pred_log[i])

Example Blurb:  As kids, we all had passions — something we loved doing, experienced with our friends, dreamed about every spare moment. For Jay Atkinson, who grew up in a small Massachusetts town, it was hockey. When Bobby Orr scored the winning goal in the 1970 Stanley Cup Finals against the St. Louis Blues, Atkinson became a fan for life. In 1975, he played on the first Methuen Rangers varsity hockey team. Once and always a rink rat, Atkinson still plays hockey whenever and wherever he can. Twenty-five years after he played for the Rangers, Atkinson returns to his high school team as a volunteer assistant. Ice Time tells the team’s story as he follows the temperamental star, the fiery but troubled winger, the lovesick goalie, the rookie whose father is battling cancer, and the "old school" coach as the Rangers make a desperate charge into the state tournament. In emotionally vivid detail, Ice Time travels into the rinks, schools, and living rooms of small-town America, where friends

-------------------

In [36]:
# Now we process the test set to get the final results.
# open the file
data = open('BlurbGenreCollection_EN_test.txt', 'r', encoding="utf8")

# parse the data
soup = BeautifulSoup(data, 'html.parser')

# cose the file
data.close()

In [37]:
booklist = soup.find_all('book')

In [38]:
book_data = {} # this will be the big dictionary that contains all our data
book_id = 1

for book in booklist:
    # Title
    title = book.title.contents[0] # contents returns a list; we know there is always only 1 title, so we are interested in the element at position 0
    
    # Topics
    topics = []
    topictags = book.metadata.topics.d0.contents[0] # we want to find all main topics (d0)
   
    # Blurb
    blurb = book.body.string
    
    # Date of data collection. This is the date that is an attribute of the book opening tag (not the date under 'published')
    date_collected = book.get('date')

    #put everything in a dictionary, and add it to our big data dictionary with the current book ID as key
    book_data[book_id] = {'title' : title,
                         'topics' : topictags,
                         'blurb' : blurb,
                         'date_of_data_collection' : date_collected}
    
    
    book_id += 1

In [39]:
with open("book_data_test.json", "w") as f:
    j2 = json.dump(book_data, f)
   

In [40]:
test_books = json.load(open('book_data_test.json'))

In [43]:
# Get all blurbs from the training set in a list.
blurbs_test = []
for id_, book_info in test_books.items():
    blurbs_test.append(book_info['blurb'])


In [44]:
processed_test_blurbs = preprocess(blurbs_test)

In [45]:
embedded_test_blurbs = list(map(make_embedding, processed_test_blurbs))

In [46]:
# gather the genres in a gold list.
genres_test = []
for id_, book_info in test_books.items():
    genres_test.append(book_info['topics'])
  

In [47]:
SVM_test = lin_clf.predict(embedded_test_blurbs) #predict on test set.

In [48]:
LogReg_test = log_reg.predict(embedded_test_blurbs) #predict on validation set.

In [49]:
print(classification_report(genres_test,SVM_test,digits=3))

                    precision    recall  f1-score   support

  Children’s Books      0.826     0.818     0.822      3903
          Classics      0.769     0.519     0.620       776
           Fiction      0.790     0.921     0.850      6092
             Humor      0.750     0.212     0.331       283
        Nonfiction      0.871     0.902     0.886      6477
            Poetry      0.554     0.263     0.356       118
Teen & Young Adult      0.582     0.043     0.080       745

          accuracy                          0.825     18394
         macro avg      0.734     0.525     0.564     18394
      weighted avg      0.815     0.825     0.805     18394



In [50]:
print(classification_report(genres_test,LogReg_test,digits=3))

                    precision    recall  f1-score   support

  Children’s Books      0.831     0.807     0.819      3903
          Classics      0.710     0.540     0.613       776
           Fiction      0.786     0.917     0.846      6092
             Humor      0.707     0.145     0.240       283
        Nonfiction      0.866     0.896     0.880      6477
            Poetry      0.450     0.076     0.130       118
Teen & Young Adult      0.620     0.107     0.183       745

          accuracy                          0.820     18394
         macro avg      0.710     0.498     0.530     18394
      weighted avg      0.810     0.820     0.802     18394

